# Exploring finding hair products in comments
This is an exploratory notebook on looking for products names and types in the Reddit comments. I used spaCy and gensim to do some lemmatization and to look for bigrams. Because people don't necessarily write out full product names, these were not effective ways to find product names. 

In [None]:
import pandas as pd
import numpy as np
import re

## Drop duplicate rows and posts with no comments

In [55]:
# matched_posts.csv is a file that contains the Reddit comments describing the hair routine used
# and matched to the appropriate image
curly_df = pd.read_csv('matched_posts.csv')

# Drop duplicates and deleted comments
curly_df.drop_duplicates(subset='sub_id', keep = False, inplace = True) 
curly_df.dropna(subset=['comm_text'], inplace=True)
curly_df = curly_df[curly_df['comm_text'] != '[deleted]']
curly_df = curly_df[curly_df['comm_text'] != '[removed]]']
curly_df.index = range(len(curly_df))

In [56]:
# Get rid of URLs, numbers, and punctuation
text = [re.sub(r'http\S+', '', t) for t in curly_df['comm_text']] # remove links
text = [re.sub(r'([0-9]+?)', ' ', t).lower() for t in text] # remove all numbers and symbols
text = [re.sub(r'(!|"|#|\$|%|&|\(|\)|\*|\+|,|-|\.|/|:|;|<|=|>|\?|@|\[|\\|\]|\^|_|`|{|\||}|~)+', ' ', t)
        for t in text]
curly_df['comm_text'] = [re.sub(r'\s+\s', ' ', t).strip() for t in text] # repace double spaces with single spaces

## Get products
The 'curly_products.csv' is a master list of curly girl approved products taken from the r/curlyhair subreddit and augmented by me.

In [None]:
products = pd.read_csv('curly_products.csv')
products.rename(columns={"Unnamed: 0": "product", "Unnamed: 1": "type"}, inplace=True)
products.drop(columns=['Unnamed: 2','Unnamed: 3','Unnamed: 4'], inplace=True)
products.describe()

In [None]:
products['product'].to_string()

In [None]:
import spacy

In [None]:
#! python -m spacy download en_core_web_sm  #Uncomment this line if you get a KeyError to download the model
nlp = spacy.load('en_core_web_sm')

In [None]:
docs = [[t.lemma_.lower() for t in doc if len(t.orth_) > 1] for doc in nlp]

In [None]:
from gensim.models import Phrases

bigram = Phrases(docs, min_count=1)
tokens = []

for idx in range(len(docs)):
    for token in bigram[docs[idx]]:
        #if '_' in token:  # bigrams can be recognized by the "_" that joins the invidual words
        docs[idx].append(token)
        tokens.append(token)
            
print(list(set(tokens)))

So I'm not a huge fan of those phrases. I will try to make my own.

In [48]:
text = [p for p in products['product']]
text = [re.sub(r'(!|"|#|\$|%|&|\(|\)|\*|\+|,|-|\.|/|:|;|<|=|>|\?|@|\[|\\|\]|\^|_|`|{|\||}|~)+', ' ', t)
        for t in text]
print(text)

['As I Am Cleansing Pudding', 'Bobeam Shampoo Bars', 'Chagrin Valley Shampoo Bars', 'Desert Essence Coconut Shampoo', 'Desert Essence No Fragrance Shampoo', 'DevaCurl Delight LowPo', 'DevaCurl LowPoo ', 'DevaCurl Decadence NoPoo', 'DevaCurl NoPoo', 'Eden Bodyworks Peppermint Tea Tree Shampoo', 'Enjoy Sulfate free Hydrating Shampoo', 'Everyday Shea Shampoos', 'Giovanni Shampoos', 'Jason Dandruff Relief', 'Jason Normalizing Tea Tree Shampoo', 'Jessicurl Gentle Lather Shampoo', 'Kinky Curly Come Clean Shampoo', 'Kevin Murphy Angel Wash', 'Kevin Murphy Born Again Wash', 'L’oreal EverSleek Sulfate Free Shampoo', 'L’oreal EverCurl Sulfate Free', 'L’oreal OleoTherapy Oil Infused Shampoo', "Miss Jessie's Cleansing Cream", 'Natural instinct Hydrating Daily shampoo', 'Nature’s Gate various shampoos', "Not Your Mother's Naturals Linseed Chia Blend   French Plum Seed Oil Volume Boost Shampoo", 'Not your Mother’s Tahitian Gardenia Flower   Mango Butter Shampoo', 'Organix Coconut Milk Shampoo', 'Org

In [None]:
text = [p.replace(' ', '_') for p in products['product']]
print(text)

### Here I messed around with vectorizing the comments.

In [53]:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()

In [66]:
comment_text = [t for t in curly_df['comm_text']]
print(comment_text[3])

vectorizer.fit(comment_text[0:1])

so i finally started cowashing after having really dry tangly hair i found a cleansing conditioner at sally's that i really enjoy plus i use the shea moisture weightless conditioner to detangle then i wash that out and then do a revised s c with the same stuff i used the kinky curly hair custard and plopped all night i wear a cpap so i'm not able to do it with a t shirt i used a satin head wrap from sally's after i got most of the moisture out since its not as bulky as a t shirt i can fit my cpap head gear over it comfortably anyway today was the first day anyone's asked me if my hair is naturally curly and said it looked nice now i just need to figure out a cpap friendly way to preseve this for tomorrow


CountVectorizer(analyzer='word', binary=False, decode_error='strict',
                dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
                lowercase=True, max_df=1.0, max_features=None, min_df=1,
                ngram_range=(1, 1), preprocessor=None, stop_words=None,
                strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
                tokenizer=None, vocabulary=None)

In [67]:
print('Vocabulary: ')
print(vectorizer.vocabulary_)


Vocabulary: 
{'still': 69, 'doing': 16, 'trial': 77, 'and': 1, 'error': 21, 'on': 49, 'my': 44, 'routine': 59, 'some': 67, 'days': 14, 'it': 33, 'looks': 38, 'great': 28, 'no': 47, 'so': 66, 'much': 43, 'this': 75, 'was': 82, 'good': 26, 'day': 13, 'washed': 84, 'at': 4, 'night': 46, 'with': 88, 'suave': 71, 'naturals': 45, 'rinse': 58, 'completely': 11, 'out': 52, 'then': 74, 'turn': 79, 'head': 31, 'upside': 80, 'down': 17, 'put': 57, 'la': 34, 'bella': 8, 'gel': 24, 'in': 32, 'gently': 25, 'flop': 23, 'hair': 29, 'back': 5, 'place': 54, 'let': 37, 'drip': 18, 'dry': 19, 'towel': 76, 'around': 3, 'shoulders': 64, 'after': 0, 'slept': 65, 'oh': 48, 'wash': 83, 'maybe': 41, 'every': 22, 'other': 51, 'week': 86, 'shea': 63, 'moisture': 42, 'shampoo': 62, 'do': 15, 'once': 50, 'but': 9, 'hateeeeeeee': 30, 'the': 73, 'scent': 60, 'previously': 56, 'plopping': 55, 'been': 7, 'making': 40, 'straight': 70, 'parts': 53, 'by': 10, 'ears': 20, 'stick': 68, 'which': 87, 'way': 85, 've': 81, 'tri

In [70]:
vector = vectorizer.transform(comment_text)
import sys
import numpy
numpy.set_printoptions(threshold=sys.maxsize)
# Our final vector:
print('Full vector: ')
print(vector.toarray())


Full vector: 
[[ 1  4  1  1  1  1  1  1  1  3  1  1  1  1  2  1  1  1  1  1  1  1  2  1
   1  1  1  1  2  1  1  1  4  8  1  1  1  1  1  1  1  1  1  1  6  1  1  1
   1  3  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  3  1  1  1  1
   1  1  2  2  1  1  1  1  1  1  3  1  1  1  2  1  3  1]
 [ 1  8  0  0  0  0  0  0  0  2  0  0  0  1  1  0  0  0  0  0  0  1  2  0
   0  0  0  1  0  1  0  0  4  8  0  0  0  0  0  0  0  0  0  0  1  0  0  2
   0  0  0  1  1  0  0  1  0  1  0  0  0  0  0  0  0  0  0  1  0  1  0  0
   0  4  3  0  2  1  3  0  0  3  2  2  0  3  0  0  6  0]
 [ 2 16  0  1  2  0  0  0  0  0  0  1  1  0  0  0  0  4  0  4  0  0  0  0
   3  0  0  0  0  9  0  5  2  4  0  0  0  1  0  0  0  0  0  0 32  0  0  1
   0  5  0  0  1  0  0  0  0  0  1  1  0  0  0  0  0  0  2  0  0  0  0  0
   0 15  4  0  2  0  0  0  1  0  1  1  0  0  0  0  4  0]
 [ 2  3  0  0  1  0  0  0  0  0  0  0  2  1  0  2  0  0  0  1  0  0  0  0
   0  0  0  0  0  3  0  2  0  3  0  0  0  0  0  0  0  0  2  0  2  0  1  0
 