## Notebook Summary - Pre-Processing & Feature Engineering
---
#### This contents of this notebook includes:
- Pre-processing steps to get the data ready for modeling (contractions function, removal of special characters, lemmatizer + POS tagging )
- EDA part II after tokenizing words via CountVectorizer & TfidVectorizer
- Train-Test-Split of dataseti n preparation for modeling

In [1]:
#import libraries
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
import numpy as np

# this setting widens how many characters pandas will display in a column:
pd.options.display.max_colwidth = 2000

from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.naive_bayes import MultinomialNB, BernoulliNB    #classifier commonly used for nlp dta
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, plot_confusion_matrix, mean_squared_error
from sklearn.compose import make_column_transformer
from sklearn.compose import make_column_selector
from sklearn.preprocessing import StandardScaler

import nltk
from nltk.tokenize import sent_tokenize, word_tokenize, RegexpTokenizer
from nltk.stem import WordNetLemmatizer
from nltk.stem.porter import PorterStemmer
from nltk.corpus import stopwords
import re
import os
from sklearn import metrics
import pickle

# Import CountVectorizer and TFIDFVectorizer from feature_extraction.text.
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

# skopt imports
from skopt.space import Integer, Real, Categorical   #these are necessary to get BayesSearchCV to run
from skopt import BayesSearchCV

In [2]:
#read in data
#sentiment analysis should be 
subs = pd.read_csv('cleaned_datasets/subs_sentiment_analysis.csv')
subs.head()

Unnamed: 0.1,Unnamed: 0,author,text,text_length,word_count,subreddit,neg,pos,neu,compound,compound_cat
0,0,Missy_Pantone,Does anyone have experience with Sidmool Saccharo Ferment Sparkle First Ampoule?,83,11,asian_beauty,0.0,0.219,0.781,0.4215,positive
1,1,Rntonie35,Love my hair color! Blended nicely with my grays,51,9,asian_beauty,0.0,0.514,0.486,0.8122,positive
2,2,flckeringfox_,What’s your best eye cream to brighten the area?,51,9,asian_beauty,0.0,0.504,0.496,0.7964,positive
3,3,etoileneha,Has anyone tried the Beauty of Joseon - Red Bean Refreshing Pore Mask?,73,13,asian_beauty,0.0,0.257,0.743,0.5859,positive
4,4,bully-maguire23,klavuu pure pearlsation micro collagen cleansing water??,59,7,asian_beauty,0.0,0.0,1.0,0.0,neutral


## 1. Pre-Processing
---

### Apply Contractions to adjust contractions in the text

In [3]:
#install contractions
# !pip install contractions

#import contractions
import contractions

In [4]:
def contr_fixer(df, col_list):
    for col in col_list:
        fixed_items = [contractions.fix(text) for text in df[col]]
        df['text'] = fixed_items
    return df

In [5]:
contr_fixer(subs, ['text'])

Unnamed: 0.1,Unnamed: 0,author,text,text_length,word_count,subreddit,neg,pos,neu,compound,compound_cat
0,0,Missy_Pantone,Does anyone have experience with Sidmool Saccharo Ferment Sparkle First Ampoule?,83,11,asian_beauty,0.000,0.219,0.781,0.4215,positive
1,1,Rntonie35,Love my hair color! Blended nicely with my grays,51,9,asian_beauty,0.000,0.514,0.486,0.8122,positive
2,2,flckeringfox_,What is your best eye cream to brighten the area?,51,9,asian_beauty,0.000,0.504,0.496,0.7964,positive
3,3,etoileneha,Has anyone tried the Beauty of Joseon - Red Bean Refreshing Pore Mask?,73,13,asian_beauty,0.000,0.257,0.743,0.5859,positive
4,4,bully-maguire23,klavuu pure pearlsation micro collagen cleansing water??,59,7,asian_beauty,0.000,0.000,1.000,0.0000,neutral
...,...,...,...,...,...,...,...,...,...,...,...
12870,12870,iinuzukaa,"Please help my terrible skin - Lately I have had some major breakouts on my cheeks and it seems to be getting worse. I have tried everything possible but my breakouts are taking forever to dissapear. I am currently using oxy and it seems to be working alright but I would like to know if anyone has another suggestion.\n(I have not been putting any makeup on it either not to infect it or something). Also, what can I use to prevent scars?",436,81,skincare_addiction,0.049,0.125,0.826,0.6486,positive
12871,12871,Amber_Owl,"How do you reset your face? - That probably sounds weird, but it is the only way I can figure out how to describe it. \n\nHere is the thing: Lately, I have been suffering from a lot of breakouts, blackheads, the bigger red spots. I am not sure what I should and should not be doing. \n\nBasically, I want to eliminate every process and/or product that goes onto my face except what is absolutely necessary. I want to figure out exactly what kind of skin type I have. I want to work on doing a daily regimen that is beneficial and will produce favorable results, but I do not know where to start. I want to reverse the damage that has already been done. \n\nWhere do I start? Does anyone have any tips? Product recommendations? Links? Videos? Advice? Seriously, I appreciate anything you can give me. \n\n",789,140,skincare_addiction,0.097,0.124,0.779,0.4619,positive
12872,12872,phantom_poo,"Facials: Worth it or not? - I am contemplating going to a salon nearby and getting a facial, mostly because it is something I have never done before.\n\nIs it worth it, or is it just an hour-long face massage? What is a decent amount to pay? Should I just get the regular kind, or go with a special ""oxygen facial"" or ""acne healing facial?""\n\nI would love to hear some of your experiences!\n\nThanks!",386,70,skincare_addiction,0.018,0.241,0.741,0.9527,positive
12873,12873,[deleted],"Biting/peeling lips - Hi all, I have a habit of biting and pulling at the skin on my lips because they are constantly chapped or peeling. They often look terrible. Any tips on how to keep them from peeling so I do not rip them to bits? Suggestions for moisturizing lip balms?",276,50,skincare_addiction,0.071,0.000,0.929,-0.5362,negative


### Remove Special Characters

In [6]:
# tolkenizer & removing special characters

def tokenizr(df, col_list):
    token = RegexpTokenizer(r'[\w\’\`\‘\'\´]+')
    for j in col_list:
        token_items = [token.tokenize(i.lower()) for i in df[j]]
        df[f'tokenized_{j}'] = token_items
    return df

In [7]:
# create a df with just text col
post = subs[['text']]

# #run through tokenizer function
tokenizr(subs, post)

Unnamed: 0.1,Unnamed: 0,author,text,text_length,word_count,subreddit,neg,pos,neu,compound,compound_cat,tokenized_text
0,0,Missy_Pantone,Does anyone have experience with Sidmool Saccharo Ferment Sparkle First Ampoule?,83,11,asian_beauty,0.000,0.219,0.781,0.4215,positive,"[does, anyone, have, experience, with, sidmool, saccharo, ferment, sparkle, first, ampoule]"
1,1,Rntonie35,Love my hair color! Blended nicely with my grays,51,9,asian_beauty,0.000,0.514,0.486,0.8122,positive,"[love, my, hair, color, blended, nicely, with, my, grays]"
2,2,flckeringfox_,What is your best eye cream to brighten the area?,51,9,asian_beauty,0.000,0.504,0.496,0.7964,positive,"[what, is, your, best, eye, cream, to, brighten, the, area]"
3,3,etoileneha,Has anyone tried the Beauty of Joseon - Red Bean Refreshing Pore Mask?,73,13,asian_beauty,0.000,0.257,0.743,0.5859,positive,"[has, anyone, tried, the, beauty, of, joseon, red, bean, refreshing, pore, mask]"
4,4,bully-maguire23,klavuu pure pearlsation micro collagen cleansing water??,59,7,asian_beauty,0.000,0.000,1.000,0.0000,neutral,"[klavuu, pure, pearlsation, micro, collagen, cleansing, water]"
...,...,...,...,...,...,...,...,...,...,...,...,...
12870,12870,iinuzukaa,"Please help my terrible skin - Lately I have had some major breakouts on my cheeks and it seems to be getting worse. I have tried everything possible but my breakouts are taking forever to dissapear. I am currently using oxy and it seems to be working alright but I would like to know if anyone has another suggestion.\n(I have not been putting any makeup on it either not to infect it or something). Also, what can I use to prevent scars?",436,81,skincare_addiction,0.049,0.125,0.826,0.6486,positive,"[please, help, my, terrible, skin, lately, i, have, had, some, major, breakouts, on, my, cheeks, and, it, seems, to, be, getting, worse, i, have, tried, everything, possible, but, my, breakouts, are, taking, forever, to, dissapear, i, am, currently, using, oxy, and, it, seems, to, be, working, alright, but, i, would, like, to, know, if, anyone, has, another, suggestion, i, have, not, been, putting, any, makeup, on, it, either, not, to, infect, it, or, something, also, what, can, i, use, to, prevent, scars]"
12871,12871,Amber_Owl,"How do you reset your face? - That probably sounds weird, but it is the only way I can figure out how to describe it. \n\nHere is the thing: Lately, I have been suffering from a lot of breakouts, blackheads, the bigger red spots. I am not sure what I should and should not be doing. \n\nBasically, I want to eliminate every process and/or product that goes onto my face except what is absolutely necessary. I want to figure out exactly what kind of skin type I have. I want to work on doing a daily regimen that is beneficial and will produce favorable results, but I do not know where to start. I want to reverse the damage that has already been done. \n\nWhere do I start? Does anyone have any tips? Product recommendations? Links? Videos? Advice? Seriously, I appreciate anything you can give me. \n\n",789,140,skincare_addiction,0.097,0.124,0.779,0.4619,positive,"[how, do, you, reset, your, face, that, probably, sounds, weird, but, it, is, the, only, way, i, can, figure, out, how, to, describe, it, here, is, the, thing, lately, i, have, been, suffering, from, a, lot, of, breakouts, blackheads, the, bigger, red, spots, i, am, not, sure, what, i, should, and, should, not, be, doing, basically, i, want, to, eliminate, every, process, and, or, product, that, goes, onto, my, face, except, what, is, absolutely, necessary, i, want, to, figure, out, exactly, what, kind, of, skin, type, i, have, i, want, to, work, on, doing, a, daily, regimen, that, is, beneficial, ...]"
12872,12872,phantom_poo,"Facials: Worth it or not? - I am contemplating going to a salon nearby and getting a facial, mostly because it is something I have never done before.\n\nIs it worth it, or is it just an hour-long face massage? What is a decent amount to pay? Should I just get the regular kind, or go with a special ""oxygen facial"" or ""acne healing facial?""\n\nI would love to hear some of your experiences!\n\nThanks!",386,70,skincare_addiction,0.018,0.241,0.741,0.9527,positive,"[facials, worth, it, or, not, i, am, contemplating, going, to, a, salon, nearby, and, getting, a, facial, mostly, because, it, is, something, i, have, never, done, before, is, it, worth, it, or, is, it, just, an, hour, long, face, massage, what, is, a, decent, amount, to, pay, should, i, just, get, the, regular, kind, or, go, with, a, special, oxygen, facial, or, acne, healing, facial, i, would, love, to, hear, some, of, your, experiences, thanks]"
12873,12873,[deleted],"Biting/peeling lips - Hi all, I have a habit of biting and pulling at the skin on my lips because they are constantly chapped or peeling. They often look terrible. Any tips on how to keep them from peeling so I do not rip them to bits? Suggestions for moisturizing lip balms?",276,50,skincare_addiction,0.071,0.000,0.929,-0.5362,negative,"[biting, peeling, lips, hi, all, i, have, a, habit, of, biting, and, pulling, at, the, skin, on, my, lips, because, they, are, constantly, chapped, or, peeling, they, often, look, terrible, any, tips, on, how, to, keep, them, from, peeling, so, i, do, not, rip, them, to, bits, suggestions, for, moisturizing, lip, balms]"


### Get POS for tokens & Lemmatize Text

#### Apply POS tagging

In [8]:
# apply POS tagging

subs['POS_tag'] = subs['tokenized_text'].apply(nltk.pos_tag)
subs.head()

Unnamed: 0.1,Unnamed: 0,author,text,text_length,word_count,subreddit,neg,pos,neu,compound,compound_cat,tokenized_text,POS_tag
0,0,Missy_Pantone,Does anyone have experience with Sidmool Saccharo Ferment Sparkle First Ampoule?,83,11,asian_beauty,0.0,0.219,0.781,0.4215,positive,"[does, anyone, have, experience, with, sidmool, saccharo, ferment, sparkle, first, ampoule]","[(does, VBZ), (anyone, NN), (have, VB), (experience, NN), (with, IN), (sidmool, NN), (saccharo, NN), (ferment, NN), (sparkle, NN), (first, RB), (ampoule, NN)]"
1,1,Rntonie35,Love my hair color! Blended nicely with my grays,51,9,asian_beauty,0.0,0.514,0.486,0.8122,positive,"[love, my, hair, color, blended, nicely, with, my, grays]","[(love, VB), (my, PRP$), (hair, NN), (color, NN), (blended, VBD), (nicely, RB), (with, IN), (my, PRP$), (grays, NNS)]"
2,2,flckeringfox_,What is your best eye cream to brighten the area?,51,9,asian_beauty,0.0,0.504,0.496,0.7964,positive,"[what, is, your, best, eye, cream, to, brighten, the, area]","[(what, WP), (is, VBZ), (your, PRP$), (best, JJS), (eye, NN), (cream, NN), (to, TO), (brighten, VB), (the, DT), (area, NN)]"
3,3,etoileneha,Has anyone tried the Beauty of Joseon - Red Bean Refreshing Pore Mask?,73,13,asian_beauty,0.0,0.257,0.743,0.5859,positive,"[has, anyone, tried, the, beauty, of, joseon, red, bean, refreshing, pore, mask]","[(has, VBZ), (anyone, NN), (tried, VBD), (the, DT), (beauty, NN), (of, IN), (joseon, NN), (red, JJ), (bean, NN), (refreshing, VBG), (pore, NN), (mask, NN)]"
4,4,bully-maguire23,klavuu pure pearlsation micro collagen cleansing water??,59,7,asian_beauty,0.0,0.0,1.0,0.0,neutral,"[klavuu, pure, pearlsation, micro, collagen, cleansing, water]","[(klavuu, JJ), (pure, NN), (pearlsation, NN), (micro, NN), (collagen, NN), (cleansing, NN), (water, NN)]"


In [9]:
import spacy

In [10]:
# create function to lemmatize text using SpaCy
nlp = spacy.load('en_core_web_lg')

def lemmalist(r):
    lemma_ls = []              #create an empty list
    for tup_ls in r:
        a = tup_ls[0]          #index the word
        b = tup_ls[1]          #index the POS
        if (b[0] == 'N' or b[0] == 'J' or b[0] == 'V' or b[0] == 'R'):     #if the first char of the POS is N, J, V, or R
            lemma_ls.append(nlp(a)[0].lemma_)                              #lemmatize using spacy
        else:
            lemma_ls.append(a)                                             #otherwise just append the word back to list
    return lemma_ls

#### Commenting out Lemmatizer so I don't rerun

In [11]:
# commented out so it doesn't rerun
# subs['words_lemmatized'] = subs['POS_tag'].apply(lemmalist)

In [12]:
# # commenting out so it doesn't rerun
# # join tokens back together into a string
# subs['text_cleaned'] = [' '.join(i) for i in subs['words_lemmatized']]

In [13]:
# #commenting out so it doesn't re-run
# #export dataset to csv
# subs.to_csv('cleaned_datasets/subs_lemmatized.csv', index = False)

In [14]:
#read in lemmatized df
subs = pd.read_csv('cleaned_datasets/subs_lemmatized.csv')
subs.head(2)

Unnamed: 0,author,text,text_length,word_count,subreddit,neg,pos,neu,compound,compound_cat,tokenized_text,POS_tag,words_lemmatized,text_cleaned
0,Missy_Pantone,Does anyone have experience with Sidmool Saccharo Ferment Sparkle First Ampoule?,83,11,asian_beauty,0.0,0.219,0.781,0.4215,positive,"['does', 'anyone', 'have', 'experience', 'with', 'sidmool', 'saccharo', 'ferment', 'sparkle', 'first', 'ampoule']","[('does', 'VBZ'), ('anyone', 'NN'), ('have', 'VB'), ('experience', 'NN'), ('with', 'IN'), ('sidmool', 'NN'), ('saccharo', 'NN'), ('ferment', 'NN'), ('sparkle', 'NN'), ('first', 'RB'), ('ampoule', 'NN')]","['do', 'anyone', 'have', 'experience', 'with', 'sidmool', 'saccharo', 'ferment', 'sparkle', 'first', 'ampoule']",do anyone have experience with sidmool saccharo ferment sparkle first ampoule
1,Rntonie35,Love my hair color! Blended nicely with my grays,51,9,asian_beauty,0.0,0.514,0.486,0.8122,positive,"['love', 'my', 'hair', 'color', 'blended', 'nicely', 'with', 'my', 'grays']","[('love', 'VB'), ('my', 'PRP$'), ('hair', 'NN'), ('color', 'NN'), ('blended', 'VBD'), ('nicely', 'RB'), ('with', 'IN'), ('my', 'PRP$'), ('grays', 'NNS')]","['love', 'my', 'hair', 'color', 'blend', 'nicely', 'with', 'my', 'gray']",love my hair color blend nicely with my gray


In [15]:
subs_final = subs.drop(columns=['text','tokenized_text','POS_tag','words_lemmatized','author'])

In [16]:
subs_final.rename(columns={'text_cleaned': 'text'}, inplace=True)

In [17]:
#re-ordering df cols
subs_final = subs_final[['subreddit', 'text', 'text_length', 'word_count','neg', 'pos', 'neu', 'compound', 'compound_cat']]
subs_final.head(2)

Unnamed: 0,subreddit,text,text_length,word_count,neg,pos,neu,compound,compound_cat
0,asian_beauty,do anyone have experience with sidmool saccharo ferment sparkle first ampoule,83,11,0.0,0.219,0.781,0.4215,positive
1,asian_beauty,love my hair color blend nicely with my gray,51,9,0.0,0.514,0.486,0.8122,positive


In [18]:
subs_final.drop(columns='compound_cat', inplace=True)

In [19]:
#export dataset to csv
subs_final.to_csv('cleaned_datasets/subs_cleaned_forCVEC.csv', index = False)

---
### 2. CountVectorizing for EDA II Only

In [20]:
words = pd.read_csv('cleaned_datasets/subs_cleaned_forCVEC.csv')
words.head(2)

Unnamed: 0,subreddit,text,text_length,word_count,neg,pos,neu,compound
0,asian_beauty,do anyone have experience with sidmool saccharo ferment sparkle first ampoule,83,11,0.0,0.219,0.781,0.4215
1,asian_beauty,love my hair color blend nicely with my gray,51,9,0.0,0.514,0.486,0.8122


In [21]:
# Import CountVectorizer and TFIDFVectorizer from feature_extraction.text.
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

In [22]:
#instantiate countvectorizer
cvec = CountVectorizer(stop_words='english',
                       ngram_range=(1,3),
                       token_pattern=r'[\w\’\`\‘\'\´]+',
                       max_features=5400)

#fit & transform dataset
cvec.fit_transform(words['text'])


words_cvec = pd.DataFrame(
    cvec.transform(words['text']).todense(),
    columns = cvec.get_feature_names_out())

# drop 'subreddit' feature name
words_cvec.drop(columns='subreddit', inplace=True)

words_cvec.head()

Unnamed: 0,',0,0 025,0 05,0 1,0 25,0 5,00,01,02,...,yul,zero,zinc,zinc 1,zinc oxide,zinc pyrithione,zit,zone,zone dry,zone oily
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [23]:
#concat it back to words df
words = pd.concat([words, words_cvec], axis=1)
words.head(2)

Unnamed: 0,subreddit,text,text_length,word_count,neg,pos,neu,compound,',0,...,yul,zero,zinc,zinc 1,zinc oxide,zinc pyrithione,zit,zone,zone dry,zone oily
0,asian_beauty,do anyone have experience with sidmool saccharo ferment sparkle first ampoule,83,11,0.0,0.219,0.781,0.4215,0,0,...,0,0,0,0,0,0,0,0,0,0
1,asian_beauty,love my hair color blend nicely with my gray,51,9,0.0,0.514,0.486,0.8122,0,0,...,0,0,0,0,0,0,0,0,0,0


In [24]:
#create a list of columns that need to be aggregated by count
col_list = words_cvec.columns
col_list

Index([''', '0', '0 025', '0 05', '0 1', '0 25', '0 5', '00', '01', '02',
       ...
       'yul', 'zero', 'zinc', 'zinc 1', 'zinc oxide', 'zinc pyrithione', 'zit',
       'zone', 'zone dry', 'zone oily'],
      dtype='object', length=5399)

---

In [25]:
words_mean = words.groupby('subreddit').agg({'text_length':'mean',
           'word_count':'mean',
           'neg':'mean',
           'pos':'mean',
           'neu':'mean',
           'compound':'mean'})

words_mean

Unnamed: 0_level_0,text_length,word_count,neg,pos,neu,compound
subreddit,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
asian_beauty,358.888297,60.830274,0.032,0.138478,0.829519,0.361997
skincare_addiction,523.183658,92.430584,0.053486,0.10784,0.838676,0.294394


In [26]:
#add subreddit_col into words_cvec
tokens = pd.concat([words['subreddit'], words_cvec], axis=1)
tokens.shape

(12875, 5400)

In [27]:
method = 'sum'

In [28]:
col_count = dict.fromkeys(col_list, method)

In [29]:
words_count = tokens.groupby('subreddit').agg(col_count)

words_count

Unnamed: 0_level_0,',0,0 025,0 05,0 1,0 25,0 5,00,01,02,...,yul,zero,zinc,zinc 1,zinc oxide,zinc pyrithione,zit,zone,zone dry,zone oily
subreddit,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
asian_beauty,78,36,2,0,3,1,4,11,44,21,...,25,41,79,0,61,1,23,41,8,2
skincare_addiction,86,339,40,34,70,16,28,20,22,5,...,13,32,260,61,56,18,123,155,12,21


#### Taking a look at the aggregates of words grouped by subreddit

In [30]:
words_agg = pd.concat([words_mean, words_count], axis=1)
words_agg.head()

Unnamed: 0_level_0,text_length,word_count,neg,pos,neu,compound,',0,0 025,0 05,...,yul,zero,zinc,zinc 1,zinc oxide,zinc pyrithione,zit,zone,zone dry,zone oily
subreddit,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
asian_beauty,358.888297,60.830274,0.032,0.138478,0.829519,0.361997,78,36,2,0,...,25,41,79,0,61,1,23,41,8,2
skincare_addiction,523.183658,92.430584,0.053486,0.10784,0.838676,0.294394,86,339,40,34,...,13,32,260,61,56,18,123,155,12,21


In [31]:
words_agg.filter(like='hyaluronic')

Unnamed: 0_level_0,hyaluronic,hyaluronic acid,hyaluronic acid 2,hyaluronic acid serum,labo hyaluronic,ordinary hyaluronic,ordinary hyaluronic acid
subreddit,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
asian_beauty,100,82,0,1,5,0,0
skincare_addiction,312,259,33,35,12,36,35


## 3. Train-Test-Split
Train-test-splitting data before doing pre-processing in preparation for modeling

---

In [86]:
subs_final.shape

(12875, 8)

## skincare_addiction - 1 / asian_beauty - 0

In [87]:
# changing target variable to binary
subs_final['subreddit'] = np.where(subs_final['subreddit'] == 'skincare_addiction', 1, 0)

In [88]:
subs_final.head(1)

Unnamed: 0,subreddit,text,text_length,word_count,neg,pos,neu,compound
0,0,do anyone have experience with sidmool saccharo ferment sparkle first ampoule,83,11,0.0,0.219,0.781,0.4215


In [89]:
subs_final.columns

Index(['subreddit', 'text', 'text_length', 'word_count', 'neg', 'pos', 'neu',
       'compound'],
      dtype='object')

In [90]:
#assign X & y
X=subs_final.drop(columns='subreddit')
y=subs_final['subreddit']

In [91]:
#split data into training & testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, stratify=y, random_state=17)

In [92]:
print('original df:', subs.shape)
print('X_train:', X_train.shape)
print('y_train:', y_train.shape)
print('X_test:', X_test.shape)
print('y_test:', y_test.shape)

original df: (12875, 14)
X_train: (9656, 7)
y_train: (9656,)
X_test: (3219, 7)
y_test: (3219,)


### Baseline Accuracy

In [93]:
y.value_counts(normalize=True)

1    0.606447
0    0.393553
Name: subreddit, dtype: float64

In [94]:
y_train.value_counts(normalize=True)

1    0.606462
0    0.393538
Name: subreddit, dtype: float64

## 4. Running a BayesSearchCV on TfidVectorizer & CountVectorizer with BernoulliNB
Accidently used a bernoullinb, but should have used multinomialnb

### BayesSearchCV on CountVectorizor ###

In [41]:
#make a pipeline for countvectorizer & Bernoulli nb

#instantiate CVEC
cvec = CountVectorizer(stop_words='english',
                       ngram_range=(1,3),
                       token_pattern=r'[\w\’\`\‘\'\´]+')   #eliminate words that appear in less than 2 docs/rows
                    
sc =StandardScaler()

#make column transformer to countvectorizer just the text column
col_trans = make_column_transformer(
    (cvec, 'text'),
    # (sc, ['text_length', 'word_count', 'neg', 'pos', 'neu','compound']), #do i standardize binary target val?
    remainder='passthrough',
    verbose_feature_names_out=False
)
    
#create a pipeline
bs_pipe = Pipeline([
    ('transformer', col_trans),
    ('bn', BernoulliNB())  # 'mnb', MultinomialNB()
])

# Search over the following values of hyperparameters:
# max_features = Maximum number of features fit
# min_df = Minimum number of documents needed to include token
# max_df = Maximum number of documents needed to include token 
# Check (individual tokens) and also check (individual tokens and 2-grams)

#pipe params
bs_params = {
    'transformer__countvectorizer__max_features': Integer(1,20000),  # 50000 was originally the best param so re-tuning, 100_000
    'transformer__countvectorizer__min_df': Integer(1,500),     
    'transformer__countvectorizer__max_df': Real(0.50,0.95),
    # 'transformer__countvectorizer__ngram_range': [(1,1), (1,2)]   #need to put up top since I'm using BayesSearchCV
}


#Instantiate BayesSearchCV
bs = BayesSearchCV(
    estimator = bs_pipe,
    search_spaces = bs_params,
    scoring = 'f1_weighted',
    n_iter = 50,
    verbose = 1,
    cv = 5,
    n_jobs=-1
)

In [42]:
#commenting out so I don't re-run

# #fit on training data
# bs.fit(X_train, y_train)

In [43]:
#commenting out because it's already been pickled

# pickle.dump(bs, open('bs.pkl', 'wb'))

In [44]:
# Loading model to compare the results
bs = pickle.load(open('pickles/bs.pkl','rb'))

In [45]:
bs.estimator.get_params().keys()

dict_keys(['memory', 'steps', 'verbose', 'transformer', 'bn', 'transformer__n_jobs', 'transformer__remainder', 'transformer__sparse_threshold', 'transformer__transformer_weights', 'transformer__transformers', 'transformer__verbose', 'transformer__verbose_feature_names_out', 'transformer__countvectorizer', 'transformer__standardscaler', 'transformer__countvectorizer__analyzer', 'transformer__countvectorizer__binary', 'transformer__countvectorizer__decode_error', 'transformer__countvectorizer__dtype', 'transformer__countvectorizer__encoding', 'transformer__countvectorizer__input', 'transformer__countvectorizer__lowercase', 'transformer__countvectorizer__max_df', 'transformer__countvectorizer__max_features', 'transformer__countvectorizer__min_df', 'transformer__countvectorizer__ngram_range', 'transformer__countvectorizer__preprocessor', 'transformer__countvectorizer__stop_words', 'transformer__countvectorizer__strip_accents', 'transformer__countvectorizer__token_pattern', 'transformer__co

In [46]:
bs.best_params_

#first time I ran it, I noticed that the max_features is at the top of my range so going to retune

OrderedDict([('transformer__countvectorizer__max_df', 0.95),
             ('transformer__countvectorizer__max_features', 5355),
             ('transformer__countvectorizer__min_df', 1)])

In [47]:
bs.best_score_

#score is lower than when I tried 50_000/100_000, but that just seemed like way too many features

0.7313585875080324

In [48]:
bs.score(X_train, y_train)

0.6783961818918343

In [49]:
bs.score(X_test, y_test)

0.6806236924103521

#### *score is better than baseline and about the same on both test and train*

### BayesSearchCV on TFIDVec

In [50]:
#instantiate TFIDVec
tvec = TfidfVectorizer(stop_words='english',
                       ngram_range=(1,3),
                       token_pattern=r'[\w\’\`\‘\'\´]+')   #eliminate words that appear in less than 2 docs/rows
                    
sc =StandardScaler()

#make column transformer to countvectorizer just the text column
col_trans = make_column_transformer(
    (tvec, 'text'),
    (sc, ['text_length', 'word_count', 'neg', 'pos', 'neu','compound']), #do i standardize binary target val?
    remainder='passthrough',
    verbose_feature_names_out=False
)
    
#create a pipeline
bs2_pipe = Pipeline([
    ('transformer', col_trans),
    ('bn', BernoulliNB())  # 'mnb', MultinomialNB()
])

# Search over the following values of hyperparameters:
# max_features = Maximum number of features fit
# min_df = Minimum number of documents needed to include token
# max_df = Maximum number of documents needed to include token 
# Check (individual tokens) and also check (individual tokens and 2-grams)

#pipe params
bs2_params = {
    'transformer__tfidfvectorizer__max_features': Integer(1,20000),  # 50000 was originally the best param so re-tuning, 100_000
    'transformer__tfidfvectorizer__min_df': Integer(1,500),     
    'transformer__tfidfvectorizer__max_df': Real(0.50,0.95),
    # 'transformer__countvectorizer__ngram_range': [(1,1), (1,2)]   #need to put up top since I'm using BayesSearchCV
}


#Instantiate BayesSearchCV
bs2 = BayesSearchCV(
    estimator = bs2_pipe,
    search_spaces = bs2_params,
    scoring = 'f1_weighted',
    n_iter = 50,
    verbose = 1,
    cv = 5,
    n_jobs=-1
)

In [51]:
bs2.estimator.get_params().keys()

dict_keys(['memory', 'steps', 'verbose', 'transformer', 'bn', 'transformer__n_jobs', 'transformer__remainder', 'transformer__sparse_threshold', 'transformer__transformer_weights', 'transformer__transformers', 'transformer__verbose', 'transformer__verbose_feature_names_out', 'transformer__tfidfvectorizer', 'transformer__standardscaler', 'transformer__tfidfvectorizer__analyzer', 'transformer__tfidfvectorizer__binary', 'transformer__tfidfvectorizer__decode_error', 'transformer__tfidfvectorizer__dtype', 'transformer__tfidfvectorizer__encoding', 'transformer__tfidfvectorizer__input', 'transformer__tfidfvectorizer__lowercase', 'transformer__tfidfvectorizer__max_df', 'transformer__tfidfvectorizer__max_features', 'transformer__tfidfvectorizer__min_df', 'transformer__tfidfvectorizer__ngram_range', 'transformer__tfidfvectorizer__norm', 'transformer__tfidfvectorizer__preprocessor', 'transformer__tfidfvectorizer__smooth_idf', 'transformer__tfidfvectorizer__stop_words', 'transformer__tfidfvectorize

In [52]:
#commenting out so I don't rerun

# bs2.fit(X_train, y_train)

In [53]:
#commenting out because it's already been pickled

# pickle.dump(bs2, open('bs2.pkl', 'wb'))

In [54]:
# Loading model to compare the results
bs2 = pickle.load(open('pickles/bs2.pkl','rb'))

In [55]:
bs2.best_params_

OrderedDict([('transformer__tfidfvectorizer__max_df', 0.7353778477590758),
             ('transformer__tfidfvectorizer__max_features', 5479),
             ('transformer__tfidfvectorizer__min_df', 1)])

In [56]:
bs2.best_score_

0.73112820425187

In [57]:
bs2.score(X_train, y_train)

0.6776047290341419

In [58]:
bs2.score(X_test, y_test)

0.6806908459343058

#### *got similar scores compared to using the cvec, but the max_features are different*

### 5. TfidVectorize transform data with the best params above from bs2

In [95]:
X_train.head(5)

Unnamed: 0,text,text_length,word_count,neg,pos,neu,compound
3493,just instal asian beauty shelve yay,46,6,0.0,0.661,0.339,0.8309
1629,mini review on utena premium puresa golden jelly mask collagen,67,11,0.0,0.0,1.0,0.0
8718,routine help incorporate to azelaic acid 10 hello all I have be on a journey to fade red pie mark after a gruesome trentinoin purge recently I decide to give the ordinary azelaic acid 10 suspension a try however I be not sure how to incorporate it into my current routine be cleanse with panoxyl 4 moisturize with cerave be spf 30 to wet skin then apply another layer after it dry pm cleanse with vanicream gentle cleanser apply neutrogena hydroboost to wet skin apply cerave pm while skin be still wet from prior step apply cerave resurfacing retinol every other night I would like to use the azelaic acid twice a day but I be open to any suggestion thank,707,122,0.013,0.067,0.921,0.7477
3403,what be the good japanese anti aging product my friend be go to japan and have generously offer to bring me back some souvenir be there any anti aging miracle I should be ask for crossposte to r skincareaddiction and r 30plusskincare,252,39,0.0,0.237,0.763,0.89
9054,routine help moisturize question hey I just get into skincare recently and buy some product I have a question about moisturizer if I put on a moisturize in the morning how long will I need to wait for it to absorb so I can wash my face without remove it,272,51,0.0,0.0,1.0,0.0


In [96]:
#instantiate tfidVectorizer
tfidvec = TfidfVectorizer(
    max_df= 0.7353778477590758,
    max_features= 5479,  #bayessearch on tfid returned this as an optimal val
    min_df = 1,
    stop_words='english',
    ngram_range=(1,3),
    token_pattern=r'[\w\’\`\‘\'\´]+'
)

In [97]:
#fit & transform the corpus
#convert to a df since it get returned as a sparse matrix
X_train_vec = pd.DataFrame(
    tfidvec.fit_transform(X_train['text']).todense(),
    columns = tfidvec.get_feature_names_out(),
    index = X_train.index)

In [98]:
X_train_vec

Unnamed: 0,',0,0 025,0 05,0 1,0 25,0 3,0 5,00,01,...,yul,zero,zinc,zinc 1,zinc oxide,zinc pyrithione,zit,zone,zone dry,zone oily
3493,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1629,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8718,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3403,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9054,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9889,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9802,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7532,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
130,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [99]:
#fit & transform the test corpus
#convert to a df since it get returned as a sparse matrix
X_test_vec = pd.DataFrame(
    tfidvec.transform(X_test['text']).todense(),
    columns = tfidvec.get_feature_names_out(),
    index = X_test.index)

X_test_vec.head()

Unnamed: 0,',0,0 025,0 05,0 1,0 25,0 3,0 5,00,01,...,yul,zero,zinc,zinc 1,zinc oxide,zinc pyrithione,zit,zone,zone dry,zone oily
3030,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8022,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
10355,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9394,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


#### Concatenate vectorized df with original dfs (Train/Test)

In [100]:
#concatenate X_train & X_test with vectorized
X_train2 = pd.concat([X_train, X_train_vec], axis = 1)
X_test2 = pd.concat([X_test, X_test_vec], axis = 1)

In [101]:
X_train2

Unnamed: 0,text,text_length,word_count,neg,pos,neu,compound,',0,0 025,...,yul,zero,zinc,zinc 1,zinc oxide,zinc pyrithione,zit,zone,zone dry,zone oily
3493,just instal asian beauty shelve yay,46,6,0.000,0.661,0.339,0.8309,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1629,mini review on utena premium puresa golden jelly mask collagen,67,11,0.000,0.000,1.000,0.0000,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8718,routine help incorporate to azelaic acid 10 hello all I have be on a journey to fade red pie mark after a gruesome trentinoin purge recently I decide to give the ordinary azelaic acid 10 suspension a try however I be not sure how to incorporate it into my current routine be cleanse with panoxyl 4 moisturize with cerave be spf 30 to wet skin then apply another layer after it dry pm cleanse with vanicream gentle cleanser apply neutrogena hydroboost to wet skin apply cerave pm while skin be still wet from prior step apply cerave resurfacing retinol every other night I would like to use the azelaic acid twice a day but I be open to any suggestion thank,707,122,0.013,0.067,0.921,0.7477,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3403,what be the good japanese anti aging product my friend be go to japan and have generously offer to bring me back some souvenir be there any anti aging miracle I should be ask for crossposte to r skincareaddiction and r 30plusskincare,252,39,0.000,0.237,0.763,0.8900,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9054,routine help moisturize question hey I just get into skincare recently and buy some product I have a question about moisturizer if I put on a moisturize in the morning how long will I need to wait for it to absorb so I can wash my face without remove it,272,51,0.000,0.000,1.000,0.0000,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9889,misc friend say washing face be like waterboarde finally convince him to put on a moisturizer at night his worry be that bacteria would feed on the moisture and oil on his newly non desert skin and because acne ok maybe you have a point despite you never touch your face there be not any acne then just splash with water in the morning to get rid of the moisturizer I guess but that be like waterboarde eventually I stop laugh I will tell him to get another pillowcase he can swap out to address his bacteria worry it be just that his face be like sandpaper and kind of splotchy red I think the good course of action be moisturize and gentle exfoliation,693,119,0.057,0.188,0.755,0.9610,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9802,can someone tell me how to get rid of this warn acne pic,66,13,0.000,0.000,1.000,0.0000,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7532,skin concern if you see me on the beach with this will you get disgust beach and watersport mean wear bikini all the time I be extremely embarrass about my dark butt area will this ever go away any product recom I be thai by the way,249,45,0.135,0.000,0.865,-0.7739,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
130,new beauty of joseon ginseng cleanse oil,46,7,0.000,0.405,0.595,0.6239,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [102]:
X_test2

Unnamed: 0,text,text_length,word_count,neg,pos,neu,compound,',0,0 025,...,yul,zero,zinc,zinc 1,zinc oxide,zinc pyrithione,zit,zone,zone dry,zone oily
3030,discussion update on purchase from sweet corea so I just want to post an update since I just receive my sweet corea order I pick the cheap ship option at around 11 dollar item ship on the 10th I receive it on the 16th that be so insane I have never have any overseas order arrive that fast some domestic us package do not even arrive in that short of time I be worry that the product be not authentic because they we be so cheap but I have absolutely no concern they be fake in fact the ljh ampoule have the same lot number and expiration date as the one I just order from target packaging look exactly the same with the same texture clear sticker seal the item and inside information product smell the same and feel the same I do get kind of a skimpy sample of just one small packet but oh well this be such a steal at 9 dollar a bottle I be giddy the only thing I wish be that I order more item I think this will be my go to shop for k beauty I be very happy with my experience thank you so much for your advice guy,1085,204,0.091,0.134,0.774,0.8915,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8022,acne I be a tran guy on testosterone and I need help get rid of my acne,79,16,0.000,0.172,0.828,0.4019,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
10355,the secret to be a decisive person,40,7,0.000,0.275,0.725,0.2263,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,klavuu pure pearlsation micro collagen cleanse water,59,7,0.000,0.000,1.000,0.0000,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9394,acne my honest story on acne acne scarring and psychological advice,73,11,0.000,0.248,0.752,0.5106,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
11002,acne my face be an acne infest flaky dry mess and I have no idea what to do for about two month now I have be use proactiv twice daily and have be take a pill every morning it help somewhat I feel like if I do not use it at all my face would be even bad but clearly it be not work as well as I want it to my nose have be a mess for almost a year now I have about two day of relatively clear skin by my standard last week but then I have five whitehead all pop up at once despite twice daily treatment sometimes I feel like all proactiv do be make my face really dry and flaky do anybody have any advice I mean I be not sure what else I could do really here be some picture if that help thank http iob imgur com uygg vfivfnbnny,776,148,0.084,0.131,0.785,0.8532,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7295,do anyone know what this be just appear randomly,56,9,0.000,0.000,1.000,0.0000,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7649,routine how to exfoliate on water only routine,52,8,0.000,0.000,1.000,0.0000,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2070,email correspondence with iunik their entire line be vegan apart from those mention,92,14,0.000,0.188,0.812,0.4588,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


#### Choosing to drop this column because it wasn't eliminated through pre-processing

In [103]:
# dropping this character column
X_train2.drop(columns="'", inplace=True)
X_test2.drop(columns="'", inplace=True)

In [104]:
#export datasets to csv
X_train2.to_csv('cleaned_datasets/Modeling/X_train.csv')
X_test2.to_csv('cleaned_datasets/Modeling/X_test.csv')
y_train.to_csv('cleaned_datasets/Modeling/y_train.csv')
y_test.to_csv('cleaned_datasets/Modeling/y_test.csv')

### 6. Additional EDA after TfidVectorizing Combined Dataset
After choosing to use the TfidVectorizer, I chose to vectorize my combined dataset to look at top words similarly to how I did it with the CountVectorizer

In [69]:
words = pd.read_csv('cleaned_datasets/subs_cleaned_forCVEC.csv')
words.head(2)

Unnamed: 0,subreddit,text,text_length,word_count,neg,pos,neu,compound
0,asian_beauty,do anyone have experience with sidmool saccharo ferment sparkle first ampoule,83,11,0.0,0.219,0.781,0.4215
1,asian_beauty,love my hair color blend nicely with my gray,51,9,0.0,0.514,0.486,0.8122


In [70]:
#instantiate tfidVectorizer
tfidvec = TfidfVectorizer(
    max_df= 0.7353778477590758,
    max_features= 5479,  #bayessearch on tfid returned this as an optimal val
    min_df = 1,
    stop_words='english',
    ngram_range=(1,3),
    token_pattern=r'[\w\’\`\‘\'\´]+'
)

In [73]:
#fit & transform the corpus
#convert to a df since it get returned as a sparse matrix
df_tfidvec = pd.DataFrame(
    tfidvec.fit_transform(words['text']).todense(),
    columns = tfidvec.get_feature_names_out(),
    index = words.index)

In [74]:
#create a df of top 50 words
top1 = pd.DataFrame(df_tfidvec.sum().sort_values(ascending=False)[:50])

In [75]:
#create a df of 50-100 words
top2 = pd.DataFrame(df_tfidvec.sum().sort_values(ascending=False)[50:101])

In [76]:
#dropping words that are not helpful to visualization
wordstodrop = ['use', 'try', 'like', 'just', 'know', 'look', 'good', 'make', 'question', 'really', 'year', 'start', 'time', 'think', 'want','thank','rid', 'new',
               'feel', 'work','need', 'month', 'buy', 'bad', 'week', 'com', 'discussion']
top1.drop(wordstodrop, axis=0, inplace=True)

In [77]:
top1

Unnamed: 0,0
skin,552.106765
product,380.249843
acne,331.019938
help,307.873206
routine,273.302439
face,259.521521
sunscreen,185.623457
concern,180.146262
cream,174.415852
skin concern,171.992897


In [78]:
#dropping words that are not helpful to visualization
wtodrop = ['2', 'care','come','comment','water','love','apply','clear','thing', 'lot', 'break', '3', 'wonder', 'long', '1','post','right', 'sure']
top2.drop(wtodrop,axis=0, inplace=True)

In [79]:
top2

Unnamed: 0,0
makeup,106.534224
red,106.299603
oily,106.177416
spot,105.612953
beauty,102.874998
product question,102.254347
scar,101.554261
pimple,100.945461
advice,97.553396
wash,97.386282


In [80]:
#concat these dfs back together to create a topwords df
topwords = pd.concat([top1, top2])

In [81]:
topwords

Unnamed: 0,0
skin,552.106765
product,380.249843
acne,331.019938
help,307.873206
routine,273.302439
face,259.521521
sunscreen,185.623457
concern,180.146262
cream,174.415852
skin concern,171.992897


In [82]:
topwords.rename(columns={0: 'Weight'}, inplace=True)

In [83]:
topwords = topwords.reset_index()
topwords

Unnamed: 0,index,Weight
0,skin,552.106765
1,product,380.249843
2,acne,331.019938
3,help,307.873206
4,routine,273.302439
5,face,259.521521
6,sunscreen,185.623457
7,concern,180.146262
8,cream,174.415852
9,skin concern,171.992897


In [84]:
topwords.head(2)

Unnamed: 0,index,Weight
0,skin,552.106765
1,product,380.249843


In [85]:
#export out to csv for data visualization later
topwords.to_csv('cleaned_datasets/topwords.csv')