# Project 3 -  Web APIs & NLP

## Notebook Summary
---
This contents of this notebook includes the problem statement, background research, & additional data cleaning.

## Problem Statement
---

A new company is looking to enter the cosmetics market with a focus on skincare. They are looking for insight into what the current skincare trends are to decide on which type of product to develop first and leverage additional data to assist with  customer segmentation and their paid search strategy. I am tasked with analyzing two specific subreddits, AsianBeauty and SkincareAddiction. Based on text data, what ingredients or skin-related issues are top of mind & priority for consumers? Are there differences in text attributes between the two subreddits? I will utilize NLP strategies to create and evaluate binary classification models that correctly classify the subreddit posts. 

## Background & Outside Research
---

**McKinsey - 'Taking a good look at the beauty industry'** ([*source*](https://www.mckinsey.com/industries/retail/our-insights/taking-a-good-look-at-the-beauty-industry))
- Sophie Marchessou: It’s important to define what it is, as a retailer or beauty brand, that you want to stand for and what consumer experience you want to provide—and stick to it. The answer doesn’t need to be the same for everyone. There are, depending on your customer targets, features that might be more or less relevant, so it’s not about going after the gimmicky things and having technology enhancements in the store just for the sake of having them. It’s about figuring out, in the consumer journey, what are potential pain points? And how do you then say, “Those three things I’ll prioritize. That will be how I deliver this omnichannel experience.” Then, make sure you trickle that down through the organization so that not just your digital team but also your store team is aware of the experience you want to provide, and explain why it matters.

**NPD - 'Empowered Consumers Want Clean Ingredients and Brand Transparency from Skincare Products'** ([*source*](https://www.npd.com/news/press-releases/2019/empowered-consumers-want-clean-ingredients-and-brand-transparency-from-skincare-products/))
- “These engaged consumers are looking to become more educated about the ingredients in their skincare regimen, particularly in those more basic products such as cleansers, moisturizers and anti-aging serums,” stated Larissa Jensen, executive director and beauty industry analyst, The NPD Group.
- In fact, 46 percent of facial skincare users report purchasing products free of sulfates, phthalates and/or gluten, representing a 6 point up-tick over the past two years. In addition, more than half of women look for skincare products made from organic ingredients. The report also found that brands making a public commitment to ingredient transparency have become top-of-mind for consumers, with several of the more well-known transparent brands ranking among the Top 25 in highest awareness-to-purchase conversions.

**NPD - 'More U.S. Women Are Using Facial Skincare Products Today, Reports The NPD Group'** ([*source*](https://www.npd.com/news/press-releases/2020/more-u-s-women-are-using-facial-skincare-products-today-reports-the-npd-group/))
- Overall, close to 40% of facial skincare users report using their products more often today. Usage of basic care products such as cleansers and moisturizers, and treatments including exfoliators/scrubs and masks saw the most significant increases since last year.
- “Using an average of five products daily, consumers are committed to their baseline facial skincare routine, which includes a combination of basic care and targeted treatments. The effects of COVID-19, including spending more time at home, have brought a greater focus on self-care, and skincare has reaped the benefits.”
- Core skincare product sales, including facial cleansers, creams, and serums, grew between 15% and 24%, versus 2020. Sales of targeted products, like eye and lip treatments, also increased. Clinical skincare brands contributed the highest revenue gains to the category. In 2021, clinical surpassed natural as the largest brand type in skincare, based on revenue.

**Statista - 'Cosmetics industry - Statistics & Facts'** ([*source*](https://www.statista.com/topics/3137/cosmetics-industry/#topicOverview))
- Skincare was the leading category, accounting for about 42 percent of the global cosmetics market (haircare, makeup). 

**Cleveland Clinic - 'Top Skincare Ingredients'** ([*source*](https://health.clevelandclinic.org/skin-care-ingredients-explained/))
- Alpha-hydroxy acids (AHA) - creams, lotions to reduce fine lines and pigmentation
- Glycolic acid - exfoliators
- Lactic acid - exfoliators, moisturizers
- Beta hydroxy acids (salicylic acid) - pore porfectors with BHA
- Kojic acid - pigment treatment and age spots
- Retinol - improve acne and scarring
- L-ascorbic acid (vitamin C)
- Hyaluronic acid 
- Niacinamide (vitamin B3)
- Dimethicone second most common ingredient in moisturizers
- Copper peptide
- Glycerin - lip balms or face creams

**Allure - Additional Ingredients** ([*source*](https://www.allure.com/story/how-to-cocktail-skin-care-ingredients))
- SPF
- beta hydroxy acids (BHAs)
- ceramides

### Datasets
---

There are 2 datasets included in the [`datasets`](./datasets/) folder for this project that I will be using.
The data was scraped from Reddit via Pushshift's API. Please refer to jupyter notebook '00-Project3-Pushshift' for more details on scraping data from subreddits.

* [`raw_asianbeauty_initial_scrape.csv`](./datasets/raw_asianbeauty_initial_scrape.csv): this dataset has the raw data collected from the 'AsianBeuaty' subreddit
* [`raw_skincare_initial_scrape.csv`](./datasets/raw_skincare_initial_scrape): this dataset has the raw data collected from the 'SkincareAddiction' subreddit

--- 
# Part 1 - EDA & Data Cleaning

---

In [1]:
# import packages
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import re
import warnings
warnings.filterwarnings('ignore')

In [2]:
# read in the data
beauty_df = pd.read_csv('datasets/raw_asianbeauty_initial_scrape.csv')
beauty_df.head()

Unnamed: 0.1,Unnamed: 0,all_awardings,allow_live_comments,author,author_flair_css_class,author_flair_richtext,author_flair_text,author_flair_type,author_fullname,author_is_blocked,...,crosspost_parent,crosspost_parent_list,steward_reports,removed_by,updated_utc,rte_mode,brand_safe,approved_at_utc,banned_at_utc,author_created_utc
0,0,[],False,doyoufeelspeciallmao,,[],,text,t2_75epmpji,False,...,,,,,,,,,,
1,1,[],False,horny_twink_bottomm,,[],,text,t2_qur9fy2h,False,...,,,,,,,,,,
2,2,[],False,Missy_Pantone,,[],,text,t2_6mc5kcji,False,...,,,,,,,,,,
3,3,[],False,H-yaRayPark,,[],,text,t2_ceyhuule,False,...,,,,,,,,,,
4,4,[],False,acaipie,,[],,text,t2_3006tlol,False,...,,,,,,,,,,


In [3]:
# read in the data
skincare_df = pd.read_csv('datasets/raw_skincare_initial_scrape.csv')
skincare_df.head()

Unnamed: 0.1,Unnamed: 0,all_awardings,allow_live_comments,author,author_flair_css_class,author_flair_richtext,author_flair_template_id,author_flair_text,author_flair_text_color,author_flair_type,...,updated_utc,gilded,rte_mode,brand_safe,distinguished,approved_at_utc,author_created_utc,banned_at_utc,mod_reports,user_reports
0,0,[],False,yuura1,notag,[],4d48f1aa-6c35-11e9-81f1-0acf30770a48,,dark,text,...,,,,,,,,,,
1,1,[],False,SKKKKRRT2,,[],,,,text,...,,,,,,,,,,
2,2,[],False,goldpunch,notag,[],4d48f1aa-6c35-11e9-81f1-0acf30770a48,,dark,text,...,,,,,,,,,,
3,3,[],False,Levangeline,notag,[],4d48f1aa-6c35-11e9-81f1-0acf30770a48,,dark,text,...,,,,,,,,,,
4,4,[],False,snd1zzi,notag,[],4d48f1aa-6c35-11e9-81f1-0acf30770a48,,dark,text,...,,,,,,,,,,


In [4]:
#only pulling cols I need and reassigning to new df
beauty = beauty_df[['author','selftext','title']]
skincare = skincare_df[['author','selftext','title']]

In [5]:
print(beauty.shape)
print(skincare.shape)

(50076, 3)
(50090, 3)


In [6]:
skincare.head()

Unnamed: 0,author,selftext,title
0,yuura1,"Hello, i have self harm scars that are not dee...",[Product request] self harm scar help
1,SKKKKRRT2,,"29 y.o F, was on Orilissa, a GnRH antagonist. ..."
2,goldpunch,I am searching vitamin v serums right now. I s...,[Product Question] liposamal vitamin c vs asco...
3,Levangeline,[removed],"My skin, fresh out of the shower. Shiny, red n..."
4,snd1zzi,,can anyone tell me what these bumps are on my ...


In [7]:
beauty.head()

Unnamed: 0,author,selftext,title
0,doyoufeelspeciallmao,[removed],I love PyunKang Yul but the amount of products...
1,horny_twink_bottomm,[removed],i left my cicaplast baume b5+ in my bathroom. ...
2,Missy_Pantone,,Does anyone have experience with Sidmool Sacch...
3,H-yaRayPark,[removed],Beauty of joseon sunscreen made me purple when...
4,acaipie,[removed],"need sunscreen help! (BoJ, biore uv didn’t wor..."


### Checking for Nulls

In [8]:
skincare.isna().sum()

author         0
selftext    8442
title          0
dtype: int64

In [9]:
beauty.isna().sum()

author          0
selftext    20434
title           0
dtype: int64

In [10]:
# replace NaNs with empty strings
skincare.replace(np.NaN, '', inplace=True)
beauty.replace(np.NaN, '', inplace=True)

### Removing posts that were removed by reddit or the moderator

In [11]:
#drop rows where selftext contains removed
skincare = skincare[~skincare['selftext'].str.contains('\[removed\]')]

# # drop rows where selftext contains deleted
skincare = skincare[~skincare['selftext'].str.contains('\[deleted\]')]

In [12]:
#drop rows where selftext contains removed
beauty = beauty[~beauty['selftext'].str.contains('\[removed\]')]

# # drop rows where selftext contains deleted
beauty = beauty[~beauty['selftext'].str.contains('\[deleted\]')]

In [13]:
print(skincare.shape)
print(beauty.shape)

(46660, 3)
(44743, 3)


### Removing posts by the moderators

In [14]:
#list of mods for each sub

beauty_mods = ['thecakepie', 'AutoModerator', 'AsianBeautyMod', 'Ronrinesu', 'kitty_paw', 'fjordling_', 'beelzeybob',
               'justherefortheAB', 'Khalano', 'weavesunlight']

skincare_mods = ['buttermilk_biscuit', '_ihavemanynames_', 'TertiaryPumpkin', 'sunscreenpuppy',
                 'ScA_Bot', 'ScAModerator', 'AutoModerator', 'mastiii',
                 'jasminekitten02', 'mayamys']

In [15]:
beauty = beauty[~beauty['author'].isin(beauty_mods)]
skincare = skincare[~skincare['author'].isin(skincare_mods)]
print(beauty.shape)
print(skincare.shape)

(44046, 3)
(46634, 3)


### Create a new column that concatenates text & selftext ###
Chose to do this because there were posts where the title itself was the content of the post

In [16]:
beauty['text'] = beauty['title'] + ' - ' + beauty['selftext']
skincare['text'] = skincare['title'] + ' - ' + skincare['selftext']

In [17]:
#remove ' - ' from ends to make sure character & word count are accurate
beauty['text_cleaned'] = [txt.rstrip(' - ') for txt in beauty['text']]
#remove '-' from ends
skincare['text_cleaned'] = [txt.rstrip(' - ') for txt in skincare['text']]

### Drop duplicate submissions

In [18]:
skincare.drop_duplicates(subset='text', inplace=True)
skincare.shape

(8186, 5)

In [19]:
beauty.drop_duplicates(subset='text', inplace=True)
beauty.shape

(5897, 5)

### Filter out texts that are very short

#### Create text_length column

In [20]:
#create a new column with length of text (characters)
beauty['text_length'] = [len(st) for st in beauty['text']]
skincare['text_length'] = [len(st) for st in skincare['text']]

In [21]:
beauty.head(1)

Unnamed: 0,author,selftext,title,text,text_cleaned,text_length
2,Missy_Pantone,,Does anyone have experience with Sidmool Sacch...,Does anyone have experience with Sidmool Sacch...,Does anyone have experience with Sidmool Sacch...,83


In [22]:
skincare.head(1)

Unnamed: 0,author,selftext,title,text,text_cleaned,text_length
0,yuura1,"Hello, i have self harm scars that are not dee...",[Product request] self harm scar help,"[Product request] self harm scar help - Hello,...","[Product request] self harm scar help - Hello,...",505


In [23]:
# removing short posts with lengths smaller than 30 characters
beauty = beauty[beauty['text_length'] > 30]
beauty.shape

(5362, 6)

In [24]:
skincare = skincare[skincare['text_length'] > 30]
skincare.shape

(7879, 6)

### Drop unneccesary columns

In [25]:
beauty.drop(columns = ['selftext', 'title', 'text'], inplace=True)
skincare.drop(columns = ['selftext', 'title', 'text'], inplace=True)

In [26]:
#rename text_cleaned col
beauty.rename(columns={'text_cleaned': 'text'}, inplace=True)
skincare.rename(columns={'text_cleaned': 'text'}, inplace=True)

In [27]:
beauty.head()

Unnamed: 0,author,text,text_length
2,Missy_Pantone,Does anyone have experience with Sidmool Sacch...,83
6,Rntonie35,Love my hair color! Blended nicely with my grays,51
8,flckeringfox_,What’s your best eye cream to brighten the area?,51
19,DefinitionAdvanced25,ASMR FLAWLESS SKIN KOREAN SKINCARE ROUTINE,45
22,etoileneha,Has anyone tried the Beauty of Joseon - Red Be...,73


In [28]:
skincare.head()

Unnamed: 0,author,text,text_length
0,yuura1,"[Product request] self harm scar help - Hello,...",505
1,SKKKKRRT2,"29 y.o F, was on Orilissa, a GnRH antagonist. ...",76
2,goldpunch,[Product Question] liposamal vitamin c vs asco...,241
4,snd1zzi,can anyone tell me what these bumps are on my ...,54
5,Prestigious-Ad5884,[routine help] should I use retinol? Im 14. I ...,104


#### Create word_count column

In [29]:
# create a column with word counts that contains number of words in each status
beauty['word_count'] = [len(st.split()) for st in beauty['text']]
beauty.head(2)

Unnamed: 0,author,text,text_length,word_count
2,Missy_Pantone,Does anyone have experience with Sidmool Sacch...,83,11
6,Rntonie35,Love my hair color! Blended nicely with my grays,51,9
8,flckeringfox_,What’s your best eye cream to brighten the area?,51,9
19,DefinitionAdvanced25,ASMR FLAWLESS SKIN KOREAN SKINCARE ROUTINE,45,6
22,etoileneha,Has anyone tried the Beauty of Joseon - Red Be...,73,13


In [30]:
skincare['word_count'] = [len(st.split()) for st in skincare['text']]
skincare.head(2)

Unnamed: 0,author,text,text_length,word_count
0,yuura1,"[Product request] self harm scar help - Hello,...",505,100
1,SKKKKRRT2,"29 y.o F, was on Orilissa, a GnRH antagonist. ...",76,13
2,goldpunch,[Product Question] liposamal vitamin c vs asco...,241,45
4,snd1zzi,can anyone tell me what these bumps are on my ...,54,11
5,Prestigious-Ad5884,[routine help] should I use retinol? Im 14. I ...,104,19


### Create column indicating subreddit name

In [31]:
skincare['subreddit'] = 'skincare_addiction'
beauty['subreddit'] = 'asian_beauty'

### Export data to csv

In [32]:
#export initially cleaned dataset to csv

beauty.to_csv('cleaned_datasets/beauty_cleaned.csv', index = False)
skincare.to_csv('cleaned_datasets/skincare_cleaned.csv', index = False)

### Combine datasets into 1 dataframe

In [33]:
subs_combined = pd.concat([beauty, skincare], axis=0)
subs_combined.reset_index(inplace=True)

In [34]:
subs_combined.head()

Unnamed: 0,index,author,text,text_length,word_count,subreddit
0,2,Missy_Pantone,Does anyone have experience with Sidmool Sacch...,83,11,asian_beauty
1,6,Rntonie35,Love my hair color! Blended nicely with my grays,51,9,asian_beauty
2,8,flckeringfox_,What’s your best eye cream to brighten the area?,51,9,asian_beauty
3,19,DefinitionAdvanced25,ASMR FLAWLESS SKIN KOREAN SKINCARE ROUTINE,45,6,asian_beauty
4,22,etoileneha,Has anyone tried the Beauty of Joseon - Red Be...,73,13,asian_beauty


In [35]:
#drop index column
subs_combined.drop(columns='index', inplace=True)

### Filter out non-english rows

In [38]:
# !pip install spacy-langdetect

In [37]:
import spacy
from spacy_langdetect import LanguageDetector

In [39]:
# !pip install langdetect

In [40]:
from langdetect import detect

In [41]:
#found an error in trying run the lang detect function, using iterrows to determine which rows are causing the error

lst = []
for rows in subs_combined.iterrows():
    try:
        detect(rows[1]['text'])
    except:
        lst.append(rows[0])

In [42]:
#list of indices to remove from dataset
lst

[102, 433, 446, 1075, 1464, 1882, 2575, 2736, 12862]

In [43]:
#checking the rows that need to be dropped
subs_combined.iloc[433]

author                                            changewithnoor
text           𝙐𝙨𝙚𝙧 𝙚𝙭𝙥𝙚𝙧𝙞𝙖𝙣𝙘𝙚 𝙖𝙣𝙙 𝙘𝙤𝙢𝙥𝙖𝙧𝙞𝙨𝙤𝙣 𝙗𝙚𝙩𝙬𝙚𝙚𝙣 𝙆𝙤𝙨𝙚 𝙎𝙤...
text_length                                                  128
word_count                                                    17
subreddit                                           asian_beauty
Name: 433, dtype: object

#### Drop index for the list above

In [44]:
subs_combined.drop(index=lst, inplace=True)

In [45]:
#create a new col with the language detected
subs_combined['lang'] = subs_combined['text'].apply(detect)

In [46]:
subs_combined.head()

Unnamed: 0,author,text,text_length,word_count,subreddit,lang
0,Missy_Pantone,Does anyone have experience with Sidmool Sacch...,83,11,asian_beauty,en
1,Rntonie35,Love my hair color! Blended nicely with my grays,51,9,asian_beauty,en
2,flckeringfox_,What’s your best eye cream to brighten the area?,51,9,asian_beauty,en
3,DefinitionAdvanced25,ASMR FLAWLESS SKIN KOREAN SKINCARE ROUTINE,45,6,asian_beauty,de
4,etoileneha,Has anyone tried the Beauty of Joseon - Red Be...,73,13,asian_beauty,en


In [47]:
#taking a look at language counts per sub to make sure i'll have enough data for the model
subs_combined['lang'].groupby(subs_combined['subreddit']).value_counts()

subreddit           lang 
asian_beauty        en       5067
                    vi         42
                    de         40
                    fr         27
                    it         16
                    ro         15
                    da         13
                    id         13
                    nl         13
                    af         12
                    et         11
                    no         11
                    tl         10
                    fi          9
                    ca          8
                    ko          7
                    es          6
                    sw          4
                    zh-cn       4
                    hr          3
                    ru          3
                    sv          3
                    te          3
                    ar          2
                    cy          2
                    so          2
                    bn          1
                    fa          1
                    hi

#### Dropping all rows where language detected isn't eng

In [49]:
# # increasing the # of rows I can view
# pd.set_option('display.max_rows', 400)

In [48]:
#fine rows where df lang = en
subs_combined = subs_combined[subs_combined['lang'] == 'en']

In [49]:
#dropping the lang column because I don't need it anymore
subs_combined.drop(columns='lang', inplace=True)

In [50]:
subs_combined.shape

(12875, 5)

In [51]:
#reset my index
subs_combined.reset_index(drop=True, inplace=False)

Unnamed: 0,author,text,text_length,word_count,subreddit
0,Missy_Pantone,Does anyone have experience with Sidmool Sacch...,83,11,asian_beauty
1,Rntonie35,Love my hair color! Blended nicely with my grays,51,9,asian_beauty
2,flckeringfox_,What’s your best eye cream to brighten the area?,51,9,asian_beauty
3,etoileneha,Has anyone tried the Beauty of Joseon - Red Be...,73,13,asian_beauty
4,bully-maguire23,klavuu pure pearlsation micro collagen cleansi...,59,7,asian_beauty
...,...,...,...,...,...
12870,iinuzukaa,Please help my terrible skin - Lately I have h...,436,81,skincare_addiction
12871,Amber_Owl,How do you reset your face? - That probably so...,789,140,skincare_addiction
12872,phantom_poo,Facials: Worth it or not? - I'm contemplating ...,386,70,skincare_addiction
12873,[deleted],"Biting/peeling lips - Hi all, I have a habit o...",276,50,skincare_addiction


### Replacing '&amp;' with 'and'

In [52]:
#locating text with &amp; and replacing with the string 'and'
subs_combined[subs_combined['text'].str.contains('&amp;')][:2]

Unnamed: 0,author,text,text_length,word_count,subreddit
13,H4km4N,"Lost sex trafficking victim, who's now happily...",306,48,asian_beauty
15,painislife4real,Which sunscreen do you recommend for dry skin ...,127,23,asian_beauty


In [54]:
# apply a lambda function that replaces all '&amp;' with 'and'
subs_combined['text'] = subs_combined['text'].apply(lambda x: re.sub(r'\&amp;', 'and', x))

### Export dataset to CSV

In [55]:
#export dataset to csv
subs_combined.to_csv('cleaned_datasets/subreddits_combined.csv', index = False)