### Sentiment Analysis on Sephora Product Reviews: Data Wrangling
In this notebook I will be using the Sephora Products and Skincare Reviews dataset from Kaggle to create two datasets, one for the primary purpose of EDA including most of the original columns and a second with only text, preprocessed text, and recommendation for the purpose of modeling.

In [1]:
# import libraries

import numpy as np
import pandas as pd

import string
import re

import pickle

In [2]:
import warnings
warnings.filterwarnings("ignore")

In [3]:
# import product info files
prod_info = pd.read_csv('product_info.csv')

In [4]:
# import review files

#reviews_1 = pd.read_csv('reviews_0_250.csv')

reviews_2 = pd.read_csv('reviews_250_500.csv')
reviews_3 = pd.read_csv('reviews_500_750.csv')
reviews_4 = pd.read_csv('reviews_750_1000.csv')
reviews_5 = pd.read_csv('reviews_1000_1500.csv')
reviews_6 = pd.read_csv('reviews_1500_end.csv')

In [5]:
# view product info dataframe

prod_info.head()

Unnamed: 0,product_id,product_name,brand_id,brand_name,loves_count,rating,reviews,size,variation_type,variation_value,...,online_only,out_of_stock,sephora_exclusive,highlights,primary_category,secondary_category,tertiary_category,child_count,child_max_price,child_min_price
0,P473671,Fragrance Discovery Set,6342,19-69,6320,3.6364,11.0,,,,...,1,0,0,"['Unisex/ Genderless Scent', 'Warm &Spicy Scen...",Fragrance,Value & Gift Sets,Perfume Gift Sets,0,,
1,P473668,La Habana Eau de Parfum,6342,19-69,3827,4.1538,13.0,3.4 oz/ 100 mL,Size + Concentration + Formulation,3.4 oz/ 100 mL,...,1,0,0,"['Unisex/ Genderless Scent', 'Layerable Scent'...",Fragrance,Women,Perfume,2,85.0,30.0
2,P473662,Rainbow Bar Eau de Parfum,6342,19-69,3253,4.25,16.0,3.4 oz/ 100 mL,Size + Concentration + Formulation,3.4 oz/ 100 mL,...,1,0,0,"['Unisex/ Genderless Scent', 'Layerable Scent'...",Fragrance,Women,Perfume,2,75.0,30.0
3,P473660,Kasbah Eau de Parfum,6342,19-69,3018,4.4762,21.0,3.4 oz/ 100 mL,Size + Concentration + Formulation,3.4 oz/ 100 mL,...,1,0,0,"['Unisex/ Genderless Scent', 'Layerable Scent'...",Fragrance,Women,Perfume,2,75.0,30.0
4,P473658,Purple Haze Eau de Parfum,6342,19-69,2691,3.2308,13.0,3.4 oz/ 100 mL,Size + Concentration + Formulation,3.4 oz/ 100 mL,...,1,0,0,"['Unisex/ Genderless Scent', 'Layerable Scent'...",Fragrance,Women,Perfume,2,75.0,30.0


In [6]:
# view reviews dataframes

reviews_2.head()

Unnamed: 0.1,Unnamed: 0,author_id,rating,is_recommended,helpfulness,total_feedback_count,total_neg_feedback_count,total_pos_feedback_count,submission_time,review_text,review_title,skin_tone,eye_color,skin_type,hair_color,product_id,product_name,brand_name,price_usd
0,0,2190293206,2,0.0,,0,0,0,2023-03-19,Used to swear by this product but hate the sme...,,lightMedium,brown,combination,brown,P443842,Retinol Anti-Aging Serum,The INKEY List,12.99
1,1,9113341005,5,1.0,,0,0,0,2023-03-18,I’ve only been using this for a week and my sk...,More tolerable than The Ordinary,deep,brown,normal,black,P443842,Retinol Anti-Aging Serum,The INKEY List,12.99
2,2,23866342710,1,0.0,1.0,13,0,13,2023-03-12,"Why, why, why would you change the formula?!!!...",New formula is awful very sad,fairLight,blue,combination,blonde,P443842,Retinol Anti-Aging Serum,The INKEY List,12.99
3,3,1328806527,1,0.0,0.941176,17,1,16,2023-03-12,I have used this product for years and it has ...,Recently reformulated and the new formula is A...,light,brown,combination,gray,P443842,Retinol Anti-Aging Serum,The INKEY List,12.99
4,4,31262847082,5,1.0,1.0,1,0,1,2023-03-09,Great product for anti-aging Also great for da...,Must have product in my nighttime skincare rou...,lightMedium,hazel,combination,brown,P443842,Retinol Anti-Aging Serum,The INKEY List,12.99


In [7]:
reviews_3.head()

Unnamed: 0.1,Unnamed: 0,author_id,rating,is_recommended,helpfulness,total_feedback_count,total_neg_feedback_count,total_pos_feedback_count,submission_time,review_text,review_title,skin_tone,eye_color,skin_type,hair_color,product_id,product_name,brand_name,price_usd
0,0,2190293206,2,0.0,,0,0,0,2023-03-19,Used to swear by this product but hate the sme...,,lightMedium,brown,combination,brown,P443842,Retinol Anti-Aging Serum,The INKEY List,12.99
1,1,9113341005,5,1.0,,0,0,0,2023-03-18,I’ve only been using this for a week and my sk...,More tolerable than The Ordinary,deep,brown,normal,black,P443842,Retinol Anti-Aging Serum,The INKEY List,12.99
2,2,23866342710,1,0.0,1.0,13,0,13,2023-03-12,"Why, why, why would you change the formula?!!!...",New formula is awful very sad,fairLight,blue,combination,blonde,P443842,Retinol Anti-Aging Serum,The INKEY List,12.99
3,3,1328806527,1,0.0,0.941176,17,1,16,2023-03-12,I have used this product for years and it has ...,Recently reformulated and the new formula is A...,light,brown,combination,gray,P443842,Retinol Anti-Aging Serum,The INKEY List,12.99
4,4,31262847082,5,1.0,1.0,1,0,1,2023-03-09,Great product for anti-aging Also great for da...,Must have product in my nighttime skincare rou...,lightMedium,hazel,combination,brown,P443842,Retinol Anti-Aging Serum,The INKEY List,12.99


In [8]:
reviews_4.head()

Unnamed: 0.1,Unnamed: 0,author_id,rating,is_recommended,helpfulness,total_feedback_count,total_neg_feedback_count,total_pos_feedback_count,submission_time,review_text,review_title,skin_tone,eye_color,skin_type,hair_color,product_id,product_name,brand_name,price_usd
0,0,2079014373,5,1.0,,0,0,0,2023-03-14,These are the only pimple patches I’ve used th...,Best Pimple Patches,medium,blue,normal,,P442857,Focuspot Micro Tip Patches,Dr. Jart+,20.0
1,1,12631885517,4,1.0,,0,0,0,2023-02-08,One of my ingrown hair turned inflamed and sor...,It works!,mediumTan,brown,combination,black,P442857,Focuspot Micro Tip Patches,Dr. Jart+,20.0
2,2,2321761961,5,1.0,1.0,1,0,1,2023-02-05,I have tried 10 different acne/blemish patches...,Good for a large or painful breakout! Sleep in...,,hazel,combination,blonde,P442857,Focuspot Micro Tip Patches,Dr. Jart+,20.0
3,3,1380382883,4,1.0,,0,0,0,2023-01-24,"Love these for my mid-size breakouts, specifyi...",Micro tips are a plus!!,light,brown,combination,black,P442857,Focuspot Micro Tip Patches,Dr. Jart+,20.0
4,4,8871759068,4,1.0,1.0,1,0,1,2023-01-15,Best so far - though still not particularly ef...,,,,,,P442857,Focuspot Micro Tip Patches,Dr. Jart+,20.0


In [9]:
reviews_5.head()

Unnamed: 0.1,Unnamed: 0,author_id,rating,is_recommended,helpfulness,total_feedback_count,total_neg_feedback_count,total_pos_feedback_count,submission_time,review_text,review_title,skin_tone,eye_color,skin_type,hair_color,product_id,product_name,brand_name,price_usd
0,0,8554483509,2,0.0,,0,0,0,2023-03-21,This was gifted by Supergoop! in exchange for ...,Nice packaging but easy to overuse,light,brown,combination,,P467976,(Re)setting 100% Mineral Powder Sunscreen SPF ...,Supergoop!,35.0
1,1,24710523057,2,0.0,1.0,2,0,2,2023-03-07,I didn’t like it; too much product comes out w...,Packaging is not suits le,,brown,combination,,P467976,(Re)setting 100% Mineral Powder Sunscreen SPF ...,Supergoop!,35.0
2,2,8429283179,5,1.0,0.941176,34,2,32,2023-03-01,Y’all….I’m begging for everyone to read instru...,PLS READ THIS LOL,light,green,normal,brown,P467976,(Re)setting 100% Mineral Powder Sunscreen SPF ...,Supergoop!,35.0
3,3,8105185455,1,0.0,0.0,5,5,0,2023-02-27,I have not figured out how to use this product...,At a loss,tan,brown,combination,black,P467976,(Re)setting 100% Mineral Powder Sunscreen SPF ...,Supergoop!,35.0
4,4,1515931062,1,0.0,0.0,7,7,0,2023-02-27,I’m at a loss as to how to use this thing!!! C...,NOT for me!!,fair,blue,combination,blonde,P467976,(Re)setting 100% Mineral Powder Sunscreen SPF ...,Supergoop!,35.0


In [10]:
reviews_6.head()

Unnamed: 0.1,Unnamed: 0,author_id,rating,is_recommended,helpfulness,total_feedback_count,total_neg_feedback_count,total_pos_feedback_count,submission_time,review_text,review_title,skin_tone,eye_color,skin_type,hair_color,product_id,product_name,brand_name,price_usd
0,0,1945004256,5,1.0,0.0,2,2,0,2022-12-10,I absolutely L-O-V-E this oil. I have acne pro...,A must have!,lightMedium,green,combination,,P379064,Lotus Balancing & Hydrating Natural Face Treat...,Clarins,65.0
1,1,5478482359,3,1.0,0.333333,3,2,1,2021-12-17,I gave this 3 stars because it give me tiny li...,it keeps oily skin under control,mediumTan,brown,oily,black,P379064,Lotus Balancing & Hydrating Natural Face Treat...,Clarins,65.0
2,2,29002209922,5,1.0,1.0,2,0,2,2021-06-07,Works well as soon as I wash my face and pat d...,Worth the money!,lightMedium,brown,dry,black,P379064,Lotus Balancing & Hydrating Natural Face Treat...,Clarins,65.0
3,3,7391078463,5,1.0,1.0,2,0,2,2021-05-21,"this oil helped with hydration and breakouts, ...",best face oil,lightMedium,brown,combination,blonde,P379064,Lotus Balancing & Hydrating Natural Face Treat...,Clarins,65.0
4,4,1766313888,5,1.0,1.0,13,0,13,2021-03-29,This is my first product review ever so that s...,Maskne miracle,mediumTan,brown,combination,black,P379064,Lotus Balancing & Hydrating Natural Face Treat...,Clarins,65.0


In [11]:
# review shapes of each review dataframe

print(reviews_2.shape)
print(reviews_3.shape)
print(reviews_4.shape)
print(reviews_5.shape)
print(reviews_6.shape)

(206725, 19)
(206725, 19)
(116262, 19)
(119317, 19)
(49977, 19)


In [12]:
# concat review dataframes into one

frames = [reviews_2, reviews_3, reviews_4, reviews_5, reviews_6]
reviews = pd.concat(frames)

In [13]:
print(reviews.shape)

(699006, 19)


In [14]:
# subset recommended and not recommended to result in a balanced dataframe

p_reviews = reviews[reviews["is_recommended"] == 1]
n_reviews = reviews[reviews["is_recommended"] == 0]

# take samples

p_reviews = p_reviews.sample(n=50000)
n_reviews = n_reviews.sample(n=50000)

# concat

reviews = pd.concat([p_reviews, n_reviews])

In [15]:
print(reviews.shape)

(100000, 19)


In [16]:
reviews.head()

Unnamed: 0.1,Unnamed: 0,author_id,rating,is_recommended,helpfulness,total_feedback_count,total_neg_feedback_count,total_pos_feedback_count,submission_time,review_text,review_title,skin_tone,eye_color,skin_type,hair_color,product_id,product_name,brand_name,price_usd
80468,80468,1778824487,5,1.0,1.0,6,0,6,2019-10-13,This is an amazing cleanser!! I was looking fo...,Will solve your clogged pore problems!,olive,green,combination,black,P474937,Vinopure Pore Purifying Gel Cleanser,Caudalie,30.0
37751,37751,10980920650,5,1.0,,0,0,0,2017-08-28,I have really enjoyed using my new Kate Somerv...,Love this product!,fair,hazel,normal,brown,P421276,ExfoliKate Glow Moisturizer,Kate Somerville,76.0
90881,90881,30056652681,5,1.0,,0,0,0,2023-02-02,Where has this been my adult life?! After one ...,My new BFF,fair,blue,normal,red,P504007,Ceramidin Skin Barrier Moisturizing Cream,Dr. Jart+,48.0
42077,42077,23288080407,5,1.0,,0,0,0,2023-02-10,I love using the Caudalie Instant Detox Clay M...,Caudalie Detox Clay Mask!,medium,brown,combination,,P395615,Pore Minimizing Instant Detox Mask,Caudalie,42.0
70790,70790,7588651785,5,1.0,,0,0,0,2023-03-14,I recently started using the gua sha and neede...,Works great with Gua sha,,brown,combination,,P504429,Hydrating Serum,SEPHORA COLLECTION,20.0


In [17]:
# drop uneccesary columns, reset index

reviews.drop(columns=['Unnamed: 0', 'author_id'], inplace=True)
reviews.reset_index(drop=True, inplace=True)

In [18]:
reviews.head()

Unnamed: 0,rating,is_recommended,helpfulness,total_feedback_count,total_neg_feedback_count,total_pos_feedback_count,submission_time,review_text,review_title,skin_tone,eye_color,skin_type,hair_color,product_id,product_name,brand_name,price_usd
0,5,1.0,1.0,6,0,6,2019-10-13,This is an amazing cleanser!! I was looking fo...,Will solve your clogged pore problems!,olive,green,combination,black,P474937,Vinopure Pore Purifying Gel Cleanser,Caudalie,30.0
1,5,1.0,,0,0,0,2017-08-28,I have really enjoyed using my new Kate Somerv...,Love this product!,fair,hazel,normal,brown,P421276,ExfoliKate Glow Moisturizer,Kate Somerville,76.0
2,5,1.0,,0,0,0,2023-02-02,Where has this been my adult life?! After one ...,My new BFF,fair,blue,normal,red,P504007,Ceramidin Skin Barrier Moisturizing Cream,Dr. Jart+,48.0
3,5,1.0,,0,0,0,2023-02-10,I love using the Caudalie Instant Detox Clay M...,Caudalie Detox Clay Mask!,medium,brown,combination,,P395615,Pore Minimizing Instant Detox Mask,Caudalie,42.0
4,5,1.0,,0,0,0,2023-03-14,I recently started using the gua sha and neede...,Works great with Gua sha,,brown,combination,,P504429,Hydrating Serum,SEPHORA COLLECTION,20.0


In [19]:
reviews.tail()

Unnamed: 0,rating,is_recommended,helpfulness,total_feedback_count,total_neg_feedback_count,total_pos_feedback_count,submission_time,review_text,review_title,skin_tone,eye_color,skin_type,hair_color,product_id,product_name,brand_name,price_usd
99995,3,0.0,,0,0,0,2020-10-20,I’ve been using this for a few weeks. It’s sti...,,lightMedium,brown,normal,gray,P443370,BioLumin-C Vitamin C Serum,Dermalogica,95.0
99996,1,0.0,0.8,10,2,8,2019-11-17,"I hardly write reviews here, but compelled aft...",Waste of money,mediumTan,brown,combination,brown,P442838,Barrier+ Triple Lipid-Boost 360° Brightening E...,Skinfix,44.0
99997,2,0.0,0.75,4,1,3,2023-02-19,I really wanted to like this it’s so cute and ...,,lightMedium,brown,dry,black,P482535,Strawberry Smooth BHA + AHA Salicylic Acid Serum,Glow Recipe,42.0
99998,1,0.0,0.4,5,3,2,2020-10-21,Got this scrub recently and it made my face wo...,not worth that money,,brown,normal,black,P122661,7 Day Face Scrub Cream Rinse-Off Formula,CLINIQUE,26.0
99999,1,0.0,0.6,10,4,6,2019-09-02,I got a really terrible reaction to this produ...,i usually love the ordinary products but...,mediumTan,brown,oily,blonde,P427404,Ascorbyl Tetraisopalmitate Solution 20% in Vit...,The Ordinary,19.8


In [20]:
# save to pickle for EDA use

reviews.to_pickle("reviews.pkl")

In [21]:
# create a new dataframe for text and sentiment only

reviews_text_only = reviews[['review_text', 'is_recommended']]

In [22]:
reviews_text_only.head()

Unnamed: 0,review_text,is_recommended
0,This is an amazing cleanser!! I was looking fo...,1.0
1,I have really enjoyed using my new Kate Somerv...,1.0
2,Where has this been my adult life?! After one ...,1.0
3,I love using the Caudalie Instant Detox Clay M...,1.0
4,I recently started using the gua sha and neede...,1.0


In [23]:
# fix dtypes

reviews_text_only["review_text"] = reviews_text_only["review_text"].values.astype('str')

reviews.replace([np.inf, -np.inf], np.nan, inplace=True)

In [24]:
# preprocess text - remove punctuation

r = re.compile(r'[^\w\s]+')

reviews_text_only['text_preproc'] = [r.sub('', s) for s in reviews_text_only['review_text'].tolist()]

In [25]:
# convert to lowercase

reviews_text_only['text_preproc'] = reviews_text_only['text_preproc'].str.lower()

In [26]:
# reorder dataframe

reviews_text_only = reviews_text_only[['review_text', 'text_preproc', 'is_recommended']]

In [27]:
reviews_text_only.head()

Unnamed: 0,review_text,text_preproc,is_recommended
0,This is an amazing cleanser!! I was looking fo...,this is an amazing cleanser i was looking for ...,1.0
1,I have really enjoyed using my new Kate Somerv...,i have really enjoyed using my new kate somerv...,1.0
2,Where has this been my adult life?! After one ...,where has this been my adult life after one us...,1.0
3,I love using the Caudalie Instant Detox Clay M...,i love using the caudalie instant detox clay m...,1.0
4,I recently started using the gua sha and neede...,i recently started using the gua sha and neede...,1.0


In [28]:
reviews_text_only.tail()

Unnamed: 0,review_text,text_preproc,is_recommended
99995,I’ve been using this for a few weeks. It’s sti...,ive been using this for a few weeks its still ...,0.0
99996,"I hardly write reviews here, but compelled aft...",i hardly write reviews here but compelled afte...,0.0
99997,I really wanted to like this it’s so cute and ...,i really wanted to like this its so cute and s...,0.0
99998,Got this scrub recently and it made my face wo...,got this scrub recently and it made my face wo...,0.0
99999,I got a really terrible reaction to this produ...,i got a really terrible reaction to this produ...,0.0


In [29]:
# save to pickle

reviews_text_only.to_pickle("reviews_text_only.pkl")