# Capstone Project :  
Skincare Product Analysis - Decoding your Skincare Product
------------------------------------------------------------


# Background

The global skin care market is projected to increase from USD 130 billion in 2020 to over USD177 billion by 2025 [[Statista, 2021]](https://www.statista.com/topics/4517/us-skin-care-market/#dossierKeyfigures). The ongoing growth of the industry shows the consumers' sustained interest in beauty regimes, as well as in basic personal care and hygiene practices.

With the explosion of skincare brands around the world, consumers are faced with many choices, making it more difficult to make informed decisions of their purchases. Adding to information paralysis, there are costs increasing the barriers to the purchase. Time is spent on researching through a myriad of products and ingredients. Trial and error is also costly both in terms of time and money. Skincare products are not always cheap and it takes time to see results. There is also a risk that the product may cause adverse reactions which is a major concern especially for consumers with sensitive skin.

By helping consumers understand what makes the product cheap or expensive and be aware of what they could be paying for, we can help them make more informed decisions. 
Consumers with a particular skin type, e.g. sensitive skin may also want to know if there are similar alternatives to their tried-and-tested products that can help them avoid unnecessary allergies.
If the product is not worth the price tag, consumers may want to know if there is a cheaper alternative that promises similar effect.

### Objectives of Project

This project aims to:

1) Build  a classification model to identify the product characteristics that affect prices of skincare products by analysing the ingredients, skincare concerns addressed, skin type that the product is suitable for,  brand, popularity, etc. 

2)  Build a recommender to help consumer find the next closest product based on product similarity of the ingredients. 

### Problem Statement

1) Help consumers understand what makes a skincare product cheap or expensive so that they can make more informed decisions.

2) Help consumers discover similar alternatives to their tried-and-tested products or even ‘dupe products’.

### Data Used

- [`sephora.csv`](../data/sephora.csv): Dataset was from [`Kaggle`](https://www.kaggle.com/raghadalharbi/all-products-available-on-sephora-website) which was extracted from [Sephora's website](https://www.sephora.com/).

- Sephora's data was used because it is one of the largest beauty retailers in the United States.

- Scope: Only skincare products were covered for this project as we want to be more targeted. Furthermore, skincare product made up the largest share (42 percent) of the global cosmetic market in 2020 [[Statista, 2021]](https://www.statista.com/statistics/243967/breakdown-of-the-cosmetic-market-worldwide-by-product-category/), making it a compelling area of research.

### Data Dictionary 

The key features used in the project are listed below:

|Feature|Type|Description|
|:---|:---:|:---:|
|id|*int*|product id| 
|brand|*object*|product brand|
|category|*object*|product type|
|name|*object*|product name|
|size|*object*|volume of product|
|rating|*float*|ratings of product from 1 to 5| 
|number_of_reviews|*int*|number of reviews submitted| 
|love|*int*|number of times the product was loved|
|price|*float*|discounted price of product|
|value_price|*float*|original price of product|
|URL|*object*|link to product|
|MarketingFlags|*bool*|whether or not the product had marketing tags| 
|MarketingFlags_content|*object*|type of marketing tags|
|options|*object*|different product sizes available|
|details|*object*|product description| 
|how_to_use|*object*|instructions to use product|
|ingredients|*object*|product ingredients|
|online_only|*int*|1 if the product has online marketing tags and 0 for none|
|exclusive|*int*|1 if the product has exclusive marketing tags and 0 for none|
|limited_edition|*int*|1 if the product has limited edition marketing tags and 0 for none|
|limited_time_offer|*int*|1 if the product has limited time offer marketing tags and 0 for none|
|skintype|*int*|skintype that the product is suitable for|
|concerns|*int*|skincare concerns that the product is targeted for|
|pref|*int*|specific preferences of skincare product composition|
|skincareacids|*int*|specific acids present in skincare product|
|excluded|*int*|ingredients that are excluded in the product|
|formulation|*int*|product formulation type|
|award|*int*|whether or not a product had achieved allure awards or the likes|
|clinical_results|*int*|whether or not a product as clinically tested|
|size_ml|*int*|volume of product in ml|
|price_per_unit_vol|*int*|derived by dividing value_price with size_ml|
|price_range|*int*|whether the product is cheap, average or expensive|
|ingredients_list|*int*|list of ingredients in the product extracted from details|

### Notebook Organisation

**<u>Part 1</u>** (Current notebook)
1. **Data Cleaning** 

**[Part 2](../code/2_eda_preprocessing.ipynb)**

2. **Exploratory Data Analysis**  

**[Part 3](../code/3_data_cleaning_ingr.ipynb)**

3. **Data Cleaning of Ingredients**
4. **Exploratory Data Analysis of Ingredients**

**[Part 4](../code/4_recommender.ipynb)**

5. **Exploring Metrics for Recommendation System**
6. **Final Recommender System**

**[Part 5](../code/5_modelling_insights.ipynb)**

7. **Modelling**
8. **Insights**

# 1. Data Cleaning

## 1.1. Data Import

We will first perform the standard importing of libraries and loading of datasets.

In [1]:
# Import libraries 
import numpy as np
import pandas as pd
import scipy.stats as stats

# Visual
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

# Pre-processing
#!pip install nltk
import nltk
#!pip install wordcloud
from wordcloud import WordCloud, STOPWORDS
#!pip install regex
import re
#!pip install python-Levenshtein
#!pip install fuzzywuzzy
import fuzzywuzzy
from fuzzywuzzy import process
from fuzzywuzzy import fuzz
from itertools import chain

# Modelling
from sklearn.preprocessing import StandardScaler
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.compose import ColumnTransformer
from sklearn.feature_extraction import text 
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.dummy import DummyClassifier
from sklearn.linear_model import LogisticRegression 
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier
from xgboost import XGBClassifier
from xgboost import plot_importance
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.metrics import jaccard_score
from sklearn.metrics import confusion_matrix, classification_report, plot_roc_curve, roc_auc_score, roc_curve, auc,\
                            accuracy_score, precision_score, recall_score, f1_score, plot_confusion_matrix

# NLP
from nltk.corpus import stopwords
import string

# Bokeh
from bokeh.io import show, curdoc, output_notebook, push_notebook
from bokeh.plotting import figure
from bokeh.models import ColumnDataSource, HoverTool, Select, Paragraph, TextInput
from bokeh.layouts import widgetbox, column, row
from ipywidgets import interact

# Similarity
from numpy import dot
from sklearn.manifold import TSNE

# Others
import time
import datetime as dt

# Display settings
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
pd.set_option('display.max_colwidth', 100)

# Warnings
import warnings 
warnings.filterwarnings('ignore')

In [2]:
# Load data
df = pd.read_csv('../data/sephora.csv')

## 1.2. Data Inspection

We will then inspect the dataset before cleaning.

In [3]:
# Inspect first 2 records
df.head(2)

Unnamed: 0,id,brand,category,name,size,rating,number_of_reviews,love,price,value_price,URL,MarketingFlags,MarketingFlags_content,options,details,how_to_use,ingredients,online_only,exclusive,limited_edition,limited_time_offer
0,2218774,Acqua Di Parma,Fragrance,Blu Mediterraneo MINIATURE Set,5 x 0.16oz/5mL,4.0,4,3002,66.0,75.0,https://www.sephora.com/product/blu-mediterraneo-minature-set-P443401?icid2=products grid:p443401,True,online only,no options,This enchanting set comes in a specially handcrafted blue box- and includes a selection of fragr...,Suggested Usage:-Fragrance is intensified by the warmth of your own body. Apply in the creases o...,Arancia di Capri Eau de Toilette: Alcohol Denat.- Water- Fragrance- Limonene- Linalool- Ethylhex...,1,0,0,0
1,2044816,Acqua Di Parma,Cologne,Colonia,0.7 oz/ 20 mL,4.5,76,2700,66.0,66.0,https://www.sephora.com/product/colonia-P163604?icid2=products grid:p163604,True,online only,- 0.7 oz/ 20 mL Spray - 1.7 oz/ 50 mL Eau de Cologne Spray - 3.4 oz/ 101 mL Eau de Cologne Spray,An elegant timeless scent filled with a fresh- luminous blend of natural ingredients like Bulgar...,no instructions,unknown,1,0,0,0


In [4]:
# Display the number of rows and columns 
df.shape

(9168, 21)

In [5]:
# Display the summary statistics
df.describe()

Unnamed: 0,id,rating,number_of_reviews,love,price,value_price,online_only,exclusive,limited_edition,limited_time_offer
count,9168.0,9168.0,9168.0,9168.0,9168.0,9168.0,9168.0,9168.0,9168.0,9168.0
mean,1962952.0,3.99002,282.13918,16278.59,50.063237,51.82359,0.234839,0.264725,0.091841,0.000327
std,385971.4,1.007707,890.642028,42606.51,47.164989,49.45902,0.423921,0.441211,0.288817,0.018087
min,50.0,0.0,0.0,0.0,2.0,2.0,0.0,0.0,0.0,0.0
25%,1819453.0,4.0,10.0,1600.0,24.0,25.0,0.0,0.0,0.0,0.0
50%,2072354.0,4.0,46.0,4800.0,35.0,35.0,0.0,0.0,0.0,0.0
75%,2230591.0,4.5,210.0,13800.0,59.0,60.0,0.0,1.0,0.0,0.0
max,2359685.0,5.0,19000.0,1300000.0,549.0,549.0,1.0,1.0,1.0,1.0


In [6]:
# Display non-null count and dtype
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9168 entries, 0 to 9167
Data columns (total 21 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   id                      9168 non-null   int64  
 1   brand                   9168 non-null   object 
 2   category                9168 non-null   object 
 3   name                    9168 non-null   object 
 4   size                    9168 non-null   object 
 5   rating                  9168 non-null   float64
 6   number_of_reviews       9168 non-null   int64  
 7   love                    9168 non-null   int64  
 8   price                   9168 non-null   float64
 9   value_price             9168 non-null   float64
 10  URL                     9168 non-null   object 
 11  MarketingFlags          9168 non-null   bool   
 12  MarketingFlags_content  9168 non-null   object 
 13  options                 9168 non-null   object 
 14  details                 9168 non-null   

In [7]:
# Get column names
df.columns

Index(['id', 'brand', 'category', 'name', 'size', 'rating',
       'number_of_reviews', 'love', 'price', 'value_price', 'URL',
       'MarketingFlags', 'MarketingFlags_content', 'options', 'details',
       'how_to_use', 'ingredients', 'online_only', 'exclusive',
       'limited_edition', 'limited_time_offer'],
      dtype='object')

In [8]:
# Rename columns to lowercase
df.rename(str.lower, axis = 1, inplace = True)

## 1.3. Selecting Categories

As the entire dataset consists of a wide-ranging types of products, including skincare, haircare, cosmetics and facial tools, we will scope only skincare products for the purpose of this project.

In [9]:
# Examine the categories
df['category'].unique()

array(['Fragrance', 'Cologne', 'Perfume', 'Body Mist & Hair Mist',
       'Body Lotions & Body Oils', 'Body Sprays & Deodorant',
       'Perfume Gift Sets', 'no category', 'Rollerballs & Travel Size',
       'Lip Balm & Treatment', 'Lotions & Oils', 'Eye Palettes',
       'Highlighter', 'Cheek Palettes', 'Lipstick', 'Face Serums',
       'Moisturizers', 'Value & Gift Sets', 'Eye Creams & Treatments',
       'Face Sunscreen', 'Lip Balms & Treatments', 'Mini Size',
       'Face Masks', 'Face Wash & Cleansers', 'Decollete & Neck Creams',
       'Face Oils', 'Hand Cream & Foot Cream', 'Face Primer',
       'Color Correct', 'Mists & Essences', 'Tinted Moisturizer',
       'Concealer', 'Beauty Supplements', 'Facial Peels', 'Exfoliators',
       'Conditioner', 'Shampoo', 'Hair Styling Products',
       'Scalp & Hair Treatments', 'Hair Masks', 'Hair Spray', 'Hair Oil',
       'Hair Primers', 'Dry Shampoo', 'Hair', 'Hair Thinning & Hair Loss',
       'Hair Straighteners & Flat Irons', 'Hair Dry

Based on Sephora's website, the following categories are labelled under skincare.

In [10]:
# Categories of interest
cat_interest = ['Moisturizers',
                'Face Serums',
                'Face Wash & Cleansers',
                'Eye Creams & Treatments',
                'Face Masks',
                'Lip Balms & Treatments',
                'Toners',
                'Face Sunscreen',
                'Mists & Essences',
                'Face Oils',
                'Sheet Masks',
                'Facial Peels',
                'Exfoliators',
                'Night Creams',
                'Blemish & Acne Treatments',
                'BB & CC Cream',
                'Makeup Removers',
                'Face Wipes',
                'Eye Masks',
                'Body Sunscreen',
                'Decollete & Neck Creams'
]

In [11]:
# Selecting only the categories of interest (i.e. Skincare)
df = df.loc[df['category'].isin(cat_interest)]

## 1.4. Investigate Null/ Zero Values

In [12]:
# Check for null values
df.isnull().sum()

id                        0
brand                     0
category                  0
name                      0
size                      0
rating                    0
number_of_reviews         0
love                      0
price                     0
value_price               0
url                       0
marketingflags            0
marketingflags_content    0
options                   0
details                   0
how_to_use                0
ingredients               0
online_only               0
exclusive                 0
limited_edition           0
limited_time_offer        0
dtype: int64

It appears that there are no null values. However, we noticed that there are several products with no associated size or ingredients. As we intend to use 'size' and 'ingredients' as features of our model, we will examine such records and drop them accordingly.

In [13]:
# Display records with 'no size'
df[df['size'] == 'no size']

Unnamed: 0,id,brand,category,name,size,rating,number_of_reviews,love,price,value_price,url,marketingflags,marketingflags_content,options,details,how_to_use,ingredients,online_only,exclusive,limited_edition,limited_time_offer
63,1957182,Algenist,Face Serums,GENIUS Liquid Collagen,no size,4.0,656,41300,115.0,115.0,https://www.sephora.com/product/genius-liquid-collagen-P421277?icid2=products grid:p421277,True,exclusive,no options,What it is: A vegan collagen treatment serum that helps to restore skin's natural bounce and re...,Suggested Usage:-After cleansing and toning- apply to entire face- neck- and décolleté. For opti...,-Patented Alguronic Acid: Naturally sourced and sustainably produced from algae- it’s clinicall...,0,1,0,0
66,1582477,Algenist,Moisturizers,GENIUS Ultimate Anti-Aging Cream,no size,4.5,462,16200,112.0,112.0,https://www.sephora.com/product/genius-ultimate-anti-aging-cream-P384537?icid2=products grid:p38...,False,0,no options,What it is: A vegan- collagen-infused advanced moisturizer that combats the look of lines and w...,Suggested Usage:-After cleansing- apply to the entire face- neck- and chest- avoiding the eye ar...,-Patented Alguronic Acid: Naturally sourced and sustainably produced from algae- it’s clinicall...,0,0,0,0
68,1420223,Algenist,Face Sunscreen,SUBLIME DEFENSE Ultra Lightweight UV Defense Fluid SPF 50,no size,4.5,422,21300,28.0,28.0,https://www.sephora.com/product/sublime-defense-ultra-lightweight-uv-defense-fluid-spf-50-P31114...,False,0,no options,What it is: An ultra-sheer- oil-free- SPF 50- and lightweight sunscreen for everyday protection...,Suggested Usage:-Shake bottle well.-Apply to the entire face in the morning or prior to sun expo...,-Patented Alguronic Acid: Helps minimize the appearance of fine lines and wrinkles and support ...,0,0,0,0
74,1328822,Algenist,Moisturizers,Regenerative Anti-Aging Moisturizer,no size,4.5,215,5500,94.0,94.0,https://www.sephora.com/product/regenerative-anti-aging-moisturizer-P282935?icid2=products grid:...,False,0,no options,What it is:A moisturizer with the exclusive antiaging ingredient Alguronic Acid. What it is form...,Suggested Usage:-Apply to the entire face- neck and chest area twice a day. -Use only as directed.,-Alguronic Acid: Increases cell regeneration and elastin synthesis. -Vitamin C: Boosts radiance...,0,0,0,0
115,2206134,Alpha-H,Face Serums,Liquid Gold Exfoliating Treatment,no size,4.5,633,6400,60.0,60.0,https://www.sephora.com/product/liquid-gold-with-glycolic-acid-P440949?icid2=products grid:p440949,True,exclusive,no options,What it is: An overnight resurfacing treatment utilizing a state-of-the-art- low pH delivery sys...,Suggested Usage:-Use on alternate evenings. -Moisten cotton pad with solution and apply to face-...,Water- Alcohol Denat- Glycolic Acid- Glycerin- Hydrolyzed Silk- Potassium Hydroxide- Phenoxyetha...,0,1,0,0
239,2187698,AMOREPACIFIC,Facial Peels,Treatment Enzyme Exfoliating Powder Cleanser,no size,4.5,1000,55700,60.0,60.0,https://www.sephora.com/product/treatment-enzyme-peel-P232931?icid2=products grid:p232931,False,0,no options,What it is: A powder to foam exfoliating daily cleanser powered by plant-derived enzymes that re...,Suggested Usage:-Use morning and night- after removing makeup.\n-The bottle is designed to dispe...,-Green Tea-derived Probiotic Enzyme (Lactobacillus ferment): Helps to remove dead skin cells.-P...,0,0,0,0
240,1161686,AMOREPACIFIC,Face Wash & Cleansers,Treatment Cleansing Foam,no size,4.5,449,14500,50.0,50.0,https://www.sephora.com/product/treatment-cleansing-foam-P232930?icid2=products grid:p232930,False,0,no options,Which skin type is it good for?✔ Normal✔ Oily✔ CombinationWhat it is:A foaming cream cleanser th...,Suggested Usage:-Can be used daily- morning and night. -Dispense pearl-size amount onto fingerti...,-BioGF1K Complex™: Delivers the most active compounds of red ginseng which can be absorbed by t...,0,0,0,0
242,1984715,AMOREPACIFIC,Mists & Essences,Vintage Single Extract Essence,no size,4.5,152,15300,145.0,145.0,https://www.sephora.com/product/vintage-single-extract-essence-P422238?icid2=products grid:p422238,False,0,no options,What it is: An anti-aging essence with only six ingredients to improve skin’s firmness- clarity...,Suggested Usage:-Apply after cleansing and toning in the morning and before moisturizing at nigh...,-Green Tea Leaf Extract: Helps visibly improve skin’s clarity- texture- and elasticity with ant...,0,0,0,0
243,1328046,AMOREPACIFIC,Moisturizers,MOISTURE BOUND Rejuvenating Crème,no size,4.0,186,5600,150.0,150.0,https://www.sephora.com/product/moisture-bound-rejuvenating-creme-P282613?icid2=products grid:p2...,False,0,no options,Which skin type is it good for?✔ Normal✔ Combination✔ Dry✔ SensitiveWhat it is:An intensely hyd...,Suggested Usage:-Apply evenly to clean skin twice a day- morning and night. Precautions:-For ex...,-Bamboo Sap: Supports natural cellular turnover and regeneration.-Aqua Sponge Complex™: Supports...,0,0,0,0
263,1775915,AMOREPACIFIC,Sheet Masks,MOISTURE BOUND Intensive Serum Masque,no size,3.5,16,3200,90.0,90.0,https://www.sephora.com/product/moisture-bound-intensive-serum-masque-P405404?icid2=products gri...,True,online only,no options,What it is:A unique two-in-one ampoule and masque that provides the instant hydration- firmness-...,Suggested Usage:-Apply the full contents of the ampoule to the face with the fingers- avoiding t...,-Bamboo Sap: Rrich in minerals and amino acids; hydrates for long-lasting suppleness.\n-Five Hy...,1,0,0,0


In [14]:
# Display number of records with 'no size'
df[df['size'] == 'no size'].shape

(502, 21)

We will drop records with 'no size' because we will be using 'size' as one of our features.

In [15]:
# Drop records with 'no size'
df.drop(df[df['size'] == 'no size'].index, axis = 0, inplace = True)

In [16]:
# Check records with 'no size' after cleaning
df[df['size'] == 'no size']

Unnamed: 0,id,brand,category,name,size,rating,number_of_reviews,love,price,value_price,url,marketingflags,marketingflags_content,options,details,how_to_use,ingredients,online_only,exclusive,limited_edition,limited_time_offer


All records with 'no size' had been dropped.

In [17]:
# Display records with 'unknown' ingredients
df[df['ingredients'] == 'unknown']

Unnamed: 0,id,brand,category,name,size,rating,number_of_reviews,love,price,value_price,url,marketingflags,marketingflags_content,options,details,how_to_use,ingredients,online_only,exclusive,limited_edition,limited_time_offer
844,2352524,Biossance,Face Oils,Squalane + Vitamin C Rose Oil,1.01 oz/ 30 mL,4.5,629,60000,72.0,72.0,https://www.sephora.com/product/squalane-vitamin-c-rose-oil-P416563?icid2=products grid:p416563,True,limited edition · exclusive,no options,What it is: Reveal your most radiant skin yet with influencer Aimee Song’s favorite brightening...,Suggested Usage:-Massage a few drops onto cleansed face and neck- following serum and moisturize...,unknown,0,1,1,0
2160,2167435,Dior,Face Serums,Capture Youth Intense Rescue Age-Delay Revitalizing Oil-Serum,1 oz/ 30 mL,3.5,3,0,95.0,95.0,https://www.sephora.com/product/capture-youth-intense-rescue-age-delay-revitalizing-oil-serum-P4...,False,0,no options,What it is: A nourishing oil-serum that protects and strengthens the skin's natural barrier. \...,Suggested Usage:-Apply the Intense Rescue Serum daily before your cream for targeted treatment.-...,unknown,0,0,0,0
2194,2167443,Dior,Eye Creams & Treatments,Capture Youth Age-Delay Advanced Eye Treatment,0.5 oz/ 15 mL,4.0,20,1700,75.0,75.0,https://www.sephora.com/product/capture-youth-age-delay-advanced-eye-treatment-P440923?icid2=pro...,False,0,no options,What it is: An eye treatment that brightens- smooths- and depuffs the eye area.Skin Type: Normal...,Suggested Usage:-Gently massage from the eye contour to the upper eyelid- every morning and even...,unknown,0,0,0,0
2202,2232247,Dior,Facial Peels,Capture Youth New Skin Effect Enzyme Solution Age-Delay Resurfacing Water,5 oz/ 150 mL,4.5,7,2000,65.0,65.0,https://www.sephora.com/product/capture-youth-new-skin-effect-enzyme-solution-age-delay-resurfac...,False,0,no options,What it is: A lotion that clarifies- brightens- and hydrates the skin- leaving it silky soft.Sk...,Suggested Usage:-Pour into hand and gently pat on with your fingertips or saturate a cotton pad ...,unknown,0,0,0,0
2227,2264786,Dior,Face Serums,Capture Dreamskin Care & Perfect Refill,1.6 oz/ 50 mL,5.0,16,5100,128.0,128.0,https://www.sephora.com/product/capture-totale-dreamskin-advanced-refill-P416728?icid2=products ...,True,online only,no options,What it is: A refill of the skin-perfecting emulsion to hydrate- blur- and even out skin tone- o...,Suggested Usage:-Apply this emulsion as the last step of your skincare routine to perfect your c...,unknown,1,0,0,0
2423,2353225,Dr. Dennis Gross Skincare,Face Masks,DRx Blemish Solutions™ Clarifying Mask with Colloidal Sulfur,1 oz/ 30 g,4.5,344,12600,28.0,28.0,https://www.sephora.com/product/clarifying-colloidal-sulfur-mask-P375270?icid2=products grid:p37...,False,0,no options,What it is:An overnight acne medication that treats blackheads and pores while absorbing excess ...,Suggested Usage:-Cleanse skin thoroughly before applying and apply a thin layer to the affected ...,unknown,0,0,0,0
2674,2340107,Erno Laszlo,Face Wash & Cleansers,The Famous Black Bar,3.4 oz/ 100 g,4.5,84,5500,38.0,38.0,https://www.sephora.com/product/sea-mud-deep-cleansing-bar-P392631?icid2=products grid:p392631,True,online only,no options,What it is: A detoxifying facial bar that combines Dead Sea mud and charcoal to cleanse- draw o...,Suggested Usage:-Use directly after cleansing oil- morning and night.,unknown,1,0,0,0
2706,2183523,Estée Lauder,Moisturizers,Resilience Multi-Effect Tri-Peptide Face and Neck Crème,1.7 oz/ 50 mL,3.5,20,5200,95.0,95.0,https://www.sephora.com/product/resilience-lift-firming-sculpting-face-neck-creme-broad-spectrum...,True,online only,no options,What it is: An intensely nourishing- multi-effect moisturizer- infused with tri-peptide complex-...,Suggested Usage:-Apply AM after your serum.,unknown,1,0,0,0
2712,2340396,Estée Lauder,Mists & Essences,Micro Essence Skin Activating Treatment Lotion Fresh with Sakura Ferment,6.8 oz/ 200 mL,0.0,0,613,120.0,120.0,https://www.sephora.com/product/estee-lauder-micro-essence-skin-activating-treatment-lotion-fres...,True,online only,no options,"What it is: A fresh- water-light ""essence in lotion"" that is infused with sakura ferment for sm...",Suggested Usage:-Apply to clean skin with a cotton pad each morning and evening.,unknown,1,0,0,0
2738,2183507,Estée Lauder,Face Serums,Perfectionist Pro Rapid Renewal Retinol Treatment,1 oz/ 30 mL,3.0,1,380,85.0,85.0,https://www.sephora.com/product/perfectionist-pro-rapid-renewal-retinol-treatment-P440648?icid2=...,True,online only,no options,What it is: A formula that helps to renew radiance and reduce the look of fine lines. Skin Type:...,Suggested Usage:-Massage a pea-sized amount over clean- dry skin at night before moisturizer. -A...,unknown,1,0,0,0


In [18]:
# Display number of records with 'unknown' ingredients
df[df['ingredients'] == 'unknown'].shape

(29, 21)

We will drop records with 'unknown' ingredients because we will be using 'ingredients' as one of our features.

In [19]:
# Drop records with 'unknown' ingredients
df.drop(df[df['ingredients'] == 'unknown'].index, axis = 0, inplace = True)

In [20]:
# Check records with 'unknown' ingredients after cleaning
df[df['ingredients'] == 'unknown']

Unnamed: 0,id,brand,category,name,size,rating,number_of_reviews,love,price,value_price,url,marketingflags,marketingflags_content,options,details,how_to_use,ingredients,online_only,exclusive,limited_edition,limited_time_offer


All records with 'unknown' ingredients had been dropped.

In [21]:
# Reset index
df.reset_index(drop=True, inplace=True)

## 1.5. Investigate Duplicates

We will check for duplicate records in the dataset in this section. 

In [22]:
# Check for exact duplicates
df[df.duplicated()]

Unnamed: 0,id,brand,category,name,size,rating,number_of_reviews,love,price,value_price,url,marketingflags,marketingflags_content,options,details,how_to_use,ingredients,online_only,exclusive,limited_edition,limited_time_offer


While there are no exact duplicates, we also want check whether there are duplicates on 'id'.

In [23]:
# Check for duplicate records on 'id'
df[df.duplicated(subset=['id'], keep = False)]     # keep = False to show all duplicates

Unnamed: 0,id,brand,category,name,size,rating,number_of_reviews,love,price,value_price,url,marketingflags,marketingflags_content,options,details,how_to_use,ingredients,online_only,exclusive,limited_edition,limited_time_offer
222,2163400,CLINIQUE,Moisturizers,Clinique iD™ Custom-Blend Hydrator Collection,4.2 oz/ 125 mL,3.5,185,44600,39.0,39.0,https://www.sephora.com/product/clinique-id-your-custom-blend-hydrator-P439643?icid2=products gr...,False,0,no options,What it is: A revolutionary- custom-blend hydrator that gives you the power to hydrate your way...,Suggested Usage:How to assemble:-Unscrew the top on the Hydration Base.-For initial use- remove ...,-Taurine: Helps support skin’s energy and revive its visible glow.-Salicyclic and Glycolic Acid...,0,0,0,0
255,2163400,CLINIQUE,Moisturizers,Clinique iD™: Moisturizer + Concentrate for Fatigue,4.2 oz/ 125 mL,3.0,4,804,39.0,39.0,https://www.sephora.com/product/clinique-id-your-custom-blend-hydrator-for-fatigue-P439646?icid2...,False,0,no options,What it is: A first-of-its-kind custom blend of Dramatically Different™ Moisturizing BB-gel wit...,Suggested Usage:-Cleansed- exfoliated skin accepts moisture better- so for best results- apply a...,-Taurine: Helps support skin’s energy and revive its visible glow.-Salicyclic and Glycolic Acid...,0,0,0,0
332,2028397,Dior,Face Serums,Capture Youth Serum Collection,1 oz/ 30 mL,4.5,102,31500,95.0,95.0,https://www.sephora.com/product/capture-youth-serum-P428255?icid2=products grid:p428255,False,0,no options,Which skin type is it good for?✔ Normal✔ Oily✔ Combination✔ Dry✔ Sensitive\n\nWhat it is: An age...,Suggested Usage:-Apply the serum(s) of your choice daily before your cream for targeted treatmen...,Aqua (Water)- Dipropylene Glycol- Glycerin- Glycolic Acid- 3-O-Ethyl Ascorbic Acid- Ppg-26-Butet...,0,0,0,0
334,2028397,Dior,Face Serums,Capture Youth Glow Booster Age-Delay Illuminating Serum,1 oz/ 30 mL,4.5,14,0,95.0,95.0,https://www.sephora.com/product/capture-youth-glow-booster-age-delay-illuminating-serum-P427731?...,False,0,no options,Which skin type is it good for?✔ Normal✔ Combination\nWhat it is: An anti-fatigue weapon against...,Suggested Usage:-Apply the Glow Booster Serum daily before your cream for targeted treatment.-Fo...,Aqua (Water)- Dipropylene Glycol- Glycerin- Glycolic Acid- 3-O-Ethyl Ascorbic Acid- Ppg-26-Butet...,0,0,0,0
455,1723881,Dr. Jart+,Face Masks,Sheet Masks,1 Mask,4.0,422,85800,6.0,6.0,https://www.sephora.com/product/sheet-masks-P429529?skuId=1723881&icid2=products grid:p429529,False,0,no options,Which skin type is it good for?✔ Normal✔ Oily✔ Combination✔ Dry✔ SensitiveWhat it is:A collectio...,Suggested Usage:Sheet masks:-Remove film and apply mask over cleansed face. -Leave on for 15 to ...,-Aquaxyl: Derived from plant glucose; replenishes and protects vital moisture in your skin.\n-X...,0,0,0,0
456,2021897,Dr. Jart+,Face Masks,Shake & Shot™ Rubber Masks,1 mask,4.0,704,68200,12.0,12.0,https://www.sephora.com/product/shake-shot-rubber-masks-P428658?icid2=products grid:p428658,False,0,no options,Which skin type is it good for?\n✔ Normal\n✔ Oily\n✔ Combination\n✔ Dry\n✔ Sensitive\n\nWhat it ...,Suggested Usage:-Open and remove the spatula from the lid. -Combine both STEP 01 Super Booster a...,-Vitamin C (Ascorbic Acid): Brightens the look of skin tone and texture.\n-Sea Buckthorn Fruit E...,0,0,0,0
460,2196467,Dr. Jart+,Face Serums,Focuspot™ Micro Tip™ Patches,6 patches,3.0,405,38800,18.0,18.0,https://www.sephora.com/product/focuspot-micro-tip-patches-P442857?icid2=products grid:p442857,True,exclusive,no options,What it is: A set of self-dissolving micro tips that melt deep into the skin surface to target s...,Suggested Usage:-Open the pouch and carefully remove patch from tray. Remove the white film. Do...,Sodium Hyaluronate- Trehalose- Clitoria Ternatea Flower Extract- Niacinamide- Oligopeptide-76- W...,0,1,0,0
461,1723881,Dr. Jart+,Sheet Masks,Dermask Water Jet Vital Hydra Solution™,1 Mask,4.5,162,13500,6.0,6.0,https://www.sephora.com/product/dermask-water-jet-vital-hydra-solution-P397623?icid2=products gr...,False,0,no options,Which skin type is it good for?\n✔ Normal\n✔ Oily\n✔ Combination\n✔ Dry\n✔ Sensitive\n\nWhat it ...,Suggested Usage:\n-Remove film and apply mask over cleansed face. \n-Leave on for 15 to 30 minut...,-Aquaxyl: Derived from plant glucose; replenishes and protects vital moisture in your skin.\n-X...,0,0,0,0
467,2196467,Dr. Jart+,Face Serums,Focuspot™ Blemish Micro Tip™ Patch,6 patches,4.0,91,0,18.0,18.0,https://www.sephora.com/product/focuspot-blemish-micro-tip-patch-P442853?icid2=products grid:p44...,True,exclusive,no options,What it is: A set of self-dissolving micro tips that melt deep into the skin surface to target a...,Suggested Usage:-Open the pouch and carefully remove patch from tray. Remove the white film. Do...,Sodium Hyaluronate- Trehalose- Clitoria Ternatea Flower Extract- Niacinamide- Oligopeptide-76- W...,0,1,0,0
471,2021897,Dr. Jart+,Face Masks,Shake & Shot™ Rubber Brightening Mask,1 mask,4.5,17,0,12.0,12.0,https://www.sephora.com/product/shake-shot-rubber-brightening-mask-P428655?icid2=products grid:p...,False,0,no options,Which skin type is it good for?✔ Normal✔ Oily✔ Combination✔ Dry✔ SensitiveWhat it is:A fun- DIY ...,Suggested Usage:-Open and remove the spatula from the lid. -Combine both STEP 01 Super Booster a...,-Vitamin C (Ascorbic Acid): Brightens the look of skin tone and texture.\n-Sea Buckthorn Fruit E...,0,0,0,0


As seen from the above, there are some records with the same 'id' but different 'name', 'category' or 'url'. Hence, 'id' alone may not be a very good indicator of whether the record is unique. We will try to check whether there are duplicates on 'id', 'ingredients', 'price' and 'value_price. 

In [24]:
# Check for duplicate records on 'id', 'ingredients', 'price' and 'value_price'
df[df.duplicated(subset=['id', 'brand','ingredients','price','value_price'], keep = False)].sort_values('id')

Unnamed: 0,id,brand,category,name,size,rating,number_of_reviews,love,price,value_price,url,marketingflags,marketingflags_content,options,details,how_to_use,ingredients,online_only,exclusive,limited_edition,limited_time_offer
455,1723881,Dr. Jart+,Face Masks,Sheet Masks,1 Mask,4.0,422,85800,6.0,6.0,https://www.sephora.com/product/sheet-masks-P429529?skuId=1723881&icid2=products grid:p429529,False,0,no options,Which skin type is it good for?✔ Normal✔ Oily✔ Combination✔ Dry✔ SensitiveWhat it is:A collectio...,Suggested Usage:Sheet masks:-Remove film and apply mask over cleansed face. -Leave on for 15 to ...,-Aquaxyl: Derived from plant glucose; replenishes and protects vital moisture in your skin.\n-X...,0,0,0,0
461,1723881,Dr. Jart+,Sheet Masks,Dermask Water Jet Vital Hydra Solution™,1 Mask,4.5,162,13500,6.0,6.0,https://www.sephora.com/product/dermask-water-jet-vital-hydra-solution-P397623?icid2=products gr...,False,0,no options,Which skin type is it good for?\n✔ Normal\n✔ Oily\n✔ Combination\n✔ Dry\n✔ Sensitive\n\nWhat it ...,Suggested Usage:\n-Remove film and apply mask over cleansed face. \n-Leave on for 15 to 30 minut...,-Aquaxyl: Derived from plant glucose; replenishes and protects vital moisture in your skin.\n-X...,0,0,0,0
1469,1764950,SEPHORA COLLECTION,Sheet Masks,Face Mask - Pearl - Brightening,1 Mask,4.5,86,4800,6.0,6.0,https://www.sephora.com/product/face-mask-P410159?icid2=products grid:p410159,True,exclusive,no options,To redeem your offer: \n1. SELECT FOUR MASKS: Choose any 4 face masks. Duplicates of the same ma...,Suggested Usage:-Unfold the mask.\n-Apply the mask onto a cleansed- dried face.\n-After 15 minut...,-Pearl Extract of Natural Origin: Provides remineralizing properties against dull- fatigued ski...,0,1,0,0
1468,1764950,SEPHORA COLLECTION,Sheet Masks,Face Mask,1 Mask,4.5,1000,106900,6.0,6.0,https://www.sephora.com/product/face-mask-P407006?icid2=products grid:p407006,True,exclusive,no options,To redeem your offer: \n1. SELECT FOUR MASKS: Choose any 4 face masks. Duplicates of the same ma...,Suggested Usage:\n-Unfold the mask and peel off the pink film.\n-Apply the mask to clean skin.\n...,-Pearl Extract of Natural Origin: Provides remineralizing properties against dull- fatigued ski...,0,1,0,0
1428,1802677,SEPHORA COLLECTION,Face Wipes,Cleansing & Exfoliating Wipes,25 Wipes,4.5,3000,224700,8.0,8.0,https://www.sephora.com/product/cleansing-exfoliating-wipes-P409800?icid2=products grid:p409800,True,exclusive,no options,What it is: A collection of cleansing and exfoliating wipes that are the ideal solution for cle...,Suggested Usage:Cleansing Wipes: -Gently wipe over the face and eyes. -No need to rinse. -Close ...,-Coconut Water Extract.\n\n Water- Caprylic/Capric Triglyceride- Fragrance- Phenoxyethanol- Gly...,0,1,0,0
1430,1802677,SEPHORA COLLECTION,Face Wipes,Cleansing Wipes - Coconut Water,25 Wipes,4.5,81,0,8.0,8.0,https://www.sephora.com/product/cleansing-exfoliating-wipes-P410144?icid2=products grid:p410144,True,exclusive,no options,no details,no instructions,-Coconut Water Extract.\n\n Water- Caprylic/Capric Triglyceride- Fragrance- Phenoxyethanol- Gly...,0,1,0,0
1429,1973700,SEPHORA COLLECTION,Face Masks,Face Mask,1 Mask,4.5,1000,117100,6.0,6.0,https://www.sephora.com/product/face-mask-P429722?icid2=products grid:p429722,True,exclusive,no options,To redeem your offer: \n1. SELECT FOUR MASKS: Choose any 4 face masks. Duplicates of the same ma...,Suggested Usage:-Unfold the mask and peel off the pink film. -Apply the mask to clean skin. -Lea...,-Aloe Vera Extract of natural origin: Helps skin stay well hydrated. Water- Butylene Glycol- Al...,0,1,0,0
1445,1973700,SEPHORA COLLECTION,Face Masks,Face Mask - Aloe Vera - Replenishing,1 Mask,4.5,17,0,6.0,6.0,https://www.sephora.com/product/face-mask-aloe-vera-P432068?icid2=products grid:p432068,True,exclusive,no options,To redeem your offer: \n1. SELECT FOUR MASKS: Choose any 4 face masks. Duplicates of the same ma...,Suggested Usage:-Unfold the mask and peel off the pink film. -Apply the mask to clean skin. -Lea...,-Aloe Vera Extract of natural origin: Helps skin stay well hydrated. Water- Butylene Glycol- Al...,0,1,0,0
1722,1973841,SEPHORA COLLECTION,Eye Masks,Eye Mask - Grape - Smoothing,1 Pair,3.5,3,0,5.0,5.0,https://www.sephora.com/product/eye-mask-grape-P432073?icid2=products grid:p432073,True,exclusive,no options,To redeem your offer: \n1. SELECT FOUR MASKS: Choose any 4 eye masks. Duplicates of the same mas...,Suggested Usage:-Remove the protective film from both white fiber patches. -Position the patches...,Water- Butylene Glycol- Glycerin- Aloe Barbadensis Leaf Juice- 1-2-Hexanediol- Hydroxyacet...,0,1,0,0
1434,1973841,SEPHORA COLLECTION,Eye Masks,Eye Mask,1 Pair,4.0,827,57300,5.0,5.0,https://www.sephora.com/product/eye-mask-P429721?icid2=products grid:p429721,True,exclusive,no options,To redeem your offer: \n1. SELECT FOUR MASKS: Choose any 4 eye masks. Duplicates of the same mas...,Suggested Usage:\n-Remove the protective film from both white fiber patches. \n-Position the pat...,Water- Butylene Glycol- Glycerin- Aloe Barbadensis Leaf Juice- 1-2-Hexanediol- Hydroxyacet...,0,1,0,0


From the above duplicates check on 'id', 'ingredients', 'price' and 'value_price, it appears that the records are indeed duplicated. While these records have different 'url', they are essentially the same product but with different 'ratings', 'number_of_reviews' and 'love'.

We will need to collapse these duplicated records into unique 'id' by taking the mean of 'ratings', sum of 'number_of_reviews' and sum of 'love' across the same 'id'.

In [25]:
# Group records by 'id' and calculate mean on 'rating'
df['av_rating'] = df.groupby('id').rating.transform('mean')

In [26]:
# Group records by 'id' and calculate sum on 'number_of_reviews'
df['sum_reviews'] = df.groupby('id').number_of_reviews.transform('sum')

In [27]:
# Group records by 'id' and calculate sum on 'love'
df['sum_love'] = df.groupby('id').love.transform('sum')

The duplicate records can now be dropped without information loss. The'rating', 'number_of_reviews' and 'love' columns can also be dropped since we have already aggregated the records.

In [28]:
# Drop duplicate records on 'id', 'ingredients', 'price' and 'value_price'
# keep = 'first' to drop all duplicates except the first
df.drop_duplicates(subset=['id', 'brand','ingredients','price','value_price'], keep = 'first', inplace = True)  

In [29]:
# Drop 'rating', 'number_of_reviews' and 'love' columns
df.drop(columns = ['rating', 'number_of_reviews', 'love'], inplace = True)

In [30]:
# Check for duplicate records on 'id' after cleaning
df[df.duplicated(subset=['id'], keep = False)]

Unnamed: 0,id,brand,category,name,size,price,value_price,url,marketingflags,marketingflags_content,options,details,how_to_use,ingredients,online_only,exclusive,limited_edition,limited_time_offer,av_rating,sum_reviews,sum_love


Duplicate records had been dropped.

In [31]:
# Reset index
df.reset_index(drop=True, inplace=True)

In [32]:
# Preview column names
df.columns

Index(['id', 'brand', 'category', 'name', 'size', 'price', 'value_price',
       'url', 'marketingflags', 'marketingflags_content', 'options', 'details',
       'how_to_use', 'ingredients', 'online_only', 'exclusive',
       'limited_edition', 'limited_time_offer', 'av_rating', 'sum_reviews',
       'sum_love'],
      dtype='object')

## 1.6. Investigate 'Refill' Products

Refills will contain the same ingredients as their parent product. Hence, we will need to extract refill products and drop them accordingly.

In [33]:
# Extract records where 'name' contains 'refill'
refill = df[df['name'].str.lower().str.contains('refill')]
refill

Unnamed: 0,id,brand,category,name,size,price,value_price,url,marketingflags,marketingflags_content,options,details,how_to_use,ingredients,online_only,exclusive,limited_edition,limited_time_offer,av_rating,sum_reviews,sum_love
359,1790245,Dior,Moisturizers,Capture Totale Multi-Perfection Creme Refill,2 oz/ 60 mL,140.0,140.0,https://www.sephora.com/product/capture-totale-multi-perfection-creme-refill-P414794?icid2=produ...,True,online only,no options,What it is:A face cream that delivers immediate and lasting comfort- and corrects visible signs ...,Suggested Usage:-Apply morning and evening to face and neck after One Essential and/or Capture T...,Aqua (Water)- Limnanthes Alba (Meadowfoam) Seed Oil- Isotridecyl Isononanoate- Glycerin- Isonony...,1,0,0,0,5.0,4,1000
1702,2320687,Yves Saint Laurent,Face Serums,Pure Shots Night Reboot Resurfacing Serum Refill,1 oz/ 30 mL,70.0,70.0,https://www.sephora.com/product/yves-saint-laurent-pure-shots-night-reboot-resurfacing-serum-ref...,True,online only,no options,What it is: An environmentally-conscious refillable package to refill your Pure Shots Night Reb...,Suggested Usage:-To release the refill from the outer packaging- twist the inner capsule to the ...,-Glycolic Acid: Exfoliates for improved skin radiance and visible clarity- revealing a youthful-...,1,0,0,0,0.0,0,210
1703,2320653,Yves Saint Laurent,Moisturizers,Pure Shots Perfect Plumper Face Cream Refill,1.6 oz/ 50 mL,70.0,70.0,https://www.sephora.com/product/yves-saint-laurent-pure-shots-perfect-plumper-face-cream-refill-...,True,online only,no options,What it is: An environmentally-conscious refillable packaging so you can refill your Perfect Pl...,Suggested Usage:-Scoop a small amount of cream using the included spatula.\n-Warm by softly rubb...,-98-Percent Pure Anti-Aging Ribose: Helps improve the appearance of skin elasticity and helps to...,1,0,0,0,0.0,0,390
1704,2320661,Yves Saint Laurent,Face Serums,Pure Shots Lines Away Anti-Aging Serum Refill,1 oz/ 30 mL,70.0,70.0,https://www.sephora.com/product/yves-saint-laurent-pure-shots-lines-away-anti-aging-serum-refill...,True,online only,no options,What it is: An environmentally-conscious refill package of Lines Away Anti-Aging Serum to refil...,Suggested Usage:-To release the refill from the outer packaging- twist the inner capsule to the ...,-High and Low Molecular Weight Hyaluronic Acid: Helps provide short- and long-term hydration. -I...,1,0,0,0,5.0,1,110
1705,2320679,Yves Saint Laurent,Face Serums,Pure Shots Light Up Brightening Serum Refill,1 oz/ 30 mL,70.0,70.0,https://www.sephora.com/product/yves-saint-laurent-pure-shots-light-up-brightening-serum-refill-...,True,online only,no options,What it is: An environmentally-conscious refillable package to refill your Pure Shots Light Up ...,Suggested Usage:-To release the refill from the outer packaging- twist the inner capsule to the ...,-Vitamin Cg: Helps to reduce the look of yellowness and dark spots while brightening the skin. -...,1,0,0,0,0.0,0,72
1706,2320646,Yves Saint Laurent,Face Serums,Pure Shots Y Shape Firming Serum Refill,1 oz/ 30 mL,70.0,70.0,https://www.sephora.com/product/yves-saint-laurent-pure-shots-y-shape-firming-serum-refill-P4540...,True,online only,no options,What it is: An environmentally-conscious refillable package to refill your Pure Shots Y Shape F...,Suggested Usage:-To release the refill from the outer packaging- twist the inner capsule to the ...,-100-Percent Pure Firming Peptide: Helps visibly smooth- nurture- and hydrate skin. -Barbary Fig...,1,0,0,0,0.0,0,45


In [34]:
# Drop records where 'name' contains 'refill' 
df.drop(refill.index, axis = 0, inplace = True)

In [35]:
# Check for records where 'name' contains 'refill' after cleaning
df[df['name'].str.lower().str.contains('refill')]

Unnamed: 0,id,brand,category,name,size,price,value_price,url,marketingflags,marketingflags_content,options,details,how_to_use,ingredients,online_only,exclusive,limited_edition,limited_time_offer,av_rating,sum_reviews,sum_love


All records where 'name' contains 'refill' had been dropped.

In [36]:
# Reset index
df.reset_index(drop=True, inplace=True)

## 1.7. Handling Redundant Columns

We will drop columns that are not key features of interest.

In [37]:
# Drop 'options' and 'how_to_use' columns
df.drop(columns = ['options', 'how_to_use', 'marketingflags', 'marketingflags_content',
       'online_only', 'exclusive', 'limited_edition', 'limited_time_offer'], inplace = True)

The 'price' column will be dropped since it reflects the discounted prices. We will retain 'value_price' which is the original price of the products instead.

In [38]:
# Drop 'price' column
df.drop(columns = ['price'], inplace = True)

## 1.8. New Features

In this section, we want to engineer more features that our model can learn on.

In [39]:
# Create new column by combining 'brand' and 'name'
df['product_name'] = df['brand'] + ' ' + df['name']

We have identified that there are useful information from 'details' column which we can extract as features. However, as the data is not sufficiently clean, we will create a function to clean the data in the 'details' column before extracting the features.

In [40]:
# Define a function to clean text
def cleaner(text):
    # Return text without punctuation 
    # Import string
    text = ''.join([char for char in text if char not in string.punctuation])   # add char if char not punctuation
    
    # Return text in all lower case
    text = text.lower()
    
    # Remove characters beyond Basic Multilingual Plane (BMP) of Unicode (e.g. maths symbols)
    text = ''.join(char for char in text if char <= '\uFFFF') 
    
    # Normalise unicodedata
    #text= unicodedata.normalize("NFKD", text)

    # Remove non-ASCII characters
    text = text.encode('ascii', 'ignore').decode('ascii')
    
    # Remove whitespace (including new line characters) - commented out since it is not required in our extraction of features  
    #text = re.sub(r'\s\s+', ' ', text)
    
    # Remove non-letters
    text = re.sub('[^a-zA-Z]', ' ', text)
    
    return text

In [41]:
# Apply cleaner function to clean 'details' in order to extract features
df['details'] = df['details'].apply(cleaner)

In [42]:
# Create a function to search for features in 'details' and return 1 if feature is found, otherwise 0 
def create_col(feature_group, feature, term_list, col) :
    df[feature_group + '_' + feature] = 0
    for i in range (len(df)): 
        if any(term in df.iloc[i, df.columns.get_loc(col)].split() for term in term_list):  # Splits details by space into a list and looks for term
            df.iloc[i, df.columns.get_loc(feature_group + '_' + feature)] = 1
        else: 
            df.iloc[i, df.columns.get_loc(feature_group + '_' + feature)] = 0

In [43]:
# Create new features based on skintype
create_col('skintype', 'sensitive', ['sensitive', 'sensitivewhat', 'mild', 'gentle', 'irritation'], 'details')
create_col('skintype', 'combination', ['combination', 'combinationwhat', 'balance', 'balanced'], 'details')
create_col('skintype', 'normal', ['normal', 'normalwhat'], 'details')
create_col('skintype', 'dry', ['dry', 'drywhat', 'dryness', 'drynessformulation', 'moisture', 'hydration', 'hydrate', 
                               'hydrating', 'ultrahydrating', 'moisturized','moisturizing'], 'details')
create_col('skintype', 'oily', ['oily', 'oilywhat', 'oiliness', 'oilinessformulation', 'mattifying', 'lightweight'], 'details')

In [44]:
# Check for list of products with no skintype
no_skintype= df[(df['skintype_sensitive'] == 0) & (df['skintype_combination'] == 0) & (df['skintype_normal'] == 0) & 
                 (df['skintype_dry'] == 0) & (df['skintype_oily'] == 0)]

In [45]:
# Manually update for no skintype
for i in no_skintype.index :
    df.at[i,'skintype_normal']= int(1)

In [46]:
# Create new features based on concerns
create_col('concerns', 'dryness', ['dryness', 'drynessformulation', 'moisture', 'hydration', 'hydrate', 'hydrating', 
                                   'ultrahydrating', 'moisturized','moisturizing'], 'details')
create_col('concerns', 'dullness', ['dullness', 'dullnessformulation', 'uneven', 'smoothen', 'brighten', 'brightens', 
                                    'brightening', 'radiance'], 'details')
create_col('concerns', 'elasticity', ['elasticity', 'elasticityformulation' 'firmness', 'elastin'], 'details')
create_col('concerns', 'darkspots', ['spots', 'spot', 'darkspot', 'darkspots', 'melatonin'], 'details')
create_col('concerns', 'darkcircles', ['circles', 'circle', 'darkcircle', 'darkcircles' ], 'details')
create_col('concerns', 'puffiness', ['puffiness', 'puffinessformulation'], 'details')
create_col('concerns', 'pores', ['pores', 'pore', 'poreformulation', 'poresformulation'], 'details')
create_col('concerns', 'wrinkles', ['wrinkles', 'wrinkle', 'antiwrinkle', 'antiwrinkles', 'fine', 'line', 
                                    'lines', 'linewrinkles', 'linewrinkle', 'lineswrinkles', 'smooth', 'smoothen'], 'details')
create_col('concerns', 'aging', ['aging', 'antiaging', 'ageing', 'antiageing', 'youth', 'youthful',
                                'youthfullooking'], 'details')
create_col('concerns', 'redness', ['redness', 'red', 'rednessformulation', 'redformulation'], 'details')
create_col('concerns', 'oiliness', ['oiliness', 'oilinessformulation', 'mattifying', 'lightweight'], 'details')
create_col('concerns', 'acne', ['acne', 'acnes', 'blemish', 'blemishes', 'acneformulation', 'acnesformulation'
                               'blemishformulation', 'blemishesformulation', 'noncomedogenic', 'mattifying'], 'details')

In [47]:
# Check for list of products with no concerns
no_concerns = df[(df['concerns_dryness'] == 0) & (df['concerns_dullness'] == 0) & (df['concerns_elasticity'] == 0) & 
                 (df['concerns_darkspots'] == 0) & (df['concerns_darkcircles'] == 0) & (df['concerns_puffiness'] == 0) &
                 (df['concerns_pores'] == 0) & (df['concerns_wrinkles'] == 0) & (df['concerns_aging'] == 0) & 
                 (df['concerns_redness'] == 0) & (df['concerns_oiliness'] == 0) & (df['concerns_acne'] == 0)]

In [48]:
# Manually update for no concerns
df['concerns_others'] = int(0)
for i in no_concerns.index :
    df.at[i,'concerns_others']= int(1)

In [49]:
# Create new features based on preferences 
create_col('pref', 'vegan', ['vegan', 'vegans', 'veganwhat', 'veganwhats'], 'details')
create_col('pref', 'crueltyfree', ['cruelty', 'crueltyfree', 'crueltywhat', 'crueltyfreewhat'], 'details')
create_col('pref', 'glutenfree', ['gluten', 'glutenfree', 'glutens', 'glutenwhat', 'glutenfreewhat',
                                  'glutenswhat'], 'details')
create_col('pref', 'antioxidant', ['antioxidants', 'antioxidant', 'antioxidantwhat', 'antioxidantswhat'], 'details')
create_col('pref', 'hydration', ['hydration', 'hydrating', 'hydrate','hydrates'], 'details')

In [50]:
# Create new features based on skincare acids
create_col('skincareacids', 'hyaluronicacid', ['hyaluronic', 'hyaluronicwhat', 'hyaluronicacid', 'hyaluronicacids'], 'details')
create_col('skincareacids', 'salicylicacid', ['salicylic', 'salicylicwhat', 'salicylicacid', 'salicylicacids'], 'details')
create_col('skincareacids', 'AHA', ['AHA', 'glycolicacid', 'glycolic', 'glycolicacidwhat', 'glycolicacidwhat',
                                   'glycolicacids'], 'details')
create_col('skincareacids', 'vitaminc', ['vitaminc', 'vitamin c', 'vitamins c'], 'details')
create_col('skincareacids', 'retinol', ['retinol', 'retinoid', 'vitamins a', 'vitamin a'], 'details')

In [51]:
# Create new features based on ingredients excluded
create_col('excluded', 'parabens', ['paraben', 'parabens', 'parabenfree', 'parabensfree', 'parabenwhat', 
                                   'parabenswhat', 'parabenfreewhat', 'parabensfreewhat'], 'details')
create_col('excluded', 'sulfates', ['sulfate', 'sulfatefree', 'sulfates', 'sulfatewhat', 'sulfatefreewhat',
                                   'sulfateswhat', 'sulphate', 'sulphates', 'sulphatewhat', 'sulphateswhat'], 'details')
create_col('excluded', 'formaldehydes', ['formaldehyde', 'formaldehydes', 'formaldehydefree', 'formaldehydesfree', 
                                          'formaldehydewhat', 'formaldehydeswhat', 'formaldehydefreewhat', 
                                          'formaldehydesfreewhat'], 'details')
create_col('excluded', 'phthalates', ['phthalate', 'phthalates', 'phthalatefree', 'phthalatesfree', 
                                          'phthalatewhat', 'phthalateswhat', 'phthalatefreehwat', 
                                          'phthalatesfreehwat'], 'details')
create_col('excluded', 'silicones', ['silicone', 'silicones', 'siliconefree', 'siliconesfree', 
                                          'siliconewhat', 'siliconeswhat', 'siliconefreewhat', 
                                          'siliconesfreewhat'], 'details')

In [52]:
# Create new features based on formulation 
create_col('formulation', 'cream', ['cream', 'creamhighlighted', 'creamingredient'], 'details')
create_col('formulation', 'serum', ['serum', 'serumhighlighted', 'serumingredient'], 'details')
create_col('formulation', 'liquid', ['liquid', 'liquidhighlighted', 'liquidingredient'], 'details')
create_col('formulation', 'gel', ['gel', 'gelhighlighted', 'gelingredient'], 'details')
create_col('formulation', 'mask', ['mask', 'maskhighlighted', 'maskingredient'], 'details')
create_col('formulation', 'spray', ['spray', 'sprayhighlighted', 'sprayingredient'], 'details')
create_col('formulation', 'balm', ['balm', 'balmhighlighted', 'balmingredient'], 'details')

In [53]:
# Create new features based on whether the product is award winning
create_col('award', 'allure', ['award', 'awards'], 'details')

In [54]:
# Create new features based on whether the product has clinical results
create_col('clinical', 'results', ['clinicalresults', 'results', 'clinical', 'clinic', 'clinically', 
                                   'test', 'tests'], 'details')

As there are too many category types, we will collapse some of these categories.

In [55]:
# Check product count by category
df['category'].value_counts()

Moisturizers                 334
Face Serums                  301
Eye Creams & Treatments      175
Face Wash & Cleansers        175
Face Masks                   169
Lip Balms & Treatments        68
Face Sunscreen                62
Toners                        61
Mists & Essences              58
Face Oils                     55
Sheet Masks                   40
Facial Peels                  35
Exfoliators                   35
Night Creams                  28
Blemish & Acne Treatments     27
BB & CC Cream                 21
Face Wipes                    18
Makeup Removers               17
Eye Masks                     14
Body Sunscreen                13
Decollete & Neck Creams        9
Name: category, dtype: int64

In [56]:
# Combine some categories together 
df['category'] = df['category'].str.replace('Decollete & Neck Creams', 'Night Creams')
df['category'] = df['category'].str.replace('Eye Masks', 'Eye Creams & Treatments')
df['category'] = df['category'].str.replace('Body Sunscreen', 'Sunscreen')
df['category'] = df['category'].str.replace('Face Sunscreen', 'Sunscreen')
df['category'] = df['category'].str.replace('Face Masks', 'Masks')
df['category'] = df['category'].str.replace('Sheet Masks', 'Masks')
df['category'] = df['category'].str.replace('Makeup Removers', 'Face Wash & Cleansers')
df['category'] = df['category'].str.replace('Face Wipes', 'Face Wash & Cleansers')

In [57]:
# Combine some categories together (for those that did not work with str.replace)
for i in df[df['category'] == 'Exfoliators'].index :
    df.at[i,'category'] = 'Exfoliators & Facial Peels'

for i in df[df['category'] == 'Facial Peels'].index :
    df.at[i,'category'] = 'Exfoliators & Facial Peels'

In [58]:
# Combine some categories together (for those that did not work with str.replace)
for i in df[df['category'] == 'Moisturizers'].index :
    df.at[i,'category'] = 'Moisturizers & Creams'

for i in df[df['category'] == 'Night Creams'].index :
    df.at[i,'category'] = 'Moisturizers & Creams'
    
for i in df[df['category'] == 'BB & CC Cream'].index :
    df.at[i,'category'] = 'Moisturizers & Creams'
    
for i in df[df['category'] == 'Face Oils'].index :
    df.at[i,'category'] = 'Moisturizers & Creams'

In [59]:
# Check product count by category after cleaning
df['category'].value_counts()

Moisturizers & Creams         447
Face Serums                   301
Face Wash & Cleansers         210
Masks                         209
Eye Creams & Treatments       189
Sunscreen                      75
Exfoliators & Facial Peels     70
Lip Balms & Treatments         68
Toners                         61
Mists & Essences               58
Blemish & Acne Treatments      27
Name: category, dtype: int64

We inspected 'size' columns and noticed different volumetric units used across products. Some products also contained more than 1 type of volumetric unit. We will need to convert the 'size' column to a common volumetric unit.

In [60]:
# Inspect 'size' column
df['size'].sample(10)

1526     4.2 oz/ 125 mL
777      0.33 oz/ 10 mL
1290               5 oz
1157        1 oz/ 30 mL
480              1 mask
1417    4.05 oz/ 120 mL
750         1 oz/ 30 mL
82       2.53 oz/ 75 mL
1087               6 oz
461              1 mask
Name: size, dtype: object

In [61]:
# Create a function to scan through the 'size' string looking for the regex pattern that produces a match, and returns the \
# corresponding matched object instance. Return None if no position in the string matches the pattern.
# Source: https://docs.python.org/3/library/re.html#match-objects 
def return_match(pattern, text):
    m = re.search(pattern, text)
    if m:
        return m.group()
    else:
        pass

In [62]:
# Create separate columns to capture all volumetric units
df['size_oz']=df['size'].apply(lambda x: return_match('[0-9. ]+o[Zz]', x))
df['size_ml']=df['size'].apply(lambda x: return_match('[0-9. ]+m[Ll]', x))
df['size_g']=df['size'].apply(lambda x: return_match('[0-9. ]+g', x))

In [63]:
# Inspect which 'size' is most dominant
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1715 entries, 0 to 1714
Data columns (total 58 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   id                            1715 non-null   int64  
 1   brand                         1715 non-null   object 
 2   category                      1715 non-null   object 
 3   name                          1715 non-null   object 
 4   size                          1715 non-null   object 
 5   value_price                   1715 non-null   float64
 6   url                           1715 non-null   object 
 7   details                       1715 non-null   object 
 8   ingredients                   1715 non-null   object 
 9   av_rating                     1715 non-null   float64
 10  sum_reviews                   1715 non-null   int64  
 11  sum_love                      1715 non-null   int64  
 12  product_name                  1715 non-null   object 
 13  ski

While 'oz' is the most dominant size of the products, 'ml' is more commonly used in local context. Hence, we will use 'ml' to represent the volumetric unit for all products.

In [64]:
# Remove trailing and leading whitespace and associated volumetric unit
df['size_ml'] = df['size_ml'].str.replace('mL', '')
df['size_ml'] = df['size_ml'].str.replace('ml', '')
df['size_ml'] = df['size_ml'].str.strip()

df['size_g'] = df['size_g'].str.replace('g', '')
df['size_g'] = df['size_g'].str.strip()

df['size_oz'] = df['size_oz'].str.replace('oz', '')
df['size_oz'] = df['size_oz'].str.strip()

In [65]:
# Convert 'size_ml', 'size_g' and 'size_oz' into numeric type
df['size_ml'] = pd.to_numeric(df['size_ml'])
df['size_g'] = pd.to_numeric(df['size_g'])
df['size_oz'] = pd.to_numeric(df['size_oz'])

As we want to use product 'size_ml' as one of our features, we will remove products that cannot be converted to 'ml'.

In [66]:
# Display null values for 'size_ml', 'size_g' and 'size_oz'
cannot_convert = df[df['size_ml'].isnull() & df['size_g'].isnull() & 
                    df['size_oz'].isnull()][['name', 'size', 'size_ml', 'size_g','size_oz']]
cannot_convert.head()

Unnamed: 0,name,size,size_ml,size_g,size_oz
69,Bright Eyes Collagen-Infused Brightening Colloidal Silver Eye Masks,15 pairs,,,
73,GloPRO® Prep Pads Clarifying Skin Cleansing Wipes with Peptides,30 pads,,,
430,Ferulic + Retinol Wrinkle Recovery Peel,16 treatments,,,
438,DRx Acne Eliminating Pads,45 Treatments,,,
439,Hyaluronic Marine Hydrating Modeling Mask,4 treatments,,,


In [67]:
# Drop records that 'cannot_convert' into 'size_ml' and reset index
df.drop(cannot_convert.index, axis = 0, inplace = True)
df.reset_index(drop=True, inplace=True)

We can impute the null values in 'size_ml' using 'size_oz' and 'size_g'.

In [68]:
# Check number of null values for 'size_ml' which we need to impute
df['size_ml'].isnull().sum()

247

In [69]:
# Display null values for 'size_ml' 
df[df['size_ml'].isnull()][['name', 'size', 'size_ml', 'size_g','size_oz']]

Unnamed: 0,name,size,size_ml,size_g,size_oz
45,Color Control Cushion Compact Broad Spectrum SPF 50+,1.05 oz,,,1.05
62,Youth Revolution Radiance Sheet Masque,1.4 oz/ 40 g x 1 Sheet,,40.0,1.4
68,Prime Time™ BB Tinted Primer Broad Spectrum SPF 30,1.0 oz,,,1.0
71,The Quench Eye Reviving Quadralipid Eye Balm,0.5 oz/ 14 g,,14.0,0.5
87,Problem Solution Moisturizer,4.22 oz,,,4.22
88,Problem Solution Toner,6.75 oz,,,6.75
91,Problem Solution Cleansing Foam,3.38 oz,,,3.38
115,Squalane+ Rose Vegan Lip Balm,0.35 oz/ 10g,,10.0,0.35
116,Extra Eye Repair Cream,0.5 oz,,,0.5
119,Extra Repair Moisturizing Balm SPF 25,1.7 oz,,,1.7


In [70]:
# Display null values for 'size_ml' which we can impute using'size_g'
convert_g_to_ml = df[df['size_ml'].isnull() & df['size_g'].notnull()][['name', 'size', 'size_ml', 'size_g']]
convert_g_to_ml.head()

Unnamed: 0,name,size,size_ml,size_g
62,Youth Revolution Radiance Sheet Masque,1.4 oz/ 40 g x 1 Sheet,,40.0
71,The Quench Eye Reviving Quadralipid Eye Balm,0.5 oz/ 14 g,,14.0
115,Squalane+ Rose Vegan Lip Balm,0.35 oz/ 10g,,10.0
142,Green Tea Oil-Free Moisturizer,1.7 oz/ 50g,,50.0
166,Lip Conditioner,0.15 oz/ 4.5 g,,4.5


In [71]:
# Impute null values in 'size_ml' using associated 'size_g' values (i.e. 1 g = 1 ml)
for i in convert_g_to_ml.index:
    df.iloc[i, df.columns.get_loc('size_ml')] = df.iloc[i, df.columns.get_loc('size_g')]

In [72]:
# Display null values for 'size_ml' which we can impute using'size_oz'
convert_oz_to_ml = df[df['size_ml'].isnull() & df['size_oz'].notnull()][['name', 'size', 'size_ml', 'size_oz']]
convert_oz_to_ml.head()

Unnamed: 0,name,size,size_ml,size_oz
45,Color Control Cushion Compact Broad Spectrum SPF 50+,1.05 oz,,1.05
68,Prime Time™ BB Tinted Primer Broad Spectrum SPF 30,1.0 oz,,1.0
87,Problem Solution Moisturizer,4.22 oz,,4.22
88,Problem Solution Toner,6.75 oz,,6.75
91,Problem Solution Cleansing Foam,3.38 oz,,3.38


In [73]:
# Impute null values in 'size_ml' using associated 'size_oz' values (i.e. 1 oz = 29.5735 ml)
for i in convert_oz_to_ml.index:
    df.iloc[i, df.columns.get_loc('size_ml')] = df.iloc[i, df.columns.get_loc('size_oz')] * 29.5735

In [74]:
# Check if there are any remaining null values for 'size_ml' 
df['size_ml'].isnull().sum()

0

All volumetric units have been converted into 'ml'. We will perform a sanity check and correct errors in unit conversions, if any.

In [75]:
# Inspect 'size'-related features to identify errors in unit conversions 
df[['name','size', 'size_ml', 'size_g','size_oz']];

In [76]:
# Correct errors in unit conversions
df.loc[df['size'].str.contains('6 x 0.78 oz Sheets/ 6 x 23 mL Sheets'), 'size_ml'] =  6 * 23
df.loc[df['size'].str.contains('3 x 0.17 oz/ 5 mL'), 'size_ml'] =  3 * 5
df.loc[df['size'].str.contains('2 x 2.02 oz/ 60 mL'), 'size_ml'] =  2 * 60
df.loc[df['size'].str.contains('6 Patches x 3 g'), 'size_ml'] =  6 * 3
df.loc[df['size'].str.contains('28 x 0.01 oz/ 0.5 g packets'), 'size_ml'] =  28 * 0.5
df.loc[df['size'].str.contains('7 Sachets; 0.33 oz/ 10 mL each'), 'size_ml'] =  7 * 10
df.loc[df['size'].str.contains('7 Ampoules, .0625 oz/ 1.8 mL each'), 'size_ml'] =  7 * 1.8
df.loc[df['size'].str.contains('2 x 1 oz/ 30 mL'), 'size_ml'] =  2 * 30
df.loc[df['size'].str.contains('6 x 2 patches (0.15 oz/ 4.5g)'), 'size_ml'] =  6 * 2 * 4.5
df.loc[df['size'].str.contains('90 x 0.012 oz/ 0.3 g'), 'size_ml'] =  90 * 0.3
df.loc[df['size'].str.contains('2 x 1.7 oz/ 50.28 mL'), 'size_ml'] =  2 * 50.28
df.loc[df['size'].str.contains('3 x 1.7 oz/ 50.28 mL'), 'size_ml'] =  3 * 50.28
df.loc[df['size'].str.contains('1.18 oz/ 2 x 35 mL'), 'size_ml'] =  2 * 35
df.loc[df['size'].str.contains('12 x 2 packettes 0.02 oz/ 0.75 g'), 'size_ml'] =  12 * 2 * 0.75
df.loc[df['size'].str.contains('4 Satchets x 1.1 oz/ 30 g'), 'size_ml'] =  4 * 30
df.loc[df['size'].str.contains('2 x 0.33 oz/ 10 mL'), 'size_ml'] =  2 * 10
df.loc[df['size'].str.contains('Four 0.27 oz/ 8 mL Vials'), 'size_ml'] =  4 * 8
df.loc[df['size'].str.contains('1.7 oz/ 50g *2 Usages'), 'size_ml'] =  50 * 2
df.loc[df['size'].str.contains('0.33 oz/ 10 mL *4'), 'size_ml'] =  10 * 4
df.loc[df['size'].str.contains('6 x 0.91 oz Sheets/ 6 x 26 mL Sheets'), 'size_ml'] =  6 * 26

In [77]:
# Convert 'size_ml' to 2 decimal places
df['size_ml'] = df['size_ml'].round(2)

We will only retain 'size_ml' columns and drop all other 'size'-related columns. As the assumption is that the greater the volume of a product, the higher the price, we will create a potential target variable - product per unit volume. 

In [78]:
# Drop redundant 'size', 'size_g' and 'size_oz' features and retain only 'size_ml'
df.drop(columns = ['size', 'size_g', 'size_oz'], inplace = True)

In [79]:
# Create a potential target variable (vs price) where price is adjusted for the size or volume of the product
df['price_per_unit_vol']=(df['value_price']/df['size_ml']).round(2)

In [80]:
# Inspect output
df.head()

Unnamed: 0,id,brand,category,name,value_price,url,details,ingredients,av_rating,sum_reviews,sum_love,product_name,skintype_sensitive,skintype_combination,skintype_normal,skintype_dry,skintype_oily,concerns_dryness,concerns_dullness,concerns_elasticity,concerns_darkspots,concerns_darkcircles,concerns_puffiness,concerns_pores,concerns_wrinkles,concerns_aging,concerns_redness,concerns_oiliness,concerns_acne,concerns_others,pref_vegan,pref_crueltyfree,pref_glutenfree,pref_antioxidant,pref_hydration,skincareacids_hyaluronicacid,skincareacids_salicylicacid,skincareacids_AHA,skincareacids_vitaminc,skincareacids_retinol,excluded_parabens,excluded_sulfates,excluded_formaldehydes,excluded_phthalates,excluded_silicones,formulation_cream,formulation_serum,formulation_liquid,formulation_gel,formulation_mask,formulation_spray,formulation_balm,award_allure,clinical_results,size_ml,price_per_unit_vol
0,2170827,Algenist,Moisturizers & Creams,GENIUS Sleeping Collagen,98.0,https://www.sephora.com/product/genius-sleeping-collagen-P439055?icid2=products grid:p439055,what it is a vegan buttery collagen sleeping cream that delivers essential nutrients and nurtur...,-Patented Alguronic Acid: Naturally sourced and sustainably produced from algae- it’s clinicall...,4.5,1000,18200,Algenist GENIUS Sleeping Collagen,0,1,1,1,1,1,1,1,0,0,0,0,1,1,0,0,0,0,1,0,0,1,1,0,0,0,0,0,1,1,1,1,0,1,0,0,0,0,0,0,0,1,60.0,1.63
1,1328855,Algenist,Eye Creams & Treatments,Complete Eye Renewal Balm,68.0,https://www.sephora.com/product/complete-eye-renewal-balm-P282938?icid2=products grid:p282938,what it is a multitasking eye balm that primes hydrates soothes diminishes the look of dark und...,-Patented Alguronic Acid: Visibly minimizes the appearance of fine lines and wrinkles and boost...,4.0,873,27500,Algenist Complete Eye Renewal Balm,0,1,1,1,1,1,0,0,0,1,1,0,1,1,0,0,0,0,1,0,0,0,1,0,0,0,0,0,1,1,1,1,0,1,0,0,0,0,0,1,0,1,15.0,4.53
2,1644376,Algenist,Eye Creams & Treatments,GENIUS Ultimate Anti-Aging Eye Cream,74.0,https://www.sephora.com/product/genius-ultimate-anti-aging-eye-cream-P388262?icid2=products grid...,what it is an antiaging eye cream that visibly firms lifts combats fine lines and wrinkles brig...,-Alguronic Acid: Combats fine lines and wrinkles and brightens the skin as a scientifically tes...,4.0,100,5100,Algenist GENIUS Ultimate Anti-Aging Eye Cream,0,0,1,0,0,0,1,1,0,1,0,0,1,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,1,0,1,0,1,0,0,0,0,0,0,0,1,15.0,4.93
3,2063014,Algenist,Lip Balms & Treatments,GENIUS Liquid Collagen Lip,35.0,https://www.sephora.com/product/genius-liquid-collagen-lip-P432045?icid2=products grid:p432045,what it is a vegan collagen moisturizing lip treatment that visibly restores lip fullness and c...,-Patented Alguronic Acid: Naturally sourced and sustainably produced from algae- it’s clinicall...,4.0,491,32100,Algenist GENIUS Liquid Collagen Lip,0,0,0,1,0,1,1,0,0,0,0,0,1,1,0,0,0,0,1,1,1,0,1,0,0,0,0,0,1,1,1,1,0,0,0,1,0,0,0,0,0,1,10.0,3.5
4,1675974,Algenist,Face Serums,GENIUS Ultimate Anti-Aging Vitamin C+ Serum,118.0,https://www.sephora.com/product/genius-ultimate-anti-aging-vitamin-c-serum-P392945?icid2=product...,which skin type is it good for normal oily combination dry sensitivewhat it isa powerful serum ...,-Alguronic Acid: Improves the appearance of firmness which results in more toned and younger loo...,4.0,221,8900,Algenist GENIUS Ultimate Anti-Aging Vitamin C+ Serum,1,1,1,1,1,0,1,0,1,0,0,0,1,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,1,0,1,0,0,1,0,0,0,0,0,0,0,30.0,3.93


In [81]:
# Save data
df.to_csv('../data/sephora_clean.csv', index = False)