# Data cleaning

________________________________________________________________________________________________________________

**Filenames of scraped data**
- eye_products.json
- moisturizers_full.json
- treatments.json
- cleansers_full.json

__________________________________________________________________________________________________________________

## 0.0 Definitions

#### Variables
- brand
- product_name: product brand and name
- product_type: general category of product (i.e. essences, serums and treatments, moisturizers and creams, face wash and cleansers, eye creams and treatments, toners, exfoliators and peels)
- num_likes: number of likes
- ingredients: list of ingredients as published on Sephora
- rating: rating from 1-5 stars
- num_reviews: number of reviews received
- sensitive_type: whether product is suitable for sensitive skin
- combination_type: whether product is suitable for combination skin
- oily_type: whether product is suitable for oily skin
- normal_type: whether product is suitable for normal skin
- dry_type: whether product is suitable for dry skin
- clean_sephora: product does not contain the sulfates SLS and SLES, parabens, formaldehydes and formaldehyde-releasing agents, phthalates, mineral oil, retinylpalmitate, oxybenzone, coal tar, hydroquinone, triclosan, and triclocarban
- cruelty_free: not tested on animals
- vegan
- skin_concerns: skin care concern targeted, as highlighted on the Sephora website
- excl_ingr: specific ingredients not used in the formulation of the product
- best_for_skintype: whether it is highlighted to be best for a specific skin type
- acids: notes if the product contains any of Hyaluronic Acid, Salicylic Acid, AHA/Glycolic Acid, Vitaminc C
- award: notes if the product is an allure winner or a community favorite
- pricepervol: price per oz
- highlighted_ingr: highlighted ingredients as noted in the 'about the product' section in Sephora
- clinical_results: whether clinical results have been noted in the 'about the product' section in Sephora
- formulation_type: whether it is a cream, serum, liquid, gel, oil, lotion or other
- richness: indicates whether the product is lightweight, normal or heavy


#### Custom functions
- search_in_list --> Returns 1 if any item on the list contains the term and 0 otherwise (term - lowercase)
- extract_info2 --> Function can 1) search for term/phrase in the list and extracts the words immediately after the given phrase 2) Search for term/phrase and if found, this is added to the retrieved list 
- find_and_encode --> Takes in a dictionary containing term to be searched and the string to be returned if term is found in the list
- return_match --> Like a re.search function but fetches the exact match; none if no match 
- find_in_list --> Searches for pattern in the list and returns a list of the matches
- extract_next --> Searches for phrase in the list and extracts the next item on the list
- find_and_encode2 --> Takes in a dictionary containing the terms to be searched and the value to return if term is found; if not found, the default is returned 

__________________________________________________________________________________________________________________

## 1.0 Loading the files

In [1]:
#Import necessary libraries
import json 
import re
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from pandas_profiling import ProfileReport

In [2]:
#Adjusting settings
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)

#Allow multiple execution of commands in one cell 
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

<font color='red'> QUESTION BELOW </font>

#Building automated - NOT DEFINING THE ATTRIBUTES!
files= ['cleansers_full.json', 'eye_products.json', 'moisturizers_full.json']
raw_df=['cleansers_raw', 'eyeproducts_raw', 'moisturizers_raw']

def json_to_df(file_list, name):
    '''
    Accepts a list of files and a list of variables where the function will store the dictionary in
    '''
    for filename, variable in zip(file_list, name):
        print(filename, variable)
        with open(filename, 'r') as file:
            raw_data= json.load(file)
        variable=pd.DataFrame.from_dict(raw_data)
    
json_to_df(files, raw_df)

In [3]:
#Storing each raw json file in a dataframe 
with open('../raw_data/cleansers_full.json', 'r') as file:
    raw_data= json.load(file)
cleansers_raw=pd.DataFrame.from_dict(raw_data)

with open('../raw_data/eye_products.json', 'r') as file:
    raw_data= json.load(file)
eyeproducts_raw=pd.DataFrame.from_dict(raw_data)

with open('../raw_data/moisturizers_full.json', 'r') as file:
    raw_data= json.load(file)
moisturizers_raw=pd.DataFrame.from_dict(raw_data)

with open('../raw_data/treatments_full.json', 'r') as file:
    raw_data= json.load(file)
treatments_raw=pd.DataFrame.from_dict(raw_data)

In [4]:
#Inspecting the data
cleansers_raw.sample(5)
moisturizers_raw.sample(5)
eyeproducts_raw.sample(5)
treatments_raw.sample(5)

Unnamed: 0,brand,name,about the product,weblink,sub_category,main_category,num_likes,img_link,price,size,ingredients,rating,num_reviews,highlights
402,Dr. Barbara Sturm,Cleanser,"[What it is: , \nA gentle foaming cleanser th...",https://www.sephora.com/product/coconut-melt-P...,Face Wash & Cleansers,cleansers,4K,https://www.sephora.com/productimages/sku/s226...,$70.00,5 oz/ 150 mL,"[ -Mild Tensides: Provide thorough, yet gentle...",4.5 stars,51,"[Fragrance Free, Cruelty-Free, Good for: Acne/..."
278,Dermalogica,Daily Microfoliant Exfoliator,"[What it is: , A gentle, rice-based powder ex...",https://www.sephora.com/product/watermelon-glo...,Exfoliators,cleansers,48.4K,https://www.sephora.com/productimages/sku/s200...,$59.00,2.6 oz/ 74g,"[ -Salicylic Acid: Clears clogged pores, draws...",4.5 stars,954,"[Best for Oily, Combo, Normal Skin, Good for: ..."
169,Indie Lee,Brightening Cleanser,"[What it is: , A cleanser, makeup remover, an...",https://www.sephora.com/product/the-method-pol...,Face Wash & Cleansers,cleansers,31.8K,https://www.sephora.com/productimages/sku/s207...,$34.00,4.2 oz/ 125 mL,"[ -Strawberry Seed Oil: Provides vitamins A, B...",4.5 stars,462,"[Clean at Sephora, Vegan]"
198,Drunk Elephant,Baby Pekee Bar™ + Juju Bar Travel Duo,"[Which skin type is it good for?, ✔ Normal, ✔ ...",https://www.sephora.com/product/problem-soluti...,Value & Gift Sets,cleansers,51.2K,https://www.sephora.com/productimages/sku/s180...,$16.00,,[ -Heilmoor Clay: Tones and detoxifies the ski...,4.5 stars,571,"[Clean at Sephora, Best for Dry, Combo, Normal..."
38,Mario Badescu,Buffering Lotion,"[What it is: , A niacimamide and zinc oxide so...",https://www.sephora.com/product/bi-facil-face-...,Toners,cleansers,4K,https://www.sephora.com/productimages/sku/s218...,$19.00,1 oz/ 29 mL,"[Isopropyl Alcohol, Propylene Glycol, Zinc Oxi...",4 stars,35,[]


Unnamed: 0,brand,name,about the product,weblink,sub_category,main_category,num_likes,img_link,price,size,ingredients,rating,num_reviews,highlights
122,OLEHENRIKSEN,Goodnight Glow Retin-ALT Sleeping Crème,"[Which skin type is it good for?, ✔ Normal, ✔ ...",https://www.sephora.com/product/goodnight-glow...,Night Creams,moisturizers,50.8K,https://www.sephora.com/productimages/sku/s210...,$55.00,1.7 oz/ 50 mL,[ -Bakuchiol: Derived from the Ayurvedic babch...,4 stars,478,"[Clean at Sephora, AHA/Glycolic Acid, Good for..."
616,Lord Jones,Acid Mantle Repair Moisturizer With 250mg CBD ...,"[What it is: , A soothing moisturizer packed ...",https://www.sephora.com/product/epic-moisture-...,Moisturizers,moisturizers,6.3K,https://www.sephora.com/productimages/sku/s234...,$75.00,1.7 oz/ 50 mL,"[-Whole Hemp Plant-Derived CBD: A restorative,...",5 stars,82,"[Best for Dry, Combo, Normal Skin, Good for: D..."
521,COOLA,Full Spectrum 360° Mineral Sun Silk Moisturiz...,"[What it is: , An ultra-rich priming cream th...",https://www.sephora.com/product/birds-nest-blu...,Face Sunscreen,moisturizers,997,https://www.sephora.com/productimages/sku/s233...,$42.00,1.5 oz/ 44 mL,[-Zinc Oxide 13.5%: Provides broad-spectrum UV...,3.5 stars,15,"[Clean at Sephora, SPF]"
326,Tata Harper,Nourishing Makeup Removing Oil Cleanser,"[What it is: , A lightweight multi-vitamin cle...",https://www.sephora.com/product/shiseido-vital...,Face Wash & Cleansers,moisturizers,7.9K,https://www.sephora.com/productimages/sku/s167...,$82.00,4.1 oz/ 125 mL,[-Alfalfa: Nourishes skin with biotin and vita...,4 stars,88,"[Best for Dry, Combo, Normal Skin, Good for: D..."
89,Origins,Original Skin™ Matte Moisturizer with Willowherb,"[What it is: , A deliciously hydrating lush w...",https://www.sephora.com/product/wrinkle-warrio...,Moisturizers,moisturizers,53.2K,https://www.sephora.com/productimages/sku/s196...,$36.00,1.7 oz/ 50 mL,[-Willowherb: Helps skin appear vibrant and sm...,4.5 stars,2.1K,"[Clean at Sephora, Community Favorite]"


Unnamed: 0,brand,name,about the product,weblink,sub_category,main_category,num_likes,img_link,price,size,ingredients,rating,num_reviews,highlights
2,First Aid Beauty,Eye Duty Niacinamide Brightening Eye Cream,"[What it is: , An illuminating eye cream that...",https://www.sephora.com/product/the-pearl-tint...,Eye Creams & Treatments,eye products,36.1K,https://www.sephora.com/productimages/sku/s233...,$36.00,0.5 oz/ 15 mL,[ -Niacinamide: Fades the appearance of brown ...,4 stars,283,"[Fragrance Free, Cruelty-Free, Best for Normal..."
212,First Aid Beauty,Eye Duty Triple Remedy A.M. Gel Cream,"[What it is: , A fast-absorbing treatment tha...",https://www.sephora.com/product/spectralite-ey...,Eye Creams & Treatments,eye products,23.5K,https://www.sephora.com/productimages/sku/s178...,$36.00,0.5 oz/ 15 mL,[ -Peptides: Smooth the look of skin and minim...,4 stars,200,"[Clean at Sephora, Good for: Dark Circles, Goo..."
84,CLINIQUE,All About Eyes Serum De-Puffing Eye Massage,"[What it is:, A rollerball that massages away ...",https://www.sephora.com/product/sea-pack-your-...,Eye Creams & Treatments,eye products,20.2K,https://www.sephora.com/productimages/sku/s123...,$35.00,0.5 oz/ 15 mL,"[Water, Butylene Glycol, Glycerin, Caffeine, P...",4 stars,893,[]
98,fresh,Black Tea Firming Eye Serum,"[What it is: , A fluid gel that provides a cor...",https://www.sephora.com/product/wide-awake-bri...,Eye Creams & Treatments,eye products,11.9K,https://www.sephora.com/productimages/sku/s217...,$72.00,0.5 oz/ 15 mL,"[ Water, Glycerin, Butylene Glycol, Pentylene ...",4 stars,56,"[Clean at Sephora, Good for: Loss of firmness]"
117,Murad,Retinol Youth Renewal Eye Serum,"[What it is: , A visibly preventative and cor...",https://www.sephora.com/product/laneige-water-...,Eye Creams & Treatments,eye products,15K,https://www.sephora.com/productimages/sku/s231...,$85.00,0.5 oz/ 15 mL,[-Retinol Tri-Active Technology: Visibly fight...,4.5 stars,155,"[Retinol, Good for: Anti-Aging, Best for Dry, ..."


Unnamed: 0,brand,name,about the product,weblink,sub_category,main_category,num_likes,img_link,price,size,ingredients,rating,num_reviews,highlights
291,Caudalie,Vinopure Natural Salicylic Acid Pore Minimizin...,"[What it is: , A pore-diminishing gel serum p...",https://www.sephora.com/product/paula-s-choice...,Face Serums,treatments,72.4K,https://www.sephora.com/productimages/sku/s211...,$49.00,1 oz/ 30 mL,[ -Natural Salicylic Acid (Derived from Winte...,4.5 stars,1.1K,"[Clean at Sephora, Good for: Acne/Blemishes, G..."
382,Primera,Mild Facial Peeling Gel,"[What it is: , A non-abrasive gel that improv...",https://www.sephora.com/product/radiance-resur...,Facial Peels,treatments,3.4K,https://www.sephora.com/productimages/sku/s226...,$38.00,5.0 oz/ 150 mL,[ -Broccoli Sprout Extract: Clarifies skin’s a...,4 stars,29,"[Best for Dry, Combo, Normal Skin, Good for: D..."
423,Kate Somerville,EradiKate® Daily Cleanser Acne Treatment,"[What it is: , A medicated daily foaming acne...",https://www.sephora.com/product/ferulic-retino...,Blemish & Acne Treatments,treatments,33.2K,https://www.sephora.com/productimages/sku/s190...,$40.00,4 oz/ 120 mL,[ -Sulfur: Works to eliminate existing imperfe...,4.5 stars,572,"[Good for: Acne/Blemishes, Good for: Pores , B..."
147,CLINIQUE,Acne Solutions™ Cleansing Bar For Face and Body,"[What it is: , A mild soap powered by salicyl...",https://www.sephora.com/product/sos-save-our-s...,Face Wash & Cleansers,treatments,15.5K,https://www.sephora.com/productimages/sku/s102...,$16.50,5.2 oz/ 147 g,"[Sodium Palmate, Sodium Cocoate, Water, Palmit...",4.5 stars,447,[]
71,LANCER Skincare,Active Rejuvenation Serum with Triple Dermal C...,"[What it is: , A powerful serum that addresse...",https://www.sephora.com/product/perfectionist-...,Face Serums,treatments,828,https://www.sephora.com/productimages/sku/s237...,$150.00,1 oz/ 30 mL,[ -Cell-Renewal Complex: Helps energize the lo...,4.5 stars,105,[]


## 2.0 Pre-processing each column

### Summary of steps taken:

1) Created new column to include both brand and product name

2) Inspected for missing values and deleted rows where appropriate

3) Removed 'stars' from rating and converted column into numeric type

4) Remove currency sign from price and converted into numeric type

5) Translated both number of likes and number of reviews into digits and converted into numeric data 

6) Created new features based on skin type the product was compatible with. This was derived from the 'about the product section'. (New columns created containing 1 or 0 for each of normal, combination, oily, dry, sensitive)

7) Created new feature based on highlights 

8) Create a new (uniform) feature for 'size'

9) Derive a feature for price adjusted for the size or volume of the product 

10) Create new feature based on highlighted ingredients

11) Create a new feature capturing whether clinical results are published in the about section

12) Extract formulation type for each ingredient and capture this in a separate feature(s)

**1) Created new column to include both brand and product name**

In [53]:
#Creating new column for full product name 
cleansers_raw['product_name']=cleansers_raw['brand']+cleansers_raw['name']
treatments_raw['product_name']=treatments_raw['brand']+treatments_raw['name']
moisturizers_raw['product_name']=moisturizers_raw['brand']+moisturizers_raw['name']
eyeproducts_raw['product_name']=eyeproducts_raw['brand']+eyeproducts_raw['name']

**2) Inspected for missing values and deleted rows where appropriate**

In [54]:
#Inspect missing values 

#see total number of items
cleansers_raw.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 409 entries, 0 to 408
Data columns (total 15 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   brand              407 non-null    object
 1   name               407 non-null    object
 2   about the product  409 non-null    object
 3   weblink            409 non-null    object
 4   sub_category       407 non-null    object
 5   main_category      409 non-null    object
 6   num_likes          407 non-null    object
 7   img_link           409 non-null    object
 8   price              403 non-null    object
 9   size               363 non-null    object
 10  ingredients        409 non-null    object
 11  rating             409 non-null    object
 12  num_reviews        402 non-null    object
 13  highlights         409 non-null    object
 14  product_name       407 non-null    object
dtypes: object(15)
memory usage: 48.1+ KB


In [55]:
cleansers_raw[cleansers_raw['size'].isnull()]

Unnamed: 0,brand,name,about the product,weblink,sub_category,main_category,num_likes,img_link,price,size,ingredients,rating,num_reviews,highlights,product_name
1,Drunk Elephant,Slaai™ Makeup-Melting Butter Cleanser,"[What it is: , An innovative cleansing balm t...",https://www.sephora.com/product/cleansing-exfo...,Face Wash & Cleansers,cleansers,66.6K,https://www.sephora.com/productimages/sku/s217...,$34.00,,[ -Nourishing Fruit Salad Blend: A mix of non-...,4 stars,1.3K,[allure 2019 Best of Beauty Award Winner: Clea...,Drunk ElephantSlaai™ Makeup-Melting Butter Cl...
6,SEPHORA COLLECTION,Cleansing & Exfoliating Wipes,"[What it is: , A collection of cleansing and ...",https://www.sephora.com/product/cleansing-exfo...,Face Wipes,cleansers,258.6K,https://www.sephora.com/productimages/sku/s180...,$8.00,,"[ -Coconut Water Extract.\n, \n, Water, Capry...",4.5 stars,3.7K,"[Best for Dry, Combo, Normal Skin, Good for: D...",SEPHORA COLLECTIONCleansing & Exfoliating Wipes
12,Tatcha,The Rice Polish Foaming Enzyme Powder,"[Which skin type is it good for?, ✔ Normal, ✔ ...",https://www.sephora.com/product/polished-rice-...,Exfoliators,cleansers,133.2K,https://www.sephora.com/productimages/sku/s212...,$65.00,,[ -Japanese Rice Bran: Gently exfoliates to so...,4.5 stars,1.6K,"[Clean at Sephora, Good for: Pores , Good for:...",TatchaThe Rice Polish Foaming Enzyme Powder
36,Estée Lauder,Nutritious Super-Pomegranate Radiant Energy Lo...,"[What it is: , A rich, cushioning treatment lo...",https://www.sephora.com/product/bi-facil-face-...,Toners,cleansers,5.6K,https://www.sephora.com/productimages/sku/s218...,$58.00,,"[Water\Aqua\Eau, Alcohol Denat., Glycerin, Dip...",4 stars,10,[],Estée LauderNutritious Super-Pomegranate Radia...
48,Clarins,Instant Eye Makeup Remover,"[Which skin type is it good for?, ✔ Normal, ✔ ...",https://www.sephora.com/product/instant-eye-ma...,Makeup Removers,cleansers,5K,https://www.sephora.com/productimages/sku/s153...,$30.00,,[ -Volatile Ultra-fine Oils: Remove heavy or w...,4.5 stars,148,[],ClarinsInstant Eye Makeup Remover
57,Drunk Elephant,Baby Bar Travel Duo with Case,"[What it is: , A baby Juju Bar and Pekee Bar™...",https://www.sephora.com/product/one-size-go-of...,Value & Gift Sets,cleansers,17.6K,https://www.sephora.com/productimages/sku/s217...,$22.00,,"[ -Virgin Marula Oil: Hydrates, protects, heal...",4.5 stars,114,"[Clean at Sephora, Best for Dry, Combo, Normal...",Drunk ElephantBaby Bar Travel Duo with Case
68,SEPHORA COLLECTION,Hemp Cleansing Wipes,"[What it is: , A cleansing wipe that is formu...",https://www.sephora.com/product/laura-mercier-...,Face Wipes,cleansers,0,https://www.sephora.com/productimages/sku/s223...,$8.00,,[-Hemp from Natural Origin: Provides the look ...,4 stars,10,"[Best for Dry, Combo, Normal Skin, Good for: D...",SEPHORA COLLECTIONHemp Cleansing Wipes
74,MILK MAKEUP,Matcha Cleanser,"[Which skin type is it good for?, ✔ Normal, ✔ ...",https://www.sephora.com/product/mario-badescu-...,Face Wash & Cleansers,cleansers,18.6K,https://www.sephora.com/productimages/sku/s191...,$30.00,,"[Water, Butylene Glycol, Sodium Stearate, Bis-...",4 stars,182,"[Clean at Sephora, Good for: Redness]",MILK MAKEUPMatcha Cleanser
77,Clarins,Hydrating Gentle Foaming Cleanser,"[What it is: , A creamy, rinse-off foaming cl...",https://www.sephora.com/product/mario-badescu-...,Face Wash & Cleansers,cleansers,115,https://www.sephora.com/productimages/sku/s244...,$27.00,,[-Saponaria Balm Extracts: Natural properties ...,1 star,1,"[Cream Formula, Best for Dry Skin, Without Min...",ClarinsHydrating Gentle Foaming Cleanser
80,Shiseido,Instant Eye And Lip Makeup Remover,"[What it is:, A gentle, dual-phase formula to ...",https://www.sephora.com/product/fresh-pressed-...,Makeup Removers,cleansers,4.3K,https://www.sephora.com/productimages/sku/s164...,$30.00,,"[Water, Cyclomethicone, Butylene Glycol, Sd Al...",4.5 stars,90,[],ShiseidoInstant Eye And Lip Makeup Remover


In [56]:
treatments_raw.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 578 entries, 0 to 577
Data columns (total 15 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   brand              578 non-null    object
 1   name               578 non-null    object
 2   about the product  578 non-null    object
 3   weblink            578 non-null    object
 4   sub_category       578 non-null    object
 5   main_category      578 non-null    object
 6   num_likes          578 non-null    object
 7   img_link           578 non-null    object
 8   price              573 non-null    object
 9   size               498 non-null    object
 10  ingredients        578 non-null    object
 11  rating             578 non-null    object
 12  num_reviews        572 non-null    object
 13  highlights         578 non-null    object
 14  product_name       578 non-null    object
dtypes: object(15)
memory usage: 67.9+ KB


In [57]:
treatments_raw[treatments_raw['size'].isnull()]

Unnamed: 0,brand,name,about the product,weblink,sub_category,main_category,num_likes,img_link,price,size,ingredients,rating,num_reviews,highlights,product_name
13,fresh,Umbrian Clay Pore Purifying Face Mask,"[What it is: , A mineral-rich clay mask and f...",https://www.sephora.com/product/b-oil-P442757?...,Face Masks,treatments,100.7K,https://www.sephora.com/productimages/sku/s182...,$58.00,,"[ -Umbrian Clay: Works to help balance, purify...",4.5 stars,875,[],freshUmbrian Clay Pore Purifying Face Mask
14,First Aid Beauty,Ultra Repair® Hydrating Serum,"[What it is: , A water-based serum that gives...",https://www.sephora.com/product/b-oil-P442757?...,Face Serums,treatments,50.2K,https://www.sephora.com/productimages/sku/s220...,$38.00,,[-Colloidal Oatmeal: Helps calm and soothe dry...,4 stars,671,"[Fragrance Free, Cruelty-Free, Best for Dry Sk...",First Aid BeautyUltra Repair® Hydrating Serum
18,CLINIQUE,"Clean Skin, Fresh Start Acne Solutions Set","[What it is: , A set of fast-acting formulas ...",https://www.sephora.com/product/time-in-bottle...,Value & Gift Sets,treatments,920,https://www.sephora.com/productimages/sku/s242...,$19.50,,[-Salicylic Acid and Acetyl Glucosamine: Remov...,3 stars,3,[],"CLINIQUEClean Skin, Fresh Start Acne Solutions..."
19,Perricone MD,Vitamin C Ester Brightening Serum,"[Which skin type is it good for?, \n, ✔ Normal...",https://www.sephora.com/product/time-in-bottle...,Face Serums,treatments,14K,https://www.sephora.com/productimages/sku/s212...,$69.00,,"[ -Vitamin C Ester: Brightens and smooths.\n,...",4.5 stars,439,"[Best for Dry, Combo, Normal Skin, Good for: D...",Perricone MDVitamin C Ester Brightening Serum
35,Murad,Rapid Collagen Infusion,"[What it is:, \n, An antiaging treatment that ...",https://www.sephora.com/product/ningaloo-coppe...,Face Serums,treatments,8.3K,https://www.sephora.com/productimages/sku/s221...,$79.00,,[ -Collagen Support Complex: Supports natural...,4 stars,121,[],MuradRapid Collagen Infusion
49,SEPHORA COLLECTION,Clarifying Lotion,"[What it is: , A 90 percent natural-ingredien...",https://www.sephora.com/product/youth-or-dare-...,Blemish & Acne Treatments,treatments,3.5K,https://www.sephora.com/productimages/sku/s225...,$16.00,,[ -Natural-Origin Salicylic Acid: Exfoliates. ...,3.5 stars,23,"[Clean at Sephora, Vegan]",SEPHORA COLLECTIONClarifying Lotion
56,Jack Black,Power Peel Multi-Acid Resurfacing Pads,"[What it is:, \n\n, A high-potency formula tha...",https://www.sephora.com/product/superserum-6-a...,Moisturizer & Treatments,treatments,5.3K,https://www.sephora.com/productimages/sku/s138...,$38.00,,"[Water, Alcohol Denat., Butylene Glycol, Glyco...",4.5 stars,51,[],Jack BlackPower Peel Multi-Acid Resurfacing Pads
57,Dr. Jart+,Focuspot™ Dark Circle Micro Tip™ Patch,"[What it is: , A set of self-dissolving micro ...",https://www.sephora.com/product/superserum-6-a...,Face Serums,treatments,0,https://www.sephora.com/productimages/sku/s219...,$18.00,,"[Sodium Hyaluronate, Glycerin, Oxidized Glutat...",3 stars,107,[],Dr. Jart+Focuspot™ Dark Circle Micro Tip™ Patch
74,HUM Nutrition,Big Chill™ Stress Management Supplement,"[Beauty Benefit:, Stress Relief , What it is:...",https://www.sephora.com/product/kx-active-conc...,Beauty Supplements,treatments,14K,https://www.sephora.com/productimages/sku/s223...,$20.00,,[-Rhodiola Rosea: Supports adrenal balance to ...,4.5 stars,68,[],HUM NutritionBig Chill™ Stress Management Supp...
84,Kiehl's Since 1851,Clearly Corrective Accelerated Clarity & Renew...,"[What it is: , A two-week ampoule brightening...",https://www.sephora.com/product/clinique-id-yo...,Face Serums,treatments,774,https://www.sephora.com/productimages/sku/s232...,$95.00,,[-Activated C: Visibly brightens skin and redu...,4 stars,145,[],Kiehl's Since 1851Clearly Corrective Accelerat...


In [58]:
eyeproducts_raw.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 220 entries, 0 to 219
Data columns (total 15 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   brand              220 non-null    object
 1   name               220 non-null    object
 2   about the product  220 non-null    object
 3   weblink            220 non-null    object
 4   sub_category       220 non-null    object
 5   main_category      220 non-null    object
 6   num_likes          220 non-null    object
 7   img_link           220 non-null    object
 8   price              216 non-null    object
 9   size               190 non-null    object
 10  ingredients        220 non-null    object
 11  rating             220 non-null    object
 12  num_reviews        216 non-null    object
 13  highlights         220 non-null    object
 14  product_name       220 non-null    object
dtypes: object(15)
memory usage: 25.9+ KB


In [59]:
eyeproducts_raw[eyeproducts_raw['size'].isnull()]

Unnamed: 0,brand,name,about the product,weblink,sub_category,main_category,num_likes,img_link,price,size,ingredients,rating,num_reviews,highlights,product_name
0,Peace Out,Retinol Eye Stick,"[What it is: , A concentrated serum balm that...",https://www.sephora.com/product/the-pearl-tint...,Eye Creams & Treatments,eye products,73.4K,https://www.sephora.com/productimages/sku/s241...,$28.00,,[-Encapsulated Retinol: Softens the look of fi...,4.5 stars,495,"[Retinol, Good for: Anti-Aging, Good for: Loss...",Peace OutRetinol Eye Stick
3,Wander Beauty,Baggage Claim Eye Masks,"[Which skin type is it good for?, ✔ Normal, ✔ ...",https://www.sephora.com/product/the-pearl-tint...,Eye Masks,eye products,32.8K,https://www.sephora.com/productimages/sku/s215...,$25.00,,[-Peptides: Help to improve the texture and to...,4 stars,446,"[Cruelty-Free, allure 2020 Best of Beauty Awar...",Wander BeautyBaggage Claim Eye Masks
6,Tatcha,The Pearl Tinted Eye Illuminating Treatment,"[Which skin type is it good for?, ✔ Normal, ✔ ...",https://www.sephora.com/product/the-pearl-tint...,Eye Creams & Treatments,eye products,69.6K,https://www.sephora.com/productimages/sku/s207...,$48.00,,[ -Liquid Extract from Akoya Pearls: A natural...,4 stars,1.1K,"[Clean at Sephora, Community Favorite, Good fo...",TatchaThe Pearl Tinted Eye Illuminating Treatment
8,IT Cosmetics,Your Hydrating Moisturizer and Eye Cream Set,"[What it is: , An essential duo set that cont...",https://www.sephora.com/product/maracuja-c-bri...,Value & Gift Sets,eye products,460,https://www.sephora.com/productimages/sku/s244...,$65.00,,"[-Ceramides: Help support the skin barrier, ke...",4 stars,,[],IT CosmeticsYour Hydrating Moisturizer and Eye...
21,KNC Beauty,All Natural Retinol Infused Eye Mask- 5 Pack,"[What it is: , A eye mask set that will give ...",https://www.sephora.com/product/l-occitane-imm...,Eye Masks,eye products,6.2K,https://www.sephora.com/productimages/sku/s222...,$25.00,,"[ -Collagen: Helps plum., -Hyaluronic acid: He...",4 stars,21,"[allure 2020 Best of Beauty Award Winner , Goo...",KNC BeautyAll Natural Retinol Infused Eye Mask...
25,Anthony,High Performance Continuous Moisture Eye Cream,"[What it is:, \n, An advanced, time-released,...",https://www.sephora.com/product/l-occitane-imm...,Eye Cream,eye products,2.9K,https://www.sephora.com/productimages/sku/s160...,$40.00,,[ -Shea Butter: Helps reduce the appearance of...,4 stars,38,[],AnthonyHigh Performance Continuous Moisture Ey...
26,Perricone MD,No Makeup Concealer Broad Spectrum SPF 20,"[What it is: , A multitasking concealer that i...",https://www.sephora.com/product/l-occitane-imm...,Concealer,eye products,9K,https://www.sephora.com/productimages/sku/s221...,$35.00,,"[Water, Dimethicone, Cyclopentasiloxane, Butyl...",4 stars,187,[],Perricone MDNo Makeup Concealer Broad Spectrum...
33,REN Clean Skincare,Vita Mineral™ Active 7 Eye Gel,"[What it is: , A light, vegan, cooling eye ge...",https://www.sephora.com/product/sobel-skin-rx-...,Eye Creams & Treatments,eye products,3.7K,https://www.sephora.com/productimages/sku/s211...,$36.00,,"[ -Wild Rumex: Illuminates the eye area. , -Ar...",3.5 stars,150,"[Clean at Sephora, Good for: Dryness, Good for...",REN Clean SkincareVita Mineral™ Active 7 Eye Gel
37,Clarins,Eye Contour Gel,"[What it is:, A cooling eye gel to minimize pu...",https://www.sephora.com/product/the-quench-eye...,Eye Creams & Treatments,eye products,3.1K,https://www.sephora.com/productimages/sku/s137...,$43.00,,[ -Certified Organic Ginkgo Biloba: Minimizes ...,4 stars,102,[],ClarinsEye Contour Gel
38,BeautyBio,GloPRO® EYE MicroTip™ Attachment Head,"[What it is: , An attachment head that is des...",https://www.sephora.com/product/the-quench-eye...,Anti-Aging,eye products,4.3K,https://www.sephora.com/productimages/sku/s216...,$35.00,,"[Suggested Usage:, -Use GloPRO® EYE in the PM ...",3 stars,6,[],BeautyBioGloPRO® EYE MicroTip™ Attachment Head


In [60]:
moisturizers_raw.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 645 entries, 0 to 644
Data columns (total 15 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   brand              645 non-null    object
 1   name               645 non-null    object
 2   about the product  645 non-null    object
 3   weblink            645 non-null    object
 4   sub_category       645 non-null    object
 5   main_category      645 non-null    object
 6   num_likes          645 non-null    object
 7   img_link           645 non-null    object
 8   price              642 non-null    object
 9   size               569 non-null    object
 10  ingredients        645 non-null    object
 11  rating             645 non-null    object
 12  num_reviews        635 non-null    object
 13  highlights         645 non-null    object
 14  product_name       645 non-null    object
dtypes: object(15)
memory usage: 75.7+ KB


In [61]:
moisturizers_raw.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 645 entries, 0 to 644
Data columns (total 15 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   brand              645 non-null    object
 1   name               645 non-null    object
 2   about the product  645 non-null    object
 3   weblink            645 non-null    object
 4   sub_category       645 non-null    object
 5   main_category      645 non-null    object
 6   num_likes          645 non-null    object
 7   img_link           645 non-null    object
 8   price              642 non-null    object
 9   size               569 non-null    object
 10  ingredients        645 non-null    object
 11  rating             645 non-null    object
 12  num_reviews        635 non-null    object
 13  highlights         645 non-null    object
 14  product_name       645 non-null    object
dtypes: object(15)
memory usage: 75.7+ KB


In [62]:
moisturizers_raw[moisturizers_raw['size'].isnull()]

Unnamed: 0,brand,name,about the product,weblink,sub_category,main_category,num_likes,img_link,price,size,ingredients,rating,num_reviews,highlights,product_name
8,IT Cosmetics,CC+ Cream Illumination with SPF 50+,"[What it is: , A color-correcting, full-covera...",https://www.sephora.com/product/jlo-beauty-tha...,BB & CC Cream,moisturizers,60K,https://www.sephora.com/productimages/sku/s186...,$39.50,,[ -Collagen: Enhances visible elasticity for a...,4 stars,732,"[Full Coverage, SPF, Cream Formula, Radiant Fi...",IT CosmeticsCC+ Cream Illumination with SPF 50+
15,Murad,Nutrient-Charged Water Gel,"[What it is: , A lightweight, oil-free gel mo...",https://www.sephora.com/product/face-finishing...,Moisturizers,moisturizers,18.5K,https://www.sephora.com/productimages/sku/s221...,$62.00,,"[-Vitamins B3, B5, B6, B9, and E: Alleviate dr...",4.5 stars,611,"[Good for: Dullness/Uneven Texture, Good for: ...",MuradNutrient-Charged Water Gel
20,Briogeo,B. Well Organic + Australian 100% Tea Tree Sk...,"[What it is: , A 100 percent pure oil with pot...",https://www.sephora.com/product/b-well-organic...,Hair Oil,moisturizers,20.3K,https://www.sephora.com/productimages/sku/s218...,$32.00,,[ -Melaleuca Alternifolia (Tea Tree) Leaf Oil:...,5 stars,155,"[Clean at Sephora, Good for: Flaky/Dry Scalp, ...",Briogeo B. Well Organic + Australian 100% Tea ...
23,Briogeo,B. Well Organic + Cold-Pressed 100% Castor Oil,"[What it is: , A cold-pressed, organic*, fair-...",https://www.sephora.com/product/b-well-organic...,Hair Oil,moisturizers,34K,https://www.sephora.com/productimages/sku/s218...,$26.00,,[ -100% Organic* Cold-Pressed Castor Oil: Oil ...,4.5 stars,269,"[Clean at Sephora, Good for: Dryness, Fragranc...",BriogeoB. Well Organic + Cold-Pressed 100% Cas...
35,bareMinerals,Prime Time™ BB Tinted Primer Broad Spectrum S...,"[Skin type:, , ✔ Combination , ✔ Oily, Wha...",https://www.sephora.com/product/bienfait-multi...,BB & CC Cream,moisturizers,20.2K,https://www.sephora.com/productimages/sku/s157...,$27.00,,"[Cyclopentasiloxane, Dimethicone Crosspolymer,...",4 stars,334,[],bareMineralsPrime Time™ BB Tinted Primer Broa...
40,Dr. Brandt Skincare,Hydro Biotic™ Recovery Sleeping Mask,"[What it is: , A leave-on mask that works ove...",https://www.sephora.com/product/hyaluronic-mar...,Face Masks,moisturizers,16.5K,https://www.sephora.com/productimages/sku/s202...,$52.00,,[ -Biotic Balancing Complex: Maintains skin's ...,4.5 stars,1.1K,"[Best for Dry, Combo, Normal Skin, Good for: D...",Dr. Brandt SkincareHydro Biotic™ Recovery Slee...
47,CLINIQUE,Acne Solutions™ BB Cream SPF 40,"[What it is: , A mattifying BB Cream specific...",https://www.sephora.com/product/kate-somervill...,BB & CC Cream,moisturizers,43.5K,https://www.sephora.com/productimages/sku/s167...,$39.50,,[ -Mica and Titanium Dioxide: Light-diffusing ...,4 stars,696,[],CLINIQUEAcne Solutions™ BB Cream SPF 40
56,Dr. Jart+,Black Label Detox BB Beauty Balm,"[What it is: , A beauty balm that supports ant...",https://www.sephora.com/product/goop-goopgenes...,BB & CC Cream,moisturizers,28.4K,https://www.sephora.com/productimages/sku/s142...,$36.00,,"[Water, Cyclopentasiloxane, Phenyl Trimethicon...",4 stars,1.5K,[],Dr. Jart+Black Label Detox BB Beauty Balm
81,Dior,Capture Totale Intensive Night Restorative Crème,"[What it is:, A face cream that works to comba...",https://www.sephora.com/product/double-duty-fa...,Moisturizers,moisturizers,3.6K,https://www.sephora.com/productimages/sku/s163...,$175.00,,"[Aqua (Water), Glycerin, Decyl Oleate, Hydroge...",4.5 stars,37,"[Good for: Anti-Aging, Good for: Dark spots, G...",DiorCapture Totale Intensive Night Restorative...
119,Murad,Essential-C Day Moisture Broad Spectrum SPF 30...,"[Which skin type is it good for?, ✔ Normal, ✔ ...",https://www.sephora.com/product/goodnight-glow...,Moisturizers,moisturizers,19.5K,https://www.sephora.com/productimages/sku/s120...,$65.00,,"[ -Avobenzone 3%\n, -Homosalate 6.5%\n, -Octin...",4 stars,457,"[SPF, Vitamin C, Best for Oily, Combo, Normal ...",MuradEssential-C Day Moisture Broad Spectrum S...


Most of the rows with no associated 'size' are related to masks, supplements, devices, peels, make-up removers, gift sets, eye masks, eye gels, massager/rollers, etc. so these can be deleted as they are not the main focus of the study.

In [63]:
print('Initial shape: ', cleansers_raw.shape)
cleansers_raw.drop(cleansers_raw[cleansers_raw['size'].isnull()].index, inplace=True)
print('After deletion: ', cleansers_raw.shape)

Initial shape:  (409, 15)
After deletion:  (363, 15)


In [64]:
print('Initial shape: ', treatments_raw.shape)
treatments_raw.drop(treatments_raw[treatments_raw['size'].isnull()].index, inplace=True)
print('After deletion: ', treatments_raw.shape)

Initial shape:  (578, 15)
After deletion:  (498, 15)


In [65]:
print('Initial shape: ', eyeproducts_raw.shape)
eyeproducts_raw.drop(eyeproducts_raw[eyeproducts_raw['size'].isnull()].index, inplace=True)
print('After deletion: ', eyeproducts_raw.shape)

Initial shape:  (220, 15)
After deletion:  (190, 15)


In [66]:
print('Initial shape: ', moisturizers_raw.shape)
moisturizers_raw.drop(moisturizers_raw[moisturizers_raw['size'].isnull()].index, inplace=True)
print('After deletion: ', treatments_raw.shape)

Initial shape:  (645, 15)
After deletion:  (498, 15)


In [67]:
#Delete rows with no target variable - less than 10 per category
eyeproducts_raw.drop(eyeproducts_raw[eyeproducts_raw['price'].isnull()].index, inplace=True)
cleansers_raw.drop(cleansers_raw[cleansers_raw['price'].isnull()].index, inplace=True)
treatments_raw.drop(treatments_raw[treatments_raw['price'].isnull()].index, inplace=True)
moisturizers_raw.drop(moisturizers_raw[moisturizers_raw['price'].isnull()].index, inplace=True)

In [68]:
#Delete rows with no reviews 
eyeproducts_raw.drop(eyeproducts_raw[eyeproducts_raw['num_reviews'].isnull()].index, inplace=True)
cleansers_raw.drop(cleansers_raw[cleansers_raw['num_reviews'].isnull()].index, inplace=True)
treatments_raw.drop(treatments_raw[treatments_raw['num_reviews'].isnull()].index, inplace=True)
moisturizers_raw.drop(moisturizers_raw[moisturizers_raw['num_reviews'].isnull()].index, inplace=True)

Combining the dataframes before further cleaning...

In [69]:
df= cleansers_raw.append([treatments_raw, moisturizers_raw, eyeproducts_raw])

In [70]:
df.shape

(1586, 15)

**3) Removed 'stars' from rating and converted column into numeric type**

In [71]:
#Remove 'stars' from rating, convert to numeric
df['rating']= df.rating.str.partition(' ')[0]
df['rating']=pd.to_numeric(df['rating'])

**4) Remove currency sign from price and converted into numeric type**

In [72]:
#Remove $ from price, convert to numeric 
df['price']=df.price.str.strip('$')
df['price']=pd.to_numeric(df['price'])

**5) Translated both number of likes and number of reviews into digits and converted into numeric data**

In [73]:
#Cleaning num_likes column
df.loc[df.num_likes.str.endswith('K'), 'num_likes']=df.loc[df.num_likes.str.endswith('K'), 'num_likes'].str.strip('K').astype('float64')*1000
df['num_likes']=df['num_likes'].astype('int64')

In [74]:
#Cleaning num_reviews column
df.loc[df.num_reviews.str.endswith('K'), 'num_reviews']=df.loc[df.num_reviews.str.endswith('K'), 'num_reviews'].str.strip('K').astype('float64')*1000
df['num_reviews']=df['num_reviews'].astype('int64')

**6) Created new features based on skin type the product was compatible with. This was derived from the 'about the product section'.**

In [76]:
#Objective is to extract the skin type the product is compatible with from the 'about the product' section.
#Then create a new feature for each of the skin types

def search_in_list(term, list_desc):
    '''
    Returns 1 if any item on the list contains the term and 0 otherwise (term - lowercase)
    '''
    if any(term in i.lower() for i in list_desc):
        return 1
    else:
        return 0

In [77]:
#Create new features based on skintype
df['sensitive']= df['about the product'].apply(lambda row: search_in_list('sensitive', row))
df['combination']= df['about the product'].apply(lambda row: search_in_list('combination', row))
df['oily']= df['about the product'].apply(lambda row: search_in_list('oily', row))
df['normal']= df['about the product'].apply(lambda row: search_in_list('normal', row))
df['dry']= df['about the product'].apply(lambda row: search_in_list('dry', row))

In [78]:
#Inspect the output
df.head()

Unnamed: 0,brand,name,about the product,weblink,sub_category,main_category,num_likes,img_link,price,size,ingredients,rating,num_reviews,highlights,product_name,sensitive,combination,oily,normal,dry
0,Glow Recipe,Watermelon Glow PHA +BHA Pore-Tight Toner,"[What it is: , A bestselling gentle, PHA- and ...",https://www.sephora.com/product/cleansing-exfo...,Toners,cleansers,125100,https://www.sephora.com/productimages/sku/s234...,34.0,5.07 oz/ 150 mL,"[ -Watermelon Extract: Hydrates, delivers esse...",4.5,1900,"[Good for: Pores , Good for: Dullness/Uneven T...",Glow RecipeWatermelon Glow PHA +BHA Pore-Tight...,0,1,1,1,1
2,Tatcha,Pure One Step Camellia Oil Cleanser,"[Which skin type is it good for?, ✔ Normal, ✔ ...",https://www.sephora.com/product/cleansing-exfo...,Face Wash & Cleansers,cleansers,107600,https://www.sephora.com/productimages/sku/s167...,48.0,5.1 oz/ 150 mL,[ -Japanese Camellia Oil (Tsubaki): Seals in m...,4.5,1700,"[Clean at Sephora, Hydrating, Best for Dry, Co...",TatchaPure One Step Camellia Oil Cleanser,1,1,1,1,1
3,goop,GOOPGLOW Microderm Instant Glow Exfoliator,"[What it is: , A clinically tested, dual-acti...",https://www.sephora.com/product/cleansing-exfo...,Exfoliators,cleansers,12900,https://www.sephora.com/productimages/sku/s231...,125.0,1.7 oz/ 50 mL,"[ -Micro-exfoliating Minerals (Quartz, Garnet,...",4.5,1200,"[Clean at Sephora, Best for Dry, Combo, Normal...",goopGOOPGLOW Microderm Instant Glow Exfoliator,0,1,1,1,1
4,Lancôme,Bi-Facil Double-Action Eye Makeup Remover,"[What it is: , An award-winning makeup remove...",https://www.sephora.com/product/cleansing-exfo...,Makeup Removers,cleansers,58700,https://www.sephora.com/productimages/sku/s534...,32.0,4.2 oz/ 125 mL,"[Aqua/Water, Cyclopentasiloxane, Isohexadecane...",4.5,3700,[],LancômeBi-Facil Double-Action Eye Makeup Remover,1,1,1,1,1
5,OLEHENRIKSEN,Balancing Force™ Oil Control Toner,"[Which skin type is it good for?, \n, ✔ Normal...",https://www.sephora.com/product/cleansing-exfo...,Toners,cleansers,92500,https://www.sephora.com/productimages/sku/s191...,29.0,6.5 oz/ 193 mL,[ -Green Fusion Complex™: Proprietary blend o...,4.5,1600,"[Salicylic Acid, AHA/Glycolic Acid, Good for: ...",OLEHENRIKSENBalancing Force™ Oil Control Toner,1,1,1,1,1


**7) Created new feature based on highlights**

In [79]:
#First inspect what are the top categories for highlights 
highlights_list={}
for row in df['highlights']:
    for i in row:
        if i not in highlights_list.keys():
            highlights_list[i]=1 
        else:
            highlights_list[i]+=1

In [80]:
#highlights sorted by count
for item in sorted(highlights_list, key=highlights_list.get, reverse=True):
    print(item, highlights_list[item])

Clean at Sephora 577
Cruelty-Free 311
Good for: Dullness/Uneven Texture 291
Best for Dry, Combo, Normal Skin 285
Good for: Dryness 275
Good for: Anti-Aging 267
Vegan 234
Without Parabens  197
Best for Oily, Combo, Normal Skin 157
Good for: Loss of firmness 150
Hyaluronic Acid 146
Good for: Pores  134
Without Sulfates SLS & SLES 127
Community Favorite 121
Hydrating 106
Good for: Acne/Blemishes 105
Vitamin C 94
Recyclable Packaging 93
Fragrance Free 89
Salicylic Acid 77
AHA/Glycolic Acid 72
Without Phthalates  72
Good for: Dark Circles 71
Good for: Dark spots 68
Oil Free 56
Retinol 49
Niacinamide 45
Good for: Redness 44
Gluten Free 38
Alcohol Free 37
Best for Dry Skin 27
SPF 27
Hypoallergenic 25
Without Silicones 24
Best for Normal Skin 20
Lactic Acid 20
allure 2020 Best of Beauty Award Winner  19
Collagen 18
Best for Oily Skin 14
Without Mineral Oil  14
Sustainable Packaging 13
CBD 13
Plumping  11
Radiant Finish 9
Best for Combination Skin 8
allure 2018 Best of Beauty Award Winner 8
Ree

In [81]:
#highlights sorted alphabetically
for item in sorted(highlights_list):
    print(item, highlights_list[item])

AHA/Glycolic Acid 72
Alcohol Free 37
Best for Combination Skin 8
Best for Dry Skin 27
Best for Dry, Combo, Normal Skin 285
Best for Normal Skin 20
Best for Oily Skin 14
Best for Oily, Combo, Normal Skin 157
CBD 13
Clean at Sephora 577
Collagen 18
Community Favorite 121
Cream Formula 7
Cruelty-Free 311
Fragrance Free 89
Fresh Scent 2
Gluten Free 38
Good for: Acne/Blemishes 105
Good for: Anti-Aging 267
Good for: Dark Circles 71
Good for: Dark spots 68
Good for: Dryness 275
Good for: Dullness/Uneven Texture 291
Good for: Hair Dryness 2
Good for: Loss of firmness 150
Good for: Pores  134
Good for: Redness 44
High Shine Finish 1
Hyaluronic Acid 146
Hydrating 106
Hypoallergenic 25
Increases Shine 1
Lactic Acid 20
Liquid Formula 4
Long-wearing 1
Loose Powder Formula 1
Matte Finish 3
Natural Finish 1
Niacinamide 45
Oil Free 56
Plumping  11
Radiant Finish 9
Recyclable Packaging 93
Reef Safe SPF 8
Refill Available 3
Retinol 49
SPF 27
Salicylic Acid 77
Shimmer Finish 1
Stick Formula 1
Sustainable

Based on the results above, the several items have been merged under more general categories as the level of the detail is not necessary and this would also prevent creating too many features.  

1) Good for: 
- Acne/Blemishes
- Anti-Aging 
- Dark Circles  or Dark spots
- Dryness 
- Dullness/Uneven Texture 
- Loss of firmness
- Pores
- Redness
- Hydrating (to be added separately)
    
2) Without x / 'dirty' ingredients not used 
- Formaldehydes, Mineral Oil, Parabens, Phthalates, Retinyl Palmitate, Silicones
- Fragrance free (to be added separately)

3) Best for: 
- Combination, Dry, Combo, Normal, oily

4) Clean at Sephora 

5) Cruelty-Free 

6) Vegan

7) Skincare acids
- Hyaluronic Acid
- Salicylic Acid 
- AHA/Glycolic Acid
- Vitaminc C (to be added separately) 

8) Awards
- Community favorite
- any type of allure award

In [82]:
#Tackling 4,5,6 - more straight-forward 
df['clean']= df['highlights'].apply(lambda row: search_in_list('clean', row))
df['cruelty-free']= df['highlights'].apply(lambda row: search_in_list('cruelty', row))
df['vegan']= df['highlights'].apply(lambda row: search_in_list('vegan', row))

In [85]:
def extract_info2(phrase_add= None, phrase=None, list_=None):
    '''
    Function can:
    1) Search for term/phrase in the list and extracts the words immediately after the given phrase  
    2) Search for term/phrase and if found, this is added to the retrieved list  
    '''
    parsed=[]
    for highlight in list_:
        if phrase_add!=None:
            if phrase in highlight:
                parsed.append(highlight.partition(phrase)[2])
            if phrase_add in highlight:
                parsed.append(phrase_add)
        else:
            if phrase in highlight:
                parsed.append(highlight.partition(phrase)[2])    
    if not parsed:
        return None
    else:
        return parsed

In [87]:
#Using the custom function above, a similar process is undertaken for 1,2,3
df['skin_concerns']=df['highlights'].apply(lambda row: extract_info2('Hydrating', 'Good for: ', row))
df['excl_ingr']=df['highlights'].apply(lambda row: extract_info2('Fragrance', 'Without ', row))
df['best for']= df['highlights'].apply(lambda row: extract_info2(None, 'Best for ', row))

In [88]:
#Check output
df.iloc[11,13] #highlights
df.iloc[11, 23] #good for

['Alcohol Free',
 'Good for: Dullness/Uneven Texture',
 'Hydrating',
 'AHA/Glycolic Acid',
 'Clean at Sephora',
 'Community Favorite']

['Dullness/Uneven Texture', 'Hydrating']

In [89]:
#Check output
df.iloc[51, 13] #highlights
df.iloc[51, 24] #excluded ingredients

['Best for Oily, Combo, Normal Skin',
 'Good for: Acne/Blemishes',
 'Salicylic Acid',
 'Without Parabens ',
 'Without Sulfates SLS & SLES',
 'Fragrance Free']

['Parabens ', 'Sulfates SLS & SLES', 'Fragrance']

In [90]:
#Check output
df.iloc[534, 13] #highlights
df.iloc[534, 23] 
df.iloc[534, 24]
df.iloc[534, 25] 

'Snow Mushroom Water Serum'

['Best for Dry, Combo, Normal Skin',
 'Good for: Anti-Aging',
 'Hyaluronic Acid',
 'Cruelty-Free',
 'Vegan',
 'Clean at Sephora']

['Anti-Aging']

['Dry, Combo, Normal Skin']

In [91]:
#Inspect output
df.sample(12)

Unnamed: 0,brand,name,about the product,weblink,sub_category,main_category,num_likes,img_link,price,size,ingredients,rating,num_reviews,highlights,product_name,sensitive,combination,oily,normal,dry,clean,cruelty-free,vegan,skin_concerns,excl_ingr,best for
576,Edible Beauty,No. 2 Citrus Rhapsody Toner Mist,"[Which skin type is it good for?, ✔ Normal, ✔ ...",https://www.sephora.com/product/vitamin-c-sunc...,Mists & Essences,moisturizers,3600,https://www.sephora.com/productimages/sku/s205...,36.0,3.4 oz/ 100 mL,[ -Sweet Orange Extract: Balances oil and supp...,4.5,38,[],Edible BeautyNo. 2 Citrus Rhapsody Toner Mist,1,1,1,1,1,0,0,0,,,
13,BeautyBio,Bright Eyes Collagen-Infused Brightening Collo...,"[What it is: , A pair of eye masks that de-pu...",https://www.sephora.com/product/skin-to-die-fo...,Eye Masks,eye products,8900,https://www.sephora.com/productimages/sku/s216...,40.0,15 pairs,[ -Illuminating Pearl: Helps eyes appear brigh...,3.5,41,"[Clean at Sephora, Good for: Dark Circles, Goo...",BeautyBioBright Eyes Collagen-Infused Brighten...,0,1,1,1,1,1,1,0,"[Dark Circles, Loss of firmness, Dullness/Unev...",,
115,SUNDAY RILEY,Pink Drink Firming Resurfacing Essence,"[What it is: , A peptide-infused essence that...",https://www.sephora.com/product/goodnight-glow...,Mists & Essences,moisturizers,9500,https://www.sephora.com/productimages/sku/s237...,48.0,1.7 oz/ 50 mL,[ -Peptide Complex: Visibly firms and resurfac...,3.5,34,[],SUNDAY RILEYPink Drink Firming Resurfacing Ess...,0,1,1,1,1,0,0,0,,,
173,Edible Beauty,No. 4 Vanilla Silk Hydrating Lotion,"[Which skin type is it good for?, ✔ Normal, ✔ ...",https://www.sephora.com/product/repairwear-las...,Moisturizers,moisturizers,4600,https://www.sephora.com/productimages/sku/s205...,47.0,1.7 oz/ 50 mL,[ -Cocoa Butter: Rich in antioxidants (polyphe...,4.5,155,[],Edible BeautyNo. 4 Vanilla Silk Hydrating Lotion,1,1,1,1,1,0,0,0,,,
405,Sulwhasoo,Bloomstay Vitalizing Cream,"[What it is: , A cushiony cream that helps vi...",https://www.sephora.com/product/volition-beaut...,Moisturizers,moisturizers,1700,https://www.sephora.com/productimages/sku/s229...,155.0,1.69 oz/ 50 mL,[ -Green Plum Blossom Extracts: Rich in antiox...,4.5,9,"[Best for Dry, Combo, Normal Skin, Good for: A...",SulwhasooBloomstay Vitalizing Cream,0,1,1,1,1,0,0,0,"[Anti-Aging, Dullness/Uneven Texture, Dryness,...",[Parabens ],"[Dry, Combo, Normal Skin]"
415,Origins,Dr. Andrew Weil For Origins™ Mega-Mushroom Rel...,"[Which skin type is it good for?, ✔ Normal, ✔ ...",https://www.sephora.com/product/advanced-c-rad...,Toners,moisturizers,8600,https://www.sephora.com/productimages/sku/s210...,21.0,3.4 oz/ 100 mL,[ -Reishi Mushroom: Works over time to help st...,4.0,109,"[Clean at Sephora, Good for: Dryness]",OriginsDr. Andrew Weil For Origins™ Mega-Mushr...,1,1,1,1,1,1,0,0,[Dryness],,
270,AMOREPACIFIC,Treatment Enzyme Exfoliating Powder Cleanser,"[What it is: , A powder to foam exfoliating da...",https://www.sephora.com/product/dr-barbara-stu...,Facial Peels,treatments,64000,https://www.sephora.com/productimages/sku/s218...,60.0,1.76 oz/ 50 g,[ -Green Tea-derived Probiotic Enzyme (Lactoba...,4.5,1600,"[Best for Dry, Combo, Normal Skin, Good for: D...",AMOREPACIFICTreatment Enzyme Exfoliating Powde...,0,1,1,1,1,0,0,0,"[Dullness/Uneven Texture, Hydrating, Dryness]",[Parabens ],"[Dry, Combo, Normal Skin]"
41,Eve Lom,WHITE Advanced Brightening Serum,"[Which skin type is it good for?, \n, ✔ Normal...",https://www.sephora.com/product/deep-sea-colla...,Face Serums,treatments,1400,https://www.sephora.com/productimages/sku/s185...,150.0,1 oz/ 30 mL,[ -Dermapep: Improves pigmentation and unifies...,4.0,12,[],Eve LomWHITE Advanced Brightening Serum,1,1,0,1,1,0,0,0,,,
213,MAELYS Cosmetics,B-PERKY Lift & Firm Breast Mask,"[What it is: , An ultra-revolutionary, breast...",https://www.sephora.com/product/dior-capture-t...,Face Masks,moisturizers,3600,https://www.sephora.com/productimages/sku/s226...,39.0,3.38 oz/ 100 mL,"[Water (Aqua), Isopropyl Myristate, Glycerin, ...",4.0,22,[],MAELYS CosmeticsB-PERKY Lift & Firm Breast Mask,0,0,0,0,0,0,0,0,,,
214,Shiseido,White Lucent Brightening Gel Cream,"[What it is: , A refreshing, light, and smoot...",https://www.sephora.com/product/dior-capture-t...,Moisturizers,moisturizers,3400,https://www.sephora.com/productimages/sku/s221...,70.0,1.7 oz/ 50 mL,[-ReNeura Technology+™: Helps skin to self-rep...,3.5,28,[],ShiseidoWhite Lucent Brightening Gel Cream,0,1,1,1,1,0,0,0,,,


Create new feature based on: 

- Skincare acids:
Hyaluronic Acid,
Salicylic Acid,
AHA/Glycolic Acid,
Vitamin C

- Awards:
Community favorite,
Allure winner

In [115]:
#Create a new feature for skincare acids used and any awards received - 7,8 
def find_and_encode(mapping, list_):
    '''
    Takes in a dictionary containing term to be searched and the string to be returned if term is found in the list
    '''
    word_list=[]
    for highlight in list_:
    #for each line 
        #look for each term 
        for term in mapping:
            if term in highlight:
                word_list.append(mapping[term])
            else:
                pass
    if not word_list:
        return None
    else:
        return word_list

In [116]:
#For skincare acids
term_map={'Hyaluronic': 'Hyaluronic Acid', 
'Salicylic':'Salicylic Acid',
'AHA':'AHA/Glycolic Acid',
'Vitamin C':'Vitamin C'}

In [117]:
df['acids']= df['highlights'].apply(lambda row: find_and_encode(term_map, row))

In [119]:
#Check output
df.iloc[12, 13] #highlights
df.iloc[12, 26]

['Clean at Sephora',
 'Community Favorite',
 'Salicylic Acid',
 'Good for: Acne/Blemishes',
 'Good for: Pores ',
 'Cruelty-Free']

In [121]:
#Check output
df.acids.value_counts()

TypeError: unhashable type: 'list'

Exception ignored in: 'pandas._libs.index.IndexEngine._call_map_locations'
Traceback (most recent call last):
  File "pandas/_libs/hashtable_class_helper.pxi", line 4588, in pandas._libs.hashtable.PyObjectHashTable.map_locations
TypeError: unhashable type: 'list'


[Hyaluronic Acid]                       127
[Vitamin C]                              78
[Salicylic Acid]                         58
[AHA/Glycolic Acid]                      46
[AHA/Glycolic Acid, Salicylic Acid]      12
[Vitamin C, Hyaluronic Acid]              7
[AHA/Glycolic Acid, Hyaluronic Acid]      6
[Hyaluronic Acid, Vitamin C]              4
[Salicylic Acid, AHA/Glycolic Acid]       4
[Vitamin C, AHA/Glycolic Acid]            2
[Salicylic Acid, Vitamin C]               2
[Hyaluronic Acid, AHA/Glycolic Acid]      1
[Salicylic Acid, Hyaluronic Acid]         1
[AHA/Glycolic Acid, Vitamin C]            1
Name: acids, dtype: int64

In [122]:
#For awards
term_map2={'Community': 'Community favorite', 'allure':'allure winner'}

In [123]:
df['award']= df['highlights'].apply(lambda row: find_and_encode(term_map2, row))

In [124]:
#Check output
df.iloc[657, 13] #highlights
df.iloc[657, 27]

['Good for: Dullness/Uneven Texture',
 'Good for: Pores ',
 'Best for Dry, Combo, Normal Skin',
 'Lactic Acid',
 'Cruelty-Free',
 'Community Favorite']

['Community favorite']

In [126]:
#Check output
df.award.value_counts()

TypeError: unhashable type: 'list'

Exception ignored in: 'pandas._libs.index.IndexEngine._call_map_locations'
Traceback (most recent call last):
  File "pandas/_libs/hashtable_class_helper.pxi", line 4588, in pandas._libs.hashtable.PyObjectHashTable.map_locations
TypeError: unhashable type: 'list'


[Community favorite]                   111
[allure winner]                         24
[allure winner, Community favorite]     10
Name: award, dtype: int64

**8) Create a new (uniform) feature for 'size'**

In [127]:
#Inspect size
df['size'].unique()

array(['5.07 oz/ 150 mL', '5.1 oz/ 150 mL', '1.7 oz/ 50 mL',
       '4.2 oz/ 125 mL', '6.5 oz/ 193 mL', '1.69 oz/ 50 mL',
       '5.41 oz/ 160 mL', '5 oz/ 150 mL', '1 oz/ 30 mL', '3.4 oz/ 100 mL',
       '6 oz/ 170.1 g', '6.7 oz/ 200 mL', '3.4 oz/100 mL', '8 oz/ 236 mL',
       '3.4 oz/ 101 mL', '4.0 oz/ 120 mL', '10.14 oz/ 300 mL',
       ' 2 oz / 67 mL', '20 Biodegradable Wipes', '2.7 oz/ 75 mL',
       '1 oz/ 29 mL', '3.3 oz/ 100 mL', 'Standard Size - 3.38 oz/ 100 mL',
       '6.76 oz/ 200 mL', '4.2 oz', '1 oz/ 28 g', '6 oz/ 180 mL',
       '3.2oz/100mL', '8.45 oz/ 250 mL', '12.85 oz/ 380 mL',
       '2 oz/ 59 mL', '5 oz/ 148 mL', '4 oz/ 120 ml', '4 oz/ 120 mL',
       '3.4 oz/ 115 ml', '0.68 oz/ 20 mL', '4.7 oz/ 125 mL',
       '5 oz/ 150 ml', '4 oz/120 mL', '8.4 oz/ 250 mL', '3.4 oz/ 100 g',
       '25 Wipes', '3 oz/ 88 g', '1.69 fl oz/ 50 mL', '3.2 oz/ 95 mL',
       '7.1 oz/ 201 g', '6.8 oz/ 200 mL', '28 x 0.01 oz/ 0.5 g packets',
       '4.05 oz', '3.5 oz/ 105 mL', '6.7 oz/ 198

Units used to note size are oz, ml, g.

From initial inspection, some entries have to be manually corrected. 

- 28 x 0.01 oz/ 0.5 g packets
- 10 x 2.27 oz cloths/ 10 x 67 mL cles oths
- 50 x 0.04 oz/ 1.25 mL
- 7 Ampoules, .0625 oz/ 1.8 mL each
- Four 0.27 oz/ 8 mL Vials
- 2 x 1.7 oz/ 50 mL
- 0.5 oz/ 15 mL x3
- 2 x 1 oz/ 30 mL
- 12 x 0.08 oz/ 2.2 g pre-saturated chemical peel pads
- 3 x 1.7 oz/ 50.28 mL
- 2 x 2.02 oz/ 60 mL

In [128]:
def return_match(pattern, source):
    '''
    Like a re.search function but fetches the exact match; none if no match 
    '''
    m=re.search(pattern, source)
    if m:
        return m.group()
    else:
        pass

In [129]:
#Create separate columns to capture all the data available - final unit to be decided on later
df['size_oz']=df['size'].apply(lambda x: return_match('[0-9. ]+o[Zz]', x))
df['size_ml']=df['size'].apply(lambda x: return_match('[0-9. ]+m[Ll]', x))
df['size_g']=df['size'].apply(lambda x: return_match('[0-9. ]+g', x))

In [131]:
#Inspect the values 
df.info()

1 oz         287
1.7 oz       249
0.5 oz       177
5 oz          52
2 oz          50
1.69 oz       43
6.7 oz        42
3.4 oz        37
4 oz          35
4.2 oz        29
1.6 oz        29
1.0 oz        28
5.07 oz       19
1.01 oz       19
6 oz          19
5.0 oz        16
5.1 oz        14
2.5 oz        14
3.3 oz        13
8.4 oz        12
 oz           12
8 oz          11
1.35 oz       11
2.53 oz       10
1.5 oz        10
6.76 oz        9
1.3 oz         9
8.5 oz         9
6.75 oz        9
3.38 oz        8
4.0 oz         7
6.8 oz         7
3 oz           7
2.6 oz         7
4.22 oz        6
0.85 oz        6
4.05 oz        6
0.67 oz        6
2.7 oz         5
1.02 oz        5
0.3 oz         5
1.18 oz        5
4.1 oz         5
0.68 oz        4
0.34 oz        4
0.35 oz        4
2.02 oz        3
6.5 oz         3
1.68 oz        3
5.41 oz        3
2.0 oz         3
1.8 oz         3
0.51 oz        3
3.6 oz         3
 1.7 oz        3
2.3 oz         3
0.95 oz        3
0.7 oz         3
0.33 oz       

Unit to be used --> oz

In [132]:
#Clean size_oz, removing trailing and leading whitespace and 'oz'
df['size_oz']=df['size_oz'].str.replace(' oz', '')
df['size_oz']=df['size_oz'].str.replace('oz', '')
df['size_oz']=df['size_oz'].str.strip()

In [133]:
#Convert into numeric type
df['size_oz']=pd.to_numeric(df['size_oz'])
df['size_oz'].dtype

Correct the errors spotted earlier by changing each entry individually  

In [136]:
#Manually correct the errors
df.loc[df['size'].str.contains('12 x 0.08 oz'), 'size_oz']=0.96
df.loc[df['size'].str.contains('10 x 2.27 oz'), 'size_oz']=22.7
df.loc[df['size'].str.contains('50 x 0.04 oz'), 'size_oz']=2
df.loc[df['size'].str.contains('Four 0.27 oz'), 'size_oz']=0.4375
df.loc[df['size'].str.contains('7 Ampoules'), 'size_oz']=0.4375
df.loc[df['size'].str.contains('Four 0.27 oz'), 'size_oz']=1.08
df.loc[df['size'].str.contains('2 x 1.7 oz'), 'size_oz']=1.08
df.loc[df['size'].str.contains('15 mL x3'), 'size_oz']=1.5
df.loc[df['size'].str.contains('2 x 1 oz'), 'size_oz']=2
df.loc[df['size'].str.contains('12 x 0.08 oz'), 'size_oz']=0.96
df.loc[df['size'].str.contains('3 x 1.7 oz'), 'size_oz']=5.1
df.loc[df['size'].str.contains('2 x 2.02 oz'), 'size_oz']=4.04

In [137]:
#Inspect the output
df[['size_ml', 'size_oz', 'size_g']]

Unnamed: 0,size_ml,size_oz,size_g
0,150 mL,5.07,
2,150 mL,5.10,
3,50 mL,1.70,
4,125 mL,4.20,
5,193 mL,6.50,
...,...,...,...
213,10 mL,0.34,
214,15 mL,0.50,
215,15 mL,0.50,
216,15 mL,0.50,


In [138]:
#Inspect missing values for size and fill them out using other info where possible -- 67 missing entries
df[df['size_oz'].isnull()]

Unnamed: 0,brand,name,about the product,weblink,sub_category,main_category,num_likes,img_link,price,size,ingredients,rating,num_reviews,highlights,product_name,sensitive,combination,oily,normal,dry,clean,cruelty-free,vegan,skin_concerns,excl_ingr,best for,acids,award,size_oz,size_ml,size_g
31,Kopari,Coconut Melt Wipes,"[What it is: , A pack of 20 coconut oil multi...",https://www.sephora.com/product/water-drench-h...,Face Wipes,cleansers,2300,https://www.sephora.com/productimages/sku/s225...,20.0,20 Biodegradable Wipes,"[-Coconut Oil: A super oil that nurtures, hydr...",4.5,15,"[Clean at Sephora, Cruelty-Free]",KopariCoconut Melt Wipes,1,1,1,1,1,1,1,0,,,,,,,,
71,SEPHORA COLLECTION,Cleansing Wipes - Rose - Moisturizing,[undefined],https://www.sephora.com/product/mario-badescu-...,Face Wipes,cleansers,0,https://www.sephora.com/productimages/sku/s180...,8.0,25 Wipes,"[ -Natural Rose Extract.\n, \n, Water, Capryl...",4.5,84,"[Best for Dry, Combo, Normal Skin, Good for: D...",SEPHORA COLLECTIONCleansing Wipes - Rose - Moi...,0,0,0,0,0,1,0,0,[Dryness],,"[Dry, Combo, Normal Skin]",,,,,
73,SEPHORA COLLECTION,Mini Triple Action Cleansing Water - Cleanse +...,"[What it is: , A cleansing water, formulated ...",https://www.sephora.com/product/mario-badescu-...,Mini Size,cleansers,3300,https://www.sephora.com/productimages/sku/s216...,7.0,1.69 fl oz/ 50 mL,[ -Zinc of Natural Origin: Known to purify the...,4.0,14,"[Best for Dry, Combo, Normal Skin, Good for: P...",SEPHORA COLLECTIONMini Triple Action Cleansing...,0,1,1,1,1,1,0,1,"[Pores , Acne/Blemishes]",,"[Dry, Combo, Normal Skin]",,,,50 mL,
95,SEPHORA COLLECTION,Triple Action Cleansing Water,"[What it is: , A quick-acting, no-rinse clean...",https://www.sephora.com/product/lancer-method-...,Face Wash & Cleansers,cleansers,0,https://www.sephora.com/productimages/sku/s216...,12.0,200ml / 6.76 fl oz,[-Calendula Extract: Known to soothe the skin....,4.5,30,[],SEPHORA COLLECTIONTriple Action Cleansing Water,0,1,1,1,1,0,0,0,,,,,,,200ml,
118,SEPHORA COLLECTION,Waterprooof Eye Makeup Remover,"[What it is: , A waterproof eye makeup remove...",https://www.sephora.com/product/innisfree-volc...,Makeup Removers,cleansers,1500,https://www.sephora.com/productimages/sku/s216...,7.0,4.22 fl oz/ 125 mL,[ -Cornflower Floral Water from Natural Origin...,3.0,72,"[Clean at Sephora, Vegan]",SEPHORA COLLECTIONWaterprooof Eye Makeup Remover,0,1,1,1,1,1,0,1,,,,,,,125 mL,
130,BeautyBio,GloPRO® Prep Pads Clarifying Skin Cleansing Wi...,"[What it is: , A skin-clarifying alternative ...",https://www.sephora.com/product/mega-mushroom-...,Face Wipes,cleansers,3200,https://www.sephora.com/productimages/sku/s216...,35.0,30 pads,[ -SteriGLO™ Complex: Proprietary peptide tech...,3.5,13,"[Clean at Sephora, Good for: Dullness/Uneven T...",BeautyBioGloPRO® Prep Pads Clarifying Skin Cle...,0,1,1,1,1,1,0,0,[Dullness/Uneven Texture],,,,,,,
136,Dr. Dennis Gross Skincare,DRx Acne Eliminating Pads,"[What it is:, A daily acne treatment featuring...",https://www.sephora.com/product/redness-soluti...,Exfoliators,cleansers,13700,https://www.sephora.com/productimages/sku/s149...,42.0,45 Treatments,"[-Salicylic Acid 2%., Water, Alcohol Denat. (S...",4.0,282,"[Clean at Sephora, Good for: Acne/Blemishes, S...",Dr. Dennis Gross SkincareDRx Acne Eliminating ...,0,0,0,0,0,1,0,0,[Acne/Blemishes],,,[Salicylic Acid],,,,
141,NuFACE,Prep-N-Glow™ Cloths,"[Which skin type is it good for?, \n, ✔ Normal...",https://www.sephora.com/product/bear-naked-wip...,Facial Peels,cleansers,12500,https://www.sephora.com/productimages/sku/s189...,20.0,20 Individually Packed Cloths,"[ -Hyaluronic Acid: Hydrates the skin.\n, \n...",4.5,174,"[Good for: Anti-Aging, Good for: Dullness/Unev...",NuFACEPrep-N-Glow™ Cloths,1,1,1,1,1,0,0,0,"[Anti-Aging, Dullness/Uneven Texture]",[Sulfates SLS & SLES],,[Hyaluronic Acid],,,,
145,Josie Maran,Bear Naked Wipes,"[Which skin type is it good for?, ✔ Normal, ✔ ...",https://www.sephora.com/product/bear-naked-wip...,Makeup Removers,cleansers,13400,https://www.sephora.com/productimages/sku/s112...,12.0,30 Wipes,[ -100% Pure Argan Oil: Repairs and replenishe...,4.0,1300,[],Josie MaranBear Naked Wipes,1,1,1,1,1,0,0,0,,,,,,,,
147,SEPHORA COLLECTION,Triple Action Cleansing Water,"[What it is: , A quick-acting, no-rinse clean...",https://www.sephora.com/product/coconut-rose-t...,Face Wash & Cleansers,cleansers,33800,https://www.sephora.com/productimages/sku/s216...,12.0,200ml / 6.76 fl oz,[-Calendula Extract: Known to soothe the skin....,4.0,1200,[],SEPHORA COLLECTIONTriple Action Cleansing Water,0,1,1,1,1,0,0,0,,,,,,,200ml,


In [139]:
#Inspecting each row
crit=df.size_oz.isnull()& df.size_ml.isnull()
df.loc[crit, ['size', 'size_oz', 'size_ml', 'size_g']]

Unnamed: 0,size,size_oz,size_ml,size_g
31,20 Biodegradable Wipes,,,
71,25 Wipes,,,
130,30 pads,,,
136,45 Treatments,,,
141,20 Individually Packed Cloths,,,
145,30 Wipes,,,
172,30 facial cleansing cloths,,,
254,40 sheets,,,
335,"10 wipes - 7"" x 10""",,,
363,"21 wipes - 7"" x 10""",,,


In [140]:
df.loc[df['size'].str.contains('6 Patches x 3 g'), 'size_oz']=0.106

In [141]:
#Other rows need to be dropped as there is no data available 
df.shape

(1586, 31)

In [142]:
df.drop(df[crit].index, inplace=True)

In [143]:
df.shape

(1426, 31)

In [144]:
#There may be entries for ml or g that we can use to compute oz
crit2=df.size_oz.isnull()& (df.size_ml.notnull() | df.size_g.notnull())
df.loc[crit2, ['size', 'size_oz', 'size_ml', 'size_g']]

Unnamed: 0,size,size_oz,size_ml,size_g
73,1.69 fl oz/ 50 mL,,50 mL,
95,200ml / 6.76 fl oz,,200ml,
147,200ml / 6.76 fl oz,,200ml,
302,1.69 fl oz/ 50 mL,,50 mL,
188,1 fl oz/ 30mL,,30mL,
221,100 mL,,100 mL,
299,30 ml/ 1.01 fl oz\t\t\t\t\t\t,,30 ml,
570,0.34fl oz/10mL,,10mL,
38,1.69 fl oz/ 50 mL,,50 mL,
377,1.69 fl oz/ 50 mL,,50 mL,


In [145]:
subset_ml=df.loc[crit2, 'size_ml']

In [146]:
#Clean the subset of size_ml required to calculate size_oz
subset_ml=subset_ml.str.lower().str.strip(' ml').astype('float64')

In [147]:
#Using 1ml = oz 0.033814
df.loc[crit2,'size_oz']=subset_ml.apply(lambda ml: ml*0.033814)

In [148]:
#Inspect the output
df.loc[crit2, ['size', 'size_oz', 'size_ml', 'size_g']]

Unnamed: 0,size,size_oz,size_ml,size_g
73,1.69 fl oz/ 50 mL,1.6907,50 mL,
95,200ml / 6.76 fl oz,6.7628,200ml,
147,200ml / 6.76 fl oz,6.7628,200ml,
302,1.69 fl oz/ 50 mL,1.6907,50 mL,
188,1 fl oz/ 30mL,1.01442,30mL,
221,100 mL,3.3814,100 mL,
299,30 ml/ 1.01 fl oz\t\t\t\t\t\t,1.01442,30 ml,
570,0.34fl oz/10mL,0.33814,10mL,
38,1.69 fl oz/ 50 mL,1.6907,50 mL,
377,1.69 fl oz/ 50 mL,1.6907,50 mL,


**9) Derive a feature for price adjusted for the size or volume of the product**

In [149]:
#Create potential target variable (vs. price)
df['pricepervol']=df['price']/df['size_oz']

**10) Create new feature based on highlighted ingredients**

In [None]:
#Inspect
df.iloc[92, 2]

In [151]:
#Extract the name of the ingredients highlighted in the 'about the product' section.
#Upon inspection, these ingredients follow a similar pattern so the approach will use regex

def find_in_list(regex, list_):
    '''
    Searches for pattern in the list and returns a list of the matches
    '''
    parsed=[]
    for line in list_:
        try:
            m=re.search(regex, line).group(0)
            parsed.append(m)
        except:
            pass
    
    if not parsed:
        return None
    else:
        return parsed

In [152]:
df['highlighted_ingr']=df['about the product'].apply(lambda x: find_in_list('(?<=- ).*?(?=:)', x))

**11) Create a new feature capturing whether clinical results are published in the about section**

In [153]:
df['clinical_results']=df['about the product'].apply(lambda x: search_in_list('results',x))

**12) Extract formulation type for each ingredient and capture this in a separate feature(s)**

In [154]:
#Creating custom function to capture type of formulation specified in 'about' section

def extract_next(phrase, list_):
    '''
    Searches for phrase in the list and extracts the next item on the list
    '''
    item_count=0
    for highlight in list_:
        item_count+=1
        if phrase in highlight:
            return list_[item_count]

In [155]:
#Testing if it works
extract_next('Formulation:', df.iloc[19, 2])

' Cream'

In [156]:
#Create new column extracting the formulation
df['formulation']=df['about the product'].apply(lambda x: extract_next('Formulation:', x))

In [157]:
#Inspect the output
df['formulation'].value_counts()

 Cream                                           75
Cream                                            64
Lightweight Serum                                63
Lightweight Liquid                               43
Lightweight Cream                                38
 Gel                                             31
Gel                                              28
Serum                                            27
Lightweight Serum\n                              22
 Serum                                           21
 Liquid                                          21
 Lightweight Serum                               21
Lightweight Gel                                  19
 Lightweight Cream                               19
Rich Cream                                       17
 Lightweight Gel                                 16
 Lightweight Serum                               15
 Lightweight Liquid                              15
Cream\n                                          14
Oil         

Output requires cleaning (for white spaces, \n, capitalisation) and consolidation. 
It seems that the data reveals the lightness or richness of the formula as well as the formulation (i.e. oil, serum, lotion, etc.) so two features will be created to capture these info after cleaning.

In [159]:
#Cleaning the formulation column
df['formulation']=df['formulation'].str.lstrip().str.rstrip('\n')

In [160]:
df['formulation']=df['formulation'].str.rstrip().str.lower()

In [162]:
df['formulation'].value_counts()

cream                                            167
lightweight serum                                138
lightweight liquid                                86
gel                                               77
lightweight cream                                 74
lightweight gel                                   54
serum                                             54
liquid                                            47
oil                                               41
rich cream                                        37
lightweight lotion                                31
scrub                                             20
lightweight oil                                   17
mask                                              14
lotion                                            12
rich serum                                         8
spray                                              6
rich lotion                                        5
lightweight mask                              

The following categories will be used:
- richness - light, normal, rich/heavy
- formulation - lotion, cream, serum, liquid, gel, oil, others

In [165]:
def find_and_encode2(mapping, default, row):
    '''
    Takes in a dictionary containing the terms to be searched and the value to return if term is found;
    if not found, the default is returned 
    '''
    x=default
    try:
        for i in mapping:
            if i in row:
                x=mapping[i]
                break
            else:
                pass
        return x
    except:
        return x

In [None]:
#Creating the formulation column
formulation_map={'lotion': 'lotion', 'cream': 'cream', 'serum':'serum', 'liquid':'liquid', 'gel':'gel', 'oil':'oil'}

In [166]:
df['formulation_type']=df['formulation'].apply(lambda x: find_and_encode2(formulation_map, 'others', x))

In [167]:
df['formulation_type'].value_counts()

others    558
cream     288
serum     202
liquid    135
gel       132
oil        63
lotion     48
Name: formulation_type, dtype: int64

In [168]:
#Creating richness column
formula_map={'light': 'light', 'creamy': 'heavy', 'rich': 'heavy', 'heavy': 'heavy'}

In [169]:
df['richness']=df['formulation'].apply(lambda x: find_and_encode2(formula_map, 'normal', x))

### Check each column of interest

- product_name
- brand
- sub_category
- main_category
- rating
- num_reviews
- sensitive
- combination
- oily
- normal
- dry
- clean
- acids
- award
- cruelty-free
- vegan
- skin_concerns
- excl_ingr
- best for
- highlighted_ingr
- clinical_results
- formulation_type
- richness


- price
- pricepervol

In [171]:
df.sub_category.value_counts()

Face Serums                  337
Moisturizers                 311
Face Wash & Cleansers        190
Eye Creams & Treatments      153
Toners                        78
Face Oils                     71
Mists & Essences              69
Exfoliators                   33
Facial Peels                  32
Face Masks                    25
Blemish & Acne Treatments     24
Makeup Removers               17
Night Creams                  17
Mini Size                     10
Face Sunscreen                 8
Face Primer                    8
Setting Spray & Powder         7
Decollete & Neck Creams        7
Body Lotions & Body Oils       5
Moisturizer & Treatments       4
Eye Masks                      3
BB & CC Creams                 2
Face Wash                      2
For Face                       2
Skincare                       2
Makeup                         1
Scrub & Exfoliants             1
BB & CC Cream                  1
Eye Primer                     1
Eye Cream                      1
Highlighte

In [172]:
#Override subcategory for 'moisturizer and treatments'
df.loc[df.sub_category=='Moisturizer & Treatments',['sub_category']]=df.loc[df.sub_category=='Moisturizer & Treatments',['main_category']]

In [173]:
df['sub_category']=df.sub_category.str.lower()

In [174]:
df.iloc[170, 4]

'face wash & cleansers'

In [175]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1426 entries, 0 to 217
Data columns (total 37 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   brand              1426 non-null   object 
 1   name               1426 non-null   object 
 2   about the product  1426 non-null   object 
 3   weblink            1426 non-null   object 
 4   sub_category       1422 non-null   object 
 5   main_category      1426 non-null   object 
 6   num_likes          1426 non-null   int64  
 7   img_link           1426 non-null   object 
 8   price              1426 non-null   float64
 9   size               1426 non-null   object 
 10  ingredients        1426 non-null   object 
 11  rating             1426 non-null   float64
 12  num_reviews        1426 non-null   int64  
 13  highlights         1426 non-null   object 
 14  product_name       1426 non-null   object 
 15  sensitive          1426 non-null   int64  
 16  combination        1426 n

In [176]:
category_mapping={
    'exfoli': 'exfoliators and peels',
    'peels': 'exfoliators and peels',
    'eye': 'eye creams and treatments',
    'oils': 'essences, serums and treatments',
    'serum': 'essences, serums and treatments',
    'wash': 'face wash and cleansers',
    'essence': 'essences, serums and treatments',
    'masks': 'essences, serums and treatments',
    'blemish': 'essences, serums and treatments',
    'treatment': 'essences, serums and treatments',
    'creams': 'moisturizers and creams',
    'toners': 'toners',
    'moisturizers': 'moisturizers and creams',
    'cleansers': 'face wash and cleansers'
}

In [177]:
df['product_type']=df['sub_category'].apply(lambda x: find_and_encode2(category_mapping, 'others', x))

In [178]:
df['product_type'].value_counts()

essences, serums and treatments    532
moisturizers and creams            337
face wash and cleansers            193
eye creams and treatments          158
toners                              78
exfoliators and peels               66
others                              62
Name: product_type, dtype: int64

In [179]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1426 entries, 0 to 217
Data columns (total 38 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   brand              1426 non-null   object 
 1   name               1426 non-null   object 
 2   about the product  1426 non-null   object 
 3   weblink            1426 non-null   object 
 4   sub_category       1422 non-null   object 
 5   main_category      1426 non-null   object 
 6   num_likes          1426 non-null   int64  
 7   img_link           1426 non-null   object 
 8   price              1426 non-null   float64
 9   size               1426 non-null   object 
 10  ingredients        1426 non-null   object 
 11  rating             1426 non-null   float64
 12  num_reviews        1426 non-null   int64  
 13  highlights         1426 non-null   object 
 14  product_name       1426 non-null   object 
 15  sensitive          1426 non-null   int64  
 16  combination        1426 n

In [180]:
#Change others to the main category
df.loc[df.product_type=='others','product_type']=df.loc[df.product_type=='others','main_category']

In [181]:
df['product_type']=df['product_type'].apply(lambda x: find_and_encode2(category_mapping, 'others', x))

23  skin_concerns      748 non-null    object 
 24  excl_ingr          286 non-null    object 
 25  best for           457 non-null    object 
 26  acids              1426 non-null   object 
 27  award              1426 non-null   object 
 28  size_oz            1426 non-null   float64
 29  size_ml            1271 non-null   object 
 30  size_g             70 non-null     object 
 31  pricepervol        1426 non-null   float64
 32  highlighted_ingr   898 non-null    object 
 33  clinical_results   1426 non-null   int64  
 34  formulation        934 non-null    object 
 35  formulation_type   1426 non-null   object 
 36  richness           1426 non-null   object 

__________________________________________________________________________________________

## 2.0 The dataset to be used

In [185]:
df.columns

Index(['brand', 'name', 'about the product', 'weblink', 'sub_category',
       'main_category', 'num_likes', 'img_link', 'price', 'size',
       'ingredients', 'rating', 'num_reviews', 'highlights', 'product_name',
       'sensitive', 'combination', 'oily', 'normal', 'dry', 'clean',
       'cruelty-free', 'vegan', 'skin_concerns', 'excl_ingr', 'best for',
       'acids', 'award', 'size_oz', 'size_ml', 'size_g', 'pricepervol',
       'highlighted_ingr', 'clinical_results', 'formulation',
       'formulation_type', 'richness', 'product_type'],
      dtype='object')

In [186]:
columns_required= [
    'brand', 
    'product_name',
    'product_type',
    'num_likes', 
    'ingredients',
    'rating', 
    'num_reviews', 
    'sensitive', 
    'combination', 
    'oily', 
    'normal', 
    'dry', 
    'clean',
    'cruelty-free',
    'vegan', 
    'skin_concerns', 
    'excl_ingr', 
    'best for',
    'acids',
    'award', 
    'pricepervol',
    'highlighted_ingr',
    'clinical_results',
    'formulation_type', 
    'richness']

In [191]:
data=df[columns_required]

In [215]:
#Rename columns
new_colnames= [
    'brand', 
    'product_name',
    'product_type',
    'num_likes', 
    'ingredients',
    'rating', 
    'num_reviews', 
    'sensitive_type', 
    'combination_type', 
    'oily_type', 
    'normal_type', 
    'dry_type', 
    'clean_sephora',
    'cruelty_free',
    'vegan', 
    'skin_concerns', 
    'excl_ingr', 
    'best_for_skintype',
    'acids',
    'award', 
    'pricepervol',
    'highlighted_ingr',
    'clinical_results',
    'formulation_type', 
    'richness']

In [216]:
columns_map=dict(zip(columns_required, new_colnames))
data.rename(columns=columns_map, inplace=True)

In [203]:
data.shape

(1426, 26)

<font color='red'> BELOW STILL DRAFT </font> 

## 3.0 Inspecting and refining the selected dataset

Additional steps taken: 
- dropping duplicates
- addressing missing values 
- check distribution for any possible outliers

In [202]:
#Check for duplicates
duplicates= data.duplicated(subset=['product_name'], keep=False)
data[duplicates].sort_values('product_name')

Unnamed: 0,index,brand,product_name,product_type,num_likes,ingredients,rating,num_reviews,sensitive,combination,oily,normal,dry,clean,cruelty-free,vegan,skin_concerns,excl_ingr,best for,acids,award,pricepervol,highlighted_ingr,clinical_results,formulation_type,richness
258,338,AMOREPACIFIC,AMOREPACIFICTreatment Cleansing Oil Makeup Rem...,"essences, serums and treatments",8300,"[ -Botanical Oil Blend (Green Tea Seed, Babass...",4.0,315,1,1,1,1,1,0,0,0,"[Dryness, Hydrating, Dullness/Uneven Texture]","[Parabens , Sulfates SLS & SLES]","[Dry, Combo, Normal Skin]",,,7.352941,,0,others,normal
1076,419,AMOREPACIFIC,AMOREPACIFICTreatment Cleansing Oil Makeup Rem...,"essences, serums and treatments",8300,"[ -Botanical Oil Blend (Green Tea Seed, Babass...",4.0,315,1,1,1,1,1,0,0,0,"[Dryness, Hydrating, Dullness/Uneven Texture]","[Parabens , Sulfates SLS & SLES]","[Dry, Combo, Normal Skin]",,,7.352941,,0,others,normal
997,312,Algenist,AlgenistAdvanced Anti-Aging Repairing Oil,"essences, serums and treatments",9100,[ -Microalgae Oil: Improves the appearance of ...,4.5,240,1,1,1,1,1,0,0,0,,,,,,82.0,,1,others,normal
748,569,Algenist,AlgenistAdvanced Anti-Aging Repairing Oil,"essences, serums and treatments",9100,[ -Microalgae Oil: Improves the appearance of ...,4.5,240,1,1,1,1,1,0,0,0,,,,,,82.0,,1,others,normal
367,68,Algenist,AlgenistELEVATE Advanced Lift Contouring Cream,moisturizers and creams,2700,[ -Alguronic Acid: Provides anti-aging and ant...,4.5,37,1,1,1,1,1,0,0,0,,,,,,48.0,,0,others,normal
1108,455,Algenist,AlgenistELEVATE Advanced Lift Contouring Cream,moisturizers and creams,2700,[ -Alguronic Acid: Provides anti-aging and ant...,4.5,37,1,1,1,1,1,0,0,0,,,,,,48.0,,0,others,normal
231,297,Algenist,AlgenistHydrating Essence Toner,"essences, serums and treatments",10900,[ -Alguronic Acid: Helps create a more youthfu...,4.5,171,0,0,0,0,0,0,0,0,,,,,,5.0,,0,others,normal
1059,393,Algenist,AlgenistHydrating Essence Toner,"essences, serums and treatments",10900,[ -Alguronic Acid: Helps create a more youthfu...,4.5,171,0,0,0,0,0,0,0,0,,,,,,5.0,,0,others,normal
1321,77,Algenist,AlgenistPOWER Advanced Wrinkle Fighter 360° Ey...,eye creams and treatments,8000,"[Water, Dimethicone, Isododecane, Glycerin, Bu...",4.0,81,1,1,1,1,1,0,0,0,,,,,,140.0,,1,others,normal
459,199,Algenist,AlgenistPOWER Advanced Wrinkle Fighter 360° Ey...,eye creams and treatments,8000,"[Water, Dimethicone, Isododecane, Glycerin, Bu...",4.0,81,1,1,1,1,1,0,0,0,,,,,,140.0,,1,others,normal


In [207]:
data=data.drop_duplicates(subset=['product_name'], keep='first')
data.shape

(1338, 26)

In [218]:
#Checking for missing values
data.isna().sum()

index                   0
brand                   0
product_name            0
product_type            0
num_likes               0
ingredients             0
rating                  0
num_reviews             0
sensitive_type          0
combination_type        0
oily_type               0
normal_type             0
dry_type                0
clean_sephora           0
cruelty_free            0
vegan                   0
skin_concerns         643
excl_ingr            1066
best_for_skintype     905
acids                1056
award                1215
pricepervol             0
highlighted_ingr      491
clinical_results        0
formulation_type        0
richness                0
dtype: int64

Missing values

- skin_concerns - TBD
- excl_ingr- TBD
- best_for_skintype - TBD
- acids               1056
- award               1215
- highlighted_ingr     491

In [212]:
#Imputing missing values
data.skin_concerns.value_counts()

TypeError: unhashable type: 'list'

Exception ignored in: 'pandas._libs.index.IndexEngine._call_map_locations'
Traceback (most recent call last):
  File "pandas/_libs/hashtable_class_helper.pxi", line 4588, in pandas._libs.hashtable.PyObjectHashTable.map_locations
TypeError: unhashable type: 'list'


[Dryness]                                                                      74
[Dullness/Uneven Texture]                                                      49
[Anti-Aging]                                                                   42
[Acne/Blemishes]                                                               36
[Pores ]                                                                       35
[Hydrating]                                                                    24
[Anti-Aging, Dullness/Uneven Texture]                                          18
[Dryness, Dullness/Uneven Texture]                                             17
[Anti-Aging, Loss of firmness]                                                 17
[Loss of firmness]                                                             16
[Acne/Blemishes, Pores ]                                                       14
[Dark Circles]                                                                 12
[Dark spots]    

In [213]:
data.excl_ingr.value_counts()

TypeError: unhashable type: 'list'

Exception ignored in: 'pandas._libs.index.IndexEngine._call_map_locations'
Traceback (most recent call last):
  File "pandas/_libs/hashtable_class_helper.pxi", line 4588, in pandas._libs.hashtable.PyObjectHashTable.map_locations
TypeError: unhashable type: 'list'


[Parabens ]                                                 60
[Fragrance]                                                 51
[Sulfates SLS & SLES]                                       24
[Sulfates SLS & SLES, Parabens , Phthalates ]               20
[Parabens , Sulfates SLS & SLES]                            18
[Parabens , Phthalates ]                                    16
[Silicones]                                                 15
[Sulfates SLS & SLES, Parabens ]                             7
[Fragrance, Parabens ]                                       6
[Sulfates SLS & SLES, Phthalates , Parabens ]                5
[Phthalates ]                                                4
[Mineral Oil ]                                               4
[Sulfates SLS & SLES, Mineral Oil ]                          4
[Parabens , Sulfates SLS & SLES, Phthalates ]                3
[Fragrance, Sulfates SLS & SLES]                             3
[Parabens , Phthalates , Sulfates SLS & SLES]          

In [219]:
data.best_for_skintype.value_counts()

TypeError: unhashable type: 'list'

Exception ignored in: 'pandas._libs.index.IndexEngine._call_map_locations'
Traceback (most recent call last):
  File "pandas/_libs/hashtable_class_helper.pxi", line 4588, in pandas._libs.hashtable.PyObjectHashTable.map_locations
TypeError: unhashable type: 'list'


[Dry, Combo, Normal Skin]                              242
[Oily, Combo, Normal Skin]                             127
[Dry Skin]                                              23
[Normal Skin]                                           16
[Oily Skin]                                             12
[Combination Skin]                                       6
[Dry, Combo, Normal Skin, Oily, Combo, Normal Skin]      3
[Oily, Combo, Normal Skin, Dry, Combo, Normal Skin]      2
[Oily Skin, Combination Skin]                            2
Name: best_for_skintype, dtype: int64

In [220]:
data.acids.value_counts()

TypeError: unhashable type: 'list'

Exception ignored in: 'pandas._libs.index.IndexEngine._call_map_locations'
Traceback (most recent call last):
  File "pandas/_libs/hashtable_class_helper.pxi", line 4588, in pandas._libs.hashtable.PyObjectHashTable.map_locations
TypeError: unhashable type: 'list'


[Hyaluronic Acid]                       102
[Vitamin C]                              68
[Salicylic Acid]                         45
[AHA/Glycolic Acid]                      38
[AHA/Glycolic Acid, Salicylic Acid]       7
[Vitamin C, Hyaluronic Acid]              5
[Hyaluronic Acid, Vitamin C]              4
[AHA/Glycolic Acid, Hyaluronic Acid]      3
[Salicylic Acid, AHA/Glycolic Acid]       3
[Vitamin C, AHA/Glycolic Acid]            2
[Salicylic Acid, Vitamin C]               2
[Hyaluronic Acid, AHA/Glycolic Acid]      1
[Salicylic Acid, Hyaluronic Acid]         1
[AHA/Glycolic Acid, Vitamin C]            1
Name: acids, dtype: int64

In [222]:
data.award.value_counts()

TypeError: unhashable type: 'list'

Exception ignored in: 'pandas._libs.index.IndexEngine._call_map_locations'
Traceback (most recent call last):
  File "pandas/_libs/hashtable_class_helper.pxi", line 4588, in pandas._libs.hashtable.PyObjectHashTable.map_locations
TypeError: unhashable type: 'list'


[Community favorite]                   94
[allure winner]                        20
[allure winner, Community favorite]     9
Name: award, dtype: int64

In [223]:
data.highlighted_ingr.value_counts()

TypeError: unhashable type: 'list'

Exception ignored in: 'pandas._libs.index.IndexEngine._call_map_locations'
Traceback (most recent call last):
  File "pandas/_libs/hashtable_class_helper.pxi", line 4588, in pandas._libs.hashtable.PyObjectHashTable.map_locations
TypeError: unhashable type: 'list'


[Jeju Green Tea Extract, Jeju Green Tea Seed Oil, Panthenol]                                                         3
[Compound K, Ginsenoside Re]                                                                                         3
[Reishi]                                                                                                             3
[Hydrogen Bio Water™, Dead Sea Salt, Coconut Water]                                                                  3
[Pitera™, InfinitPower Technology]                                                                                   3
                                                                                                                    ..
[Lactic Acid, Vitamin E, Chamomile Extract]                                                                          1
[Japanese Purple Rice, Okinawa Algae Blend and Hyaluronic Acid, Botanical Extracts]                                  1
[Salicylic Acid (BHA) 0.5%, Arginine, Provitamin

In [None]:
#Check for any outliers etc

In [194]:
#Reset index
data.reset_index(inplace=True)

In [195]:
#Save as json or csv for next step 

RangeIndex(start=0, stop=1426, step=1)

In [1]:
df.product_type.value_counts()

NameError: name 'df' is not defined

In [193]:
profile=ProfileReport(data, title='Profiling Report', explorative=True)
profile.to_widgets()

Summarize dataset:   0%|          | 0/39 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render widgets:   0%|          | 0/1 [00:00<?, ?it/s]

VBox(children=(Tab(children=(Tab(children=(GridBox(children=(VBox(children=(GridspecLayout(children=(HTML(valu…

## 4.0 Saving the dataset