# Berland Team Junior Data Scientist Assignment

## Analysis of Sephora Web Scraping

Hayley Caddes

The data was obtained using the Python packages Scrapy and Selenium

View whole notebook including interactive plotly plots from <a href='http://nbviewer.jupyter.org/github/hkcaddes/berlandteam_assignment/blob/master/sephora_analysis.ipynb'>this link</a>.

#### Import libraries

In [1]:
import pandas as pd 
import numpy as np
import re
import ast
import seaborn as sns
from matplotlib import pyplot as plt
%matplotlib inline

import plotly.plotly as py
import plotly.graph_objs as go
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
import colorlover as cl

#### Load datasets

In [2]:
columns1 = ['Brand', 'Item', 'ProductId', 'Price', 'OverallStars']
products = pd.read_csv('products1.csv', encoding='utf-8', header=None)
products.columns=columns1
products['Price'] = products['Price'].replace({'\$': ''}, regex=True)

In [3]:
products.head()

Unnamed: 0,Brand,Item,ProductId,Price,OverallStars
0,DRUNK ELEPHANT,Protini™ Polypeptide Cream,p427421,68.0,4.4669
1,DRUNK ELEPHANT,C-Firma Day Serum,p400259,80.0,4.0507
2,SUNDAY RILEY,Good Genes All-In-One Lactic Acid Treatment,p309308,158.0,4.328
3,LA MER,Crème de la Mer,p416341,175.0,4.1648
4,SUMMER FRIDAYS,Jet Lag Mask,p429952,48.0,4.1639


In [4]:
products['Price'] = products['Price'].apply(pd.to_numeric)
products.dtypes

Brand            object
Item             object
ProductId        object
Price           float64
OverallStars    float64
dtype: object

In [5]:
columns2 = ['ProductId', 'Loves', 'Ingredients']
ingredients = pd.read_csv('ingredients.csv', encoding='utf-8', header=None)
ingredients = ingredients.iloc[:, 0:3]
ingredients.columns = columns2

In [6]:
ingredients.head()

Unnamed: 0,ProductId,Loves,Ingredients
0,p427421,80000,"['9 Signal Peptide Complex', 'Soybean Folic Ac..."
1,p400259,90000,['Potent Antioxidant Complex (15% L-ascorbic a...
2,p309308,130000,"['High Potency', 'Purified Grade Lactic Acid',..."
3,p416341,40000,"['Algae (Seaweed) Extract', 'Mineral Oil', 'Pe..."
4,p392246,110000,"['T.L.C. Framboos™ AHA Blend 12%', 'Salicylic ..."


### Ingredients

Manually define ingredient categories. Could be an unsupervised learning problem, but this will be more accurate bc of domain knowledge

In [7]:
# manually define ingredient categories

categories = ['Peptides Complex', 'Alfalfa Powder', 'Algae (Seaweed, Spirulina) Extract', 'Aloe', 'Ascorbic Acid', 'Aspartic Acid', 'Folic Acid',
             'Soybean Extract', 'Agave Extract', 'Cactus Extract', 'Camellia Sinensis Leaf Extract', 'Citric Acid', 'Coconut', 'Eucalyptus Oil', 'Ferulic Acid',
             'Flower Extract', 'Glycolic Acid', 'Grape Extract', 'Lactic Acid', 'Pumpkin Extract', "Lady's Slipper Orchid Extract",
             'Lemongrass', 'Licorice', 'Lime Extract', 'Linoleic Acid', 'Magnesium', 'Mineral Oil', 'Misc Root Extract',
             'Ficus-Indica Extract', 'Misc Fruit Extract', 'Sodium Hyaluronate', 'Sesame Oil', 'Sodium Benzoate',
             'AHAs', 'Yeast Extract', 'Misc Seed Extract', 'Linolenic Acid', 'Antioxidants', 'Prickly Pear', 'Vitamin-E', 'BHAs', 'Salicylic Acid',
             'Raspberry Extract', 'Pomegranate Extract', 'Almond Extract', 'Carbon', 'Ceramides', 'Zinc Oxide', 'Vitamin-C', 'Silk Extract',
             'Cucumber Extract', 'Rose (Rosa, Rosewood, Rosehips) Extract', 'Orange Extract', 'Collagen', 'Chestnut Extract', 'Peppermint Oil',
             'Olive Extract', 'Hyaluronic Acid', 'Magnolia Bark Extract', 'Cassia Extract', 'Shea Butter', 'Retinol Ester', 'Avocado Oil', 'Blackberry Extract',
             'Chia Oil', 'Ylang Ylang', 'Willow Extract', 'Mandelic Acid', 'Ivy Extract', 'Calendula Extract', 'Stearic Acid', 'Squalane', 'Vegetable Oil',
             'Papaya Extract', 'Bergamot Oil', 'Eggplant Extract', 'Turmeric Extract', 'Ginger Oil', 'Carrot Oil', 'Meadowfoam Oil', 'Sandalwood Oil',
             'Titanium Dioxide', 'Sorbic Acid', 'Fig Extract', 'Strawberry Extract', 'Apple Extract', 'Kiwi Extract', 'Microalgae', 'Plantain Extract',
             'Mistletoe Extract', 'Marula Oil', 'Neroli Oil', 'Pitera™', 'Beta Carotene', 'Banana Extract', 'Peony Extract', 'Cabbage Extract', 'Sweet Potato Extract',
             'Castor Oil', 'Vitamin-B5', 'Lentil Extract', 'Thyme Extract', 'Sugar Cane', 'Jojoba Oil', 'Birch Extract', 'Witch Hazel', 'Moringa', 
             'Malic Acid', 'Tamarind', 'Coffee (Caffeine)', 'Lavender Oil', 'Citrus Extract', 'Kaolin (Clay)', 'Oatmeal', 'Chrysanthemum Extract', 'Ginseng Extract', 
             'Rosemary Oil', 'Chlorella Extract', 'Artichoke Extract', 'Blueberry Extract', 'Cranberry Extract', 'Flax Oil', 'Apricot Extract', 'Basil Extract', 'Peach Extract',
             'Evening Primrose Oil', 'Argan Oil', 'Mushroom', 'Probiotics', 'Vitamin-A', 'Jasmine Oil', 'Cocoa Extract', 'Tangerine Extract', 'Bamboo Extract', 'Stem Cell',
             'Charcoal', 'Honeysuckle Extract', 'Kukui Oil', 'Plum', 'Chamomile Extract', 'Walnut Powder', 'Geranium', 'Sulfur', 'Coco-Glucoside', 'Greek Yogurt', 'Matricaria Oil',
             'Clove Oil', 'Mango Extract', 'Henna', 'Royal Jelly Extract', 'Kelp Extract', 'Coco-Beatine', 'Barley Extract', 'Maracuja Oil', 'Palm Oil', 'Radish Extract',
             'Coriander Extract', 'Omegas', 'Broccoli Extract', 'Sage Oil', 'Niacinamide']

<strong>Define a function that will parse out the ingredients list for a product</strong>:
* Strip any extra words (Powder, Juice, Extract, Acid, Oil, Complex)
* Search through keys, and if there is a match in any of the keys, add 1 to that value in temporary ingredients dictionary
    * If there are no matches at all, append the ingredient to a list to analyze later to see if need to add more categories (keys)

* Use list of dictionaries to create new dataframe with flags for how many times a given ingredient shows up in product's ingredient list


In [8]:
def generateIngredients(lst):
    temp_dict = {key: 0 for key in categories}
    
    for ind in ast.literal_eval(lst):
        # breakdown each ingredient to easily search list of keys
        if ind != 'NA':
            temp = re.sub('-', ' ', ind)
            temp = re.sub('\+|\%|\.|\*', '', temp)
            temp = re.sub('\\\\+', '', temp)
            
            x = 'Acid|Extract|/|\(|\)|Oil|Leaf|Complex|Ferment|Sodium|Powder|Oxides|Oxide|CI |\]|\[|Ci |Dioxide|And'
            
            temp = re.sub('[Vv]itamin.([A-Z])', r'Vitamin-\1', temp)
            temp = re.sub(x, ' ', temp)
            temp = re.sub('acid|extract|oil|leaf|complex|ferment|sodium|powder|oxide|and ', ' ', temp)
            temp = re.sub(r'\b\w{1,2}\b', '', temp).strip()
    
            temp = "|".join(temp.split()).lower()
            
         
            pattern = re.compile(temp)
            temp_keys = [item for item in categories if re.search(pattern, item.lower())]
#             print(ind)
#             print(temp)
#             print(temp_keys)
#             print("="*20)
            
            if len(temp_keys) !=0:
                for key in temp_keys:
                    temp_dict[key] += 1
#                     print('added'+key)
#                 print("="*30)
#     print(sum(temp_dict.values()))
    
    return(temp_dict)
    

Apply function to each product in 'ingredients' dataframe

In [9]:
blank = []
sums = []
#for i in range(0, 200):
for i in range(0, ingredients.shape[0]):
    temp_dict = generateIngredients(ingredients['Ingredients'].iloc[i])
    _sum = sum(temp_dict.values())
    if _sum < 90:
        sums.append((i, _sum))
        blank.append(temp_dict)
    else:
        sums.append((i, _sum))
        # append empty dict if no ingredients
        blank.append({key: 0 for key in categories})

# list of dictionaries to dataframe      
df1 = pd.DataFrame(blank)

Concatenate this data frame with the ingredients dataframe

In [10]:
df2 = pd.concat([ingredients, df1], axis = 1)

Merge with 'products' dataframe

In [11]:
df = pd.merge(products, df2, on='ProductId', how = 'outer')

In [12]:
df.sample(5)

Unnamed: 0,Brand,Item,ProductId,Price,OverallStars,Loves,Ingredients,AHAs,Agave Extract,Alfalfa Powder,...,Vitamin-A,Vitamin-B5,Vitamin-C,Vitamin-E,Walnut Powder,Willow Extract,Witch Hazel,Yeast Extract,Ylang Ylang,Zinc Oxide
152,LA MER,The Moisturizing Cool Gel Cream,p429637,175.0,3.3023,3366.0,['It all started when Dr. Max Huber – unable t...,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
16,KIEHL'S SINCE 1851,Ultra Facial Cream,p421996,29.5,4.4253,30000.0,['Antarcticine (Glacial Glycoprotein Extract)'...,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
133,CLINIQUE,Dramatically Different Moisturizing Lotion+,p381030,28.0,3.8719,60000.0,['Mineral Oil/Paraffinum Liquidum/Huile Minera...,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
28,FOREO,LUNA mini 2,p404444,139.0,4.6043,30000.0,['Foreo took the beauty industry by storm with...,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
154,KORRES,Wild Rose Vitamin C Active Brightening Oil,p405289,54.0,4.4232,30000.0,"['Super Vitamin C', 'Wild Rose Oil', 'Camapu E...",0.0,0.0,0.0,...,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0


In [13]:
df.shape[0]

205

In [14]:
#df[df['Retinol Ester'] > 0]

Create dataframe with counts of each ingredient

In [15]:
df_sums = df.iloc[:, 7:].fillna(0).apply(sum, axis = 0).sort_values(ascending=False).to_frame().reset_index()
df_sums.columns = ['Ingredient', 'Count']


In [16]:
df_sums.head()

Unnamed: 0,Ingredient,Count
0,Misc Seed Extract,122.0
1,Misc Root Extract,88.0
2,Sodium Hyaluronate,88.0
3,Sodium Benzoate,84.0
4,Misc Fruit Extract,83.0


Create "long" dataframe with an observation for every ingredient for each product

* Drop all rows with value = 0, so only have rows for the occurance of an ingredient

In [17]:
df_long = pd.melt(df, id_vars = ['ProductId', 'Price', 'Loves'], value_vars = categories).fillna(0)
df_long = df_long[(df_long != 0).all(1)].reset_index(drop=True).rename(columns={'variable': 'Ingredient', 'value': 'Count'})
df_long.head()

Unnamed: 0,ProductId,Price,Loves,Ingredient,Count
0,p427421,68.0,80000.0,Peptides Complex,1.0
1,p433435,38.0,20000.0,Peptides Complex,1.0
2,p429515,64.0,30000.0,Peptides Complex,1.0
3,p432668,36.0,30000.0,Peptides Complex,1.0
4,p419223,60.0,20000.0,Peptides Complex,1.0


In [18]:
# create list of all the ingredients with more than x observations

x = 40

bigs = (df_long.groupby('Ingredient')['Count'].apply(np.sum, axis = 0) > x)

In [19]:
# now create new df with just the ingredients that have more than x observations

df_bigs = df_long[(df_long['Ingredient'].isin(list(bigs[bigs == True].index)))]

Mid-range of ingredient occurences

In [20]:
# create list of all the ingredients with observations between x, y

x = 20
y = 40

mid = (df_long.groupby('Ingredient')['Count'].apply(np.sum, axis = 0).isin(range(x, y)))

In [21]:
df_mids = df_long[(df_long['Ingredient'].isin(list(mid[mid == True].index)))]

### API for comparing bestseller's ingredients to common ingredients

In [22]:
import requests
import json

In [23]:
base_url = 'https://skincare-api.herokuapp.com/product?q='
base_url2 = 'https://skincare-api.herokuapp.com/products'


response1 = requests.get(base_url2)
data1 = response1.json()
api_tot = len(data1) # this is the total number of unique products in this API
best_tot = len(products) # total number of unique best sellers

In [24]:
df_sums['BestProp'] = df_sums['Count']/best_tot

In [25]:
# format categories for API search
searches = []
for index, cat in enumerate(categories):
    temp = re.sub('Powder|Extract|Leaf|Oil|Misc', '', cat)
    temp = re.sub('\(.*\)', '', temp)
    temp = re.sub('-|\(|\)|,', " ", temp).strip()
    temp = re.sub(r's\b', '', temp)
    temp = "+".join(temp.split()).lower()
    searches.append((cat, temp))
    
# get tuple of proportions
proportions = []
for cat, search in searches:
    response = requests.get(base_url + search)
    data = response.json()
    proportions.append((cat, len(data)/api_tot))


In [26]:
# make list of tuples into dataframe
props = pd.DataFrame(proportions)
props.columns = ['Ingredient', 'APIProp']

In [27]:
# merge with df_sums
Props = pd.merge(df_sums, props, on = 'Ingredient')

In [28]:
# add difference column
Props['Diff'] = Props['BestProp'] - Props['APIProp']
avg_diff = np.average(Props['Diff'])

In [29]:
Props = Props.reindex(Props['Diff'].abs().sort_values(ascending = False).index)

In [30]:
Props_top = Props.iloc[0:20]

In [31]:
#Props

## Plots

#### Price distribution of products with most common ingredients

In [32]:
init_notebook_mode(connected=True)

In [33]:
# interactive plot price distribution of ingredients
# ingredients list
greds = df_bigs['Ingredient'].unique()

N = 14

c = ['hsl('+str(h)+',50%'+',50%)' for h in np.linspace(180, 300, N)]


# plot price distribution
trace = []
for i in range(0, len(greds)):
    trace.append(go.Box(x = df_bigs[df_bigs['Ingredient'] == greds[i]]['Price'], 
                       name = re.sub('\(.*\)', '', greds[i]),
                       marker = dict(color = c[i]),
                       boxmean = True))
    
layout = go.Layout(title = 'Most Common Skincare Ingredients: Sephora Bestsellers', 
                   yaxis=dict(tickangle=-45), 
                   margin = go.layout.Margin(l=150,
                                             r=50,
                                             b=100,
                                             t=100,
                                             pad=4),
                   showlegend = False, 
                   xaxis = dict(range = [0, 250], title = 'Price', tickformat = "$.0f"),
                   autosize = True)

iplot(go.Figure(data = trace, layout = layout))

#### Popularity Distribution most common ingredients

In [34]:
# interactive plot popularity distribution of ingredients
# ingredients list
greds = df_bigs['Ingredient'].unique()

#N = 14

#c = ['hsl('+str(h)+',50%'+',50%)' for h in np.linspace(180, 300, N)]

N = 9
M = 5

c = ['hsl('+str(int(h))+',50%'+',50%)' for h in np.linspace(300, 360, N)]
for i in range(0, M):
    c.append(['hsl('+str(int(h))+',50%'+',50%)' for h in np.linspace(0, 30, M)][i])

# plot price distribution
trace = []
for i in range(0, len(greds)):
    trace.append(go.Box(x = df_bigs[df_bigs['Ingredient'] == greds[i]]['Loves'], 
                        name = re.sub('\(.*\)', '', greds[i]),
                        marker = dict(color = c[i]),
                        boxmean = True))

    
layout = go.Layout(title = 'Most Common Skincare Ingredients: Sephora Bestsellers', 
                   yaxis=dict(tickangle=-45), 
                   margin = go.layout.Margin(
                       l=150,
                       r=50,
                       b=100,
                       t=100,
                       pad=4), 
                   showlegend = False, 
                   xaxis = dict(range = [0, 160000], title = 'Loves'),
                   autosize = True)

iplot(go.Figure(data = trace, layout = layout))

In [35]:
# interactive violin plot popularity distribution of ingredients
# ingredients list
greds = df_bigs['Ingredient'].unique()

N = 14

c = ['hsl('+str(h)+',50%'+',50%)' for h in np.linspace(180, 300, N)]


# plot price distribution
trace = []
for i in range(0, len(greds)):
    trace.append(go.Violin(x = df_bigs[df_bigs['Ingredient'] == greds[i]]['Loves'], 
                       name = re.sub('\(.*\)', '', greds[i]),
                       marker = dict(color = c[i]),
                          box = dict(visible = True)))

    
layout = go.Layout(title = 'Most Common Skincare Ingredients: Sephora Bestsellers', 
                   yaxis=dict(tickangle=-45), 
                   margin = go.layout.Margin(
                       l=150,
                       r=50,
                       b=100,
                       t=100,
                       pad=4), 
                   showlegend = False, 
                   xaxis = dict(range = [0, 160000], title = 'Loves'),
                   height = 700)

iplot(go.Figure(data = trace, layout = layout))

#### Look at mid-range of occuring ingredients

In [39]:
# interactive plot price distribution of ingredients
# ingredients list
greds = df_mids['Ingredient'].unique()

# N = 13
# M = 7

# c = ['hsl('+str(int(h))+',50%'+',50%)' for h in np.linspace(300, 360, N)]
# for i in range(0, M):
#     c.append(['hsl('+str(int(h))+',50%'+',50%)' for h in np.linspace(0, 30, M)][i])

N = 24
c = ['hsl('+str(h)+',50%'+',50%)' for h in np.linspace(180, 300, N)]

# plot price distribution
trace = []
for i in range(0, len(greds)):
    trace.append(go.Box(x = df_mids[df_mids['Ingredient'] == greds[i]]['Price'], 
                        name = re.sub('\(.*\)', '', greds[i]),
                        marker = dict(color = c[i]),
                        boxmean = True))

    
layout = go.Layout(title = 'Most Common Skincare Ingredients: Sephora Bestsellers', 
                   yaxis=dict(tickangle=-45), 
                   margin = go.layout.Margin(
                       l=150,
                       r=50,
                       b=100,
                       t=100,
                       pad=4), 
                   showlegend = False, 
                   xaxis = dict(range = [0, 250], title = 'Price', tickformat = "$.0f"),
                   autosize = True)
#py.iplot(go.Figure(data = trace, layout = layout), filename='mid_prices')
iplot(go.Figure(data = trace, layout = layout))

In [41]:
# interactive plot popularity distribution of ingredients
# ingredients list
greds = df_mids['Ingredient'].unique()

N = 16
M = 8

c = ['hsl('+str(int(h))+',50%'+',50%)' for h in np.linspace(300, 360, N)]
for i in range(0, M):
    c.append(['hsl('+str(int(h))+',50%'+',50%)' for h in np.linspace(0, 30, M)][i])


# plot price distribution
trace = []
for i in range(0, len(greds)):
    trace.append(go.Box(x = df_mids[df_mids['Ingredient'] == greds[i]]['Loves'], 
                        name = re.sub('\(.*\)', '', greds[i]),
                        marker = dict(color = c[i]),
                        boxmean = True))

    
layout = go.Layout(title = 'Most Common Skincare Ingredients: Sephora Bestsellers', 
                   yaxis=dict(tickangle=-45), 
                   margin = go.layout.Margin(
                       l=150,
                       r=50,
                       b=100,
                       t=100,
                       pad=4), 
                   showlegend = False, 
                   xaxis = dict(range = [400, 150000], title = 'Loves'),
                   autosize = True)

iplot(go.Figure(data = trace, layout = layout))

#### Plot Percentage of Products with Ingredient X in Bestsellers vs. in General Population

Compare which ingredients show up more often in 200 bestsellers vs. 2000 general population

In [42]:
Props_top.head()

Unnamed: 0,Ingredient,Count,BestProp,APIProp,Diff
6,Glycolic Acid,78.0,0.39,0.035159,0.354841
3,Sodium Benzoate,84.0,0.42,0.083573,0.336427
7,Citric Acid,72.0,0.36,0.096254,0.263746
110,Citrus Extract,5.0,0.025,0.280115,-0.255115
9,Chrysanthemum Extract,55.0,0.275,0.027089,0.247911


In [43]:
len(Props_top)

20

In [44]:
# interactive plot of BestProp and APIProp

N = 2

#c = ['hsl('+str(h)+',50%'+',50%)' for h in np.linspace(270, 300, N)]
c = ['hsl('+str(210)+',50%'+',50%)', 'hsl('+str(170)+',50%'+',50%)']

trace0 = go.Bar(x = Props_top['Ingredient'],
               y = Props_top['BestProp'],
               name = 'Bestsellers',
               marker = dict(color = c[0]))

trace1 = go.Bar(x = Props_top['Ingredient'],
               y = Props_top['APIProp'],
               name = 'General',
               marker = dict(color = c[1]))
data = [trace0, trace1]

layout = go.Layout(title = 'Percentage of Products with a Given Ingredient', 
                   yaxis = dict(title = 'Percentage',tickformat = "%.0f"),
                   margin = go.layout.Margin(
                       l=50,
                       r=50,
                       b=150,
                       t=50,
                       pad=4), 
                   xaxis = dict(title = '', tickangle=45),
                   autosize = True, 
                   barmode = 'group',
                  legend = dict(x = 0.8, y = 1))

iplot(go.Figure(data = data, layout = layout))

#### Specific "new" ingredients

In [45]:
df[df['Retinol Ester'] > 0]

Unnamed: 0,Brand,Item,ProductId,Price,OverallStars,Loves,Ingredients,AHAs,Agave Extract,Alfalfa Powder,...,Vitamin-A,Vitamin-B5,Vitamin-C,Vitamin-E,Walnut Powder,Willow Extract,Witch Hazel,Yeast Extract,Ylang Ylang,Zinc Oxide
12,SUNDAY RILEY,Power Couple Duo: Total Transformation Kit,p402718,85.0,4.3553,90000.0,"['Trans-retinol Ester', 'Pharmaceutical Grade ...",0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0
17,TATCHA,"Smooth, Poreless Skin Obento Box",p433963,79.0,4.8077,30000.0,"['Japanese Luffa Fruit Exfoliant', 'Japanese L...",0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
33,SUNDAY RILEY,Luna Sleeping Night Oil,p393718,105.0,4.1198,90000.0,"['Trans-retinoic Acid Ester', 'Blue Tansy and ...",0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0
85,SHISEIDO,Benefiance WrinkleResist24 Pure Retinol Expres...,p173619,65.0,4.4282,30000.0,"['Pure Retinol Micro-infusion Technology', 'Wr...",0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
91,GLAMGLOW,GLOWSTARTER™ Mega Illuminating Moisturizer,p408739,49.0,3.9812,40000.0,"['Dimethicone', 'Butylene Glycol', 'Cetyl Rici...",0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
132,DR. DENNIS GROSS SKINCARE,Ferulic + Retinol Anti-Aging Moisturizer,p384536,75.0,4.1171,8120.0,"['Ferulic Acid', 'Retinol', 'ECG Complex™ (Pro...",0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
163,DR. DENNIS GROSS SKINCARE,Ferulic + Retinol Triple Correction Eye Serum,p377531,69.0,3.8498,20000.0,"['Ferulic Acid', 'Retinol', 'Licorice Root Ext...",0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
177,CAUDALIE,Favorites Set,p429685,39.0,4.3421,10000.0,"['Grape-seed Polyphenols', 'Peptides and Caffe...",0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [46]:
len(df[df['Retinol Ester'] > 0])

8

In [47]:
# general 
response = requests.get(base_url + 'retin')
data = response.json()
print(len(data))


59


In [48]:
df[df['Sulfur'] > 0]

Unnamed: 0,Brand,Item,ProductId,Price,OverallStars,Loves,Ingredients,AHAs,Agave Extract,Alfalfa Powder,...,Vitamin-A,Vitamin-B5,Vitamin-C,Vitamin-E,Walnut Powder,Willow Extract,Witch Hazel,Yeast Extract,Ylang Ylang,Zinc Oxide
113,KATE SOMERVILLE,EradiKate® Daily Cleanser Acne Treatment,p415667,38.0,4.2651,10000.0,"['Sulfur 3%', 'Botanical Complex', 'Natural Oa...",0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
131,KATE SOMERVILLE,EradiKate Acne Treatment,p232903,26.0,4.2771,60000.0,"['Sulfur 10%', 'AHAs', 'Zinc Oxide', 'Isopropy...",1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0


In [256]:
# general 
response = requests.get(base_url + 'sulfur')
data = response.json()
print(len(data))

18


In [49]:
df[df['Niacinamide'] > 0]

Unnamed: 0,Brand,Item,ProductId,Price,OverallStars,Loves,Ingredients,AHAs,Agave Extract,Alfalfa Powder,...,Vitamin-A,Vitamin-B5,Vitamin-C,Vitamin-E,Walnut Powder,Willow Extract,Witch Hazel,Yeast Extract,Ylang Ylang,Zinc Oxide
7,IT COSMETICS,Your Skin But Better CC+ Cream Oil-Free Matte ...,p433435,38.0,3.9735,20000.0,"['Collagen', 'Peptides', 'Niacin', 'Antioxidan...",0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0
199,TARTE,Knockout Tingling Treatment,p427427,39.0,4.4955,20000.0,"['Niacinamide', 'Salicylic Acid', 'Lactic Acid...",0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [50]:
# general 
response = requests.get(base_url + 'niacinamide')
data = response.json()
print(len(data))

326


In [51]:
# general 
response = requests.get(base_url + 'heparan+sulfate')
data = response.json()
print(len(data))

0


## Next Steps:

* Analyze the brands that have popular ingredients
* Find data source for "new" products -- > search the "new" section of sephora
* Model popularity of products
* Search dermatology studies