1. Skin-care, Chemicals... it's complicated 

Choosing a new cosmetic or skin-care can be overwhelming and even scary. Many of us have had skin issues from trying new products. Though the info is on each product, understanding ingredient lists can be hard unless you're a chemist. Instead of guessing, we can use data science to predict which products might work for us. In this project, we'll build a recommendation system based on cosmetics' chemical components. We'll analyze ingredient lists for 1,472 products from Sephora using word embedding and a Document-Term Matrix (DTM). Then, we'll visualize ingredient similarities using t-SNE and Bokeh. Let's start by checking our data.

In [77]:
import pandas as pd
import numpy as np
from sklearn.manifold import TSNE

df = pd.read_csv("cosmetics.csv")
display(df.sample(5))
df.Label.value_counts()

Unnamed: 0,Label,Brand,Name,Price,Rank,Ingredients,Combination,Dry,Normal,Oily,Sensitive
398,Cleanser,LANEIGE,Essential Power Skin Toner for Normal to Dry Skin,28,4.5,"Water, Glycereth-26, Alcohol, Butylene Glycol,...",1,1,1,0,1
1146,Eye cream,DR. BRANDT SKINCARE,needles no more® NO MORE BAGGAGE™ eye de-puffi...,42,3.8,"Water, Sodium Magnesium Silicate, Propanediol,...",1,1,1,1,1
885,Face Mask,TOO COOL FOR SCHOOL,Egg Cream Mask Hydration,24,4.3,"Water, Butylene Glycol, Glycerin, PEG-32, Cycl...",1,1,1,0,1
956,Face Mask,TATA HARPER,Resurfacing Mask,62,4.3,"Aloe barbadensis Leaf Juice*, Salix Alba (Will...",1,1,1,1,1
27,Moisturizer,ORIGINS,Dr. Andrew Weil For Origins™ Mega-Mushroom Rel...,34,4.4,"Water, Butylene Glycol, PEG-4, Citrus Aurantiu...",1,1,1,1,1


Label
Moisturizer    298
Cleanser       281
Face Mask      266
Treatment      248
Eye cream      209
Sun protect    170
Name: count, dtype: int64

2. Focus on one type of product and one skin type
In our dataset, we have six types of products (moisturizers, cleansers, face masks, eye creams, and sun protection) and five skin types (combination, dry, normal, oily, and sensitive). Since everyone's skincare needs and skin types vary, let's set up our workflow to customize its outputs—a t-SNE model and a visualization of that model. For this example, let's narrow down to moisturizers for people with dry skin by filtering the data accordingly.

In [78]:
moisturizers = df[df.Label == "Moisturizer"]
moisturizers_dry = moisturizers[moisturizers.Dry==1]
moisturizers_dry = moisturizers_dry.reset_index(drop = True)

3. Tokenizing the ingredients
To get to our end goal of comparing ingredients in each product, we first need to do some preprocessing tasks and bookkeeping of the actual words in each product's ingredients list. The first step will be tokenizing the list of ingredients in Ingredients column. After splitting them into tokens, we'll make a binary bag of words. Then we will create a dictionary with the tokens, ingredient_idx, which will have the following format:

{ "ingredient": index value, ... }

In [79]:
ingredient_idx = {}
c = []
idx = 0

for i in range(len(moisturizers_dry)):
    ingredients = moisturizers_dry["Ingredients"][i]
    ingredients_low = ingredients.lower()
    tokens = ingredients_low.split(', ')
    c.append(tokens)
    for ingredients in tokens:
        if ingredients not in ingredient_idx:
            ingredient_idx[ingredients] = idx
            idx +=1

print("the index of Decyl oleate", ingredient_idx['decyl oleate'])

the index of Decyl oleate 25


4. Initializing a document-term matrix (DTM)
The next step is making a document-term matrix (DTM). Here each cosmetic product will correspond to a document, and each chemical composition will correspond to a term. This means we can think of the matrix as a “cosmetic-ingredient” matrix.To create this matrix, we'll first make an empty matrix filled with zeros. The length of the matrix is the total number of cosmetic products in the data. The width of the matrix is the total number of ingredients. After initializing this empty matrix, we'll fill it in the following tasks.

In [80]:
M = len(moisturizers_dry)
N = len(ingredient_idx)
J = np.zeros((M,N))

5. Before we can fill the matrix, let's create a function to count the tokens (i.e., an ingredients list) for each row. Our end goal is to fill the matrix with 1 or 0: if an ingredient is in a cosmetic, the value is 1. If not, it remains 0. The name of this function, oh_hoyiyaa, will become clear next.

In [81]:
def oh_hoyiyaa(tokens):
    x = np.zeros(N)
    for ingredient in tokens:
        idx = ingredient_idx[ingredient]
        x[idx] = 1
    return x

6. The Cosmetic-Ingredient matrix!
Now we'll apply the oh_hoyiyaa() functon to the tokens in c(list) and set the values at each row of this matrix. So the result will tell us what ingredients each item is composed of. For example, if a cosmetic item contains water, niacin, decyl aleate and sh-polypeptide-1, the outcome of this item will be as follows.
[[1 1 0 0 1 ... 0]
 [0 1 1 0 0 ... 1]
 [1 0 0 1 1 ... 0]
 [0 1 0 0 0 ... 0]
 [1 0 1 0 0 ... 1]]
This is what we called one-hot encoding. By encoding each ingredient in the items, the Cosmetic-Ingredient matrix will be filled with binary values.

In [82]:
i = 0
for tokens in c:
    J[i, :] = oh_hoyiyaa(tokens)
    i = i+1

7. Dimension reduction with t-SNE
The dimensions of the existing matrix is (190, 2233), which means there are 2233 features in our data. For visualization, we should downsize this into two dimensions. We'll use t-SNE for reducing the dimension of the data here.

T-distributed Stochastic Neighbor Embedding (t-SNE) is a nonlinear dimensionality reduction technique that is well-suited for embedding high-dimensional data for visualization in a low-dimensional space of two or three dimensions. Specifically, this technique can reduce the dimension of data while keeping the similarities between the instances. This enables us to make a plot on the coordinate plane, which can be said as vectorizing. All of these cosmetic items in our data will be vectorized into two-dimensional coordinates, and the distances between the points will indicate the similarities between the items.

In [83]:
mod = TSNE(n_components=2, learning_rate=200, random_state=42)
tsne_feats = mod.fit_transform(J)
moisturizers_dry['X'] = tsne_feats[:,0]
moisturizers_dry['Y'] = tsne_feats[:,1]

X_embedded = mod.fit_transform(tsne_feats)
tsne_df = pd.DataFrame(X_embedded, columns=['t-SNE 1', 't-SNE 2'])
tsne_df['Product Name'] = df['Name']
tsne_df['Brand'] = df['Brand']
tsne_df['Price'] = df['Price']

8. Let's map the items with Bokeh
We are now ready to start creating our plot. With the t-SNE values, we can plot all our items on the coordinate plane. And the coolest part here is that it will also show us the name, the brand, the price and the rank of each item. Let's make a scatter plot using Bokeh and add a hover tool to show that information. Note that we won't display the plot yet as we will make some more additions to it.

In [84]:
from bokeh.plotting import figure, show, output_notebook
from bokeh.models import ColumnDataSource, HoverTool
from bokeh.transform import factor_cmap

output_notebook()

source = ColumnDataSource(tsne_df)
p = figure(title="t-SNE Visualization of Cosmetic Ingredients",
           x_axis_label='t-SNE 1', y_axis_label='t-SNE 2', tools=[hover])

p.scatter('t-SNE 1', 't-SNE 2', size=8, source=source, fill_alpha=0.6,
         color=factor_cmap('Brand', palette='Viridis256', factors=tsne_df['Brand'].unique()))



9. Adding a hover tool
Why don't we add a hover tool? Adding a hover tool allows us to check the information of each item whenever the cursor is directly over a glyph. We'll add tooltips with each product's name, brand, price, and rank (i.e., rating).

In [85]:

hover = HoverTool(tooltips=[
    ("Product", "@{Product Name}"),
    ("Brand", "@Brand"),
    ("Price", "@Price")
])
p.add_tools(hover)

10. Mapping the cosmetic items
Finally, it's show time! Let's see how the map we've made looks like. Each point on the plot corresponds to the cosmetic items. Then what do the axes mean here? The axes of a t-SNE plot aren't easily interpretable in terms of the original data. Like mentioned above, t-SNE is a visualizing technique to plot high-dimensional data in a low-dimensional space. Therefore, it's not desirable to interpret a t-SNE plot quantitatively.

Instead, what we can get from this map is the distance between the points (which items are close and which are far apart). The closer the distance between the two items is, the more similar the composition they have. Therefore this enables us to compare the items without having any chemistry background.

In [86]:
show(p)

11. Comparing two products
Since there are so many cosmetics and so many ingredients, the plot doesn't have many super obvious patterns that simpler t-SNE plots can have. Our plot requires some digging to find insights, but that's okay!

Say we enjoyed a specific product, there's an increased chance we'd enjoy another product that is similar in chemical composition. Say we enjoyed AmorePacific's Color Control Cushion Compact Broad Spectrum SPF 50+. We could find this product on the plot and see if a similar product(s) exist. And it turns out it does! If we look at the points furthest left on the plot, we see LANEIGE's BB Cushion Hydra Radiance SPF 50 essentially overlaps with the AmorePacific product. By looking at the ingredients, we can visually confirm the compositions of the products are similar (though it is difficult to do, which is why we did this analysis in the first place!), plus LANEIGE's version is $22 cheaper and actually has higher ratings.

It's not perfect, but it's useful. In real life, we can actually use our little ingredient-based recommendation engine help us make educated cosmetic purchase choices.

In [87]:

sc_1 = moisturizers_dry[moisturizers_dry['Name'] == "Color Control Cushion Compact Broad Spectrum SPF 50+"]
sc_2 = moisturizers_dry[moisturizers_dry['Name'] == "BB Cushion Hydra Radiance SPF 50"]

display(sc_1)
print(sc_1.Ingredients.values)
display(sc_2)
print(sc_2.Ingredients.values)

Unnamed: 0,Label,Brand,Name,Price,Rank,Ingredients,Combination,Dry,Normal,Oily,Sensitive,X,Y
45,Moisturizer,AMOREPACIFIC,Color Control Cushion Compact Broad Spectrum S...,60,4.0,"Phyllostachis Bambusoides Juice, Cyclopentasil...",1,1,1,1,1,-9.419198,-335.968231


['Phyllostachis Bambusoides Juice, Cyclopentasiloxane, Cyclohexasiloxane, Peg-10 Dimethicone, Phenyl Trimethicone, Butylene Glycol, Butylene Glycol Dicaprylate/Dicaprate, Alcohol, Arbutin, Lauryl Peg-9 Polydimethylsiloxyethyl Dimethicone, Acrylates/Ethylhexyl Acrylate/Dimethicone Methacrylate Copolymer, Polyhydroxystearic Acid, Sodium Chloride, Polymethyl Methacrylate, Aluminium Hydroxide, Stearic Acid, Disteardimonium Hectorite, Triethoxycaprylylsilane, Ethylhexyl Palmitate, Lecithin, Isostearic Acid, Isopropyl Palmitate, Phenoxyethanol, Polyglyceryl-3 Polyricinoleate, Acrylates/Stearyl Acrylate/Dimethicone Methacrylate Copolymer, Dimethicone, Disodium Edta, Trimethylsiloxysilicate, Ethylhexyglycerin, Dimethicone/Vinyl Dimethicone Crosspolymer, Water, Silica, Camellia Japonica Seed Oil, Camillia Sinensis Leaf Extract, Caprylyl Glycol, 1,2-Hexanediol, Fragrance, Titanium Dioxide, Iron Oxides (Ci 77492, Ci 77491, Ci77499).']


Unnamed: 0,Label,Brand,Name,Price,Rank,Ingredients,Combination,Dry,Normal,Oily,Sensitive,X,Y
55,Moisturizer,LANEIGE,BB Cushion Hydra Radiance SPF 50,38,4.3,"Water, Cyclopentasiloxane, Zinc Oxide (CI 7794...",1,1,1,1,1,-31.170496,-359.229309


['Water, Cyclopentasiloxane, Zinc Oxide (CI 77947), Ethylhexyl Methoxycinnamate, PEG-10 Dimethicone, Cyclohexasiloxane, Phenyl Trimethicone, Iron Oxides (CI 77492), Butylene Glycol Dicaprylate/Dicaprate, Niacinamide, Lauryl PEG-9 Polydimethylsiloxyethyl Dimethicone, Acrylates/Ethylhexyl Acrylate/Dimethicone Methacrylate Copolymer, Titanium Dioxide (CI 77891 , Iron Oxides (CI 77491), Butylene Glycol, Sodium Chloride, Iron Oxides (CI 77499), Aluminum Hydroxide, HDI/Trimethylol Hexyllactone Crosspolymer, Stearic Acid, Methyl Methacrylate Crosspolymer, Triethoxycaprylylsilane, Phenoxyethanol, Fragrance, Disteardimonium Hectorite, Caprylyl Glycol, Yeast Extract, Acrylates/Stearyl Acrylate/Dimethicone Methacrylate Copolymer, Dimethicone, Trimethylsiloxysilicate, Polysorbate 80, Disodium EDTA, Hydrogenated Lecithin, Dimethicone/Vinyl Dimethicone Crosspolymer, Mica (CI 77019), Silica, 1,2-Hexanediol, Polypropylsilsesquioxane, Chenopodium Quinoa Seed Extract, Magnesium Sulfate, Calcium Chloride

In [90]:
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

cosine_sim = cosine_similarity(tsne_feats)

def get_recommendations(product_name, cosine_sim=cosine_sim):
    if product_name not in df['Name'].values:
        return f"Product '{product_name}' not found in the dataset."
    
    idx = df[df['Name'] == product_name].index[0]
    s_scores = list(enumerate(cosine_sim[idx]))
    s_scores = sorted(s_scores, key=lambda x: x[1], reverse=True)
    s_scores = s_scores[1:11]
    product_indices = [i[0] for i in s_scores]
    return df.iloc[product_indices][['Brand', 'Name', 'Price']]

# Example usage
# Enter the product_name(coloum3 from the dataset) to get recomandations
# It gives top 10 similar products (excluding the product itself) in decending order
print(get_recommendations('Facial Treatment Essence Mini'))

                  Brand                                            Name  Price
117              TATCHA       The Indigo Cream Soothing Skin Protectant     85
8    KIEHL'S SINCE 1851                              Ultra Facial Cream     29
81    PETER THOMAS ROTH             Water Drench Hyaluronic Cloud Cream     52
183          L'OCCITANE                         Immortelle Divine Cream    110
169              KOPARI                                    Coconut Melt     28
141           BIOSSANCE            Squalane + Probiotic Gel Moisturizer     52
57         ESTÉE LAUDER  Micro Essence Skin Activating Treatment Lotion    100
120            SHISEIDO  Bio-Performance Advanced Super Restoring Cream    127
189            CAUDALIE                               Premier Cru Cream    140
64             CAUDALIE                  Vinosource Moisturizing Sorbet     39
