# Visualization and Recommender

This notebook includes the visualization and recommender tool. Start with the cleaned dataset.

In [26]:
import pandas as pd
import numpy as np
from sklearn.manifold import TSNE
from sklearn.metrics.pairwise import cosine_similarity
from numpy.linalg import norm
from bokeh.io import show, curdoc, output_notebook, push_notebook
from bokeh.plotting import figure
from bokeh.models import ColumnDataSource, HoverTool, Select, Paragraph, TextInput
from bokeh.layouts import column, row
from ipywidgets import interact, widgets
from fuzzywuzzy import process



#### Data Preprocessing

In [27]:
#read in cleaned dataset
yesstyle = pd.read_csv('cleaned_yesstyle.csv')
del yesstyle['Unnamed: 0']

Let's check out the products from Isntree, one of all time favorite brands! 

In [28]:
isntree = yesstyle[yesstyle['brand'].str.contains('Isntree')]
isntree

Unnamed: 0,price,name,ingredients,rating,reviews,label,brand,product
94,21.0,Isntree - Hyaluronic Acid Low pH Cleansing Foam,"Water, Sodium Cocoyl Isethionate, Glycerin, So...",4.7,23,cleanser,Isntree,Hyaluronic Acid Low pH Cleansing Foam
151,15.4,Isntree - Green Tea Fresh Cleanser,"Water, Glycerin, Sodium Cocoyl Alaninate, Diso...",5.0,9,cleanser,Isntree,Green Tea Fresh Cleanser
228,15.1,Isntree - Sensitive Balancing Cleansing Foam,"Water, Sorbitan Olivate, Trehalose, Allantoin,...",4.0,5,cleanser,Isntree,Sensitive Balancing Cleansing Foam
283,24.48,Isntree - Spot Saver Mugwort Powder Wash 25pcs,"Zea Mays Starch, Sodium Cocoyl Isethionate, So...",4.3,22,cleanser,Isntree,Spot Saver Mugwort Powder Wash 25pcs
317,21.7,Isntree - Onion Newpair Cleansing Foam,"Water, Glycerin, Potassium Myristate, Allium C...",4.7,4,cleanser,Isntree,Onion Newpair Cleansing Foam
387,15.6,Isntree - Green Tea Fresh Toner 200ml,"Camellia Sinensis Leaf Extract, Water, Ginkgo ...",4.6,1226,toner,Isntree,Green Tea Fresh Toner 200ml
422,23.6,Isntree - Hyaluronic Acid Toner Plus,"Water, Propanediol, 1,2-Hexanediol, Sodium Hya...",4.6,10,toner,Isntree,Hyaluronic Acid Toner Plus
427,4.08,Isntree - Hyaluronic Acid Toner Plus Mini,"Water, Propanediol, 1,2-Hexanediol, Sodium Hya...",4.5,128,toner,Isntree,Hyaluronic Acid Toner Plus Mini
474,18.4,Isntree - Clear Skin BHA Toner,"Purified Water, Glycerin, Butylene Glycol, 1,2...",4.5,180,toner,Isntree,Clear Skin BHA Toner
489,5.0,Isntree - Hyaluronic Acid Toner Mini,"Sodium Hyaluronate, Water, Glycerin, Butylene ...",4.5,51,toner,Isntree,Hyaluronic Acid Toner Mini


In [29]:
#tokenization
ingredient_idx = {}
corpus = []
idx = 0

for i in range(len(yesstyle)):
    ingredients = yesstyle['ingredients'][i]
    ingredients_lower = ingredients.lower()
    tokens = ingredients_lower.split(', ')
    corpus.append(tokens)
    for ingredient in tokens:
        if ingredient not in ingredient_idx:
            ingredient_idx[ingredient] = idx
            idx +=1

In [30]:
#get the number of items and tokens
M = len(yesstyle)
N = len(ingredient_idx)

#initialize a matrix of zeroes
A = np.zeros((M,N))
A

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

In [31]:
#shape of A
print('The shape of A is', A.shape)

The shape of A is (1574, 4930)


The dimension of A is (1574, 4930), meaning there are 1574 products and 4930 features in the matrix, or 4930 distinct ingredients in the entire dataset.

In [32]:
#define oh_encoder function
def oh_encoder(tokens):
    x = np.zeros(N)
    for ingredient in tokens:
        #retrieve the index for each ingredient
        idx = ingredient_idx[ingredient]
        #place 1 at the corresponding indices
        x[idx] = 1
    return x

In [33]:
#make a document-term matrix (item-ingredient matrix)
i = 0

for tokens in corpus:
    A[i, :] = oh_encoder(tokens)
    i +=1

#### Dimensionality Reduction with t-SNE

In [34]:
#dimension reduction with t-SNE
model = TSNE(n_components=2, learning_rate=200, random_state=69)
tsne_features = model.fit_transform(A)
yesstyle['X'] = tsne_features[:, 0]
yesstyle['Y'] = tsne_features[:, 1]

TSNE is non-deterministic, meaning you won't get exactly the same output each time you run it (though the results are likely to be similar. Ones that are REALLY close to each other tend to be REALLY close next time t-SNE is ran. 


In [35]:
#convert price and rating to floats and round to 2 decimal places
yesstyle['rating'] = yesstyle['rating'].astype('float').round(2)
yesstyle['price'] = yesstyle['price'].astype('float').round(2)
yesstyle.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1574 entries, 0 to 1573
Data columns (total 10 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   price        1574 non-null   float64
 1   name         1574 non-null   object 
 2   ingredients  1574 non-null   object 
 3   rating       1574 non-null   float64
 4   reviews      1574 non-null   int64  
 5   label        1574 non-null   object 
 6   brand        1574 non-null   object 
 7   product      1574 non-null   object 
 8   X            1574 non-null   float32
 9   Y            1574 non-null   float32
dtypes: float32(2), float64(2), int64(1), object(5)
memory usage: 110.8+ KB


#### Mapping and Visualizing Skincare Products with Bokeh

In [36]:
source = ColumnDataSource(data=yesstyle)

#create figure
plot = figure(x_axis_label='T-SNE 1',
              y_axis_label='T-SNE 2',
              width=500, height=400)

#plot data points
plot.circle(x='X', y='Y', source=source,
            size=10, color='#FF7373', alpha=.8)

# hover tool
hover = HoverTool(tooltips=[
    ('Product', '@product'),
    ('Brand', '@brand'),
    ('Price', '$ @price'),
    ('Rating', '@rating'),
    ('Reviews', '@reviews')
])
plot.add_tools(hover)

#define the update function
def update(selected_label):
    selected_data = yesstyle[yesstyle['label'] == selected_label]
    source.data = {
        'X': selected_data['X'],
        'Y': selected_data['Y'],
        'product': selected_data['product'],
        'brand': selected_data['brand'],
        'price': selected_data['price'],
        'rating': selected_data['rating'],
        'reviews': selected_data['reviews'],
    }
    show(plot)

#create a list of all the labels for a drop-down menu
label_options = yesstyle['label'].unique().tolist()

dropdown = widgets.Dropdown(options=label_options, description='Product Type')

#interact the plot with callback
output_notebook()
interact(update, selected_label=dropdown)



interactive(children=(Dropdown(description='Product Type', options=('cleanser', 'toner', 'serum', 'moisturizer…

<function __main__.update(selected_label)>

All of the labels, except for spf, have a blob, i.e., an area where the majority of the products lie, with some outliers. spfs are more spread out because there's more variety when it comes to ingredients, i.e. wide variety of chemical/physical filters for spfs.

In [290]:
isntree = yesstyle[yesstyle['brand'].str.contains('Isntree')]
isntree

Unnamed: 0,price,name,ingredients,rating,reviews,label,brand,product,X,Y
94,21.0,Isntree - Hyaluronic Acid Low pH Cleansing Foam,"Water, Sodium Cocoyl Isethionate, Glycerin, So...",4.7,23,cleanser,Isntree,Hyaluronic Acid Low pH Cleansing Foam,-32.442444,41.15052
151,15.4,Isntree - Green Tea Fresh Cleanser,"Water, Glycerin, Sodium Cocoyl Alaninate, Diso...",5.0,9,cleanser,Isntree,Green Tea Fresh Cleanser,9.008814,-5.102948
228,15.1,Isntree - Sensitive Balancing Cleansing Foam,"Water, Sorbitan Olivate, Trehalose, Allantoin,...",4.0,5,cleanser,Isntree,Sensitive Balancing Cleansing Foam,27.025011,-12.745646
283,24.48,Isntree - Spot Saver Mugwort Powder Wash 25pcs,"Zea Mays Starch, Sodium Cocoyl Isethionate, So...",4.3,22,cleanser,Isntree,Spot Saver Mugwort Powder Wash 25pcs,-3.51525,-20.6077
317,21.7,Isntree - Onion Newpair Cleansing Foam,"Water, Glycerin, Potassium Myristate, Allium C...",4.7,4,cleanser,Isntree,Onion Newpair Cleansing Foam,24.837965,-9.212627
387,15.6,Isntree - Green Tea Fresh Toner 200ml,"Camellia Sinensis Leaf Extract, Water, Ginkgo ...",4.6,1226,toner,Isntree,Green Tea Fresh Toner 200ml,-14.748699,-25.239874
422,23.6,Isntree - Hyaluronic Acid Toner Plus,"Water, Propanediol, 1,2-Hexanediol, Sodium Hya...",4.6,10,toner,Isntree,Hyaluronic Acid Toner Plus,-21.910812,38.223812
427,4.08,Isntree - Hyaluronic Acid Toner Plus Mini,"Water, Propanediol, 1,2-Hexanediol, Sodium Hya...",4.5,128,toner,Isntree,Hyaluronic Acid Toner Plus Mini,-21.910349,38.223385
474,18.4,Isntree - Clear Skin BHA Toner,"Purified Water, Glycerin, Butylene Glycol, 1,2...",4.5,180,toner,Isntree,Clear Skin BHA Toner,-20.347439,5.189004
489,5.0,Isntree - Hyaluronic Acid Toner Mini,"Sodium Hyaluronate, Water, Glycerin, Butylene ...",4.5,51,toner,Isntree,Hyaluronic Acid Toner Mini,-23.018829,-9.999369


#### Using Cosine Similarity to Recommend Products

Now let's start building the recommender. I use **cosine similarity** as a similarity metric, which measures the cosine of the angle between two vectors in a multidimensional space. It's widely used in recommendation systems, especially when dealing with high-dimensional data, as it is invariant to the magnitude of the vectors. The closer the cosine similarity is to 1, the more similar the items are, and vice versa

**Note**: For the function below, it's important to use the *exact* product name as it's located in the dataframe. I tried using fuzzy matching to bypass this restriction but through thorough testing, and while convenient, the search feature is not completely accurate.

In [291]:
def yesstyle_recommender(product, label, df):

    #filter df by label
    filtered_df = df[df['label'] == label].reset_index().drop('index', axis = 1)

    #extract the product name, has to exactly match
    myItem = filtered_df[filtered_df['name'].str.contains(product, case=False)]

    if myItem.empty:
        print("Product not found.")
        return None

    # extract tsne values for the target item
    X = myItem.iloc[0]['X']
    Y = myItem.iloc[0]['Y']

    point_1 = np.array([X,Y]).reshape(1,-1)

    filtered_df['dist'] = 0.0


    # iterate through df and calculate cos sim
    for i in range(len(filtered_df)):
        point_2 = np.array([filtered_df.at[i, 'X'], filtered_df.at[i, 'Y']])
        filtered_df.at[i, 'dist'] = np.dot(point_1, point_2) / (norm(point_1) * norm(point_2))


    filtered_df = filtered_df.sort_values('dist', ascending=False).reset_index()

    top_10_recommendations = filtered_df[['product','brand','price','dist']].iloc[:11]

    return top_10_recommendations


Let's try this using an example product, the **Matcha Hydrating Foam Cleanser** from B.LAB. 

In [292]:
# example 
yesstyle_recommender('Matcha Hydrating Foam Cleanser','cleanser',yesstyle)

Unnamed: 0,product,brand,price,dist
0,Matcha Hydrating Foam Cleanser,B.LAB,10.08,1.0
1,No. 3 All Green pH Balancing Cleanser,numbuzin,13.36,0.999998
2,ACSEN Oil Cut Cleansing 120ml,TROIAREUKE,40.6,0.999978
3,Gentle Cleansing Foam Mini,Sulwhasoo,10.8,0.999863
4,Gentle Black Facial Cleanser,"Dear, Klairs",18.8,0.999542
5,Cleansing Gel Be Clean Be Moist,Huxley,20.0,0.999474
6,Apple Seed Lip & Eye Remover 100ml,innisfree,12.9,0.999454
7,Heartleaf Acne Facial Cleanser,Anua,16.7,0.999333
8,Collagen Bubble Cleanser,VILLAGE 11 FACTORY,12.08,0.999246
9,Cleansing Water Be Clean Be Moist 200ml,Huxley,18.88,0.999245


Now let's see how similar the ingredient formulations are between the **top** recommendation, **No. 3 All Green pH Balancing Cleanser** from numbuzin and the target product. The cosine similarity metric is 0.999998, which means the 2 products are really similar.

<img src="matcha hydrating and no3.png" alt="Drawing" style="width: 500px;"/> <br>


<img src="matcha hydrating and no3 ingredients.png" alt="Drawing" style="width: 500px;"/> <br><br>

As you can see these 2 products share a whopping **19 ingredients** together:

- $\frac{20}{31}$ or 63.33% of the ingredients found in the **Matcha Hydrating Foam Cleanser** are also found in the **No. 3 All Green pH Balancing Cleanser**

- $\frac{20}{27}$ or 74.04% of the ingredients found in the **No. 3 All Green pH Balancing Cleanser** are also found in the **Matcha Hydrating Foam Cleanser**. 

<br>

Since the **Matcha Hydrating Foam Cleanser** is about $3 USD cheaper than the **No. 3 All Green pH Balancing Cleanser**, I would go as far to say that the former is a dupe for the latter!

Special thanks to INCIDecoder, a website dedicated towards analyzing and explaining product ingredients lists!

Let's examine another recommendation further down the list, the **Heartleaf Acne Facial Cleanser** from Anua, which is #7 on the list. The cosine similarity is 0.999333, which is pretty similar to the target product, but it probably should not be as similar as the **No. 3 All Green pH Balancing Cleanser** from numbuzin. <br> <br>

<img src="matcha hydrating and anua.png" alt="Drawing" style="width: 500px;"/> <br>

<img src="matcha hydrating and anua ingredients.png" alt="Drawing" style="width: 500px;"/> <br>

As you can see these 2 products share **15 ingredients** together:

- $\frac{15}{31}$ or 48.39% of the ingredients found in the **Matcha Hydrating Foam Cleanser** are also found in the **Heartleaf Acne Facial Cleanser**

- $\frac{15}{34}$ or 44.11% of the ingredients found in the **Heartleaf Acne Facial Cleanser** are also found in the **Matcha Hydrating Foam Cleanser**. 

<br>

As expected, these 2 products share a lot of similar ingredients, but not as many as the previous product! This is still a very valid recommendation! <br><br>

**Observation**: When comparing these lists, I noticed something important. INCIDecoder treats "Fig Extract" and "Ficus Carica (Fig) Fruit Extract" as the same ingredient, but via my data preprocessing approach, they're ultimately 2 different ingredients, e.g. "fig extract" and "ficus carica fruit extract" since I've effectively removed anything anything enclosed in parenthesis and they don't match character for character. <br>

While I suspect there's other cases like this that contribute to the other overall complexity of the dataset by augmenting the dimensionality of the sparse matrix, these are simply edge cases, and I believe there's not much to do to control for this problem. 

