<a href="https://colab.research.google.com/github/jjone36/Cosmetic/blob/master/cosmtic_map.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Cosmetic Recommendation based on Chemical Composition

This is the project for mapping cosmetic items based on similarities of chemical composition and giving content-based  recommendation. The dataset was prepared in advance and the details from [scraping](https://github.com/jjone36/Cosmetic/blob/master/cosmetic_1_scraping.py) to [modeling](https://github.com/jjone36/Cosmetic/blob/master/cosmetic_2_ML.py) can be found through the link. 



After preprocessing, the ingredients are tokenzied just like tokens from Natural Language Preprocessing. With the truncated SVD, I reduced the dimensionality, which will also be possible to get the contextual similarities of tokens. This concept is applied to get the similarities of cosmetic items and visualization with an interactive bokeh application. It's possible to use this plot as a map of cosmetic items and a recommendation for new items with similiar property. This notebook starts from visualizing the interactive bokeh plot, which also can offer the options to choose.  

## 1. Importing the necessary libraries and the dataset

In [0]:
# import libraries
import pandas as pd
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

from bokeh.io import show, curdoc, output_notebook, push_notebook
from bokeh.plotting import figure
from bokeh.models import ColumnDataSource, HoverTool, Select, Paragraph, TextInput
from bokeh.layouts import widgetbox, column, row
from ipywidgets import interact 

In [2]:
df = pd.read_csv('cosmetic_svd.csv')
df.info()
df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4480 entries, 0 to 4479
Data columns (total 7 columns):
Label    4480 non-null object
brand    4480 non-null object
name     4480 non-null object
price    4480 non-null int64
rank     4480 non-null float64
SVD1     4480 non-null float64
SVD2     4480 non-null float64
dtypes: float64(3), int64(1), object(3)
memory usage: 245.1+ KB


Unnamed: 0,Label,brand,name,price,rank,SVD1,SVD2
0,Moisturizer_Combination,LA MER,Crème de la Mer,175,4.1,1.721741,2.937995
1,Moisturizer_Combination,SK-II,Facial Treatment Essence,179,4.1,0.566024,0.069162
2,Moisturizer_Combination,DRUNK ELEPHANT,Protini™ Polypeptide Cream,68,4.4,2.605622,-1.06302
3,Moisturizer_Combination,LA MER,The Moisturizing Soft Cream,175,3.8,3.931477,4.717803
4,Moisturizer_Combination,IT COSMETICS,Your Skin But Better™ CC+™ Cream with SPF 50+,38,4.1,3.950338,-4.08327


All the steps until the decomposition is done already and I combine all data into one with all possible combination. `brand`,  `name`, `price` and `rank` is the data of each item scraped from [Sephora](https://www.sephora.com). 

In [3]:
# cosmetic filtering options
option_1 = ['Moisturizer', 'Cleanser', 'Treatment', 'Face Mask', 'Eye cream', 'Sun protect']
option_2 = ['Combination', 'Dry', 'Normal', 'Oily', 'Sensitive']

print(option_1)
print(option_2)

# the 30 different combinations of options
df.Label.unique()

['Moisturizer', 'Cleanser', 'Treatment', 'Face Mask', 'Eye cream', 'Sun protect']
['Combination', 'Dry', 'Normal', 'Oily', 'Sensitive']


array(['Moisturizer_Combination', 'Moisturizer_Dry', 'Moisturizer_Normal',
       'Moisturizer_Oily', 'Moisturizer_Sensitive',
       'Cleanser_Combination', 'Cleanser_Dry', 'Cleanser_Normal',
       'Cleanser_Oily', 'Cleanser_Sensitive', 'Treatment_Combination',
       'Treatment_Dry', 'Treatment_Normal', 'Treatment_Oily',
       'Treatment_Sensitive', 'Face Mask_Combination', 'Face Mask_Dry',
       'Face Mask_Normal', 'Face Mask_Oily', 'Face Mask_Sensitive',
       'Eye cream_Combination', 'Eye cream_Dry', 'Eye cream_Normal',
       'Eye cream_Oily', 'Eye cream_Sensitive', 'Sun protect_Combination',
       'Sun protect_Dry', 'Sun protect_Normal', 'Sun protect_Oily',
       'Sun protect_Sensitive'], dtype=object)

There are 6 different categories of items and 5 skin tpye options. So `Label` column has all possible 30 combinations as above. To make a selecting option and filtering application on them, I calculated the similarities separately. Users can choice each one from option_1 and option_2 and get the filtered plot accordingly.

## 2. Mapping with Bokeh

In [4]:
output_notebook()

To work with Bokeh server on jupyter notebook, made a connection first.

In [0]:
# make a source and scatter bokeh plot  
source = ColumnDataSource(df)
plot = figure(x_axis_label = 'SVD1', y_axis_label = 'SVD2', width = 500, height = 400)
plot.circle(x = 'SVD1', y = 'SVD2', source = source, size = 8, color = 'Salmon', alpha = .4)

plot.background_fill_color = "beige"
plot.background_fill_alpha = 0.2

# add hover tool
hover = HoverTool(tooltips = [
        ('Item', '@name'),
        ('brand', '@brand'),
        ('Price', '$ @price'),
        ('Rank', '@rank')])
plot.add_tools(hover)

The x-axis and y-axis will be the value of SVD, getting the similarity resulting map. The hover tool is also added, showing the name of item with the name of its brand, the price and the rank from Sephora.  

In [0]:
## defining the callback
def update(Category = option_1[0], Skin_type = option_2[0]):
    a_b = Category + '_' + Skin_type
    new_data = {
        'SVD1' : df[df['Label'] == a_b]['SVD1'],
        'SVD2' : df[df['Label'] == a_b]['SVD2'],
        'name' : df[df['Label'] == a_b]['name'],
        'brand' : df[df['Label'] == a_b]['brand'],
        'price' : df[df['Label'] == a_b]['price'],
        'rank' : df[df['Label'] == a_b]['rank'],
    }
    source.data = new_data
    push_notebook()

This callback function will make it able to filter the items according to users' choice. When the option is given, it will filter and update the source.

In [0]:
# interact the plot with callback 
output_notebook()

interact(update, Category = option_1, Skin_type = option_2)
t1 = Paragraph(text = '> Zoom in the plot using the second button on the right', width = 400)
t2 = Paragraph(text = '> Please press the reset button when you make a change the option. It\'s the third botton from the buttom.', width = 300)

show(row(plot, widgetbox(t1, t2)), notebook_handle = True)

interactive(children=(Dropdown(description='Category', options=('Moisturizer', 'Cleanser', 'Treatment', 'Face …

This is the resulting map for cosmetics. Each point on the plot is each cusmetic item. You can choose the options depending on the item category and your skin type. For example, if you choose 'Clenaser' for catergoy and 'Oily' for your skin type, the plot is updated with the cleanser items for Oily skin customers. You can comprehend the distance between the points as a similarities between the items. The longer the distance is, the more different components the items have. Therefore if you want to find one that is similiar with what you've used, you can look up for the ones around that item. 

# 3. Cosine similarity

Now each item is plotted on the plane we can simply calculate the cosine similarities between each point. I took [Peat Miracle Revital Cream](https://www.sephora.com/product/peat-miracle-revital-cream-P412440) from Belif as an example. 

In [11]:
df_2 = df[df.Label == 'Moisturizer_Dry'].reset_index().drop('index', axis = 1)
myItem = df_2[df_2.name.str.contains('Peat Miracle Revital')]
myItem

Unnamed: 0,Label,brand,name,price,rank,SVD1,SVD2
87,Moisturizer_Dry,BELIF,Peat Miracle Revital Cream,58,4.7,2.52592,-0.017333


In [12]:
# initiate the column
df_2['dist'] = 0.0
df_2.head()

Unnamed: 0,Label,brand,name,price,rank,SVD1,SVD2,dist
0,Moisturizer_Dry,LA MER,Crème de la Mer,175,4.1,1.764602,2.845905,0.0
1,Moisturizer_Dry,SK-II,Facial Treatment Essence,179,4.1,0.567817,0.032575,0.0
2,Moisturizer_Dry,DRUNK ELEPHANT,Protini™ Polypeptide Cream,68,4.4,2.586603,-1.131287,0.0
3,Moisturizer_Dry,LA MER,The Moisturizing Soft Cream,175,3.8,3.934386,4.586911,0.0
4,Moisturizer_Dry,IT COSMETICS,Your Skin But Better™ CC+™ Cream with SPF 50+,38,4.1,3.929549,-3.999283,0.0


In [17]:
# getting the array for myItem
P1 = np.array([myItem.SVD1.values, myItem.SVD2.values]).reshape(1, -1)
P1

array([[ 2.52591969, -0.01733327]])

In [18]:
# cosine similarities with other items
for i in range(len(df_2)):
    P2 = np.array([df_2['SVD1'][i], df_2['SVD2'][i]]).reshape(-1, 1)
    df_2.dist[i] = (P1 * P2).sum() / (np.sqrt(np.sum(P1))*np.sqrt(np.sum(P2)))

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  This is separate from the ipykernel package so we can avoid doing imports until
  This is separate from the ipykernel package so we can avoid doing imports until


If we sort the result in ascending order, we can see the top 5 closest cosmetic items like below.

In [16]:
df_2 = df_2.sort_values('dist')
df_2[['name', 'brand', 'dist']].head()

Unnamed: 0,name,brand,dist
95,Midnight Secret Late Night Recovery Treatment,GUERLAIN,4.337068e-08
171,Water Drop Hydrating Moisturizer,DR. JART+,2.001145e-07
114,Coconut Melt,KOPARI,2.168347e-07
109,Black Tea Age-Delay Cream,FRESH,3.303227e-07
108,Abeille Royale Youth Watery Oil,GUERLAIN,3.422469e-07


These are the top 5 cosmetics that have similar properties with myItem. With this list, we can produce a recommendation for new products. If we sort them in descending way, then the list could be used as *'the worst choice for you'*.