This is a simple cosmetic product recommender.

It mixes content-based and collaborative filtering, depending on how many items the user inputs.

Dataset from https://www.kaggle.com/datasets/nadyinky/sephora-products-and-skincare-reviews.

I took inspiration from the following github pages:
 - https://github.com/Rubina-Bansal/Sephora-Recommendation-System/tree/main
 - https://github.com/Techie5879/moviepp/tree/main

In [1]:
import numpy as np
import pandas as pd

**Product Dataset Handling**

In [2]:
products_df = pd.read_csv('product_info.csv')
products_df.head()

Unnamed: 0,product_id,product_name,brand_id,brand_name,loves_count,rating,reviews,size,variation_type,variation_value,...,online_only,out_of_stock,sephora_exclusive,highlights,primary_category,secondary_category,tertiary_category,child_count,child_max_price,child_min_price
0,P473671,Fragrance Discovery Set,6342,19-69,6320,3.6364,11.0,,,,...,1,0,0,"['Unisex/ Genderless Scent', 'Warm &Spicy Scen...",Fragrance,Value & Gift Sets,Perfume Gift Sets,0,,
1,P473668,La Habana Eau de Parfum,6342,19-69,3827,4.1538,13.0,3.4 oz/ 100 mL,Size + Concentration + Formulation,3.4 oz/ 100 mL,...,1,0,0,"['Unisex/ Genderless Scent', 'Layerable Scent'...",Fragrance,Women,Perfume,2,85.0,30.0
2,P473662,Rainbow Bar Eau de Parfum,6342,19-69,3253,4.25,16.0,3.4 oz/ 100 mL,Size + Concentration + Formulation,3.4 oz/ 100 mL,...,1,0,0,"['Unisex/ Genderless Scent', 'Layerable Scent'...",Fragrance,Women,Perfume,2,75.0,30.0
3,P473660,Kasbah Eau de Parfum,6342,19-69,3018,4.4762,21.0,3.4 oz/ 100 mL,Size + Concentration + Formulation,3.4 oz/ 100 mL,...,1,0,0,"['Unisex/ Genderless Scent', 'Layerable Scent'...",Fragrance,Women,Perfume,2,75.0,30.0
4,P473658,Purple Haze Eau de Parfum,6342,19-69,2691,3.2308,13.0,3.4 oz/ 100 mL,Size + Concentration + Formulation,3.4 oz/ 100 mL,...,1,0,0,"['Unisex/ Genderless Scent', 'Layerable Scent'...",Fragrance,Women,Perfume,2,75.0,30.0


I wanted to combine the categories together, as a category that appears as a primary category in product A might end up being in the secondary category of product B, so this is just for consistency.

In [3]:
primary = products_df['primary_category'].unique().tolist()
secondary = products_df['secondary_category'].unique().tolist()
tertiary = products_df['tertiary_category'].unique().tolist()

categories = set()
categories.update(primary)
categories.update(secondary)
categories.update(tertiary)

In [4]:
products_df['all_categories'] = products_df[['primary_category', 'secondary_category', 'tertiary_category']].apply(
    lambda x : list(x.dropna()), axis=1)

The `highlights` column has the entire list as a string, so this is reformat it into a list of strings, like `all_categories`.

In [5]:
products_df['highlights'][0]

"['Unisex/ Genderless Scent', 'Warm &Spicy Scent', 'Woody & Earthy Scent', 'Fresh Scent']"

In [6]:
import ast

products_df['highlights'] = products_df['highlights'].apply(lambda x: ast.literal_eval(x) if isinstance(x, str) else x)

In [7]:
products_df['highlights'][0]

['Unisex/ Genderless Scent',
 'Warm &Spicy Scent',
 'Woody & Earthy Scent',
 'Fresh Scent']

Same thing for `ingredients`.

In [8]:
products_df['ingredients'] = products_df['ingredients'].apply(lambda x: ast.literal_eval(x) if isinstance(x, str) else x)



This dataset has a lot of features, and while I do think some are useful, I wanted a simpler model.

In [9]:
products_df.drop(columns=['brand_id', 'primary_category', 'secondary_category', 
                          'tertiary_category', 'child_count', 'child_max_price',
                          'child_min_price', 'value_price_usd', 'sale_price_usd',
                          'rating', 'reviews', 'variation_type', 'size',
                          'variation_value', 'variation_desc', 'loves_count',
                          'limited_edition', 'new', 'online_only', 'out_of_stock',
                          'sephora_exclusive'],
                 inplace=True)

In [10]:
products_df.head()

Unnamed: 0,product_id,product_name,brand_name,ingredients,price_usd,highlights,all_categories
0,P473671,Fragrance Discovery Set,19-69,"[Capri Eau de Parfum:, Alcohol Denat. (SD Alco...",35.0,"[Unisex/ Genderless Scent, Warm &Spicy Scent, ...","[Fragrance, Value & Gift Sets, Perfume Gift Sets]"
1,P473668,La Habana Eau de Parfum,19-69,"[Alcohol Denat. (SD Alcohol 39C), Parfum (Frag...",195.0,"[Unisex/ Genderless Scent, Layerable Scent, Wa...","[Fragrance, Women, Perfume]"
2,P473662,Rainbow Bar Eau de Parfum,19-69,"[Alcohol Denat. (SD Alcohol 39C), Parfum (Frag...",195.0,"[Unisex/ Genderless Scent, Layerable Scent, Wo...","[Fragrance, Women, Perfume]"
3,P473660,Kasbah Eau de Parfum,19-69,"[Alcohol Denat. (SD Alcohol 39C), Parfum (Frag...",195.0,"[Unisex/ Genderless Scent, Layerable Scent, Wa...","[Fragrance, Women, Perfume]"
4,P473658,Purple Haze Eau de Parfum,19-69,"[Alcohol Denat. (SD Alcohol 39C), Parfum (Frag...",195.0,"[Unisex/ Genderless Scent, Layerable Scent, Wo...","[Fragrance, Women, Perfume]"


Drop null values.

In [11]:
products_df.isnull().sum()

product_id           0
product_name         0
brand_name           0
ingredients        945
price_usd            0
highlights        2207
all_categories       0
dtype: int64

In [12]:
products_df.dropna(inplace=True)

In [13]:
products_df.shape

(5820, 7)

There are repeated products, because some products are sold in different sizes.

I'll just make the key be (product + brand), so I can easily merge with `reviews_df` later.

In [14]:
products_df.drop(columns=['product_id'], inplace=True)

In [15]:
aggregated_products = products_df.groupby(['product_name', 'brand_name']).agg({
    'price_usd': 'mean',
    'ingredients': 'first',
    'highlights': 'first',
    'all_categories': 'first',
}).reset_index()

aggregated_products['product_plus_brand'] = aggregated_products['product_name'] + ' ' + aggregated_products['brand_name']

aggregated_products.head()

Unnamed: 0,product_name,brand_name,price_usd,ingredients,highlights,all_categories,product_plus_brand
0,"""B"" Oil",The Ordinary,11.1,"[Caprylic/Capric Triglyceride, Squalane, Cramb...","[Vegan, Good for: Dullness/Uneven Texture, Goo...","[Skincare, Moisturizers, Face Oils]","""B"" Oil The Ordinary"
1,"""Buffet"" + Copper Peptides 1%",The Ordinary,30.9,"[Water, Glycerin, Lactococcus Ferment Lysate, ...","[Vegan, Oil Free, Without Silicones, Alcohol F...","[Skincare, Treatments, Face Serums]","""Buffet"" + Copper Peptides 1% The Ordinary"
2,"""The Martini"" Emotional Detox Bath Soak",goop,40.0,"[Sodium Chloride, Magnesium Sulfate, Passiflor...","[Vegan, Clean at Sephora]","[Bath & Body, Bath & Shower, Bath Soaks & Bubb...","""The Martini"" Emotional Detox Bath Soak goop"
3,#BombBrows Full ’n Fluffy Volumizing Fiber Gel,HUDA BEAUTY,19.0,"[Water/Aqua/Eau, Synthetic Fluorphlogopite, Ny...","[Long-wearing, Fragrance Free, Cruelty-Free, V...","[Makeup, Eye, Eyebrow]",#BombBrows Full ’n Fluffy Volumizing Fiber Gel...
4,#BombBrows Microshade Brow Pencil,HUDA BEAUTY,17.0,"[Stearic Acid, Hydrogenated Castor Oil, Rhus S...","[Waterproof, Long-wearing, Vegan]","[Makeup, Eye, Eyebrow]",#BombBrows Microshade Brow Pencil HUDA BEAUTY


In [16]:
aggregated_products.shape

(5804, 7)

**Reviews Dataset Handling**

In [17]:
reviews1 = pd.read_csv("reviews_0-250.csv", index_col=0)
reviews2 = pd.read_csv("reviews_250-500.csv", index_col=0)
reviews3 = pd.read_csv("reviews_500-750.csv", index_col=0)
reviews4 = pd.read_csv("reviews_750-1250.csv", index_col=0)
reviews5 = pd.read_csv("reviews_1250-end.csv", index_col=0)

reviews_df = pd.concat([reviews1, reviews2, reviews3, reviews4, reviews5], ignore_index=True)

  reviews1 = pd.read_csv("reviews_0-250.csv", index_col=0)
  reviews4 = pd.read_csv("reviews_750-1250.csv", index_col=0)
  reviews5 = pd.read_csv("reviews_1250-end.csv", index_col=0)


In [18]:
reviews_df.head()

Unnamed: 0,author_id,rating,is_recommended,helpfulness,total_feedback_count,total_neg_feedback_count,total_pos_feedback_count,submission_time,review_text,review_title,skin_tone,eye_color,skin_type,hair_color,product_id,product_name,brand_name,price_usd
0,1741593524,5,1.0,1.0,2,0,2,2023-02-01,I use this with the Nudestix “Citrus Clean Bal...,Taught me how to double cleanse!,,brown,dry,black,P504322,Gentle Hydra-Gel Face Cleanser,NUDESTIX,19.0
1,31423088263,1,0.0,,0,0,0,2023-03-21,I bought this lip mask after reading the revie...,Disappointed,,,,,P420652,Lip Sleeping Mask Intense Hydration with Vitam...,LANEIGE,24.0
2,5061282401,5,1.0,,0,0,0,2023-03-21,My review title says it all! I get so excited ...,New Favorite Routine,light,brown,dry,blonde,P420652,Lip Sleeping Mask Intense Hydration with Vitam...,LANEIGE,24.0
3,6083038851,5,1.0,,0,0,0,2023-03-20,I’ve always loved this formula for a long time...,Can't go wrong with any of them,,brown,combination,black,P420652,Lip Sleeping Mask Intense Hydration with Vitam...,LANEIGE,24.0
4,47056667835,5,1.0,,0,0,0,2023-03-20,"If you have dry cracked lips, this is a must h...",A must have !!!,light,hazel,combination,,P420652,Lip Sleeping Mask Intense Hydration with Vitam...,LANEIGE,24.0


In [19]:
reviews_df['is_recommended'].value_counts()

is_recommended
1.0    778160
0.0    148263
Name: count, dtype: int64

There are too many features. Let's keep it simple.

In [20]:
reviews_df.drop(columns=['is_recommended', 'helpfulness', 'total_feedback_count', 
                         'total_neg_feedback_count', 'total_pos_feedback_count', 
                         'submission_time', 'review_text', 'review_title', 
                         'price_usd', 'skin_tone', 'eye_color', 'hair_color',
                         'product_id'], # don't need product_id since I'm merging by product + brand name
                inplace=True)

In [21]:
reviews_df.head()

Unnamed: 0,author_id,rating,skin_type,product_name,brand_name
0,1741593524,5,dry,Gentle Hydra-Gel Face Cleanser,NUDESTIX
1,31423088263,1,,Lip Sleeping Mask Intense Hydration with Vitam...,LANEIGE
2,5061282401,5,dry,Lip Sleeping Mask Intense Hydration with Vitam...,LANEIGE
3,6083038851,5,combination,Lip Sleeping Mask Intense Hydration with Vitam...,LANEIGE
4,47056667835,5,combination,Lip Sleeping Mask Intense Hydration with Vitam...,LANEIGE


In [22]:
reviews_df.dropna(inplace=True)

In [23]:
reviews_df.shape

(982854, 5)

Let's proceed to the recommendation machine portion.

**Content-based Filtering**

In [24]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

In [25]:
vectorizer = TfidfVectorizer()

Similarity based on product + brand name.

In [26]:
product_names = aggregated_products.product_plus_brand

In [27]:
tfidf_matrix_names = vectorizer.fit_transform(product_names)

In [28]:
similarity_matrix_names = cosine_similarity(tfidf_matrix_names, tfidf_matrix_names)

In [29]:
similarity_names_df = pd.DataFrame(similarity_matrix_names, index=product_names, columns=product_names)

In [30]:
similarity_names_df

product_plus_brand,"""B"" Oil The Ordinary","""Buffet"" + Copper Peptides 1% The Ordinary","""The Martini"" Emotional Detox Bath Soak goop",#BombBrows Full ’n Fluffy Volumizing Fiber Gel HUDA BEAUTY,#BombBrows Microshade Brow Pencil HUDA BEAUTY,#FauxFilter Luminous Matte Buildable Coverage Crease Proof Concealer HUDA BEAUTY,#FauxFilter Luminous Matte Foundation HUDA BEAUTY,#FauxFilter Skin Finish Buildable Coverage Foundation Stick HUDA BEAUTY,#Lipstories Set SEPHORA COLLECTION,#OribeObsessed Hair Set Oribe,...,Éclat Soleil Luminous Bronzer Gucci,Éternité De Beauté 24 Hour Full Coverage Luminous Matte Finish Foundation Gucci,’REPLICA’ Beach Walk Maison Margiela,’REPLICA’ Beach Walk Travel Spray Maison Margiela,’REPLICA’ By The Fireplace Travel Spray Maison Margiela,’REPLICA’ By the Fireplace Maison Margiela,’REPLICA’ Jazz Club Maison Margiela,’REPLICA’ Jazz Club Refill Set Maison Margiela,’REPLICA’ Jazz Club Travel Spray Maison Margiela,’REPLICA’ On a Date Scented Candle Maison Margiela
product_plus_brand,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
"""B"" Oil The Ordinary",1.000000,0.382940,0.088345,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.0,0.000000,0.000000,0.000000,0.104734,0.112401,0.000000,0.000000,0.000000,0.000000
"""Buffet"" + Copper Peptides 1% The Ordinary",0.382940,1.000000,0.044858,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.0,0.000000,0.000000,0.000000,0.053180,0.057073,0.000000,0.000000,0.000000,0.000000
"""The Martini"" Emotional Detox Bath Soak goop",0.088345,0.044858,1.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.0,0.000000,0.000000,0.000000,0.042859,0.045997,0.000000,0.000000,0.000000,0.000000
#BombBrows Full ’n Fluffy Volumizing Fiber Gel HUDA BEAUTY,0.000000,0.000000,0.000000,1.000000,0.420875,0.132271,0.187807,0.150516,0.000000,0.000000,...,0.0,0.085184,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
#BombBrows Microshade Brow Pencil HUDA BEAUTY,0.000000,0.000000,0.000000,0.420875,1.000000,0.148800,0.211275,0.169324,0.000000,0.000000,...,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
’REPLICA’ By the Fireplace Maison Margiela,0.112401,0.057073,0.045997,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.0,0.000000,0.546968,0.512090,0.931785,1.000000,0.536839,0.485906,0.503750,0.464237
’REPLICA’ Jazz Club Maison Margiela,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.0,0.000000,0.517161,0.484184,0.500219,0.536839,1.000000,0.905123,0.938363,0.438938
’REPLICA’ Jazz Club Refill Set Maison Margiela,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.076734,0.077129,...,0.0,0.000000,0.468094,0.438246,0.452760,0.485906,0.905123,1.000000,0.849334,0.397293
’REPLICA’ Jazz Club Travel Spray Maison Margiela,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.0,0.000000,0.485285,0.575794,0.594862,0.503750,0.938363,0.849334,1.000000,0.411884


Similarity of prices.

In [31]:
prices = aggregated_products.price_usd.apply(lambda x : np.log(x))

In [32]:
price_diff_matrix = np.abs(prices.values[:, np.newaxis] - prices.values[np.newaxis, :])

In [33]:
similarity_matrix_prices = 1 / (1 + price_diff_matrix)

In [34]:
similarity_prices_df = pd.DataFrame(similarity_matrix_prices, index=product_names, columns=product_names)

In [35]:
similarity_prices_df

product_plus_brand,"""B"" Oil The Ordinary","""Buffet"" + Copper Peptides 1% The Ordinary","""The Martini"" Emotional Detox Bath Soak goop",#BombBrows Full ’n Fluffy Volumizing Fiber Gel HUDA BEAUTY,#BombBrows Microshade Brow Pencil HUDA BEAUTY,#FauxFilter Luminous Matte Buildable Coverage Crease Proof Concealer HUDA BEAUTY,#FauxFilter Luminous Matte Foundation HUDA BEAUTY,#FauxFilter Skin Finish Buildable Coverage Foundation Stick HUDA BEAUTY,#Lipstories Set SEPHORA COLLECTION,#OribeObsessed Hair Set Oribe,...,Éclat Soleil Luminous Bronzer Gucci,Éternité De Beauté 24 Hour Full Coverage Luminous Matte Finish Foundation Gucci,’REPLICA’ Beach Walk Maison Margiela,’REPLICA’ Beach Walk Travel Spray Maison Margiela,’REPLICA’ By The Fireplace Travel Spray Maison Margiela,’REPLICA’ By the Fireplace Maison Margiela,’REPLICA’ Jazz Club Maison Margiela,’REPLICA’ Jazz Club Refill Set Maison Margiela,’REPLICA’ Jazz Club Travel Spray Maison Margiela,’REPLICA’ On a Date Scented Candle Maison Margiela
product_plus_brand,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
"""B"" Oil The Ordinary",1.000000,0.494117,0.438225,0.650409,0.701130,0.493330,0.429051,0.443141,0.674115,0.387291,...,0.363380,0.353712,0.272611,0.465462,0.465462,0.272611,0.272611,0.268179,0.465462,0.351921
"""Buffet"" + Copper Peptides 1% The Ordinary",0.494117,1.000000,0.794835,0.672804,0.625961,0.996779,0.765162,0.811158,0.649189,0.641755,...,0.578661,0.554523,0.378155,0.889211,0.889211,0.378155,0.378155,0.369680,0.889211,0.550134
"""The Martini"" Emotional Detox Bath Soak goop",0.438225,0.794835,1.000000,0.573250,0.538890,0.796881,0.953480,0.975307,0.556017,0.769169,...,0.680270,0.647154,0.419060,0.882199,0.882199,0.419060,0.419060,0.408677,0.882199,0.641184
#BombBrows Full ’n Fluffy Volumizing Fiber Gel HUDA BEAUTY,0.650409,0.672804,0.573250,1.000000,0.899907,0.671344,0.557653,0.581692,0.948706,0.489106,...,0.451581,0.436745,0.319414,0.620767,0.620767,0.319414,0.319414,0.313346,0.620767,0.434017
#BombBrows Microshade Brow Pencil HUDA BEAUTY,0.701130,0.625961,0.538890,0.899907,1.000000,0.624698,0.525084,0.546344,0.945932,0.463871,...,0.429984,0.416512,0.308455,0.580675,0.580675,0.308455,0.308455,0.302793,0.580675,0.414030
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
’REPLICA’ By the Fireplace Maison Margiela,0.272611,0.378155,0.419060,0.319414,0.308455,0.378618,0.427807,0.414660,0.313991,0.479343,...,0.521841,0.543163,1.000000,0.396853,0.396853,1.000000,1.000000,0.942841,0.396853,0.547442
’REPLICA’ Jazz Club Maison Margiela,0.272611,0.378155,0.419060,0.319414,0.308455,0.378618,0.427807,0.414660,0.313991,0.479343,...,0.521841,0.543163,1.000000,0.396853,0.396853,1.000000,1.000000,0.942841,0.396853,0.547442
’REPLICA’ Jazz Club Refill Set Maison Margiela,0.268179,0.369680,0.408677,0.313346,0.302793,0.370122,0.416992,0.404492,0.308126,0.465806,...,0.505839,0.525848,0.942841,0.387529,0.387529,0.942841,0.942841,1.000000,0.387529,0.529857
’REPLICA’ Jazz Club Travel Spray Maison Margiela,0.465462,0.889211,0.882199,0.620767,0.580675,0.891774,0.845794,0.902353,0.600609,0.697527,...,0.623622,0.595678,0.396853,1.000000,1.000000,0.396853,0.396853,0.387529,1.000000,0.590616


Similarity of ingredients.

In [36]:
product_ingredients = aggregated_products.ingredients.apply(lambda x: ' '.join(x))
tfidf_matrix_ingredients = vectorizer.fit_transform(product_ingredients)
similarity_matrix_ingredients = cosine_similarity(tfidf_matrix_ingredients)
similarity_ingredients_df = pd.DataFrame(similarity_matrix_ingredients, index=product_names, columns=product_names)

In [37]:
similarity_ingredients_df

product_plus_brand,"""B"" Oil The Ordinary","""Buffet"" + Copper Peptides 1% The Ordinary","""The Martini"" Emotional Detox Bath Soak goop",#BombBrows Full ’n Fluffy Volumizing Fiber Gel HUDA BEAUTY,#BombBrows Microshade Brow Pencil HUDA BEAUTY,#FauxFilter Luminous Matte Buildable Coverage Crease Proof Concealer HUDA BEAUTY,#FauxFilter Luminous Matte Foundation HUDA BEAUTY,#FauxFilter Skin Finish Buildable Coverage Foundation Stick HUDA BEAUTY,#Lipstories Set SEPHORA COLLECTION,#OribeObsessed Hair Set Oribe,...,Éclat Soleil Luminous Bronzer Gucci,Éternité De Beauté 24 Hour Full Coverage Luminous Matte Finish Foundation Gucci,’REPLICA’ Beach Walk Maison Margiela,’REPLICA’ Beach Walk Travel Spray Maison Margiela,’REPLICA’ By The Fireplace Travel Spray Maison Margiela,’REPLICA’ By the Fireplace Maison Margiela,’REPLICA’ Jazz Club Maison Margiela,’REPLICA’ Jazz Club Refill Set Maison Margiela,’REPLICA’ Jazz Club Travel Spray Maison Margiela,’REPLICA’ On a Date Scented Candle Maison Margiela
product_plus_brand,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
"""B"" Oil The Ordinary",1.000000,0.000000,0.297452,0.089060,0.099321,0.022464,0.018848,0.035572,0.022808,0.178766,...,0.081563,0.036845,0.000000,0.000000,0.009569,0.009569,0.020648,0.020648,0.020648,0.000000
"""Buffet"" + Copper Peptides 1% The Ordinary",0.000000,1.000000,0.015701,0.054147,0.015769,0.051605,0.039156,0.040747,0.011145,0.083082,...,0.043917,0.073651,0.004608,0.004608,0.006273,0.006273,0.001951,0.001951,0.001951,0.000000
"""The Martini"" Emotional Detox Bath Soak goop",0.297452,0.015701,1.000000,0.139479,0.083173,0.030750,0.015616,0.024853,0.021658,0.175297,...,0.049641,0.039436,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
#BombBrows Full ’n Fluffy Volumizing Fiber Gel HUDA BEAUTY,0.089060,0.054147,0.139479,1.000000,0.376839,0.277409,0.271902,0.371208,0.228281,0.100080,...,0.271965,0.257044,0.079236,0.079236,0.124049,0.124049,0.129456,0.129456,0.129456,0.007246
#BombBrows Microshade Brow Pencil HUDA BEAUTY,0.099321,0.015769,0.083173,0.376839,1.000000,0.204694,0.148351,0.250044,0.173570,0.075141,...,0.245071,0.193892,0.037194,0.037194,0.067506,0.067506,0.072831,0.072831,0.072831,0.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
’REPLICA’ By the Fireplace Maison Margiela,0.009569,0.006273,0.000000,0.124049,0.067506,0.082066,0.091921,0.182340,0.188353,0.060098,...,0.136107,0.143950,0.307237,0.307237,1.000000,1.000000,0.576345,0.576345,0.576345,0.012538
’REPLICA’ Jazz Club Maison Margiela,0.020648,0.001951,0.000000,0.129456,0.072831,0.088539,0.096199,0.213168,0.219716,0.049336,...,0.157620,0.153685,0.299545,0.299545,0.576345,0.576345,1.000000,1.000000,1.000000,0.013527
’REPLICA’ Jazz Club Refill Set Maison Margiela,0.020648,0.001951,0.000000,0.129456,0.072831,0.088539,0.096199,0.213168,0.219716,0.049336,...,0.157620,0.153685,0.299545,0.299545,0.576345,0.576345,1.000000,1.000000,1.000000,0.013527
’REPLICA’ Jazz Club Travel Spray Maison Margiela,0.020648,0.001951,0.000000,0.129456,0.072831,0.088539,0.096199,0.213168,0.219716,0.049336,...,0.157620,0.153685,0.299545,0.299545,0.576345,0.576345,1.000000,1.000000,1.000000,0.013527


Similarity of highlights.

In [38]:
product_highlights = aggregated_products.highlights.apply(lambda x: ' '.join(x))
tfidf_matrix_highlights = vectorizer.fit_transform(product_highlights)
similarity_matrix_highlights = cosine_similarity(tfidf_matrix_highlights)
similarity_highlights_df = pd.DataFrame(similarity_matrix_highlights, index=product_names, columns=product_names)

In [39]:
similarity_highlights_df

product_plus_brand,"""B"" Oil The Ordinary","""Buffet"" + Copper Peptides 1% The Ordinary","""The Martini"" Emotional Detox Bath Soak goop",#BombBrows Full ’n Fluffy Volumizing Fiber Gel HUDA BEAUTY,#BombBrows Microshade Brow Pencil HUDA BEAUTY,#FauxFilter Luminous Matte Buildable Coverage Crease Proof Concealer HUDA BEAUTY,#FauxFilter Luminous Matte Foundation HUDA BEAUTY,#FauxFilter Skin Finish Buildable Coverage Foundation Stick HUDA BEAUTY,#Lipstories Set SEPHORA COLLECTION,#OribeObsessed Hair Set Oribe,...,Éclat Soleil Luminous Bronzer Gucci,Éternité De Beauté 24 Hour Full Coverage Luminous Matte Finish Foundation Gucci,’REPLICA’ Beach Walk Maison Margiela,’REPLICA’ Beach Walk Travel Spray Maison Margiela,’REPLICA’ By The Fireplace Travel Spray Maison Margiela,’REPLICA’ By the Fireplace Maison Margiela,’REPLICA’ Jazz Club Maison Margiela,’REPLICA’ Jazz Club Refill Set Maison Margiela,’REPLICA’ Jazz Club Travel Spray Maison Margiela,’REPLICA’ On a Date Scented Candle Maison Margiela
product_plus_brand,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
"""B"" Oil The Ordinary",1.000000,0.328846,0.072871,0.132454,0.047538,0.027651,0.063588,0.087044,0.0,0.389589,...,0.029036,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
"""Buffet"" + Copper Peptides 1% The Ordinary",0.328846,1.000000,0.077356,0.332453,0.050464,0.029353,0.137393,0.218476,0.0,0.063622,...,0.030824,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
"""The Martini"" Emotional Detox Bath Soak goop",0.072871,0.077356,1.000000,0.110521,0.124820,0.072603,0.080527,0.072630,0.0,0.071683,...,0.076240,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
#BombBrows Full ’n Fluffy Volumizing Fiber Gel HUDA BEAUTY,0.132454,0.332453,0.110521,1.000000,0.468423,0.272463,0.305225,0.426552,0.0,0.041406,...,0.286113,0.234908,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
#BombBrows Microshade Brow Pencil HUDA BEAUTY,0.047538,0.050464,0.124820,0.468423,1.000000,0.581660,0.052533,0.047381,0.0,0.046764,...,0.323131,0.265301,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
’REPLICA’ By the Fireplace Maison Margiela,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.0,0.000000,...,0.000000,0.000000,0.678456,0.678456,1.000000,1.000000,1.000000,1.000000,1.000000,0.406955
’REPLICA’ Jazz Club Maison Margiela,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.0,0.000000,...,0.000000,0.000000,0.678456,0.678456,1.000000,1.000000,1.000000,1.000000,1.000000,0.406955
’REPLICA’ Jazz Club Refill Set Maison Margiela,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.0,0.000000,...,0.000000,0.000000,0.678456,0.678456,1.000000,1.000000,1.000000,1.000000,1.000000,0.406955
’REPLICA’ Jazz Club Travel Spray Maison Margiela,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.0,0.000000,...,0.000000,0.000000,0.678456,0.678456,1.000000,1.000000,1.000000,1.000000,1.000000,0.406955


Similarity of categories.

In [40]:
product_categories = aggregated_products.all_categories.apply(lambda x: ' '.join(x))
tfidf_matrix_categories = vectorizer.fit_transform(product_categories)
similarity_matrix_categories = cosine_similarity(tfidf_matrix_categories)
similarity_categories_df = pd.DataFrame(similarity_matrix_categories, index=product_names, columns=product_names)

In [41]:
similarity_categories_df

product_plus_brand,"""B"" Oil The Ordinary","""Buffet"" + Copper Peptides 1% The Ordinary","""The Martini"" Emotional Detox Bath Soak goop",#BombBrows Full ’n Fluffy Volumizing Fiber Gel HUDA BEAUTY,#BombBrows Microshade Brow Pencil HUDA BEAUTY,#FauxFilter Luminous Matte Buildable Coverage Crease Proof Concealer HUDA BEAUTY,#FauxFilter Luminous Matte Foundation HUDA BEAUTY,#FauxFilter Skin Finish Buildable Coverage Foundation Stick HUDA BEAUTY,#Lipstories Set SEPHORA COLLECTION,#OribeObsessed Hair Set Oribe,...,Éclat Soleil Luminous Bronzer Gucci,Éternité De Beauté 24 Hour Full Coverage Luminous Matte Finish Foundation Gucci,’REPLICA’ Beach Walk Maison Margiela,’REPLICA’ Beach Walk Travel Spray Maison Margiela,’REPLICA’ By The Fireplace Travel Spray Maison Margiela,’REPLICA’ By the Fireplace Maison Margiela,’REPLICA’ Jazz Club Maison Margiela,’REPLICA’ Jazz Club Refill Set Maison Margiela,’REPLICA’ Jazz Club Travel Spray Maison Margiela,’REPLICA’ On a Date Scented Candle Maison Margiela
product_plus_brand,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
"""B"" Oil The Ordinary",1.000000,0.291989,0.0,0.0,0.0,0.152143,0.165498,0.165498,0.000000,0.000000,...,0.000000,0.165498,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
"""Buffet"" + Copper Peptides 1% The Ordinary",0.291989,1.000000,0.0,0.0,0.0,0.172105,0.187211,0.187211,0.000000,0.000000,...,0.000000,0.187211,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
"""The Martini"" Emotional Detox Bath Soak goop",0.000000,0.000000,1.0,0.0,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
#BombBrows Full ’n Fluffy Volumizing Fiber Gel HUDA BEAUTY,0.000000,0.000000,0.0,1.0,1.0,0.123302,0.134125,0.134125,0.090446,0.000000,...,0.100416,0.134125,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
#BombBrows Microshade Brow Pencil HUDA BEAUTY,0.000000,0.000000,0.0,1.0,1.0,0.123302,0.134125,0.134125,0.090446,0.000000,...,0.100416,0.134125,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
’REPLICA’ By the Fireplace Maison Margiela,0.000000,0.000000,0.0,0.0,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000,1.000000,0.419876,0.419876,1.000000,1.000000,0.314043,0.419876,0.121424
’REPLICA’ Jazz Club Maison Margiela,0.000000,0.000000,0.0,0.0,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000,1.000000,0.419876,0.419876,1.000000,1.000000,0.314043,0.419876,0.121424
’REPLICA’ Jazz Club Refill Set Maison Margiela,0.000000,0.000000,0.0,0.0,0.0,0.000000,0.000000,0.000000,0.252300,0.821138,...,0.000000,0.000000,0.314043,0.088390,0.088390,0.314043,0.314043,1.000000,0.088390,0.057437
’REPLICA’ Jazz Club Travel Spray Maison Margiela,0.000000,0.000000,0.0,0.0,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.419876,1.000000,1.000000,0.419876,0.419876,0.088390,1.000000,0.084182


Combine similarity matrices.

The CBF is just a small part to mitigate the cold start problem, so I'll just come up with arbitrary weights to keep it simple.

In [42]:
name_weight = 0.1
price_weight = 0.2
ingredients_weight = 0.3
highlights_weight = 0.2
categories_weight = 0.2

overall_similarity_df = (name_weight * similarity_names_df
                      + price_weight * similarity_prices_df
                      + ingredients_weight * similarity_ingredients_df
                      + highlights_weight * similarity_highlights_df
                      + categories_weight * similarity_categories_df)

In [43]:
overall_similarity_df

product_plus_brand,"""B"" Oil The Ordinary","""Buffet"" + Copper Peptides 1% The Ordinary","""The Martini"" Emotional Detox Bath Soak goop",#BombBrows Full ’n Fluffy Volumizing Fiber Gel HUDA BEAUTY,#BombBrows Microshade Brow Pencil HUDA BEAUTY,#FauxFilter Luminous Matte Buildable Coverage Crease Proof Concealer HUDA BEAUTY,#FauxFilter Luminous Matte Foundation HUDA BEAUTY,#FauxFilter Skin Finish Buildable Coverage Foundation Stick HUDA BEAUTY,#Lipstories Set SEPHORA COLLECTION,#OribeObsessed Hair Set Oribe,...,Éclat Soleil Luminous Bronzer Gucci,Éternité De Beauté 24 Hour Full Coverage Luminous Matte Finish Foundation Gucci,’REPLICA’ Beach Walk Maison Margiela,’REPLICA’ Beach Walk Travel Spray Maison Margiela,’REPLICA’ By The Fireplace Travel Spray Maison Margiela,’REPLICA’ By the Fireplace Maison Margiela,’REPLICA’ Jazz Club Maison Margiela,’REPLICA’ Jazz Club Refill Set Maison Margiela,’REPLICA’ Jazz Club Travel Spray Maison Margiela,’REPLICA’ On a Date Scented Candle Maison Margiela
product_plus_brand,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
"""B"" Oil The Ordinary",1.000000,0.261285,0.200289,0.183291,0.179530,0.141364,0.137282,0.149808,0.141665,0.209006,...,0.102952,0.114895,0.054522,0.093092,0.106437,0.068633,0.060717,0.059830,0.099287,0.070384
"""Buffet"" + Copper Peptides 1% The Ordinary",0.261285,1.000000,0.183634,0.217296,0.140016,0.255129,0.229700,0.255593,0.133181,0.166000,...,0.135072,0.170442,0.077013,0.179225,0.185042,0.083220,0.076216,0.074521,0.178427,0.110027
"""The Martini"" Emotional Detox Bath Soak goop",0.200289,0.183634,1.000000,0.178598,0.157694,0.183122,0.211486,0.217044,0.117701,0.220760,...,0.166194,0.141262,0.083812,0.176440,0.180726,0.088412,0.083812,0.081735,0.176440,0.128237
#BombBrows Full ’n Fluffy Volumizing Fiber Gel HUDA BEAUTY,0.183291,0.217296,0.178598,1.000000,0.628805,0.309872,0.299752,0.354888,0.276315,0.136127,...,0.249212,0.246787,0.087654,0.147924,0.161368,0.101097,0.102720,0.101506,0.162990,0.088977
#BombBrows Microshade Brow Pencil HUDA BEAUTY,0.179530,0.140016,0.157694,0.628805,1.000000,0.342220,0.207981,0.237516,0.259347,0.124669,...,0.244228,0.221355,0.072849,0.127293,0.136387,0.081943,0.083540,0.082408,0.137984,0.082806
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
’REPLICA’ By the Fireplace Maison Margiela,0.068633,0.083220,0.088412,0.101097,0.081943,0.100343,0.113138,0.137634,0.119304,0.113898,...,0.145200,0.151818,0.682559,0.442417,0.756524,1.000000,0.826587,0.672871,0.586624,0.265349
’REPLICA’ Jazz Club Maison Margiela,0.060717,0.076216,0.083812,0.102720,0.083540,0.102285,0.114421,0.146883,0.128713,0.110669,...,0.151654,0.154738,0.677271,0.437319,0.586271,0.826587,1.000000,0.841889,0.757182,0.263116
’REPLICA’ Jazz Club Refill Set Maison Margiela,0.059830,0.074521,0.081735,0.101506,0.082408,0.100586,0.112258,0.144849,0.185673,0.279903,...,0.148454,0.151275,0.523741,0.364563,0.513363,0.672871,0.841889,1.000000,0.680117,0.242637
’REPLICA’ Jazz Club Travel Spray Maison Margiela,0.099287,0.178427,0.176440,0.162990,0.137984,0.204916,0.198018,0.244421,0.186037,0.154306,...,0.172011,0.165241,0.437429,0.683134,0.832390,0.586624,0.757182,0.680117,1.000000,0.261597


In [44]:
# retrieve top 10 CBF recommendations

def cbf_rec(product):
    row = overall_similarity_df.loc[product]
    sorted_row = row.sort_values(ascending=False)
    top_10_similar = sorted_row[1:11]
    return top_10_similar.index.tolist()

In [45]:
# test function

cbf_rec('#BombBrows Full ’n Fluffy Volumizing Fiber Gel HUDA BEAUTY')

['#BombBrows Microshade Brow Pencil HUDA BEAUTY',
 'Laminated Look Brow Kit Anastasia Beverly Hills',
 'Brow Flick Microfine Detailing Eyebrow Pen Glossier',
 'Arch Brow Volumizing Fiber Gel Hourglass',
 'Brow 1980 Volumizing Eyebrow Pomade Gel MERIT',
 'Volumizing Fiber Brow Gel SEPHORA COLLECTION',
 'Mini Arch Brow Micro Sculpting Pencil Hourglass',
 'Brow Tint Eyebrow Gel REFY',
 "Fluff'n Brow Pencil - 3-in-1 Brow Pencil and Balm Velour Lashes",
 'Shape Up Soft Fill Eyebrow Pencil LAWLESS']

**Collaborative Filtering**

In [46]:
combined_df = pd.merge(aggregated_products, reviews_df, on=['product_name', 'brand_name'], how='inner')

In [47]:
combined_df['product_plus_brand'] = combined_df['product_name'] + ' ' + combined_df['brand_name']

In [48]:
combined_df.head()

Unnamed: 0,product_name,brand_name,price_usd,ingredients,highlights,all_categories,product_plus_brand,author_id,rating,skin_type
0,"""B"" Oil",The Ordinary,11.1,"[Caprylic/Capric Triglyceride, Squalane, Cramb...","[Vegan, Good for: Dullness/Uneven Texture, Goo...","[Skincare, Moisturizers, Face Oils]","""B"" Oil The Ordinary",27162108348,5,dry
1,"""B"" Oil",The Ordinary,11.1,"[Caprylic/Capric Triglyceride, Squalane, Cramb...","[Vegan, Good for: Dullness/Uneven Texture, Goo...","[Skincare, Moisturizers, Face Oils]","""B"" Oil The Ordinary",5049431408,5,combination
2,"""B"" Oil",The Ordinary,11.1,"[Caprylic/Capric Triglyceride, Squalane, Cramb...","[Vegan, Good for: Dullness/Uneven Texture, Goo...","[Skincare, Moisturizers, Face Oils]","""B"" Oil The Ordinary",11803640603,5,normal
3,"""B"" Oil",The Ordinary,11.1,"[Caprylic/Capric Triglyceride, Squalane, Cramb...","[Vegan, Good for: Dullness/Uneven Texture, Goo...","[Skincare, Moisturizers, Face Oils]","""B"" Oil The Ordinary",56672561,5,combination
4,"""B"" Oil",The Ordinary,11.1,"[Caprylic/Capric Triglyceride, Squalane, Cramb...","[Vegan, Good for: Dullness/Uneven Texture, Goo...","[Skincare, Moisturizers, Face Oils]","""B"" Oil The Ordinary",27025237093,1,combination


In [49]:
combined_df.shape

(876900, 10)

I will use cross-validation to choose the model first.

In [50]:
from surprise import Dataset, Reader, SVD, SVDpp, NMF
from surprise.model_selection import cross_validate, GridSearchCV

In [51]:
algo = SVD()

reader = Reader(rating_scale=(1, 5))

data = Dataset.load_from_df(combined_df[['author_id', 'product_plus_brand', 'rating']], reader)

cross_validate(algo, data, measures=['RMSE', 'MAE'], cv=5, verbose=True)

Evaluating RMSE, MAE of algorithm SVD on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    1.0171  1.0176  1.0188  1.0180  1.0219  1.0187  0.0017  
MAE (testset)     0.7494  0.7488  0.7503  0.7497  0.7523  0.7501  0.0012  
Fit time          5.24    6.29    5.51    5.64    5.70    5.68    0.35    
Test time         0.57    0.75    0.47    0.42    0.42    0.52    0.12    


{'test_rmse': array([1.01705437, 1.01761613, 1.01878221, 1.01795208, 1.02187437]),
 'test_mae': array([0.74943505, 0.74879469, 0.75025433, 0.7496667 , 0.75233395]),
 'fit_time': (5.236186981201172,
  6.290385007858276,
  5.513761043548584,
  5.644951105117798,
  5.703460931777954),
 'test_time': (0.566227912902832,
  0.7470211982727051,
  0.4680309295654297,
  0.41750311851501465,
  0.4154949188232422)}

In [52]:
algo = SVDpp()

reader = Reader(rating_scale=(1, 5))

data = Dataset.load_from_df(combined_df[['author_id', 'product_plus_brand', 'rating']], reader)

cross_validate(algo, data, measures=['RMSE', 'MAE'], cv=5, verbose=True)

Evaluating RMSE, MAE of algorithm SVDpp on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    1.0161  1.0137  1.0167  1.0146  1.0103  1.0143  0.0023  
MAE (testset)     0.7270  0.7268  0.7264  0.7286  0.7258  0.7269  0.0009  
Fit time          5.25    5.46    5.19    5.22    5.30    5.28    0.10    
Test time         1.27    1.17    1.15    0.99    1.01    1.12    0.10    


{'test_rmse': array([1.01613978, 1.01371857, 1.01667122, 1.01456051, 1.0103034 ]),
 'test_mae': array([0.72702978, 0.72682869, 0.72644011, 0.72864098, 0.72579918]),
 'fit_time': (5.2482287883758545,
  5.457863807678223,
  5.189410209655762,
  5.215213060379028,
  5.301656007766724),
 'test_time': (1.2654681205749512,
  1.1681432723999023,
  1.1483259201049805,
  0.994927167892456,
  1.0091660022735596)}

In [53]:
algo = NMF()

reader = Reader(rating_scale=(1, 5))

data = Dataset.load_from_df(combined_df[['author_id', 'product_plus_brand', 'rating']], reader)

cross_validate(algo, data, measures=['RMSE', 'MAE'], cv=5, verbose=True)

Evaluating RMSE, MAE of algorithm NMF on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    1.1310  1.1130  1.1259  1.1233  1.1261  1.1239  0.0060  
MAE (testset)     0.8369  0.8175  0.8327  0.8321  0.8344  0.8307  0.0068  
Fit time          22.36   21.65   22.54   24.06   22.50   22.62   0.79    
Test time         0.75    0.39    0.61    0.48    0.62    0.57    0.13    


{'test_rmse': array([1.1309771 , 1.11300095, 1.12588183, 1.12327814, 1.12613494]),
 'test_mae': array([0.83694717, 0.81754977, 0.8326818 , 0.83206755, 0.83442387]),
 'fit_time': (22.358661890029907,
  21.648234128952026,
  22.54408597946167,
  24.055536031723022,
  22.497864961624146),
 'test_time': (0.7524809837341309,
  0.3876838684082031,
  0.6133291721343994,
  0.4806489944458008,
  0.6170177459716797)}

SVD++ seems to perform the best, so let's use that.

In [54]:
model = SVDpp()

reader = Reader(rating_scale=(1, 5))

data = Dataset.load_from_df(combined_df[['author_id', 'product_plus_brand', 'rating']], reader)

In [55]:
trainset = data.build_full_trainset()
model.fit(trainset)

<surprise.prediction_algorithms.matrix_factorization.SVDpp at 0x2c5ffafc0>

In [56]:
pd.DataFrame(model.qi)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19
0,-0.264105,0.095209,0.053088,-0.551046,0.114269,-0.417563,0.139759,0.410604,-0.243925,0.313093,-0.809048,-0.051948,-0.535472,0.141738,0.168246,0.068505,0.426470,0.817488,0.349635,-0.207652
1,0.355707,-0.028891,0.462035,0.340225,0.135431,-0.209157,0.574760,0.102670,-0.452626,0.363016,0.131923,-0.366255,-0.580687,0.611573,0.261965,0.318538,-0.484818,-0.356960,-0.513519,-0.268533
2,0.004306,0.256737,-0.004622,0.046444,-0.256158,0.048749,-0.043178,0.154407,0.078727,0.288402,0.126624,0.067543,0.097578,-0.062225,-0.030545,-0.124875,0.131328,-0.003080,-0.267871,-0.021156
3,-0.417119,0.669556,0.863620,0.119556,-0.375960,0.076344,-0.131053,-1.299948,0.310964,0.717349,-0.653215,0.900715,1.820080,0.549087,-0.537430,-0.700695,-0.418760,0.234708,0.335076,0.661901
4,0.796674,0.648841,0.180466,-0.609661,-0.292246,-0.949169,-0.825861,-0.939984,-0.865343,-0.687816,-0.908989,0.450199,-0.756128,0.001919,0.300516,-0.433541,0.430695,0.468552,-0.090249,-0.618608
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1856,0.063960,0.009030,0.049141,0.049662,0.118953,-0.019865,0.016119,0.230228,0.018293,-0.011457,0.202466,0.199628,0.010208,0.023422,0.103854,-0.018139,-0.077322,-0.025927,0.163133,-0.241227
1857,0.039048,-0.433612,0.060744,0.267998,-0.218258,0.126668,-0.043235,-0.059833,0.291686,0.270097,0.071400,-0.226354,-0.362356,-0.006021,0.155652,-0.139924,0.345599,0.213197,0.024997,0.305618
1858,-0.100399,-0.355660,-0.098571,-0.372561,0.577303,0.042286,-0.283992,-0.222340,0.039703,-0.221695,0.004120,-0.085235,-0.101681,-0.165926,-0.008050,-0.233545,0.221167,-0.006037,0.146702,-0.134308
1859,-0.074912,0.320173,-0.111947,-0.189012,-0.177017,-0.303994,-0.257226,-0.251996,-0.062924,0.056665,-0.308962,0.117160,-0.101934,0.123773,0.018813,-0.039131,0.345338,-0.263219,0.076914,-0.228924


From https://github.com/Techie5879/moviepp/tree/main

In [57]:
from scipy.spatial.distance import cosine as cosine_distance

In [58]:
def get_vector(raw_id: str, trained_model=model) -> np.array:
    """Returns the latent features of a movie in the form of a numpy array"""
    product_row_idx = trained_model.trainset._raw2inner_id_items[raw_id]
    return trained_model.qi[product_row_idx]

Here I'm just using the mean of the vectors of multiple products.

In [59]:
def get_recs(liked_products: list[str], num_recs=5, model=model):
    try:
        """Returns the top x most similar movies to a specified movie, 5 being the default

        This function iterates over every possible movie in MovieLens and calculates
        distance between `movie_title` vector and that movie's vector.
        """
        vectors = []

        for liked_product in liked_products:
            liked_raw_id = combined_df[combined_df['product_plus_brand']==liked_product]["product_plus_brand"].iloc[0]
            # Get the first movie vector
            cur_product_vector: np.array = get_vector(liked_raw_id, model)
            vectors.append(cur_product_vector)

        product_vector = sum(vectors) / len(vectors)

        similarity_table = []

        # Iterate over every possible movie and calculate similarity
        for other_raw_id in model.trainset._raw2inner_id_items.keys():
            if other_raw_id not in liked_products:
                other_product_vector = get_vector(other_raw_id, model)

                # Get the second movie vector, and calculate distance
                similarity_score = cosine_distance(other_product_vector, product_vector)
                recommended_products = combined_df[combined_df['product_plus_brand']==other_raw_id]['product_plus_brand'].iloc[0]
                if similarity_score != 0:
                    similarity_table.append((similarity_score, recommended_products))


        recs = pd.DataFrame(sorted(similarity_table), columns=["vector cosine distance", "Product"])
        # sort movies by ascending similarity
        return recs.head(num_recs)
    # Exception for if there isnt enough info about the movie
    except:
        print("Not enough info about product")

In [60]:
owned_products = ['Active Moist Moisturizer Dermalogica', 'Acne Solutions Clinical Clearing Gel CLINIQUE']

get_recs(owned_products)

Unnamed: 0,vector cosine distance,Product
0,0.31803,Vinosource-Hydra SOS Intense Hydration Moistur...
1,0.364592,Cica Recover & Repair Multi-Use Balm Dior
2,0.382223,Hyaluronic Eye Cream Mario Badescu
3,0.388779,The Moisturizing Matte Lotion La Mer
4,0.402947,Juno Antioxidant + Superfood Face Oil Sunday R...


**Hybrid Approach**

 - One product - top 5 content, top 5 collaborative
 - Two products - top 2 content of each product, top 6 collaborative
 - Three products - top content of each product, top 7 collaborative
 - Four or more - top 10 collaborative

In [63]:
def recommend(owned_products):
    
    n = len(owned_products)
    res = set()
    
    if n <= 0:
        print('Product list is empty!')
        return
    
    elif n == 1:
        collab_recs = get_recs(owned_products, 5)['Product'].tolist()
        res.update(collab_recs)
        content_recs = cbf_rec(owned_products[0])
        j = 0
        while len(res) < 10 and j < 10:
            res.add(content_recs[j])
            j += 1
            
    elif n == 2:
        collab_recs = get_recs(owned_products, 6)['Product'].tolist()
        res.update(collab_recs)
        print(res)
        for i in range(2):
            content_recs = cbf_rec(owned_products[i])
            j, cnt = 0, 0
            while cnt < 2 and j < 10:
                if content_recs[j] not in res:
                    res.add(content_recs[j])
                    cnt += 1
                j += 1
                
    elif n == 3:
        collab_recs = get_recs(owned_products, 7)['Product'].tolist()
        res.update(collab_recs)
        for i in range(3):
            content_recs = cbf_rec(owned_products[i])
            j, cnt = 0, 0
            while cnt < 1 and j < 10:
                if content_recs[j] not in res:
                    res.add(content_recs[j])
                    cnt += 1
                j += 1
                
    else:
        collab_recs = get_recs(owned_products, 10)['Product'].tolist()
        res.update(collab_recs)
        
    return res

In [65]:
import time

start_time = time.time()

owned_products = ['Active Moist Moisturizer Dermalogica', 'Acne Solutions Clinical Clearing Gel CLINIQUE', 
                  'Advanced Génifique Radiance Boosting Face Serum Lancôme', 'Ageless Revitalizing Neck Cream Tatcha']

print(recommend(owned_products))

end_time = time.time()

print('wait time: ', end_time - start_time)

{'Cleanser Dr. Barbara Sturm', 'Prep It Self-Tan Priming Spray Isle of Paradise', 'Green Tea Hyaluronic Acid Face Cleanser innisfree', 'Mini Acne Solutions Cleansing Foam CLINIQUE', 'Mini Melt Moisturizer with Bakuchiol and Squalane alpyn beauty', 'Vitamin C & Bearberry Instant Glow Serum alpyn beauty', 'BioLumin-C Vitamin C Serum Dermalogica', 'Pore Minimizing Instant Detox Mask Caudalie', 'Acne Treatment Gel SEPHORA COLLECTION', 'High Performance Eye Cream for Dark Circles with Hyaluronic Acid MACRENE actives'}
wait time:  43.93125510215759


**Conclusion**

Possible improvements:
 - Find a better algorithm for the Collaborative Filtering part, since it seems to be what causes the recommendation engine to take so long.
 - Find a better way to combine the two methods, like attaching weights based on an exponential function.
 - Ratings are subject to each user, so a good way to normalise it. I thought about it a little during the data-processing, but each way of normalising I researched had significant drawbacks, so I ended up leaving it alone.