# Final Project | `Skin-Scout`

Batch        : FTDS-BSD-006

Group        : 3

Team members : 
- Achmad Abdillah Ghifari : Data Analyst
- Celine Clarissa         : Data Scientist
- Evan Juanto             : Data Engineer

HuggingFace      : [Skin-Scout Deployment Link](https://huggingface.co/spaces/celineclarissa/Skin-Scout)

Original Dataset : [Original Dataset Link](https://www.kaggle.com/datasets/teejmahal20/airline-passenger-satisfaction/data)

Team GitHub      : [GitHub Link](https://github.com/juanto26/p2-final-project-skinscout)

---
---

## i. Introduction

Before we start loading the data, we must define the background and problem statement that can help us answer the problems in the data. In this introduction part, the background will explain why we are using this dataset and the problem statement will explain what problem we want to solve in the data.

### i.1. Background

The current skincare market is flooded with countless products each with unique ingredients and highlights. Consumers often struggle to decide which product most consumer recommend due to the large amount of reviews for each different products, making reading to all the review traditionally wasting too much time and effort. While other metrics such as star rating is present on most skincare website, relying on only star rating to rate the quality of a product is unreliable as research has shown that star rating has many problem such as negativity bias where one negative aspect could lead to users leading a low star despite excelling in other area and also sometime the review and star a user give has discreptancy with some research finding only a moderate correlation between review and star rating. Hence, consumers are left to go through multiple reviews in order to get an accurate insight regarding certain skincare product. Due to this factor our teams goal is to create an application where we could make this process easier by finding out whether a certain user will recommend or not recommend a product based on their review.

### i.2. Problem Statement and Objective

We want to create an application that utilizes Natural Language Processing (NLP) and a recommender system in order to help predict whether a customer will recommend a product or not and also to give recommendation of similar skincare product. Our goal is to create a model with an F1-Score of 80%. This is done by using model such as SVC and cosine similarity in order to create the model. By creating this model, our objective is to make the process of finding the perfect skincare product more time-efficient and less frustrating.

---
---

## ii. Import Libraries

The following are the libraries used in the making of our group's Recommender System.

In [1]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

---
---

## iii. Data Loading

Data contains information about skincare products on the beauty e-commerce Sephora.

Data was obtained from [kaggle.com](https://www.kaggle.com/datasets/nadyinky/sephora-products-and-skincare-reviews), and then merged. Our Data Engineer processed the data with Airflow and obtained the cleaned dataset.

In [2]:
df_ori = pd.read_csv('finalproject_clean.csv')

In [3]:
df = df_ori.copy()
df

Unnamed: 0,rating,is_recommended,helpfulness,total_feedback_count,total_neg_feedback_count,total_pos_feedback_count,submission_time,review_text,skin_tone,eye_color,...,ingredients,limited_edition,new,online_only,out_of_stock,sephora_exclusive,highlights,secondary_category,tertiary_category,child_count
0,5,1.0,1.0,0,0.0,0,2020-04-05,I’m very impressed with the price and how well...,fair,green,...,"['Water, Butylene Glycol, Glycerin, Sodium Hya...",0,0.0,0.0,0.0,1.0,"['Good for: Dullness/Uneven Texture', 'Hyaluro...",Treatments,Face Serums,0.0
1,5,1.0,1.0,0,0.0,0,2017-11-15,Just picked up this product to replace my curr...,tan,brown,...,"['Diisostearyl Malate, Hydrogenated Polyisobut...",0,0.0,0.0,0.0,1.0,"['allure 2019 Best of Beauty Award Winner', 'C...",Lip Balms & Treatments,Moisturizers,3.0
2,4,1.0,0.0,1,1.0,0,2018-02-16,stops future breakouts from happening! The por...,medium,brown,...,"['Aqua (Water), Niacinamide, Pentylene Glycol,...",0,0.0,0.0,0.0,0.0,"['Vegan', 'Community Favorite', 'Oil Free', 'W...",Treatments,Face Serums,1.0
3,4,1.0,1.0,0,0.0,0,2019-09-12,First positive impression is how smooth and so...,lightMedium,brown,...,"['Water, Glycerin, Alcohol Denat., Dipropylene...",0,0.0,0.0,0.0,0.0,"['allure 2019 Best of Beauty Award Winner', 'C...",Cleansers,Toners,0.0
4,5,1.0,1.0,0,0.0,0,2021-09-03,This smells incredible! It really brightens my...,lightMedium,blue,...,"['Alcohol Denat., Fragrance (Parfum), Benzyl C...",0,0.0,1.0,0.0,1.0,"['Vegan', 'Unisex/ Genderless Scent', 'Woody &...",Wellness,Holistic Wellness,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
39924,5,1.0,1.0,0,0.0,0,2022-05-31,My hubby hates sunscreen because of the way it...,light,brown,...,"['Avobenzone 2.3%, Homosalate 10.0%, Octisalat...",0,0.0,0.0,0.0,0.0,"['Without Oxybenzone', 'Hyaluronic Acid', 'Goo...",Sunscreen,Face Sunscreen,0.0
39925,5,1.0,1.0,0,0.0,0,2019-09-26,I have only heard good things about this and c...,lightMedium,brown,...,"['Aqua/Water/Eau, Glycerin, Propanediol, Cetea...",0,0.0,0.0,0.0,1.0,"['Good for: Dullness/Uneven Texture', 'Good fo...",Eye Care,Eye Masks,0.0
39926,3,0.0,1.0,0,0.0,0,2022-10-15,I received a sample sachet size of this produc...,fair,blue,...,"['Water/Aqua/Eau, Tridecyl Trimellitate, Glyce...",0,0.0,0.0,0.0,0.0,"['Good for: Anti-Aging', 'Hyaluronic Acid', 'F...",Eye Care,Eye Creams & Treatments,0.0
39927,3,0.0,1.0,0,0.0,0,2021-12-22,my skin is combination in winter and more oily...,light,brown,...,"['Water, Stearic Acid, Peg-8, Myristic Acid, G...",0,0.0,0.0,0.0,0.0,"['allure 2019 Best of Beauty Award Winner', 'C...",Cleansers,Face Wash & Cleansers,0.0


---
---

## iv. Feature Engineering

### iv.1. Clean Values in Used Columns

In [4]:
# Show columns 'ingredients' and 'highlights' before feature engineering
df[['ingredients','highlights']]

Unnamed: 0,ingredients,highlights
0,"['Water, Butylene Glycol, Glycerin, Sodium Hya...","['Good for: Dullness/Uneven Texture', 'Hyaluro..."
1,"['Diisostearyl Malate, Hydrogenated Polyisobut...","['allure 2019 Best of Beauty Award Winner', 'C..."
2,"['Aqua (Water), Niacinamide, Pentylene Glycol,...","['Vegan', 'Community Favorite', 'Oil Free', 'W..."
3,"['Water, Glycerin, Alcohol Denat., Dipropylene...","['allure 2019 Best of Beauty Award Winner', 'C..."
4,"['Alcohol Denat., Fragrance (Parfum), Benzyl C...","['Vegan', 'Unisex/ Genderless Scent', 'Woody &..."
...,...,...
39924,"['Avobenzone 2.3%, Homosalate 10.0%, Octisalat...","['Without Oxybenzone', 'Hyaluronic Acid', 'Goo..."
39925,"['Aqua/Water/Eau, Glycerin, Propanediol, Cetea...","['Good for: Dullness/Uneven Texture', 'Good fo..."
39926,"['Water/Aqua/Eau, Tridecyl Trimellitate, Glyce...","['Good for: Anti-Aging', 'Hyaluronic Acid', 'F..."
39927,"['Water, Stearic Acid, Peg-8, Myristic Acid, G...","['allure 2019 Best of Beauty Award Winner', 'C..."


In [5]:
# Clean values in columns 'ingredients' and 'highlights' by making them in one list
df['ingredients'] = df['ingredients'].apply(lambda x: ' '.join(eval(x)) if isinstance(x, str) else '')
df['highlights'] = df['highlights'].apply(lambda x: ' '.join(eval(x)) if isinstance(x, str) else '')

In [6]:
# Show columns 'ingredients' and 'highlights' after cleaning
df[['ingredients','highlights']]

Unnamed: 0,ingredients,highlights
0,"Water, Butylene Glycol, Glycerin, Sodium Hyalu...",Good for: Dullness/Uneven Texture Hyaluronic A...
1,"Diisostearyl Malate, Hydrogenated Polyisobuten...",allure 2019 Best of Beauty Award Winner Commun...
2,"Aqua (Water), Niacinamide, Pentylene Glycol, Z...",Vegan Community Favorite Oil Free Without Sili...
3,"Water, Glycerin, Alcohol Denat., Dipropylene G...",allure 2019 Best of Beauty Award Winner Commun...
4,"Alcohol Denat., Fragrance (Parfum), Benzyl Cin...",Vegan Unisex/ Genderless Scent Woody & Earthy ...
...,...,...
39924,"Avobenzone 2.3%, Homosalate 10.0%, Octisalate ...",Without Oxybenzone Hyaluronic Acid Good for: A...
39925,"Aqua/Water/Eau, Glycerin, Propanediol, Ceteary...",Good for: Dullness/Uneven Texture Good for: Lo...
39926,"Water/Aqua/Eau, Tridecyl Trimellitate, Glyceri...",Good for: Anti-Aging Hyaluronic Acid Fragrance...
39927,"Water, Stearic Acid, Peg-8, Myristic Acid, Gly...",allure 2019 Best of Beauty Award Winner Commun...


---

### iv.2. Join Used Columns

In [7]:
# Join values in columns 'ingredients' and 'highlights' to column 'combined_text'
df['combined_text'] = df['highlights'] + ' ' + df['ingredients']

In [8]:
# Show combined values
df['combined_text']

0        Good for: Dullness/Uneven Texture Hyaluronic A...
1        allure 2019 Best of Beauty Award Winner Commun...
2        Vegan Community Favorite Oil Free Without Sili...
3        allure 2019 Best of Beauty Award Winner Commun...
4        Vegan Unisex/ Genderless Scent Woody & Earthy ...
                               ...                        
39924    Without Oxybenzone Hyaluronic Acid Good for: A...
39925    Good for: Dullness/Uneven Texture Good for: Lo...
39926    Good for: Anti-Aging Hyaluronic Acid Fragrance...
39927    allure 2019 Best of Beauty Award Winner Commun...
39928    Vegan Good for: Loss of firmness Collagen Hypo...
Name: combined_text, Length: 39929, dtype: object

---

### iv.3. Vectorization

We will use TF-IDF Vectorizer to convert the combined text to a matrix. The TF-IDF Vectorizer will generate a matrix of vectors based on the frequency of a word in the text. TF, which stands for Term Frequency, measures the frequency of words. Meanwhile, IDF, which stands for Inverse Document Frequency, calculates the weight or importance of words. So, words that rarely occur are more important than those that frequently occur.

In [9]:
# Define vectorizer
tfidf = TfidfVectorizer(stop_words='english')

# Fit and transform text with vectorizer
tfidf_matrix = tfidf.fit_transform(df['combined_text'])

---
---

## v. Modelling

### v.1. Calculate Cosine Similarity

Next, we will calculate the cosine similarity between products based on the matrix we got from the Vectorization process.

In [10]:
cosine_sim = cosine_similarity(tfidf_matrix, tfidf_matrix)

---

### v.2. Define Recommender System

We will define the Recommender System with a function.

In [11]:
# Define function to generate product recommendation
def get_recommendations_by_name(product_name, cosine_sim=cosine_sim, df=df, num_recommendations=5):
    '''
    This function is used to get product recommendation based on the product name that the user inserts.
    '''

    # Check if the product inserted by user is in the dataframe
    ## Create condition for when the product inserted is not in dataframe
    if product_name not in df['product_name'].values:
        return "Product not found."
    
    # Find product index based on product name
    index = df[df['product_name'] == product_name].index[0]
    
    # Calculate cosine similarity score between all products and inserted product
    sim_scores = list(enumerate(cosine_sim[index]))
    
    # Sort products based on cosine similarity (except for the inserted product itself)
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    sim_scores = sim_scores[1:]
    
    # Get recommended product index
    item_indices = [i[0] for i in sim_scores]
    
    # Create dataframe for recommended products
    recommendations = df.iloc[item_indices]
    
    # Remove inserted product from recommendations
    recommendations = recommendations[recommendations['product_name'] != product_name]
    
    # Drop duplicates based on column 'product_name'
    unique_recommendations = recommendations.drop_duplicates(subset='product_name')
    
    # Return recommendations
    return unique_recommendations.head(num_recommendations)

---
---

## vi. Recommender System Trial

Now, we will test the Recommender System that we made by inserting a product name.

In [12]:
# Get recommendations for inserted product name
recommendations = get_recommendations_by_name("Alpha Beta Pore Perfecting & Refining Serum")

# Clean recommendations
recommendations = recommendations[['product_name','brand_name']].reset_index()
recommendations.drop(columns='index',inplace=True)

# Print recommendations
print("You like 'Alpha Beta Pore Perfecting & Refining Serum'. Therefore, we recommend you to try:")
recommendations

You like 'Alpha Beta Pore Perfecting & Refining Serum'. Therefore, we recommend you to try:


Unnamed: 0,product_name,brand_name
0,Alpha Beta Ultra Gentle Daily Peel Pads for Se...,Dr. Dennis Gross Skincare
1,Age Arrest Eye Cream,Kate Somerville
2,So Poreless Deep Exfoliating Blackhead Scrub,TULA Skincare
3,Vitamin C Lactic Dewy Deep Cream,Dr. Dennis Gross Skincare
4,Vitamin C Lactic Oil-Free Radiant Moisturizer,Dr. Dennis Gross Skincare
