# Final Project | `Skin-Scout`

Batch        : FTDS-BSD-006

Group        : 3

Team members : 
- Achmad Abdillah Ghifari : Data Analyst
- Celine Clarissa         : Data Scientist
- Evan Juanto             : Data Engineer

HuggingFace      : [Skin-Scout Deployment Link](https://huggingface.co/spaces/celineclarissa/Skin-Scout)

Original Dataset : [Original Dataset Link](https://www.kaggle.com/datasets/teejmahal20/airline-passenger-satisfaction/data)

Team GitHub      : [GitHub Link](https://github.com/juanto26/p2-final-project-skinscout)

---
---

## i. Introduction

Before we start loading the data, we must define the background and problem statement that can help us answer the problems in the data. In this introduction part, the background will explain why we are using this dataset and the problem statement will explain what problem we want to solve in the data.

### i.1. Background

The current skincare market is flooded with countless products each with unique ingredients and highlights. Consumers often struggle to decide which product most consumer recommend due to the large amount of reviews for each different products, making reading to all the review traditionally wasting too much time and effort. While other metrics such as star rating is present on most skincare website, relying on only star rating to rate the quality of a product is unreliable as research has shown that star rating has many problem such as negativity bias where one negative aspect could lead to users leading a low star despite excelling in other area and also sometime the review and star a user give has discreptancy with some research finding only a moderate correlation between review and star rating. Hence, consumers are left to go through multiple reviews in order to get an accurate insight regarding certain skincare product. Due to this factor our teams goal is to create an application where we could make this process easier by finding out whether a certain user will recommend or not recommend a product based on their review.

### i.2. Problem Statement and Objective

We want to create an application that utilizes Natural Language Processing (NLP) and a recommender system in order to help predict whether a customer will recommend a product or not and also to give recommendation of similar skincare product. Our goal is to create a model with an accuracy of 80%. This is done by using model such as SVC and cosine similarity in order to create the model. By creating this model, our objective is to make the process of finding the perfect skincare product more time-efficient and less frustrating.

---
---

## ii. Import Libraries and Model

The following are the libraries used in our group's Review Classification (NLP) model inference. We will also use the model that was previously saved in file `Final_Project_NLP.ipynb`.

In [21]:
# Import libraries
import pandas as pd
import re
import pickle

# Import for Feature Engineering
import nltk
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize
import warnings
warnings.filterwarnings('ignore')
warnings.filterwarnings('ignore')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\User\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\User\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\User\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [22]:
with open('Final_Project_Model_NLP.pkl', 'rb') as file_1:
    Final_Project_Model_NLP = pickle.load(file_1)

---
---

## iii. Preprocessing

Now, we will define the function that will be used to preprocess text before inputting it into the model.

In [23]:
# Define stopwords
stopwords_eng = stopwords.words('english')

# Create text preprocessing function
def text_preprocessing(text):
  '''
  This function is created to do text preprocessing: change text to lowercase, Remove numbers and punctuation symbols, Remove stopwords,
  lemmatize text, and tokenize text. Text preprocessing can be done just by calling this function.
  '''
  # Change text to lowercase
  text = text.lower()

  # Remove numbers
  text = re.sub(r'\d+', '', text)

  # Remove comma
  text = text.replace(',', '')

  # Remove period symbol
  text = text.replace('.', '')

  # Remove exclamation mark
  text = text.replace('!', '')

  # Remove question mark
  text = text.replace('?', '')

  # Remove quotation mark
  text = text.replace('"', '')
  text = text.replace("'", '')
  text = text.replace('’', '')

  # Remove hyphen
  text = text.replace('-', ' ')
  text = text.replace('—', ' ')

  # Remove ampersand
  text = text.replace('&', 'and')

  # Remove whitespace
  text = text.strip()

  # Tokenization
  tokens = word_tokenize(text)

  # Remove stopwords
  tokens = [word for word in tokens if word not in stopwords_eng]

  # Lemmatization: minimize words with same or similar meaning
  lemmatizer = WordNetLemmatizer()
  tokens = [lemmatizer.lemmatize(word) for word in tokens]

  # Combine tokens
  text = ' '.join(tokens)

  return text

Then, data scientist will define dictionary to convert nominal to class name.

In [24]:
# Create class dictionary
dict_class = {0: 'not recommended',
              1: 'recommended'}

---
---

## iv. Inference Data

To test the model, we will use two different datas which are new, and the model has never been trained with. The first one is a review where the user recommends the product, and the second one is a review where the user doesn't recommend the product.

### iv.1. Recommended

We will use the review of Facial Treatment Essence (Pitera Essence) by the brand SK-II from [byrdie.com](https://www.byrdie.com/sk-ii-pitera-facial-treatment-essence-review-5188402).

In [25]:
# create inference data
inf_data = {'product_name': 'Facial Treatment Essence (Pitera Essence)',
            'brand_name': 'SK-II',
            'review_text': "Honestly, my skin feels more rejuvenated within my first week of using SK-II's Pitera Facial Treatment Essence than it has all year, which is a complete win in my book. The brand claims that their essence will help your skin look and feel even brighter within 28 days, so I'll be continuing my use of this product to see the results for myself. Although SK-II's essence is priced higher than others on the market, the quality of their ingredients (and over 40 years of expertise in their field) lets them speak for themselves. If you're looking for a step to add to your skincare routine that is sure to revive your skin, no matter what type, look no further than SK-II"}

# put inference data into dataframe
inf_data = pd.DataFrame(inf_data, index=[0])

# show dataframe
inf_data

Unnamed: 0,product_name,brand_name,review_text
0,Facial Treatment Essence (Pitera Essence),SK-II,"Honestly, my skin feels more rejuvenated withi..."


Then, we will use text_preprocessing function defined in steps above.

In [26]:
inf_data['text_processed'] = inf_data['review_text'].apply(lambda x: text_preprocessing(x))
inf_data

Unnamed: 0,product_name,brand_name,review_text,text_processed
0,Facial Treatment Essence (Pitera Essence),SK-II,"Honestly, my skin feels more rejuvenated withi...",honestly skin feel rejuvenated within first we...


In [27]:
y_pred_inf = Final_Project_Model_NLP.predict(inf_data.text_processed)
print(f'Based on your review, the product is {dict_class[int(y_pred_inf)]}.')

Based on your review, the product is recommended.


From the results above, it can be seen that the model can classify the review as 'recommended' correctly.

---

### vi.2. Not Recommended

We will use the review of B-Hydra Intensive Hydration Serum with Hyaluronic Acid by the brand Drunk Elephant from [community.sephora.com](https://community.sephora.com/t5/Skincare-Aware/DO-NOT-BUY-DRUNK-ELEPHANT/m-p/6764331).

In [28]:
# create inference data
inf_data = {'product_name': 'B-Hydra Intensive Hydration Serum with Hyaluronic Acid',
            'brand_name': 'Drunk Elephant',
            'review_text': "Ok, so I was sucked into drunk elephants advertising and aesthetic packaging. It was all over TikTok, YouTube, instagram, you name it. I decided that since everyone was saying it was so good that I would try it. I bought the B-hydra intensive hydration serum. FIRST OF ALL: the price tag for this product was 50 dollars! Which in my opinion is a total rip- off. According to founder Tiffany Masterton, she said it was safe for all skin. She believes in the “ suspicious 6” which is totally not true. Coming from a woman who doesn’t even wash her face every day!! I used this and tested it on a small portion of my skin to make sure it was safe. I woke up to that part of skin inflamed and covered with bumps. I immediately returned the product and got my money back. This is a warning for everyone who is interested in buying drunk elephant. DONT WASTE YOUR TIME OR MONEY!!! Even if this product works well on your skin I think it is not work the price. I found the perfect dupe, ordinary B5 hyaluronic acid serum which is only 8 dollars and still delivers hydration to your face. In conclusion, this whole drunk elephant is a scam and is just a brand that was over hyped by popular influencers. I hope I saved you all from wasting your money!!! Happy shopping!!!"}

# put inference data into dataframe
inf_data = pd.DataFrame(inf_data, index=[0])

# show dataframe
inf_data

Unnamed: 0,product_name,brand_name,review_text
0,B-Hydra Intensive Hydration Serum with Hyaluro...,Drunk Elephant,"Ok, so I was sucked into drunk elephants adver..."


Then, we will use the text_preprocessing function defined in the steps above.

In [29]:
inf_data['text_processed'] = inf_data['review_text'].apply(lambda x: text_preprocessing(x))
inf_data

Unnamed: 0,product_name,brand_name,review_text,text_processed
0,B-Hydra Intensive Hydration Serum with Hyaluro...,Drunk Elephant,"Ok, so I was sucked into drunk elephants adver...",ok sucked drunk elephant advertising aesthetic...


In [30]:
y_pred_inf = Final_Project_Model_NLP.predict(inf_data.text_processed)
print(f'Based on your review, the product is {dict_class[int(y_pred_inf)]}.')

Based on your review, the product is not recommended.


From the results above, it can be seen that the model can classify the review as 'not recommended' correctly.