## Problem statement

The e-commerce business is quite popular today. Here, you do not need to take orders by going to each customer. A company launches its website to sell the items to the end consumer, and customers can order the products that they require from the same website. Famous examples of such e-commerce companies are Amazon, Flipkart, Myntra, Paytm and Snapdeal.

Suppose you are working as a Machine Learning Engineer in an e-commerce company named 'Ebuss'. Ebuss has captured a huge market share in many fields, and it sells the products in various categories such as household essentials, books, personal care products, medicines, cosmetic items, beauty products, electrical appliances, kitchen and dining products and health care products.

With the advancement in technology, it is imperative for Ebuss to grow quickly in the e-commerce market to become a major leader in the market because it has to compete with the likes of Amazon, Flipkart, etc., which are already market leaders.

As a senior ML Engineer, you are asked to build a model that will improve the recommendations given to the users given their past reviews and ratings. 


The steps to be performed for the first task are given below.

- Exploratory data analysis
- Data cleaning
- Text preprocessing
- Feature extraction: In order to extract features from the text data, you may choose from any of the methods, including bag-of-words, TF-IDF vectorization or word embedding.
- Training a text classification model: You need to build at least three ML models. You then need to analyse the performance of each of these models and choose the best model. At least three out of the following four models need to be built (Do not forget, if required, handle the class imbalance and perform hyperparameter tuning.). 
    1. Logistic regression
    2. Random forest
    3. XGBoost
    4. Naive Bayes

Out of these four models, you need to select one classification model based on its performance.

Building a recommendation system
As you learnt earlier, you can use the following types of recommendation systems.
1. User-based recommendation system
2. Item-based recommendation system

Your task is to analyse the recommendation systems and select the one that is best suited in this case. 

Once you get the best-suited recommendation system, the next task is to recommend 20 products that a user is most likely to purchase based on the ratings. <br/>

You can use the 'reviews_username' (one of the columns in the dataset) to identify your user. 
- Improving the recommendations using the sentiment analysis model

Now, the next task is to link this recommendation system with the sentiment analysis model that was built earlier (recall that we asked you to select one ML model out of the four options). Once you recommend 20 products to a particular user using the recommendation engine, you need to filter out the 5 best products based on the sentiments of the 20 recommended product reviews. 

In this way, you will get an ML model (for sentiments) and the best-suited recommendation system. 



# **Improving the recommendations using the sentiment analysis model**

### Fine-Tuning the Recommendation System and Recommendation of Top 5 Products

In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

In [2]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [3]:
import pickle

In [4]:
import warnings
warnings.filterwarnings("ignore")

#### Load the cleaned dataframe

In [5]:
clean_df = pd.read_pickle("savedData/preprocessed-dataframe.pkl")
clean_df_recommended = clean_df[['id','name','reviews_complete_text', 'user_sentiment']]

#### Load the user final rating

In [6]:
user_final_rating = pd.read_pickle("savedData/user_final_rating.pkl")

In [7]:
clean_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 29255 entries, 0 to 29999
Data columns (total 13 columns):
 #   Column                   Non-Null Count  Dtype              
---  ------                   --------------  -----              
 0   id                       29255 non-null  object             
 1   brand                    29255 non-null  object             
 2   categories               29255 non-null  object             
 3   manufacturer             29255 non-null  object             
 4   name                     29255 non-null  object             
 5   reviews_date             29255 non-null  datetime64[ns, UTC]
 6   reviews_rating           29255 non-null  int64              
 7   reviews_text             29255 non-null  object             
 8   reviews_title            29255 non-null  object             
 9   reviews_username         29255 non-null  object             
 10  user_sentiment           29255 non-null  int64              
 11  reviews_preprocess_text  29255 no

#### Load the Vectorizer

In [8]:
file = open("savedData/tfidf-vectorizer.pkl",'rb')
vectorizer = pickle.load(file)
file.close()

#### Load the Classification model

In [9]:
# LR SMOTE
file = open("savedData/models/logistic_regression_20241114-105215.pkl",'rb')
lr_smote_obj = pickle.load(file)
file.close()

# XGBoost
file = open("savedData/models/xgboost_classifier_20241114-105221.pkl",'rb')
xg_obj = pickle.load(file)
file.close()

# RF
# random_forest_classifier_20241114-111106.pkl
file = open("savedData/models/random_forest_classifier_20241114-111106.pkl",'rb')
rf_obj = pickle.load(file)
file.close()

In [10]:
user_input = "rebecca"

# **Task 6: Recommendation of Top 20 Products to a Specified**

In [11]:
def get_top20_products_for_user(user):
    # get the top 20  recommedation using the user_final_rating
    top20_reco = user_final_rating.loc[user].sort_values(ascending=False)[0:20]
    recommendations = pd.DataFrame({'product_id': top20_reco.index, 'similarity_score' : top20_reco})
    recommendations.reset_index(drop=True)
    result = pd.merge(recommendations, clean_df_recommended, on="id")[["id", "name", "similarity_score"]].drop_duplicates()
    
    return result

In [12]:
get_top20_products_for_user(user_input)

Unnamed: 0,id,name,similarity_score
0,AVpfPaoqLJeJML435Xk9,Godzilla 3d Includes Digital Copy Ultraviolet ...,29.592094
3325,AVpf2tw1ilAPnD_xjflC,Red (special Edition) (dvdvideo),10.672246
3994,AVpe59io1cnluZ0-ZgDU,My Big Fat Greek Wedding 2 (blu-Ray + Dvd + Di...,9.868508
4662,AVpfJP1C1cnluZ0-e3Xy,Clorox Disinfecting Bathroom Cleaner,9.466633
6701,AVpe41TqilAPnD_xQH3d,Mike Dave Need Wedding Dates (dvd + Digital),8.450727
7458,AVpfM_ytilAPnD_xXIJb,Tostitos Bite Size Tortilla Chips,6.623078
7722,AVpe8gsILJeJML43y6Ed,"Pendaflex174 Divide It Up File Folder, Multi S...",6.537746
8032,AVpf5olc1cnluZ0-tPrO,Chester's Cheese Flavored Puffcorn Snacks,5.489594
8204,AVpf63aJLJeJML43F__Q,"Burt's Bees Lip Shimmer, Raisin",4.742484
9077,AVpe6FfKilAPnD_xQmHi,Chex Muddy Buddies Brownie Supreme Snack Mix,4.202777


# **Task 7: Fine-Tuning the Recommendation System and Recommendation of Top 5 Products**

### Fine tune and optimise the recommendation using the user-recommendation and classification model

In [13]:
def get_top5_user_recommendations(user, model):
  if user in user_final_rating.index:
    # get the top 20  recommedation using the user_final_rating
    top20_reco = list(user_final_rating.loc[user].sort_values(ascending=False)[0:20].index)
    # get the product recommedation using the orig data used for trained model
    common_top20_reco = clean_df_recommended[clean_df['id'].isin(top20_reco)]
    # Apply the TFIDF Vectorizer for the given 20 products to convert data in reqd format for modeling
    X =  vectorizer.transform(common_top20_reco['reviews_complete_text'].values.astype(str))

    # Using the model from param to predict
    model.set_test_data(X)
    common_top20_reco['sentiment_pred']= model.predict()

    # Create a new dataframe "pred_df" to store the count of positive user sentiments
    temp_df = common_top20_reco.groupby(by='name').sum()
    # Create a new dataframe "pred_df" to store the count of positive user sentiments
    sent_df = temp_df[['sentiment_pred']]
    sent_df.columns = ['positive_sentiment_count']
    # Create a column to measure the total sentiment count
    sent_df['total_sentiment_count'] = common_top20_reco.groupby(by='name')['sentiment_pred'].count()
    # Calculate the positive sentiment percentage
    sent_df['positive_sentiment_percent'] = np.round(sent_df['positive_sentiment_count']/sent_df['total_sentiment_count']*100,2)
    # Return top 5 recommended products to the user
    result = sent_df.sort_values(by='positive_sentiment_percent', ascending=False)[:5]
    return result
  else:
    print(f"User name {user} doesn't exist")

## Recommendatons from `Linear Regression SMOTE` and `User-User` filtering

In [14]:
get_top5_user_recommendations(user_input, lr_smote_obj)

Unnamed: 0_level_0,positive_sentiment_count,total_sentiment_count,positive_sentiment_percent
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Clorox Disinfecting Bathroom Cleaner,2010,2039,98.58
Chester's Cheese Flavored Puffcorn Snacks,165,172,95.93
The Resident Evil Collection 5 Discs (blu-Ray),801,845,94.79
Red (special Edition) (dvdvideo),621,669,92.83
Jolly Time Select Premium Yellow Pop Corn,25,27,92.59


## Recommendatons from `XGBoost` and `User-User` filtering

In [15]:
get_top5_user_recommendations(user_input, xg_obj)

Unnamed: 0_level_0,positive_sentiment_count,total_sentiment_count,positive_sentiment_percent
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Clorox Disinfecting Bathroom Cleaner,2022,2039,99.17
Chester's Cheese Flavored Puffcorn Snacks,168,172,97.67
Vaseline Intensive Care Lip Therapy Cocoa Butter,153,158,96.84
Chex Muddy Buddies Brownie Supreme Snack Mix,28,29,96.55
The Resident Evil Collection 5 Discs (blu-Ray),815,845,96.45


## Recommendatons from `Random Forest` and `User-User` filtering

In [16]:
get_top5_user_recommendations(user_input, rf_obj)

Unnamed: 0_level_0,positive_sentiment_count,total_sentiment_count,positive_sentiment_percent
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Clorox Disinfecting Bathroom Cleaner,2014,2039,98.77
Red (special Edition) (dvdvideo),648,669,96.86
Chester's Cheese Flavored Puffcorn Snacks,166,172,96.51
Jolly Time Select Premium Yellow Pop Corn,26,27,96.3
The Resident Evil Collection 5 Discs (blu-Ray),813,845,96.21


# **Our selection was `XGBoost and User-User` filtering**
#### **This is our final top 5 recommendation for the user**

In [17]:
get_top5_user_recommendations(user_input, xg_obj)

Unnamed: 0_level_0,positive_sentiment_count,total_sentiment_count,positive_sentiment_percent
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Clorox Disinfecting Bathroom Cleaner,2022,2039,99.17
Chester's Cheese Flavored Puffcorn Snacks,168,172,97.67
Vaseline Intensive Care Lip Therapy Cocoa Butter,153,158,96.84
Chex Muddy Buddies Brownie Supreme Snack Mix,28,29,96.55
The Resident Evil Collection 5 Discs (blu-Ray),815,845,96.45
