
***Problem Statement***

The e-commerce business is quite popular today. Here, you do not need to take orders by going to each customer. A company launches its website to sell the items to the end consumer, and customers can order the products that they require from the same website. Famous examples of such e-commerce companies are Amazon, Flipkart, Myntra, Paytm and Snapdeal.

Suppose you are working as a Machine Learning Engineer/Data Scientist in an e-commerce company named 'Door-Dash'. Door-Dash has captured a huge market share in many fields, and it sells the products in various categories such as household essentials, books, personal care products, medicines, cosmetic items, beauty products, electrical appliances, kitchen and dining products and health care products.

With the advancement in technology, it is imperative for Door-Dash to grow quickly in the e-commerce market to become a major leader in the market because it has to compete with the likes of Amazon, Flipkart, etc., which are already market leaders.

As a Data Scientis/ML Engineer, you are asked to build a model that will improve the recommendations given to the users given their past reviews and ratings of various products.

In order to do this, you planned to build a sentiment-based(NLP) product recommendation system, which includes the following tasks.

Data sourcing and sentiment analysis ,building a recommendation system improving the recommendations using the sentiment analysis model.


***High Level Overview***

* Data Sourcing
* Sentiment Analysis
* Building a recommendation system
* Improving the recommendation system using sentiment analysis model


In [2]:
""" Install General Purpose Libraries"""
# pip install -U scikit-learn scipy matplotlib pandas numpy seaborn

' Install General Purpose Libraries'

In [15]:
""" Install General Purpose Libraries"""
import re    #regex
import pandas as pd #pandas library for dataframe handling
import numpy as np #numpy library vector calculation
import seaborn as sns # plotting library
import matplotlib.pyplot as plt #plotting library
from collections import Counter 
from datetime import datetime
import warnings 
import pickle #File Format

# Import Pre-Processing Tools
from imblearn.over_sampling import SMOTE #Over sampling
warnings.filterwarnings("ignore") 

# Set Pandas options
pd.set_option('display.max_columns', 200)
pd.set_option('display.max_colwidth', 300)
pd.set_option("display.precision", 2)

# import image module
from IPython.display import Image

#Helper Functions
# from utils import (
#     clean_stopwords,
#     clean_punctuation, 
#     calc_missing_rowcount    
# )



In [3]:
""" Install nltk libraries for sentiment analysis"""
#pip install -U nltk ssl

' Install nltk libraries for sentiment analysis'

In [None]:
# nltk packages
import nltk #Natural Language Toolkit
import ssl #SSL wrapper for socket objects

""" Download all nltk packages through external ssl"""
try:
    _create_unverified_https_context = ssl._create_unverified_context
except AttributeError:
    pass
else:
    ssl._create_default_https_context = _create_unverified_https_context

nltk.download('stopwords')
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')
nltk.download('omw-1.4')



In [5]:
""" nltk libraries"""
from nltk.corpus import stopwords 
from nltk import FreqDist
from nltk.tokenize import word_tokenize
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.corpus import wordnet as wn

In [6]:
""" ML Modelling Libraries"""

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer #Converting text to vectors
from sklearn.linear_model import LogisticRegression #Classification Algorithm
from sklearn.ensemble import RandomForestClassifier #Ensemble Algorithm
from sklearn.model_selection import RandomizedSearchCV, GridSearchCV
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix, classification_report, roc_auc_score #Evaluation Package

In [7]:
import sklearn
print(sklearn.__version__) #1.0.2
print(np.__version__) #1.22.4
print(pd.__version__) #1.3.5
print(nltk.__version__) #3.8.1



1.2.1
1.24.2
1.5.3
3.7


#### Load the Dataset
User reviews is collected and stored in a csv format with date timestamps.

In [8]:
# Load the dataset
import os
cwd = os.getcwd() #gets the present working directory
# df_reviews = pd.read_csv(cwd + "/user_reviews.csv", parse_dates= ['reviews_date'])
# df_reviews.shape


In [None]:
#sample the dataset


In [None]:
#Inspect the dataframe to understand the given data


*   First step is to eliminate some columns as part of our analysis for Sentiment and Recommender models.We will inspect them and remove them later.



#### Exploratory Data Analysis (EDA) - Data Cleaning and Pre-Processing

In [9]:
# write function to calculate missing row count
def calc_missing_rowcount(df):
    pass

In [12]:
# Drop columns that dont add any value to the overall goal
# df_clean = 

#### Handling NULL values in reviews_title

#### Handling NULL values in reviews_username

#### Handling NULL values in user_sentiment


#### Target Analysis
Identify the target column and label the output column.

**Analyze a single company for in-depth intuition**

In [25]:
"""
Updating the user_sentiment so that its relevant to the model
Part of pre-processing
"""


'\nUpdating the user_sentiment so that its relevant to the model\nPart of pre-processing\n'

#### Training Analysis

Checking Distribution of reviews_rating column¶

##### Checking Top 5 Brands with negative reviews

##### Checking Top 5 Brands with negative reviews

##### Type Conversion

Before we start the pre-processing steps, we need to make sure that all the text columns are converted to string type for future text operations.

In [None]:
# Convert all the text columns to string for performing text operations

#### Pre-Processing

In [None]:
# Get a copy of dataframe for pre-processing



Combining reviews_text and reviews_title columns into reviews_combined and dropping the initial fields




Creating dataframe for Sentiment analysis with only the required columns




Handling punctuations




Handling Lemmatization


Create Wordcloud

Get Most Common Words

Get n-gram frequency of words

#### Feature Extraction

In this part, we will be performing

1.Feature Extraction using TF-IDF

2.Check for and handle Class imbalance

3.Perform train, test split

Feature Extraction using TF-IDF Convert the raw texts to a matrix of TF-IDF features

max_df is used for removing terms that appear too frequently, also known as "corpus-specific stop words" max_df = 0.95 means "ignore terms that appear in more than 95% of the complaints"

min_df is used for removing terms that appear too infrequently min_df = 5 means "ignore terms that appear in less than 5 complaints"





Train - Test Split




Class Imbalance


SMOTE


Model Building

We will be creating the following three ML models based on performance for predicting the sentiments based on the text and title of the reviews:

**Logistic Regression**

**Random Forest Classifier**



In [27]:
# Evaluation score function
def evaluation_scores(classifier, X_test, y_test):
    pass

#### Log-Reg Model

##### Hyperparameter Tuning

In [28]:
logreg_grid = {"C": [100, 10, 5, 4, 3, 2, 1, 1.0, 0.1, 0.01],
                "solver": ["liblinear"]}

In [29]:
## Setup grid hyperparameter search for LogisticRegression

In [None]:
# Checking the best parameters

**HyperParameter Tuned Log-Reg Model**

In [30]:
# Getting the scores of the tuned model

#### Random Forest Classifier

##### Hyperparameter Tuning RF Classifier

In [None]:
rf_grid = {"n_estimators": np.arange(10, 1000, 50),
           "max_depth": np.arange(10, 50, 5),
           "min_samples_split": np.arange(15, 500, 15),
           "min_samples_leaf": np.arange(5, 50, 5)}

##### Model Evaluation

**Save Sentiment Model**

In [31]:
#Save a pickle file

**Save Tfidf-vectorizer**

**Save Cleaned Data**