### The project inplements Supervised Machine learning to predict rating based on review text.
<br>
- Supervised ML is used due to the presence of target variables in dataset.<br>
- A Multi-class classification technique is implemented, therefore the classifiers -  SVM and Naive Bayes will be used to train and make predictions

In [1]:
# dependencies
!pip install scikit-learn
!pip install git+https://github.com/laxmimerit/preprocess_kgptalkie.git
!pip install spacy==2.2.3
!python -m spacy download en_core_web_sm
!pip install beautifulsoup4==4.9.1
!pip install textblob==0.15.3
!pip install emoji

Collecting scikit-learn
[?25l  Downloading https://files.pythonhosted.org/packages/bf/6c/31c623a656ab66e938b2432cb8e41dc5497dec670ff07e1e74989ee0ab77/scikit_learn-0.24.2-cp38-cp38-macosx_10_13_x86_64.whl (7.2MB)
[K     |████████████████████████████████| 7.2MB 1.4MB/s eta 0:00:01
Collecting joblib>=0.11 (from scikit-learn)
[?25l  Downloading https://files.pythonhosted.org/packages/55/85/70c6602b078bd9e6f3da4f467047e906525c355a4dacd4f71b97a35d9897/joblib-1.0.1-py3-none-any.whl (303kB)
[K     |████████████████████████████████| 307kB 1.9MB/s eta 0:00:01
[?25hCollecting threadpoolctl>=2.0.0 (from scikit-learn)
  Downloading https://files.pythonhosted.org/packages/c6/e8/c216b9b60cbba4642d3ca1bae7a53daa0c24426f662e0e3ce3dc7f6caeaa/threadpoolctl-2.2.0-py3-none-any.whl
Installing collected packages: joblib, threadpoolctl, scikit-learn
Successfully installed joblib-1.0.1 scikit-learn-0.24.2 threadpoolctl-2.2.0
You should consider upgrading via the 'pip install --upgrade pip' command.[0m


In [1]:
# read data from csv file
import pandas as pd
import numpy as np

df = pd.read_csv('reviews.csv')
df

Unnamed: 0,product_id,brand,price_usd,category,review_text,review_rating,review_title,shop,review_published
0,b42e0e8e-63f9-41c1-a6c1-4f7c24b6a501,By Terry,155.00,Makeup - Foundation - Mousse and Cream Foundation,I love this - just makes you look so healthy a...,5,Great,spacenk.com,2021-01-21
1,b42e0e8e-63f9-41c1-a6c1-4f7c24b6a501,By Terry,155.00,Makeup - Foundation - Mousse and Cream Foundation,Great coverage while allowing your skin to shi...,5,Beautiful coverage,mecca.com.au,2021-01-17
2,b42e0e8e-63f9-41c1-a6c1-4f7c24b6a501,By Terry,155.00,Makeup - Foundation - Mousse and Cream Foundation,Absolutely wonderful product. I use this as a ...,5,Lovely glowy base product!,spacenk.com,2021-01-14
3,b42e0e8e-63f9-41c1-a6c1-4f7c24b6a501,By Terry,155.00,Makeup - Foundation - Mousse and Cream Foundation,Amazing price and quick delivery My favourite ...,5,,catch.com.au,2021-01-12
4,b42e0e8e-63f9-41c1-a6c1-4f7c24b6a501,By Terry,155.00,Makeup - Foundation - Mousse and Cream Foundation,"So very expensive, but probably the best found...",5,Eclat Opulent,byterry.com,2021-01-08
...,...,...,...,...,...,...,...,...,...
999995,914bf776-64b8-44b9-8b95-6cbbcf8944f5,Paul Sebastian,9.64,Men - Shaving - Post Shave,"One of our favorite fragrances, at a great pri...",5,Great Fragrance!,zulily.com,2019-02-13
999996,914bf776-64b8-44b9-8b95-6cbbcf8944f5,Paul Sebastian,9.64,Men - Shaving - Post Shave,"This is one of my husband favorite cologne, so...",5,,Overstock.com,2019-02-10
999997,914bf776-64b8-44b9-8b95-6cbbcf8944f5,Paul Sebastian,9.64,Men - Shaving - Post Shave,all of the above reasons. Its my husbands favo...,5,Awesome fragrance,ebay.com,2019-02-06
999998,914bf776-64b8-44b9-8b95-6cbbcf8944f5,Paul Sebastian,9.64,Men - Shaving - Post Shave,great product and super fast shipping !!!!!!!!!,5,"great price, fast shipping !!!!!!!!!!!!!!!!!!!...",fragrancex.com,2019-02-06


In [2]:
# using 500000 rows for project to reduce computational power
sample_df = df[0:500000]

In [3]:
sample_df.columns

Index(['product_id', 'brand', 'price_usd', 'category', 'review_text',
       'review_rating', 'review_title', 'shop', 'review_published'],
      dtype='object')

### Data Cleaning and preprocessing
- Drop rows with empty review texts
- Drop rows with duplicate lines
- Remove punctuation and convert all text to lowercase
- Removing numbers
- Removing extra whitespace
- Removing stop-words (extremely common words which do not provide any
analytic information and tend to be of little value i.e. a, and, are etc.)

In [5]:
# check for empty rows or null values
sample_df.isna().sum()

product_id               0
brand                    0
price_usd                0
category                 0
review_text              0
review_rating            0
review_title        234119
shop                     0
review_published         0
dtype: int64

In [6]:
# drop rows with null values?
sample_df = sample_df.dropna()

# drop duplicate rows for review text
selected_df = sample_df.drop_duplicates(['review_text'])

In [None]:
# using package - preprocess_kgptalkie to preprocess data
''' https://github.com/laxmimerit/preprocess_kgptalkie '''

import preprocess_kgptalkie as ps
import emoji
import re

def remove_emoji(row):
    new_text = re.sub(emoji.get_emoji_regexp(), r"", row)
    return new_text

def preprocess_data(row):
    row = str(row).lower().replace('\\', '').replace('_', ' ')
    row = ps.cont_exp(row)
    row = ps.remove_accented_chars(row)
    row = ps.remove_special_chars(row)
    row = re.sub("(.)\\1{2,}", "\\1", row)
    row = remove_emoji(row)
    
    return row

# apply preprocess function to all rows in dataframe
selected_df['review_text'] = selected_df['review_text'].apply(lambda row: preprocess_data(row))

In [8]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split

# using tfidf to get x and y vectors and features
tfidf = TfidfVectorizer(max_features=20000, sublinear_tf=True, norm='l2', ngram_range=(1, 2), stop_words='english')

X = tfidf.fit_transform(selected_df['review_text'])
y = selected_df['review_rating']

# fit to SVM linear svc model
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
X.shape, y.shape

((242503, 20000), (242503,))

In [15]:
# import classifiers 
from sklearn.svm import LinearSVC
from sklearn.naive_bayes import MultinomialNB, GaussianNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix 

# train classifiers
def train_classifier(clf, X_train, X_test, y_train, y_test):
    # fit and train model 
    clf.fit(X_train, y_train)
    y_pred = clf.predict(X_test)
    return y_pred, clf

model_clf ={ MultinomialNB(alpha=1.0, fit_prior=False):'Naive Bayes', LinearSVC(C = 20, class_weight='balanced'):'svm'} 


#loop through each model to create different classifiers
for clf in model_clf:
    
    # train classifiers
    pred, clf = train_classifier(clf, X_train, X_test, y_train, y_test)
    
    # store classifier for each model
    if model_clf[clf] == 'svm':
        svm_clf = clf
    else:
        nb_clf = clf
   
    # get and print report
    report = classification_report(y_test, pred)
    print(f"\n For {model_clf[clf]}:")
    #print(f"Confusion matrix:\n{confusion_matrix(y_test, pred)}")
    print(f"Report: \n{report}")


 For Naive Bayes:
Report: 
              precision    recall  f1-score   support

           1       0.41      0.71      0.52      3261
           2       0.23      0.21      0.22      2334
           3       0.27      0.35      0.30      3256
           4       0.32      0.48      0.38      6952
           5       0.90      0.71      0.80     32698

    accuracy                           0.63     48501
   macro avg       0.42      0.49      0.44     48501
weighted avg       0.71      0.63      0.66     48501


 For svm:
Report: 
              precision    recall  f1-score   support

           1       0.45      0.54      0.49      3261
           2       0.21      0.27      0.23      2334
           3       0.24      0.29      0.26      3256
           4       0.35      0.32      0.33      6952
           5       0.86      0.83      0.85     32698

    accuracy                           0.67     48501
   macro avg       0.42      0.45      0.43     48501
weighted avg       0.69      



### testing classifiers with random texts 

In [20]:
# testing SVM
text = 'This product is a bit fake '
text =   preprocess_data(text)
vec = tfidf.transform([text])
svm_clf.predict(vec)

array([2])

In [21]:
# testing NB
text = 'This product is a bit fake'
text = preprocess_data(text)
vec = tfidf.transform([text])
nb_clf.predict(vec)

array([4])