# Searching the best algorithm to predict the Amazon review ratings

In this notebook I'm going to analyze the performance of different Machine Learning algorithms to find out the best model for this exercise. As in other notebooks, I will begin importing the libraries.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from pymongo import MongoClient

from functions.nlp import cleaning_review
from functions.preproc import tfidf,svc_dimred
from functions.automl import BestClassifier

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix
from sklearn.metrics import plot_confusion_matrix

client = MongoClient()

[nltk_data] Downloading package stopwords to
[nltk_data]     /home/ordovas/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


I will load a subset of the database (stored in a MongoDB collection).

In [2]:
db = client.get_database("amazon")
# Define an alias for the books dataset
books =db.books
# Obtain a random sample from the dataset, selecting only a few records 
# (we will use only overall and reviewTest, but I will load a few more just in case I
# want to play with more info...)
res = list(books.aggregate([
    { "$sample": { "size": 50000 }}
    ,{ "$project": {"id": "$_id", "_id": 0, "overall": 1, "reviewText": 1,"summary":1,"reviewerName":1}} 
]))
# Converting to pandas DataFrame
df=pd.DataFrame(res)

df=df.dropna()
df=df.reset_index()

And now I will clean the text of the reviews (removing punctuation and stop words) and create the TF-IDF vectors for the cleaned text. After that, I will reduce the dimensionality of the TF-IDF vectors using 500 dimensions (the optimal number as seen in the other notebook)

In [3]:
df["review_clean"]=df["reviewText"]
df["review_clean"]=df["review_clean"].apply(cleaning_review)
df_tfidf=tfidf(df["review_clean"],5)
comps,var,svd_transformer = svc_dimred(df_tfidf,500)
data_svd = svd_transformer.transform(df_tfidf)

Now let's find out what is the best ML algorithm for this problem. The first thing that we are going to do is dividing the dataset into train and test samples.

After that, I will import a class that I've created previously, that is the one called `BestClassifier`. This one requires the input train features (X_train), the target (y_train) and a list of models to analyze its performance (see `functions/automl.py` to see how it works). It will store the best fitting ML model among the best fitting parameters for each model. `BestClassifier` will have a `.fit(X,y)`, `.predict(X)`, `.predict_proba(X)` and `.score(X,y)` functions as any other classifier, but with the best model with the best performance parameters.

In [4]:
X_train, X_test, y_train, y_test = train_test_split(pd.DataFrame(data_svd), df["overall"], test_size=0.2, random_state=42)
clf=BestClassifier(X_train,y_train,[ "LogisticRegression","LinearSVC","SGDClassifier",
                             "KNeighborsClassifier","RandomForestClassifier"])

Analyzing LogisticRegression


KeyboardInterrupt: 

In [None]:
clf.score(X_train,y_train),clf.score(X_test,y_test)

In [None]:
print(clf.bestclassifier_)

In [None]:
counts, bins = np.histogram(y_train-clf.predict(X_train),bins=np.arange(-4,6))
plt.plot(bins[:-1],counts*100/sum(counts),label="Train data")
plt.grid()
counts, bins = np.histogram(y_test-clf.predict(X_test),bins=np.arange(-4,6))
plt.plot(bins[:-1],counts*100/sum(counts),label="Test data")
plt.xlabel("Real Rating - Predicted Rating")
plt.ylabel("Percentage")
plt.legend();

In [None]:
print("Train subsample:")
plot_confusion_matrix(clf, X_train, y_train,cmap="viridis",normalize="true") 
plt.show();
print("Test subsample:")
plot_confusion_matrix(clf, X_test, y_test,cmap="viridis",normalize="true")
plt.show();