# Amazon Reviews Category Classification

In this project, we are going to use some amazon reviews data, made available by [Keith Galli](https://github.com/KeithGalli/pycon2020), and NLP techniques in order to build a category classification model. This report is divided in 4 parts:

* Setting up the data
* Training the model
* Evaluating the model
* Conclusion

## Setting up the data

We start our project by loading the data and necessary packages. The data is was previously stored in a S3 bucket and it is already divided into 2 datasets, one for training and one for testing. The aws cli is already installed and setted up locall

In [109]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.metrics import f1_score, accuracy_score
from sklearn import svm
import json
import boto3

bucket_name = 'melo-datascience-projects'
training_path = 'nlp/amazon-reviews/training/'
testing_path = 'nlp/amazon-reviews/test/'
s3 = boto3.resource('s3')
bucket = s3.Bucket(bucket_name)
training_files = [file.key.replace(training_path, '') for file in bucket.objects.filter(Prefix=training_path)]
training_files

['train_Automotive.json',
 'train_Beauty.json',
 'train_Books.json',
 'train_Clothing.json',
 'train_Digital_Music.json',
 'train_Electronics.json',
 'train_Grocery.json',
 'train_Patio_Lawn_Garden.json',
 'train_Pet_Supplies.json']

In [107]:
testing_files = [file.key.replace(testing_path, '') for file in bucket.objects.filter(Prefix=testing_path)]
testing_files

['test_Automotive.json',
 'test_Beauty.json',
 'test_Books.json',
 'test_Clothing.json',
 'test_Digital_Music.json',
 'test_Electronics.json',
 'test_Grocery.json',
 'test_Patio_Lawn_Garden.json',
 'test_Pet_Supplies.json']

As we can see, both the training and testing datasets are divided in several files. Each file contains reviews for a specific category, which is going to be the prediction target of our model. The titles are self-descriptive. Let us read the files.

In [112]:
train_df = pd.DataFrame()
for file in training_files:
    obj = s3.Object(bucket_name, training_path+file).get()['Body'].read()
    train_df_aux = pd.read_json(obj.decode(), lines=True)
    train_df_aux['category'] = file.strip('train_').split('.')[0]
    train_df = pd.concat([train_df_aux, train_df], ignore_index=True)

train_df.tail()

Unnamed: 0,overall,verified,reviewTime,reviewerID,asin,style,reviewerName,reviewText,summary,unixReviewTime,vote,image,category
22495,5,True,"04 22, 2016",A34IOVL7T9YYSF,B0002H335A,{'Size:': ' 3 Ton'},JAF,"Exactly What I expected, delivered on time. Hi...",Highly Recommend.,1461283200,,,Automotive
22496,5,True,"03 27, 2018",A1LTFR6UKP7N3Q,B000CITK8S,"{'Size:': ' 12V @ 750mA', 'Color:': ' Black/Gr...",Bryan Hargrave,Got this thing taking care of my lawnmower bat...,... care of my lawnmower battery and it is wor...,1522108800,,,Automotive
22497,5,True,"07 22, 2016",A10DPAG6XHKI25,B0006I0MVS,"{'Size:': ' 1 inch Shaft', 'Color:': ' Beige'}",buyer,Great replacement for old brittle handles. As ...,Great replacement for old brittle handles,1469145600,,,Automotive
22498,5,True,"10 11, 2016",A20JW9PWVAENUP,B000BUU5VS,{'Style:': ' Slide Out Lube'},michael owens,great,great,1476144000,,,Automotive
22499,5,True,"07 23, 2015",A30DC6PH6NJQL9,B00017YYI6,,Susan,"My husband likes it, he can fill air bags goin...",Five Stars,1437609600,,,Automotive


In [114]:
test_df = pd.DataFrame()
for file in testing_files:
    obj = s3.Object(bucket_name, testing_path+file).get()['Body'].read()
    test_df_aux = pd.read_json(obj.decode(), lines=True)
    test_df_aux['category'] = file.strip('test_').split('.')[0]
    test_df = pd.concat([test_df_aux, test_df], ignore_index=True)

test_df.tail()

Unnamed: 0,overall,verified,reviewTime,reviewerID,asin,reviewerName,reviewText,summary,unixReviewTime,style,vote,image,category
4495,5,True,"08 26, 2015",A1SRL9RS46YFNY,B0000AXS24,pedro vazquez,Good productos,Five Stars,1440547200,{'Color:': ' Black'},,,Automotive
4496,5,True,"08 7, 2017",AYNPB2QQS2QUF,B000CONU1K,dino77dan,Easy to install. Highly re ommended.,Good quality locks.,1502064000,,,,Automotive
4497,5,True,"11 1, 2015",A28YJ3RXJM0R5P,B00062YCGA,Michael,Awesome price for such good oil!,Five Stars,1446336000,{'Size:': ' 5 Quart'},,,Automotive
4498,5,True,"08 23, 2010",AJU1HQWCM13Y8,B000CPAEJA,John Dowdell,High quality CV joint grease. I use this greas...,High quality CV joint grease,1282521600,"{'Item Package Quantity:': ' 1', 'Package Quan...",8.0,,Automotive
4499,4,True,"07 5, 2014",A2G7YC89EHUP2G,B000C9WL6A,review,Well made,Four Stars,1404518400,"{'Item Package Quantity:': ' 1', 'Package Quan...",,,Automotive


We have 22500 rows for training and 4500 for testing. We also have several columns, but we will only be using 2: reviewText and category. Let us remove the unnecessary columns. We will also check the category proportions.

In [115]:
columns = ['reviewText', 'category']
train_df = train_df[columns]
test_df = test_df[columns]

train_df.tail()

Unnamed: 0,reviewText,category
22495,"Exactly What I expected, delivered on time. Hi...",Automotive
22496,Got this thing taking care of my lawnmower bat...,Automotive
22497,Great replacement for old brittle handles. As ...,Automotive
22498,great,Automotive
22499,"My husband likes it, he can fill air bags goin...",Automotive


In [116]:
test_df.tail()

Unnamed: 0,reviewText,category
4495,Good productos,Automotive
4496,Easy to install. Highly re ommended.,Automotive
4497,Awesome price for such good oil!,Automotive
4498,High quality CV joint grease. I use this greas...,Automotive
4499,Well made,Automotive


In [118]:
train_df.groupby('category')['reviewText'].count()

category
Automotive           2500
Beauty               2500
Books                2500
Clothing             2500
Digital_Music        2500
Electronics          2500
Grocery              2500
Patio_Lawn_Garden    2500
Pet_Supplies         2500
Name: reviewText, dtype: int64

In [121]:
test_df.groupby('category')['reviewText'].count()

category
Automotive           500
Beauty               500
Books                500
Clothing             500
Digital_Music        500
Electronics          500
Grocery              500
Patio_Lawn_Garden    500
Pet_Supplies         500
Name: reviewText, dtype: int64

The data is perfectly balanced. Let us check if there are some null or blank values and that will finish this part.

In [122]:
train_df['reviewText'].isna().sum()

0

In [123]:
test_df['reviewText'].isna().sum()

0

In [125]:
sum(train_df['reviewText']=='')

0

In [126]:
sum(test_df['reviewText']=='')

0

## Training the model