## 602 Project
For the project you will perform sentiment analysis on reviews from amazon.  The data to use is located here: http://deepyeti.ucsd.edu/jianmo/amazon/categoryFilesSmall/Cell_Phones_and_Accessories_5.json.gz
Note it is a gziped json file.  If you'd like to download and extract it directly into colab, this can be done using the following lines:
!curl http://deepyeti.ucsd.edu/jianmo/amazon/categoryFilesSmall/Cell_Phones_and_Accessories_5.json.gz -o reviews.json.gz
!gunzip reviews.json.gz
To view the file:
!ls

At this point it should be easily ingested into a pandas data frame. A few relevant fields are reviewText: The text from the review
summary: The summary text from the review
overall: The score the reviewer gave the item.


Your task: 
Perform the necessary transformations to train both regression and classification models to predict the 'overall' field in the data set. This should include creating the correctly sized training and test sets.

When performing the classification task, use overall less than 3 as negative, greater than 3 as positive, and 3 as neutral. If you'd prefer a numeric value 0 should be negative, 1 neutral and 2 positive.

You may wish to drop the columns image, style and votes.  You may also wish the drop duplicate data.

There are several options for using the summary and reviewText together, such as concatenating the strings, training separate models on both and feeding those results into another models, etc.  You may find that you don't want to use all fields (other that the ones I suggested dropping). I will let you experiment with this, just explain what you did and why.

You should certainly apply vectorization and perhaps a pca or nmf as well.  Try at least three different classifiers/regressors.  Attempt to get the best possible result, remember the different metrics we discussed for evaluating models.  Discuss which metric you this you should optimize for and why.  Pipelines and grid search will certainly help in optimizing your results!

Write a 5 page or less paper describing your work and the performance you were able to achieve.

### Imports :

!pip install  gensim

In [1]:
import warnings
warnings.filterwarnings("ignore")
import os
from datetime import *
import re
import json
import pandas as pd
import string
import nltk
from nltk.tokenize import sent_tokenize,word_tokenize 
import io 
from nltk.corpus import stopwords 
import numpy as np
from gensim.models import Word2Vec
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import roc_auc_score
from sklearn.metrics import f1_score
from sklearn.metrics import mean_squared_error
from sklearn.metrics import confusion_matrix
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
from sklearn.calibration import CalibratedClassifierCV
from sklearn.metrics.classification import accuracy_score, log_loss
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import SGDClassifier
from sklearn.model_selection import RandomizedSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import f1_score
from sklearn.manifold import TSNE

unable to import 'smart_open.gcs', disabling that module


In [2]:
file = 'Cell_Phones_and_Accessories_5.json'

In [3]:
df = pd.read_json( file , lines= True )

In [4]:
df.head()

Unnamed: 0,overall,verified,reviewTime,reviewerID,asin,style,reviewerName,reviewText,summary,unixReviewTime,vote,image
0,5,True,"08 4, 2014",A24E3SXTC62LJI,7508492919,{'Color:': ' Bling'},Claudia Valdivia,Looks even better in person. Be careful to not...,Can't stop won't stop looking at it,1407110400,,
1,5,True,"02 12, 2014",A269FLZCB4GIPV,7508492919,,sarah ponce,When you don't want to spend a whole lot of ca...,1,1392163200,,
2,3,True,"02 8, 2014",AB6CHQWHZW4TV,7508492919,,Kai,"so the case came on time, i love the design. I...",Its okay,1391817600,,
3,2,True,"02 4, 2014",A1M117A53LEI8,7508492919,,Sharon Williams,DON'T CARE FOR IT. GAVE IT AS A GIFT AND THEY...,CASE,1391472000,,
4,4,True,"02 3, 2014",A272DUT8M88ZS8,7508492919,,Bella Rodriguez,"I liked it because it was cute, but the studs ...",Cute!,1391385600,,


In [5]:
df.shape

(1128437, 12)

## Data Pre-Processing: 

In [6]:
df.drop( columns = [ 'image', 'style' , 'vote' ] , inplace=True )  #Dropping columns

In [7]:
df.isna().sum() # To check null values in Columns

overall             0
verified            0
reviewTime          0
reviewerID          0
asin                0
reviewerName      135
reviewText        765
summary           517
unixReviewTime      0
dtype: int64

In [8]:
df.dropna(inplace=True)

In [9]:
df[df.duplicated()].shape[0]        #Checking Duplicate Data 

3984

In [10]:
df.drop_duplicates(inplace=True)

In [11]:
df.describe()

Unnamed: 0,overall,unixReviewTime
count,1123089.0,1123089.0
mean,4.221692,1440653000.0
std,1.231564,45256830.0
min,1.0,1035331000.0
25%,4.0,1416528000.0
50%,5.0,1444435000.0
75%,5.0,1470528000.0
max,5.0,1538438000.0


In [12]:
df['reviewTime'] = df['reviewTime'].apply(lambda x: datetime.strptime(str(x),'%m %d, %Y'))

## Text Pre-Processing :

In [14]:
def clean(line):          #Function to clean the text 
    filtered_words=[]
    stop_words = set(stopwords.words('english'))     
    
    words = word_tokenize(line)                         #Splitting into words
    words = [word.lower() for word in words]             # All words to lowercase
    words = [word for word in words if word.isalpha()]   #only alphabets are taken (punct,digits etc removed)

    for r in words: 
        if r not in stop_words:              # Removing stop words
            filtered_words.append(r)         
            
    return filtered_words    

In [15]:
df[ 'ReviewText' ] = df[ 'reviewText' ].str.cat( df[ 'summary' ] , sep=" " )  #  Concatenating reviewText and summary

In [16]:
df.drop( columns = [ 'reviewText', 'summary'  ] , inplace=True )  #Dropping columns

In [17]:
df['ReviewText'] = df['ReviewText'].apply(clean)    #Applying clean function to text

In [19]:
def target(overall):      
    op = 0
    if overall == 3 :
        op = 1
    elif overall >= 3 :
        op = 2
    elif overall <= 3 :
        op = 0
        
    return op

In [20]:
df['target']=df['overall'].apply(target)           # New target column 

In [22]:
review=df['ReviewText'].tolist()

['looks',
 'even',
 'better',
 'person',
 'careful',
 'drop',
 'phone',
 'often',
 'rhinestones',
 'fall',
 'duh',
 'decorative',
 'case',
 'protective',
 'say',
 'fits',
 'perfectly',
 'securely',
 'phone',
 'overall',
 'pleased',
 'purchase',
 'ca',
 'stop',
 'wo',
 'stop',
 'looking']

## Text Vectorization: 

In [23]:
w2v_model=Word2Vec(review,min_count=5,size=50,workers=4)    #Word2Vec Model

In [24]:
w2v_words = list(w2v_model.wv.vocab)

In [25]:
def vec(rev):       #Vectorizing words in text
    vectsum = []
    for line in rev:
        vectline = np.zeros(50)
        num=0
        for word in line: 
            if word in w2v_words:
                vectline += w2v_model.wv[word]
                num += 1
        if num != 0:
            vectline /= num
        vectsum.append(vectline)
    return vectsum

In [26]:
df['reviewvectors']=vec(review) 

In [27]:
df.head()

Unnamed: 0,overall,verified,reviewTime,reviewerID,asin,reviewerName,unixReviewTime,ReviewText,target,reviewvectors
0,5,True,2014-08-04,A24E3SXTC62LJI,7508492919,Claudia Valdivia,1407110400,"[looks, even, better, person, careful, drop, p...",2,"[0.9326032150398802, 0.27075807964084325, -0.0..."
1,5,True,2014-02-12,A269FLZCB4GIPV,7508492919,sarah ponce,1392163200,"[want, spend, whole, lot, cash, want, great, d...",2,"[-1.631149485707283, -1.7938518196344375, -0.8..."
2,3,True,2014-02-08,AB6CHQWHZW4TV,7508492919,Kai,1391817600,"[case, came, time, love, design, actually, mis...",1,"[1.29744592173533, 0.21961938657543875, 0.2175..."
3,2,True,2014-02-04,A1M117A53LEI8,7508492919,Sharon Williams,1391472000,"[care, gave, gift, okay, expected, case]",0,"[-0.04533412059148153, 1.3900722414255142, -0...."
4,4,True,2014-02-03,A272DUT8M88ZS8,7508492919,Bella Rodriguez,1391385600,"[liked, cute, studs, fall, easily, protect, ph...",2,"[1.403727525460104, 1.0023157136658063, -0.474..."


In [28]:
vecdf = pd.DataFrame(df['reviewvectors'].tolist(), index= df.index)       #Vectors to df

In [32]:
vecdf.shape

(1123089, 50)

tsne2d = TSNE( n_components=2,    init='random',     # PCA  takes long time
    random_state=101,    method='barnes_hut',
    n_iter=250,    verbose=2,
    angle=0.5).fit_transform(vecdf)

# Classifier

In [68]:
X_train,X_test,Y_train,Y_test = train_test_split(vecdf,df['target'], test_size=0.3)  #Splitting the data

# DecisionTreeClassifier

In [50]:
tree = DecisionTreeClassifier()
param = {'max_depth':[5,10,15]}
gs = GridSearchCV(tree,param)
gs.fit(X_train,Y_train)
bestgrid = gs.best_estimator_
y_test = bestgrid.predict(X_test)

In [51]:
print("Optimal parameters are",gs.best_params_)
print("Accuracy score = ",bestgrid.score(X_test,Y_test))
print("F1-score = ",f1_score(Y_test,y_test,average='weighted'))
print("Confusion Matrix :",confusion_matrix(Y_test,y_test))

Optimal parameters are {'max_depth': 10}
Accuracy score =  0.838792972958534
F1-score =  0.807390060488788


array([[ 21227,    605,  19601],
       [  5414,   3003,  20912],
       [  7247,    536, 258382]], dtype=int64)

# Logistic Regression

In [69]:
lrparam = {}
lrparam['C'] = []
for i in range(-3,3):
    lrparam['C'].append(10 ** i)   # C value is 0.001,0.01,0.1,1,10,100
    
lr = LogisticRegression(class_weight='balanced', penalty='l2', random_state=2)
    
gs = GridSearchCV(lr,lrparam)
gs.fit(X_train,Y_train)
bestgrid = gs.best_estimator_
y_test = bestgrid.predict(X_test)

In [70]:
print("Optimal parameters = ",gs.best_params_)
print("Accuracy score = ",bestgrid.score(X_test,Y_test))
print("F1-score = ",f1_score(Y_test,y_test,average='weighted'))
print("Confusion Matrix :",confusion_matrix(Y_test,y_test))

Optimal parameters =  {'C': 0.01}
Accuracy score =  0.8410130384326575
F1-score =  0.8424723385733639
Confusion Matrix : [[ 30122   4677   6649]
 [  8080  10439  10696]
 [ 11492  11973 242799]]


## Linear SVC

In [71]:
lsvc = LinearSVC(class_weight='balanced',penalty='l2',random_state=1)    
gs = GridSearchCV(lsvc,lrparam)
gs.fit(X_train,Y_train)
bestgrid = gs.best_estimator_
y_test = bestgrid.predict(X_test)

In [72]:
print("Optimal parameters are",gs.best_params_)
print("Accuracy score = ",bestgrid.score(X_test,Y_test))
print("F1-score = ",f1_score(Y_test,y_test,average='weighted'))
print("Confusion Matrix :",confusion_matrix(Y_test,y_test))

Optimal parameters are {'C': 10}
Accuracy score =  0.8449011210143443
F1-score =  0.8354449858356258
Confusion Matrix : [[ 30851   2497   8100]
 [ 10301   5969  12945]
 [ 12912   5502 247850]]


# Regression

In [61]:
X_train, X_test, Y_train, Y_test = train_test_split(vecdf,df['overall'],test_size=0.3)

# Linear Regression

In [62]:
clf = LinearRegression()
clf.fit(X_train,Y_train)

r2_score = clf.score(X_test,Y_test)

predictions = clf.predict(X_test)
errors = abs(predictions - Y_test)
m = 100 * np.mean(errors / Y_test)
accuracy = 100 - m
print("Accuracy = ",accuracy)
print("Coefficient of determination R2 score = ",r2_score)
print("Mean square error = ",mean_squared_error(Y_test,predictions))

Accuracy =  72.37497308676487
Coefficient of determination R2 score =  0.4629704230465106
Mean square error=  0.814794408879482


# DecisionTreeRegressor

In [63]:
parameters = {'max_depth':[4,5,6,7,8]}
dt = DecisionTreeRegressor()
gs = GridSearchCV(dt,parameters)
gs.fit(X_train,Y_train)
y_test = clf.predict(X_test)

In [64]:
bestgrid = gs.best_estimator_
r2_score = bestgrid.score(X_test,Y_test)
predictions = bestgrid.predict(X_test)
errors = abs(predictions - Y_test)
m = 100 * np.mean(errors / Y_test)
accuracy = 100 - m
print("Optimal parameters are",gs.best_params_)
print("Accuracy = ",accuracy)
print("Coefficient of determination R2 score = ",r2_score)
print("Mean square error= ",mean_squared_error(Y_test,predictions))

Optimal parameters are {'max_depth': 8}
Accuracy =  74.24650746684881
Coefficient of determination R2 score =  0.4484922804136525
Mean square error=  0.8367610009899834
