# Training Ml model on Amazon fine food Data


<b>About</b>

This dataset consists of reviews of fine foods from amazon. The data span a period of more than 10 years, including all ~500,000 reviews up to October 2012. Reviews include product and user information, ratings, and a plaintext review. We also have reviews from all other Amazon categories.


<b>Data</b>

Number of reviews: 568,454.

Number of users: 256,059.

Number of products: 74,258.

Timespan: Oct 1999 - Oct 2012

Number of Attributes/Columns in data: 

<b> Data consist of </b>

1.ID

2.productID

3.userID: unique ID for every customer

4.score:-rating 1-5

5.Helpfulness numerator :- how many customers find reviews  helpful

6.Helpfulness Denominator :- How Many customers find the review helpful or not

7.Text:- text of the review

8.Summary

9.Time :- time stamp

10.Profile Name

11.Labels:- 1 for Positive review and 0 for Negative review

<b>objective</b>

Determine whether the given review is positive (Rating of 4 or 5) or negative (rating of 1 or 2).

Here im not doing any preprocessing steps im importing fully preprocessed data from my another jupyter notebook

<b>Note</b> This note book is extension of my t-sne vizualizatio notebook 

In [1]:
#importing required librarie
%matplotlib inline
import os 
import pandas as pd  #data analysis
import numpy as np #scientific computation
import seaborn as sns #ploting tool
import matplotlib.pyplot as plt #ploting tool
import sqlite3
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer

from sklearn.metrics import confusion_matrix
from sklearn.metrics import roc_curve,auc
from sklearn import metrics


from sklearn.model_selection import train_test_split

In [2]:
link  = sqlite3.connect("C:/users/rock/Documents/resulted_data.sqlite") #linking with sql data 

df = pd.read_sql_query("""SELECT *FROM Reviews """,link)

In [3]:
df.head(5)

Unnamed: 0,index,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text,Labels,CleanedText
0,138706,150524,6641040,ACITT7DI6IDDL,shari zychinski,0,0,positive,939340800,EVERY book is educational,this witty little book makes my son laugh at l...,1,witti littl book make son laugh loud recit car...
1,138688,150506,6641040,A2IW4PEEKO2R0U,Tracy,1,1,positive,1194739200,"Love the book, miss the hard cover version","I grew up reading these Sendak books, and watc...",1,grew read sendak book watch realli rosi movi i...
2,138689,150507,6641040,A1S4A3IQ2MU7V4,"sally sue ""sally sue""",1,1,positive,1191456000,chicken soup with rice months,This is a fun way for children to learn their ...,1,fun way children learn month year learn poem t...
3,138690,150508,6641040,AZGXZ2UUK6X,"Catherine Hallberg ""(Kate)""",1,1,positive,1076025600,a good swingy rhythm for reading aloud,This is a great little book to read aloud- it ...,1,great littl book read nice rhythm well good re...
4,138691,150509,6641040,A3CMRKGE0P909G,Teresa,3,4,positive,1018396800,A great way to learn the months,This is a book of poetry about the months of t...,1,book poetri month year goe month cute littl po...


In [4]:
df.shape

(364171, 13)

In [5]:
df.columns

Index(['index', 'Id', 'ProductId', 'UserId', 'ProfileName',
       'HelpfulnessNumerator', 'HelpfulnessDenominator', 'Score', 'Time',
       'Summary', 'Text', 'Labels', 'CleanedText'],
      dtype='object')

In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 364171 entries, 0 to 364170
Data columns (total 13 columns):
index                     364171 non-null int64
Id                        364171 non-null int64
ProductId                 364171 non-null object
UserId                    364171 non-null object
ProfileName               364171 non-null object
HelpfulnessNumerator      364171 non-null int64
HelpfulnessDenominator    364171 non-null int64
Score                     364171 non-null object
Time                      364171 non-null int64
Summary                   364171 non-null object
Text                      364171 non-null object
Labels                    364171 non-null int64
CleanedText               364171 non-null object
dtypes: int64(6), object(7)
memory usage: 36.1+ MB


For to achieve our objective im focusing only on Text and Score

Now I'm going to split the data into traing and test set 

In [8]:
X_train,x_test,Y_train,y_test = train_test_split(df["Text"],df["Labels"],random_state=0)

print(X_train.shape)

print(x_test.shape)

(273128,)
(91043,)


Here we can see that in our train set we have reviews of 273128 and in test 91043 reviews 

to feed this data to our algorithm we need to convert data numerical format for this wee have different methods like

<b>
Bag of words  
    
tf-idf 
    
bigrams and n-grams

# Bag of words

Want to know about Bag of Words :-https://en.wikipedia.org/wiki/Bag-of-words_model

In BoW we construct a dictionary that contains set of all unique words from our text review dataset.The frequency of the word is counted here. if there are d unique words in our dictionary then for every sentence or review the vector will be of length d and count of word from review is stored at its particular location in vector. The vector will be highly sparse in such case.

Which return sparse matrix 
    

In [9]:
vec=CountVectorizer() #intializing vector
final_vector = vec.fit_transform(X_train) #tranforming the data

In [10]:
type(final_vector)

scipy.sparse.csr.csr_matrix

In [11]:
final_vector.get_shape()

(273128, 100274)

In [12]:
X_train_vectorized =vec.transform(X_train)
X_train_vectorized

<273128x100274 sparse matrix of type '<class 'numpy.int64'>'
	with 14509835 stored elements in Compressed Sparse Row format>

In [13]:
x_tes_vectorized = vec.transform(x_test)

x_tes_vectorized

<91043x100274 sparse matrix of type '<class 'numpy.int64'>'
	with 4815884 stored elements in Compressed Sparse Row format>

 The result is stored in a SciPy sparse matrix, where each row corresponds to a document, and each column is a word from our training vocabulary.

# Logistic Regression agorithm

In [14]:
from sklearn.linear_model import LogisticRegression
model = LogisticRegression()
model.fit(X_train_vectorized, Y_train)



LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=None, solver='warn',
          tol=0.0001, verbose=0, warm_start=False)

In [15]:
from sklearn.metrics import roc_auc_score
predictions = model.predict(vec.transform(x_test))


In [16]:
print('Accuracy: ', metrics.accuracy_score(y_test, predictions))

Accuracy:  0.9307030743714508


In [55]:
feature_names = np.array(vec.get_feature_names())
sorted_coef_index = model.coef_[0].argsort()
print('negatives: \n{}\n'.format(feature_names[sorted_coef_index[:10]]))
print('positives: \n{}\n'.format(feature_names[sorted_coef_index[:-10:-1]]))

negatives: 
['disappointing' 'unacceptable' 'worst' 'unappealing' 'dissapointing'
 'redeeming' 'disappointment' 'holle' 'weakest' 'undrinkable']

positives: 
['pleasantly' 'addicting' 'ramune' 'skeptical' 'hooked' 'easiest' 'solved'
 'firmly' 'upgraded']



Here u can see that our model predicts well got acuracy of 84% and we get some negative words and positive seprated i.e classified results

In [62]:
print(model.predict(vec.transform(['this is the worst food i have ever eaten'])))

[0]


In [61]:
print(model.predict(vec.transform(['this is the best  food i have ever eaten'])))

[1]


Here Our model predicts well on some unseen reviews whooo we have acheived our task 

oh really?? 

lets see another exmaples 

In [64]:
print(model.predict(vec.transform(['this is the not worst  food i have ever eaten'])))

[0]


oh Now you can see that our model is misleading show that 'this is the not worst  food i have ever eaten' as negative class

# Bigrams and ngrams with logistic regression


To fix this issue we use n-grams method. we use bi grams to put words  together and retain some information like in our test example we see that if not word is present our model misleading in  classifying 

what bi-grams actually do is it puts adjacent words together i.e not good , very disappointing etc


In [17]:
vector = CountVectorizer(ngram_range=(1,2)).fit(X_train)

X_train_vect =vector.transform(X_train)

In [18]:
type(X_train_vect)

scipy.sparse.csr.csr_matrix

In [19]:
X_train_vect.shape

(273128, 2436364)

In [20]:
m= LogisticRegression()
m.fit(X_train_vect,Y_train)
pred =m.predict(vector.transform(x_test))

print('Accuracy: ', metrics.accuracy_score(y_test, pred))

Accuracy:  0.9497709873356546


In [73]:
print(m.predict(vector.transform(['this is the not worst  food i have ever eaten'])))

[0]


<b>yesssssssss now u can see that our model classified well Whooo we acheived our task 

# KNN  on Bag of Words

In [11]:
from sklearn.neighbors import KNeighborsClassifier

In [12]:
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
from sklearn.model_selection import cross_val_score
from collections import Counter
from sklearn.metrics import accuracy_score
from sklearn.model_selection import cross_validate

In [18]:
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train_vectorized,Y_train)


KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=None, n_neighbors=5, p=2,
           weights='uniform')

In [19]:
predic = knn.predict(x_tes_vectorized)


In [23]:
from sklearn.metrics import roc_auc_score
print("accuracy_score",metrics.accuracy_score(y_test,predic))

accuracy_score 0.8502465867776765


In [25]:
print(knn.predict(vec.transform(['this is the not worst  food i have ever eaten'])))

[1]


In [13]:
k_range = range(1, 30,2)

scores = []
# We append the scores in the dictionary
for k in k_range:
    knn = KNeighborsClassifier(n_neighbors=k)
    knn.fit(X_train_vectorized,Y_train)
    predic = knn.predict(x_tes_vectorized)
    scores.append(metrics.accuracy_score(y_test,predic))
    

print(scores)

KeyboardInterrupt: 

<b>Note</b>

-->Here by logisticregression algorithm i have accuracy score 94% on bi-grams 

-->BY KNN with K=5 algorithm i have achieved accuracy score of 85%  here im not sure this is the perfect K i have 

-->To find perfect K it takes lot of time to train a model.since im using local system for me it takes lot of time.
you can try this by ur own just by chaning the parameters 