# Predicting reviews in Amazon Fine Food Reviews data set using KNN.

__Here we are using 10 fold cross-validation to predict our optimal k . And then we are using that k and predicting our 
accuracy for the test Dataset__

<h2>Introduction to the Dataset</h2>


Data Source: https://www.kaggle.com/snap/amazon-fine-food-reviews

The Amazon Fine Food Reviews dataset consists of reviews of fine foods from Amazon.<br>

Number of reviews: 568,454<br>
Number of users: 256,059<br>
Number of products: 74,258<br>
Timespan: Oct 1999 - Oct 2012<br>
Number of Attributes/Columns in data: 10 

Attribute Information:

1. Id
2. ProductId - unique identifier for the product
3. UserId - unqiue identifier for the user
4. ProfileName
5. HelpfulnessNumerator - number of users who found the review helpful
6. HelpfulnessDenominator - number of users who indicated whether they found the review helpful or not
7. Score - rating between 1 and 5
8. Time - timestamp for the review
9. Summary - brief summary of the review
10. Text - text of the review


#### Objective:
Given a review, determine whether the review is positive (Rating of 4 or 5) or negative (rating of 1 or 2).

<br>
[Q] How to determine if a review is positive or negative?<br>
<br> 
[Ans] We could use the Score/Rating. A rating of 4 or 5 could be cosnidered a positive review. A review of 1 or 2 could be considered negative. A review of 3 is nuetral and ignored. This is an approximate and proxy way of determining the polarity (positivity/negativity) of a review.




Importing all the necessary packages.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.cross_validation import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
from sklearn.cross_validation import cross_val_score
from collections import Counter
from sklearn.metrics import accuracy_score
from sklearn import cross_validation
import sqlite3
import pandas as pd
import nltk
import string
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import confusion_matrix
from sklearn import metrics
from sklearn.metrics import roc_curve, auc
from nltk.stem.porter import PorterStemmer



##### 2. Connecting to Amazon food review dataset

In [2]:
con=sqlite3.connect('./database.sqlite')
filtered_data=pd.read_sql_query("""select * from reviews where score!=3""",con)
def partition(x):
    if x<3:
        return 'negative'
    else:
        return 'positive'
actual_score=filtered_data['Score']
PositiveNegative=actual_score.map(partition)
filtered_data['Score']=PositiveNegative
print(filtered_data.shape)
filtered_data.head()


(525814, 10)


Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text
0,1,B001E4KFG0,A3SGXH7AUHU8GW,delmartian,1,1,positive,1303862400,Good Quality Dog Food,I have bought several of the Vitality canned d...
1,2,B00813GRG4,A1D87F6ZCVE5NK,dll pa,0,0,negative,1346976000,Not as Advertised,Product arrived labeled as Jumbo Salted Peanut...
2,3,B000LQOCH0,ABXLMWJIXXAIN,"Natalia Corres ""Natalia Corres""",1,1,positive,1219017600,"""Delight"" says it all",This is a confection that has been around a fe...
3,4,B000UA0QIQ,A395BORC6FGVXV,Karl,3,3,negative,1307923200,Cough Medicine,If you are looking for the secret ingredient i...
4,5,B006K2ZZ7K,A1UQRSCLF8GW1T,"Michael D. Bigham ""M. Wassir""",0,0,positive,1350777600,Great taffy,Great taffy at a great price. There was a wid...


##### 3. Sorting our data on the basis of date and removing the Duplicate reviews

In [3]:
sorted_data=filtered_data.sort_values('ProductId',axis=0,ascending=True,inplace=False,kind='quicksort',na_position='last')
final=sorted_data.drop_duplicates(subset={"UserId","ProfileName","Time","Text"},keep='first',inplace=False)
print(final.shape)

(364173, 10)


##### 4. we are also removing the rows which has HelpfulnessDenominator greater then HelpfulnessNumerator because its not practically possile 

In [4]:
final=final[final.HelpfulnessNumerator<=final.HelpfulnessDenominator]

In [5]:
print(final.shape)

(364171, 10)


In [6]:
final=final[final.HelpfulnessNumerator<=final.HelpfulnessDenominator]

#### we are also cleaning our text of html tags , stop words, and puncuations

In [7]:
# find sentences containing HTML tags

import re
i=0;
for sent in final['Text'].values:
    if (len(re.findall('<.*?>', sent))):
        print(i)
        print(sent)
        break;
    i += 1;    


6
I set aside at least an hour each day to read to my son (3 y/o). At this point, I consider myself a connoisseur of children's books and this is one of the best. Santa Clause put this under the tree. Since then, we've read it perpetually and he loves it.<br /><br />First, this book taught him the months of the year.<br /><br />Second, it's a pleasure to read. Well suited to 1.5 y/o old to 4+.<br /><br />Very few children's books are worth owning. Most should be borrowed from the library. This book, however, deserves a permanent spot on your shelf. Sendak's best.


In [8]:
import re
# Tutorial about Python regular expressions: https://pymotw.com/2/re/
import string
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.stem.wordnet import WordNetLemmatizer

stop = set(stopwords.words('english')) #set of stopwords
sno = nltk.stem.SnowballStemmer('english') #initialising the snowball stemmer

def cleanhtml(sentence): #function to clean the word of any html-tags
    cleanr = re.compile('<.*?>')
    cleantext = re.sub(cleanr, ' ', sentence)
    return cleantext
def cleanpunc(sentence): #function to clean the word of any punctuation or special characters
    cleaned = re.sub(r'[?|!|\'|"|#]',r'',sentence)
    cleaned = re.sub(r'[.|,|)|(|\|/]',r' ',cleaned)
    return  cleaned
print(stop)
print('************************************')
print(sno.stem('tasty'))

{"shan't", "needn't", 'now', 'doesn', 'ours', 'so', 'whom', 'a', "don't", 'o', 'her', 'shan', 'not', 'are', 'against', 'this', 'only', 'them', "hadn't", 'yourselves', 'wasn', "it's", 'did', 'more', 'most', 'can', 'haven', "aren't", 'between', 'an', 'hadn', 'myself', 'had', "hasn't", 'these', 'here', 'what', "that'll", 'that', 'ma', 'down', 'by', 'won', 'while', 'above', 'into', 'he', 'wouldn', 'through', 'and', 'been', 'our', 'isn', 'doing', 'for', 'does', 'any', 'yours', 'y', 'yourself', 'too', 'who', 'we', "you've", "you'd", 'she', "isn't", 'don', 'its', 'having', 'over', 'up', 'himself', 'with', 'just', 'ain', 'you', 'themselves', 'ourselves', 'to', 'd', 'where', "mightn't", 'of', 'mustn', "won't", 'herself', 'or', 'after', 'him', 'until', 'again', 's', 've', 'than', 'further', 'during', "should've", "weren't", 'as', 'if', 'itself', 'some', 'was', "didn't", 'm', 'on', 'hers', 'about', 'do', 'aren', 'before', 'needn', 'at', "you're", 'those', 'same', 'should', 'their', "you'll", 'am'

In [9]:
#Code for implementing step-by-step the checks mentioned in the pre-processing phase
# this code takes a while to run as it needs to run on 500k sentences.
i=0
str1=' '
final_string=[]
all_positive_words=[] # store words from +ve reviews here
all_negative_words=[] # store words from -ve reviews here.
s=''
for sent in final['Text'].values:
    filtered_sentence=[]
    #print(sent);
    sent=cleanhtml(sent) # remove HTMl tags
    for w in sent.split():
        for cleaned_words in cleanpunc(w).split():
            if((cleaned_words.isalpha()) & (len(cleaned_words)>2)):    
                if(cleaned_words.lower() not in stop):
                    s=(sno.stem(cleaned_words.lower())).encode('utf8')
                    filtered_sentence.append(s)
                    if (final['Score'].values)[i] == 'positive': 
                        all_positive_words.append(s) #list of all words used to describe positive reviews
                    if(final['Score'].values)[i] == 'negative':
                        all_negative_words.append(s) #list of all words used to describe negative reviews reviews
                else:
                    continue
            else:
                continue 
    #print(filtered_sentence)
    str1 = b" ".join(filtered_sentence) #final string of cleaned words
    #print("***********************************************************************")
    
    final_string.append(str1)
    i+=1

In [10]:
final['CleanedText']=final_string #adding a column of CleanedText which displays the data after pre-processing of the review 

In [11]:
final.head(3) #below the processed review can be seen in the CleanedText Column 


# store final table into an SQlLite table for future.
conn = sqlite3.connect('final.sqlite')
c=conn.cursor()
conn.text_factory = str
final.to_sql('Reviews', conn, flavor=None, schema=None, if_exists='replace', index=True, index_label=None, chunksize=None, dtype=None)

##### 6. Here we are Seperating all the review information of user on the basis of their Score i.e positive or negative. 
Then we are taking 306913 positive and 57087 negative reviews respectively from positive and negative data frame and we are concating them together in one data frame bigdata. We are also taking the scores of these 364000 reviews seperately in s1.
We then divide 364000 reviews to train and test data, and we convert the text column of the test and train into BOW. 


In [None]:
total_data=final.sample(364000)

In [None]:
conn = sqlite3.connect('total_data.sqlite')
c=conn.cursor()
conn.text_factory = str
total_data.to_sql('total', conn, flavor=None, schema=None, if_exists='replace', index=True, index_label=None, chunksize=None, dtype=None)

In [None]:
positive_data=pd.read_sql_query("""select * from total where score='positive'""",conn)
negative_data=pd.read_sql_query("""select * from total where score='negative'""",conn)

In [None]:
print(positive_data.shape)
print(negative_data.shape)

In [None]:
positive_data2000=positive_data.head(50000)
negative_data2000=negative_data.head(50000)
bigdata = positive_data2000.append(negative_data2000, ignore_index=True)
print(bigdata.shape)

In [None]:
sorted_data=bigdata

In [None]:
du=sorted_data.sample(100000)

In [None]:
du.head(15)

In [None]:
#Again sorting our data in Ascending order
du=du.sort_values('Time',axis=0,ascending=True,inplace=False,kind='quicksort',na_position='last')

In [None]:
s1=du['Score']

In [None]:
X_1, X_test, y_1, y_test = cross_validation.train_test_split(du, s1, test_size=0.3, random_state=0)

# split the train data set into cross validation train and cross validation test
X_tr, X_cv, y_tr, y_cv = cross_validation.train_test_split(X_1, y_1, test_size=0.3)

In [None]:
#BOW for train points
count_vect = CountVectorizer(max_features=35354) #in scikit-learn
X_1 = count_vect.fit_transform(X_1['Text'].values)
print(X_1.shape)

In [None]:
#BOW for CV points
count_vect = CountVectorizer()
X_test = count_vect.fit_transform(X_test['Text'].values)
print(X_test.shape)

In [None]:
#from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import StandardScaler
standardizedtest_data = StandardScaler(with_mean=False).fit_transform(X_1)
print(standardizedtest_data.shape)
X_1=standardizedtest_data

In [None]:
#from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import StandardScaler
standardizedtest_data = StandardScaler(with_mean=False).fit_transform(X_test)
print(standardizedtest_data.shape)
X_test=standardizedtest_data

In [None]:
# creating odd list of K for KNN
myList = list(range(0,21))
neighbors = list(filter(lambda x: x % 2 != 0, myList))

# empty list that will hold cv scores
cv_scores = []

# perform 10-fold cross validation
for k in neighbors:
    knn = KNeighborsClassifier(n_neighbors=k)
    scores = cross_val_score(knn, X_1, y_1, cv=10, scoring='accuracy')
    cv_scores.append(scores.mean())

# changing to misclassification error
MSE = [1 - x for x in cv_scores]

# determining best k
optimal_k = neighbors[MSE.index(min(MSE))]
print('\nThe optimal number of neighbors is %d.' % optimal_k)

# plot misclassification error vs k 
plt.plot(neighbors, MSE)

for xy in zip(neighbors, np.round(MSE,3)):
    plt.annotate('(%s, %s)' % xy, xy=xy, textcoords='data')

plt.xlabel('Number of Neighbors K')
plt.ylabel('Misclassification Error')
plt.show()

print("the misclassification error for each k value is : ", np.round(MSE,3))

In [None]:
X_1.shape

In [None]:
X_test.shape

In [None]:
# ============================== KNN with k = optimal_k ===============================================
# instantiate learning model k = optimal_k
knn_optimal = KNeighborsClassifier(n_neighbors=3)

# fitting the model
knn_optimal.fit(X_1, y_1)

# predict the response
pred = knn_optimal.predict(X_test)
acc = accuracy_score(y_test, pred, normalize=True) * float(100)
print('\n Accuracy for Optimal k = %d is %d%%' % (optimal_k, acc))

#### Confusion matrix , Precision, Recall, F-Score

In [None]:
import numpy as np


def plot_confusion_matrix(cm,
                          target_names,
                          title='Confusion matrix',
                          cmap=None,
                          normalize=True):
    """
    given a sklearn confusion matrix (cm), make a nice plot

    Arguments
    ---------
    cm:           confusion matrix from sklearn.metrics.confusion_matrix

    target_names: given classification classes such as [0, 1, 2]
                  the class names, for example: ['high', 'medium', 'low']

    title:        the text to display at the top of the matrix

    cmap:         the gradient of the values displayed from matplotlib.pyplot.cm
                  see http://matplotlib.org/examples/color/colormaps_reference.html
                  plt.get_cmap('jet') or plt.cm.Blues

    normalize:    If False, plot the raw numbers
                  If True, plot the proportions

    Usage
    -----
    plot_confusion_matrix(cm           = cm,                  # confusion matrix created by
                                                              # sklearn.metrics.confusion_matrix
                          normalize    = True,                # show proportions
                          target_names = y_labels_vals,       # list of names of the classes
                          title        = best_estimator_name) # title of graph

    Citiation
    ---------
    http://scikit-learn.org/stable/auto_examples/model_selection/plot_confusion_matrix.html

    """
    import matplotlib.pyplot as plt
    import numpy as np
    import itertools

    accuracy = np.trace(cm) / float(np.sum(cm))
    misclass = 1 - accuracy

    if cmap is None:
        cmap = plt.get_cmap('Blues')

    plt.figure(figsize=(8, 6))
    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.title(title)
    plt.colorbar()

    if target_names is not None:
        tick_marks = np.arange(len(target_names))
        plt.xticks(tick_marks, target_names, rotation=45)
        plt.yticks(tick_marks, target_names)

    if normalize:
        cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]


    thresh = cm.max() / 1.5 if normalize else cm.max() / 2
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        if normalize:
            plt.text(j, i, "{:0.4f}".format(cm[i, j]),
                     horizontalalignment="center",
                     color="white" if cm[i, j] > thresh else "black")
        else:
            plt.text(j, i, "{:,}".format(cm[i, j]),
                     horizontalalignment="center",
                     color="white" if cm[i, j] > thresh else "black")


    plt.tight_layout()
    plt.ylabel('True label')
    plt.xlabel('Predicted label\naccuracy={:0.4f}; misclass={:0.4f}'.format(accuracy, misclass))
    plt.show()


In [None]:
# print the confusion matrix
from sklearn.metrics import confusion_matrix
from sklearn import metrics
gb=metrics.confusion_matrix(y_test,pred)
print(gb)
plot_confusion_matrix(cm           = np.array([[ 1334  ,4635],[1183  ,4848]]), 
                      normalize    = False,
                      target_names = ['negative', 'positive'],
                      title        = "Confusion Matrix")


In [None]:
#Recall From above Confusion Metric 
recall=(gb[1,1]+0.0)/sum(gb[1,:])
recall

In [None]:
#precision From above Confusion Metric
pre=(gb[1,1]+0.0)/sum(gb[:,1])
print(pre)

In [None]:
# caculating F1 Score By using HP i.e 
#F1=2*TP/2*TP+FP+FN
F1=(2*pre*recall)/(pre+recall)
F1

# Now Doing this same process with TF-idf vectors

In [None]:
X_1, X_test, y_1, y_test = cross_validation.train_test_split(du, s1, test_size=0.3, random_state=0)

# split the train data set into cross validation train and cross validation test
X_tr, X_cv, y_tr, y_cv = cross_validation.train_test_split(X_1, y_1, test_size=0.3)

In [None]:
#Now we Use TF-IDF vectors to predict our reviews using Naive Bayes
from sklearn.feature_extraction.text import TfidfVectorizer
tf_idf_vect = TfidfVectorizer(ngram_range=(1,2),max_features=320494)
new1 = tf_idf_vect.fit_transform(X_1['Text'].values)
new1.get_shape()

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
tf_idf_vect = TfidfVectorizer(ngram_range=(1,2))
new2 = tf_idf_vect.fit_transform(X_test['Text'].values)
new2.get_shape()

##### Standardizing our Train and Test TF-IDF vectors 

In [None]:
#from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import StandardScaler
new11 = StandardScaler(with_mean=False).fit_transform(new1)


In [None]:
#from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import StandardScaler
new22 = StandardScaler(with_mean=False).fit_transform(new2)


#### Using KNN to train for Different values of k using the 10 fold cross validation  

### Now we are applying 10 fold cross validation to our Train dataset and then choosing the best value of k to find out our final accuracy on our test data set

In [None]:
# creating odd list of K for KNN
myList = list(range(0,50))
neighbors = list(filter(lambda x: x % 2 != 0, myList))

# empty list that will hold cv scores
cv_scores = []

# perform 10-fold cross validation
for k in neighbors:
    knn = KNeighborsClassifier(n_neighbors=k)
    scores = cross_val_score(knn, new11, y_1, cv=10, scoring='accuracy')
    cv_scores.append(scores.mean())

# changing to misclassification error
MSE = [1 - x for x in cv_scores]

# determining best k
c = neighbors[MSE.index(min(MSE))]
print('\nThe optimal number of neighbors is %d.' % optimal_k)

# plot misclassification error vs k 
plt.plot(neighbors, MSE)

for xy in zip(neighbors, np.round(MSE,3)):
    plt.annotate('(%s, %s)' % xy, xy=xy, textcoords='data')

plt.xlabel('Number of Neighbors K')
plt.ylabel('Misclassification Error')
plt.show()

print("the misclassification error for each k value is : ", np.round(MSE,3))

In [None]:
# ============================== KNN with k = optimal_k ===============================================
# instantiate learning model k = optimal_k
knn_optimal = KNeighborsClassifier(n_neighbors=optimal_k)

# fitting the model
knn_optimal.fit(new11, y_1)

# predict the response
pred = knn_optimal.predict(new22)

# evaluate accuracy
acc = accuracy_score(y_test, pred) * 100
print('\nThe accuracy of the knn classifier for k = %d is %f%%' % (optimal_k, acc))

In [None]:
y_test.describe()

In [None]:
# print the confusion matrix
from sklearn.metrics import confusion_matrix
from sklearn import metrics
gb=metrics.confusion_matrix(y_test,pred)
print(gb)
#plotting the confusion matrix
#Plot of Confusion Metric
plot_confusion_matrix(cm           = np.array([[ 10   ,5659],[14   ,6017]]), 
                      normalize    = False,
                      target_names = ['negative', 'positive'],
                      title        = "Confusion Matrix")

In [None]:
#Recall From above Confusion Metric 
recall=(gb[1,1]+0.0)/sum(gb[1,:])
recall

In [None]:
#precision From above Confusion Metric
pre=(gb[1,1]+0.0)/sum(gb[:,1])
print(pre)

In [None]:
# caculating F1 Score By using HP i.e 
#F1=2*TP/2*TP+FP+FN
F1=(2*pre*recall)/(pre+recall)
F1

In [None]:
# Train your own Word2Vec model using your own text corpus
i=0
list_of_sent=[]
for sent in final['CleanedText'].values:
    list_of_sent.append(sent.split())

In [None]:
print(final['CleanedText'].values[0])
print("*****************************************************************")
print(list_of_sent[0])

In [None]:
# min_count = 5 considers only words that occured atleast 5 times
w2v_model=Word2Vec(list_of_sent,min_count=5,size=50, workers=4)

In [None]:
w2v_words = list(w2v_model.wv.vocab)
print("number of words that occured minimum 5 times ",len(w2v_words))
print("sample words ", w2v_words[0:50])

In [None]:
#  Avg W2V,

In [None]:
# average Word2Vec
# compute average word2vec for each review.
sent_vectors = []; # the avg-w2v for each sentence/review is stored in this list
for sent in list_of_sent: # for each review/sentence
    sent_vec = np.zeros(50) # as word vectors are of zero length
    cnt_words =0; # num of words with a valid vector in the sentence/review
    for word in sent: # for each word in a review/sentence
        if word in w2v_words:
            vec = w2v_model.wv[word]
            sent_vec += vec
            cnt_words += 1
    if cnt_words != 0:
        sent_vec /= cnt_words
    sent_vectors.append(sent_vec)
print(len(sent_vectors))
print(len(sent_vectors[0]))

### Conclusion / Summary

(i) Sampled 40k reviews from our Dataset.
(ii) Then dividing our reviews into train and test.
(iii) Converting the text of reviews into vectors using both BOW and TD-idf Vectoriser.
(iv) Applying 10 Fold Cross Validation to our Train dataset and finding the optimum value of k, using KNN.
(v) Computing the Accuracy on our test dataset using the optimal value of K.
(vi) Also finding Confusion Matrix , Precision, Recall, F-Score.


Model :- K nearest Neighbour
HyperParameter:- K

1.FOR BOW

    Optimal K:- 3
    Train Error:- 39 %
    Test Accuracy :-51%
    F1 Score :- 0.62 


2.FOR TF-IDF

    Optimal K:- 1
    Train Error:- 49.9 %
    Test Accuracy :-50%
    F1 Score :- 0.56 
    
For TF-IDF we are slightly getting lower accuracy on our test dataset, but it's still very bad.
KNN might not be a good model for this problem.
We will use different models in future such as Naive Bayes and Logistic Regression that would improve the accuracy.