## Analysis of an E-commerce Dataset Part 3 (s2 2023)


In this Portfolio task, you will continue working with the dataset you have used in portfolio 2. But the difference is that the ratings have been converted to like (with score 1) and dislike (with score 0). Your task is to train classification models such as KNN to predict whether a user like or dislike an item.  


The header of the csv file is shown below. 

| userId | timestamp | review | item | helpfulness | gender | category | item_id | item_price | user_city | rating |
    | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- |
    
Your high level goal in this notebook is to try to build and evaluate predictive models for 'rating' from other available features - predict the value of the like (corresponding to rating 1) and dislike (corresponding to rating 0) in the data from some of the other fields. More specifically, you need to complete the following major steps: 
1) Explore the data. Clean the data if necessary. For example, remove abnormal instanaces and replace missing values.
2) Convert object features into digit features by using an encoder
3) Study the correlation between these features. 
4) Split the dataset and train a logistic regression model to predict 'rating' based on other features. Evaluate the accuracy of your model.
5) Split the dataset and train a KNN model to predict 'rating' based on other features. You can set K with an ad-hoc manner in this step. Evaluate the accuracy of your model.
6) Tune the hyper-parameter K in KNN to see how it influences the prediction performance

Note 1: We did not provide any description of each step in the notebook. You should learn how to properly comment your notebook by yourself to make your notebook file readable. 

Note 2: you are not being evaluated on the ___accuracy___ of the model but on the ___process___ that you use to generate it. Please use both ___Logistic Regression model___ and ___KNN model___ for solving this classification problem. Accordingly, discuss the performance of these two methods.
    

In [5]:
import pandas as pd
import numpy as np

In [6]:
df = pd.read_csv('/Users/navneetwarraich/Downloads/portfolio_3.csv')
df.head()

Unnamed: 0,userId,timestamp,review,item,helpfulness,gender,category,item_id,item_price,user_city,rating
0,4081,71900,Not always McCrap,McDonald's,3,M,Restaurants & Gourmet,41,30.74,4,1
1,4081,72000,I dropped the chalupa even before he told me to,Taco Bell,4,M,Restaurants & Gourmet,74,108.3,4,0
2,4081,72000,The Wonderful World of Wendy,Wendy's,4,M,Restaurants & Gourmet,84,69.0,4,1
3,4081,100399,They actually did it,"South Park: Bigger, Longer & Uncut",3,M,Movies,68,143.11,4,1
4,4081,100399,Hey! Gimme some pie!,American Pie,3,M,Movies,6,117.89,4,0


In [7]:
# question 1
df.info() #to explore the dataset
#it does not require any cleaning 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2685 entries, 0 to 2684
Data columns (total 11 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   userId       2685 non-null   int64  
 1   timestamp    2685 non-null   int64  
 2   review       2685 non-null   object 
 3   item         2685 non-null   object 
 4   helpfulness  2685 non-null   int64  
 5   gender       2685 non-null   object 
 6   category     2685 non-null   object 
 7   item_id      2685 non-null   int64  
 8   item_price   2685 non-null   float64
 9   user_city    2685 non-null   int64  
 10  rating       2685 non-null   int64  
dtypes: float64(1), int64(6), object(4)
memory usage: 230.9+ KB


In [8]:
# question 2: Convert object features into digit features by using an encoder
from sklearn.preprocessing import OrdinalEncoder
enc = OrdinalEncoder(dtype=int)
df[["review", "item", "gender","category"]]=enc.fit_transform(df[["review", "item", "gender","category"]])
df.head()


Unnamed: 0,userId,timestamp,review,item,helpfulness,gender,category,item_id,item_price,user_city,rating
0,4081,71900,1618,37,3,1,8,41,30.74,4,1
1,4081,72000,1125,67,4,1,8,74,108.3,4,0
2,4081,72000,2185,77,4,1,8,84,69.0,4,1
3,4081,100399,2243,61,3,1,5,68,143.11,4,1
4,4081,100399,1033,5,3,1,5,6,117.89,4,0


In [9]:
# question 3
df.corr()

Unnamed: 0,userId,timestamp,review,item,helpfulness,gender,category,item_id,item_price,user_city,rating
userId,1.0,-0.069176,0.007139,-0.005513,-0.166136,-0.058324,-0.041362,-0.005549,0.024576,-0.030031,0.066444
timestamp,-0.069176,1.0,0.007029,-0.003543,0.014179,-0.003367,0.015009,-0.004452,0.010979,-0.014934,-0.009739
review,0.007139,0.007029,1.0,0.16309,-0.028259,-0.037884,0.00197,0.163544,-0.041421,0.045626,-0.041756
item,-0.005513,-0.003543,0.16309,1.0,-0.020433,0.001925,-0.045988,0.999765,-0.049885,-0.00522,0.057793
helpfulness,-0.166136,0.014179,-0.028259,-0.020433,1.0,0.075947,-0.013408,-0.019882,0.004112,0.012086,-0.010622
gender,-0.058324,-0.003367,-0.037884,0.001925,0.075947,1.0,0.022549,0.00237,-0.040596,-0.065638,-0.022169
category,-0.041362,0.015009,0.00197,-0.045988,-0.013408,0.022549,1.0,-0.045268,-0.115571,0.008017,-0.142479
item_id,-0.005549,-0.004452,0.163544,0.999765,-0.019882,0.00237,-0.045268,1.0,-0.05445,-0.005576,0.057107
item_price,0.024576,0.010979,-0.041421,-0.049885,0.004112,-0.040596,-0.115571,-0.05445,1.0,-0.023427,0.026062
user_city,-0.030031,-0.014934,0.045626,-0.00522,0.012086,-0.065638,0.008017,-0.005576,-0.023427,1.0,-0.034866


In [10]:
df.head()

Unnamed: 0,userId,timestamp,review,item,helpfulness,gender,category,item_id,item_price,user_city,rating
0,4081,71900,1618,37,3,1,8,41,30.74,4,1
1,4081,72000,1125,67,4,1,8,74,108.3,4,0
2,4081,72000,2185,77,4,1,8,84,69.0,4,1
3,4081,100399,2243,61,3,1,5,68,143.11,4,1
4,4081,100399,1033,5,3,1,5,6,117.89,4,0


In [11]:
# question 4: splitting the data
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(df.drop(['rating'], axis=1), df['rating'], test_size=0.2, random_state=42)
print("X train = ", X_train.shape)
print("X test = ", X_test.shape)
print("y train = ", y_train.shape)
print("y test = ", y_test.shape)


X train =  (2148, 10)
X test =  (537, 10)
y train =  (2148,)
y test =  (537,)


In [12]:
# train a logistic regression model to predict 'rating' based on other features.
from sklearn.linear_model import LogisticRegression
clf = LogisticRegression()
clf.fit(X_train, y_train)


In [13]:
# evaluation
from sklearn.metrics import accuracy_score
y_pred=clf.predict(X_test)

print(accuracy_score(y_test, y_pred))

0.6368715083798883


In [14]:
print("Conclusion:")
print("Through using logistic regression, there is an accuracy score of 63%.")
print("The use of KNN model is advisable")

Conclusion:
Through using logistic regression, there is an accuracy score of 63%.
The use of KNN model is advisable


In [15]:
import warnings
warnings.filterwarnings("ignore")

In [16]:
# Using RFE to improve accuracy
from sklearn.feature_selection import RFE
selector = RFE(clf, n_features_to_select=3)
selector = selector.fit(X_train, y_train)
selector.ranking_


array([7, 8, 6, 1, 3, 1, 1, 2, 5, 4])

In [17]:
# Use RFE slected columns as input featurs to train logistic model again
X_train, X_test, y_train, y_test = train_test_split(df[["item", "gender", "category"]], df['rating'], test_size=0.2, random_state=42)
print("X_train =", X_train.shape)
print("X_test =", X_test.shape)
print("y_train =", y_train.shape)
print("y_test =", y_test.shape)


X_train = (2148, 3)
X_test = (537, 3)
y_train = (2148,)
y_test = (537,)


In [18]:
clf = LogisticRegression()
clf.fit(X_train, y_train)
y_pred=clf.predict(X_test)
print(accuracy_score(y_test, y_pred))


0.6443202979515829


In [22]:
from sklearn.neighbors import KNeighborsClassifier
neigh = KNeighborsClassifier(n_neighbors=3)
neigh.fit(X_train, y_train)


In [23]:
y_pred=neigh.predict(X_test)
print(accuracy_score(y_test, y_pred))

0.6759776536312849


In [27]:
from sklearn.model_selection import GridSearchCV
parameters = {'n_neighbors':range(1, 100)}
clf = GridSearchCV(neigh, parameters)
clf.fit(X_train, y_train)

In [28]:
clf.best_params_

{'n_neighbors': 22}

In [29]:
clf.best_score_

0.7453504634899983