## Amazon Review

Problem Statement
Objective is to classify the review as positive or negetive with count vectorizer, tf-idf and Ngram. Compare the results

### Import Libraries

In [50]:
import pandas as pd
import numpy as np

### Data Collection

In [51]:
df = pd.read_csv('Amazon_Unlocked_Mobile.csv')
df.head()

Unnamed: 0,Product Name,Brand Name,Price,Rating,Reviews,Review Votes
0,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,5,I feel so LUCKY to have found this used (phone...,1.0
1,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,4,"nice phone, nice up grade from my pantach revu...",0.0
2,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,5,Very pleased,0.0
3,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,4,It works good but it goes slow sometimes but i...,0.0
4,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,4,Great phone to replace my lost phone. The only...,0.0


### Data Understanding

In [52]:
df.shape

(413840, 6)

In [53]:
# Look for missing values
df.isna().sum()

Product Name        0
Brand Name      65171
Price            5933
Rating              0
Reviews            62
Review Votes    12296
dtype: int64

In [54]:
# Replace missing value
df.dropna(inplace=True)

In [55]:
df.shape

(334335, 6)

### Data Labeling

Reviews above 3 are rated as positive and below 3 are rated as negetive. 1: +ve and 0: -ve

In [57]:
df['Positively Rated'] = np.where(df['Rating'] > 3, 1, 0)
df.sample(10)

Unnamed: 0,Product Name,Brand Name,Price,Rating,Reviews,Review Votes,Positively Rated
198858,HTC Vivid X710a 16GB Unlocked GSM Android Dual...,HTC,97.99,5,"exelente, 100% perfecto",0.0,1
89033,"BlackBerry Bold 9790 GSM Unlocked Phone, Black",BlackBerry,269.1,1,"after 4 days of working good, now it is showin...",0.0,0
158410,BLU Tank II T193 Unlocked GSM Dual-SIM Cell Ph...,BLU,20.99,1,Bought for my grandma. Worked well for a few d...,0.0,0
302288,"RCA M1 4.0 Unlocked Cell Phone, Dual SIM, 5MP ...",RCA,159.99,5,excellent,4.0,1
92267,Blackberry Curve 8520 Gemini SmartPhone Unlock...,BlackBerry,39.99,3,Please I need an image of the commercial invoi...,0.0,0
220474,LG G3 D855 32GB LTE Unlocked GSM Android Smart...,LG,210.95,5,Best smartphone!,1.0,1
254939,Motorola Moto X Developer GSM Edition Factory ...,Moto X,235.0,5,Excelente producto!!!!,0.0,1
107308,BLU Dash 3.5 II D352L Unlocked GSM Dual-SIM 4G...,BLU,109.99,3,"It's a phone, doesn't have very much internal ...",0.0,0
319991,"Samsung Galaxy J7 J700M, 16GB, Dual SIM LTE, F...",Samsung,227.99,5,Excellent,0.0,1
308681,"Samsung Galaxy Exhibit 4G (T-Mobile), t679",Samsung,119.99,2,This product arrived on time. I returned it be...,0.0,0


In [58]:
df['Positively Rated'].value_counts(normalize=True)

1    0.689949
0    0.310051
Name: Positively Rated, dtype: float64

### Divide dataset into independent and dependent features

In [59]:
X = df['Reviews']
y = df['Positively Rated']

### Train Test and Split

In [65]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2,random_state = 0)

In [66]:
print('X_train first entry: \n\n', X_train[0])
print('\n\nX_train shape: ', X_train.shape)

X_train first entry: 

 I feel so LUCKY to have found this used (phone to us & not used hard at all), phone on line from someone who upgraded and sold this one. My Son liked his old one that finally fell apart after 2.5+ years and didn't want an upgrade!! Thank you Seller, we really appreciate it & your honesty re: said used phone.I recommend this seller very highly & would but from them again!!


X_train shape:  (267468,)


### Model Building with Count Vectorizer

In [67]:
# CountVectorizer
from sklearn.feature_extraction.text import CountVectorizer

vect = CountVectorizer().fit(X_train)

In [68]:
vect.get_feature_names()[::3000]

['00',
 '80mcc',
 'apktrain',
 'bluboo',
 'circling',
 'd535u',
 'dsm',
 'factors',
 'goodmetal',
 'impportant',
 'largeif',
 'minimal',
 'oficinas',
 'plans12',
 'rceived',
 'sailing',
 'solctice',
 'telefonia',
 'uninvited',
 'withdraw']

In [69]:
len(vect.get_feature_names())

58385

In [70]:
# transform the documents in the training data to a document-term matrix
X_train_vectorized = vect.transform(X_train)
X_train_vectorized

<267468x58385 sparse matrix of type '<class 'numpy.int64'>'
	with 7299949 stored elements in Compressed Sparse Row format>

### Model Building

In [71]:
from sklearn.linear_model import LogisticRegression

model = LogisticRegression()
model.fit(X_train_vectorized, y_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


LogisticRegression()

In [75]:
from sklearn.metrics import roc_auc_score, confusion_matrix,accuracy_score, classification_report

predictions = model.predict(vect.transform(X_test))

print('AUC: ', roc_auc_score(y_test, predictions))
print()
print('Accuracy: ',accuracy_score(y_test, predictions))
print()
print('Confusion Metrix: ',confusion_matrix(y_test, predictions))
print()
print('Classification Report',classification_report(y_test, predictions))

AUC:  0.8839823958024075

Accuracy:  0.9110173927348318

Confusion Metrix:  [[16882  3894]
 [ 2056 44035]]

Classification Report               precision    recall  f1-score   support

           0       0.89      0.81      0.85     20776
           1       0.92      0.96      0.94     46091

    accuracy                           0.91     66867
   macro avg       0.91      0.88      0.89     66867
weighted avg       0.91      0.91      0.91     66867



In [76]:
# get the feature names as numpy array
feature_names = np.array(vect.get_feature_names())

# Sort the coefficients from the model
sorted_coef_index = model.coef_[0].argsort()

# Find the 10 smallest and 10 largest coefficients
# The 10 largest coefficients are being indexed using [:-11:-1] 
# so the list returned is in order of largest to smallest
print('Smallest Coefs: \n{}\n'.format(feature_names[sorted_coef_index[:10]]))
print('Largest Coefs: \n{}\n'.format(feature_names[sorted_coef_index[:-11:-1]]))

Smallest Coefs: 
['worst' 'junk' 'disappointing' 'upset' 'garbage' 'dirty' 'false'
 'unusable' 'freezes' 'waste']

Largest Coefs: 
['excelent' 'excelente' 'exelente' 'loving' 'perfecto' 'loves' 'excellent'
 'complaints' 'superb' 'happier']



### Model Bulding with Tfidf

In [77]:
#Tfidf
from sklearn.feature_extraction.text import TfidfVectorizer

# Fit the TfidfVectorizer to the training data specifiying a minimum document frequency of 5
vect = TfidfVectorizer(min_df = 5).fit(X_train)
len(vect.get_feature_names())

19500

In [79]:
X_train_vectorized = vect.transform(X_train)

model = LogisticRegression()
model.fit(X_train_vectorized, y_train)
predictions = model.predict(vect.transform(X_test))

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [80]:
print('AUC: ', roc_auc_score(y_test, predictions))
print()
print('Accuracy: ',accuracy_score(y_test, predictions))
print()
print('Confusion Metrix: ',confusion_matrix(y_test, predictions))
print()
print('Classification Report',classification_report(y_test, predictions))

AUC:  0.8999571705584489

Accuracy:  0.9202297097222846

Confusion Metrix:  [[17585  3191]
 [ 2143 43948]]

Classification Report               precision    recall  f1-score   support

           0       0.89      0.85      0.87     20776
           1       0.93      0.95      0.94     46091

    accuracy                           0.92     66867
   macro avg       0.91      0.90      0.91     66867
weighted avg       0.92      0.92      0.92     66867



In [81]:
feature_names = np.array(vect.get_feature_names())

sorted_tfidf_index = X_train_vectorized.max(0).toarray()[0].argsort()

print('Smallest Tfidf: \n{}\n'.format(feature_names[sorted_tfidf_index[:10]]))
print('Largest Tfidf: \n{}\n'.format(feature_names[sorted_tfidf_index[:-11:-1]]))

Smallest Tfidf: 
['brawns' 'reading___' 'messiah' '16nm' 'srgb' '___thank' '401p' '625nits'
 'bigtime' 'tsmc']

Largest Tfidf: 
['greate' 'exelentes' 'exito' 'is' 'machete' 'bulls' 'bulky' 'exellent'
 'exellen' 'exelente']



In [82]:
sorted_coef_index = model.coef_[0].argsort()

print('Smallest coef: \n{}\n'.format(feature_names[sorted_coef_index[:10]]))
print('Largest coef: \n{}\n'.format(feature_names[sorted_coef_index[:-11:-1]]))

Smallest coef: 
['not' 'worst' 'disappointed' 'waste' 'poor' 'return' 'terrible' 'stopped'
 'slow' 'horrible']

Largest coef: 
['love' 'great' 'amazing' 'excellent' 'perfect' 'loves' 'best' 'awesome'
 'perfectly' 'easy']



In [83]:
# These reviews are treated the same by our current model

print(model.predict(vect.transform(['Not an issue, phone is working', 
                                   'an issue, phone is not working'])))

[0 0]


### Model Building with Ngram

In [85]:
# n-grams
# Fit the CountVectorizer to the training data specifiying a minimum 
# document frequency of 5 and extracting 1-grams and 2-grams
vect = CountVectorizer(min_df = 5, ngram_range = (1,2)).fit(X_train)
X_train_vectorized = vect.transform(X_train)
len(vect.get_feature_names())

227924

In [86]:
model = LogisticRegression()
model.fit(X_train_vectorized, y_train)

predictions = model.predict(vect.transform(X_test))

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [87]:
print('AUC: ', roc_auc_score(y_test, predictions))
print()
print('Accuracy: ',accuracy_score(y_test, predictions))
print()
print('Confusion Metrix: ',confusion_matrix(y_test, predictions))
print()
print('Classification Report',classification_report(y_test, predictions))

AUC:  0.9358685110319043

Accuracy:  0.949436941989322

Confusion Metrix:  [[18699  2077]
 [ 1304 44787]]

Classification Report               precision    recall  f1-score   support

           0       0.93      0.90      0.92     20776
           1       0.96      0.97      0.96     46091

    accuracy                           0.95     66867
   macro avg       0.95      0.94      0.94     66867
weighted avg       0.95      0.95      0.95     66867



In [88]:
feature_names = np.array(vect.get_feature_names())
sorted_coef_index = model.coef_[0].argsort()

print('Smallest Coef: \n{}\n'.format(feature_names[sorted_coef_index][:10]))
print('Largest Coef: \n{}\n'.format(feature_names[sorted_coef_index][:-11:-1]))

Smallest Coef: 
['no good' 'junk' 'not satisfied' 'worst' 'not happy' 'not worth'
 'garbage' 'wouldn recommend' 'never worked' 'poor']

Largest Coef: 
['excelent' 'exelente' 'excelente' 'perfecto' 'excellent' 'no issues'
 'perfect' 'loving' 'awsome' 'amazing']



In [89]:
print(model.predict(vect.transform(['Phone is awsome working'])))

[1]
