In [29]:
import pandas as pd
import nltk
import matplotlib.pyplot as plt
import numpy as np

## Import the data and recognise structure!

In [30]:
reviews = pd.read_csv('./Amazon_Unlocked_Mobile.csv')
reviews.head()

Unnamed: 0,Product Name,Brand Name,Price,Rating,Reviews,Review Votes
0,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,5,I feel so LUCKY to have found this used (phone...,1.0
1,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,4,"nice phone, nice up grade from my pantach revu...",0.0
2,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,5,Very pleased,0.0
3,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,4,It works good but it goes slow sometimes but i...,0.0
4,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,4,Great phone to replace my lost phone. The only...,0.0


Looks pretty clean, so we convert the Rating label as greater than 3 for positive and less than 3 for negative. For the time being, we would be leaving the 3 ratings(Though we could later classify them as positive or negative).

In [31]:
reviews.count()

Product Name    413840
Brand Name      348669
Price           407907
Rating          413840
Reviews         413778
Review Votes    401544
dtype: int64

In [32]:
reviews.dropna(inplace=True)
reviews['Positively Rated'] = np.where(reviews['Rating'] > 3, 1, 0)

In [33]:
reviews = reviews[reviews['Rating'] != 3]

In [36]:
reviews.sample(10)

Unnamed: 0,Product Name,Brand Name,Price,Rating,Reviews,Review Votes,Positively Rated
171807,CNPGD [U.S. Warranty] All-in-1 Smartwatch and ...,CNPGD,49.99,5,Works great will do business again,0.0,1
318856,Samsung Galaxy J7 J700M 16GB Dual Sim LTE Unlo...,Samsung,227.14,4,Ok so I thought I did all of my research befor...,2.0,1
249771,Motorola KRZR K1 Unlocked Phone with 2 MP Came...,Motorola,44.0,5,Very happy with this purchase. The phone does ...,0.0,1
321439,Samsung Galaxy Mega 6.3 I527 16GB Unlocked GSM...,Samsung,304.99,1,Would have given it 5 stars had I not had to g...,1.0,0
171165,CNPGD [U.S. Warranty] All-in-1 Smartwatch and ...,CNPGD,49.99,1,I like the style but i can't get it to synchro...,0.0,0
270190,Nokia Lumia 820 8GB GSM 4G LTE Windows 8 Smart...,Nokia,99.99,1,Was unworking from the gitgo. Sent back.,0.0,0
306802,Samsung Galaxy Ace 2 i8160 Black Factory Unloc...,Samsung,99.99,2,Given the great reviews on this phone I felt c...,1.0,0
90675,"BlackBerry Classic Factory Unlocked Cellphone,...",BlackBerry,149.99,5,Awesome phone...better than I expected :),2.0,1
212262,LG Electronics LG G3 D855 16GB Unlocked Cell P...,LG Electronics,249.0,4,I had a Galaxy S2 before this phone and strugg...,48.0,1
263737,"Nokia C3-00 Unlocked Cell Phone with QWERTY, D...",Nokia,269.1,5,"My old flip phone was behaving strangely, so I...",4.0,1


In [37]:
reviews.count()

Product Name        308277
Brand Name          308277
Price               308277
Rating              308277
Reviews             308277
Review Votes        308277
Positively Rated    308277
dtype: int64

Now that our data is cleaned, we can move on to analyzing the reviews. We would only be using the reviews and the newly made decision column for the reviews for the analysis.

In [38]:
reviews['Positively Rated'].mean()

0.7482686025879323

In [40]:
from sklearn.model_selection import train_test_split()

In [42]:
X_train, X_test, y_train, y_test = train_test_split(reviews['Reviews'],reviews['Positively Rated'], random_state=42)

Let's now convert reviews to a numerical representation that the models can utilize. We use CountVectorizer to carry out the tokenizing and counting aspect in one go. Normalising and weighting can be handled by us if needed (using Tf-idf).

In [39]:
from sklearn.feature_extraction.text import CountVectorizer

In [43]:
vectorizer = CountVectorizer().fit(X_train)

In [51]:
len(vectorizer.get_feature_names())

53438

In [52]:
vectorizer.get_feature_names()[::3000]

['00',
 '975foiblesbattery',
 'assertions',
 'bumpers',
 'conectarte',
 'diabetic',
 'estaré',
 'furnace',
 'huaweis',
 'kindle',
 'microsimcard',
 'okwahts',
 'ponerse',
 'recovers',
 'selback',
 'stickers',
 'tornado',
 'voluntary']

In [53]:
X_train_vectorized = vectorizer.transform(X_train)
X_train_vectorized

<231207x53438 sparse matrix of type '<class 'numpy.int64'>'
	with 6123196 stored elements in Compressed Sparse Row format>

In [103]:
from sklearn.svm import LinearSVC
from sklearn.linear_model import LogisticRegression

In [104]:
model = LogisticRegression()
model.fit(X_train_vectorized, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [105]:
from sklearn.metrics import roc_auc_score

predictions = model.predict(vectorizer.transform(X_test))
print(roc_auc_score(y_test, predictions))

0.9208909315442279


We got a pretty decent score. Do note that **any words that appeared in X_test that didn't appear in X_train will be ignored.**

Let's see the top features for predictions of either case.

In [106]:
sort_index = model.coef_[0].argsort()
feature_names = np.array(vectorizer.get_feature_names())
print("Largest coefficients:{}".format(feature_names[sort_index[-10:]]))
print("Smallest coefficients:{}".format(feature_names[sort_index[:10]]))

Largest coefficients:['perfect' 'perfecto' 'amazing' 'love' 'loves' 'exelente' 'loving'
 'excellent' 'excelente' 'excelent']
Smallest coefficients:['worst' 'worthless' 'junk' 'false' 'mony' 'garbage' 'horrible' 'terrible'
 'useless' 'nope']


argsort() = *yahan pe aana chahiye original array ke ye wala element*

In [107]:
svc = LinearSVC()
svc.fit(X_train_vectorized, y_train)
from sklearn.metrics import roc_auc_score

predictions = svc.predict(vectorizer.transform(X_test))
print(roc_auc_score(y_test, predictions))

sort_index = svc.coef_[0].argsort()
feature_names = np.array(vectorizer.get_feature_names())
print("Largest coefficients:{}".format(feature_names[sort_index[-10:]]))
print("Smallest coefficients:{}".format(feature_names[sort_index[:10]]))

0.936742901145614
Largest coefficients:['disastrous' 'saturated' 'excelent' 'spire' 'sis' 'blockedand' 'ofamanda'
 '4eeeks' 'emoticon' 'talkin']
Smallest coefficients:['rejecting' 'screams' 'downloas' 'reviewphysical' 'mony' 'timethe'
 'theory' '7yr' 'avoiding' 'uncouple']


We note that LinearSVC, while giving better results, is not really an interpretable model.

Tf-idf(Term frequency inverse-document-frequency) is a way for vectorizing, which gives low weight to those terms that occur frequently in all documents, as compared to those which appear more only in specific documents and are defining features of those documents.

In [109]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [110]:
vect = TfidfVectorizer(min_df=5).fit(X_train) #only those features that appear in 5 documents or more.

In [111]:
len(vect.get_feature_names())

17989

In [112]:
X_train_vect = vect.transform(X_train)
X_test_vect = vect.transform(X_test)

In [114]:
model.fit(X_train_vect, y_train)
svc.fit(X_train_vect, y_train)

LinearSVC(C=1.0, class_weight=None, dual=True, fit_intercept=True,
     intercept_scaling=1, loss='squared_hinge', max_iter=1000,
     multi_class='ovr', penalty='l2', random_state=None, tol=0.0001,
     verbose=0)

In [116]:
predictions_logreg = model.predict(X_test_vect)
print("Logistic Regression:{}".format(roc_auc_score(y_test, predictions_logreg)))

predictions_svc = svc.predict(X_test_vect)
print("SVC:{}".format(roc_auc_score(y_test, predictions_svc)))

Logistic Regression:0.926701445245822
SVC:0.9350900501379037


In [119]:
sort_index = model.coef_[0].argsort()
feature_names = np.array(vect.get_feature_names())
print("Largest coefficients:{}".format(feature_names[sort_index[-10:]]))
print("Smallest coefficients:{}".format(feature_names[sort_index[:10]]))

Largest coefficients:['perfectly' 'loves' 'best' 'awesome' 'easy' 'amazing' 'perfect'
 'excellent' 'great' 'love']
Smallest coefficients:['not' 'worst' 'terrible' 'return' 'useless' 'waste' 'disappointed' 'poor'
 'horrible' 'returning']


In [120]:
sort_index = svc.coef_[0].argsort()
feature_names = np.array(vect.get_feature_names())
print("Largest coefficients:{}".format(feature_names[sort_index[-10:]]))
print("Smallest coefficients:{}".format(feature_names[sort_index[:10]]))

Largest coefficients:['loving' 'returnyou' 'loves' 'perfect' 'amazing' 'flawlessly' '4eeeks'
 'great' 'love' 'excelent']
Smallest coefficients:['worst' 'mony' 'not' 'lemon' 'useless' 'false' 'nope' 'scammed'
 'unsatisfied' 'paperweight']


We can include 2 or 3(max) word combinations for analysis too, see if that improves accuracy. Because currently:

In [127]:
svc.predict(vect.transform(['Not an issue, phone is working.','Issue, phone is not working.']))

array([0, 0])

They both are negative reviews, which is wrong!

In [129]:
#So,
vect = TfidfVectorizer(ngram_range=(1,2), min_df=5).fit(X_train)

In [130]:
X_train_vect = vect.transform(X_train)
X_test_vect = vect.transform(X_test)

model.fit(X_train_vect, y_train)
svc.fit(X_train_vect, y_train)

predictions_logreg = model.predict(X_test_vect)
print("Logistic Regression:{}".format(roc_auc_score(y_test, predictions_logreg)))

predictions_svc = svc.predict(X_test_vect)
print("SVC:{}".format(roc_auc_score(y_test, predictions_svc)))

Logistic Regression:0.9488645714453975
SVC:0.9689819614190661


In [131]:
svc.predict(vect.transform(['Not an issue, phone is working.','Issue, phone is not working.']))

array([1, 0])

Voila!

In [132]:
sort_index = model.coef_[0].argsort()
feature_names = np.array(vect.get_feature_names())
print("Largest coefficients:{}".format(feature_names[sort_index[-10:]]))
print("Smallest coefficients:{}".format(feature_names[sort_index[:10]]))

Largest coefficients:['awesome' 'love this' 'no problems' 'best' 'not bad' 'amazing'
 'excellent' 'perfect' 'love' 'great']
Smallest coefficients:['not' 'disappointed' 'worst' 'poor' 'return' 'terrible' 'horrible'
 'doesn' 'slow' 'broken']


We see compound words, which is a bonus.

Hence, we have carried a basic analysis of NLP, and seen how it is done.
Cheers!