# Sentiment Analysis on Amazon dataset

- An important part of our information-gathering behavior has always been to find out what other people think.
- With the growing text processing and NLP techniques improve ability to identify sentiment or opinion of individual from review, personal blog and social media.
- This helps to understand Product or service quality, competition senario and customers expectations.
- Dataset for the case study: https://www.kaggle.com/nehasontakke/amazon-unlocked-mobilecsv

In [1]:
import pandas as pd
import numpy as np

# Read in the data
df = pd.read_csv('Amazon_Unlocked_Mobile.csv')
print(df.shape)

(413840, 6)


<b> Include only 10% random data from original dataset

In [2]:
df = df.sample(frac=0.1, random_state=10)
print(df.shape)
df.head()

(41384, 6)


Unnamed: 0,Product Name,Brand Name,Price,Rating,Reviews,Review Votes
394349,Sony XPERIA Z2 D6503 FACTORY UNLOCKED Internat...,,244.95,5,Very good one! Better than Samsung S and iphon...,0.0
34377,Apple iPhone 5c 8GB (Pink) - Verizon Wireless,Apple,194.99,1,"The phone needed a SIM card, would have been n...",1.0
248521,Motorola Droid RAZR MAXX XT912 M Verizon Smart...,Motorola,174.99,5,I was 3 months away from my upgrade and my Str...,3.0
167661,CNPGD [U.S. Office Extended Warranty] Smartwat...,CNPGD,49.99,1,an experience i want to forget,0.0
73287,Apple iPhone 7 Unlocked Phone 256 GB - US Vers...,Apple,922.0,5,GREAT PHONE WORK ACCORDING MY EXPECTATIONS.,1.0


In [3]:
# Drop review with missing values
df.dropna(inplace=True)

# Remove any 'neutral' ratings equal to 3
df = df[df['Rating'] != 3]


<b>
- Encode 4s and 5s as 1 (rated positively)
- Encode 1s and 2s as 0 (rated poorly)

In [4]:
df['Positively Rated'] = np.where(df['Rating'] > 3, 1, 0)
print(df.shape)
df.head(10)

(30737, 7)


Unnamed: 0,Product Name,Brand Name,Price,Rating,Reviews,Review Votes,Positively Rated
34377,Apple iPhone 5c 8GB (Pink) - Verizon Wireless,Apple,194.99,1,"The phone needed a SIM card, would have been n...",1.0,0
248521,Motorola Droid RAZR MAXX XT912 M Verizon Smart...,Motorola,174.99,5,I was 3 months away from my upgrade and my Str...,3.0,1
167661,CNPGD [U.S. Office Extended Warranty] Smartwat...,CNPGD,49.99,1,an experience i want to forget,0.0,0
73287,Apple iPhone 7 Unlocked Phone 256 GB - US Vers...,Apple,922.0,5,GREAT PHONE WORK ACCORDING MY EXPECTATIONS.,1.0,1
277158,Nokia N8 Unlocked GSM Touch Screen Phone Featu...,Nokia,95.0,5,I fell in love with this phone because it did ...,0.0,1
100311,Blackberry Torch 2 9810 Unlocked Phone with 1....,BlackBerry,77.49,5,I am pleased with this Blackberry phone! The p...,0.0,1
251669,Motorola Moto E (1st Generation) - Black - 4 G...,Motorola,89.99,5,"Great product, best value for money smartphone...",0.0,1
279878,OtterBox 77-29864 Defender Series Hybrid Case ...,OtterBox,9.99,5,I've bought 3 no problems. Fast delivery.,0.0,1
406017,Verizon HTC Rezound 4G Android Smarphone - 8MP...,HTC,74.99,4,Great phone for the price...,0.0,1
302567,"RCA M1 Unlocked Cell Phone, Dual Sim, 5Mp Came...",RCA,159.99,5,My mom is not good with new technoloy but this...,4.0,1


In [5]:
df['Positively Rated'].value_counts(normalize=True) * 100

1    74.717767
0    25.282233
Name: Positively Rated, dtype: float64

- 75% are positive and 25% are negaive reviews

In [6]:
from sklearn.model_selection import train_test_split
# Split data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(df['Reviews'], df['Positively Rated'], test_size=0.33,random_state=0,stratify=df['Positively Rated'])
#X_train, X_test, y_train, y_test = train_test_split(df['Reviews'], df['Positively Rated'], test_size=0.33,random_state=0)

In [7]:
print('\nX_train shape: ', X_train.shape)
print('\nX_test shape: ', X_test.shape)


X_train shape:  (20593,)

X_test shape:  (10144,)


<b> Feature extraction using CountVectorizer

In [8]:
from sklearn.feature_extraction.text import CountVectorizer

# Fit the CountVectorizer to the training data
vect = CountVectorizer().fit(X_train)
print(len(vect.get_feature_names()))
print('Top extrcated Features: {}'.format(vect.get_feature_names()[::1000]))
# transform the documents in the training data to a document-term matrix
X_train_vectorized = vect.transform(X_train)

18482
Top extrcated Features: ['00', 'accesories', 'aunque', 'byers', 'confused', 'deteriorado', 'entertain', 'foreground', 'haver', 'iphone4s', 'machine', 'netorks', 'performed', 'puse', 'rich', 'skills', 'suspicous', 'tweaks', 'who']


In [9]:
from sklearn.linear_model import LogisticRegression
# Train the model
model = LogisticRegression()
model.fit(X_train_vectorized, y_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


LogisticRegression()

In [10]:
from sklearn.metrics import f1_score #Meassures to check model performance

# Predict the transformed test documents
predictions = model.predict(vect.transform(X_test))
print('F1 score: ', f1_score(y_test, predictions))

model.coef_[0]

F1 score:  0.9521821384304261


array([-0.36816228, -0.06249667, -0.00541864, ...,  0.08696725,
       -0.16110622,  0.04929907])

In [11]:
# get the feature names as numpy array
feature_names = np.array(vect.get_feature_names())

# Sort the coefficients from the model
sorted_coef_index = model.coef_[0].argsort()

# Find the 10 smallest and 10 largest coefficients
# The 10 largest coefficients are being indexed using [:-11:-1] 
# so the list returned is in order of largest to smallest

print('Smallest Coefs helps to identify negative reviews:\n{}\n'.format(feature_names[sorted_coef_index[:10]]))
print('Largest Coefs to identify positive reviews: \n{}'.format(feature_names[sorted_coef_index[:-11:-1]]))

feature_names[sorted_coef_index]

Smallest Coefs helps to identify negative reviews:
['junk' 'terrible' 'worst' 'poor' 'garbage' 'slow' 'sucks' 'disappointed'
 'freezes' 'defective']

Largest Coefs to identify positive reviews: 
['excellent' 'excelent' 'love' 'excelente' 'loves' 'perfectly' 'perfect'
 'great' 'amazing' 'exactly']


array(['junk', 'terrible', 'worst', ..., 'love', 'excelent', 'excellent'],
      dtype='<U117')

<b> Feature extraction using Tfidf Vectorizer

In [12]:
from sklearn.feature_extraction.text import TfidfVectorizer
# Fit the TfidfVectorizer to the training data specifiying a minimum document frequency of 5
vect = TfidfVectorizer(min_df=5).fit(X_train)
len(vect.get_feature_names())
X_train_vectorized = vect.transform(X_train)

In [13]:
model = LogisticRegression()
model.fit(X_train_vectorized, y_train)
predictions = model.predict(vect.transform(X_test))
print('F1 score: ', f1_score(y_test, predictions))

F1 score:  0.9521466036391893


In [14]:
feature_names = np.array(vect.get_feature_names())
sorted_coef_index = model.coef_[0].argsort()

print('Smallest Coefs helps to identify negative reviews:\n{}\n'.format(feature_names[sorted_coef_index[:10]]))
print('Largest Coefs to identify positive reviews: \n{}'.format(feature_names[sorted_coef_index[:-11:-1]]))

Smallest Coefs helps to identify negative reviews:
['not' 'slow' 'disappointed' 'doesn' 'worst' 'poor' 'terrible' 'waste'
 'return' 'never']

Largest Coefs to identify positive reviews: 
['great' 'love' 'excellent' 'good' 'perfect' 'best' 'awesome' 'amazing'
 'excelente' 'far']


In [15]:
# These reviews are treated the same by our current model
print(model.predict(vect.transform(['not an issue, phone is working',
                                    'an issue, phone is not working'])))

[0 0]


<b> Feature extraction using CountVectorizer by including n-grams

In [16]:
# Fit the CountVectorizer to the training data specifiying a minimum 
# document frequency of 5 and extracting 1-grams and 2-grams
vect = CountVectorizer(min_df=5, ngram_range=(1,2)).fit(X_train)
X_train_vectorized = vect.transform(X_train)
len(vect.get_feature_names())

26024

In [17]:
model = LogisticRegression()
model.fit(X_train_vectorized, y_train)
predictions = model.predict(vect.transform(X_test))
print('F1 score: ', f1_score(y_test, predictions))

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


F1 score:  0.9594187720905878


In [18]:
feature_names = np.array(vect.get_feature_names())
sorted_coef_index = model.coef_[0].argsort()
print('Smallest Coefs helps to identify negative reviews:\n{}\n'.format(feature_names[sorted_coef_index[:10]]))
print('Largest Coefs to identify positive reviews: \n{}'.format(feature_names[sorted_coef_index[:-11:-1]]))

Smallest Coefs helps to identify negative reviews:
['no good' 'junk' 'poor' 'not good' 'terrible' 'sucks' 'broken'
 'defective' 'slow' 'worst']

Largest Coefs to identify positive reviews: 
['excellent' 'excelente' 'excelent' 'great' 'perfect' 'love' 'not bad'
 'no problems' 'awesome' 'amazing']


#### Make inference using build model
<b> Identify sentiment for following reviews using model:
    - 'not an issue, phone is working'
    - 'an issue, phone is not working'

In [19]:
print(model.predict(vect.transform(['not an issue, phone is working',
                                    'an issue, phone is not working'])))

[1 0]


### Extract sentiment from text using pre trained model

TextBlob (top of on NLTK) library for processing textual data. It provides a simple API for diving into common natural language processing (NLP) tasks such as part-of-speech tagging, noun phrase extraction, sentiment analysis, classification, translation, and more.

- https://textblob.readthedocs.io/en/dev/quickstart.html

In [20]:
from textblob import TextBlob
from nltk.corpus import stopwords #To get stopwords from NLTK corpus

English_stopwords=stopwords.words('english')
print(len(English_stopwords),English_stopwords[0:10])

179 ['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're"]


<b> Words like 'no' and 'not' help us to understand sentiment of sentence, For this usecase we will not consider as stopwords

In [21]:
English_stopwords.remove('no')
English_stopwords.remove('not')
print(len(English_stopwords),English_stopwords[0:10])

177 ['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're"]


In [22]:
tb=TextBlob('world is beautiful place')
tb.sentiment

Sentiment(polarity=0.85, subjectivity=1.0)

In [23]:
tb=TextBlob('Global warming is worst thing for world')
tb.sentiment

Sentiment(polarity=-0.5, subjectivity=0.5)

###### * Text Polarity is a measure of how negative or how positive a piece of text is. Polarity is the measure of the overall combination of the positive and negative emotions in a sentence. It’s notoriously hard for computers to predict this, in fact it’s even hard for people to predict this over text. 

<b> Sentiment claculation using  text blob

In [24]:
def sentiment_using_text_blob(msg):
    #msg=msg.decode('utf-8')    #for linux , uncomment the line
    msg=' '.join([m_ for m_ in msg.split() if m_ not in English_stopwords])
    tb=TextBlob(msg)
    if tb.sentiment[0]>0:
        return 1
    else:
        return 0
predictions=np.array([sentiment_using_text_blob(x_) for x_ in X_test])
print('F1 score: ', f1_score(y_test, predictions))


F1 score:  0.8575631937943026


In [27]:
l = [(x_) for x_ in X_test[0:5]],predictions[0:5]
l

(["get what you pay for. this is the 4th one I've had. They break easily.",
  "I got this phone to replace my Galaxy S that gave me so many problems. This phone is similar when it comes to features and was easy to figure out how to use but I do have complaints. Before I ordered this phone, I never read anywhere that I would need to by an SD memory card in order to do basically anything that wasn't just calls or texts. The screen doesn't light up when you get a text. The sensor that is supposed to keep the screen turned off when you are talking on the phone isn't very sensitive. There is no task manager, the battery drains fast even if you aren't using it AT ALL. You can't get rid of extra home screens that you aren't using. These aren't huge, deal-breaking complaints but I'm not exactly thrilled with the phone either. For the price though, I suppose I'm satisfied.",
  'Works well. I got the 128 gig version and it has huge amounts of space for video and pictures. Very easy to set up. Ba