### Week 3: Programming assignment begins here(using Pandas, numpy and scikit-learn)
### Reference: https://medium.com/tensorist/classifying-yelp-reviews-using-nltk-and-scikit-learn-c58e71e962d9

In [57]:
#Importing the necessary libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%pylab inline


Populating the interactive namespace from numpy and matplotlib


In [58]:
selected_words = ['awesome', 'great', 'fantastic', 'amazing', 'love', 'horrible', 'bad', 'terrible', 'awful', 'wow', 'hate']

### Import the dataset and inspect its contents

In [59]:
reviews_df = pd.read_csv('amazon_baby.csv')

In [60]:
#Dimensions of the dataset. # of rows x # of columns
reviews_df.shape


(183531, 3)

In [61]:
reviews_df.head()

Unnamed: 0,name,review,rating
0,Planetwise Flannel Wipes,"These flannel wipes are OK, but in my opinion ...",3
1,Planetwise Wipe Pouch,it came early and was not disappointed. i love...,5
2,Annas Dream Full Quilt with 2 Shams,Very soft and comfortable and warmer than it l...,5
3,Stop Pacifier Sucking without tears with Thumb...,This is a product well worth the purchase. I ...,5
4,Stop Pacifier Sucking without tears with Thumb...,All of my kids have cried non-stop when I trie...,5


In [62]:
reviews_df.describe()

Unnamed: 0,rating
count,183531.0
mean,4.120448
std,1.285017
min,1.0
25%,4.0
50%,5.0
75%,5.0
max,5.0


In [63]:
#Lets inspect the datatypes of the columns
reviews_df.dtypes

name      object
review    object
rating     int64
dtype: object

In [64]:
#Fill NA in the review column which do not contain anything
reviews_df = reviews_df.fillna('')
reviews_df

Unnamed: 0,name,review,rating
0,Planetwise Flannel Wipes,"These flannel wipes are OK, but in my opinion ...",3
1,Planetwise Wipe Pouch,it came early and was not disappointed. i love...,5
2,Annas Dream Full Quilt with 2 Shams,Very soft and comfortable and warmer than it l...,5
3,Stop Pacifier Sucking without tears with Thumb...,This is a product well worth the purchase. I ...,5
4,Stop Pacifier Sucking without tears with Thumb...,All of my kids have cried non-stop when I trie...,5
5,Stop Pacifier Sucking without tears with Thumb...,"When the Binky Fairy came to our house, we did...",5
6,A Tale of Baby\'s Days with Peter Rabbit,"Lovely book, it\'s bound tightly so you may no...",4
7,"Baby Tracker&reg; - Daily Childcare Journal, S...",Perfect for new parents. We were able to keep ...,5
8,"Baby Tracker&reg; - Daily Childcare Journal, S...",A friend of mine pinned this product on Pinter...,5
9,"Baby Tracker&reg; - Daily Childcare Journal, S...",This has been an easy way for my nanny to reco...,4


In [65]:
#The review column has a datatype of object, which means that the column can contain other datatypes
#other than just strings
#https://stackoverflow.com/questions/23158447/convert-float-to-string-in-pandas
reviews_df['review'] = reviews_df['review'].astype(str)


In [66]:
#Find the length of each review and add a new column
reviews_df['review length'] = reviews_df['review'].apply(len)
#Inspect the dataframe again. You will see that a 4th column containing the length of the review has been added.
reviews_df.head()


Unnamed: 0,name,review,rating,review length
0,Planetwise Flannel Wipes,"These flannel wipes are OK, but in my opinion ...",3,452
1,Planetwise Wipe Pouch,it came early and was not disappointed. i love...,5,158
2,Annas Dream Full Quilt with 2 Shams,Very soft and comfortable and warmer than it l...,5,143
3,Stop Pacifier Sucking without tears with Thumb...,This is a product well worth the purchase. I ...,5,390
4,Stop Pacifier Sucking without tears with Thumb...,All of my kids have cried non-stop when I trie...,5,405


### Define what is a positive and a negative sentiment


In [67]:
#We will ignore 3. 4 and 5 are considered positive whereas 1,2 are considered negative.
#We will indicate a positive sentiment by 1 and negative sentiment by 0
reviews_df = reviews_df[reviews_df['rating'] != 3]
#Construct a new column which indicates the sentiment. A sentiment value of True indicates that we have a positive review
reviews_df['sentiment'] = reviews_df['rating'].apply(lambda rating: +1 if rating > 3 else -1)
reviews_df


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


Unnamed: 0,name,review,rating,review length,sentiment
1,Planetwise Wipe Pouch,it came early and was not disappointed. i love...,5,158,1
2,Annas Dream Full Quilt with 2 Shams,Very soft and comfortable and warmer than it l...,5,143,1
3,Stop Pacifier Sucking without tears with Thumb...,This is a product well worth the purchase. I ...,5,390,1
4,Stop Pacifier Sucking without tears with Thumb...,All of my kids have cried non-stop when I trie...,5,405,1
5,Stop Pacifier Sucking without tears with Thumb...,"When the Binky Fairy came to our house, we did...",5,495,1
6,A Tale of Baby\'s Days with Peter Rabbit,"Lovely book, it\'s bound tightly so you may no...",4,217,1
7,"Baby Tracker&reg; - Daily Childcare Journal, S...",Perfect for new parents. We were able to keep ...,5,254,1
8,"Baby Tracker&reg; - Daily Childcare Journal, S...",A friend of mine pinned this product on Pinter...,5,196,1
9,"Baby Tracker&reg; - Daily Childcare Journal, S...",This has been an easy way for my nanny to reco...,4,416,1
10,"Baby Tracker&reg; - Daily Childcare Journal, S...",I love this journal and our nanny uses it ever...,4,1012,1


## First split the data into traning set and testing set.

In [68]:
from sklearn.model_selection  import train_test_split
#First define the training set and the testing set. Testing set is used to evaluate the model
X_train, X_test, y_train, y_test = train_test_split(reviews_df,reviews_df.sentiment,test_size=0.2,random_state=1)
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

(133401, 5)
(33351, 5)
(133401L,)
(33351L,)


## Vectorizing our dataset

In [69]:
#References for  feature extraction using bag of words technique
#http://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction
#https://www.youtube.com/watch?v=ZiKMIuYidY0  - Reference youtube video. Very very good from Dataschool
#https://github.com/justmarkham/pycon-2016-tutorial - Look at tutorial.ipynb

# Import and instantiate the vectorizer
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(vocabulary=selected_words)

# Learn the vocabulary from the training data, then use it to create a document-term matrix
#This is again in-place
vectorizer.fit(X_train['review'])


X_train_dtm = vectorizer.transform(X_train['review'])
# equivalently: combine fit and transform into a single step
X_train_dtm = vectorizer.fit_transform(X_train['review'])

#Use the learned vocabulary and build the dtm for X_test
#You dont do a fit here as the vocabulary learned from the training set has to be used.
X_test_dtm = vectorizer.transform(X_test['review'])


#Examine the vocabulary and dtm together
#You encounter a memory error. However, this is the way to examine the DTM and features
#pd.DataFrame(X_train_dtm.toarray(),columns=vectorizer.get_feature_names()) 


## Building and evaluating the model using logistic regression

In [70]:
#Import and instantiate a logistic regression model
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression()

In [71]:
#Train the model using X_train_dtm
logreg.fit(X_train_dtm, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [72]:
# make class predictions for X_test_dtm
y_pred_class = logreg.predict(X_test_dtm)

In [73]:
# calculate predicted probabilities for X_test_dtm (well calibrated)
y_pred_prob = logreg.predict_proba(X_test_dtm)[:, 1]
y_pred_prob

array([ 0.79079627,  0.97354921,  0.90147483, ...,  0.97354921,
        0.90147483,  0.79079627])

In [74]:
# calculate accuracy of class predictions
from sklearn import metrics
metrics.accuracy_score(y_test, y_pred_class)

0.84756079277982665

In [75]:
# calculate AUC
metrics.roc_auc_score(y_test, y_pred_prob)

0.68650039258471462

In [76]:
X_test[y_pred_class == y_test]

Unnamed: 0,name,review,rating,review length,sentiment
178665,15 Plastic Alligator Grip Suspender Pacifier B...,These clips are just what I was looking for. ...,5,120,1
158713,"green sprouts 2 Count Cool Hand Teether, Green...","This was a great buy, the baby really loves ch...",5,125,1
11916,Kidkusion Kid Safe Banister Guard,It\'s a little amusing that this is marketed a...,5,822,1
55010,Mommy\'s Helper Car Seat Sun Shade,I live an area of the US where we get summers ...,5,437,1
44239,Gerber Graduates BPA Free 4 Pack Bunch-A-Bowls...,I ordered these to give to my daughter - she l...,5,241,1
138343,"Gerber 12 Pack Wash Cloth Set, Blue",These were much softer than I thought. I have ...,5,139,1
170161,Pura Kiki Stainless Infant Bottle Stainless St...,Pros:Stainless steel and silicone throughout -...,4,669,1
117226,"Summer Infant Disney Water Squirters, Winnie t...",These were so cheap and my son loves them! The...,5,111,1
102353,"Skip Hop Zoo Lunchie Insulated Lunch Bag, Monkey",Lovely! I use with 2 thermos ice mats (one 9 c...,5,737,1
42744,Stairway Gate Installation Kit (K12) by KidCo,It is a quality product and totally met our ex...,5,124,1


## Print the document term matrix for the training set

In [83]:
#pd.DataFrame(X_train_dtm.toarray(),columns=vectorizer.get_feature_names()) 




## Print the column names in the X_train_dtm

In [84]:
#print the column names in the X_train_dtm
columns=vectorizer.get_feature_names()
print(columns)

['awesome', 'great', 'fantastic', 'amazing', 'love', 'horrible', 'bad', 'terrible', 'awful', 'wow', 'hate']


## Print the weights learned by the classifier 

In [85]:
#https://stackoverflow.com/questions/47303261/getting-weights-of-features-using-scikit-learn-logistic-regression
#The coefficient array is the list of coefficient values. 
#The values are ordered by the order of columns in your X_train dataset.
coef = logreg.coef_[0]
print (coef)

[ 1.2264206   0.8839883   0.9133578   1.07063568  1.39194209 -2.24692436
 -1.00144791 -2.22244556 -2.05852225 -0.10411529 -1.42709769]
