In [8]:
import pandas as pd
import numpy as np
import string

dtype_dict = {'name':str, 'review':str, 'rating':int}

products=pd.read_csv('amazon_baby.csv',dtype=dtype_dict)

In [9]:
type(products.index)

pandas.indexes.range.RangeIndex

In [10]:
products.head()

Unnamed: 0,name,review,rating
0,Planetwise Flannel Wipes,"These flannel wipes are OK, but in my opinion ...",3
1,Planetwise Wipe Pouch,it came early and was not disappointed. i love...,5
2,Annas Dream Full Quilt with 2 Shams,Very soft and comfortable and warmer than it l...,5
3,Stop Pacifier Sucking without tears with Thumb...,This is a product well worth the purchase. I ...,5
4,Stop Pacifier Sucking without tears with Thumb...,All of my kids have cried non-stop when I trie...,5


In [11]:
len(products)

183531

In [13]:
for c in products.columns:
    print(c, products[c].dtype)

name object
review object
rating int64


Perform text cleaning

2. We start by removing punctuation, so that words "cake." and "cake!" are counted as the same word.

Write a function remove_punctuation that strips punctuation from a line of text
Apply this function to every element in the review column of products, and save the result to a new column review_clean.

Aside. In this notebook, we remove all punctuation for the sake of simplicity. A smarter approach to punctuation would preserve phrases such as "I'd", "would've", "hadn't" and so forth. See this page for an example of smart handling of punctuation.

IMPORTANT. Make sure to fill n/a values in the review column with empty strings (if applicable). The n/a values indicate empty reviews. For instance, Pandas's the fillna() method lets you replace all N/A's in the review columns as follows:

In [14]:
def remove_punctuation(text):
   # if not isinstance(text, str):
        #return text
    import string
    translator = str.maketrans({key: None for key in string.punctuation})
    return text.translate(translator) 

products = products.fillna({'review':''})

products['review_clean'] = products['review'].apply(remove_punctuation)

In [15]:
products.head()

Unnamed: 0,name,review,rating,review_clean
0,Planetwise Flannel Wipes,"These flannel wipes are OK, but in my opinion ...",3,These flannel wipes are OK but in my opinion n...
1,Planetwise Wipe Pouch,it came early and was not disappointed. i love...,5,it came early and was not disappointed i love ...
2,Annas Dream Full Quilt with 2 Shams,Very soft and comfortable and warmer than it l...,5,Very soft and comfortable and warmer than it l...
3,Stop Pacifier Sucking without tears with Thumb...,This is a product well worth the purchase. I ...,5,This is a product well worth the purchase I h...
4,Stop Pacifier Sucking without tears with Thumb...,All of my kids have cried non-stop when I trie...,5,All of my kids have cried nonstop when I tried...



We will ignore all reviews with rating = 3, since they tend to have a neutral sentimen

In [16]:
products = products[products['rating'] != 3]

In [17]:
len(products)

166752

Now, we will assign reviews with a rating of 4 or higher to be positive reviews, while the ones with rating of 2 or lower are negative. For the sentiment column, we use +1 for the positive class label and -1 for the negative class label. A good way is to create an anonymous function that converts a rating into a class label and then apply that function to every element in the rating column.

In [18]:
products['sentiment'] = products['rating'].apply(lambda rating : +1 if rating > 3 else -1)

In [19]:
products.head()

Unnamed: 0,name,review,rating,review_clean,sentiment
1,Planetwise Wipe Pouch,it came early and was not disappointed. i love...,5,it came early and was not disappointed i love ...,1
2,Annas Dream Full Quilt with 2 Shams,Very soft and comfortable and warmer than it l...,5,Very soft and comfortable and warmer than it l...,1
3,Stop Pacifier Sucking without tears with Thumb...,This is a product well worth the purchase. I ...,5,This is a product well worth the purchase I h...,1
4,Stop Pacifier Sucking without tears with Thumb...,All of my kids have cried non-stop when I trie...,5,All of my kids have cried nonstop when I tried...,1
5,Stop Pacifier Sucking without tears with Thumb...,"When the Binky Fairy came to our house, we did...",5,When the Binky Fairy came to our house we didn...,1


In [20]:
test_f = pd.read_json("module-2-assignment-test-idx.json")
train_f = pd.read_json("module-2-assignment-train-idx.json")

In [21]:
train_data=products.iloc[train_f[0].values]
test_data=products.iloc[test_f[0].values]

In [22]:
test_data.head()

Unnamed: 0,name,review,rating,review_clean,sentiment
9,"Baby Tracker&reg; - Daily Childcare Journal, S...",This has been an easy way for my nanny to reco...,4,This has been an easy way for my nanny to reco...,1
10,"Baby Tracker&reg; - Daily Childcare Journal, S...",I love this journal and our nanny uses it ever...,4,I love this journal and our nanny uses it ever...,1
16,Nature's Lullabies First Year Sticker Calendar,"I love this little calender, you can keep trac...",5,I love this little calender you can keep track...,1
20,Nature's Lullabies Second Year Sticker Calendar,I had a hard time finding a second year calend...,5,I had a hard time finding a second year calend...,1
28,"Lamaze Peekaboo, I Love You","One of baby's first and favorite books, and it...",4,One of babys first and favorite books and it i...,1


### Build the word count vector for each review

We will now compute the word count for each word that appears in the reviews. A vector consisting of word counts is often referred to as **bag-of-word** features. Since most words occur in only a few reviews, word count vectors are sparse. For this reason, scikit-learn and many other tools use sparse matrices to store a collection of word count vectors. Refer to appropriate manuals to produce sparse word count vectors. General steps for extracting word count vectors are as follows:

* Learn a vocabulary (set of all words) from the training data. Only the words that show up in the training data will be considered for feature extraction.
* Compute the occurrences of the words in each review and collect them into a row vector.
* Build a sparse matrix where each row is the word count vector for the corresponding review. Call this matrix train_matrix.
* Using the same mapping between words and columns, convert the test data into a sparse matrix test_matrix.
The following cell uses CountVectorizer in scikit-learn. Notice the token_pattern argument in the constructor.

In [23]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(token_pattern=r'\b\w+\b')
     # Use this token pattern to keep single-letter words
# First, learn vocabulary from the training data and assign columns to words
# Then convert the training data into a sparse matrix
train_matrix = vectorizer.fit_transform(train_data['review_clean'])
# Second, convert the test data into a sparse matrix, using the same word-column mapping
test_matrix = vectorizer.transform(test_data['review_clean'])

In [24]:
train_matrix

<133416x121712 sparse matrix of type '<class 'numpy.int64'>'
	with 7326618 stored elements in Compressed Sparse Row format>

In [25]:
train_data.shape, train_matrix.shape, test_data.shape, test_matrix.shape

((133416, 5), (133416, 121712), (33336, 5), (33336, 121712))

In [26]:
train_matrix[0]

<1x121712 sparse matrix of type '<class 'numpy.int64'>'
	with 24 stored elements in Compressed Sparse Row format>

In [27]:
max([x=="hard" for x in vectorizer.get_feature_names()])

True

In [29]:
train_matrix[0].toarray()

array([[0, 0, 0, ..., 0, 0, 0]])

### Train a sentiment classifier with logistic regression

We will now use logistic regression to create a sentiment classifier on the training data.

* Learn a logistic regression classifier using the training data. If you are using scikit-learn, you should create an instance of the LogisticRegression class and then call the method fit() to train the classifier. This model should use the sparse word count matrix (train_matrix) as features and the column sentiment of train_data as the target. Use the default values for other parameters. Call this model sentiment_model.

* There should be over 100,000 coefficients in this sentiment_model. Recall from the lecture that positive weights w_j correspond to weights that cause positive sentiment, while negative weights correspond to negative sentiment. Calculate the number of positive (>= 0, which is actually nonnegative) coefficients.

In [30]:
from sklearn import linear_model

In [31]:
sentiment_model = linear_model.LogisticRegression()

In [34]:
sentiment_model.fit(train_matrix,train_data['sentiment'])

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [48]:
sum([x>=0 for x in sentiment_model.coef_[0]])

85858

In [49]:
sum([x<0 for x in sentiment_model.coef_[0]])

35854

Making predictions with logistic regression

* Now that a model is trained, we can make predictions on the test data. In this section, we will explore this in the context of 3 data points in the test data. Take the 11th, 12th, and 13th data points in the test data and save them to sample_test_data. The following cell extracts the three data points from the SFrame test_data and print their content:

In [51]:
sample_test_data = test_data[10:13]
print(sample_test_data)

                                                 name  \
59                          Our Baby Girl Memory Book   
71  Wall Decor Removable Decal Sticker - Colorful ...   
91  New Style Trailing Cherry Blossom Tree Decal R...   

                                               review  rating  \
59  Absolutely love it and all of the Scripture in...       5   
71  Would not purchase again or recommend. The dec...       2   
91  Was so excited to get this product for my baby...       1   

                                         review_clean  sentiment  
59  Absolutely love it and all of the Scripture in...          1  
71  Would not purchase again or recommend The deca...         -1  
91  Was so excited to get this product for my baby...         -1  


In [55]:
sample_test_data.iloc[0]['review']

'Absolutely love it and all of the Scripture in it.  I purchased the Baby Boy version for my grandson when he was born and my daughter-in-law was thrilled to receive the same book again.'

In [57]:
sample_test_data.iloc[2]['review']

"Was so excited to get this product for my baby girls bedroom!  When I got it the back is NOT STICKY at all!  Every time I walked into the bedroom I was picking up pieces off of the floor!  Very very frustrating!  Ended up having to super glue it to the wall...very disappointing.  I wouldn't waste the time or money on it."

We will now make a class prediction for the sample_test_data. The sentiment_model should predict +1 if the sentiment is positive and -1 if the sentiment is negative. Recall from the lecture that the score (sometimes called margin) for the logistic regression model is defined as:

scorei=w⊺h(xi)
where h(xi) represents the features for data point i. We will write some code to obtain the scores. For each row, the score (or margin) is a number in the range (-inf, inf). Use a pre-built function in your tool to calculate the score of each data point in sample_test_data. In scikit-learn, you can call the decision_function() function.

Hint: You'd probably need to convert sample_test_data into the sparse matrix format first.



In [59]:
sample_test_matrix = vectorizer.transform(sample_test_data['review_clean'])
scores = sentiment_model.decision_function(sample_test_matrix)
print(scores)

[  5.60184477  -3.12478355 -10.40573966]


In [61]:
def score_to_sentiment(score):
    if score > 0:
        return 1
    else:
        return -1

In [63]:
[score_to_sentiment(s) for s in scores]

[1, -1, -1]

In [64]:
predictions=sentiment_model.predict(sample_test_matrix)

In [65]:
predictions

array([ 1, -1, -1])

In [66]:
import math
def prob_prediction(score):
    return (1/(1+math.exp(-score)))

In [67]:
[prob_prediction(s) for s in scores]

[0.9963225254257791, 0.042096455236751894, 3.0257395709178456e-05]

In [68]:
sentiment_model.predict_proba(sample_test_matrix)

array([[  3.67747457e-03,   9.96322525e-01],
       [  9.57903545e-01,   4.20964552e-02],
       [  9.99969743e-01,   3.02573957e-05]])

In [69]:
test_prob=sentiment_model.predict_proba(test_matrix)

In [82]:
p=pd.Series(test_prob[:,1])
test_data["predicted_prob"]=pd.Series(p)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  from ipykernel import kernelapp as app


In [91]:
test_data.sort_values('predicted_prob', ascending=0).head(20)

Unnamed: 0,name,review,rating,review_clean,sentiment,predicted_prob
26830,"Leachco Wrap Strap Anywhere Safety Strap, Red",I have twin boys and find that very few shoppi...,4,I have twin boys and find that very few shoppi...,1,1.0
14482,Evenflo Whisper Connect Sensa Monitor,This unit worked well for the first few months...,1,This unit worked well for the first few months...,-1,1.0
18112,Bugaboo Frog Complete Stroller - Black,It was the best buy for our baby. I'm so happy...,5,It was the best buy for our baby Im so happy w...,1,1.0
15732,"Fisher-Price Infant-To-Toddler Rocker, Blue/Green","I GOT THIS FOR MY NEW GRANDDAUGHTER ,SHE LOVE ...",5,I GOT THIS FOR MY NEW GRANDDAUGHTER SHE LOVE A...,1,1.0
30535,Munchkin Five Sea Squirts,"very good! Recommended moms buy, your baby is ...",5,very good Recommended moms buy your baby is we...,1,1.0
31271,"Prince Lionheart Wheely Bug, Ladybug, Large",Love this mouse! We bought the large for my 18...,5,Love this mouse We bought the large for my 18 ...,1,1.0
18009,Prince Lionheart Jumbo Toy Hammock,"Super stretchy, easy to attach to wall, and so...",5,Super stretchy easy to attach to wall and solv...,1,1.0
20905,Boppy Newborn Lounger,This is perfect for a newborn.Pros: Soft fabri...,5,This is perfect for a newbornPros Soft fabric ...,1,1.0
12823,Baby Trend High Chair Palm Tree,We have had this chair for almost three years ...,5,We have had this chair for almost three years ...,1,1.0
32449,Aquatopia Deluxe Safety Bath Thermometer Alarm...,Works entirely like it's supposed to. The onl...,4,Works entirely like its supposed to The only ...,1,1.0


In [92]:
test_data.sort_values('predicted_prob', ascending=1).head(20)

Unnamed: 0,name,review,rating,review_clean,sentiment,predicted_prob
9655,Baby Einstein Seek &amp; Discover Activity Gym,"I am a big fan of the ""Baby Einstein"" products...",2,I am a big fan of the Baby Einstein products a...,-1,3.337686e-11
31226,Sassy Teething Feeder and 16 Replacement Bags,I LOVE being able to feed my baby whole food. ...,5,I LOVE being able to feed my baby whole food ...,1,6.740963e-10
7310,Avent Isis Manual Breast Pump,"I got this pump when I was still pregnant, I d...",2,I got this pump when I was still pregnant I di...,-1,7.098455e-10
17222,Fisher-Price Ocean Wonders Aquarium Cradle Swing,We choose the Fisher Price swing because they ...,5,We choose the Fisher Price swing because they ...,1,3.01893e-09
17985,Prince Lionheart Jumbo Toy Hammock,"muito boa para guardar os bichos de pelucia, p...",5,muito boa para guardar os bichos de pelucia po...,1,8.633105e-09
13752,Medela Pump &amp; Save Breastmilk Bags - 50 pa...,Exactly what I needed and ordered and came sup...,5,Exactly what I needed and ordered and came sup...,1,2.597869e-08
13572,Mustela 2-In-1 Hair &amp; Body Shampoo 6.76 ou...,"Smells wonderful, nice and creamy so won't dry...",5,Smells wonderful nice and creamy so wont dry o...,1,4.763777e-08
3747,Playtex Diaper Genie - First Refill Included,I love the Diaper Genie! I received this as a...,5,I love the Diaper Genie I received this as a ...,1,1.06949e-07
394,Baby Trend Diaper Champ,Works great - no smells. LOVE that it uses re...,5,Works great no smells LOVE that it uses regu...,1,1.247145e-07
30538,"Munchkin 2 Pack Fresh Food Feeder, Colors May ...",grandson really loves the stuff you put in thi...,5,grandson really loves the stuff you put in thi...,1,1.33617e-07


In [94]:
test_predictions=sentiment_model.predict(test_matrix)

In [95]:
len(test_predictions)

33336

In [103]:
sum([predict==actual for predict, actual in zip(test_predictions,test_data['sentiment'])])

31080

In [104]:
31080/33336.

0.9323254139668826

In [105]:
training_predictions=sentiment_model.predict(train_matrix)

In [106]:
len(training_predictions)

133416

In [110]:
sum([predict==actual for predict, actual in zip(training_predictions,train_data['sentiment'])])

129096

In [111]:
129096/133416.

0.967620075553157

There were a lot of words in the model we trained above. We will now train a simpler logistic regression model using only a subet of words that occur in the reviews. For this assignment, we selected 20 words to work with. These are:

In [112]:
significant_words = ['love', 'great', 'easy', 'old', 'little', 'perfect', 'loves', 
      'well', 'able', 'car', 'broke', 'less', 'even', 'waste', 'disappointed', 
      'work', 'product', 'money', 'would', 'return']

Compute a new set of word count vectors using only these words. The CountVectorizer class has a parameter that lets you limit the choice of words when building word count vectors:

Compute word count vectors for the training and test data and obtain the sparse matrices train_matrix_word_subset and test_matrix_word_subset, respectively.


In [113]:
vectorizer_word_subset = CountVectorizer(vocabulary=significant_words) # limit to 20 words
train_matrix_word_subset = vectorizer_word_subset.fit_transform(train_data['review_clean'])
test_matrix_word_subset = vectorizer_word_subset.transform(test_data['review_clean'])

In [115]:
simple_model=linear_model.LogisticRegression()

In [116]:
simple_model.fit(train_matrix_word_subset,train_data['sentiment'])

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [132]:
simple_model_coef_table = pd.DataFrame({'word':significant_words,
                                         'coefficient':simple_model.coef_.flatten()})
sentiment_model_coef_table = pd.DataFrame({'word2':vectorizer.get_feature_names(),
                                            'coefficient2':sentiment_model.coef_.flatten()})
pd.merge(simple_model_coef_table, sentiment_model_coef_table,how='inner',left_on='word',right_on='word2')

Unnamed: 0,coefficient,word,coefficient2,word2
0,1.36369,love,1.575339,love
1,0.944,great,1.227473,great
2,1.192538,easy,1.357699,easy
3,0.085513,old,0.053986,old
4,0.520186,little,0.638772,little
5,1.509812,perfect,1.858013,perfect
6,1.673074,loves,1.516041,loves
7,0.50376,well,0.539714,well
8,0.190909,able,0.392523,able
9,0.058855,car,0.123899,car


In [124]:
simple_model_coef_table.sort_values('coefficient',ascending=1)

Unnamed: 0,coefficient,word
14,-2.348298,disappointed
19,-2.109331,return
13,-2.033699,waste
10,-1.651576,broke
17,-0.898031,money
15,-0.621169,work
12,-0.51138,even
18,-0.362167,would
16,-0.320556,product
11,-0.209563,less


In [127]:
def accuracy(model, matrix, target):
    predictions=model.predict(matrix)
    total = len(target)
    if len(predictions) == total:
        return sum([predict==actual for predict, actual in zip(predictions,target)])/total
    else:
        return "invalid error"
    

In [128]:
accuracy(sentiment_model,test_matrix,test_data['sentiment'])

0.93232541396688262

In [129]:
accuracy(simple_model,test_matrix_word_subset,test_data['sentiment'])

0.86936045116390692

In [133]:
accuracy(sentiment_model,train_matrix,train_data['sentiment'])

0.96762007555315699

In [134]:
accuracy(simple_model,train_matrix_word_subset,train_data['sentiment'])

0.8668225700065959