# Load Amazon dataset
1. Load the dataset consisting of baby product reviews on Amazon.com. Store the data in a data frame products. In SFrame, you would run

In [1]:
import pandas as pd
import numpy as np

In [2]:
products = pd.read_csv('amazon_baby.csv')

# Perform text cleaning
2.We start by removing punctuation, so that words "cake." and "cake!" are counted as the same word.

- Write a function **remove_punctuation** that strips punctuation from a line of text
- Apply this function to every element in the **review** column of **products**, and save the result to a new column **review_clean**.

Refer to your tool's manual for string processing capabilities. Python lets us express the operation in a succinct way, as follows:

**Note: the instructure provided by the course goes wrong， I use an alternate function which required the data type is string. So before the removing punctuation process, I should set the date type first **

In [8]:
# set data type
products.review = products.review.astype(str)

In [12]:
def remove_punctuation(text):
    import string
    return text.translate(str.maketrans('','',string.punctuation))

print(remove_punctuation('I am Niu. How are you?'))

products['review_clean'] = products['review'].apply(remove_punctuation)

I am Niu How are you


**Aside**. In this notebook, we remove all punctuation for the sake of simplicity. A smarter approach to punctuation would preserve phrases such as "I'd", "would've", "hadn't" and so forth. See this page for an example of smart handling of punctuation.

**IMPORTANT**. Make sure to fill n/a values in the **review** column with empty strings (if applicable). The n/a values indicate empty reviews. For instance, Pandas's the fillna() method lets you replace all N/A's in the **review** columns as follows:

In [13]:
products = products.fillna({'review':''})  # fill in N/A's in the review column
products = products.fillna({'review_clean':''})  # fill in N/A's in the review_clean column

In [17]:
# have a glance at your data
products.iloc[1]

name                                        Planetwise Wipe Pouch
review          it came early and was not disappointed. i love...
rating                                                          5
review_clean    it came early and was not disappointed i love ...
Name: 1, dtype: object

# Extract Sentiments
3.We will **ignore** all reviews with rating = 3, since they tend to have a neutral sentiment. In SFrame, for instance,

In [18]:
products = products[products['rating'] != 3]

4.Now, we will assign reviews with a rating of 4 or higher to be positive reviews, while the ones with rating of 2 or lower are negative. For the sentiment column, we use +1 for the positive class label and -1 for the negative class label. A good way is to create an anonymous function that converts a rating into a class label and then apply that function to every element in the **rating** column. In SFrame, you would use apply():

In [19]:
products['sentiment'] = products['rating'].apply(lambda rating : +1 if rating > 3 else -1)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


Now, we can see that the dataset contains an extra column called sentiment which is either positive (+1) or negative (-1).

In [20]:
products.head()

Unnamed: 0,name,review,rating,review_clean,sentiment
1,Planetwise Wipe Pouch,it came early and was not disappointed. i love...,5,it came early and was not disappointed i love ...,1
2,Annas Dream Full Quilt with 2 Shams,Very soft and comfortable and warmer than it l...,5,Very soft and comfortable and warmer than it l...,1
3,Stop Pacifier Sucking without tears with Thumb...,This is a product well worth the purchase. I ...,5,This is a product well worth the purchase I h...,1
4,Stop Pacifier Sucking without tears with Thumb...,All of my kids have cried non-stop when I trie...,5,All of my kids have cried nonstop when I tried...,1
5,Stop Pacifier Sucking without tears with Thumb...,"When the Binky Fairy came to our house, we did...",5,When the Binky Fairy came to our house we didn...,1


# Split into training and test sets
5.Let's perform a train/test split with 80% of the data in the training set and 20% of the data in the test set. If you are using SFrame, make sure to use seed=1 so that you get the same result as everyone else does. (This way, you will get the right numbers for the quiz.)

**Note: the instructure provided by the course goes wrong and you can't use the API by sklearn in order to get the correct answer respect to the course.**

In [52]:
train_indice = open('module-2-assignment-train-idx.json','r')
train_indice

<_io.TextIOWrapper name='module-2-assignment-train-idx.json' mode='r' encoding='UTF-8'>

In [53]:
train_list = []
i = 0
for line in train_indice:
    train_list = [int(x.strip('[]')) for x in line.split(',')]

In [55]:
train_data = products.iloc[train_list]

In [57]:
test_indice = open('module-2-assignment-test-idx.json','r')

In [58]:
test_list = []
i = 0
for line in test_indice:
    test_list = [int(x.strip('[]')) for x in line.split(',')]

In [59]:
test_data = products.iloc[test_list]

# Build the word count vector for each review
6.We will now compute the word count for each word that appears in the reviews. A vector consisting of word counts is often referred to as **bag-of-word features**. Since most words occur in only a few reviews, word count vectors are sparse. For this reason, scikit-learn and many other tools use sparse matrices to store a collection of word count vectors. Refer to appropriate manuals to produce sparse word count vectors. General steps for extracting word count vectors are as follows:

- Learn a vocabulary (set of all words) from the training data. Only the words that show up in the training data will be considered for feature extraction.
- Compute the occurrences of the words in each review and collect them into a row vector.
- Build a sparse matrix where each row is the word count vector for the corresponding review. Call this matrix **train_matrix**.
- Using the same mapping between words and columns, convert the test data into a sparse matrix **test_matrix**.

The following cell uses CountVectorizer in scikit-learn. Notice the **token_pattern** argument in the constructor.



In [60]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(token_pattern=r'\b\w+\b')
     # Use this token pattern to keep single-letter words
# First, learn vocabulary from the training data and assign columns to words
# Then convert the training data into a sparse matrix
train_matrix = vectorizer.fit_transform(train_data['review_clean'])
# Second, convert the test data into a sparse matrix, using the same word-column mapping
test_matrix = vectorizer.transform(test_data['review_clean'])

**Note: Keep in mind that the test data must be transformed in the same way as the training data.**

Because we should make sure the features in both set have the same meaning.

# Train a sentiment classifier with logistic regression
We will now use logistic regression to create a sentiment classifier on the training data.

7.Learn a logistic regression classifier using the training data. If you are using scikit-learn, you should create an instance of the LogisticRegression class and then call the method fit() to train the classifier. This model should use the sparse word count matrix (train_matrix) as features and the column sentiment of train_data as the target. Use the default values for other parameters. Call this model sentiment_model.


In [66]:
# explore the data
print(type(train_matrix))
print(train_matrix.shape)
print(train_data.info())

<class 'scipy.sparse.csr.csr_matrix'>
(133416, 121712)
<class 'pandas.core.frame.DataFrame'>
Int64Index: 133416 entries, 1 to 183529
Data columns (total 5 columns):
name            133174 non-null object
review          133416 non-null object
rating          133416 non-null int64
review_clean    133416 non-null object
sentiment       133416 non-null int64
dtypes: int64(2), object(3)
memory usage: 6.1+ MB
None


In [None]:
# Build a basic model
from sklearn import linear_model
reg = linear_model.LinearRegression()
reg.fit(train_matrix,train_data['sentiment'])

8.There should be over 100,000 coefficients in this sentiment_model. Recall from the lecture that positive weights w_j correspond to weights that cause positive sentiment, while negative weights correspond to negative sentiment. Calculate the number of positive (>= 0, which is actually nonnegative) coefficients.

**Quiz question**: How many weights are >= 0?

# Making predictions with logistic regression
9.Now that a model is trained, we can make predictions on the **test data**. In this section, we will explore this in the context of 3 data points in the test data. Take the 11th, 12th, and 13th data points in the test data and save them to **sample_test_data**. The following cell extracts the three data points from the SFrame **test_data** and print their content: