# Programming Assignment 2: Analyzing Product Sentiment

Use a set of key polarizing words and verify their corresponding weights to train a sentiment analysis model. Then, compare the results of this classifier with those of one using all of the words.

In [1]:
import re
import pandas as pd
import numpy as np
import torch
import torch.nn as nn
from collections import Counter

  from .autonotebook import tqdm as notebook_tqdm


A subset of key polarizing words:

In [2]:
selected_words = ['awesome', 'great', 'fantastic', 'amazing', 'love', 'horrible', 'bad', 'terrible', 'awful', 'wow', 'hate']

### Data pre-processing

In [3]:
products = pd.read_csv("data/amazon_baby.csv")
print("There are " + str(products.shape[0]) + " entries and " + str(products.shape[1]) + " columns.")
products.head()
# products.sample(n=3, random_state=0)

There are 183531 entries and 3 columns.


Unnamed: 0,name,review,rating
0,Planetwise Flannel Wipes,"These flannel wipes are OK, but in my opinion ...",3
1,Planetwise Wipe Pouch,it came early and was not disappointed. i love...,5
2,Annas Dream Full Quilt with 2 Shams,Very soft and comfortable and warmer than it l...,5
3,Stop Pacifier Sucking without tears with Thumb...,This is a product well worth the purchase. I ...,5
4,Stop Pacifier Sucking without tears with Thumb...,All of my kids have cried non-stop when I trie...,5


## Task 1: Build a new feature with the counts for each of the selected words
The task is to create a new column in the products DataFrame with the counts for each selected_word above. 

In [None]:
def count_words(sentence):
    # Convert sentence to lower-case, and remove punctuations
    sentence = re.sub('\\.|\\,|\\!|\\?| -', '', sentence.lower())
    
    # Count words in sentence
    word_count = Counter(sentence.split())
    
    return word_count

def count_target_words(sent_counter, selected_words):
    sent_targets = {}
    
    for word in selected_words:
        if word in sent_counter:
            sent_targets[word] = sent_counter[word]
        else:
            sent_targets[word] = 0
            
    return sent_targets

In [17]:
# This is the function that will be used together with .apply()
def count(row):
    sentence = row['review']
    
    # Convert sentence to lower-case, and remove punctuations
    sentence = re.sub('\\.|\\,|\\!|\\?| -', '', sentence.lower())
    
    # Count words in sentence
    word_count = Counter(sentence.split())
    
    return word_count

# This is the function to use together with .apply() to get word count of selected words
def count_selected(row, word_list = selected_words):
    sentence = row['review']
    
    # Convert sentence to lower-case, and remove punctuations
    sentence = re.sub('\\.|\\,|\\!|\\?| -', '', sentence.lower())
    
    # Count words in sentence
    word_count = Counter(sentence.split())
    
    # Initialize empty dict to store count of selected words
    sent_targets = {}
    
    for word in word_list:
        if word in word_count:
            sent_targets[word] = word_count[word]
        else:
            sent_targets[word] = 0
            
    return sent_targets

If the functions in the cell above are directly used on the original dataframe, it will raise <code>AttributeError: 'float' object has no attribute 'lower'</code> because there are empty cells with NaN values. One solution is to remove the rows with empty cells in the review column, another solution is to replace NaN values with an empty string.

For more information, see [this page](https://stackoverflow.com/questions/42224700/attributeerror-float-object-has-no-attribute-split).

In [18]:
p1 = products.copy(deep=True) # Make a copy of the original dataframe

# If the function count() is directly applied on the dataframe, it will raise
# AttributeError: 'float' object has no attribute 'lower' because the column
# 'review' have cells with NaN values. 

# Must remove these empty cells before calling .apply()
p1 = p1[p1['review'].notnull()]

# Get word count
# p1['word_count'] = p1.apply(count, axis = 1)

# Get word count of selected words
p1['selected_count'] = p1.apply(count_selected, axis = 1)
p1.head()

Unnamed: 0,name,review,rating,selected_count
0,Planetwise Flannel Wipes,"These flannel wipes are OK, but in my opinion ...",3,"{'awesome': 0, 'great': 0, 'fantastic': 0, 'am..."
1,Planetwise Wipe Pouch,it came early and was not disappointed. i love...,5,"{'awesome': 0, 'great': 0, 'fantastic': 0, 'am..."
2,Annas Dream Full Quilt with 2 Shams,Very soft and comfortable and warmer than it l...,5,"{'awesome': 0, 'great': 0, 'fantastic': 0, 'am..."
3,Stop Pacifier Sucking without tears with Thumb...,This is a product well worth the purchase. I ...,5,"{'awesome': 0, 'great': 0, 'fantastic': 0, 'am..."
4,Stop Pacifier Sucking without tears with Thumb...,All of my kids have cried non-stop when I trie...,5,"{'awesome': 0, 'great': 1, 'fantastic': 0, 'am..."


#### Quiz Question 1: 
Use .sum() method on each of the new columns created and answer the following question: out of the <code>selected_words</code>, which one is most used in the dataset? Which one is least used?
*Save these results to answer the quiz at the end.*

## Task 2: Create a new sentiment analysis model using only the <code>selected_words</code> as features

- Split the dataset into train and test sets
- Train a logistic regression classifier using just the selected words
- Examine the weights the learned classifier assigned to each of the words in selected_words

#### Quiz Question 2: 
Using this approach, sort the learned coefficients according to the 'value' column using .sort(). Out of the 11 words, which one got the most positive weight? Which one got the most negative weight? Do these values make sense?
*Save the results to answer the quiz at the end.*