# SI 330 - Homework 2: Analysis of Amazon Fine Foods Reviews (including pet foods!)
## Top-level goal:
To create a pandas DataFrame that contains adjectives and their counts for positive and negative reviews from the Amazon Fine Foods Reviews dataset that can be used for text exploration.
<br><br>
For this homework assignment, we suggest that you follow the questions in order, as they build on the results of the previous one(s).  Also note that because you’ll be using random samples of the dataset, everyone’s results will be slightly different (for that matter, yours will be different if you re-run your code)

In [1]:
import numpy as np
import pandas as pd
import spacy

In [2]:
# Note that Windows users may need to use Evan Hogan's solution for specifying the location of the 'en' dictionary
nlp = spacy.load('en')

## Q1 (2 points): read the data
From https://www.kaggle.com/snap/amazon-fine-food-reviews/home


In [3]:
file = pd.read_csv('data/Reviews.csv')
file.head()

Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text
0,1,B001E4KFG0,A3SGXH7AUHU8GW,delmartian,1,1,5,1303862400,Good Quality Dog Food,I have bought several of the Vitality canned d...
1,2,B00813GRG4,A1D87F6ZCVE5NK,dll pa,0,0,1,1346976000,Not as Advertised,Product arrived labeled as Jumbo Salted Peanut...
2,3,B000LQOCH0,ABXLMWJIXXAIN,"Natalia Corres ""Natalia Corres""",1,1,4,1219017600,"""Delight"" says it all",This is a confection that has been around a fe...
3,4,B000UA0QIQ,A395BORC6FGVXV,Karl,3,3,2,1307923200,Cough Medicine,If you are looking for the secret ingredient i...
4,5,B006K2ZZ7K,A1UQRSCLF8GW1T,"Michael D. Bigham ""M. Wassir""",0,0,5,1350777600,Great taffy,Great taffy at a great price. There was a wid...


## Q2 (4 points): split the reviews into positive (score = 4 or 5) and negative (score = 1 or 2)

In [4]:
f2 = file.copy()
negative = f2[f2['Score'] <= 2]
positive = f2[f2['Score'] >= 4]

## Q3 (4 points): take a random sample of 500 of each of the positive and negative reviews
Note: this is largely to overcome limitations of spaCy running on individual laptop machines
The samples will be used for all subsequent analyses
<br>Hint: look up the pandas method for taking a random sample


In [5]:
pos_samp = positive.sample(500)
neg_samp = negative.sample(500)

## Q4(4 points): strip all HTML tags from the Text column
Hint: 
* look up how to display full (non-truncated) dataframe information to figure out what HTML tags are present in the text column

In [6]:
import re 
def html_strip(txt): 
    return re.sub(r'<br />','', txt)


In [7]:
pos_samp.Text = pos_samp.Text.apply(html_strip)
neg_samp.Text = neg_samp.Text.apply(html_strip)
pos_samp.Text.iloc[7]

"I don't care for bold coffee so I was a little nervous getting my new coffee maker, I absolutely love this flavor! It is exactly what I was looking for."

## Q5 (4 points): create a "text blob" that combines all the Text into a single string for the positive reviews.  Do the same for the negative reviews
Note: this sets us up to use spaCy efficiently, rather than calling the spaCy parser for each row
Hint: You can treat a column as a list and join all the values
https://stackoverflow.com/questions/41400381/python-pandas-concatenate-a-series-of-strings-into-one-string

In [8]:
pos_blob = ''.join(pos_samp.Text.tolist())
neg_blob = ''.join(neg_samp.Text.tolist())   

## Q6 (8 points): create a list of adjectives that are not stop words for the positive reviews.  Repeat this for the negative reviews.

In [37]:
from spacy.lang.en.stop_words import STOP_WORDS

temp = nlp(pos_blob)
pos_adj = []
for token in temp:
    if token.pos_ == 'ADJ':
        pos_adj.append(token)
pos_nostop = []
for word in pos_adj:
    if word not in STOP_WORDS:
        pos_nostop.append(word)
pos_nostop[:5]


[hard, which, hard, soft, which]

In [38]:
temp2 = nlp(neg_blob)
neg_adj = []
for token in temp2:
    if token.pos_ == 'ADJ':
        neg_adj.append(token)
neg_nostop = []
for word in neg_adj:
    if word not in STOP_WORDS:
        neg_nostop.append(word)
neg_nostop[:5]

[my, old, healthy, that, last]

## Q7(8 points): create a DataFrame for each of positive and negative reviews that each contains two columns: the adjective and its count

Hint: a possible solution is using collections.Counter

In [44]:
from collections import Counter

pos_txt = list()
neg_txt = list()

for x in pos_nostop:
    pos_txt.append(str(x).lower())
for y in neg_nostop:
    neg_txt.append(str(y).lower())

    

p = Counter(pos_txt).items()
n = Counter(neg_txt).items()

    



In [45]:
def df(wrd):
    l1 = []
    l2 = []
    for x in wrd:
        l1.append(x[0])
        l2.append(x[1])
    return pd.DataFrame({
    'Adjective': l1,
    'Count': l2
    })
    

In [50]:
pos_df = df(p)

In [51]:
neg_df = df(n)

## Q8 (10 points: merge the resulting DataFrames into a single DataFrame to answer the following two questions:
1. How many different adjectives are used 
2. How many adjectives appear in both the positive and negative reviews 

Hint:
* you can either use set_index and the merge using left_index=True, right_index=True or you can skip the set_index step and use left_on='word', right_on='
* an outer join can be used to answer question 1, and an inner join can be used to answer question 2

In [61]:
q1 = pos_df.merge(neg_df, how = 'left', on = 'Adjective')
q1.shape

(828, 3)

In [68]:
q2 = pos_df.merge(neg_df, on = 'Adjective')
q2.shape

(440, 3)

1. There are 828 different adjectives used. 
2. There are 440 adjectives that appear in both the positive and negative reviews. 

## Q9 (6 points): Using your resulting DataFrame, what are the five most common adjectives in (1) positive reviews, (2) negative reviews, and (3) overall

In [69]:
Counter(pos_txt).most_common(5)

[('my', 420), ('great', 165), ('good', 160), ('that', 110), ('other', 82)]

In [70]:
Counter(neg_txt).most_common(5)

[('my', 354), ('that', 140), ('good', 115), ('which', 104), ('your', 97)]

In [73]:
q2['total'] = q2['Count_x'] + q2['Count_y']
q2.sort_values(by = 'total', ascending = False).head(5)

Unnamed: 0,Adjective,Count_x,Count_y,total
11,my,420,354,774
3,good,160,115,275
46,that,110,140,250
9,great,165,49,214
1,which,71,104,175


1. The most common words in the positive reviews were: my, great, good, that, other.
2. The most common words in the negative reviews were: my, that, good, which, your.
3. The most common wors used overall were: my, good, that, great, which.