# SI 330 - Homework 2: Analysis of Amazon Fine Foods Reviews (including pet foods!)
## Top-level goal:
To create a pandas DataFrame that contains adjectives and their counts for positive and negative reviews from the Amazon Fine Foods Reviews dataset that can be used for text exploration.
<br><br>
For this homework assignment, we suggest that you follow the questions in order, as they build on the results of the previous one(s).  Also note that because you’ll be using random samples of the dataset, everyone’s results will be slightly different (for that matter, yours will be different if you re-run your code)

In [1]:
import numpy as np
import pandas as pd
import spacy

In [2]:
# Note that Windows users may need to use Evan Hogan's solution for specifying the location of the 'en' dictionary
nlp = spacy.load(r'C:\Users\jodiy\Anaconda3\lib\site-packages\en_core_web_sm\en_core_web_sm-2.0.0')

## Q1 (2 points): read the data
From https://www.kaggle.com/snap/amazon-fine-food-reviews/home


In [3]:
import pandas as pd
reviews = pd.read_csv('data/Reviews.csv')

## Q2 (4 points): split the reviews into positive (score = 4 or 5) and negative (score = 1 or 2)

In [4]:
positive_review = reviews[reviews['Score']>3]

In [5]:
negative_review = reviews[reviews['Score']<3]

## Q3 (4 points): take a random sample of 500 of each of the positive and negative reviews
Note: this is largely to overcome limitations of spaCy running on individual laptop machines
The samples will be used for all subsequent analyses
<br>Hint: look up the pandas method for taking a random sample


In [6]:
pos_random = positive_review.sample(n=500)

In [7]:
neg_random = negative_review.sample(n=500)

## Q4(4 points): strip all HTML tags from the Text column
Hint: 
* look up how to display full (non-truncated) dataframe information to figure out what HTML tags are present in the text column

In [8]:
from bs4 import BeautifulSoup
lst_pos = list()
for x in pos_random['Text'].values:
    soup = BeautifulSoup(x)
    lst_pos.append(soup.text)



 BeautifulSoup(YOUR_MARKUP})

to this:

 BeautifulSoup(YOUR_MARKUP, "lxml")

  markup_type=markup_type))


In [9]:
lst_neg = list()
for x in neg_random['Text'].values:
    soup = BeautifulSoup(x)
    lst_neg.append(soup.text)



 BeautifulSoup(YOUR_MARKUP})

to this:

 BeautifulSoup(YOUR_MARKUP, "lxml")

  markup_type=markup_type))


In [10]:
positive_text=' '.join(lst_pos)

In [11]:
negative_text=' '.join(lst_neg)

## Q6 (8 points): create a list of adjectives that are not stop words for the positive reviews.  Repeat this for the negative reviews.

In [12]:
from spacy.lang.en.stop_words import STOP_WORDS
pos_words = positive_text.split()
pos_nostop_lst = list()
for word in pos_words:
    if word not in STOP_WORDS:
        pos_nostop_lst.append(word)

pos_nostop = ' '.join(pos_nostop_lst)

In [17]:
doc_pos = nlp(pos_nostop)
pos_adj = list()
for i, sent in enumerate(doc_pos.sents):
    for token in sent:
        if (token.pos_ == 'ADJ'):
            pos_adj.append(token.text.lower())
# pos_adj

In [14]:
neg_words = negative_text.split()
neg_nostop_lst = list()
for word in neg_words:
    if word not in STOP_WORDS:
        neg_nostop_lst.append(word)

neg_nostop = ' '.join(neg_nostop_lst)

In [18]:
doc_neg = nlp(neg_nostop)
neg_adj = list()
for i, sent in enumerate(doc_neg.sents):
    for token in sent:
        if (token.pos_ == 'ADJ'):
            neg_adj.append(token.text.lower())
# neg_adj

## Q7(8 points): create a DataFrame for each of positive and negative reviews that each contains two columns: the adjective and its count

Hint: a possible solution is using collections.Counter

In [19]:
from collections import Counter
positive_list = Counter(pos_adj).most_common()
pos_dataframe = pd.DataFrame(positive_list, columns=['adjective', 'count'])
pos_dataframe.head(5)

Unnamed: 0,adjective,count
0,great,178
1,good,161
2,my,74
3,best,68
4,little,66


In [20]:
negative_list = Counter(neg_adj).most_common()
neg_dataframe = pd.DataFrame(negative_list, columns=['adjective', 'count'])
neg_dataframe.head(5)

Unnamed: 0,adjective,count
0,good,104
1,my,78
2,little,65
3,bad,58
4,great,54


## Q8 (10 points: merge the resulting DataFrames into a single DataFrame to answer the following two questions:
1. How many different adjectives are used
2. How many adjectives appear in both the positive and negative reviews

Hint:
* you can either use set_index and the merge using left_index=True, right_index=True or you can skip the set_index step and use left_on='word', right_on='
* an outer join can be used to answer question 1, and an inner join can be used to answer question 2

In [21]:
merge_dataframe = pos_dataframe.merge(neg_dataframe,left_on='adjective', right_on='adjective')
merge_dataframe.head(5)

Unnamed: 0,adjective,count_x,count_y
0,great,178,54
1,good,161,104
2,my,74,78
3,best,68,24
4,little,66,65


In [22]:
merge_dataframe.shape

(398, 3)

In [23]:
# total number of adjectives
total_adj = len(pos_adj) + len(neg_adj)
total_adj

6202

# Q9 (6 points): Using your resulting DataFrame, what are the five most common adjectives in (1) positive reviews, (2) negative reviews, and (3) overall

In [24]:
merge_dataframe['count'] = merge_dataframe['count_x'] + merge_dataframe['count_y']

In [25]:
merge_dataframe.sort_values('count_x',ascending=False).head(5)

Unnamed: 0,adjective,count_x,count_y,count
0,great,178,54,232
1,good,161,104,265
2,my,74,78,152
3,best,68,24,92
4,little,66,65,131


In [26]:
merge_dataframe.sort_values('count_y',ascending=False).head(5)

Unnamed: 0,adjective,count_x,count_y,count
1,good,161,104,265
2,my,74,78,152
4,little,66,65,131
34,bad,16,58,74
0,great,178,54,232


In [27]:
merge_dataframe.sort_values('count',ascending=False).head(5)

Unnamed: 0,adjective,count_x,count_y,count
1,good,161,104,265
0,great,178,54,232
2,my,74,78,152
4,little,66,65,131
3,best,68,24,92
