In [None]:
import numpy as np
import pandas as pd
import json
import nltk

import matplotlib.pyplot as plt
import requests
import urllib.request

%matplotlib inline

In [None]:
#data sourced from http://jmcauley.ucsd.edu/data/amazon/
#let's examine how a json file looks!

with urllib.request.urlopen('https://graderdata.s3.amazonaws.com/reviews_Pet_Supplies_5.json') as f:
    data = f.readlines()
    data = [json.loads(line) for line in data]
    
data

In [None]:
df = pd.read_json('https://graderdata.s3.amazonaws.com/reviews_Pet_Supplies_5.json', lines=True)
#lines = True is for parsing more than one block of data from your json

In [None]:
df.shape

In [None]:
df.head()

In [None]:
df.sample(5)

In [None]:
#First there seems to be some "default tags", let's see how many of those exists in our data sets

df[df['reviewerName'].str.contains('Consumer')]

#notice one issue commonly encountered

In [None]:
#to fix it

consumer_df = df[df['reviewerName'].str.contains('Consumer', na=False)]
consumer_df

- When it comes to review data, generally there wil be skewness and other potential issues, let's evaluate to see if it's true here

In [None]:
df.overall.astype('str').value_counts().plot(kind='bar')

- Let's evaluate what might be the most frequent words observed in 5's as it's by far our most popular rating. We'll combine a few steps at once here.

- First we'll create a mask for our data frame. Then we'll lower case all the string text found within our review as well as join all our text into a single long string to form what's known as a "corpus".

In [None]:
best_rev_corpus = ' '.join(df[df['overall']==5]['reviewText']).lower()

- Now let's introduce some tools that will assist us with counting our most frequent words (tokens).

In [None]:
from nltk.probability import FreqDist
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
import string

In [None]:
lemmatizer= WordNetLemmatizer()

In [None]:
stop_filters = stopwords.words('english') + list(string.punctuation)

In [None]:
best_rev_tokens = [lemmatizer.lemmatize(tokens) for tokens in word_tokenize(best_rev_corpus) if tokens not in stop_filters]

fdist = FreqDist(best_rev_tokens)
fdist.most_common(50)

**Next steps:**

- So we see here that there's still some tokens that are not very helpful. While you may be interested in how often dog and cat appears, it's unlikely that we can attribute those tokens to the five star reviews. Since this will be an iterative process, it makes sense for us to create a function that will filter for us.

In [None]:
## note this uses word_tokenize from nltk

def extra_filter(corpus, stop_tokens):
    '''
    corpus: string format of text data
    stop_tokens: list of tokens you wish to add to stopwords filter
    '''
    from nltk.tokenize import word_tokenize
    
    stop_filters = stopwords.words('english') + list(string.punctuation) + stop_tokens
    filtered_tokens = [lemmatizer.lemmatize(tokens) for tokens in word_tokenize(corpus) 
                       if tokens not in stop_filters]
    return filtered_tokens

In [None]:
extra_stopwords = ["n't", "'s", 'dog', 'cat', '...' ,"''", "'m", '``', '--', 'pet']

best_rev_new_toks = extra_filter(corpus= best_rev_corpus, stop_tokens=extra_stopwords)

In [None]:
fdist = FreqDist(best_rev_new_toks)
fdist.most_common(50)

- The results are better but we may still be missing some of the context of what people are talking about. The issue is, we're currently examining strictly tokens in isolation, but what if we can capture some of the context behind each token?

- One method to do so is by extracting bigrams instead of individual words (unigrams).

In [None]:
best_rev_bigram = list(nltk.bigrams(best_rev_tokens))

In [None]:
fdist_bi = FreqDist(best_rev_bigram)
fdist_bi.most_common(50)

- So already just from examining this list, we can see some potential comments rise towards the top namely that products where the animal seem to enjoy them influences a good nature of the five star reviews. In addition, other traits such as easy to use for the owner also factor in.

- We can do additional iterations to filter out excess "obvious" factors but we will leave that direction for additional future work.

- Instead, let's examine a different library that can also replicate our desired effect of examining most frequent bigrams.

- Ie. different tools for similar effect.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

In [None]:
cnt_vec = CountVectorizer(ngram_range=(2,2), stop_words='english',max_features=50)

In [None]:
cnt_vec.fit_transform(df[df['overall']==5]['reviewText'])

In [None]:
cnt_vec.get_feature_names()

- Now let's tie this together with other topics we've learned from Data Analysis. For example, suppose, we're interested in the top 5 terms and their frequencies of each of the review ratings.

In [None]:
#First we create the groups into groupby objects

rev_groups = df.groupby('overall')

In [None]:
#next let's build a function that we can apply aggregated to our groups

def freq_analysis(txt, stop_tokens=extra_stopwords, num =50):
    txt = ' '.join(txt).lower()
    stop_filters = stopwords.words('english') + list(string.punctuation) + stop_tokens
    filtered_tokens = [tokens for tokens in word_tokenize(txt) 
                       if tokens not in stop_filters]
    
    return filtered_tokens
    fdist = FreqDist(filtered_tokens)
    return fdist.most_common(num)


In [None]:
rev_top50 = rev_groups.agg({'reviewText' : freq_analysis})

In [None]:
pd.set_option('display.max_colwidth', None)

rev_top50

- While we still need to dig deeper, we can see that there does seem to be some trend where dog products seem to review worse than cats from even the review tokens themselves. A fair amount of negative reviews seems related to dog food or the after effects said food products. 

- Next as a pet owner, I personally will be interested in certain aspects of a product, namely safety as a feature.

- During the session covered in the program, there is a lecture labeled "Applications with NLP" where we examine latent "topics" that can provide some of this information but we'll employ a more "basic" method here.

- By looking for the word "allergies" we may be interested in what words appear in context with that term within our reviews. Those words may be of interest for us to be aware of, for potential allergens to either our pet or ourselves.

- For processessing time of certain tasks later, we'll just use the five star reviews for now to demonstrate.

In [None]:
# Create a nltk Text object

best_rated_text = nltk.Text(best_rev_new_toks)

In [None]:
best_rated_text.concordance('allergies')

- Yikes some nasty stuff!

In [None]:
#We can examine contexts of two terms if needed

best_rated_text.common_contexts(['allergies', 'cats'])

- From here we're able to find some common themes regarding products that seem to review high regarding this "topic" of interest. The reviews that discuss these products are often centered around food allergies. So those may be something to examine when reading reviews of interest for a product.

- So now at this stage, we've only scratched the surface of what we can examine with our standard data analytics tools from text. Try to see how you can answer the following questions?
    - Which products appear to be the most safe according to your analysis? (hint are there proxies of information you can use for this?)
    - Can you find durability ratings for certain toys?
    - Are there products that pets especially seem to like? Dislike?