# <p style="text-align: center;">Analytics for Unstructured Data</p>
## <p style="text-align: center;">Group Assignment #2</p>
### <p style="text-align: center;">Authors: Conoly Cravens, JT Flume, Connor Gilmore, Jessie Lee, Garrett Sooter</p>
### <p style="text-align: center;">10:30AM Section</p>

## Task B

### Import Libraries

In [1]:
import pandas as pd
import numpy as np

In [2]:
import spacy
import pandas as pd
import numpy as np
%matplotlib inline
import matplotlib.pyplot as plt
nlp=spacy.load('en_core_web_sm')

### Import Data

In [3]:
#beer_df = pd.read_csv('beeradvocate.csv')
beer_df = pd.read_csv('beer_review.csv')

In [4]:
beer_df.head()

Unnamed: 0.1,Unnamed: 0,brand,score,review
0,0,Kentucky Brunch Brand Stout,4.56,"Long time waiting to tick this one, and I hav..."
1,1,Kentucky Brunch Brand Stout,5.0,This review is for the 2019 batch. It was bot...
2,2,Kentucky Brunch Brand Stout,5.0,Supreme maple OD! Soooo easy drinking &amp; w...
3,3,Kentucky Brunch Brand Stout,5.0,I have now had 4 different years of KBBS and ...
4,4,Kentucky Brunch Brand Stout,5.0,2020 Bottle.\n\nAbsolutely bonkers Maple Syru...


In [5]:
products = pd.DataFrame(beer_df[['brand', 'score', 'review']])
products.columns = ['product_name', 'user_rating', 'product_review']

In [6]:
products.to_csv('products.csv', index=False)

### Task B

In [7]:
beer_df['reviews'] = beer_df['review'].replace(r'\n',' ', regex=True) 
beer_df['reviews'] = beer_df['reviews'].replace('[0-9]+', '', regex=True)

In [8]:
beer_df['reviews'] = beer_df['reviews'].map(lambda x:nlp(x))

In [9]:
# STEP 1: create frequency dictionary for ALL words (including stopwords)
voc=dict()
for row in range(len(beer_df)):
    text=beer_df['reviews'].iloc[row]
    for word in text:
        if word.is_stop:
            continue
        if word.is_punct:
            continue
        if word.text == '' or ' ' in word.text:
            continue
        if word.text.lower() in voc:
            voc[word.text.lower()]+=1
        else:
            voc[word.text.lower()]=1

In [10]:
# STEP 2: create dataframe from dictionary
word_count = pd.DataFrame.from_dict(voc, orient = 'index')
word_count.columns=['count']
word_sort=word_count.sort_values(by='count',ascending=False)

In [11]:
word_sort[:60]

Unnamed: 0,count
beer,4984
head,3819
taste,3162
chocolate,2904
dark,2830
like,2533
sweet,2495
coffee,2234
bourbon,2187
vanilla,2104


**Three Attributes:**

* Malty
* Fruity 
* Balanced

In [12]:
attributes = pd.DataFrame(['malty', 'fruity', 'crisp'])
attributes.columns = ['attributes']

In [13]:
attributes.to_csv('attributes.csv', index=False)

## Task C

### Import Libraries

In [14]:
import pandas as pd
import numpy as np

In [15]:
import math

In [16]:
import spacy
import pandas as pd
import numpy as np
%matplotlib inline
import matplotlib.pyplot as plt
nlp=spacy.load('en_core_web_sm')

### Import Data

In [17]:
products_df = pd.read_csv('products.csv')

In [18]:
attributes = pd.read_csv('attributes.csv')

In [19]:
attributes_list = attributes['attributes'].to_list()

In [20]:
products_df['product_review'] = products_df['product_review'].replace(r'\n',' ', regex=True) 

### Cosine Similarity

In [21]:
def get_sim(review):
    review_tokens = review.split(' ')
    #vectors
    vec1 = [1, 1, 1]
    vec2 = [0, 0, 0]
    for i in range(0, len(attributes_list)):
        if attributes_list[0] in review_tokens:
            vec2[0] = 1
    #same as dot product b
    numerator = np.dot(vec1, vec2)
    denominator = (math.sqrt(3) * math.sqrt(3))
    return numerator / denominator 

In [22]:
products_df['similarity_score'] = products_df['product_review'].map(lambda x:get_sim(x))

In [23]:
new_products_df = pd.DataFrame(products_df[['product_name', 'product_review', 'similarity_score']])

In [24]:
new_products_df.to_csv('products_w_sim.csv', index=False)

## Task D

### VADER Sentiment Analyzer

This code is from https://towardsdatascience.com/sentimental-analysis-using-vader-a3415fef7664.  

### Import Libraries

In [25]:
import warnings
warnings.filterwarnings("ignore",category=DeprecationWarning)

In [26]:
import pandas as pd
import nltk
nltk.download('vader_lexicon')
from nltk.sentiment.vader import SentimentIntensityAnalyzer

[nltk_data] Downloading package vader_lexicon to
[nltk_data]     C:\Users\conol\AppData\Roaming\nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


### Import Data

In [27]:
product_df = pd.read_csv('products.csv')

### Prep Data

In [28]:
product_df['product_review'] = product_df['product_review'].map(lambda x: str(x))

### Get Sentiment

In [29]:
sid = SentimentIntensityAnalyzer()
product_df['review_scores'] = product_df['product_review'].apply(lambda review: sid.polarity_scores(review))
product_df['compound'] = product_df['review_scores'].apply(lambda score_dict: score_dict['compound'])
product_df.head()

Unnamed: 0,product_name,user_rating,product_review,review_scores,compound
0,Kentucky Brunch Brand Stout,4.56,"Long time waiting to tick this one, and I hav...","{'neg': 0.066, 'neu': 0.836, 'pos': 0.098, 'co...",0.1159
1,Kentucky Brunch Brand Stout,5.0,This review is for the 2019 batch. It was bot...,"{'neg': 0.029, 'neu': 0.829, 'pos': 0.142, 'co...",0.8316
2,Kentucky Brunch Brand Stout,5.0,Supreme maple OD! Soooo easy drinking &amp; w...,"{'neg': 0.0, 'neu': 0.828, 'pos': 0.172, 'comp...",0.7955
3,Kentucky Brunch Brand Stout,5.0,I have now had 4 different years of KBBS and ...,"{'neg': 0.0, 'neu': 0.843, 'pos': 0.157, 'comp...",0.9022
4,Kentucky Brunch Brand Stout,5.0,2020 Bottle.\n\nAbsolutely bonkers Maple Syru...,"{'neg': 0.146, 'neu': 0.79, 'pos': 0.064, 'com...",-0.6114


### Aggregate Sentiment

In [30]:
sentiment_df = pd.DataFrame(product_df[['product_name', 'compound']])
averages_df = pd.DataFrame(sentiment_df.groupby('product_name').mean()).reset_index()
averages_df.columns = ['product_name', 'compound']
averages_df.sort_values('compound', ascending = False).head(10)

Unnamed: 0,product_name,compound
51,Cable Car Kriek,0.931096
50,Cable Car,0.904376
113,Genealogy Of Morals - Bourbon Barrel-Aged,0.90322
96,Expedition Stout - Bourbon Barrel-Aged,0.897408
223,Ten FIDY - Bourbon Barrel Aged,0.891664
98,Flora Plum,0.889096
176,Mother Of All Storms,0.884268
126,Hellaboozie (Bourbon Barrel Aged Dark Lord Imp...,0.882748
247,Zenne Y Frontera,0.877152
87,Double Shot,0.874096


In [31]:
averages_df.to_csv('avg_sentiment_by_beer.csv', index=False)

## Task E

### Import Libraries

In [32]:
import pandas as pd

### Import Data

In [33]:
avg_sent = pd.read_csv('avg_sentiment_by_beer.csv')

In [34]:
product_sim = pd.read_csv('products_w_sim.csv')

### Analyze

In [35]:
avg_sim = pd.DataFrame(product_sim.groupby('product_name').mean()).reset_index()

In [36]:
product_scores = avg_sent.merge(avg_sim, 
          on='product_name')
product_scores.rename(columns={"compound": 'sent_score'}, inplace = True)

In [37]:
product_scores['evaluation_score'] = product_scores['sent_score'] + product_scores['similarity_score']

### Reccomendation

In [38]:
top_beers = pd.DataFrame(product_scores.sort_values(by = 'evaluation_score', ascending = False))
bow_recs = list(top_beers.head(3)['product_name'])
top_beers.head(3)

Unnamed: 0,product_name,sent_score,similarity_score,evaluation_score
51,Cable Car Kriek,0.931096,0.013333,0.944429
176,Mother Of All Storms,0.884268,0.053333,0.937601
96,Expedition Stout - Bourbon Barrel-Aged,0.897408,0.013333,0.910741


**Our three reccomendations based on Bag Of Words:**
1. Cable Car Kriek
2. Mother Of All Storms
3. Expedition Stout - Bourbon Barrel-Aged

### Task F

### Import Libraries

In [39]:
import pandas as pd
import numpy as np

import spacy
nlp = spacy.load('en_core_web_md')

### Import Data

In [40]:
reviews_df = pd.read_csv('beer_review.csv')

### Analyze Data

In [41]:
reviews_df['review_token'] = reviews_df['review'].map(lambda x:nlp(x))

In [42]:
def vectorize (nlpObject):
    return nlpObject.vector

reviews_df['vector'] = reviews_df['review_token'].map(lambda x:vectorize(x))

In [43]:
attributes = 'malty fruity crisp'
attributes_token = nlp(attributes)
attribute_vector = attributes_token.vector

In [44]:
def getCosineSimilarity(reviewVector, attribute_vector):
    num = np.dot(reviewVector, attribute_vector)
    den = 300
    return num / den

In [45]:
reviews_df['similarity'] = reviews_df['vector'].map(lambda x:getCosineSimilarity(x, attribute_vector))

In [46]:
vector_recs = list(reviews_df.groupby(['brand'])['similarity'].mean().sort_values(ascending=False)[:3].index)
pd.DataFrame(reviews_df.groupby(['brand'])['similarity'].mean()).sort_values(by = 'similarity', ascending=False).head(3)

Unnamed: 0_level_0,similarity
brand,Unnamed: 1_level_1
Appervation,0.039329
All That Is And All That Ever Will Be,0.038649
Double Dry Hopped Congress Street,0.038565


**Our three reccomendations based on Word Vectors:**
1. Appervation
2. All That Is And All That Ever Will Be
3. Double Dry Hopped Congress Street

### Analyze Reccomendations
To compare bag of words versus word vectors, we compared the percent of reviews that contain each attributes for our reccomendations. As a reminder: 

*Our three reccomendations based on Bag Of Words:*
1. Cable Car Kriek
2. Mother Of All Storms
3. Expedition Stout - Bourbon Barrel-Aged

*Our three reccomendations based on Word Vectors:*
1. Appervation
2. All That Is And All That Ever Will Be
3. Double Dry Hopped Congress Street

In [47]:
def mentionCounts(review, attribute):
    count = 0
    for word in review:
        if word.text == attribute:
            count = 1
    return count

In [48]:
reviews_df['% of comments that mention malty'] = reviews_df['review_token'].map(lambda x:mentionCounts(x,'malty'))
reviews_df['% of comments that mention fruity'] = reviews_df['review_token'].map(lambda x:mentionCounts(x,'fruity'))
reviews_df['% of comments that mention crisp'] = reviews_df['review_token'].map(lambda x:mentionCounts(x,'crisp'))

In [49]:
percentage_counts = pd.DataFrame(reviews_df.groupby(['brand'])[['% of comments that mention malty','% of comments that mention fruity','% of comments that mention crisp']].mean()).reset_index()

In [50]:
mask1=percentage_counts['brand'] == vector_recs[0]
mask2=percentage_counts['brand'] == vector_recs[1]
mask3=percentage_counts['brand'] == vector_recs[2]
print('Percent of Comments that Mention each Attribute for Reccomendations from Word Vectors')
percentage_counts[mask1|mask2|mask3]

Percent of Comments that Mention each Attribute for Reccomendations from Word Vectors


Unnamed: 0,brand,% of comments that mention malty,% of comments that mention fruity,% of comments that mention crisp
10,All That Is And All That Ever Will Be,0.04,0.08,0.0
14,Appervation,0.04,0.0,0.0
79,Double Dry Hopped Congress Street,0.04,0.12,0.0


Versus BOW recs

In [51]:
mask1=percentage_counts['brand'] == bow_recs[0]
mask2=percentage_counts['brand'] == bow_recs[1]
mask3=percentage_counts['brand'] == bow_recs[2]
print('Percent of Comments that Mention each Attribute for Reccomendations from Bag of Words')
percentage_counts[mask1|mask2|mask3]

Percent of Comments that Mention each Attribute for Reccomendations from Bag of Words


Unnamed: 0,brand,% of comments that mention malty,% of comments that mention fruity,% of comments that mention crisp
51,Cable Car Kriek,0.04,0.28,0.12
96,Expedition Stout - Bourbon Barrel-Aged,0.12,0.0,0.0
176,Mother Of All Storms,0.16,0.16,0.0


We see that bag of words actually reccomends products that *literally* mention our three attributes more frequently. We could ultimately say these reccomendations are 'better' which is in part due to the fact that word vectors deam some of our attributes are correlated. In some sense, maybe even all three are slightly correlated. 

## Task F

In [52]:
top_scores = pd.DataFrame(reviews_df.groupby(['brand'])['score'].mean()).reset_index().sort_values(by='score', ascending = False)
top_scores_recs = list(top_scores.head(3)['brand'])
top_scores.head(3)

Unnamed: 0,brand,score
54,Chemtrailmix,4.7716
238,Vanilla Bean Assassin,4.74625
38,Blessed,4.7428


**Our three reccomendations based on Score Ratings:**
1. Chemtrailmix
2. Vanilla Bean Assassin
3. Blessed

In [53]:
mask1=percentage_counts['brand'] == top_scores_recs[0]
mask2=percentage_counts['brand'] == top_scores_recs[1]
mask3=percentage_counts['brand'] == top_scores_recs[2]
print('Percent of Comments that Mention each Attribute for Reccomendations on Score Rating')
percentage_counts[mask1|mask2|mask3]

Percent of Comments that Mention each Attribute for Reccomendations on Score Rating


Unnamed: 0,brand,% of comments that mention malty,% of comments that mention fruity,% of comments that mention crisp
38,Blessed,0.0,0.0,0.0
54,Chemtrailmix,0.0,0.0,0.0
238,Vanilla Bean Assassin,0.0,0.0,0.0


As expected, we see that while the top 3 rated beer brands have no mention of the particular attributes the user requested. The top scores would not be a viable reccomendation system SINCE a user's preferences vary greatly. For instance, one user may want a fruity beer while some users may want a complex beer. A better reccomendation would obviuosly consider the specefics user's preference.