# Beer Sentiment Analysis

## Section 1

We extract 25 reviews each from the top 250 beers as ranked by Beer Advocate at the following URL:
https://www.beeradvocate.com/beer/top-rated/
We then save the reviews as a CSV to avoid re-scraping.

In [None]:
!pip install selenium
!apt-get -q update
!apt install -yq chromium-chromedriver
!cp /usr/lib/chromium-browser/chromedriver /usr/bin

import sys
sys.path.insert(0,'/usr/lib/chromium-browser/chromedriver')
from selenium import webdriver
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument('--headless')
chrome_options.add_argument('--no-sandbox')
chrome_options.add_argument('--disable-dev-shm-usage')
import pandas as pd
driver = webdriver.Chrome('chromedriver', options=chrome_options)

In [None]:
#get list of urls for top 250 beer products
driver.get("https://www.beeradvocate.com/beer/top-rated/")
table_rows = driver.find_elements_by_xpath("/html/body/div[2]/div/div[2]/div[2]/div[2]/div/div/div[3]/div/div/div[2]/table/tbody/tr/td/a")
links = [row.get_attribute('href') for row in table_rows]

In [None]:
reviewsdf = pd.DataFrame(columns = ['product_name','product_review','user_rating'])
i=0 #counter just for reference
for link in links:
    driver.get(link)
    title = driver.find_element_by_xpath('//*[@id="content"]/div/div/div[3]/div/div/div[1]/h1').text.replace('\n', ' by ')
    i+=1
    print(i, title)
    ratings_elems = driver.find_elements_by_xpath('/html/body/div[2]/div/div[2]/div[2]/div[2]/div/div/div[3]/div/div/div[2]/div[8]/div/div/div[2]/span[2]')
    ratings = []
    for rating in ratings_elems:
        ratings.append(float(rating.text))
    reviews_elems = driver.find_elements_by_xpath('/html/body/div[2]/div/div[2]/div[2]/div[2]/div/div/div[3]/div/div/div[2]/div[8]/div/div/div[2]')
    reviews = []
    for review in reviews_elems:
        lines_list = review.text.split('\n')
        #last 2 lines are date and blank, not useful
        lines_list = lines_list[:-2]
        #first 5 lines are metadata, not useful
        lines_list = lines_list[5:]
        reviews.append((' '.join(lines_list)))
    for review, rating in list(zip(reviews, ratings)):
        reviewsdf.loc[len(reviewsdf)] = [title,review,rating]

In [None]:
reviewsdf.to_csv('beer_reviews.csv', index=False)

## **Task B**

In this section, we assume that a customer will be using this recommender system by specifying 3 attributes.

We do a word frequency analysis first to see what attributes frequently appear and then construct a hypothetical customer.


In [None]:
import pandas as pd
import numpy as np
#following file comes from task A
reviewsdf = pd.read_csv("beer_reviews.csv").dropna()
import nltk
from nltk.tokenize import word_tokenize
nltk.download('punkt')
from nltk.corpus import stopwords
nltk.download('stopwords')
stop_words = set(stopwords.words('english'))

In [None]:
term_freq = {}
for review in reviewsdf['product_review'].str.lower().values:
    words = word_tokenize(review)
    for word in words:
        if (word in stop_words) or (not word.isalpha()):
            continue
        if word in term_freq:
            term_freq[word] += 1
        else:
            term_freq[word] = 1

In [None]:
word_freq = sorted(term_freq.items(), key=lambda item: item[1], reverse=True)
word_freq[:50]

[('beer', 5057),
 ('head', 3801),
 ('taste', 3157),
 ('dark', 2789),
 ('chocolate', 2752),
 ('like', 2425),
 ('sweet', 2381),
 ('bourbon', 2177),
 ('one', 2169),
 ('nice', 2169),
 ('coffee', 2115),
 ('notes', 2035),
 ('nose', 1993),
 ('vanilla', 1976),
 ('light', 1923),
 ('well', 1914),
 ('finish', 1899),
 ('good', 1855),
 ('aroma', 1779),
 ('orange', 1744),
 ('carbonation', 1733),
 ('pours', 1711),
 ('fruit', 1671),
 ('bottle', 1651),
 ('body', 1629),
 ('bit', 1626),
 ('flavor', 1593),
 ('medium', 1556),
 ('really', 1548),
 ('white', 1541),
 ('overall', 1538),
 ('mouthfeel', 1535),
 ('little', 1527),
 ('great', 1518),
 ('smooth', 1493),
 ('black', 1481),
 ('lacing', 1467),
 ('glass', 1454),
 ('citrus', 1417),
 ('flavors', 1389),
 ('barrel', 1377),
 ('thick', 1349),
 ('oak', 1321),
 ('brown', 1257),
 ('feel', 1241),
 ('poured', 1240),
 ('malt', 1182),
 ('smell', 1169),
 ('bitterness', 1167),
 ('color', 1165)]

In [None]:
important_attr = ['dark', 'thick', 'smooth']

#### Our customer has said that they want a beer that is **dark**, **thick**, and **smooth**.

## **Task C**

In this section, we perform a similarity analysis using cosine similarity with the 3 customer-specified attributes for each review.

In [None]:
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import CountVectorizer

In [None]:
text_attr = ' '.join(important_attr)
similaritydf = pd.DataFrame(columns = ['product_name','product_review','similarity_score'])
for i in range(len(reviewsdf)):
    text_review = reviewsdf.iloc[i]['product_review']
    documents = [text_attr,text_review]
    count_vectorizer = CountVectorizer()
    sparse_matrix = count_vectorizer.fit_transform(documents)
    doc_term_matrix = sparse_matrix.todense()
    df = pd.DataFrame(doc_term_matrix, 
                    columns = count_vectorizer.get_feature_names(),
                    index=['x','y'])
    similaritydf.loc[len(similaritydf)] = [reviewsdf.iloc[i]['product_name'],
                                          text_review,
                                          cosine_similarity(df,df)[0,1]]

In [None]:
similaritydf.to_csv('beer_reviews_similarity.csv', index=False)
similaritydf.head()

Unnamed: 0,product_name,product_review,similarity_score
0,Kentucky Brunch Brand Stout by Toppling Goliat...,Smell: early morning pancakes and coffee befor...,0.0
1,Kentucky Brunch Brand Stout by Toppling Goliat...,2019 vintage. Pours a very dark brown color wi...,0.044151
2,Kentucky Brunch Brand Stout by Toppling Goliat...,It's hyped... There is a lot of breweries doin...,0.0
3,Kentucky Brunch Brand Stout by Toppling Goliat...,Reviewing 2019 vintage. This pours thick and c...,0.093116
4,Kentucky Brunch Brand Stout by Toppling Goliat...,2018 version. Poured dark with a small head. S...,0.15861


## **Task D**

Now, for every review we perform feature-level sentiment analysis for each of the 3 features. If an attribute does not appear in a review, the feature-level sentiment is left blank. Otherwise, the sentiment is calculated by taking 3 words to the left and right of the feature (if possible) and then passing that phrase to Vader Sentiment Analyzer. If the feature appears multiple times, we construct multiple phrases and join them before calculating the sentiment. Finally, the average feature-level sentiment for each review is calculated. If none of the features appeared in a review, the average is left blank.

In [None]:
!pip install vaderSentiment
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
def get_sentiment_score(text):
    sia = SentimentIntensityAnalyzer()
    scores = sia.polarity_scores(text)
    return(scores['compound'])

In [None]:
sim_sent_df = pd.DataFrame(columns = 
                           ['product_name','product_review','similarity_score',
                            'attr1_sent', 'attr2_sent', 'attr3_sent', 'avg_feature_sent'])
for i in range(len(similaritydf)):
    beer_review = similaritydf.iloc[i]['product_review']
    attr1_sent, attr2_sent, attr3_sent = np.nan, np.nan, np.nan
    for ix, attr in enumerate(important_attr):
        sentiment = np.nan
        phrase = []
        if attr in beer_review:
            #calculate sentiment
            words = word_tokenize(beer_review)
            words = [word for word in words if word not in stop_words and word.isalpha()]
        for w_ix, word in enumerate(words):
            if word == attr:
                #get three words to the left and three to the right if possible
                #and pass to sentiment analyzer
                phrase += words[max(w_ix-3,0):min(w_ix+4, len(words))]
        sentiment = get_sentiment_score(' '.join(phrase))
        if ix==0:
            attr1_sent = sentiment
        elif ix==1:
            attr2_sent = sentiment
        elif ix==2:
            attr3_sent = sentiment
    avg_feature_sent = np.nanmean([attr1_sent, attr2_sent, attr3_sent])
    sim_sent_df.loc[len(sim_sent_df)] = [similaritydf.iloc[i]['product_name'],
                                       similaritydf.iloc[i]['product_review'],
                                       similaritydf.iloc[i]['similarity_score'],
                                       attr1_sent, attr2_sent, attr3_sent, avg_feature_sent]

In [None]:
sim_sent_df.head()

Unnamed: 0,product_name,product_review,similarity_score,attr1_sent,attr2_sent,attr3_sent,avg_feature_sent
0,Kentucky Brunch Brand Stout by Toppling Goliat...,Smell: early morning pancakes and coffee befor...,0.0,,,,
1,Kentucky Brunch Brand Stout by Toppling Goliat...,2019 vintage. Pours a very dark brown color wi...,0.044151,0.0,,,0.0
2,Kentucky Brunch Brand Stout by Toppling Goliat...,It's hyped... There is a lot of breweries doin...,0.0,,,,
3,Kentucky Brunch Brand Stout by Toppling Goliat...,Reviewing 2019 vintage. This pours thick and c...,0.093116,,0.0,0.6249,0.31245
4,Kentucky Brunch Brand Stout by Toppling Goliat...,2018 version. Poured dark with a small head. S...,0.15861,0.0,,0.5859,0.29295


## **Task E**

We assume that the evaluation score for each product is simply the sum of the average similarity score and the average feature sentiment score. We then will recommend the top 3 products.

In [None]:
eval_score_df = sim_sent_df.groupby(['product_name'])[['similarity_score','avg_feature_sent']].mean()
eval_score_df['eval_score'] = eval_score_df.sum(axis=1)
eval_score_df.sort_values(by="eval_score", ascending=False).head()

Unnamed: 0_level_0,similarity_score,avg_feature_sent,eval_score
product_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Flora Plum by Hill Farmstead Brewery,0.006035,0.75195,0.757985
Hefeweissbier by Bayerische Staatsbrauerei Weihenstephan,0.034543,0.472982,0.507525
Space Trace by Bottle Logic Brewing,0.076611,0.425495,0.502106
Double Sunshine by Lawson's Finest Liquids,0.00593,0.465267,0.471197
Bodhi by Columbus Brewing Company,0.016268,0.4545,0.470768


We will recommend Flora Plum by Hill Farmstead Brewery, Hefeweissbier by Bayerische Staatsbrauerei Weihenstephan, and Space Trace by Bottle Logic Brewing to the customer.

## **Task F**

We now want to see if the recommendations would change if, instead of cosine similarity, we used word vectors through the spaCy package.

In [None]:
import spacy
!python -m spacy download en_core_web_md

In [None]:
import en_core_web_md
nlp = en_core_web_md.load()
sim_sent_df2 = sim_sent_df.copy()

In [None]:
text1 = ' '.join(important_attr)
def get_spacy_sim(prod_review):
    doc1 = nlp(text1)
    doc2 = nlp(prod_review)
    return doc1.similarity(doc2)
sim_sent_df2['similarity_score'] = sim_sent_df2['product_review'].map(get_spacy_sim)

In [None]:
eval_score_df2 = sim_sent_df2.groupby(['product_name'])[['similarity_score','avg_feature_sent']].mean()
eval_score_df2['eval_score'] = eval_score_df2.sum(axis=1)
eval_score_df2.sort_values(by="eval_score", ascending=False).head()

Unnamed: 0_level_0,similarity_score,avg_feature_sent,eval_score
product_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Flora Plum by Hill Farmstead Brewery,0.672477,0.75195,1.424427
Bodhi by Columbus Brewing Company,0.671971,0.4545,1.126471
Hefeweissbier by Bayerische Staatsbrauerei Weihenstephan,0.652207,0.472982,1.125189
Double Sunshine by Lawson's Finest Liquids,0.647936,0.465267,1.113203
Space Trace by Bottle Logic Brewing,0.680827,0.425495,1.106322


We can see that the similarity scores have all greatly increased because spaCy/word vectors accounts for words that have similar meanings. However, the products recommended do not change that much. The top 5 from cosine similarity and this top 5 include the same beers, just in a different order. In this case, the 3 products we would recommend would be Flora Plum by Hill Farmstead Brewery, Bodhi by Columbus Brewing Company, and Hefeweissbier by Bayerische Staatsbrauerei Weihenstephan.

One reason we may not see vast differences is because dark, thick, and smooth are fairly common attributes to describe beer. If we chose more obscure attributes, it's possible that the two measures of similarity would differ greatly. In that scenario, it's likely that spaCy would overestimate the attribute similarity. For example, if one attribute chosen was 'hoppy', spaCy would likely assign a high similarity score to a review mentioned 'malty' because they are similar words in that they both describe beer. However, to an experienced beer drinker, these attributes are very different. In a case like this, spaCy might lead to less accurate results. With fairly general attributes like ours (dark, thick, smooth), this doesn't appear to be the case for our customer.

## **Task G**

We now examine how the recommendations would differ if we simply chose the 3 highest user-rated products overall and whether these products would meet the needs of the user. We return to the BeerAdvocate website and find the top 3 rated beers:
* Kentucky Brunch Brand Stout by Toppling Goliath Brewing Company
* Marshmallow Handjee by 3 Floyds Brewing Co.
* Barrel-Aged Abraxas by Perennial Artisan Ales

In [None]:
top3_rated = ['Kentucky Brunch Brand Stout by Toppling Goliath Brewing Company',
              'Marshmallow Handjee by 3 Floyds Brewing Co.', 
              'Barrel-Aged Abraxas by Perennial Artisan Ales']
eval_score_df.loc[top3_rated,:]

Unnamed: 0_level_0,similarity_score,avg_feature_sent,eval_score
product_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Kentucky Brunch Brand Stout by Toppling Goliath Brewing Company,0.078574,0.286133,0.364707
Marshmallow Handjee by 3 Floyds Brewing Co.,0.064799,0.191546,0.256346
Barrel-Aged Abraxas by Perennial Artisan Ales,0.040149,0.108729,0.148878


In [None]:
eval_score_df2.loc[top3_rated,:]

Unnamed: 0_level_0,similarity_score,avg_feature_sent,eval_score
product_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Kentucky Brunch Brand Stout by Toppling Goliath Brewing Company,0.659288,0.286133,0.945421
Marshmallow Handjee by 3 Floyds Brewing Co.,0.650372,0.191546,0.841918
Barrel-Aged Abraxas by Perennial Artisan Ales,0.659161,0.108729,0.76789


If the customer looking for recommendations chose the 3 top-rated beers, it's possible that they would enjoy them. However, it's likely that they would enjoy the beers we recommend them better based on the overall feature-level sentiment scores. The similarity scores are fairly similar to scores of the beers we recommend; this is likely because dark, thick, and smooth are fairly ubiquitous beer attributes likely to appear in many reviews. However, the average feature sentiment is much lower for the top 3 rated beers when compared to the 3 beers we recommend. This means that users are much more positive about the three features relevant to the customer. Hopefully, this means that the customer will enjoy the three beers we recommend more than just picking the overall top-rated beers.