#### This notebook is created as submission for Assignment 02, Analytics for Unstructured Data
Submitted by: Ruchi Sharma 

#### Building a Crowdsourced Recommender System

High level description: The objective of this assignment is to create the building blocks of a 
crowdsourced recommender system. It should accept user inputs in the form of desired attributes of a 
product and come up with 3 recommendations. 

Obtain reviews of craft beers from beeradvocate.com.

In [1]:
'''

import sys
sys.executable

!/opt/anaconda3/bin/python -m pip install spacy
!/opt/anaconda3/bin/python -m pip install nltk
!/opt/anaconda3/bin/python -m pip install vaderSentiment
!/opt/anaconda3/bin/python -m spacy download en_core_web_lg

'''

'\n\nimport sys\nsys.executable\n\n!/opt/anaconda3/bin/python -m pip install spacy\n!/opt/anaconda3/bin/python -m pip install nltk\n!/opt/anaconda3/bin/python -m pip install vaderSentiment\n!/opt/anaconda3/bin/python -m spacy download en_core_web_lg\n\n'

In [2]:
import requests
import pandas as pd
import nltk
import matplotlib.pyplot as plt
import seaborn as sns
import spacy 
import numpy as np
import string
import vaderSentiment

from bs4 import BeautifulSoup
from lxml import html
from string import punctuation
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

stop_words = set( stopwords.words('english'))

%matplotlib inline

##### Task A: Loading the scraped dataset

In [3]:
import pandas as pd
data = pd.read_csv("beer_reviews.csv")
data.head()

Unnamed: 0,product_name,product_review,user_rating
0,Kentucky Brunch Brand Stout,I didnt think i was going to give it a perfect...,5.0
1,Kentucky Brunch Brand Stout,So I just read a review that called the legend...,3.79
2,Kentucky Brunch Brand Stout,"2021 vintage, bottle #79\n\nHoly. Fucking. Shi...",4.64
3,Kentucky Brunch Brand Stout,"Celebrating my buddy @Rug with his 1,000th bee...",4.27
4,Kentucky Brunch Brand Stout,"Thick and syrupy pour, mocha head. Aroma is bo...",4.79


In [4]:
data.groupby(['product_name']).count()

Unnamed: 0_level_0,product_review,user_rating
product_name,Unnamed: 1_level_1,Unnamed: 2_level_1
18 Hours From Brooklyn,13,13
25th Anniversary Ale,5,5
26th Anniversary Imperial IPA,20,20
31 Pumpkin Spiced Lager,18,18
312 Urban Wheat,18,18
...,...,...
Yellow,6,6
Yellow Bus,7,7
Zenne Y Frontera,6,6
Zombie Dust,21,21


In [5]:
# text preprocessing

data = data.dropna()
data['cleaned_review'] = data['product_review'].apply(lambda x :x.translate(str.maketrans('', '', string.punctuation)))
data['cleaned_review'] = data['cleaned_review'].apply(lambda x :x.lower())

data['cleaned_review'] = data['cleaned_review'].apply(word_tokenize).apply(set).apply(list)

def remove_stopwords(s):
    return [w for w in s if not w in stop_words] 
    
data['cleaned_review'] =  data['cleaned_review'].apply(remove_stopwords)

In [6]:
all_words = []
for i in range(len(data)):
    all_words+=data['cleaned_review'][i]
from nltk import FreqDist
word_freq = nltk.FreqDist(all_words)

In [7]:
word_freq.most_common()

[('head', 2882),
 ('taste', 2206),
 ('beer', 2089),
 ('pours', 1534),
 ('carbonation', 1457),
 ('good', 1400),
 ('sweet', 1383),
 ('lacing', 1337),
 ('overall', 1301),
 ('finish', 1294),
 ('nose', 1292),
 ('one', 1257),
 ('body', 1257),
 ('aroma', 1251),
 ('like', 1248),
 ('light', 1216),
 ('dark', 1187),
 ('nice', 1172),
 ('medium', 1151),
 ('mouthfeel', 1151),
 ('white', 1123),
 ('notes', 1100),
 ('color', 1072),
 ('chocolate', 1071),
 ('well', 1052),
 ('glass', 1044),
 ('flavor', 998),
 ('malt', 995),
 ('bit', 949),
 ('black', 943),
 ('feel', 928),
 ('smooth', 925),
 ('poured', 920),
 ('bottle', 920),
 ('vanilla', 906),
 ('thick', 884),
 ('really', 827),
 ('bitterness', 819),
 ('little', 811),
 ('smell', 810),
 ('brown', 807),
 ('orange', 805),
 ('caramel', 788),
 ('great', 787),
 ('creamy', 782),
 ('bourbon', 777),
 ('fruit', 768),
 ('flavors', 726),
 ('full', 713),
 ('hops', 690),
 ('barrel', 686),
 ('citrus', 685),
 ('’', 684),
 ('coffee', 658),
 ('much', 653),
 ('dry', 652),
 ('

From the above word frequency distribution, we can list out the top attributes or features associated with beer. 

Noting the following list of words assocaited with beer: carbonation, sweet, aroma, light, dark, medium, mouthfeel,
color, chocolate, malt, vanilla, thick, smell, brown, orange,
caramel, creamy, bourbon, fruit, hops, barrel , bitter, hop, 
tropical 

##### Task B: Assume that a customer, who will be using this recommender system, has specified 3 attributes in a product. 

E.g., one website describes multiple attributes of beer:
https://www.dummies.com/food-drink/drinks/beer/beer-for-dummies-cheat-sheet/

1. Aggressive (Boldly assertive aroma and/or taste) 
2. Balanced: Malt and hops in similar proportions; equal representation of malt sweetness and 
hop bitterness in the flavor — especially at the finish
3. Complex: Multidimensional; many flavors and sensations on the palate
4. Crisp: Highly carbonated; effervescent
5. Fruity: Flavors reminiscent of various fruits or Hoppy: Herbal, earthy, spicy, or citric aromas and flavors of hops or Malty: Grainy, caramel-like; can be sweet or dry
6. Robust: Rich and full-bodied

Use the above attributes as examples only, for a word frequency analysis of beer reviews is a better way to find important attributes in the actual data. 

Assume that a customer has specified three attributes of the product as being important to him or her. 

###### For this experiment, let's consider dark, carbonation and creamy as the three attributes entered by the user. 

##### Task C: Perform a similarity analysis using the 3 attributes specified by the customer and the reviews. 

In [8]:
import spacy
import en_core_web_lg
nlp = en_core_web_lg.load()

In [9]:
def join_words(comment):   
    """Joins the tokenized words to a sentence"""
    return " ".join(comment) 

data['joined_review'] = data['cleaned_review'].map(join_words)

In [10]:
# Calculate similarity with pre-processing functions

def calculate_similarity(comment):
    """Compute similarity score"""
    base = nlp(comment)
    compare = nlp(input_attributes)
    return base.similarity(compare)

In [11]:
input_list = ['dark', 'carbonation', 'creamy']
input_attributes =  " ".join(input_list)
data['similarity'] = data['joined_review'].map(calculate_similarity)

  return base.similarity(compare)


In [12]:
data.head(10)

Unnamed: 0,product_name,product_review,user_rating,cleaned_review,joined_review,similarity
0,Kentucky Brunch Brand Stout,I didnt think i was going to give it a perfect...,5.0,"[real, pancakes, maple, grabs, familiar, taste...",real pancakes maple grabs familiar taste give ...,0.581676
1,Kentucky Brunch Brand Stout,So I just read a review that called the legend...,3.79,"[’, pretentious, heavy, ”, breaking, nose, jud...",’ pretentious heavy ” breaking nose judge beer...,0.606326
2,Kentucky Brunch Brand Stout,"2021 vintage, bottle #79\n\nHoly. Fucking. Shi...",4.64,"[end, morbid, 1000, whole, signature, unless, ...",end morbid 1000 whole signature unless im nose...,0.670074
3,Kentucky Brunch Brand Stout,"Celebrating my buddy @Rug with his 1,000th bee...",4.27,"[rug, someone, tasted, flavor, feels, chocolat...",rug someone tasted flavor feels chocolate ’ bu...,0.659777
4,Kentucky Brunch Brand Stout,"Thick and syrupy pour, mocha head. Aroma is bo...",4.79,"[chocolate, heavy, expectations, carbonation, ...",chocolate heavy expectations carbonation lived...,0.796936
5,Kentucky Brunch Brand Stout,Had a big share and this was by far the best s...,5.0,"[far, big, share, best, stout]",far big share best stout,0.32328
6,Kentucky Brunch Brand Stout,Look - fantastic black and thick with a great ...,4.73,"[fantastic, black, look, creamy, maple, great,...",fantastic black look creamy maple great head s...,0.747402
7,Kentucky Brunch Brand Stout,Initial nose is beautiful maple! The taste is ...,4.96,"[chocolate, knock, nose, high, beer, viscosity...",chocolate knock nose high beer viscosity sligh...,0.697868
8,Kentucky Brunch Brand Stout,A: Nightfall in appearance that presents a cla...,4.23,"[aids, placed, ’, lingers, peripheral, set, ce...",aids placed ’ lingers peripheral set centering...,0.721594
9,Kentucky Brunch Brand Stout,Thanks for the pour Azuelke!,4.64,"[pour, thanks, azuelke]",pour thanks azuelke,0.241991


In [13]:
top300_reviews = data.sort_values(by='similarity', ascending=False)[0:300]
top300_reviews

Unnamed: 0,product_name,product_review,user_rating,cleaned_review,joined_review,similarity
2386,KBS - Hazelnut,The deepest of dark brown liquid swirls into y...,4.93,"[luscious, flavor, liquid, chocolate, variant,...",luscious flavor liquid chocolate variant evolu...,0.851299
3092,Barrel Aged Exhausted Parent Stout,A - Pours a very dark brown with a thin tan he...,4.30,"[chocolate, solid, black, carbonation, heat, c...",chocolate solid black carbonation heat creamy ...,0.851100
3093,Barrel Aged Exhausted Parent Stout,Pours a dark brown color with a minimal head t...,4.43,"[color, minimal, black, creamy, dark, sweet, h...",color minimal black creamy dark sweet head smo...,0.850226
1891,Damon (Bourbon Barrel Aged),Bottle: Poured a pitch-black color stout with ...,4.50,"[chocolate, large, color, barrelaged, solid, n...",chocolate large color barrelaged solid notes b...,0.849690
3320,Rocky Road Stout,"Rocky Road Stout has a thin, fizzing-away beig...",4.46,"[flavor, milk, chocolate, road, high, sweetnes...",flavor milk chocolate road high sweetness silk...,0.849404
...,...,...,...,...,...,...
941,Black Tuesday - Reserve,"Very good head production, considering the sky...",4.56,"[double, molasses, nose, oak, foam, sweet, che...",double molasses nose oak foam sweet chest quic...,0.800929
4482,Old Rasputin,Black as used motor oil. Thick consistency. St...,4.72,"[flavor, chocolate, notes, smoky, black, beers...",flavor chocolate notes smoky black beers dark ...,0.800866
3442,Electric Skies,Canned on 5/28/22; consumed on 6/21/22\n\nPour...,4.29,"[end, textures, consumed, frothy, blanc, repre...",end textures consumed frothy blanc representat...,0.800863
2369,Project Find The Limit #10,Canned on 06-09-22.\n\nA - Opaque pineapple or...,4.41,"[fantasic, lots, color, citrus, spicy, silky, ...",fantasic lots color citrus spicy silky 060922 ...,0.800842


##### Task D: For every review, perform a sentiment analysis (using VADER).

In [14]:
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
analyser = SentimentIntensityAnalyzer()

In [15]:
def sentiment_analyzer_scores(sentence):
    
    """Generate seniment score"""
    scores = analyser.polarity_scores(sentence)
    return scores['compound']

top300_reviews['sentiment'] = top300_reviews['product_review'].map(sentiment_analyzer_scores)

In [16]:
top300_reviews_sentiment = top300_reviews.sort_values(by='sentiment', ascending=False)

In [17]:
top300_reviews_sentiment

Unnamed: 0,product_name,product_review,user_rating,cleaned_review,joined_review,similarity,sentiment
3800,Speedway Stout - Madagascar Vanilla & Ceylon A...,Poured from the can into a teku style glass.\n...,4.44,"[kick, grandeur, swirling, looking, nose, cake...",kick grandeur swirling looking nose cake finge...,0.828255,0.9975
3710,Banana Colada Melted Sno Cone,A: Pours an opaque murky thick sludgy milky pa...,4.26,"[reduces, two, minimal, beer, spring, evokes, ...",reduces two minimal beer spring evokes lacing ...,0.814653,0.9959
3342,Big Bad Baptist Naked,From the 22 oz bottle in a snifter. This excep...,4.37,"[wafting, two, heavy, raspberry, nose, beer, t...",wafting two heavy raspberry nose beer turn lac...,0.804232,0.9957
2842,Even More Nobility,A: Pours an opaque yet still clear thick visco...,4.22,"[reduces, two, beer, toasted, pretty, lacing, ...",reduces two beer toasted pretty lacing lingeri...,0.818179,0.9949
3361,Barrel Aged Fully Loaded French Toast,A: Pours an opaque yet still clear thick visco...,4.36,"[reduces, two, minimal, beer, toasted, lacing,...",reduces two minimal beer toasted lacing oak co...,0.817935,0.9935
...,...,...,...,...,...,...,...
4537,Blue Hen Pilsner,"Blue Hen Pilsner has a thick, off-white head, ...",4.50,"[hop, flavor, bubbly, pilsner, cracker, minima...",hop flavor bubbly pilsner cracker minimal hen ...,0.814212,-0.4019
3406,LAX To JFK : In The Clouds,Poured from a can dated 10/27/21\nThick cloudy...,4.17,"[tart, overly, dissipating, ’, heavy, thicker,...",tart overly dissipating ’ heavy thicker stone ...,0.815495,-0.4168
3040,Blender King,Murky dark orange with two fingers of yellowed...,4.08,"[easter, two, tasty, mango, nose, fingers, car...",easter two tasty mango nose fingers carbonatio...,0.828534,-0.4767
3894,Double Fractal Penrose Tile,Pours an opaque grapefruit with thin lacing an...,4.09,"[hop, flavor, nose, resin, carbonation, lacing...",hop flavor nose resin carbonation lacing tropi...,0.815912,-0.4939


##### Task E: Create an evaluation score for each beer that uses both similarity and sentiment scores. 
E.g. total score  = average of (similarity score + sentiment score) or a multiplicative model.  

##### Now recommend 3 products to the customer. 

In [18]:
top300_reviews_combined = top300_reviews.groupby('product_name')[['similarity','sentiment']].mean()
top300_reviews_combined['recommend'] = top300_reviews_combined['similarity']+top300_reviews_combined['sentiment']

In [19]:
top300_reviews_combined.sort_values(by='recommend', ascending=False)[0:3]

Unnamed: 0_level_0,similarity,sentiment,recommend
product_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Mexican Brunch,0.825498,0.9911,1.816598
Plead The 5th - Bourbon Barrel-Aged,0.827077,0.9884,1.815477
Endless,0.837786,0.9757,1.813486


In [20]:
data['sentiment'] = data['product_review'].map(sentiment_analyzer_scores)

In [21]:
reviews_df_combined = data.groupby('product_name')[['similarity','sentiment', 'user_rating']].mean()
reviews_df_combined['recommend'] = reviews_df_combined['similarity']+reviews_df_combined['sentiment']

In [22]:
reviews_df_combined.sort_values(by = 'user_rating', ascending = False)[0:3]

Unnamed: 0_level_0,similarity,sentiment,user_rating,recommend
product_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
The Adjunct Trail - Bourbon Barrel-Aged,0.662782,0.8126,4.93,1.475382
Triple Shot - House Blend,0.803807,0.9878,4.84,1.791607
Twice the Daily Serving: Raspberry,0.639894,0.75505,4.833333,1.394944


##### Thus, we get the final recommnedations on the basis of similarity combined with sentiment as "The Adjunct Trail - Bourbon Barrel-Aged", "Triple Shot - House Blend", "Twice the Daily Serving: Raspberry" for the three selected attributes. 