Building a Recommendation System in Python
============================
> In this tutorial we'll show you how to build a recommendation system using pandas, scikit-learn, and numpy. We've provided a dataset of beer reviews which we'll use for building our product recommender, but this use case could be easily substituted with a different product.

In [1]:
import pandas as pd
import numpy as np
import pylab as pl
import os as os



In [2]:
import sys
print("The Python version is %s.%s.%s" % sys.version_info[:3])

The Python version is 3.6.4


<h2><a href="https://s3.amazonaws.com/demo-datasets/beer_reviews.tar.gz">Download the data</a></h2>
<p>Grab the dataset from our data demos bucket on S3, then decompress it. It will create a directory called ~/Downloads/beer_reviews.</p>

In [3]:
%pwd

'C:\\Users\\JM025575\\Predictive Models Class\\Week 7\\code'

In [4]:
cd /Users/JM025575/Predictive Models Class/data

C:\Users\JM025575\Predictive Models Class\data


In [6]:
# substitute your name here. If you're on windows you'll need a different filepath
df = pd.read_csv("beer_reviews.csv", encoding='iso-8859-1')
df.head()

Unnamed: 0,brewery_id,brewery_name,review_time,review_overall,review_aroma,review_appearance,review_profilename,beer_style,review_palate,review_taste,beer_name,beer_abv,beer_beerid
0,10325,Vecchio Birraio,1234817823,1.5,2.0,2.5,stcules,Hefeweizen,1.5,1.5,Sausa Weizen,5.0,47986
1,10325,Vecchio Birraio,1235915097,3.0,2.5,3.0,stcules,English Strong Ale,3.0,3.0,Red Moon,6.2,48213
2,10325,Vecchio Birraio,1235916604,3.0,2.5,3.0,stcules,Foreign / Export Stout,3.0,3.0,Black Horse Black Beer,6.5,48215
3,10325,Vecchio Birraio,1234725145,3.0,3.0,3.5,stcules,German Pilsener,2.5,3.0,Sausa Pils,5.0,47969
4,1075,Caldera Brewing Company,1293735206,4.0,4.5,4.0,johnmichaelsen,American Double / Imperial IPA,4.0,4.5,Cauldron DIPA,7.7,64883


In [7]:
df.shape

(1586614, 13)

In [8]:
import warnings
warnings.filterwarnings("ignore")

## Finding People Who Have Reviewed 2 Beers

In [9]:
beer_1, beer_2 = "Dale's Pale Ale", "Fat Tire Amber Ale"

beer_1_reviewers = df[df.beer_name==beer_1].review_profilename.unique()
beer_2_reviewers = df[df.beer_name==beer_2].review_profilename.unique()
common_reviewers = set(beer_1_reviewers).intersection(beer_2_reviewers)
print("Users in the sameset:", len(common_reviewers))
list(common_reviewers)[:10]

Users in the sameset: 499


['smakawhat',
 'MrMcGibblets',
 'jctribe25',
 'u2carew',
 'AdamBear',
 'haddon',
 'peabody',
 'drpimento',
 'GClarkage',
 'bump8628']

There are some missing values, so we need to clean this up before we impliment our functions

In [68]:
df.isnull().sum()

brewery_id            0
brewery_name          0
review_time           0
review_overall        0
review_aroma          0
review_appearance     0
review_profilename    0
beer_style            0
review_palate         0
review_taste          0
beer_name             0
beer_abv              0
beer_beerid           0
dtype: int64

In [50]:
df.dropna(inplace=True)
df.isnull().sum()

brewery_id            0
brewery_name          0
review_time           0
review_overall        0
review_aroma          0
review_appearance     0
review_profilename    0
beer_style            0
review_palate         0
review_taste          0
beer_name             0
beer_abv              0
beer_beerid           0
dtype: int64

## Extracting Reviews

In [69]:
def get_beer_reviews(beer, common_users):
    mask = (df.review_profilename.isin(common_users)) & (df.beer_name==beer)
    reviews = df[mask].sort_values('review_profilename')
    reviews = reviews[reviews.review_profilename.duplicated()==False]
    return reviews
beer_1_reviews = get_beer_reviews(beer_1, common_reviewers)
beer_2_reviews = get_beer_reviews(beer_2, common_reviewers)

cols = ['beer_name', 'review_profilename', 'review_overall', 'review_aroma', 'review_palate', 'review_taste']
beer_2_reviews[cols].head()




Unnamed: 0,beer_name,review_profilename,review_overall,review_aroma,review_palate,review_taste
202456,Fat Tire Amber Ale,ATPete,4.5,4.0,4.0,4.5
201458,Fat Tire Amber Ale,AdamBear,3.5,2.5,4.5,3.5
201886,Fat Tire Amber Ale,AlCaponeJunior,2.0,3.0,3.5,3.0
202481,Fat Tire Amber Ale,AltBock,4.0,3.0,3.0,3.0
201803,Fat Tire Amber Ale,Andreji,4.0,4.5,4.0,4.0


## Calculating Distance

In [70]:
# choose your own way to calculate distance

from sklearn.metrics.pairwise import euclidean_distances
from sklearn.metrics.pairwise import manhattan_distances
from scipy.stats.stats import pearsonr
from sklearn.metrics.pairwise import cosine_similarity

warnings.filterwarnings("ignore")

ALL_FEATURES = ['review_overall', 'review_aroma', 'review_palate', 'review_taste']
def calculate_similarity(beer1, beer2):
    # find common reviewers
    beer_1_reviewers = df[df.beer_name==beer1].review_profilename.unique()
    beer_2_reviewers = df[df.beer_name==beer2].review_profilename.unique()
    common_reviewers = set(beer_1_reviewers).intersection(beer_2_reviewers)

    # get reviews
    beer_1_reviews = get_beer_reviews(beer1, common_reviewers)
    beer_2_reviews = get_beer_reviews(beer2, common_reviewers)
    dists = []
    for f in ALL_FEATURES:
        dists.append(euclidean_distances([beer_1_reviews[f]], [beer_2_reviews[f]])[0][0])
    return dists

calculate_similarity(beer_1, beer_2)

[17.4928556845359, 17.284386017443605, 16.545392107774298, 17.592612085759182]

## Calculate the Similarity for a Set of Beers

In [71]:
# calculate only a subset for the demo
warnings.filterwarnings("ignore")
beers = ["Dale's Pale Ale", "Sierra Nevada Pale Ale", "Michelob Ultra",
        "Natural Light", "Bud Light", "Fat Tire Amber Ale", "Coors Light",
         "Blue Moon Belgian White", "60 Minute IPA", "Guinness Draught", "Old Rasputin Russian Imperial Stout",
         "90 Minute IPA","Sierra Nevada Celebration Ale","Two Hearted Ale","Arrogant Bastard Ale","Pliny The Elder",
         "Sierra Nevada Bigfoot Barleywine Style Ale","La Fin Du Monde","Trappistes Rochefort 10","Ayinger Celebrator Doppelbock",
         "St. Bernardus Abt 12","Imperial Stout", "Samuel Adams Boston Lager","Duvel","Dead Guy Ale","Orval Trappist Ale",
         "Weihenstephaner Hefeweissbier", "Budweiser","Samuel Smith's Oatmeal Stout","Samuel Adams Octoberfest"]

# calculate everything for real production
# beers = df.beer_name.unique()

simple_distances = []
for beer1 in beers:
    print("starting", beer1)
    for beer2 in beers:
        if beer1 != beer2:
            row = [beer1, beer2] + calculate_similarity(beer1, beer2)
            simple_distances.append(row)

starting Dale's Pale Ale
starting Sierra Nevada Pale Ale
starting Michelob Ultra
starting Natural Light
starting Bud Light
starting Fat Tire Amber Ale
starting Coors Light
starting Blue Moon Belgian White
starting 60 Minute IPA


KeyboardInterrupt: 

## Inspect the Results

In [72]:
cols = ["beer1", "beer2", "overall_dist", "aroma_dist", "palate_dist", "taste_dist"]
simple_distances = pd.DataFrame(simple_distances, columns=cols)
simple_distances.tail(28)

Unnamed: 0,beer1,beer2,overall_dist,aroma_dist,palate_dist,taste_dist
222,Blue Moon Belgian White,St. Bernardus Abt 12,26.958301,29.491524,31.164884,32.241278
223,Blue Moon Belgian White,Imperial Stout,20.161845,21.897488,21.926012,23.355941
224,Blue Moon Belgian White,Samuel Adams Boston Lager,25.573424,22.231734,23.005434,23.569047
225,Blue Moon Belgian White,Duvel,29.364094,28.06243,29.635283,32.015621
226,Blue Moon Belgian White,Dead Guy Ale,24.849547,21.943108,23.113849,25.342652
227,Blue Moon Belgian White,Orval Trappist Ale,24.984995,26.99537,24.642443,28.026773
228,Blue Moon Belgian White,Weihenstephaner Hefeweissbier,29.154759,29.711109,28.05352,31.116716
229,Blue Moon Belgian White,Budweiser,29.870554,33.275366,29.060282,30.643107
230,Blue Moon Belgian White,Samuel Smith's Oatmeal Stout,26.641134,25.258662,29.176189,29.385371
231,Blue Moon Belgian White,Samuel Adams Octoberfest,21.324868,18.967077,19.493589,20.2793


## Allow the User to Customize the Weights

In [73]:
def calc_distance(dists, beer1, beer2, weights):
    mask = (dists.beer1==beer1) & (dists.beer2==beer2)
    row = dists[mask]
    row = row[['overall_dist', 'aroma_dist', 'palate_dist', 'taste_dist']]
    dist = weights * row
    return dist.sum(axis=1).tolist()[0]

weights = [2, 1, 2, 1]
#fixed from PY 2.7
print(calc_distance(simple_distances, 'Fat Tire Amber Ale', "Dale's Pale Ale", weights))
print(calc_distance(simple_distances, "Fat Tire Amber Ale", "Michelob Ultra", weights))

102.95349368782318
182.5481551979252


## Find Similar Beers 


In [74]:
my_beer = "Samuel Smith's Oatmeal Stout"
results = []
for b in beers:
    if my_beer!=b:
        results.append((my_beer, b, calc_distance(simple_distances, my_beer, b, weights)))
sorted(results, key=lambda x: x[2])[0:4]

IndexError: list index out of range

# See in Production
http://beers.yhathq.com/

# Similar program in R

http://blog.yhat.com/posts/recommender-system-in-r.html