# Introduction

In this notebook, we want to study the two datasets in `beerAdv_beer_brewery.tsv` and in `rateBeer_beer_brewery.tsv`. The columns of these datasets are like this:

                    beer_id ¦ beer_name ¦ brewery_id ¦ brewery_name
            
The idea of this notebook is to create a similarity measure in order to find the same elements in both these datasets. Once we have this similarity measure, we want to find the limit X such that:
- All pairs with a similarity value of $X' > X$ are considered as the same beer
- All pairs with a similarity value of $X'' < X$ are not taken into account.

In order to achieve this, this notebook is composed of three parts. The first part is some sort of simple Data Analysis. Then there will be the creation of the pairs and the similarity value. Finally the part to find the best value $X$.

In [1]:
# Usefule imports
import pandas as pd
import numpy as np
import codecs
from IPython.display import HTML

from fuzzywuzzy import fuzz
import multiprocessing
from joblib import Parallel, delayed
import pickle

# For the Python notebook
%matplotlib inline
%reload_ext autoreload
%autoreload 2

## Data Analysis

First, we need to analyze the data. It's a simple data analysis since we only have two usefull features: `beer_name` and `brewery_name`.

In [59]:
# Datasets
beerAdvocate_dataset = './data/beerAdv_beer_brewery.tsv'
rateBeer_dataset = './data/rateBeer_beer_brewery.tsv'

In [60]:
columns = ['beer_id', 'beer_name', 'brewery_id', 'brewery_name']

In [254]:
# Load the dataset with pandas
beerAdvocate = pd.read_table(beerAdvocate_dataset, header=None)
beerAdvocate.columns = columns
rateBeer = pd.read_table(rateBeer_dataset, header=None, encoding='utf-8')
rateBeer.columns = columns

In [298]:
rateBeer.ix[9997]

beer_id                                   720
beer_name             Gambrinus Premium ?erné
brewery_id                                115
brewery_name    Plzensky Prazdroj (SABMiller)
Name: 9997, dtype: object

Print the two datasets

In [256]:
beerAdvocate.head()

Unnamed: 0,beer_id,beer_name,brewery_id,brewery_name
0,14348,Eisbrau Czech,1,"Plzensky Prazdroj, a. s."
1,19099,Primus,1,"Plzensky Prazdroj, a. s."
2,19123,Gambrinus Pale,1,"Plzensky Prazdroj, a. s."
3,19274,Urutislav,1,"Plzensky Prazdroj, a. s."
4,41294,Pilsner Urquell 3.5%,1,"Plzensky Prazdroj, a. s."


In [257]:
rateBeer.head()

Unnamed: 0,beer_id,beer_name,brewery_id,brewery_name
0,4,Abita Bock,1,Abita Brewing Company
1,10731,Abita Louisiana Red Ale,1,Abita Brewing Company
2,114065,Abita Select Pecan Brown Ale,1,Abita Brewing Company
3,114981,Abita Select Amber Ale,1,Abita Brewing Company
4,117017,Abita American Wheat,1,Abita Brewing Company


We just stop two minutes here. We can see a problem with the encoding of the file for *beerAdvocate*. 

In [258]:
name = beerAdvocate.ix[12].beer_name
print(name)

Master PolotmavÃ½ 13Â°


The encoding here is **latin_1**. Therefore, we can find the UTF-8 encoding by doing this:

In [259]:
new_name = bytes(name,'latin_1').decode('utf-8')
print(new_name)

Master Polotmavý 13°


Let's do it everywhere for *beerAdvocate*!

In [260]:
def decode_from_latin_1(string):
    try:
        return bytes(string ,'latin_1').decode('utf-8')
    except:
        return string

In [261]:
for i in range(len(beerAdvocate)):
    # Change the name of the beer
    beerAdvocate.set_value(i, 'beer_name', decode_from_latin_1(beerAdvocate.ix[i].beer_name))
    beerAdvocate.set_value(i, 'brewery_name', decode_from_latin_1(beerAdvocate.ix[i].brewery_name))


In [262]:
beerAdvocate.ix[12].beer_name

'Master Polotmavý 13°'

Let's print the number of entries in each dataset.

In [263]:
print("Number of rows in beerAdvocate: %i"%(len(beerAdvocate)))
print("Number of rows in rateBeer: %i"%(len(rateBeer)))

Number of rows in beerAdvocate: 66056
Number of rows in rateBeer: 110359


Let's check the number of **unique** beers to see if it matches the number of rows in the dataset.

In [264]:
unique_beers_beerAdvocate = beerAdvocate.beer_name.unique()
unique_beers_rateBeer = rateBeer.beer_name.unique()

print("Number of unique beers in beerAdvocate: %i"%(len(unique_beers_beerAdvocate)))
print("Number of unique beers in rateBeer: %i"%(len(unique_beers_rateBeer)))

Number of unique beers in beerAdvocate: 56855
Number of unique beers in rateBeer: 110302


It's already interesting to see that some beers have the same name in both datasets. Let's take one of them and see if they have the same `brewery_name`.

In [265]:
duplicated_beers_beerAdvocate = beerAdvocate[beerAdvocate.beer_name.duplicated()].beer_name.unique()
duplicated_beers_rateBeer = rateBeer[rateBeer.beer_name.duplicated()].beer_name.unique()

print("Number of duplicated beer names in beerAdvocate: %i"%(len(duplicated_beers_beerAdvocate)))
print("Number of duplicated beer names in rateBeer: %i"%(len(duplicated_beers_rateBeer)))

Number of duplicated beer names in beerAdvocate: 2707
Number of duplicated beer names in rateBeer: 55


In [266]:
dup_beer_dup_brewery = []

# Create list of tuples. 
#   First entry is the name of the duplicated beer
#   Second entry is the list of Brewery that is duplicated
#   Third entry is the list of indices to remove them easily.
for dup_beer in duplicated_beers_beerAdvocate:
    subdf = beerAdvocate[beerAdvocate.beer_name == dup_beer]
    if any(subdf.brewery_name.duplicated()):
        dup_beer_dup_brewery.append((dup_beer, list(subdf[subdf.brewery_name.duplicated()]["brewery_name"]), list(subdf[subdf.brewery_name.duplicated()].index)))

In [267]:
print("Number of beers with duplicated brewery in beerAdvocate: %i"%(len(dup_beer_dup_brewery)))

Number of beers with duplicated brewery in beerAdvocate: 286


Just check how the function `duplicated` works. For example, if we have two times the same brewery, this means that we would have 3 times the same beer and the same brewery in the data set.

In [268]:
dup_beer_dup_brewery[3]

('Saison',
 ['Triumph Brewing Company', 'Triumph Brewing Company'],
 [22563, 43875])

In [269]:
beerAdvocate[(beerAdvocate.beer_name == "Saison") & (beerAdvocate.brewery_name == "Triumph Brewing Company")]

Unnamed: 0,beer_id,beer_name,brewery_id,brewery_name
17881,49387,Saison,1317,Triumph Brewing Company
22563,37383,Saison,15341,Triumph Brewing Company
43875,24385,Saison,4832,Triumph Brewing Company


At this point, we imagine that the three entries are the same. But we directly see that the brewery_id is different. And if we go check the breweries on beerAdvocate, we see that the three Breweries are different. Therefore, we cannot say that these three beers are the same. In order to get a better similarity measure, we need to add a bit mopre information about the breweries and the beers. Here are the features we scrap from the websites:
- For the brewery:
    - The address
- For the beer:
    - The ABV (Alcohol by Volume)
    - The Style

At this point, we will not scrap the websites nor use other data. This will be done later in the project. Therefore, we just remove the duplicate indices.

In [273]:
indices_to_remove = []
for i in range(len(dup_beer_dup_brewery)):
    indices_to_remove.extend(dup_beer_dup_brewery[i][2])

In [274]:
# Remove the indices in the dataset
beerAdvocate = beerAdvocate.drop(indices_to_remove, axis=0)
beerAdvocate.index = range(len(beerAdvocate))
beerAdvocate.tail()

Unnamed: 0,beer_id,beer_name,brewery_id,brewery_name
65668,76042,Foster's Gold,575,Foster's Group Limited
65669,917,Foster's Special Bitter,575,Foster's Group Limited
65670,918,Foster's Lager,575,Foster's Group Limited
65671,52642,Hefe Weissbier,5751,Bayerische Löwenbrauerei
65672,58058,Hefeweissbier Dunkel,5751,Bayerische Löwenbrauerei


Now, we need to the same for the rateBeer dataset.

In [275]:
dup_beer_dup_brewery = []

# Create list of tuples. 
#   First entry is the name of the duplicated beer
#   Second entry is the list of Brewery that is duplicated
#   Third entry is the list of indices to remove them easily.
for dup_beer in duplicated_beers_rateBeer:
    subdf = rateBeer[rateBeer.beer_name == dup_beer]
    if any(subdf.brewery_name.duplicated()):
        dup_beer_dup_brewery.append((dup_beer, list(subdf[subdf.brewery_name.duplicated()]["brewery_name"]), list(subdf[subdf.brewery_name.duplicated()].index)))

In [276]:
print("Number of beers with duplicated brewery in beerRate: %i"%(len(dup_beer_dup_brewery)))

Number of beers with duplicated brewery in beerRate: 4


In [277]:
dup_beer_dup_brewery

[('Traugott Simon Export', ['Udo Täubrich Betreuungs'], [9819]),
 ('Big Horn Saison',
  ['Big Horn Brewing Company (Ram International)'],
  [19577]),
 ('Yukon Lead Dog Ale', ['Yukon Brewing Company'], [37365]),
 ('Prison Brews Winter Ale', ['Prison Brews'], [77632])]

In [278]:
indices_to_remove = []
for i in range(len(dup_beer_dup_brewery)):
    indices_to_remove.extend(dup_beer_dup_brewery[i][2])

In [279]:
# Remove the indices in the dataset
rateBeer = rateBeer.drop(indices_to_remove, axis=0)
rateBeer.index = range(len(rateBeer))
rateBeer.tail()

Unnamed: 0,beer_id,beer_name,brewery_id,brewery_name
110350,77397,Jelling Bryghus Frode Fredegod,8804,Jelling Bryghus
110351,77398,Jelling Bryghus Poppo,8804,Jelling Bryghus
110352,77399,Jelling Bryghus Jelling Jól,8804,Jelling Bryghus
110353,89887,Jelling Bryghus Jalunki,8804,Jelling Bryghus
110354,97191,Jelling Bryghus Sildeglimt,8804,Jelling Bryghus


### Save the cleaned data set

Now that the datasets are cleaned, we can save them in a CSV format.

In [280]:
beerAdvocate.to_csv('./data/beerAdvocate_cleaned.csv', index=False, encoding='utf-8')
rateBeer.to_csv('./data/rateBeer_cleaned.csv', index=False, encoding='utf-8')

# Similarity

** YOU CAN DIRECTLY START FROM HERE. NO NEED TO REDO THE CLEANING OF THE DATASETS! **

Now, we want to match beers between the two different datasets. Let's load them first. Then, we will create a matrix of size $N\times M$, $N$ being the size of *RateBeer* dataset and $M$ being the size of *BeerAdvocate* dataset. Since $M < N$, we want to get a vector of size $M\times 5$. (5 because we will aggregate the text columns and the id columns for each dataset and the similarity value.)

In [2]:
beerAdvocate = pd.read_csv('./data/beerAdvocate_cleaned.csv', dtype=str)
rateBeer = pd.read_csv('./data/rateBeer_cleaned.csv', dtype=str)

In [3]:
beerAdvocate.head()

Unnamed: 0,beer_id,beer_name,brewery_id,brewery_name
0,14348,Eisbrau Czech,1,"Plzensky Prazdroj, a. s."
1,19099,Primus,1,"Plzensky Prazdroj, a. s."
2,19123,Gambrinus Pale,1,"Plzensky Prazdroj, a. s."
3,19274,Urutislav,1,"Plzensky Prazdroj, a. s."
4,41294,Pilsner Urquell 3.5%,1,"Plzensky Prazdroj, a. s."


In [4]:
rateBeer.head()

Unnamed: 0,beer_id,beer_name,brewery_id,brewery_name
0,4,Abita Bock,1,Abita Brewing Company
1,10731,Abita Louisiana Red Ale,1,Abita Brewing Company
2,114065,Abita Select Pecan Brown Ale,1,Abita Brewing Company
3,114981,Abita Select Amber Ale,1,Abita Brewing Company
4,117017,Abita American Wheat,1,Abita Brewing Company


In [5]:
unique_rateBeer_breweries = rateBeer.brewery_name.unique()

We can now try to see the similarity between two strings.

In [6]:
def similarity_lines(line_1, line_2): 
    beer_val = fuzz.token_set_ratio(line_1.beer_name, line_2.beer_name)/100
    brewery_val = fuzz.token_set_ratio(line_1.brewery_name, line_2.brewery_name)/100
    return (beer_val*brewery_val)

def similarity(test):
    brewery = test.brewery_name.replace("'", "")
    beer = test.beer_name.replace("'", "")
    
    brewery_sim = np.zeros(len(unique_rateBeer_breweries))
    for i in range(len(unique_rateBeer_breweries)):
        brewery_sim[i] = fuzz.token_set_ratio(brewery, unique_rateBeer_breweries[i])/100.0
        
    idx_best_match_brewery = np.argmax(brewery_sim)
    best_match_brewery = np.max(brewery_sim)
    
    beer_names = list(rateBeer[rateBeer.brewery_name == unique_rateBeer_breweries[idx_best_match_brewery]].beer_name)
        
    N = len(beer_names)

    test_beers = np.zeros(N)
    for i in range(N):
        test_beers[i] = fuzz.token_set_ratio(beer, beer_names[i])/100.0
        
    best_idx = np.argsort(test_beers)[::-1]
    
    nbr_match = 5
    if N < 5:
        nbr_match = N
    
    matches = []
    for i in range(nbr_match):
        matches.append((beer_names[best_idx[i]], 
                        unique_rateBeer_breweries[idx_best_match_brewery], 
                        test_beers[best_idx[i]]*best_match_brewery))
        
    ## Double check 
    # If the elements have the same best values, but the string is longer,
    # we have to put it first. (Problem with this algo)
    
    best_sim = matches[0][2]
    idx = [0]
    diff_length_beer_name = [abs(len(matches[0][0])-len(beer))]
    for i in range(1,nbr_match):
        if matches[i][2] == best_sim:
            idx.append(i)
            diff_length_beer_name.append(abs(len(matches[i][0])-len(beer)))
            
    best_idx = np.argsort(diff_length_beer_name)
    idx = np.asarray(idx)
    
    matches = np.asarray(matches)
    
    matches[idx] = matches[best_idx]
            
    return matches

In [7]:
nbr_jobs = multiprocessing.cpu_count()
nbr_jobs

8

In [12]:
%%time
M = len(beerAdvocate)
N = len(rateBeer)

sim = Parallel(n_jobs=nbr_jobs)(delayed(similarity)(beerAdvocate.ix[idx]) for idx in range(M))

CPU times: user 48.8 s, sys: 1.6 s, total: 50.4 s
Wall time: 43min 48s


In [13]:
pickle.dump(sim, open('./data/similarity.pickle', "wb"))

In [14]:
sim = pickle.load(open('./data/similarity.pickle', "rb"))

In [15]:
sim[340:400]

[array([['Harpoon Barleywine (Bourbon Barrel Aged)', 'Harpoon Brewery',
         '0.79'],
        ['Harpoon 100 Barrel Series #04 - Barleywine', 'Harpoon Brewery',
         '0.67'],
        ['Harpoon Barrel Aged Triticus ', 'Harpoon Brewery', '0.61'],
        ['Harpoon Leviathan Barleywine Style Ale', 'Harpoon Brewery', '0.58'],
        ['Harpoon Barrel Aged Munich Dark', 'Harpoon Brewery', '0.55']], 
       dtype='<U42'),
 array([['Harpoon UFO Pale Ale', 'Harpoon Brewery', '1.0'],
        ['Harpoon Sour Belgian Pale Ale', 'Harpoon Brewery', '0.67'],
        ['Harpoon KPA (Kriek Pale Ale)', 'Harpoon Brewery', '0.67'],
        ['Harpoon Belgian Pale Ale', 'Harpoon Brewery', '0.67'],
        ['Harpoon UFO Pumpkin', 'Harpoon Brewery', '0.55']], 
       dtype='<U29'),
 array([['Harpoon 100 Barrel Series #29 - Ginger Wheat', 'Harpoon Brewery',
         '1.0'],
        ['Harpoon 100 Barrel Series #26 - Catamount Maple Wheat',
         'Harpoon Brewery', '0.86'],
        ['Harpoon 100 Barrel 