# Pivo Recommender

Looking for a new way to compare and recommend beer.  

This notebook scrapes beer reviews from [BeerAdvocate](https://www.beeradvocate.com/) then performs natural language processing on these reviews.  Once a profile of a beer has been created, a similar, semi-similar, or completely different beer can be recommended.

To run the scrapy scraper (and generate all the data required for this analysis) follow these steps:
 1. move into the beerAdvocateScraper directory (cd beerAdvocateScraper)
 2. execute the scraper (scrapy crawl reviewScraper

In [None]:
!cd beerAdvocateScraper
!scrapy crawl reviewScraper

Import and load the necessary libraries.

In [1]:
import itertools as it
import os
import pandas as pd
import spacy

In [2]:
nlp = spacy.load('en')

## Phrase Modeling

Phrase modeling is an approach to learning combinations of tokens that together represent meaning multi-word concepts.  These phrase models are developed by looping over the words in the corpus and finding words that appear together more than they should by random chance.  The formula used to determine if whether two tokens $A$ and $B$ constitute a phrase is:

$$
\frac {count(A B) - count_{min}}
{count(A) * count(B)}
*N > threshold
$$

where:
* $count(A)$ is the number of times $A$ appears in the corpus
* $count(B)$ is the number of times $B$ appears in the corpus
* $count(AB)$ is the number of times $AB$ appear in the corpus in that order
* $N$ is the total size of the corpus vocabulary
* $count_{min}$ is a user-defined parameter to ensure that the phrase appears a minimum number of times
* $threshold$ is a user-defined paramter to control how strong the relationship must be before the two tokens are considered a single concept

Once we have trained the phrase model we can apply it to the reviews in our corpus.  It will consider the multiworded tokens to be single phrases.

The gensim library will help us with phrase modeling, specifically the Phrases class.

In [8]:
from gensim.models import Phrases
from gensim.models.phrases import Phraser
from gensim.models.word2vec import LineSentence

In [None]:
%mv beerAdvocateScraper/BeerAdvocateReviews.csv .

In [None]:
# creating slimReview.csv for testing purposes

!head -20 BeerAdvocateReviews.csv >> slimReviews.csv
!tail -20 BeerAdvocateReviews.csv >> slimReviews.csv

In [None]:
%mkdir ./intermediate
%mv BeerAdvocateReviews.csv ./intermediate/
%mv slimReviews.csv ./intermediate/

In [3]:
import re
pattern = re.compile('^[\d,"]+')

def punct_space(token):
    """
    helper function to eliminate tokens
    that are pure punctuation or whitespace
    """
    
    return token.is_punct or token.is_space

def line_review(filename):
    """
    generator function to read in reviews from the file
    and un-escape the original line breaks in the text
    """
    count = 0
    with open(filename, encoding='utf_8') as f:
        for review in f:
            try:
                yield re.split(pattern, review)[1].replace('\\n', '\n')
            except: pass
            if count % 1000 == 0:
                print(f'on review {count}')
            count += 1
            
def lemmatized_sentence_corpus(filename):
    """
    generator function to use spaCy to parse reviews,
    lemmatize the text, and yield sentences
    """
    
    for parsed_review in nlp.pipe(line_review(filename),
                                  batch_size=1000, n_threads=4):
        
        for sent in parsed_review.sents:
            yield u' '.join([token.lemma_ for token in sent
                             if not punct_space(token)])

In [66]:
intermediate_directory = os.path.join('.', 'intermediate')
review_txt_filepath = os.path.join(intermediate_directory,'BeerAdvocateReviews.csv')
# review_txt_filepath = os.path.join(intermediate_directory,'slimReviews.csv')

%mkdir './intermediate/ngram_all'
ngram_all = os.path.join(intermediate_directory, 'ngram_all')
unigram_sentences_filepath = os.path.join(ngram_all,
                                          'unigram_sentences_all.txt')

mkdir: ./intermediate/ngram_all: File exists


We will use `lemmatized_sentence_corpus` generator to loop over the original review text, segmenting the reviews into individual sentences and normalizing the text.  We will write this data back out to a new file (`unigram_sentence_all`), with one normalized sentence per line.  We will use this data for learning our phrase models.

In [67]:
with open(unigram_sentences_filepath, 'w', encoding='utf_8') as f:
    for sentence in lemmatized_sentence_corpus(review_txt_filepath):
#         print(sentence)
        f.write(sentence + '\n')

on review 0
on review 1000


Our data in the `unigram_sentences_all` file is now organized as a large text file with one sentence per line.  This format allows us to use gensim's LineSentence class, a convenient iterator for working with gensim's other components.  It *streams* the documents/sentences from disk, so you never have to hold the entire corpus in RAM at once.  This allows you to scale the modeling to a very large corpora.

In [68]:
unigram_sentences = LineSentence(unigram_sentences_filepath)

In [69]:
for unigram_sentence in it.islice(unigram_sentences, 5,7):
    print(u' '.join(unigram_sentence))
    print(u' ')

-PRON- pour be as thick as paste slick and black like crude oil and without hardly any head to -PRON- near still in appearance shiny almost gelatinous look
 
the aroma and flavor be intense but unfortunately to -PRON- familiar i want this to be strikingly different at least from a quality standpoint and as good as -PRON- all be -PRON- not head and shoulder above all else -PRON- may not even be -PRON- favorite barrel aged coffee stout
 


In [70]:
bigram_model_file = os.path.join(ngram_all, 'bigram_model')

In [71]:
if 1==1:
    phrases = Phrases(unigram_sentences)
    bigram_model = Phraser(phrases)
    bigram_model.save(bigram_model_file)
    
bigram_model = Phrases.load(bigram_model_file)

In [72]:
bigram_sentences_filepath = os.path.join(ngram_all,'bigram_sentences_all.txt')

In [73]:
with open(bigram_sentences_filepath, 'w', encoding='utf_8') as f:

    for unigram_sentence in unigram_sentences:

        bigram_sentence = u' '.join(bigram_model[unigram_sentence])

        f.write(bigram_sentence + '\n')

In [74]:
bigram_sentences = LineSentence(bigram_sentences_filepath)

In [75]:
for bigram_sentence in it.islice(bigram_sentences, 5, 7):
    print(u' '.join(bigram_sentence))
    print(u'')

-PRON- pour be as thick as paste slick and black like crude oil and without hardly_any head to -PRON- near still in appearance shiny almost gelatinous look

the aroma and flavor be intense but unfortunately to -PRON- familiar i want this to be strikingly different at_least from a quality standpoint and as good as -PRON- all be -PRON- not head and shoulder_above all else -PRON- may not even be -PRON- favorite barrel_aged coffee stout



In [76]:
trigram_model_filepath = os.path.join(ngram_all, 'trigram_model')

In [77]:
if 1 == 1:

    phrases_trigram = Phrases(bigram_sentences)
    trigram_model = Phraser(phrases_trigram)
    trigram_model.save(trigram_model_filepath)
    
# load the finished model from disk
trigram_model = Phrases.load(trigram_model_filepath)

In [78]:
if 1==1:
    phrases = Phrases(unigram_sentences)
    bigram_model = Phraser(phrases)
    bigram_model.save(bigram_model_file)
    
bigram_model = Phrases.load(bigram_model_file)

In [79]:
trigram_sentences_filepath = os.path.join(ngram_all, 'trigram_sentences_all.txt')

In [80]:
with open(trigram_sentences_filepath, 'w', encoding='utf_8') as f:

    for bigram_sentence in bigram_sentences:

        trigram_sentence = u' '.join(trigram_model[bigram_sentence])

        f.write(trigram_sentence + '\n')

In [81]:
trigram_sentences = LineSentence(trigram_sentences_filepath)

In [82]:
for trigram_sentence in it.islice(trigram_sentences, 3, 5):
    print(u' '.join(trigram_sentence))
    print(u'')

i_remember when i be heady and westy_12 before that time have change

after a couple of narrowly miss opportunity_to drink this one over the year -PRON- man drlovemd87 come_through in the clutch with this rare of the rare bottle



In [83]:
trigram_reviews_filepath = os.path.join(intermediate_directory,
                                        'trigram_transformed_reviews_all.txt')

In [84]:
from spacy.lang.en.stop_words import STOP_WORDS

In [85]:

with open(trigram_reviews_filepath, 'w', encoding='utf_8') as f:

    for parsed_review in nlp.pipe(line_review(review_txt_filepath),
                                  batch_size=10000, n_threads=4):

        # lemmatize the text, removing punctuation and whitespace
        unigram_review = [token.lemma_ for token in parsed_review
                          if not punct_space(token)]

        # apply the first-order and second-order phrase models
        bigram_review = bigram_model[unigram_review]
        trigram_review = trigram_model[bigram_review]

        # remove any remaining stopwords
        trigram_review = [term for term in trigram_review
                          if term not in STOP_WORDS]
        
        # remove pronouns
        trigram_review = [term for term in trigram_review
                          if term !='-PRON-']
        

        # write the transformed review as a line in the new file
        trigram_review = u' '.join(trigram_review)
        f.write(trigram_review + '\n')

on review 0
on review 1000


In [86]:
print(u'Original:' + u'\n')

for review in it.islice(line_review(review_txt_filepath), 4, 5):
    print(review)

print(u'----' + u'\n')
print(u'Transformed:' + u'\n')

with open(trigram_reviews_filepath, encoding='utf_8') as f:
    for review in it.islice(f, 4, 5):
        print(review)

Original:

on review 0
Served alongside Mornin' Delight and Assassin. The beer is near black with a filmy beige head. All hype and bias aside, this aroma is silly good. Oak, roast, coffee, vanilla. World class from the first sip. The definition of grace and balance. Silky smooth on the palate. Absolutely amazing."

----

Transformed:

serve_alongside mornin_delight assassin beer near_black filmy beige_head hype bias aside aroma silly good oak roast coffee vanilla world_class first_sip definition grace balance silky_smooth palate absolutely_amazing



Exception ignored in: <generator object line_review at 0x11255bbf8>
RuntimeError: generator ignored GeneratorExit


In [28]:
def get_beerID(filename):
    """
    generator function to read in reviews from the file
    and un-escape the original line breaks in the text
    """
    with open(review_txt_filepath, encoding='utf_8') as f:
        for review in f:
            try:
                yield re.split('(^[\d"]+)', review)[1].replace('\\n', '\n')
            except: pass

In [29]:
from collections import Counter

In [88]:
# for testing, use next cell to run full dict

beerDict = {}

for i in range(2):
    with open(trigram_reviews_filepath, encoding='utf_8') as f:
        for review in it.islice(f, i, i+1):
            beerID = list(it.islice(get_beerID(review_txt_filepath), i, i+1))[0]
            wordDict = Counter()
            for word in review.split():
                wordDict[word] = wordDict.get(word,0) + 1
#             print(wordDict)
#             print('---------------')
            beerDict[beerID] = beerDict.get(beerID, Counter()) + wordDict
print(beerDict)

Exception ignored in: <generator object get_beerID at 0x11265f780>
RuntimeError: generator ignored GeneratorExit
Exception ignored in: <generator object get_beerID at 0x11265fb48>
RuntimeError: generator ignored GeneratorExit
Exception ignored in: <generator object get_beerID at 0x11265ff10>
RuntimeError: generator ignored GeneratorExit
Exception ignored in: <generator object get_beerID at 0x11265f780>
RuntimeError: generator ignored GeneratorExit
Exception ignored in: <generator object get_beerID at 0x11265fb48>
RuntimeError: generator ignored GeneratorExit
Exception ignored in: <generator object get_beerID at 0x11265ff10>
RuntimeError: generator ignored GeneratorExit
Exception ignored in: <generator object get_beerID at 0x11265f780>
RuntimeError: generator ignored GeneratorExit
Exception ignored in: <generator object get_beerID at 0x11265fb48>
RuntimeError: generator ignored GeneratorExit
Exception ignored in: <generator object get_beerID at 0x11265ff10>
RuntimeError: generator ignor

{'78820': Counter({'beer': 168, 'coffee': 160, 'bourbon': 140, 'maple_syrup': 108, 'thick': 96, 'good': 88, 'like': 84, 'maple': 76, 'vanilla': 72, 'taste': 64, 'smell': 64, 'pour': 60, 'flavor': 60, 'oak': 56, 'sweet': 56, 'chocolate': 48, 'hype': 48, 'bottle': 44, 'black': 44, 'finish': 44, 'drink': 40, 'head': 40, 'big': 40, 'dark': 36, 'linger': 36, 'nose': 36, 'dark_chocolate': 36, 'sweetness': 36, 'morning_delight': 36, 'aroma': 32, 'whiskey': 32, 'perfect': 32, 'mouthfeel': 32, 'stout': 28, 'carbonation': 28, 'glass': 28, 'topple_goliath': 28, 'lot_of': 28, 'caramel': 28, 'rich': 24, 'full_bodied': 24, 'mornin_delight': 24, 'palate': 24, 'viscous': 24, 'molass': 24, 'complex': 24, 'nice': 24, 'milk_chocolate': 24, 'creamy': 24, 'do_not': 20, 'live_up_to': 20, 'roast': 20, 'lace': 20, 'smooth': 20, 'close': 20, 'sticky': 20, 'note': 20, 'release': 20, 'touch': 20, 'hint_of': 20, 'world': 16, 'rare': 16, 'look': 16, 'barrel': 16, 'great': 16, 'ton_of': 16, 'pour_into': 16, 'snifte

In [89]:
m = pd.read_csv(review_txt_filepath)

beerDict = {}

for i in range(m.shape[0]):
    with open(trigram_reviews_filepath, encoding='utf_8') as f:
        for review in it.islice(f, i, i+1):
            beerID = list(it.islice(get_beerID(review_txt_filepath), i, i+1))[0]
            wordDict = Counter()
            for word in review.split():
                wordDict[word] = wordDict.get(word,0) + 1
            beerDict[beerID] = beerDict.get(beerID, Counter()) + wordDict
# print(beerDict)

Exception ignored in: <generator object get_beerID at 0x11265fe60>
RuntimeError: generator ignored GeneratorExit
Exception ignored in: <generator object get_beerID at 0x11265feb8>
RuntimeError: generator ignored GeneratorExit
Exception ignored in: <generator object get_beerID at 0x11265f8e0>
RuntimeError: generator ignored GeneratorExit
Exception ignored in: <generator object get_beerID at 0x11265fe60>
RuntimeError: generator ignored GeneratorExit
Exception ignored in: <generator object get_beerID at 0x11265feb8>
RuntimeError: generator ignored GeneratorExit
Exception ignored in: <generator object get_beerID at 0x11265f8e0>
RuntimeError: generator ignored GeneratorExit
Exception ignored in: <generator object get_beerID at 0x11265fe60>
RuntimeError: generator ignored GeneratorExit
Exception ignored in: <generator object get_beerID at 0x11265feb8>
RuntimeError: generator ignored GeneratorExit
Exception ignored in: <generator object get_beerID at 0x11265f8e0>
RuntimeError: generator ignor

In [32]:
# remove any items in beerDict with less than n occurances

In [90]:
beerDict.keys()

dict_keys(['78820', '16814', '87246', '42349', '110635', '87846', '21690', '172669', '57747', '76421', '146770', '47658', '86237', '7971', '64545', '207976', '122114', '5281', '162502', '115317', '78660', '259249', '123286', '58299', '77563', '41815', '113674', '51116', '3659', '120372', '19960', '10672', '188570', '1545', '69522', '64228', '104649'])

In [91]:
wordThreshold = 5
beerDictFiltered = {}
for key, value in beerDict.items():
    beerDictFiltered[key] = {k:v for k, v in value.items() if v > wordThreshold} 

In [92]:
# beerDictFiltered

## Word2Vec

The goal of *word vector embedding models*, or *word vector models* for short is to learn dense numerical representations of each term in a corpus vocabulary.  If the model is succesful, the vectors it learns should encode some information about the *meaning* or *concept* the term represents, and the relationship between it and other terms in the vocabulary.  Word vector models are fully unsupervised &mdash; they learn all of these meaning and relationships solely by analyzing the text of the corpus, without any advanced knowledge provided.

In [34]:
from gensim.models import Word2Vec

# run with or without stop words removed and word lemmatized
trigram_sentences = LineSentence(trigram_reviews_filepath)
# trigram_sentences = LineSentence(trigram_sentences_filepath)
word2vec_filepath = os.path.join(intermediate_directory, 'word2vec_model_all')

In [93]:
if 1==1:
    
    # initiate the model and perform one epoch of training
    beer2vec = Word2Vec(trigram_sentences, size=100, window=5,
                       min_count=1, sg=1) #workers=?
    beer2vec.save(word2vec_filepath)
    
    #perform the next n epochs of training
    for i in range(1, 5):
        beer2vec.train(trigram_sentences, total_examples=beer2vec.corpus_count, epochs=2)
        beer2vec.save(word2vec_filepath)
        
        
#load the finished model from disk
beer2vec = Word2Vec.load(word2vec_filepath)
beer2vec.init_sims()

print(f'Trained model for {beer2vec.train_count} epochs.')

Trained model for 5 epochs.


In [94]:

# build a list of the terms, integer indices,
# and term counts from the food2vec model vocabulary
ordered_vocab = [(term, voc.index, voc.count)
                 for term, voc in beer2vec.wv.vocab.items()]

# sort by the term counts, so the most common terms appear first
ordered_vocab = sorted(ordered_vocab, key=lambda tup: -1*tup[-1])

# # unzip the terms, integer indices, and counts into separate lists
ordered_terms, term_indices, term_counts = zip(*ordered_vocab)

# # create a DataFrame with the food2vec vectors as data,
# # and the terms as row labels
word_vectors = pd.DataFrame(beer2vec.wv.syn0norm[term_indices, :],
                            index=ordered_terms)

word_vectors.head(10)


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,90,91,92,93,94,95,96,97,98,99
the,-0.018351,0.113972,-0.192778,-0.054929,0.005518,-0.083566,-0.065739,0.103216,0.076961,-0.182536,...,-0.215077,-0.03273,0.009318,0.131968,0.033135,-0.120947,-0.013589,-0.146412,-0.079318,-0.02039
-PRON-,-0.211005,0.028624,-0.059772,-0.139637,0.028131,-0.091531,0.009782,0.130188,0.05234,-0.034864,...,-0.259531,-0.16693,0.208226,0.034254,0.12462,-0.135096,-0.046581,-0.111316,-0.076569,-0.204735
a,-0.050716,0.293615,0.011403,-0.053512,-0.113627,0.047881,-0.168817,0.115116,0.030559,-0.063464,...,-0.162549,-0.077588,0.164781,-0.035664,0.143922,-0.018855,0.168128,-0.050203,0.067544,-0.12326
be,-0.260863,0.209165,-0.017781,-0.189163,-0.067562,-0.093673,0.061458,0.144727,0.025251,0.103492,...,-0.0637,-0.094654,0.160783,-0.077925,0.027673,-0.184514,-0.043152,-0.143526,0.052082,-0.160826
and,-0.084709,0.081097,-0.078662,-0.101901,-0.050747,-0.124017,0.122031,0.237389,0.094549,-0.148606,...,-0.090887,-0.010418,-0.011989,0.019586,0.202351,-0.042855,0.100566,-0.056468,0.114493,-0.072946
of,-0.052598,0.178534,0.018845,-0.235192,-0.003208,-0.03176,0.01756,0.19844,0.065128,-0.114216,...,0.00106,0.025789,-0.03383,0.095644,0.027014,0.073375,0.151863,-0.027904,0.049595,0.016763
with,-0.080563,0.264559,-0.003863,-0.095007,-0.011471,-0.058137,0.013501,0.225476,0.025236,-0.043376,...,-0.181996,-0.01839,0.006752,0.086323,0.190026,-0.019431,0.036792,0.045002,-0.062925,-0.009047
this,-0.105641,-0.066573,-0.139427,0.005025,0.083186,-0.174347,0.028543,0.107517,0.037499,0.055169,...,-0.137252,-0.21063,0.131423,0.051925,0.079088,-0.255634,-0.058475,0.035154,-0.001208,-0.066169
to,-0.226462,0.246415,-0.047947,0.050168,0.078659,-0.140712,-0.141988,0.081105,-0.073131,-0.035501,...,-0.051204,-0.097441,-0.066573,0.023948,0.101628,-0.066689,0.06444,-0.086273,0.104738,-0.133303
in,0.011131,0.140776,-0.067797,-0.153594,-0.128552,-0.003055,-0.081086,0.009775,0.179716,-0.022355,...,-0.215259,0.074747,0.104564,-0.03607,-0.020554,0.020677,-0.123438,0.01245,0.00268,-0.051838


In [95]:
word_vectors.shape

(9265, 100)

# Dimension Reduction Using t-SNE 

t-Distributed Stochastic Neighbor Embedding, or *t-SNE*, is a dimensionality reduction technique to assist with visualizing high-dimensional datasets.  It attempts to map high-dimensional data onto a low (2 or 3) dimensional  representation such that the relative distance between points are preserved as closely as possible in both high and low dimensional space.

In [96]:
from sklearn.manifold import TSNE

In [99]:
# from spacy.lang.en.stop_words import STOP_WORDS

tsneInput = word_vectors.drop(STOP_WORDS, errors=u'ignore')
# tsneInput = tsneInput.head(400)

In [100]:
tsneInput.shape

(9029, 100)

In [101]:
tsne_filepath = os.path.join(intermediate_directory,
                             u'tsne_model')

tsne_vectors_filepath = os.path.join(intermediate_directory,
                                     u'tsne_vectors.npy')

In [102]:
import pickle

In [103]:
tsneInput.values.shape

(9029, 100)

In [104]:
if 1 == 1:
    
    tsne = TSNE()
    tsne_vectors = tsne.fit_transform(tsneInput.values)
    
    with open(tsne_filepath, 'wb') as f:
        pickle.dump(tsne, f)

    pd.np.save(tsne_vectors_filepath, tsne_vectors)
    
with open(tsne_filepath, 'rb') as f:
    tsne = pickle.load(f)
    
tsne_vectors = pd.np.load(tsne_vectors_filepath)

tsne_vectors = pd.DataFrame(tsne_vectors,
                            index=pd.Index(tsneInput.index),
                            columns=[u'x_coord', u'y_coord'])

In [None]:
tsne_vectors[u'word'] = tsne_vectors.index

In [None]:
tsne_vectors.head()

In [105]:
ids = list(beerDictFiltered.keys())
print(ids)

['78820', '16814', '87246', '42349', '110635', '87846', '21690', '172669', '57747', '76421', '146770', '47658', '86237', '7971', '64545', '207976', '122114', '5281', '162502', '115317', '78660', '259249', '123286', '58299', '77563', '41815', '113674', '51116', '3659', '120372', '19960', '10672', '188570', '1545', '69522', '64228', '104649']


In [106]:
descriptorListRaw = []
for key, value in beerDictFiltered.items():
    for k,v in value.items():
        descriptorListRaw.append(k)

        
descriptorList = list(set(descriptorListRaw))

In [44]:
descriptorList

['maple_syrup',
 'note',
 'pepper',
 'smooth',
 'bottle',
 'morning_delight',
 'fudge',
 'light',
 'batch',
 'burn',
 'sweetness',
 'awesome',
 'big',
 'coffee',
 'stout',
 'nose',
 'finish',
 'creamy',
 'review',
 'alcohol',
 'chocolate',
 'linger',
 'cinnamon_toast_crunch',
 'thick',
 's',
 'head',
 'world',
 'regular',
 'nice',
 'mocha',
 'bourbon',
 'spicy',
 'little',
 'abraxas',
 'mornin_delight',
 'dark_chocolate',
 'body',
 'drink',
 'choc',
 'barrel',
 'heat',
 'medium',
 'rich',
 'hype',
 'great',
 'come',
 'malt',
 'booze',
 'cocoa',
 'bit_of',
 'maple',
 'cinnamon',
 'm',
 'chili',
 'flavour',
 'spice',
 'glass',
 'pour',
 'dark',
 'like',
 'age',
 'oak',
 'vanilla',
 'sweet',
 'black',
 'coconut',
 'lot_of',
 'whiskey',
 'perfect',
 'roasted',
 'mouthfeel',
 'slight',
 't',
 'aroma',
 'good',
 'smell',
 'snifter',
 'taste',
 'flavor',
 'beer',
 'molass',
 'topple_goliath',
 'carbonation']

In [107]:
list1 = []
for i in descriptorList:
    list1.append(i + '_x')
    list1.append(i + '_y')

In [108]:
masterDF = pd.DataFrame(columns=list1, index=ids)
masterDF.head()

Unnamed: 0,huge_x,huge_y,smooth_x,smooth_y,people_x,people_y,bottle_x,bottle_y,morning_delight_x,morning_delight_y,...,pit_x,pit_y,green_apple_x,green_apple_y,fluffy_x,fluffy_y,mind_x,mind_y,bottle_into_x,bottle_into_y
78820,,,,,,,,,,,...,,,,,,,,,,
16814,,,,,,,,,,,...,,,,,,,,,,
87246,,,,,,,,,,,...,,,,,,,,,,
42349,,,,,,,,,,,...,,,,,,,,,,
110635,,,,,,,,,,,...,,,,,,,,,,


In [None]:
masterDF.shape

In [109]:
for key, value in beerDict.items():
    for k2, v2 in value.items():
        try:
            x_val = tsne_vectors.loc[k2]['x_coord'] * v2
            columnLookupX = k2 + '_x'
            masterDF.loc[key][columnLookupX] = x_val

            y_val = tsne_vectors.loc[k2]['y_coord'] * v2
            columnLookupY = k2 + '_y'
            masterDF.loc[key][columnLookupY] = y_val
        except KeyError:
            pass

In [110]:
masterDF

Unnamed: 0,huge_x,huge_y,smooth_x,smooth_y,people_x,people_y,bottle_x,bottle_y,morning_delight_x,morning_delight_y,...,pit_x,pit_y,green_apple_x,green_apple_y,fluffy_x,fluffy_y,mind_x,mind_y,bottle_into_x,bottle_into_y
78820,-450.84,668.006,-1184.14,28.8658,248.009,-20.4841,2382.35,-1694.74,2218.75,714.877,...,,,,,,,,,-104.624,-523.008
16814,-826.541,1224.68,-1539.38,37.5256,496.018,-40.9683,1082.88,-770.338,,,...,,,,,-120.068,-2.12422,134.833,194.553,,
87246,-300.56,445.337,-828.897,20.2061,,,433.154,-308.135,,,...,,,,,,,44.9442,64.8511,-26.1561,-130.752
42349,-375.7,556.672,-1243.35,30.3091,,,433.154,-308.135,,,...,-46.5235,43.3778,,,,,,,,
110635,,,-355.242,8.65974,124.005,-10.2421,379.01,-269.618,,,...,,,-41.5789,42.7785,,,22.4721,32.4256,,
87846,-112.71,167.001,-236.828,5.77316,,,54.1442,-38.5169,,,...,-46.5235,43.3778,,,-420.239,-7.43476,,,,
21690,-488.411,723.673,-1184.14,28.8658,248.009,-20.4841,216.577,-154.068,,,...,,,,,-120.068,-2.12422,67.4163,97.2767,,
172669,-75.1401,111.334,-296.035,7.21645,124.005,-10.2421,487.298,-346.652,,,...,,,,,,,,,,
57747,-375.7,556.672,-1776.21,43.2987,62.0023,-5.12103,1624.33,-1155.51,,,...,,,,,-60.0341,-1.06211,,,-39.2341,-196.128
76421,-150.28,222.669,-473.655,11.5463,248.009,-20.4841,379.01,-269.618,61.632,19.8577,...,,,,,-240.136,-4.24843,22.4721,32.4256,,


In [111]:
masterDF.fillna(0, inplace=True)

In [62]:
masterDF

Unnamed: 0,maple_syrup_x,maple_syrup_y,note_x,note_y,pepper_x,pepper_y,smooth_x,smooth_y,bottle_x,bottle_y,...,flavor_x,flavor_y,beer_x,beer_y,molass_x,molass_y,topple_goliath_x,topple_goliath_y,carbonation_x,carbonation_y
78820,348.072292,338.940989,110.128792,119.568909,0.0,0.0,30.173761,7.311276,-208.661396,-226.340263,...,158.553978,202.45768,-260.607433,-465.176334,85.393158,56.89444,-66.696585,-71.815933,-30.524355,99.098869
87246,16.574871,16.140047,178.959287,194.299477,280.730519,416.992285,45.260642,10.966914,-144.45789,-156.697105,...,184.979641,236.200626,-260.607433,-465.176334,28.464386,18.964813,0.0,0.0,-19.077722,61.936793


In [112]:
from sklearn.metrics.pairwise import cosine_similarity
from scipy import spatial
from itertools import combinations

In [113]:
list(masterDF.index)

['78820',
 '16814',
 '87246',
 '42349',
 '110635',
 '87846',
 '21690',
 '172669',
 '57747',
 '76421',
 '146770',
 '47658',
 '86237',
 '7971',
 '64545',
 '207976',
 '122114',
 '5281',
 '162502',
 '115317',
 '78660',
 '259249',
 '123286',
 '58299',
 '77563',
 '41815',
 '113674',
 '51116',
 '3659',
 '120372',
 '19960',
 '10672',
 '188570',
 '1545',
 '69522',
 '64228',
 '104649']

In [114]:
for pair in combinations(list(masterDF.index), 2):
    cs = 1 - spatial.distance.cosine(masterDF.loc[pair[0]].values.reshape(-1,1), masterDF.loc[pair[1]].values.reshape(-1,1))
    print(f'The similarity between {pair[0]} and {pair[1]} is {cs}')

The similarity between 78820 and 16814 is 0.4410527579747747
The similarity between 78820 and 87246 is 0.6107970340519152
The similarity between 78820 and 42349 is 0.7144780305656186
The similarity between 78820 and 110635 is 0.5759228405067454
The similarity between 78820 and 87846 is 0.3853968496422101
The similarity between 78820 and 21690 is 0.49319259562667406
The similarity between 78820 and 172669 is 0.6926374906287212
The similarity between 78820 and 57747 is 0.8192041911779465
The similarity between 78820 and 76421 is 0.8187412775409447
The similarity between 78820 and 146770 is 0.4154617106033829
The similarity between 78820 and 47658 is 0.8561294731265551
The similarity between 78820 and 86237 is 0.413002097716215
The similarity between 78820 and 7971 is 0.4467707252587072
The similarity between 78820 and 64545 is 0.44646167738992504
The similarity between 78820 and 207976 is 0.43531267166759036
The similarity between 78820 and 122114 is 0.3801001351168044
The similarity bet

In [None]:
# find the similarity between every beer pair
# create a dictionary where the key is the beer and value is the list of beers from most to least similar