## How do find similar beer drinkers by using written reviews only?   

In [1]:
import pandas as pd
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

In [2]:
df = pd.read_csv('data/BeerDataScienceProject.csv', engine='python')
df.shape

(528870, 13)

As we only need to use 'review_text' and based on writtern reviews find out the most similar beer drinkers. 

In [3]:
df = df[['review_text', 'review_profileName']]
df.head(2)

Unnamed: 0,review_text,review_profileName
0,A lot of foam. But a lot. In the smell some ba...,stcules
1,"Dark red color, light beige foam, average. In ...",stcules


Drop null rows

In [15]:
df.dropna(inplace=True)

Drop all the duplicate samples

In [16]:
df.drop_duplicates(keep='first', inplace=True)

Convert review_text to numeric ones. 
- Remove all the stop words from the written reviews
- Keep only those terms which are more than 5 in number in the document and ignore terms that have a document frequency strictly higher than the 90% threshold.
- It only aids to remove corpus specific stop words.

In [17]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(stop_words='english', min_df=5, max_df=0.90)
review_text_matrix = vectorizer.fit_transform(df['review_text'])
review_text_matrix

<461352x41704 sparse matrix of type '<class 'numpy.float64'>'
	with 25821175 stored elements in Compressed Sparse Row format>

Below function returns the top 3 similar beer drinkers based on given review. 
- It calcuates the cosine similarities between test review and list of given written review.
- Then, sort the similarity scores in descending order and keep the top 3 reviews.
- Based on indexes, it keeps their corespondence user.

In [18]:
# Get top 3 similar beer drinkers based on test review written

def get_similar_beer_drinkers(review_text_matrix, test_review_matrix):
    cosine_sim = cosine_similarity(review_text_matrix, review_text_matrix[0])
    cosine_sim = np.array([x[0] for x in cosine_sim])
    
    idxs = np.argsort(cosine_sim)[::-1][:3]
    # print(idxs)
    return df.iloc[idxs, 1].tolist()

In [19]:
test_review = '22 oz bottle from "Lifesource" Salem. $3.95 Nice golden clear beer body with a nice sized frothy/creamy white head. Ok aromas..mainlly a bit of ginger speice and some bready malt..simple nice Taste very nice indeed..nice spicy ginger backed with slightly caramel maltiness..simple again but i like . Liked the mouthfeel of this one..very forward carbonation which helps the ginger effect and a lingering ginger in the after taste. Overall a simple ginger brew .I liked it'
test_review_matrix = vectorizer.transform([test_review])

similar_drinkers = get_similar_beer_drinkers(review_text_matrix, test_review_matrix)
print('Similar beer drinkers: ')
similar_drinkers


Similar beer drinkers: 


['stcules', 'jctribe25', 'HopHead84']

## ['stcules', 'jctribe25', 'HopHead84'] are the top three similar beer drinker for given test review.