<a href="https://colab.research.google.com/github/lurbisci/InsightDataScienceProject/blob/master/Laura_Urbisci_InsightProjCode.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Insight Data Science Project
## The Ron Burgundy, the scotch whisky recommender

__Date:__ January 2019   
__Author:__ Laura Urbisci

___

## Project overview:
The Insight program takes data scientists with PhDs in various quantitative fields and helps them further develop their skills by tackling a real world data science project in just 3 weeks. 
These projects demonstrate Fellows’ familiarity with industry-standard tools as well as their ability to build a project from scratch in a short amount of time. Projects come in two types: individual projects and consulting projects. For individual projects we develop our own idea, and for consulting projects, we work with a startup company to solve their data problems. I did an individual project which I called the Ron Burgundy, the scotch whisky recommender. My app takes Reddit data to recommend scotch so that whisky lovers can find new products to try. I used collaborative filtering for my model and deployed my app on AWS (the ronburgundy.com).



# Data acquisition

This section of the notebook contains the packages needed to run all of the code in this notebook in addition to the code used for data extraction.  

<br>

We obtained data from the following sources:
1. The majority of the data can be found [here](https://docs.google.com/spreadsheets/d/1X1HTxkI6SqsdpNSkSSivMzpxNT-oeTbjFFDdEkXD30o/edit#gid=695409533). The link contains mutiple structured tables pre-gathered from Reddit on Reddit Username, Whisky, and Whisky Review. I used the Review Archive tab for training and testing data and the Best by User tab for the validation data set.
2. I gathered whisky data from the [Meta-Critic Whisky Database](https://whiskyanalysis.com/index.php/interesting-correlations/how-to-read-the-database/). This data was used to supplement the Reddit data and provide descriptions on the whisky for the final product. 

<br>

I first installed and imported all the necessary packages for the analysis.

In [0]:
# Import all of the necessary Python packages
from google.colab import drive
import os
import json

import pandas as pd
import numpy as np
from copy import deepcopy

import matplotlib.pyplot as plt
import matplotlib
import seaborn as sns
from pylab import savefig

from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from sklearn.cluster import KMeans
from sklearn.preprocessing import LabelEncoder
from sklearn.pipeline import Pipeline
from sklearn.metrics import mean_absolute_error

from keras.models import Sequential, Model, load_model
from keras.utils.vis_utils import model_to_dot
from keras.optimizers import Adam
from keras.layers import Embedding, Reshape, Activation, Input, Dense, Flatten, Dropout, dot
from keras.constraints import non_neg
from IPython.display import SVG

In [0]:
# Set up directory for data
import os

# Load the Drive helper and mount
from google.colab import drive
drive.mount('/content/drive')

data_dir = '/content/drive/My Drive/Insight'

os.chdir(data_dir)
!ls

In [0]:
whisky_reddit = pd.read_csv("Reddit Whisky Network Review Archive - Review Archive.csv")
whisky_reddit.info()

In [0]:
whisky_origin = pd.read_csv('Meta-Critic Whisky Database – Selfbuilts Whisky Analysis.csv')
whisky_origin.info()

In [0]:
whisky_val = pd.read_csv("Reddit Whisky Network Review Archive - Best by User.csv")
whisky_val.info()

# Data cleaning

The Reddit data needed to be cleaned before analysis. Some of the features were not useful for the modeling approach and therefore were excluded. One of the columns (i.e., Cost) needed to be transformed into a form that is readible by Python. Trailing white spaces were removed. In addition, multiple columns have missing values. I decided to eliminate the rows with missing data. The Reddit data set also was merged with the descriptive data set.  I then used label encoding to turn the categorical variables into numeric variables.



In [0]:
whisky_reddit.describe()

In [0]:
whisky_origin.describe()

In [0]:
colsToDrop = ['Link To Reddit Review', 'Date of Review', 'Timestamp', 
'Month', 'Day', 'Time', 'am pm'] # remove some columns out 
reddit_v2 = whisky_reddit.drop(colsToDrop, axis=1)

reddit_v2.head()

In [0]:
colsToDrop3 = ['Super Cluster'] # remove some columns out 
descrip_v2 = whisky_origin.drop(colsToDrop3, axis=1)

descrip_v2.head()

In [0]:
# merge whiskey names
whisky_df = pd.merge(descrip_v2,reddit_v2,left_on=["Whisky"], right_on=["Whisky Name"],how="inner")

whisky_df.head() 

In [0]:
# created a new column so i can keep the $ column
whisky_df['Cost_num'] = whisky_df['#']

# converting Cost column $ to number
for i in range(len(whisky_df)):
  whisky_df.Cost_num.iat[i] = len(whisky_df.Cost.iat[i])

In [0]:
whisky_df = whisky_df.rename(columns={'Meta Critic': 'Meta_Critic',
                                     '#': 'Number_of_MCReviews',
                                     'Whisky Name': 'Whisky_Name',
                                     "Reviewer's Reddit Username": 'Reddit_Username',
                                     'Reviewer Rating': 'Reddit_Review',
                                     'Whisky Region or Style': 'Region_or_Style',
                                     'Full Bottle Price Paid': 'Price_Paid'})

whisky_df.head()

In [0]:
# strip white spaces on character strings

# created a new column so i can keep the $ column
whisky_df['Region_cleaned'] = whisky_df['Country']

for i in range(len(whisky_df)):
  whisky_df.Region_cleaned.iat[i] = whisky_df.Region_or_Style.iat[i].strip()


In [0]:
colsToDrop4 = ['Cost', 'STDEV', 'Price_Paid', 
  'Region_or_Style', 'Whisky_Name', 'Year', 'Country'] # remove some columns out 

whisky_df.drop(colsToDrop4, axis=1, inplace=True)


whisky_df.shape 
# 4,506 rows and 10 columns - lost alot of data with inner

In [0]:
whisky_df['Reddit_Review']= pd.to_numeric(whisky_df['Reddit_Review'], errors='coerce')

In [0]:
whisky_clean=whisky_df.dropna(subset=['Reddit_Review'])

whisky_clean.shape # down to 4,443 rows

In [0]:
# for validation data 
# keep only unnecessary rows
whisky_val = whisky_val.loc[:,"Reviewer's Reddit Username":'tied2']

# rename columns
whisky_val = whisky_val.rename(columns={"Reviewer's Reddit Username" : 'Reddit_Username',
                                      "avg Reviewer Rating" : 'Reddit_Review',
                                      "count Reviewer Rating" : 'Number_of_Reviews',
                                      "1st highest rating" : 'Rec_1',
                                      "tied2" : 'Rec_2' })
# remove unnecessary rows
whisky_val = whisky_val.drop(["max Reviewer Rating"], axis=1)
    
# remove NAs
whisky_val = whisky_val.dropna()
whisky_val.shape # down to 116 rows

In [0]:
# transform first then merge
whisky_val_melt = pd.melt(whisky_val, id_vars=['Reddit_Username', 'Reddit_Review'], value_vars=['Rec_1', 'Rec_2'],
                         var_name='Recommendation', value_name="Whisky")

# merge with description data
whisky_validat = pd.merge(descrip_v2,whisky_val_melt,left_on=["Whisky"],
                          right_on=["Whisky"],how="inner")

# rename columns
whisky_validat = whisky_validat.rename(columns={'Meta Critic': 'Meta_Critic',
                                     '#': 'Number_of_MCReviews'})


In [0]:
# created a new column so i can keep the $ column
whisky_validat['Cost_num'] = whisky_validat['Number_of_MCReviews']

# converting Cost column $ to number
for i in range(len(whisky_validat)):
  whisky_validat.Cost_num.iat[i] = len(whisky_validat.Cost.iat[i])

In [0]:
colsToDrop4 = ['Cost', 'STDEV', 'Country', 'Recommendation'] # remove some columns out 

whisky_validat.drop(colsToDrop4, axis=1, inplace=True)

whisky_validat.head()

In [0]:
whisky_le = deepcopy(whisky_clean)
whisky_le = whisky_le.drop(columns=['Rating_category',
                                        'Rating_category_quantile'], axis=1)

In [0]:
# Encode labels with value between 0 and n_classes-1
le = preprocessing.LabelEncoder()
#le.fit(X_train['Class'])
#list(le.classes_)
#le.transform(X_train['Class']) 

whisky_le.loc[:,'Whisky'] = le.fit_transform(whisky_le['Whisky'].astype(str))
whisky_le.loc[:,'Class'] = le.fit_transform(whisky_le['Class'].astype(str))
whisky_le.loc[:,'Cluster'] = le.fit_transform(whisky_le['Cluster'].astype(str))
whisky_le.loc[:,'Type'] = le.fit_transform(whisky_le['Type'].astype(str))
whisky_le.loc[:,'Reddit_Username'] = le.fit_transform(whisky_le['Reddit_Username'].astype(str))
whisky_le.loc[:,'Region_cleaned'] = le.fit_transform(whisky_le['Region_cleaned'].astype(str))

whisky_validat.loc[:,'Whisky'] = le.fit_transform(whisky_validat['Whisky'].astype(str))
whisky_validat.loc[:,'Class'] = le.fit_transform(whisky_validat['Class'].astype(str))
whisky_validat.loc[:,'Cluster'] = le.fit_transform(whisky_validat['Cluster'].astype(str))
whisky_validat.loc[:,'Type'] = le.fit_transform(whisky_validat['Type'].astype(str))
whisky_validat.loc[:,'Reddit_Username'] = le.fit_transform(whisky_validat['Reddit_Username'].astype(str))
whisky_validat.loc[:,'Region_cleaned'] = le.fit_transform(whisky_validat['Region_cleaned'].astype(str))

whisky_le.head()

# Exploratory data analysis

I looked at the distribution of the Reddit reviews to see if the data was skewed. In addition, I explored if there was pairwise correlation between the official whisky critic's review (Meta Critic) and the Reddit whisky review.

In [0]:
sns.set(font_scale=1.5)
sns.set_style("white")

ax = sns.distplot(whisky_le['Reddit_Review'], color='#800020')
ax.set(xlabel='Reddit User Reviews of Whisky', ylabel='Density')

figure = ax.get_figure()    
figure.savefig('hist_reddit_reviews.png')

In [0]:
ax = sns.pairplot(whisky_le, vars=['Reddit_Review','Meta_Critic'], 
                  kind='scatter', palette='#800020')

ax.fig.set_size_inches(10,10)
#figure = ax.get_figure()    
#figure.savefig('corr_reviews.png')


In [0]:
np.corrcoef(whisky_le['Reddit_Review'],whisky_le['Meta_Critic'])

# Model building

Three of the features (i.e., Reddit username, Reddit whisky rating, and whisky) were very important in determining the final model I chose for this project. I originally tried other approaches such as decision tree analysis and k-means clustering, but given the output of my app, the metrics of the aforementioned models, and the three key features I decided to use collaborative filtering using neural nets. This is a common tool used for recommender systems. There are few ways to build this kind of model, but in this project I decided to focus on the product aka whisky (item-based collaborative filtering).


In [0]:
train, test = train_test_split(whisky_le, test_size=0.2)
val_x = whisky_validat.drop(columns=['Reddit_Review'], axis=1)
val_y = whisky_validat["Reddit_Review"]

In [0]:
# how many diminsions can I think about that differentiates scotch - flavor, cost

n_users, n_whisky = len(whisky_le.Reddit_Username.unique()), len(whisky_le.Whisky.unique())
n_latent_factors = 8 # why 8 - 3 main regions, 5 different cost ranges

In [0]:
whisky_input = Input(shape=[1],name='Item')
whisky_embedding = Embedding(n_whisky, n_latent_factors, name='NonNegWhisky-Embedding', embeddings_constraint=non_neg())(whisky_input)
whisky_vec = Flatten(name='FlattenMovies')(whisky_embedding)

user_input = Input(shape=[1],name='User')
user_vec = Flatten(name='FlattenUsers')(Embedding(n_users, n_latent_factors,name='NonNegUser-Embedding',embeddings_constraint=non_neg())(user_input))

prod = dot([whisky_vec, user_vec], axes=1, normalize=False) #Normalized: yes or no?
model = Model([user_input, whisky_input], prod)
model.compile('adam', 'mean_squared_error')

In [0]:
SVG(model_to_dot(model, show_shapes=True, show_layer_names=True, 
                 rankdir='HB').create(prog='dot', format='svg'))

In [0]:
# 8,000 ish parameters to learn
model.summary()

In [0]:
# add validation data in this step 
history_nonneg = model.fit([train.Reddit_Username, train.Whisky], 
                           train.Reddit_Review, 
                           validation_data=[[val_x.Reddit_Username,val_x.Whisky], val_y],
                           epochs=100, verbose=0)


In [0]:
y_hat = np.round(model.predict([test.Reddit_Username, test.Whisky]),0)
y_true = test.Reddit_Review
mean_absolute_error(y_true, y_hat)

In [0]:
#model.save('whisky_cf_nonneg.h5')

In [0]:
whisky_embedding_learnt = model.get_layer(name='NonNegWhisky-Embedding').get_weights()[0]
pd.DataFrame(whisky_embedding_learnt).describe()

In [0]:
with open("dictionary_whisky.json") as f:
    whisky_dict = json.load(f)

model = load_model('whisky_cf_nonneg.h5')

whi = model.get_layer("NonNegWhisky-Embedding")
whisky_weights = whi.get_weights()[0]
lens = np.linalg.norm(whisky_weights, axis=1)
normalized = (whisky_weights.T / lens).T

In [0]:
def similarity_output(search_index):
  
  dists = np.dot(normalized, normalized[search_index])
  closest = np.argsort(dists)[::-1]
  return dists

In [0]:
# create dictionary 

whisky_dict = {}

for index, series in whisky_clean.iterrows():
    print(series[0])
    print(series[12])
    whisky_dict[series[0]] = series[12]
print(whisky_dict)


In [0]:
with open("dictionary_whisky.json", "w") as f:
  json.dump(whisky_dict, f)

whisky_dict

In [0]:
# don't want duplicate whisky with average score - try group by
whisky_group = whisky_le.groupby(['Whisky'], as_index=False).mean()

In [0]:
whisky_dict_reverse = {v: k for (k, v) in whisky_dict.items()}

In [0]:
whisky_group['Whisky_name'] = whisky_group['Whisky'].apply(lambda x: whisky_dict_reverse[x])
whisky_group.head()

# Results

From whisky embeddings what whiskies are closest and furthest apart? Pick a random whisky and test the output as a way of "validating the model". 

In [0]:
A = similarity_output(121)
whisky_group['Dot'] = A
whisky_group.head()
whisky_group.sort_values(by=['Dot'], inplace=True, ascending=False)

whisky_group

In [0]:
whisky_group.loc[whisky_group['Whisky'] == 121]

# Conclusion

I build a neural net model that has two inputs (whisky name and reddit username) and a single output (rating). I focused on the whisky embedding layer to build my app. While this model is a good start, it is messy. A big challenge I encountered through the Insight data science project process was that I had to pivot twice in the beginning. This meant instead of 3 weeks to do the project I had a week and a half. In addition, I decided to learn a new modeling technique in a week (i.e., deep learning). Therefore, there are many ways I would build upon this and iterate in the future, I would try focusing on the Reddit users and see if I can see what Reddit user one would most closely associate with. I also would build a new model that used the descriptive portion of the data in the neural net. 
