### Notebook 1: Cleaning & EDA for Scotch Recommender

Here are the necessary imports:

In [15]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

Import the Scotch Dataset. This is a Kaggle dataset found here: https://www.kaggle.com/koki25ando/scotch-whisky-dataset

In [16]:
scotch_df = pd.read_csv('data/whisky.csv')
scotch_df

Unnamed: 0,RowID,Distillery,Body,Sweetness,Smoky,Medicinal,Tobacco,Honey,Spicy,Winey,Nutty,Malty,Fruity,Floral,Postcode,Latitude,Longitude
0,1,Aberfeldy,2,2,2,0,0,2,1,2,2,2,2,2,\tPH15 2EB,286580,749680
1,2,Aberlour,3,3,1,0,0,4,3,2,2,3,3,2,\tAB38 9PJ,326340,842570
2,3,AnCnoc,1,3,2,0,0,2,0,0,2,2,3,2,\tAB5 5LI,352960,839320
3,4,Ardbeg,4,1,4,4,0,0,2,0,1,2,1,0,\tPA42 7EB,141560,646220
4,5,Ardmore,2,2,2,0,0,1,1,1,2,3,1,1,\tAB54 4NH,355350,829140
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
81,82,Tobermory,1,1,1,0,0,1,0,0,1,2,2,2,PA75 6NR,150450,755070
82,83,Tomatin,2,3,2,0,0,2,2,1,1,2,0,1,IV13 7YT,279120,829630
83,84,Tomintoul,0,3,1,0,0,2,2,1,1,2,1,2,AB37 9AQ,315100,825560
84,85,Tormore,2,2,1,0,0,1,0,1,2,1,0,0,PH26 3LR,315180,834960


In [17]:
scotch_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 86 entries, 0 to 85
Data columns (total 17 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   RowID       86 non-null     int64 
 1   Distillery  86 non-null     object
 2   Body        86 non-null     int64 
 3   Sweetness   86 non-null     int64 
 4   Smoky       86 non-null     int64 
 5   Medicinal   86 non-null     int64 
 6   Tobacco     86 non-null     int64 
 7   Honey       86 non-null     int64 
 8   Spicy       86 non-null     int64 
 9   Winey       86 non-null     int64 
 10  Nutty       86 non-null     int64 
 11  Malty       86 non-null     int64 
 12  Fruity      86 non-null     int64 
 13  Floral      86 non-null     int64 
 14  Postcode    86 non-null     object
 15  Latitude    86 non-null     int64 
 16  Longitude   86 non-null     int64 
dtypes: int64(15), object(2)
memory usage: 11.5+ KB


Right away I can see that not much cleaning is needed here. There aren't any null values. For now, I think I can just drop the RowID and the Postcode columns and move forward. I'm not sure how I'm going to use the Lat/Long data yet, I will just keep that information in there for now.  

In [18]:
scotch_df = scotch_df.drop(['RowID', 'Postcode'], axis = 1)
scotch_df.head(2)

Unnamed: 0,Distillery,Body,Sweetness,Smoky,Medicinal,Tobacco,Honey,Spicy,Winey,Nutty,Malty,Fruity,Floral,Latitude,Longitude
0,Aberfeldy,2,2,2,0,0,2,1,2,2,2,2,2,286580,749680
1,Aberlour,3,3,1,0,0,4,3,2,2,3,3,2,326340,842570


Now onto the EDA: I'd like to explore the different scotch attributes. My personal favorite scotches are the real peaty/smoky ones, so I'd like to explore the scotches that have the most Body, Smoky, and Medicinal attributes.

In [19]:
scotch_df.describe()

Unnamed: 0,Body,Sweetness,Smoky,Medicinal,Tobacco,Honey,Spicy,Winey,Nutty,Malty,Fruity,Floral,Latitude,Longitude
count,86.0,86.0,86.0,86.0,86.0,86.0,86.0,86.0,86.0,86.0,86.0,86.0,86.0,86.0
mean,2.069767,2.290698,1.534884,0.546512,0.116279,1.244186,1.383721,0.976744,1.465116,1.802326,1.802326,1.697674,287247.162791,802659.7
std,0.93041,0.717287,0.863613,0.990032,0.322439,0.853175,0.784686,0.93276,0.82173,0.629094,0.779438,0.855017,67889.046814,88024.22
min,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,126680.0,554260.0
25%,2.0,2.0,1.0,0.0,0.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0,265672.5,755697.5
50%,2.0,2.0,1.0,0.0,0.0,1.0,1.0,1.0,2.0,2.0,2.0,2.0,319515.0,839885.0
75%,2.0,3.0,2.0,1.0,0.0,2.0,2.0,1.0,2.0,2.0,2.0,2.0,328630.0,850770.0
max,4.0,4.0,4.0,4.0,1.0,4.0,3.0,4.0,4.0,3.0,3.0,4.0,381020.0,1009260.0


Each flavor/profile category has a range from 0-4, so if I wanted to see the smokiest scotches, I would look for those with Smoky = 3 or 4. 

In [20]:
#What are the smokiest scotches? I want to see which scotches are smokiest.
smokiest = scotch_df[scotch_df['Smoky'] > 2]
smokiest

Unnamed: 0,Distillery,Body,Sweetness,Smoky,Medicinal,Tobacco,Honey,Spicy,Winey,Nutty,Malty,Fruity,Floral,Latitude,Longitude
3,Ardbeg,4,1,4,4,0,0,2,0,1,2,1,0,141560,646220
18,Bowmore,2,2,3,1,0,2,2,1,1,1,1,2,131330,659720
21,Caol Ila,3,1,4,2,1,0,2,0,2,1,1,1,142920,670040
23,Clynelish,3,2,3,3,1,0,2,0,1,1,2,0,290250,904230
34,GlenGarioch,2,1,3,0,0,0,3,1,0,2,2,2,381020,827590
53,Highland Park,2,2,3,1,0,2,1,1,1,2,1,1,345340,1009260
57,Lagavulin,4,1,4,4,1,0,1,2,1,1,1,0,140430,645730
58,Laphroig,4,2,4,4,1,0,0,1,1,1,0,0,138680,645160
77,Talisker,4,2,3,3,0,1,3,0,1,2,2,0,137950,831770


This result makes sense, as some of my favorite scotches are Ardbeg and Laphroaig. Already I'm seeing some scotches I need to try!

In [21]:
#What are the most Medicinal scotches? 
medicinal = scotch_df[scotch_df['Medicinal'] > 2]
medicinal

Unnamed: 0,Distillery,Body,Sweetness,Smoky,Medicinal,Tobacco,Honey,Spicy,Winey,Nutty,Malty,Fruity,Floral,Latitude,Longitude
3,Ardbeg,4,1,4,4,0,0,2,0,1,2,1,0,141560,646220
23,Clynelish,3,2,3,3,1,0,2,0,1,1,2,0,290250,904230
57,Lagavulin,4,1,4,4,1,0,1,2,1,1,1,0,140430,645730
58,Laphroig,4,2,4,4,1,0,0,1,1,1,0,0,138680,645160
77,Talisker,4,2,3,3,0,1,3,0,1,2,2,0,137950,831770


There are only 5 of those most medicinal scotches, but they are also in the smoky category as well.

In [22]:
#What are the most Fruity scotches? 
fruity = scotch_df[scotch_df['Fruity'] > 2]
fruity

Unnamed: 0,Distillery,Body,Sweetness,Smoky,Medicinal,Tobacco,Honey,Spicy,Winey,Nutty,Malty,Fruity,Floral,Latitude,Longitude
1,Aberlour,3,3,1,0,0,4,3,2,2,3,3,2,326340,842570
2,AnCnoc,1,3,2,0,0,2,0,0,2,2,3,2,352960,839320
6,Auchentoshan,0,2,0,0,0,1,1,0,2,2,3,3,247670,672610
13,Benriach,2,2,1,0,0,2,2,0,0,2,3,2,323450,858380
27,Dalmore,3,2,2,1,0,1,2,2,1,2,3,1,266610,868730
43,Glendullan,3,2,1,0,0,2,1,2,1,2,3,2,333000,840300
46,Glengoyne,1,2,0,0,0,1,1,1,2,2,3,2,252810,682750
59,Linkwood,2,3,1,0,0,1,1,2,0,1,3,2,322640,861040
62,Macallan,4,3,1,0,0,2,1,4,2,2,3,1,327710,844480
69,RoyalBrackla,2,3,2,1,1,1,2,1,0,2,3,2,286040,851320


It seems like maybe the Smoky and Medicinal Flavors might be inversely related to the Sweet/Fruity Flavors, which makes sense. 

I would like to try a clustering project to see if I can group these scotches together. 