Introduction:

Whisky (or whiskey) is a type of distilled alcoholic beverage made from just three natural ingedients - barley, water and yeast. Whisky is distilled throughout the world, most popularly in Scotland, Ireland, the United States, and Japan. It comes in many styles and some countries have strict regulations regarding its production. For instance, a whisky can be classified as 'Scotch' if it is distilled and matured in Scotland for at least three years and bottled at a minimum alcoholic strength of 40% ABV (Source).

Whether it's Scotch, Irish, Japanese or bourbon, whisky is the most popular drink in the word and can be enjoyed on its own or be used in cocktails.

In [30]:
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import pandas as pd
import numpy as np
import re

In [4]:
df = pd.read_csv(r'C:\Users\tusha\dataV2-labs\module-2\Project Whisky review\data\scotch_review.csv',index_col = 0)
df.head()

Unnamed: 0,name,category,review.point,price,currency,description
1,"Johnnie Walker Blue Label, 40%",Blended Scotch Whisky,97,225.0,$,"Magnificently powerful and intense. Caramels, ..."
2,"Black Bowmore, 1964 vintage, 42 year old, 40.5%",Single Malt Scotch,97,4500.0,$,What impresses me most is how this whisky evol...
3,"Bowmore 46 year old (distilled 1964), 42.9%",Single Malt Scotch,97,13500.0,$,There have been some legendary Bowmores from t...
4,"Compass Box The General, 53.4%",Blended Malt Scotch Whisky,96,325.0,$,With a name inspired by a 1926 Buster Keaton m...
5,"Chivas Regal Ultis, 40%",Blended Malt Scotch Whisky,96,160.0,$,"Captivating, enticing, and wonderfully charmin..."


In [5]:
df.index = df.index - 1   #remove 1 so that the index starts from 0

In [6]:
df.head()

Unnamed: 0,name,category,review.point,price,currency,description
0,"Johnnie Walker Blue Label, 40%",Blended Scotch Whisky,97,225.0,$,"Magnificently powerful and intense. Caramels, ..."
1,"Black Bowmore, 1964 vintage, 42 year old, 40.5%",Single Malt Scotch,97,4500.0,$,What impresses me most is how this whisky evol...
2,"Bowmore 46 year old (distilled 1964), 42.9%",Single Malt Scotch,97,13500.0,$,There have been some legendary Bowmores from t...
3,"Compass Box The General, 53.4%",Blended Malt Scotch Whisky,96,325.0,$,With a name inspired by a 1926 Buster Keaton m...
4,"Chivas Regal Ultis, 40%",Blended Malt Scotch Whisky,96,160.0,$,"Captivating, enticing, and wonderfully charmin..."


In [7]:
df.shape

(2247, 6)

name: Name on the bottle

category : (e.g., Single Malt/Blended)

review.point : Points out of 100 given by each reviewer

price : price of the bottle

currency : Unit of price e.g. $(dollars)

description : Descriptions of reviews (Adjectives defining the ingredients etc.)



In [8]:
#Renaming review.point

df.rename(columns = {'review.point': 'points'}, inplace = True)
df.columns

Index(['name', 'category', 'points', 'price', 'currency', 'description'], dtype='object')

In [9]:
#gathering more information about the dataset 
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2247 entries, 0 to 2246
Data columns (total 6 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   name         2247 non-null   object
 1   category     2247 non-null   object
 2   points       2247 non-null   int64 
 3   price        2247 non-null   object
 4   currency     2247 non-null   object
 5   description  2247 non-null   object
dtypes: int64(1), object(5)
memory usage: 122.9+ KB


We can see that price is defined as an object and it should be an integer, so we need to convert it 
Non null values are zero as per the above result.



In [10]:
df.price.unique()

array(['225', '4500.00', '13500.00', '325', '160', '85.00', '6250.00',
       '11000.00', '1500.00', '3360', '750.00', '3108', '105.00', '120',
       '3500.00', '70.00', '20000.00', '$15,000 or $60,000/set', '26650',
       '400.00', '200.00', '455.00', '750', '460.00', '2525.00',
       '1250.00', '280.00', '500.00', '215.00', '300.00', '2000.00',
       '4000', '225.00', '60.00', '180', '3500', '181.00', '800.00',
       '250.00', '6000.00', '30,000', '645.00', '11824', '1250', '550.00',
       '700.00', '140.00', '387', '5730.00', '100', '325.00', '350.00',
       '6088.00', '112.00', '109.00', '130', '100.00', '120.00',
       '1900.00', '1400', '84.00', '3000', '164', '50.00', '150', '140',
       '175', '1100', '157000.00', '850', '34.00', '600.00', '55.00',
       '60', '60,000/set', '90.00', '191.00', '1925', '2200', '1,700',
       '430.00', '1,100', '150.00', '135', '3000.00', '60,000', '95.00',
       '262', '599.00', '9420.00', '45', '2850.00', '1000.00', '3657.00',
      

In [None]:
# from the above result it seems price consists of mutiple values other than numeric e.g.  $60,000/set'

In [11]:
#reference: https://stackoverflow.com/questions/40095712/when-to-applypd-to-numeric-and-when-to-astypenp-float64-in-python

symbol_idx = pd.to_numeric(df['price'], errors = 'coerce').isnull() # errors = 'coerce' results in NaNs for non-numeric values
df[symbol_idx][['name','price']].head()

Unnamed: 0,name,price
19,"Balvenie 1973 43 year old, 46.6%","$15,000 or $60,000/set"
49,"Bowmore 1966 50 year old, 41.5%",30000
95,"Balvenie 1961 55 year old, 41.7%","60,000/set"
100,Brora 34 year old (Diageo Special Releases 201...,1700
102,"Bruichladdich 1984, 43.7%",1100


In [20]:
#df[(df.values  == "banana")|(df.values  == "apple" ) ]

df[(df.values == '60,000/set') | (df.values == '15,000𝑜𝑟 60,000/set')]

Unnamed: 0,name,category,points,price,currency,description
95,"Balvenie 1961 55 year old, 41.7%",Single Malt Scotch,93,"60,000/set",$,Aged in a European oak oloroso sherry hogshead...
410,"Balvenie 1981 35 year old, 43.8%",Single Malt Scotch,90,"60,000/set",$,A refill American oak hogshead matured this wh...
1000,"Balvenie 1993 23 year old, 51.9%",Single Malt Scotch,87,"60,000/set",$,This was aged in a refill American oak hogshea...
1215,"Balvenie 2004 13 year old, 58.2%",Single Malt Scotch,86,"60,000/set",$,This expression was aged in a European oak olo...


In [None]:
# I can also see 19th row also has the same value
#Referencec for future use
# https://stackoverflow.com/questions/37216485/pandas-at-versus-loc

# df.at can only access a single value at a time.

# df.loc can select multiple rows and/or columns.

# Note that there is also df.get_value, which may be even quicker at accessing single values:

In [21]:
df.at[[19, 95, 410, 1000, 1215], 'price'] = 15000   # instances with '60,000/set' which equals 15000 dollars as per row 19

In [22]:
df['price'].replace('/liter', '', inplace = True, regex = True) # this bottle was actually 1 lt, so we don't need the price per litre
df['price'].replace(',', '', inplace = True, regex = True)
#if regex=True then all of the strings in both lists will be interpreted as regexs otherwise they will match directly. 
#This doesn’t matter much for value since there are only a few possible substitution regexes you can use.
#str, regex and numeric rules apply as above.
df['price'] = df['price'].astype('float')

In [23]:
df.price.unique()

array([2.2500e+02, 4.5000e+03, 1.3500e+04, 3.2500e+02, 1.6000e+02,
       8.5000e+01, 6.2500e+03, 1.1000e+04, 1.5000e+03, 3.3600e+03,
       7.5000e+02, 3.1080e+03, 1.0500e+02, 1.2000e+02, 3.5000e+03,
       7.0000e+01, 2.0000e+04, 1.5000e+04, 2.6650e+04, 4.0000e+02,
       2.0000e+02, 4.5500e+02, 4.6000e+02, 2.5250e+03, 1.2500e+03,
       2.8000e+02, 5.0000e+02, 2.1500e+02, 3.0000e+02, 2.0000e+03,
       4.0000e+03, 6.0000e+01, 1.8000e+02, 1.8100e+02, 8.0000e+02,
       2.5000e+02, 6.0000e+03, 3.0000e+04, 6.4500e+02, 1.1824e+04,
       5.5000e+02, 7.0000e+02, 1.4000e+02, 3.8700e+02, 5.7300e+03,
       1.0000e+02, 3.5000e+02, 6.0880e+03, 1.1200e+02, 1.0900e+02,
       1.3000e+02, 1.9000e+03, 1.4000e+03, 8.4000e+01, 3.0000e+03,
       1.6400e+02, 5.0000e+01, 1.5000e+02, 1.7500e+02, 1.1000e+03,
       1.5700e+05, 8.5000e+02, 3.4000e+01, 6.0000e+02, 5.5000e+01,
       9.0000e+01, 1.9100e+02, 1.9250e+03, 2.2000e+03, 1.7000e+03,
       4.3000e+02, 1.3500e+02, 6.0000e+04, 9.5000e+01, 2.6200e

CURRENCY: Price column looks clean and I will now check the currency column

In [24]:
df['currency'].value_counts()

$    2247
Name: currency, dtype: int64

Looks like there is only 1 currency. Dropping the column

In [25]:
df.drop('currency', axis = 1, inplace = True)

In [26]:
df.head()

Unnamed: 0,name,category,points,price,description
0,"Johnnie Walker Blue Label, 40%",Blended Scotch Whisky,97,225.0,"Magnificently powerful and intense. Caramels, ..."
1,"Black Bowmore, 1964 vintage, 42 year old, 40.5%",Single Malt Scotch,97,4500.0,What impresses me most is how this whisky evol...
2,"Bowmore 46 year old (distilled 1964), 42.9%",Single Malt Scotch,97,13500.0,There have been some legendary Bowmores from t...
3,"Compass Box The General, 53.4%",Blended Malt Scotch Whisky,96,325.0,With a name inspired by a 1926 Buster Keaton m...
4,"Chivas Regal Ultis, 40%",Blended Malt Scotch Whisky,96,160.0,"Captivating, enticing, and wonderfully charmin..."


In [None]:
#Finding a relationship between price and points given to it, if any?

In [27]:
#price to point ratio
df['price_p_points'] = df['price']/df['points']
df.head()

Unnamed: 0,name,category,points,price,description,price_p_points
0,"Johnnie Walker Blue Label, 40%",Blended Scotch Whisky,97,225.0,"Magnificently powerful and intense. Caramels, ...",2.319588
1,"Black Bowmore, 1964 vintage, 42 year old, 40.5%",Single Malt Scotch,97,4500.0,What impresses me most is how this whisky evol...,46.391753
2,"Bowmore 46 year old (distilled 1964), 42.9%",Single Malt Scotch,97,13500.0,There have been some legendary Bowmores from t...,139.175258
3,"Compass Box The General, 53.4%",Blended Malt Scotch Whisky,96,325.0,With a name inspired by a 1926 Buster Keaton m...,3.385417
4,"Chivas Regal Ultis, 40%",Blended Malt Scotch Whisky,96,160.0,"Captivating, enticing, and wonderfully charmin...",1.666667


Checking the name column to extract information like age, alcohol percentage etc. 

In [28]:
#result = sr.str.extract(pat = '([aeiou].)')
df['age'] = df['name'].str.extract(r'(\d+) year')[0].astype(float) # extract age and convert to float

In [29]:
df.head()

Unnamed: 0,name,category,points,price,description,price_p_points,age
0,"Johnnie Walker Blue Label, 40%",Blended Scotch Whisky,97,225.0,"Magnificently powerful and intense. Caramels, ...",2.319588,
1,"Black Bowmore, 1964 vintage, 42 year old, 40.5%",Single Malt Scotch,97,4500.0,What impresses me most is how this whisky evol...,46.391753,42.0
2,"Bowmore 46 year old (distilled 1964), 42.9%",Single Malt Scotch,97,13500.0,There have been some legendary Bowmores from t...,139.175258,46.0
3,"Compass Box The General, 53.4%",Blended Malt Scotch Whisky,96,325.0,With a name inspired by a 1926 Buster Keaton m...,3.385417,
4,"Chivas Regal Ultis, 40%",Blended Malt Scotch Whisky,96,160.0,"Captivating, enticing, and wonderfully charmin...",1.666667,


In [36]:
df['name'].unique()[0: 50]

array(['Johnnie Walker Blue Label, 40%',
       'Black Bowmore, 1964 vintage, 42 year old, 40.5%',
       'Bowmore 46 year old (distilled 1964), 42.9%',
       'Compass Box The General, 53.4%', 'Chivas Regal Ultis, 40%',
       'Ardbeg Corryvreckan, 57.1%', 'Gold Bowmore, 1964 vintage, 42.4% ',
       'Bowmore, 40 year old, 44.8%', 'The Dalmore, 50 year old, 52.8%',
       'Glenfarclas Family Casks 1954 Cask #1260, 47.2%',
       'The Glenlivet Cellar Collection, 1969 vintage, 50.8%',
       'Macallan 1976 Vintage, 29 year old, cask #11354, 45.4%',
       'The Last Drop (distilled at Lochside) 1972 (cask 346), 44%',
       'Compass Box Flaming Heart (10th Anniversary bottling), 48.9%',
       'Compass Box The Peat Monster 10th Anniversary Special Cask Strength Bottling, 54.7%',
       'Johnnie Walker Blue Anniversary, 60%', 'Chivas, 18 year old, 40%',
       'Ardbeg, 1974 Vintage, Cask #3145, 49.9%',
       'Ardbeg Uigeadail, 54.2%', 'Balvenie 1973 43 year old, 46.6%',
       'Bowmore 

In [37]:
df['name'] = df['name'].str.replace(' ABV ', '') #ABV: alcohol by Volume

In [38]:
df['alcohol%'] = df['name'].str.extract(r"([\(\,\,\'\"\’\”\$] ? ?\d+(\.\d+)?%)")[0]
df['alcohol%'] = df['alcohol%'].str.replace("[^\d\.]", "").astype(float) # keep only numerics and convert to float

In [39]:
# https://www.geeksforgeeks.org/python-pandas-dataframe-sample/
df[['name', 'age', 'alcohol%']].sample(10, random_state = 42)

Unnamed: 0,name,age,alcohol%
1637,"Benromach Origins 12 year old, Batch 2, 50%",12.0,50.0
482,"Highland Park Thor 16 year old, 52.1%",16.0,52.1
674,Murray McDavid 'Leapfrog' (distilled at Laphro...,12.0,46.0
247,"Talisker, 25 year old (2009 Release), 54.8%",25.0,54.8
1655,Wemyss Malts Spiced Chocolate Cup (distilled a...,,46.0
867,"Lagavulin, 30 year old, 52.6%",30.0,52.6
1763,"Highland Queen 12 year old, 40%",12.0,40.0
1769,Cadenhead Authentic Collection 28 year old (di...,28.0,48.3
1309,"Old Pulteney Dunnet Head, 46%",,46.0
1437,Balvenie DCS Compendium 1st Chapter 1985 30 ye...,30.0,54.1


In [40]:
#checking the number of missing values for our new columns
df[['age', 'alcohol%']].isnull().sum()

age         1033
alcohol%      17
dtype: int64

In [41]:
df.head()

Unnamed: 0,name,category,points,price,description,price_p_points,age,alcohol%
0,"Johnnie Walker Blue Label, 40%",Blended Scotch Whisky,97,225.0,"Magnificently powerful and intense. Caramels, ...",2.319588,,40.0
1,"Black Bowmore, 1964 vintage, 42 year old, 40.5%",Single Malt Scotch,97,4500.0,What impresses me most is how this whisky evol...,46.391753,42.0,40.5
2,"Bowmore 46 year old (distilled 1964), 42.9%",Single Malt Scotch,97,13500.0,There have been some legendary Bowmores from t...,139.175258,46.0,42.9
3,"Compass Box The General, 53.4%",Blended Malt Scotch Whisky,96,325.0,With a name inspired by a 1926 Buster Keaton m...,3.385417,,53.4
4,"Chivas Regal Ultis, 40%",Blended Malt Scotch Whisky,96,160.0,"Captivating, enticing, and wonderfully charmin...",1.666667,,40.0


In [44]:
#Saving the cleaned data to a new .csv file called df_clean.csv in the data folder.

#df.to_csv('../data/wnba_clean.csv')

df.to_csv(r'C:\Users\tusha\dataV2-labs\module-2\Project Whisky review\data\df_clean.csv')
