Importing Libraries




In [1]:
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import pandas as pd
import numpy as np

Exploring the dataset

In [2]:
scotchreview = pd.read_csv(r'C:\Users\tusha\dataV2-labs\module-2\Project Whisky review\data\scotch_review.csv')
scotchreview.head()

Unnamed: 0.1,Unnamed: 0,name,category,review.point,price,currency,description
0,1,"Johnnie Walker Blue Label, 40%",Blended Scotch Whisky,97,225.0,$,"Magnificently powerful and intense. Caramels, ..."
1,2,"Black Bowmore, 1964 vintage, 42 year old, 40.5%",Single Malt Scotch,97,4500.0,$,What impresses me most is how this whisky evol...
2,3,"Bowmore 46 year old (distilled 1964), 42.9%",Single Malt Scotch,97,13500.0,$,There have been some legendary Bowmores from t...
3,4,"Compass Box The General, 53.4%",Blended Malt Scotch Whisky,96,325.0,$,With a name inspired by a 1926 Buster Keaton m...
4,5,"Chivas Regal Ultis, 40%",Blended Malt Scotch Whisky,96,160.0,$,"Captivating, enticing, and wonderfully charmin..."


In [3]:
scotchreview.columns

Index(['Unnamed: 0', 'name', 'category', 'review.point', 'price', 'currency',
       'description'],
      dtype='object')

Description of columns: 

ID: Unique number assigned to each name

Name: Name on the bottle

Category :  (e.g., Single Malt/Blended)

review.point : Points/100 given by each reviewer

Bottle price : Amount

Bottle currency : Currency of the bottle price e.g. $(dollars)

description : Adjectives defining the ingredients, characteristic etc. (e.g. Magnificently powerful and intense. Caramels, dried peats, elegant cigar smoke, seeds scraped from vanilla beans, brand new pencils, peppercorn, coriander seeds, and star anise make for a deeply satisfying nosing experience. Silky caramels, bountiful fruits of ripe peach, stewed apple, orange pith, and pervasive smoke with elements of burnt tobacco. An abiding finish of smoke, dry spices, and banoffee pie sweetness. Close to perfection)  

In [4]:
#Checking for null values
scotchreview.isnull().sum()



Unnamed: 0      0
name            0
category        0
review.point    0
price           0
currency        0
description     0
dtype: int64

In [5]:
#Perfect! looks like there are no null values in the columns although there are Nan values in the data

Exploring the whisky characteristics starting with the name column
    
    

In [6]:
#Separating the name, age and concentration 

dfname = scotchreview.name.str.split(',',expand=True)


In [7]:
dfname.head(5)

Unnamed: 0,0,1,2,3,4
0,Johnnie Walker Blue Label,40%,,,
1,Black Bowmore,1964 vintage,42 year old,40.5%,
2,Bowmore 46 year old (distilled 1964),42.9%,,,
3,Compass Box The General,53.4%,,,
4,Chivas Regal Ultis,40%,,,


In [8]:
#renaming columns
dfname.columns=['BottleName','a','b','c','d']
dfname.head()

Unnamed: 0,BottleName,a,b,c,d
0,Johnnie Walker Blue Label,40%,,,
1,Black Bowmore,1964 vintage,42 year old,40.5%,
2,Bowmore 46 year old (distilled 1964),42.9%,,,
3,Compass Box The General,53.4%,,,
4,Chivas Regal Ultis,40%,,,


In [9]:
#extracting concentration percentage from the columns
dfname.a = dfname.a.apply(lambda x: np.nan if not '%' in str(x) else x)
dfname.b = dfname.b.apply(lambda x: np.nan if not '%' in str(x) else x)
dfname.c = dfname.c.apply(lambda x: np.nan if not '%' in str(x) else x)
dfname.d = dfname.d.apply(lambda x: np.nan if not '%' in str(x) else x)
dfname['concentration'] = [dfname.a[i] if dfname.a[i] is not np.nan else dfname.b[i] if dfname.b[i] is not np.nan else dfname.c[i] if dfname.c[i] else dfname.d[i] if dfname.d[i] is not np.nan else dfname.a[i] for i in range(dfname.shape[0])]

In [10]:
dfname.reindex().head()

Unnamed: 0,BottleName,a,b,c,d,concentration
0,Johnnie Walker Blue Label,40%,,,,40%
1,Black Bowmore,,,40.5%,,40.5%
2,Bowmore 46 year old (distilled 1964),42.9%,,,,42.9%
3,Compass Box The General,53.4%,,,,53.4%
4,Chivas Regal Ultis,40%,,,,40%


In [11]:
dfname

Unnamed: 0,BottleName,a,b,c,d,concentration
0,Johnnie Walker Blue Label,40%,,,,40%
1,Black Bowmore,,,40.5%,,40.5%
2,Bowmore 46 year old (distilled 1964),42.9%,,,,42.9%
3,Compass Box The General,53.4%,,,,53.4%
4,Chivas Regal Ultis,40%,,,,40%
...,...,...,...,...,...,...
2242,Duncan Taylor (distilled at Cameronbridge),,,54.4%,,54.4%
2243,Distillery Select 'Craiglodge' (distilled at L...,,,45%,,45%
2244,Edradour Barolo Finish,,57.1%,,,57.1%
2245,Highland Park,,,,55%,


In [23]:
dfname.BottleName.unique()

array(['Johnnie Walker Blue Label', 'Black Bowmore',
       'Bowmore 46 year old (distilled 1964)', ...,
       'Duncan Taylor (distilled at Cameronbridge)',
       'Edradour Barolo Finish',
       "Distillery Select 'Inchmoan' (distilled at Loch Lomond)"],
      dtype=object)

Reindexing changes the row labels and column labels of a DataFrame. To reindex means to conform the data to match a given set of labels along a particular axis.

Multiple operations can be accomplished through indexing like −

Reorder the existing data to match a new set of labels.

Insert missing value (NA) markers in label locations where no data for the label existed

In [68]:
#Creating a new table with name and concentration
#df_a=pd.concat([df.name,df.concentration],axis=1)
    #df_a.sample(5)
dfnc=pd.concat([dfname.BottleName,dfname.concentration],axis=1)
dfnc.head()

Unnamed: 0,BottleName,concentration
0,Johnnie Walker Blue Label,40%
1,Black Bowmore,40.5%
2,Bowmore 46 year old (distilled 1964),42.9%
3,Compass Box The General,53.4%
4,Chivas Regal Ultis,40%


In [69]:
dfnc.shape

(2247, 2)

In [70]:
dfnc.isnull().sum()

BottleName        0
concentration    86
dtype: int64

checking the price column

In [None]:
df_p=scotchreview.price.str.replace("[({',$qwertyuioplkjhgfdsazxcvbnm%:]", " ")
df_price=df_p.convert_objects(convert_numeric=True)
df_price.sample()