Data Preperation

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

In [2]:
#read csv file directly from a remote repo without downloading it to your local machine

reviews = pd.read_csv('https://raw.githubusercontent.com/Manoj-A-Thomas/data/data/winemag-data_first150k.csv',
                      index_col = 0)

In [3]:
reviews.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 150930 entries, 0 to 150929
Data columns (total 10 columns):
country        150925 non-null object
description    150930 non-null object
designation    105195 non-null object
points         150930 non-null int64
price          137235 non-null float64
province       150925 non-null object
region_1       125870 non-null object
region_2       60953 non-null object
variety        150930 non-null object
winery         150930 non-null object
dtypes: float64(1), int64(1), object(8)
memory usage: 12.7+ MB


Analyzing only columns: country, province, points, price

In [4]:
#Change column datatype, BUT don't commit it to the memory with inplace=True
reviews['points'].astype('float64')

0         96.0
1         96.0
2         96.0
3         96.0
4         95.0
5         95.0
6         95.0
7         95.0
8         95.0
9         95.0
10        95.0
11        95.0
12        95.0
13        95.0
14        95.0
15        95.0
16        95.0
17        95.0
18        95.0
19        95.0
20        95.0
21        95.0
22        95.0
23        95.0
24        95.0
25        94.0
26        94.0
27        94.0
28        94.0
29        94.0
          ... 
150900    81.0
150901    81.0
150902    81.0
150903    81.0
150904    81.0
150905    80.0
150906    93.0
150907    92.0
150908    90.0
150909    89.0
150910    89.0
150911    87.0
150912    87.0
150913    94.0
150914    94.0
150915    93.0
150916    93.0
150917    92.0
150918    92.0
150919    91.0
150920    91.0
150921    91.0
150922    91.0
150923    91.0
150924    91.0
150925    91.0
150926    91.0
150927    91.0
150928    90.0
150929    90.0
Name: points, Length: 150930, dtype: float64

In [5]:
#Find missing values column
reviews[['country','province','points','price']].isnull().sum()

country         5
province        5
points          0
price       13695
dtype: int64

In [6]:
#Find all the missing values in the country and province columns
reviews[(reviews.country.isnull())|(reviews.province.isnull())]

Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,variety,winery
1133,,Delicate white flowers and a spin of lemon pee...,Askitikos,90,17.0,,,,Assyrtiko,Tsililis
1440,,"A blend of 60% Syrah, 30% Cabernet Sauvignon a...",Shah,90,30.0,,,,Red Blend,Büyülübağ
68226,,"From first sniff to last, the nose never makes...",Piedra Feliz,81,15.0,,,,Pinot Noir,Chilcas
113016,,"From first sniff to last, the nose never makes...",Piedra Feliz,81,15.0,,,,Pinot Noir,Chilcas
135696,,"From first sniff to last, the nose never makes...",Piedra Feliz,81,15.0,,,,Pinot Noir,Chilcas


In [7]:
#delete neglectable missing values from country and province if there is any
#commit the changes to memory

reviews.dropna(subset=['country','province'], inplace=True)

In [8]:
reviews.sample(10)

Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,variety,winery
3751,Austria,While the nose of this Viennese field blend is...,,91,18.0,Wiener Gemischter Satz,,,White Blend,Wieninger
33356,US,Extraordinarily sweet and attractive in raspbe...,Bien Nacido Vineyard,94,55.0,California,Santa Maria Valley,Central Coast,Pinot Noir,Gary Farrell
5841,US,This dry wine smells and tastes vividly like b...,Estate,88,55.0,California,Livermore Valley,Central Coast,Nebbiolo,Las Positas
37245,US,"Lightly lemony, slightly waxy in the nose, wit...",,87,20.0,Oregon,Rogue Valley,Southern Oregon,Pinot Blanc,Torii Mor
25269,Italy,Grillo is a popular white variety planted in m...,,87,20.0,Sicily & Sardinia,Sicilia,,Grillo,D'Alessandro
49356,US,A distinctive bottle with notes of fresh herbs...,,85,16.0,Oregon,Oregon,Oregon Other,Pinot Noir,Primarius
127115,Australia,Melon and pineapple fruit flavors offer someth...,,87,14.0,South Australia,Adelaide,,Viognier,Shoofly
123609,Italy,Here is a deeply saturated Chardonnay with a g...,,90,40.0,Sicily & Sardinia,Sicilia,,Chardonnay,Planeta
50091,US,This Bordeaux blend is soft and somewhat sweet...,Mirepoix,85,30.0,California,Santa Cruz Mountains,Central Coast,Bordeaux-style Red Blend,Fernwood
4327,France,With 45% Mourvèdre and 35% Cinsault this has t...,Cuvée G,91,21.0,Provence,Bandol,,Rosé,Les Vignobles Gueissard


In [9]:
reviews.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 150925 entries, 0 to 150929
Data columns (total 10 columns):
country        150925 non-null object
description    150925 non-null object
designation    105190 non-null object
points         150925 non-null int64
price          137230 non-null float64
province       150925 non-null object
region_1       125870 non-null object
region_2       60953 non-null object
variety        150925 non-null object
winery         150925 non-null object
dtypes: float64(1), int64(1), object(8)
memory usage: 12.7+ MB


In [10]:
#find missing values for just region_1
reviews['region_1'].isnull().sum()

25055

In [11]:
#fill all the missing values with unknown using fillna() function and commit that to the memory
reviews['region_1'].fillna('unknown',inplace = True)

In [12]:
reviews.sample(10)

Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,variety,winery
62463,France,"Mature already, displaying leather, earth and ...",,85,22.0,Bordeaux,Margaux,,Bordeaux-style Red Blend,Château Tricot d'Arsac
136114,Uruguay,"Jumbled and earthy on the nose, with a strong ...",Reserve Oak Barrel,83,12.0,Uruguay,unknown,,Tannat,Don Adelio Ariano
41377,Italy,Sweet jasmine and floral aromas make this Soav...,Capitel al Pigno,87,12.0,Veneto,Soave Classico Superiore,,Garganega,Bixio
71570,Germany,"Smoky and leesy upfront, with powerful diesel ...",Halbtrocken,86,18.0,Nahe,unknown,,Riesling,Schäfer-Fröhlich
63061,US,"Rich in flavor, although a little too soft and...",Collusion,84,28.0,California,Paso Robles,Central Coast,Red Blend,Clavo Cellars
55589,Italy,Here's a blend of Montepulciano (75%) and Agli...,Gironia,89,26.0,Southern Italy,Biferno Rosso,,Red Blend,Borgo di Colloredo
43752,US,"Clean, crisp and stony with aromas of rain on ...",Happy Canyon Vineyard Blanc,90,36.0,California,Happy Canyon of Santa Barbara,Central Coast,Bordeaux-style White Blend,Barrack
124868,Australia,"Starts off dark and a bit brooding, with aroma...",The Laughing Magpie,90,29.0,South Australia,McLaren Vale,,Shiraz-Viognier,D'Arenberg
92032,US,"Rustic, with sweet-and-sour flavors of cherrie...",,83,12.0,California,Napa Valley,Napa,Red Blend,Tractor Shed Red
73954,Portugal,"Ripe and full-bodied, with balanced acidity an...",Special Reserve Ruby,87,,Port,unknown,,Port,Andresen


In [13]:
#replace missing values for region_2 with replace(np.nan) function and commit that to the memory
reviews['region_2'].replace(np.nan,'unknown',inplace=True)

In [14]:
reviews.sample(10)

Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,variety,winery
92795,Italy,This ripasso delivers an interesting combinati...,,88,25.0,Veneto,Valpolicella Superiore Ripasso,unknown,"Corvina, Rondinella, Molinara",Novaia
106574,Portugal,"A simple fruity red wine, with spice, red-berr...",Monte Alentejano Trincadeira and Aragonez,85,9.0,Alentejano,unknown,unknown,Portuguese Red,DFJ Vinhos
80542,US,Merry Edwards has figured out the magic trick ...,Flax Vineyard Méthode a L'Ancienne,96,54.0,California,Russian River Valley,Sonoma,Pinot Noir,Merry Edwards
131210,US,Geyser Peak's Reserve Chards have been pretty ...,Reserve,88,23.0,California,Alexander Valley,Sonoma,Chardonnay,Geyser Peak
107441,US,"Dry enough, but the fruit has a baked, raisiny...",White Hawk,84,48.0,California,Santa Barbara County,Central Coast,Syrah,Damian Rae
44873,US,"Rich and creamy, this has a smooth, buttery mo...",Lucia Highlands Vineyard,91,24.0,California,Santa Lucia Highlands,Central Coast,Chardonnay,Pessagno
10698,US,There's a fascinating presence of herb-roasted...,Brosseau Vineyard,91,45.0,California,Chalone,Central Coast,Chardonnay,Testarossa
60580,Italy,Here's a sparkling wine with all its many face...,Cuvée Brut,92,34.0,Lombardy,Franciacorta,unknown,Sparkling Blend,Bellavista
72848,US,"Nicely crisp in acidity, this has a minerally ...",Vintner's Selection,86,18.0,California,Santa Barbara County,Central Coast,Sauvignon Blanc,Rock Hollow
126592,Italy,This cheerful 60-40 blend of Chardonnay and Sa...,Trappoline,86,15.0,Tuscany,Toscana,unknown,White Blend,Badia a Coltibuono


In [15]:
# 
reviews.reset_index(drop=True, inplace=True)

In [16]:
reviews.sample(10)

Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,variety,winery
40477,France,"There is a meaty feel to this wine, under the ...",,88,,Provence,Côtes de Provence,unknown,Rosé,Château Riotor
48896,Italy,Here's an elegant and powerful wine from the d...,,90,,Southern Italy,Aglianico del Vulture,unknown,Aglianico,Macarico
82838,US,"Gets the job done with this dry, crisp everyda...",,84,10.0,California,Central Coast,Central Coast,Pinot Grigio,Tamás Estates
36988,Argentina,"Rusty in color, with a mildly mulchy, leafy no...",Alto Las Tacas,84,10.0,Mendoza Province,Mendoza,unknown,Malbec,Bodega Vistandes
107310,US,"Steely, crisp and very dry, this Chard has a s...",,89,22.0,California,Mendocino County,Mendocino/Lake Counties,Chardonnay,Zina Hyde Cunningham
84404,US,This is a rich Pinot Gris. Tastes like it had ...,Trenton Station,88,20.0,California,Russian River Valley,Sonoma,Pinot Gris,Joseph Swan Vineyards
135883,Greece,Fresh lemon and lime and a spray of minerals a...,Estate,84,32.0,Santorini,unknown,unknown,Assyrtico,Argyros
104312,US,Fans of this variety will find something reall...,,91,22.0,California,El Dorado,Sierra Foothills,Petite Sirah,Ursa
137703,US,"Tastes riper than Peachy's other '04 Zins, wit...",Mustang Springs,92,30.0,California,Paso Robles,Central Coast,Zinfandel,Peachy Canyon
150229,US,"Aromas of tobacco, dried flowers and earth seg...",,88,25.0,New York,"The Hamptons, Long Island",Long Island,Cabernet Franc,Wölffer


In [17]:
# Fix price
# reviews['price'].plot(kind='hist', bins=45)
reviews['price'].describe()

count    137230.000000
mean         33.132019
std          36.323072
min           4.000000
25%          16.000000
50%          24.000000
75%          40.000000
max        2300.000000
Name: price, dtype: float64

In [18]:
# reviews['price'].value_counts()

In [22]:
#Find how many wine are there that cost $40+
len(reviews[reviews['price'] > 40])

31317

In [23]:
len(reviews[reviews['price'] < 40])

101895

In [21]:
reviews.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150925 entries, 0 to 150924
Data columns (total 10 columns):
country        150925 non-null object
description    150925 non-null object
designation    105190 non-null object
points         150925 non-null int64
price          137230 non-null float64
province       150925 non-null object
region_1       150925 non-null object
region_2       150925 non-null object
variety        150925 non-null object
winery         150925 non-null object
dtypes: float64(1), int64(1), object(8)
memory usage: 11.5+ MB


In [24]:
# Remove wine > $40 and create a new df

reviews_price = reviews.loc[reviews['price'] < 40].copy()

In [25]:
reviews_price.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 101895 entries, 20 to 150924
Data columns (total 10 columns):
country        101895 non-null object
description    101895 non-null object
designation    66059 non-null object
points         101895 non-null int64
price          101895 non-null float64
province       101895 non-null object
region_1       101895 non-null object
region_2       101895 non-null object
variety        101895 non-null object
winery         101895 non-null object
dtypes: float64(1), int64(1), object(8)
memory usage: 8.6+ MB


In [29]:
#reseting index to make sure it starts and stops at the right index location 
# (20 to 150924 is NOT right)
reviews_price.reset_index(drop=True,inplace=True)

In [30]:
reviews_price.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 101895 entries, 0 to 101894
Data columns (total 10 columns):
country        101895 non-null object
description    101895 non-null object
designation    66059 non-null object
points         101895 non-null int64
price          101895 non-null float64
province       101895 non-null object
region_1       101895 non-null object
region_2       101895 non-null object
variety        101895 non-null object
winery         101895 non-null object
dtypes: float64(1), int64(1), object(8)
memory usage: 7.8+ MB


In [36]:
reviews_price.province.is_unique

False

In [37]:
reviews_noDup = reviews_price.drop_duplicates()

In [38]:
reviews_noDup.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 65503 entries, 0 to 100751
Data columns (total 10 columns):
country        65503 non-null object
description    65503 non-null object
designation    42228 non-null object
points         65503 non-null int64
price          65503 non-null float64
province       65503 non-null object
region_1       65503 non-null object
region_2       65503 non-null object
variety        65503 non-null object
winery         65503 non-null object
dtypes: float64(1), int64(1), object(8)
memory usage: 5.5+ MB


In [39]:
reviews_noDup.reset_index(drop=True, inplace=True)

In [40]:
reviews_noDup.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 65503 entries, 0 to 65502
Data columns (total 10 columns):
country        65503 non-null object
description    65503 non-null object
designation    42228 non-null object
points         65503 non-null int64
price          65503 non-null float64
province       65503 non-null object
region_1       65503 non-null object
region_2       65503 non-null object
variety        65503 non-null object
winery         65503 non-null object
dtypes: float64(1), int64(1), object(8)
memory usage: 5.0+ MB


In [45]:
# log scaling 
# reviews_noDup.var() #var()  is the variance 

reviews_noDup.loc[:,'price_log'] = np.log(reviews_noDup['price']) 

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self.obj[item] = s


In [46]:
reviews_noDup[['points','price_log']].var()

points       7.620576
price_log    0.173753
dtype: float64