In [34]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [35]:
wine = pd.read_json('../input/winemag-data-130k-v2.json')

In [36]:
wine.shape

In [37]:
wine.describe()

The range of points given out of nearly 130K wines reviewed is 20 points. The range in price varies a great deal more. The cheapest wine is 4 dollars and the most expensive is 3300 dollars.

In [38]:
wine.head()

# Data Cleaning

It is clear that there is missing data just by looking at the head. In this section, I will review how complete the dataset is and modify the dataset for better usage. This includes looking for data that is miss or repetitive. First, let's look into data that is missing.

In [39]:
wine.isnull().head()

In [40]:
sns.heatmap(wine.isnull(),yticklabels=False,cbar=False,cmap='viridis')

Massive amounts of missing data from the designation, region_1, region_2, taster_name and taster_twitter_handle columns. How much data is missing from the price column??

In [41]:
wine['price'].isnull().value_counts()

In [42]:
wine = wine.drop(['region_1','region_2','taster_twitter_handle','designation','taster_name'],axis=1)

Remove heavy missing data columns instead of calling dropna(). dropna() would drop too many rows.



In [43]:
sns.heatmap(wine.isnull(),yticklabels=False,cbar=False,cmap='viridis')

In [44]:
wine.head()

In [45]:
wine = wine.dropna()

In [46]:
wine.shape

In [47]:
wine.nunique()

Hmm. There are 120K rows, but only 111K unique descriptions and 110K unique titles. Let's see if any othere these are duplicates.

In [48]:
wine[wine.duplicated('title',keep=False)].sort_values('title').head(10)

In [49]:
wine[wine.duplicated('description',keep=False)].sort_values('description').head(10)

In [50]:
len(wine[wine.duplicated('description',keep=False)].sort_values('description'))

Looks like both title and description have duplicates. Duplication in title seems a little more plausible than having duplicate data for description.

In [51]:
wine = wine.drop_duplicates('description')

In [52]:
wine.shape

In [53]:
wine.nunique()

In [54]:
wine['desc_length'] = wine['description'].apply(len)

The wine dataframe is now 19k rows and 4 columns slimmer. IMO, the real lose comes at losing the taster names. We still have variety, country and province, making up for the loss of region_1, region_2 and designation. Now that our data is all nice and clean, let's start doing some exploratory data analysis!!!

# Univariate Analysis and Visualization

- Since we no longer have wine taster information, let's mostly focus our analysis on points, price, location, wine variety and description. 

In [55]:
plt.figure(figsize=(10,6))
wine['country'].value_counts().head(10).plot.bar()
plt.title('Top Ten Wine Producing Countries by Varieties Produced.')
plt.show()
print('Top Ten Wine Producing Countries by Varieties Produced.\n')
print(wine['country'].value_counts().head(10))

- U.S.A. without a doubt is the top wine producer. Argentina is the only Latin American producer on the top ten list.

In [56]:
plt.figure(figsize=(10,6))
sns.distplot(wine['points'],kde=False,color='orange')
plt.title('Frequency of Points Given')
plt.show()
print('Frequency of Points Given\n')
print(wine.groupby('points')['points'].count())

This plot looks a lot like a normal distribution. The most frequent points given also happens to be the average points given.

In [57]:
plt.figure(figsize=(10,6))
#sns.distplot()
sns.distplot(wine[wine['price']<=200]['price'],kde=False,bins=50,color='purple')
plt.title('Price Distribution for Wines under $200')
plt.show()
#wine[wine['price']<=200]['price']

A huge portion of Wines fall under the price range of 50 dollars. Also notice that the visual doesn't show any wines over the price of 200 dollars. The expensive outliers throw a wrench into the visual.

In [58]:
plt.figure(figsize=(10,6))
wine['province'].value_counts().head(10).plot.bar()
plt.title('Top Ten Wine Producing Provinces by Variety Produced.')
plt.show()
print('Top Ten Wine Producing Provinces by Variety Produced.\n')
print(wine['province'].value_counts().head(10))

California alone produces over twice as many varieties of wine as the next largest producing country, France!

In [59]:
print('Ratio of U.S. wine variety from California: ' + str(len(wine[wine['province']=='California']) / len(wine[wine['country']=='US'])))
print('Ratio of World wine variety from California: ' + str(len(wine[wine['province']=='California']) / len(wine)))

In [60]:
plt.figure(figsize=(10,6))
wine['variety'].value_counts().head(10).plot.bar()
plt.title('Top Ten Wines by Variety')
plt.show()
print('Top Ten Wines by Variety\n')
print(wine['variety'].value_counts().head(10))

The length of the taster's wine description also seems to be a normal distribution.

In [61]:
plt.figure(figsize=(10,6))
sns.distplot(wine['desc_length'],bins=100,kde=False)
plt.title('Length of Characters per Taster Description')
plt.xlabel('Characters per Desc.')
plt.ylabel('# of Desc.')
plt.show()

# Bivariate Analysis and Visualization

In [62]:
sns.heatmap(wine.corr(),annot=True,cmap='plasma')

- There is some correlation between points / price and points / description length. Nothing rock hard though.

In [63]:
plt.figure(figsize=(10,6))
sns.pairplot(wine)
plt.show()

In [64]:
sns.jointplot(x='points',y='price',data=wine)

In [65]:
wine[wine['price']>=3000]

- The most expense wine is a French Bordeaux, and it got an average rating of 88 points.

In [66]:
sns.jointplot(x='points',y='price',data=wine, kind='kde', cmap='plasma')

Hmm. These two plots are misleading. Let's get more precise by specifying some price parameters.

In [67]:
sns.lmplot(x='points',y='price',data=wine[wine['price']<=200])

In [68]:
sns.jointplot(x='points',y='price',data=wine[wine['price']<=200], kind='kde',cmap='plasma')

Looks like most wines don't cost much more than 50 dollars and have between 82 and 93 points.

In [69]:
sns.jointplot(x='points',y='desc_length',data=wine)

In [70]:
print(wine['description'][97446])

The longest review seems to be redundant. 

In [71]:
plt.figure(figsize=(10,6))
sns.boxplot(x='points', y='desc_length', data=wine)
plt.title('Points given by Length of Review')
plt.show()

A correlation between points given and number of characters in a review.

In [72]:
wine[wine['price']>=1000]['variety'].value_counts().plot.bar()
plt.title('Varieties of Wine over $1000')
plt.show()
print('Varieties of Wine over $1000.\n')
print(wine[wine['price']>=1000]['variety'].value_counts())

In [73]:
wine[wine['price']>=1000]['country'].value_counts().plot.bar()
plt.title('Number of Wines over $1000 by Country')
plt.show()
print('Number of Wines over $1000 by Country.\n')
print(wine[wine['price']>=1000]['country'].value_counts())

In [74]:
plt.figure(figsize=(8,6))
wine[wine['price']<=5]['variety'].value_counts().head(10).plot.bar()
plt.title('Varieties of Wine under $6')
plt.show()
print('Varieties of Wine under $6.\n')
print(wine[wine['price']<=5]['variety'].value_counts().head(10))

In [75]:
plt.figure(figsize=(8,6))
wine[wine['price']<=5]['country'].value_counts().head(10).plot.bar()
plt.title('Top Countries with Wines under $6')
plt.show()
print('Top Countries with Wines under $6.\n')
print(wine[wine['price']<=5]['country'].value_counts().head(10))

In [76]:
plt.figure(figsize=(8,6))
wine[wine['points']>=98]['variety'].value_counts().head(10).plot.bar()
plt.title('Varieties of Wines that scored 98 points or more.')
plt.show()
print('Varieties of Wines that scored 98 points or more.\n')
print(wine[wine['points']>=98]['variety'].value_counts().head(10))

In [77]:
plt.figure(figsize=(8,6))
wine[wine['points']<=82]['variety'].value_counts().head(10).plot.bar()
plt.title('Varieties of Wines that scored 82 points or less')
plt.show()
print('Varieties of Wines that scored 82 points or less.\n')
print(wine[wine['points']<=82]['variety'].value_counts().head(10))

In [78]:
wine_country = wine.groupby('country')

In [79]:
plt.figure(figsize=(14,6))
sns.boxplot(x='country',y='points',data=wine)
plt.title('Average Points Given by Country')
plt.xticks(rotation = 90)
plt.show()
print('Average points given by Country\n')
print(wine_country['points'].mean().sort_values(ascending=False).head(15))

Hmm. Looking at this boxplot, it seems that the number of varieties submitted is affecting the average score per country. Let's filter out any country that doesn't have at least 100 wine varieties.

In [80]:
big_wine = wine.groupby('country').filter(lambda x: len(x) > 100)

In [81]:
plt.figure(figsize=(14,6))
sns.boxplot(x='country',y='points',data=big_wine)
plt.title('Average Points Given by Country')
plt.xticks(rotation = 90)
plt.show()
print('Average points given by Country:\n')
#print(big_wine['points'].mean().sort_values(ascending=False))
print(big_wine.groupby('country')['points'].mean().sort_values(ascending=False).head(15))

Very interesting. Where did England, India, Hungary, China and Luxembourg go???

# Economy Wines

Bargin wines will by classified as the top 25% wines by points and bottom 25% of wines by price. Let's see where these wines are made and learn more about them!!!

In [82]:
wine.describe()

The top 25% of wines by score have 91 points or more. The bottom 25% of wines by price are priced at 17 dollars or less. Let's create a new dataframe that meets both of those criterion. 

In [83]:
economyWine = wine[(wine['points'] >= 91) & (wine['price'] <= 17)] 

In [84]:
plt.figure(figsize=(8,6))
economyWine['country'].value_counts().head(10).plot.bar()
plt.title('Economy Wines by Country')
plt.show()
print('There are ' + str(len(economyWine)) + ' economy wines.')
print('Economy Wines by Country:\n')
print(economyWine['country'].value_counts().head(10))

In [85]:
plt.figure(figsize=(8,6))
economyWine['variety'].value_counts().head(10).plot.bar()
plt.title('Economy Wines by Variety')
plt.show()
print('Economy Wines by Variety:\n')
print(economyWine['variety'].value_counts().head(10))

No surprise that the country that produces the most economy wines is the United States. The top economy wine by variety is Portuguese Red!! Didn't see that coming.

# Super Economy Wines!

Super Economy Wines are top 25% by score and lowest 25% by price of the economy wine dataset. Let's create that dataset and see what it looks like!!

In [86]:
economyWine.describe()

In [87]:
superEcon = economyWine[(economyWine['points'] >= 92) & (economyWine['price'] <= 14)]

In [88]:
plt.figure(figsize=(8,6))
superEcon['country'].value_counts().plot.bar()
plt.title('Super Economy Wines by Country')
plt.show()
print('There are ' + str(len(superEcon)) + ' super economy wines.')
print('Super Economy Wines by Country:\n')
print(superEcon['country'].value_counts())

In [89]:
plt.figure(figsize=(8,6))
superEcon['variety'].value_counts().plot.bar()
plt.title('Mega Value Wines by Variety')
plt.show()
print('Mega Value wines by Variety\n')
print(superEcon['variety'].value_counts())

Once again Portuguese wine comes out on top as the best bang for the buck. Portuguese Red is top dog by variety again. 

# Feedback is massively appreciated! Thank you.