# Worldwide winery EDA
### This notebook is an initial EDA for beginners. Some interesting insights can be drawn via EDA analysis. 

### Outline:
### 1. ggplot
### 2. Seaborn
### 3. Train-test split

In [None]:
import pandas as pd
import numpy as np
from ggplot import *
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
df_wine = pd.read_csv("../input/winemag-data-130k-v2.csv")

In [None]:
df_wine = df_wine.dropna(subset = ['price','country','variety'])

In [None]:
df_wine.columns

In [None]:
df_wine.info()

In [None]:
df_wine.isnull().sum()

## 1. ggplot 

### using size, shape and color as well as facets.
### in reality, it's not user-friendly to plot too many variables in one single chart and this often lead to redundency.

In [None]:
df = df_wine[df_wine.variety.isin(df_wine.variety.value_counts().head(9).index)]
df = df[df.country.isin(df.country.value_counts().head(9).index)]
df.head()

In [None]:
p = ggplot(df,aes(x="points", y="price", shape ="variety", size ="price", color="country")) + geom_point()
p + facet_wrap('variety', scales="free_y") + xlab("points") + ylab("price") + ggtitle("winery review: country to price")

### TAKEAWAY: 
### From above plots, France has the most expensive wine, both Birdeaux Red Blend and Pinot Noir. 
### For wine around 20 dollar, a pretty affordable price, choosing American CABERNET SAUVIGNON, France Chardonnay, and Italy Riesling could give you a better drinking experience.

## 2. a Correlation Heatmap in Seaborn

In [None]:
plt.subplots(figsize=(20,15))
ax = plt.axes()
ax.set_title("winery review")
corr = df.corr()
sns.heatmap(corr, 
            xticklabels=corr.columns.values,
            yticklabels=corr.columns.values)

### TAKEAWAY: 
### The correlation between price and points is 0.4
### since the dataset only has two numerical variables, the correlation heatmap conveys less information. 


## 3. Split Test and Training sets 

In [None]:
from sklearn.model_selection import train_test_split

X, y = df.iloc[:, 1:].values, df.iloc[:, 0].values

X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                    test_size=0.3,
                                                    random_state=0,
                                                    stratify=y)
