# Explore

- `random_state`=123

Plan
- [x] Split data
- [x] Get Univariate Insights
- [ ] Get Bivariate Insights to target (quality)
    - scatterplots with quality on the y
    - correlations + heatmap
    - barplot, swarmplot, and/or boxplot with quality on y and color on x

In [7]:
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib.ticker import FuncFormatter
from matplotlib import patches

%config InlineBackend.figure_format = 'retina'
import seaborn as sns

import sys
import os
home_directory_path = os.path.expanduser('~')
sys.path.append(home_directory_path +'/utils')

from wrangle import split_data

from sklearn.preprocessing import MinMaxScaler
from scipy.stats import pearsonr

from sklearn.cluster import KMeans

from itertools import combinations
import wrangle as w

In [4]:
import warnings
warnings.filterwarnings("ignore")

Acquire and split data

In [8]:
df = pd.read_csv('wine_data.csv') 

train, validate, test = split_data(df, validate_size=.15, test_size=.15, random_state=123)

**Univariate Analysis**

Notes:
- `color`: 75/25, white/red whine
- `quality`: quality is approximately normal. scores of 3 and 9 are rare. 92% have scores 5-7. 44% have a score of 6

In [None]:
e.explore_univariate_categorical_cols(train)

In [None]:
e.explore_univariate_continuous_cols(train)

**Bivariate Analysis**

- color of the wine seems independent of quality
- Features that drive quality: $r > .2$
    - `['alcohol', 'density', 'volatile acidity', 'chlorides' ]`
- Features that don't drive quality: $r < .1$
    - `['color', 'fixed acidity', 'citric acid', 'residual sugar',
        'free sulfur dioxide', 'total sulfur dioxide', 'pH', 'sulphates']`, 

In [None]:
e.explore_bivariate_cat_to_cont_target(train, target='quality',
                                       cat_cols=['red'])

In [None]:
# e.explore_bivariate_cont_to_cont_target(train, target='quality',
#                                         cont_cols=e.get_cat_and_cont_cols(train)[1] + ['quality'])

In [None]:


#.to_frame()[1:]

In [None]:
def plot_key_features():
    df = train
    target='quality'

    plt.figure(figsize=(1.5,5))
    ax = sns.heatmap(df[abs(df.corr()[target]).sort_values(ascending=False).index].corr()[target].to_frame()[1:],
                        annot=True, cmap='RdYlGn', vmin=-1, vmax=1)
    cbar = ax.collections[0].colorbar
    cbar.ax.tick_params(right=False, labelsize=8) 
    cbar.set_ticks([-1, -.5, 0, .5, 1])
    plt.tick_params(axis='both', left=False, bottom=False)

    rectangle = patches.Rectangle((0, 0), 1, 4, linewidth=1.5, edgecolor='#C40000', facecolor='none')
    ax.add_patch(rectangle)

    plt.title('4 Strongest Drivers of Quality')
    plt.show()

In [None]:
plot_key_features()

In [None]:
def plot_alcohol_by_quality():
    fig, axes = plt.subplots(2, 1, figsize=(6,6))
    sns.barplot(data=train, x='quality', y='alcohol', color='green',
                errorbar=None, ax=axes[0])

    for p in axes[0].patches:
        axes[0].annotate(f'{str(round(p.get_height(), 1))}%', 
                    (p.get_x() + p.get_width() / 2, p.get_height()),
                    ha='center', va='bottom', fontsize=8)

    axes[0].set_xlabel('')

    sns.stripplot(data=train, x='quality', y='alcohol', size=1, 
                  color='green', jitter=.2, ax=axes[1])

    axes[1].set_xlabel('Quality', fontsize=10, labelpad=5)

    plt.suptitle('Higher Quality Wines Have More Alcohol')

    for ax in axes:
        ax.set_ylabel('Alcohol', rotation=0, fontsize=10, labelpad=20)
        ax.tick_params(axis='both', left=False, bottom=False, labelsize=8)
        ax.yaxis.set_major_formatter(FuncFormatter(lambda x, pos: f'{round(x)}%'))

    sns.despine()
    plt.tight_layout()
    plt.show()

In [None]:
plot_alcohol_by_quality()

In [14]:
for col in ['alcohol', 'density', 'volatile acidity', 'chlorides']:
    r_stat, p_val = pearsonr(train['quality'], train[col])
    print(r_stat, p_val)

0.4837133455918712 1.0089565593623517e-217
-0.353586651013573 4.192528826581278e-110
-0.262858663423861 6.736031363518946e-60
-0.20710153205137027 2.327569955302168e-37


**Clustering**

Find clusters/groups from strong features

In [None]:
scaler = MinMaxScaler()

Scale Training data

In [None]:
pd.DataFrame(data=scaler.fit_transform(train.drop(columns=['quality'])),
             columns=train.drop(columns=['quality']).columns)

In [None]:
train_sc = pd.concat([pd.DataFrame(data=scaler.fit_transform(train.drop(columns=['quality'])),
                                   columns=train.drop(columns=['quality']).columns),
                      train[['quality']].reset_index().iloc[:,1]],
                      axis=1)

In [None]:
train_sc.head()

In [None]:
validate_sc = pd.concat([pd.DataFrame(data=scaler.transform(validate.drop(columns=['quality'])),
                                   columns=validate.drop(columns=['quality']).columns),
                         validate[['quality']].reset_index().iloc[:,1]],
                         axis=1)

In [None]:
validate_sc.head()

In [None]:
test_sc = pd.concat([pd.DataFrame(data=scaler.transform(test.drop(columns=['quality'])),
                                   columns=test.drop(columns=['quality']).columns),
                     test[['quality']].reset_index().iloc[:,1]],
                     axis=1)

In [None]:
test_sc.head()

In [None]:
# plt.figure(figsize=(40,20))

# sns.pairplot(data=train_sc, corner=True,
#              hue='quality', plot_kws={'s': 3, 'alpha': .1})

**Cluster**

- `['alcohol', 'volatile acidity', 'chlorides']`
- `[‘alcohol', ‘density’, 'citric acid’]`
- `[‘alcohol', ‘sugar, ‘ph’]`
- `['Total sulfur dioxide', 'density']`

Cluster on combination of 2 features

In [None]:
len(list(combinations(train_sc.columns, 2)))

How well can we cluster off 2 features?
- lower inertia means we have denser clusters
- **Note:** this doesn't tell us how useful these clusters will be at predicting quality.

In [None]:
# for combo in [list(tup) for tup in list(combinations(train_sc.columns, 2))]:
#     print('-'*20)
#     print(combo)
#     plt.figure(figsize=(4, 3))
#     pd.Series({k: KMeans(k).fit(train_sc[combo]).inertia_ for k in range(2, 12)}).plot(marker='x')
#     plt.xticks(range(2, 12))
#     plt.xlabel('k')
#     plt.ylabel('inertia')
#     plt.title('Change in inertia as k increases')
#     plt.show()

Cluster on combination of 3 features

In [None]:
len(list(combinations(train_sc.columns, 3)))

How well can we cluster off 3 features?
- lower inertia means we have denser clusters
- **Note:** this doesn't tell us how useful these clusters will be at predicting quality.

In [None]:
# for combo in [list(tup) for tup in list(combinations(train_sc.columns, 3))]:
#     print('-'*20)
#     print(combo)
#     plt.figure(figsize=(4, 3))
#     pd.Series({k: KMeans(k).fit(train_sc[combo]).inertia_ for k in range(2, 12)}).plot(marker='x')
#     plt.xticks(range(2, 12))
#     plt.xlabel('k')
#     plt.ylabel('inertia')
#     plt.title('Change in inertia as k increases')
#     plt.show()

Cluster group 1:

- 4 clusters off `['fixed acidity', 'chlorides', 'alcohol']`

In [None]:
feats1 = ['fixed acidity', 'chlorides', 'alcohol']

kmeans1 = KMeans(n_clusters=4, random_state=123).fit(train_sc[feats1])

train['clusters_1'] = kmeans1.predict(train_sc[feats1])
validate['clusters_1'] = kmeans1.predict(validate_sc[feats1])
test['clusters_1'] = kmeans1.predict(test_sc[feats1])

Cluster group 2:

- 4 clusters off `['fixed acidity', 'alcohol']`

In [None]:
feats2 = ['fixed acidity', 'alcohol']

kmeans2 = KMeans(n_clusters=4, random_state=123).fit(train_sc[feats2])

train['clusters_2'] = kmeans2.predict(train_sc[feats2])
validate['clusters_2'] = kmeans2.predict(validate_sc[feats2])
test['clusters_2'] = kmeans2.predict(test_sc[feats2])

Cluster group 3:

- 4 clusters off `['free sulfur dioxide', 'residual sugar', 'alcohol']`

In [None]:
feats3 = ['free sulfur dioxide', 'residual sugar', 'alcohol']

kmeans3 = KMeans(n_clusters=4, random_state=123).fit(train_sc[feats3])

train['clusters_3'] = kmeans3.predict(train_sc[feats3])
validate['clusters_3'] = kmeans3.predict(validate_sc[feats3])
test['clusters_3'] = kmeans3.predict(test_sc[feats3])

Explore Clusters

Cluster 3 in the first group of clusters yields much higher quality.

In [None]:
train.groupby('clusters_1')['quality'].mean()

Cluster 0 in the second group of clusters yields much higher quality.

In [None]:
train.groupby('clusters_2')['quality'].mean()

Cluster 2 in the third group of clusters yields much higher quality.

In [None]:
train.groupby('clusters_3')['quality'].mean()

In [None]:
train['quality'].mean()

Save new cluster features onto the original data

In [None]:
df = pd.concat([train, validate, test]).sort_index()
df.head()

In [None]:
df.to_csv('wine_data_model.csv', index=False)