# Variable Distribution
In this step we will analyze the data in order to understand how the variables are distributed.


In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from IPython.display import Image

In [2]:
%matplotlib inline

plt.rcParams['figure.figsize'] = (6,6)

### Load the pickle file

In [3]:
wine = pd.read_pickle("data/wine.3.grouped.pkl")

In [4]:
wine.dtypes

id                         int64
volatile_acidity         float64
citric_acid              float64
residual_sugar           float64
chlorides                float64
total_sulfur_dioxide     float64
density                  float64
pH                       float64
sulfates                 float64
alcohol                  float64
quality                    int64
quality_rating          category
pH_level                category
alcohol_level           category
density_level           category
citric_acid_level       category
sugar_level             category
chloride_level          category
total_sulfur_level      category
dtype: object

### Numerical Variable Distribution
We can analyze the distribution by viewing the data as a KDE (Kernal Density Estimate) graph, a more continuous histogram.

In [5]:
def numerical_variable_distribution(df, col):
    df[col].plot.kde()
    plt.xlabel('Variable "{}" Distribution'.format(col))
    plt.show()

In [6]:
numerical_columns = wine.select_dtypes(include=[np.number]).columns.tolist()

In [7]:
from ipywidgets import interact, fixed

In [8]:
interact(numerical_variable_distribution, 
         col=numerical_columns, df=fixed(wine));

interactive(children=(Dropdown(description='col', options=('id', 'volatile_acidity', 'citric_acid', 'residual_…

Quality and citric_acid are two variables that are definitely not normally distributed.

In [9]:
wine["quality"].value_counts(normalize=True)

5    0.425891
6    0.398999
7    0.124453
4    0.033146
8    0.011257
3    0.006254
Name: quality, dtype: float64

In [10]:
wine["citric_acid"].value_counts(normalize=True)

0.00    0.082552
0.49    0.042527
0.24    0.031895
0.02    0.031270
0.26    0.023765
0.10    0.021889
0.08    0.020638
0.01    0.020638
0.21    0.020638
0.32    0.020013
0.03    0.018762
0.09    0.018762
0.31    0.018762
0.30    0.018762
0.42    0.018136
0.40    0.018136
0.04    0.018136
0.39    0.017511
0.22    0.016886
0.12    0.016886
0.25    0.016886
0.33    0.015635
0.20    0.015635
0.23    0.015635
0.06    0.015009
0.34    0.015009
0.48    0.014384
0.44    0.014384
0.45    0.013759
0.07    0.013759
          ...   
0.38    0.008755
0.53    0.008755
0.51    0.008130
0.54    0.008130
0.35    0.008130
0.55    0.007505
0.68    0.006879
0.63    0.006254
0.57    0.005629
0.64    0.005629
0.16    0.005629
0.58    0.005629
0.60    0.005629
0.59    0.005003
0.56    0.005003
0.65    0.004378
0.74    0.002502
0.69    0.002502
0.76    0.001876
0.73    0.001876
0.67    0.001251
0.61    0.001251
0.70    0.001251
0.71    0.000625
0.79    0.000625
0.75    0.000625
0.78    0.000625
1.00    0.0006

To check if any of the variables are normally distributed we will create a probability plot.

In [11]:
from scipy import stats

In [12]:
def numerical_variable_normality(col):
    stats.probplot(wine[col], plot=plt)
    plt.xlabel('Probability plot for variable {}'.format(col))
    plt.show()

In [13]:
interact(numerical_variable_normality, col=numerical_columns);

interactive(children=(Dropdown(description='col', options=('id', 'volatile_acidity', 'citric_acid', 'residual_…

In [14]:
for num_col in numerical_columns:
    _, pval = stats.normaltest(wine[num_col][wine[num_col].notnull()])
    if(pval < 0.05):
        print("Column {} doesn't follow a normal distribution".format(num_col))

Column id doesn't follow a normal distribution
Column volatile_acidity doesn't follow a normal distribution
Column citric_acid doesn't follow a normal distribution
Column residual_sugar doesn't follow a normal distribution
Column chlorides doesn't follow a normal distribution
Column total_sulfur_dioxide doesn't follow a normal distribution
Column density doesn't follow a normal distribution
Column pH doesn't follow a normal distribution
Column sulfates doesn't follow a normal distribution
Column alcohol doesn't follow a normal distribution
Column quality doesn't follow a normal distribution


None of the variables follow a normal distribution.

### Categorical Variable Distribution

To understand the distribution of categorical variables, we use the function .value_counts()

In [15]:
def categorical_variable_distribution(col):
    wine[col].value_counts(ascending=True,normalize=True).tail(20).plot.barh()
    plt.show()

In [16]:
categorical_columns = wine.select_dtypes(
    ['object', 'category']).columns.tolist()

In [17]:
categorical_columns

['quality_rating',
 'pH_level',
 'alcohol_level',
 'density_level',
 'citric_acid_level',
 'sugar_level',
 'chloride_level',
 'total_sulfur_level']

In [18]:
interact(categorical_variable_distribution, col=categorical_columns);

interactive(children=(Dropdown(description='col', options=('quality_rating', 'pH_level', 'alcohol_level', 'den…

### Conclusions
- No numerical variable follows a normal distribution
- The variable `quality` has an unbalanced discrete distribution (the values 5, 6 and 7 account for 95% of the wines). It has been grouped into 'bad', 'okay', 'good', and 'excellent'.
- The variable 'citric acid' may also have an unbalanced distribution. The value 0.0 accounts for 8% of the wine's citric acid content. This sounds like a small amount, but there are lots of values.
- 28% of wines have very low alcohol content
- 23% of wines have very low fixed acidity
- 1/4 of wines have low sugar levels