<h1> Red wine data analysis </h1>

<i>Joël Boafo, Sjoerd Beetsma, Maarten de Jeu
Class V2A - Group 5</i>

<h2> Data Understanding </h2>

The datasets about chemical properties of red and white wines aquired from https://archive.ics.uci.edu/ml/datasets/wine+quality

The business tells us the variables in the datasets are:<br />
1 - fixed acidity <br />
2 - volatile acidity <br />
3 - citric acid <br />
4 - residual sugar <br />
5 - chlorides <br />
6 - free sulfur dioxide <br />
7 - total sulfur dioxide <br />
8 - density <br />
9 - pH <br />
10 - sulphates <br />
11 - alcohol <br />
12 - quality (score between 0 and 10, based on sensory data) <br />

The business also let us now that they don't know if all variables are relevant in deciding the quality score of a wine.

We import some libraries and the dataset to examine the data through code.

In [None]:
import numpy as np
import pandas as pd
import sklearn as sk
from sklearn import preprocessing

import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

Load in both red and white wine datasets

In [None]:
dataset_red = pd.read_csv("datasets/winequality-red.csv", sep=";")
dataset_white = pd.read_csv("datasets/winequality-white.csv", sep=";")

In [None]:
print('red-wine dataset \n', dataset_red.dtypes)
print('\nwhite-wine dataset \n', dataset_white.dtypes)

Both the red and white wine datasets seem to have the exact same columns and dtypes.

Lets see if there are any NA values.

In [None]:
print(f'red wines has {dataset_red.isna().sum().sum()} NA values white wine has {dataset_white.isna().sum().sum()}')

Lets take a look at the head of one of the datasets

In [None]:
dataset_red.head()

Rows seem to correspond with individual wines with eleven columns describing the chemical properties and one column a quality score.

Lets also take a look at the size of the raw datasets

In [None]:
red_rows, red_columns = dataset_red.shape
white_rows, white_columns = dataset_white.shape

print(f'There are {red_rows} rows and {red_columns} in the red dataset')
print(f'There are {white_rows} rows and {white_columns} in the white dataset')

Lets change the column name white spaces to underscores to make life easier.

In [None]:
dataset_red.columns = dataset_red.columns.str.replace(' ','_')
dataset_white.columns = dataset_white.columns.str.replace(' ','_')

dataset_red.head(0)

<h3>Target and feature variables</h3>

All the columns describing chemical properties will be considerd as a feature variable and the column quality represents the target variable, the variable we want to predict.
lets safe them in a variables for later use.

In [None]:
feature_vars = ['fixed_acidity', 'volatile_acidity', 'citric_acid', 'residual_sugar', 'chlorides', 'free_sulfur_dioxide', 'total_sulfur_dioxide', 'density', 'pH', 'sulphates', 'alcohol'] 
target_var = 'quality'

<h3> Scales of measurements </h3>

To choose a appropiate model for our research-questions and available data it's necessary to have a understanding of all the scales of measurements for the target and feature variables.

In [None]:
nomi, disc, ordi, cont = 'Nominal', 'Discrete', 'Ordinal','Continous'

pd.DataFrame(index=dataset.columns, data=[cont for i in range(11)] + [disc], columns=['Scale_of_measurement'])

As can be seen all the chemical properties (feature variables) have continous scale of measurement and the target variable, quality has a Discrete scale of measurement.

<h3>Central tendancies and dispersion measures</h3>

From the central tendancies and dispersion measures we can see some useful statistics about the target and feature variables.

In [None]:
dataset_red.describe().round(2)

From the describe we can tell that there are quite a few columns with a big difference between maximum and minimum values which indicate outliers.

The columns with big differences between max and min values:
Residual_sugar, chlorides, free_sulfur_dioxide ,total_sulfur_dioxide.

In [None]:
dataset_white.describe().round(2)

Just like the red-wine dataset, the white-wine dataset has similair differences in maximum and minimum values.

Lets take a more visual look at the distribution of all data through a histogram for each of the feature and target attributes.

Starting off with the red wine dataset

In [None]:
dataset_red.hist(figsize=(15,15))
plt.show()

Moving on to the white wine dataset:

In [None]:
dataset_white.hist(figsize=(15,15))
plt.show()

As can been seen in the tables above the quality scores for the red wines range between 3 and 8 with wines with a score of 5 being the most common.
The white wines range within a quality of 3 and 9 with the score of 6 being most common.

<h3>Outliers</h3>

To get a view of the outliers we create a boxplot of all the attributes in the dataset.

In [None]:
def boxplotter(dataset, y_axes, x_axis):
    for col in y_axes: # don't plot quality against quality
        sns.boxplot(x=dataset[x_axis], y=dataset[col])
        plt.show()

First boxplot all the feature variables against the target variable of the red-wine dataset

In [None]:
boxplotter(dataset=dataset_red, y_axes=feature_vars, x_axis=target_var)

Now do the same for the white-wine dataset

In [None]:
boxplotter(dataset=dataset_white, y_axes=feature_vars, x_axis=target_var)

As can be seen in the plots above all of our attributes have some outliers ranging from mild to extreme outliers. This requires a look during the data preparation phase.

<h3>Correlations</h3>

To help find correlations between variables and indepented/undepented attributes we can make use of a correlation matrix. 

In [None]:
def corr_matrix_plotter(dataset, title=''):
    corr = dataset.corr()
    plt.figure(figsize=(10,7.5))
    cmap = sns.diverging_palette(200, 0, as_cmap=True) # color palette as cmap
    mask = np.logical_not(np.tril(np.ones_like(corr))) # triangle mask
    sns.heatmap(corr, annot=True, mask=mask, cmap = cmap, vmin=-1, vmax=1).set_title(title) # correlation heatmap
    plt.show()
corr_matrix_plotter(dataset_red, 'red-wine')
corr_matrix_plotter(dataset_white, 'white-wine')

In the correlation matrix graph above you can see which attributes have a correlation to other attributes. Starting with our target variable 'quality', we can see quality has a few correlations with the strongest one being alcohol and a few weaker ones like volatile acidity, sulphates and citric acid. Because quality is our target variable it's indepented attribute in the correlation.

Besides there are some corelations among chemical properties:
Fixed acidity has strong correlation with pH, but it’s still an independent type. pH However is a dependent type; it depends on the former. Volatile acidity, residual sugar, sulphates, chlorides, and density are all independent data types. Total sulfur dioxide is dependent on free sulfur dioxide, but free sulfur dioxide is independent.

<h3>Data Preparation</h3>

<h3>Data Cleaning</h3>

In the Data Understanding phase we came to the conclusion the dataset doesn't contain any NA values and all the columns are all of the correct datatype. Thus the data doesn't have to be cleaned on NA values or incorrect types / scales of measurement.

Lets start of by removing all extreme the outliers leaving the mild ones in the dataset with a outer fence:

In [None]:
def remove_outliers(dataset, fence = 3):
    q1 = dataset.quantile(.25)
    q3 = dataset.quantile(.75)
    iqr = q3 - q1
    return dataset[(dataset >= q1 - (fence * iqr)) & (dataset <= q3 + (fence * iqr))].dropna() # turn extreme outliers into NaN values
   

The red-wine dataset contained 1599 rows and the white-wine 4898 before removing the outliers lets remove the outliers and checck howmany are left.

In [None]:
dataset_red = remove_outliers(dataset_red)
dataset_white = remove_outliers(dataset_white)

In [None]:
dataset_red.shape, dataset_white.shape

There are still 1435 rows left which means 12% of the columns contained outliers. With 88% still left there will be enough data to construct a model. 
There is roughly still 88% of the red-wine data left and 96% of the white-wine data after removing the extreme outliers that lay 3+ IQR above Q3 or 3+ IQR below Q1.

<h3>Normalizing data</h3>

Many algorithms used for making a prediction model work more efficient with normalized data. We can normalize the whole dataset into a new dataframe to acces normalized data from.

In [None]:
def normalizer(dataset):
    scaler = sk.preprocessing.StandardScaler().fit(dataset)
    return pd.DataFrame((scaler.transform(dataset)), columns=dataset.columns)


In [None]:
normalized_dataset_red = normalizer(dataset_red)
normalized_dataset_white = normalizer(dataset_white)

In [None]:
normalized_dataset.head()

<h3>Data cleaned</h3>

The data got cleaned by removing all extreme outliers in the dataset and creating a normalized copy of the dataset