<h1> Red wine data analysis </h1>

<i>Joël Boafo, Sjoerd Beetsma, Maarten de Jeu
Class V2A - Group 5</i>

<h2> Data Understanding </h2>

The dataset aquired from https://archive.ics.uci.edu/ml/datasets/wine

The business tells us the variables in the dataset are:<br />
1 - fixed acidity <br />
2 - volatile acidity <br />
3 - citric acid <br />
4 - residual sugar <br />
5 - chlorides <br />
6 - free sulfur dioxide <br />
7 - total sulfur dioxide <br />
8 - density <br />
9 - pH <br />
10 - sulphates <br />
11 - alcohol <br />
12 - quality (score between 0 and 10, based on sensory data) <br />

The business also let us now that they don't know if all variables are relevant in deciding the quality score of a wine.

We import some libraries and the dataset to examine the data through code.

In [None]:
import numpy as np
import pandas as pd
import sklearn as sk
from sklearn import preprocessing

import matplotlib.pyplot as plt
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
dataset = pd.read_csv("datasets/winequality-red.csv", sep=";")

In [None]:
dataset.dtypes

First look at the dataset through the head of the dataset

In [None]:
dataset.head()

Rows seem to correspond with individual wines with eleven columns describing the chemical properties and one column a quality score.

Lets also take a look at the size of the raw dataset

In [None]:
rows, columns = dataset.shape
print(f'There are {rows} rows and {columns} in the dataset')

Lets change the column name white spaces to underscores to make life easier.

In [None]:
dataset.columns = dataset.columns.str.replace(' ','_')
dataset.head(0)

<h3>Target and feature variables</h3>

All the columns describing chemical properties will be considerd as a feature variable and the column quality represents the target variable, the variable we want to predict.

<h3> Scales of measurements </h3>

To choose a appropiate model for our research-questions and available data it's necessary to have a understanding of all the scales of measurements for the target and feature variables.

In [None]:
nomi, disc, ordi, cont = 'Nominal', 'Discrete', 'Ordinal','Continous'
pd.DataFrame(index=dataset.columns, data=[cont for i in range(11)] + [disc], columns=['Scale_of_measurement'])

As can be seen all the chemical properties (feature variables) have continous scale of measurement and the target variable, quality has a Discrete scale of measurement.

<h3>Central tendancies and dispersion measures</h3>

From the central tendancies and dispersion measures we can see some useful statistics about the target and feature variables.

In [None]:
dataset.describe().round(2)

Because all attributes are numerical lets take a closer look at the distribution of all data through a histogram for each of the feature and target attributes.

In [None]:
dataset.hist(figsize=(15,15))
plt.show()

As can been seen in the table above the quality scores range between 3 and 8 with wines with a score of 5 or 6 are the most common. 

Lets check the dataset on and NA values

In [None]:
dataset.isna().sum()

The dataset contains no NA values

<h3>Outliers</h3>

To get a view of the outliers we create a boxplot of all the attributes in the dataset.

In [None]:
for col in dataset.columns:
    sns.boxplot(x=dataset[col])
    plt.show()

As can be seen in the plots above all of our attributes have some outliers ranging from mild to extreme outliers. This requires a look during the data preparation phase.

<h3>Correlations</h3>

To help find correlations between variables and indepented/undepented attributes we can make use of a correlation matrix. 

In [None]:
corr = dataset.corr()
plt.figure(figsize=(10,7.5))
cmap = sns.diverging_palette(200, 0, as_cmap=True) # color palette as cmap
mask = np.logical_not(np.tril(np.ones_like(corr))) # triangle mask
sns.heatmap(corr,annot=True, mask=mask, cmap = cmap, vmin=-1, vmax=1) # correlation heatmap

In the correlation matrix graph above you can see which attributes have a correlation to other attributes. Starting with our target variable 'quality', we can see quality has a few correlations with the strongest one being alcohol and a few weaker ones like volatile acidity, sulphates and citric acid. Because quality is our target variable it's indepented attribute in the correlation.

Besides there are some corelations among chemical properties:
Fixed acidity has strong correlation with pH, but it’s still an independent type. pH However is a dependent type; it depends on the former. Volatile acidity, residual sugar, sulphates, chlorides, and density are all independent data types. Total sulfur dioxide is dependent on free sulfur dioxide, but free sulfur dioxide is independent.

<h3>Data Preparation</h3>

<h3>Data Cleaning</h3>

In the Data Understanding phase we came to the conclusion the dataset doesn't contain any NA values and all the columns are all of the correct datatype. Thus the data doesn't have to be cleaned on NA values or incorrect types / scales of measurement.

Lets start of by removing all extreme the outliers leaving the mild ones in the dataset with a outer fence:

In [None]:
q1 = dataset.quantile(.25)
q3 = dataset.quantile(.75)
iqr = q3 - q1
dataset = dataset[(dataset >= q1 - (3 * iqr)) & (dataset <= q3 + (3 * iqr))] # turn extreme outliers into NaN values
dataset.dropna(inplace=True) # drop all rows containing one or more etreme outliers 

The dataset contained 1599 rows before removing the outliers lets check howmany are left.

In [None]:
dataset.shape

There are still 1410 rows left which means 12% of the columns contained outliers. With 88% still left there will be enough data to construct a model. 

<h3>Normalizing data</h3>

Many algorithms used for making a prediction model work more efficient with normalized data. We can normalize the whole dataset into a new dataframe to acces normalized data from.

In [None]:
scaler = preprocessing.StandardScaler().fit(dataset)
normalized_dataset = pd.DataFrame((scaler.transform(dataset)), columns=dataset.columns)

In [None]:
normalized_dataset.head()

<h3>Data cleaned</h3>

The data got cleaned by removing all extreme outliers in the dataset and creating a normalized copy of the dataset