# Training and testing of a classification model: wine quality dataset
## Outreachy application startup task

In order to get started with Mozilla PRESC, Outreachy applicants are asked to [train a classification model on a chosen dataset, then evaluate the quality of the model](https://github.com/mozilla/PRESC/issues/2) using scikit-learn.  
I chose to work on the [wine quality dataset](https://archive.ics.uci.edu/ml/datasets/Wine+Quality).

#### My background with data science
I have experience with tools such as pandas, numpy and matplotlib. I am familiar with data science methodology.
More importantly, I used to be in charge of the modelisation tools of my former flood forecasting team. So I am familiar with understanding the strengths and flaws of a model. I hope these skills prove useful.  
However, this is my first time working with scikit-learn. I am excited at this opportunity to explore.

#### Why I chose this dataset
Given the choice, I prefer to work with data I can make sense of and learn something interesting from. Wine quality and EEG eye state measurements both picked my curiosity in this regard. I discarded the latter because I lack resources to understand its input variables.  
The wine making dataset has variables I can make sense of from basic skills in chemistry. It promises to be fun to work with.

## 1. Model construction
In this first of two parts, I will train a classification model to analyze the data.
To make things interesting, I wish to treat this part as close from a proper data analysis a reasonably doable.

So I will follow the following methodology:
* Ask a question both interesting and relevant to the data. The following work will aim at answering this question,
* Observe and understand the data,
* Choose and train a fitting model to answer the question,
* Answer the question as best I can using model results.

I would usually consider trying another model if I am not satisfied with the results. Since the end goal of this work is to evaluate the model, I will do no such iterative work but focus on highlighting the model's limits. A flawed model will certainly be interesting to evaluate in the second part of my work.

### The question I want to answer
The dataset describes a dozen physicochemical properties of white wines and their quality as percieved by wine experts.  
Given the data, the following question follows:    
**_What can I learn from the data about the influence of physiochemical properties on wine quality?_**

### Diving into the data
#### Overview
The [wine quality dataset](https://archive.ics.uci.edu/ml/datasets/Wine+Quality) I chose to work with describes quality and physicochemical properties of approximately 5000 instances of Portuguese white wine Vinho Verde.  
The **input variables** consist of 11 physiochemical properties of each wine:
 * fixed acidity,
 * volatile acidity,
 * citric acid,
 * residual sugar,
 * chlorides,
 * free sulfur dioxide,
 * total sulfur dioxide,
 * density,
 * pH,
 * sulphates,
 * alcohol.
 
The **output variables** are:
 * quality of wine, assessed by wine experts on a scale of 1 to 10,
 * a boolean variable added for the purpose of this exercise stating if a wine is recommended or not. Recommended wines are the ones with a quality >= 7.

*insert view of head of table*

*insert spreading of quality grades*

*insert checking for missing data*

### Understanding input variables
I believe meaningful data analysis requires understanding what the data represents.  
In this case, most input variables are technical wine terms I am not familiar enough with to know why they are being measured. Plus the dataset comes without units.  
These details must certainly be found in the study the dataset originates from ([[Cortez et al., 2009](https://www.sciencedirect.com/science/article/pii/S0167923609001377?via%3Dihub)]). I don't want to read it so as not to influence my own work. So I found answers through various other sources:

| Variable              | Unit in dataset| Description |
| :-------------------- | :------------- | :---------- |
| fixed acidity         |   g/l          | Concentration of all nonvolatile or fixed acids, as opposed to volatile acides (see below). The three most important fixed acids are tartaric, malic and citric. They are present in the grapes.|
| volatile acidity      |   g/l          | Concentration of all volative acids. Contrary to fixed acids, volative acids are the results of the winemaking process.|
| citric acid           |   g/l          | Concentration of citric acid. Usually between 0 and 0.5 g/l.|
| residual sugar        |   g/l          | Concentration of sugars remaining from the grapes after winemaking. It can be under 2 g/l for dry wines and up to over 45 g/l for sweet wines.|
| chlorides             |   g/l          | Concentration of chloride ions - mainly salt (NaCl). Usually under 0.05 g/l. But it can be up to over 1 g/l for wines collected close to the sea.|
| free sulfur dioxide   |   mg/l         | Concentration of free sulfur dioxide (sulphite, SO2). SO2 is a common additive with many uses, it mostly helps control fermentation. Most of added SO2 reacts with sugar and various wine composants; this fraction is called bound SO2. Free SO2 is the remaining concentration and the one actually useful. In wines made by winemakers expert with this treatment, bound SO2 doesn't exceed twice free SO2.|
| total sulfur dioxide  |   mg/l         | Total concentration of free and bound sulfur dioxide. It is usually between 10 and 150 mg/l. To prevent toxic effects, Europe has limits for maximal concentration of total SO2: 200 mg/l for white wines, or 250 for sweeter white wines (above 5 g/l residual sugar) and up to 400 mg/l for some sweet wines.|
| density               |    /           | mass of 1l wine / mass of 1l water ratio. |
| pH                    |    /           | A measure of overall acidity. The smaller, the more acid. Most wines have a pH between 2.9 and 3.9. Acidity is a fundamental property of wine. |
| sulphates             |   g/l          | Concentration of sulphate ions. Wine contains naturally between 0.1 and 0.4 g/l from grapes. Sulphate concentration rises gradually over time due to reactions with air and SO2 from treatment.|
| alcohol               | % of volume| Can vary from 7 to 16%.|

Sources :
 * https://waterhouse.ucdavis.edu/whats-in-wine
 * https://fr.wikipedia.org/wiki/Acides_du_vin
 * https://fr.wikipedia.org/wiki/Acidit%C3%A9_volatile
 * https://fr.wikipedia.org/wiki/Dioxyde_de_soufre_en_%C5%93nologie
 * https://www.oenologie.fr
 * https://dico-du-vin.com
 * https://www.futura-sciences.com/sciences/dossiers/chimie-chimie-vin-381/  
(Since this is merely an exercise, I allow myself to refer to Wikipedia. Articles about wine tend to be good on Wikipedia France.) (I am suddenly craving baguette.)

*insert table of descriptive statistics*

I have the following observations at this point:
 * 4 out of 11 variables are about acidity. 3 are about sulfur.
 * the wide distribution of residual sugar in the data indicates that these wines are of different types despite all being white Vinho Verde. Residual sugar might be the right variable for a first approach into classification.
 * Sugar is meant to balance acidity and alcohol in the wine. I expect a complex but meaningful correlation between the three.
 * Sulfites rates are indicative of treatments to prevent or curb undesirable effects during winemaking. Poorer quality of wine could be associated with anomalies in sulfite rates. Sufite treatment depends on sugar concentration, so sugar should once again be significant.
 * chlorine should not be significative aside from a small fraction of wines with high chlorine concentration. 110 wines measure above 0.1 g/l.
 * From my humble experience with wine, I expect density to be highly correlated to residual sugar and thus not a significant variable.
 * From my research, sulphates rates didn't appear to be a very relevant variable. I couldn't find much information aboutit. There may or may not be subtlety to it, like sulphate not being relevant in itself but being a good indicator for something else.
 * A few numbers appear strange already: max total SO2 of 440.0 mg/l, max free SO2 of 289.0 mg/l, max residual sugar of 65 g/l. I wish to give it a closer look:

In [None]:
print(wine.loc[wine["total sulfur dioxide"]>400]) #shows all wines with total SO2 > 400 mg/l
print(wine.loc[wine["free sulfur dioxide"]>150])  #shows all wines with free  SO2 > 150 mg/l

Wine #4745 is the one with both maximal values for total and free SO2. It even is the single wine with total S02 over 400 and free SO2 over 150 mg/l. I am not surprised that it is one of only 20 wines that got the minimal quality grade of 3/10. Something went wrong with this wine.

In [None]:
print(wine.loc[wine["residual sugar"]>40])

Wine #2781 is the only one with residual sugar above 40 g/l. Its quality is average. Maybe it is just a normal, very sweet wine.  
It also is the one wine with the highest density. This seems to be in line with my conjecture about sugar determining density.

This analysis gives me a solid first idea about the relevance of various variables and the dynamics between them. This dynamics appear complex enough to justify using machine learning. 

In [None]:
#insert scatter matrix

scatter matrix comments:
 * some variables are well spread (alcohol, pH) but others have a few extreme values that stick out. Residual sugar, density and free sulfur dioxide even have a single extreme value that completely flattens their scatter plots. It must be wines #4745 and #2781 I previously noticed. Let's drop them to make the plots more readable:

In [None]:
#insert scatter matrix dropping these two extreme points
#also drop points with extreme density + explain dropping high chloride

This is mush more readable already!
 * density has a strong linear correlation with both residual sugar and alcohol percentage. But no such direct correlation is visible a this point between sugar and alcohol.
 * other couple of variables have visible correlation:
   * pH and fixed acidity (this is conforting),
   * fixed acidity and citric acid (this seems logical as weel),
   * total sulfur dioxide and residual sugar (as anticipated),
   * total sulfur dioxide and free sulfur dioxide,
 * some tendancies already appear regarding quality. Recommended wines tend to have:
   * higher alcohol percentage,
   * lower density,
   * no high chloride concentration,
   * medium citric acid concentration,
   * lower ratio of total/free sulfur dioxide
   
 *insert scatter plot with colormap as quality instead as recommandation, just for free SO2.*
 The poorer wines have low free SO2, just like wine #4745. Insufficient sulphite treatment => too much fermentation and, obviously, poor wine.

## 2. Model evaluation