# Training and testing of a classification model: wine quality dataset
## Outreachy application startup task

In order to get started with Mozilla PRESC, Outreachy applicants are asked to [train a classification model on a chosen dataset, then evaluate the quality of the model](https://github.com/mozilla/PRESC/issues/2) using scikit-learn.  
I chose to work on the [wine quality dataset](https://archive.ics.uci.edu/ml/datasets/Wine+Quality).

### My background with data science
I have experience with tools such as pandas, numpy and matplotlib. I am familiar with data science methodology.
More importantly, I used to be in charge of the modelisation tools of my former flood forecasting team. So I am familiar with understanding the strengths and flaws of a model. I hope these skills prove useful.  
However, this is my first time working with scikit-learn. I am excited at this opportunity to explore.

## 1. Model construction
In this first of two parts, I will train a classification model to analyze the data.
I will follow the following methodology:
* Ask a question both interesting and relevant to the data. The following work will aim at answering this question.
* Observe and understand the data.
* Choose and train a fitting model to answer the question.
* Answer the question as best I can using model results.

I would usually consider trying another model if I am not satisfied with the results. Since the end goal of this work is to evaluate the model, I will do no such iterative work but focus on highlighting the model's limits. A flawed model will certainly be interesting to evaluate in the second part of my work.

### The question I want to answer
The dataset describes a dozen physicochemical properties of white wines and their quality as percieved by wine experts.  
Given the data, the following question follows:  
**_What can I learn from the data about the influence of physiochemical properties on wine quality?_**

### Diving into the data
#### Overview
The [wine quality dataset](https://archive.ics.uci.edu/ml/datasets/Wine+Quality) I chose to work with describes quality and physicochemical properties of approximately 5000 instances of Portuguese white wine Vinho Verde.  
The **input variables** consist of 11 physiochemical properties of each wine:
 * fixed acidity
 * volatile acidity
 * citric acid
 * residual sugar
 * chlorides
 * free sulfur dioxide
 * total sulfur dioxide
 * density
 * pH
 * sulphates
 * alcohol
 
The **output variable** is the quality of wine, assessed by wine experts on a scale of 1 to 10.

### Understanding input variables
Meaningful data analysis requires understanding what the data represents.  
In this case, most input variables are technical wine terms I am not familiar enough with to know why they are being measured. Plus the dataset comes without units.  
These details must certainly be found in the study the dataset originates from ([[Cortez et al., 2009](https://www.sciencedirect.com/science/article/pii/S0167923609001377?via%3Dihub)]). I don't want to read it so as not to influence my own work. So I found answers through various other sources:

| Variable              | Unit in dataset      | Description |
| :-------------------- | :--------- | :---------- |
| fixed acidity         |   g/l      | Concentration of all nonvolatile or fixed acids, as opposed to volatile acides (see below). The three most important fixed acids are tartaric, malic and citric. They are present in the grapes.|
| volatile acidity      |   g/l      | Concentration of all volative acids. Contrary to fixed acids, volative acids are the results of the winemaking process.|
| citric acid           |   g/l      | Concentration of citric acid. Usually between 0 and 0.5 g/l.|
| residual sugar        |   g/l      | Concentration of sugars remaining from the grapes after winemaking. It can be under 2 g/l for dry wines and up to over 45 g/l for sweet wines.|
| chlorides             |   g/l      | Concentration of chloride ions - mainly NaCl ie salt. Usually under 0.05 g/l. But it can be up to over 1 g/l for wines collected close to the sea.|
| free sulfur dioxide   |                 |             |
| total sulfur dioxide  |                 |             |
| density               |    /       |             |
| pH                    |    /       |   A measure of overall acidity. The smaller, the more acid. Most wines have a pH between 2.9 and 3.9. Acidity is a fundamental property of wine. |
| sulphates             |                 |             |
| alcohol               | % of volume| Can vary from 7 to 16%.|

Sources :
* https://waterhouse.ucdavis.edu/whats-in-wine
* https://fr.wikipedia.org/wiki/Acides_du_vin
* https://fr.wikipedia.org/wiki/Acidit%C3%A9_volatile
* https://www.oenologie.fr
* https://dico-du-vin.com
* https://www.futura-sciences.com/sciences/dossiers/chimie-chimie-vin-381/
  (Since this is merely an exercise, I allow myself to refer to Wikipedia. Articles about wine tend to be good on Wikipedia France.) (I am suddenly craving baguette.)

*insert table of descriptive statistics*

observations at this point:
 * 4 out of 11 variables are about acidity. 3 are about sulfur.
 * the wide distribution of residual sugar in the data indicates that these wines are of different types despite all being white Vinho Verde. Residual sugar might be the right variable for a first approach into classification.
 * From my humble knowledge of wine, I expect density to be highly correlated to residual sugar and thus not a significant variable.
 * Sugar is meant to balance acidity and alcohol in the wine. I expect a complex but meaningful correlation between the three.
 * S rates are indicative of treatments to prevent or curb undesirable effects during winemaking. Poorer quality of wine could be associated with anomalies in S rates.
 * chlorine should not be significative aside from a small fraction of wines with high chlorine concentration. 110 wines measure above 0.1 g/l.

## 2. Model evaluation