# Individual assignment: more wine!

In this assignment you will work with two datasets of Portuguese red and white wine variants. The data is related to red and white variants of the Portuguese "Vinho Verde" wine.

* Paper: https://www.semanticscholar.org/paper/Modeling-wine-preferences-by-data-mining-from-Cortez-Cerdeira/bf15a0ccc14ac1deb5cea570c870389c16be019c


The columns in the datasets are as follows:

1. fixed acidity: most acids involved with wine or fixed or nonvolatile (do not evaporate readily)
2. volatile acidity: the amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar taste
3. citric acid: found in small quantities, citric acid can add 'freshness' and flavor to wines
4. residual sugar: the amount of sugar remaining after fermentation stops, it's rare to find wines with less than 1 gram/liter and wines with greater than 45 grams/liter are considered sweet
5. chlorides: the amount of salt in the wine
6. free sulfur dioxide: the free form of SO2 exists in equilibrium between molecular SO2 (as a dissolved gas) and bisulfite ion; it prevents microbial growth and the oxidation of wine
7. total sulfur dioxide: amount of free and bound forms of S02; in low concentrations, SO2 is mostly undetectable in wine, but at free SO2 concentrations over 50 ppm, SO2 becomes evident in the nose and taste of wine
8. density: the density of wine is close to that of water depending on the percent alcohol and sugar content
9. pH: describes how acidic or basic a wine is on a scale from 0 (very acidic) to 14 (very basic); most wines are between 3-4 on the pH scale
10. sulphates: a wine additive which can contribute to sulfur dioxide gas (S02) levels, wich acts as an antimicrobial and antioxidant
11. alcohol: the percent alcohol content of the wine

There are two datasets, one for red wine and one for white wine. The goal is to model wine quality.

## Grading

The assignment is graded up to 8 points.

There can be a maximum of 2 points of extra credit, according to the `improvement` percentage in the last question:
* If your hyperparameter tuning improves the model by 5% or more, you get **1 point of extra credit.**
* If it improves the model by 10% or more, you get **2 points of extra credit.**
* If your hyperparameter tuning does not improve the model, you get **0 points of extra credit.**

**The maximum grade is 10 (8 + 2 extra credit if applicable).**

## Question 1 (0.5 points)

Load each dataset as a dataframe and create a new column called `type` with the wine type (red or white, type included in the name of the dataset).

Contatenate the two dataframes into a single dataframe called `wine` and display a sample of 5 rows.

## Question 2  (0.5 points)

I just realized that the column `type`should not be a string, but a number that represents a category. Change the column `type` to a numerical column and display the first 5 rows.

## Question 3 (1 point)

We need to convert this problem into a classification problem.

Before that, we need a categorical target, and for that we will use the `quality` column. But first, analyze the `quality` column and decide how to convert it into a categorical target.

1. Print the unique values of the `quality` column.
2. Plot a histogram of the `quality` column for each type of wine.
3. Based on the analysis, decide and justify which value will be the threshold to convert the `quality` column into a binary target.
    * For example, if you decide that the threshold is 6, then the target will be 1 if the quality is greater than 6 and 0 otherwise.
    * Don't overthink this, just decide a threshold and justify it.
4. Create a new column called `target` with the binary target, considering the threshold you decided.


## Question 4 (1 point total, 0.125 each new feature)

Let's create some new features.

You have to create 8 new columns in the dataframe `wine`, doesn't matter whether they'll be useful or not, just create them.

Remember, you can use the following operations to create new columns:
* Basic arithmetic operations between columns
* Label encoding of categorical columns
* Binning of numerical columns (convert a continuous variable column into a categorical column)
* Apply a function to a column
...

## Question 5 (1 point)

Now that we have a target and some new features, we can create a classification model.

But first, we need to remove the `quality` column and split the data into features and target.

1. Remove the `quality` column from the dataframe.
2. Split the data into features and target. Name the features dataframe `x` and the target series `y`.
3. Split the data into training and test sets.
    * Keep in mind the nature of your target, and use stratification if necessary.

## Question 6  (0.5 points)

Time for scaling the data.

Choose a scaler and scale the features, properly done.

## Question 7  (0.5 points)

Choose a classification algorithm and initialize it.

## Question 8 (1 point)

Test the model's performance on the test set without any hyperparameter tuning, just to see how it performs.

From now on, the metric used to evaluate the model will be the [F1 score](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html).

Print the F1 score of the model evaluated on the test set.

## Question 9 (1 point)

Now we have a baseline to beat, let's try to improve the model's performance and also make it robust by using GridSearchCV.

Create a grid with hyperparameters to be tested and use GridSearchCV to find the best hyperparameters.

Use as many hyperparameters as you want, but test at least 3 different values for each hyperparameter. 

Use no less than 3 folds in GridSearchCV.

Be aware of the time it takes to run GridSearchCV, don't use too many hyperparameters or too many values for each hyperparameter.

Use the grid search fitting time to review your answers above.


## Question 10 (1 point)

Given the best hyperparameters, train you model with them, and test its performance on the test set.
* Print the best hyperparameters found by GridSearchCV.
* Print the F1 score of the freshly trained model evaluated on the test set.
* Print the improvement in the F1 score compared to the baseline model by using the following relation:

$$ improvement = \frac{F1_{new} - F1_{baseline}}{F1_{baseline}} $$

* Save the predictions in a CSV file called `predictions.csv`.