# Laboratory practice 2.3: Logistic Regression, LDA and GridSearchCV


For this practice, you will need the following datasets:

- **winequality-red.csv**: This datasets is related to red variants of the Portuguese "Vinho Verde" wine. 

Columns:
* fixed acidity: most acids involved with wine or fixed or nonvolatile (do not evaporate readily).
* volatile acidity: the amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar taste.
* citric acid: found in small quantities, citric acid can add 'freshness' and flavor to wines.
* residual sugar: the amount of sugar remaining after fermentation stops.
* chlorides: the amount of salt in the wine.
* free sulfur dioxide: the free form of SO2 exists in equilibrium between molecular SO2 (as a dissolved gas) and bisulfite ion.
* total sulfur dioxide: amount of free and bound forms of S02.
* density: the density of water is close to that of water depending on the percent alcohol and sugar content
* ph: describes how acidic or basic a wine is on a scale.
* sulphates: a wine additive which can contribute to sulfur dioxide gas (S02) levels.
* quality: 0-6.5 BAD (0); 6.5-10 GOOD(1).

The main package for machine learning in Python is **scikit-learn**.

Further reading:
- [scikit-learn](https://scikit-learn.org) (Machine Learning libraries)

In addition, we will be using the following libraries:
- Data management
    - [numpy](https://numpy.org/) (linear algebra)
    - [pandas](https://pandas.pydata.org/) (data processing, CSV file)

- Plotting
    - [seaborn](https://seaborn.pydata.org/)
    - [matplotlib](https://matplotlib.org/)

In [1]:
import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler,OneHotEncoder
from sklearn.metrics import (accuracy_score, confusion_matrix, 
                             roc_auc_score, roc_curve, confusion_matrix, 
                             ConfusionMatrixDisplay, classification_report,RocCurveDisplay
                            )
from sklearn.datasets import make_classification
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
from feature_engine.selection import DropCorrelatedFeatures
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import LabelBinarizer


import matplotlib.pyplot as plt
import seaborn as sns
sns.set()

# Interactive plotting
%matplotlib inline
%config InlineBackend.figure_format = 'svg' # ‘png’, ‘retina’, ‘jpeg’, ‘svg’, ‘pdf’

#### Exercise 1: Data manipulation and preprocessing

1.Encode the quality variable as mentioned in the variable description. 
- New variable named "y"
- Drop original variable

2. Exploratory analysis of the data exposing what you consider important.

3. Check for missing values and treat them if necessary

4. Perform all the manipulations you consider additional to the data (create new variables, delete constants,...). If you do not consider, skip this section.

#### Exercise 2:  Training the model with a Logistic regression

1. Separate training and test sets


2. Using the Logistic Regression example practice as a reference:

    * 2.1 Creates a list with continuous variables and with categorical variables
    * 2.2 Create a numerical transformer with Pipeline object where you apply the transformations you consider. + info https://feature-engine.trainindata.com/en/latest/ or https://scikit-learn.org/stable/data_transforms.html
    (In this case, there is no categorical so just create numeric transformer)
    * 2.3 Encapsulate your numeric transformer inside the ColumnTransformer object
    * 2.4 Create a Pipeline object where you join the preprocessing and your classifier
    * 2.5 Create a hyperparameter search dictionary. Idea: you can search inside the C parameter or the penalty parameter.
    * 2.6 Instantiate GridSearchCV with cv = 5, return_train_score = True and scoring = "roc_auc"
    * 2.7 Train the model and interpret the results that come out of the cross validation results

3. About our model:

* 3.1. What is the best model? What parameters does it have? What is its mean_train_score and its mean_test_score?
    
* 3.2. Do you think the model is overfitted? Can you show a graph justifying it? (hint: you can get the data from cv_results)

4. What are the coefficients of the model? Are they related to the importance of variables?

5. Predictions:
    * 5.1. Get a variable that is the predictions of the test set in the form of classes and in the form of probability.
    * 5.2. Is it a good model? Justify it using the confusion matrix and metrics such as recall or ROC.
    * 5.3. Shows a graph of the distributions taking as reference the prediction of class 1.
    * 5.4 What do you think of the distribution? Do you think the model could be improved? 

#### Exercise 3: Training a model with LDA
Using LDA practice example as reference: 

1.Performing the same task as with logistic regression but with the following differences:
    
* 1.1. Our classifier will now be an LDA with n_components = 1
    
* 1.2 Create a hyperparameter search where you put different options of the solver that is used in the model.
    
* 1.3 Scatter plot as follows:
     * to_plot_lda = pd.DataFrame(search.transform(X_test), columns =["component1"])
     * to_plot_lda["y_pred"] = list(y_pred)
     * sns.scatterplot(x = 'component1', y= y_pred,hue=y_pred, data =to_plot_lda).set_title("Components")
     
     Can you interpret it? Why or why not?
    
2. Which model is best for you? What have you used to compare them?