# Data Science Task 2
Data preprocessing and modeling

## Imports

In [16]:
import sys
import os
from dotenv import load_dotenv
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import FunctionTransformer, OneHotEncoder, MinMaxScaler
from sklearn.model_selection import GridSearchCV, train_test_split
import warnings

# load environment variables
load_dotenv()

#add working directory to sys path to execute utils/dataset.py
working_dir = os.environ.get("WORKING_DIRECTORY")
sys.path.insert(0, working_dir)

from utils.dataset import get_data 
from utils.pipeline_moduls import fs_colinearity, fs_vif, outlier_label, outlier_num, dim_reduction
warnings.filterwarnings('ignore')

df = get_data()


df.head(10)


Loading data from wines: 0it [00:00, ?it/s]

Loading data from wines: 8000it [00:00, 19259.53it/s]


Unnamed: 0,wine type,fixed acidity,volatile acidity,citric acid,residual sugar,magnesium,flavanoids,minerals,calcium,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,Pinot noir,5.8,0.15,0.49,1.1,76.729301,894.94,186.639301,109.91,0.048,21.0,98.0,0.9929,3.19,0.48,9.2,5
1,Merlot,6.6,0.25,0.32,5.6,4.795712,1160.95,251.875712,247.08,0.039,15.0,68.0,0.99163,2.96,0.52,11.1,6
2,Chardonnay,6.7,0.21,0.34,1.5,85.19371,789.82,304.70371,219.51,0.035,45.0,123.0,0.98949,3.24,0.36,12.6,7
3,Merlot,8.3,0.28,0.27,17.5,11.976525,777.86,237.586525,225.61,0.045,48.0,253.0,1.00014,3.02,0.56,9.1,6
4,Merlot,7.5,0.42,0.19,6.9,5.599673,785.72,95.399673,89.8,0.041,62.0,150.0,0.99508,3.23,0.37,10.0,6
5,Merlot,7.3,0.34,0.3,1.3,22.403749,1044.95,289.523749,267.12,0.057,25.0,173.0,0.9948,3.26,0.51,9.1,6
6,Merlot,7.6,0.21,0.49,2.5,23.875866,888.61,133.545866,109.67,0.047,20.0,130.0,0.99178,3.15,0.48,11.1,5
7,Chardonnay,6.0,0.25,0.4,5.7,23.309699,1381.79,266.529699,243.22,0.052,56.0,152.0,0.99398,3.16,0.88,10.5,6
8,Cabernet Sauvignon,6.7,0.18,0.19,4.7,49.165745,1456.41,269.915745,220.75,0.046,57.0,161.0,0.9946,3.32,0.66,10.5,6
9,Gamay,7.7,0.28,0.39,8.9,54.450579,929.44,377.690579,323.24,0.036,8.0,117.0,0.9935,3.06,0.38,12.0,2


## Preprocessing Pipeline Setup
Pipeline for missing value imputation, outlier detection and imputation, feature scaling, feature encoding, feature selection and label outlier removal.

### Transform custom functions into pipeline steps
we define custom functions to handle the data and transform them into pipeline steps with the FunctionTransformer.

#### Label Outlier Detection
Function that removes outliers from the label (quality > 10)

In [17]:
outlier_detection_label = FunctionTransformer(outlier_label).set_output(transform="pandas")

#### Feature Outlier Detection

Using IQR to detect outliers in the features and impute with best method (mean or median)
- median for normal distributed features, mean for skewed and uniform distributed features
- we checked the impact of using different approaches to impute the outliers and found that the impact is negligible on the prediction results (tested for decision tree and linear regression)

In [4]:
outlier_detection = FunctionTransformer(outlier_num).set_output(transform="pandas")

#### Scaling the numeric features


Scaling is useful and necessary for following models: 
- KNN
- ANN
- SVM

Scaling is not useful or improves results significantly for following models:
- Random Forest
- Gradient Boosting
- Linear Regression


#### Results without scaling vs with best scaling method:
| Model | MSE | R² |
| ----------- | ----------- | ----------- |
| Random Forest | -1.46% | +1.1% |
| Gradient Boosting | +4.8% | -2.9% |
| Linear Regression | -0.5% | +1.5% |
| KNN* | -∞ | +∞ |
| ANN* | -∞ | +∞ |
| SVM* | -∞ | +∞ |


*all had Error > 1 and R² < 0

\
The different scaling methods have no significant impact on the results for most models. 


But for KNN, MinMaxScaler outperformed the other scalers.

#### MSE:
```
MinMaxScaler:   0%
StandardScaler: +74.5%
RobustScaler:   +62,1%
```
#### R²:
```
MinMaxScaler:   0%
StandardScaler: -68.4%
RobustScaler:   -52.6%
```
**MinMaxScaler** is reliable and fast, so it is the best choice for scaling.

SVM's without scaling the data are not usable, because the training takes too long.

#### Polinomial Feature Transformation
Polinomial feature transformation did not improve the results.
Deviation to best result:\
`+1.16% MSE (Random Forest)`\
`-1.24% R² (Random Forest)`

#### Dimensionality Reduction (PCA) 
PCA did not improve the results

In [5]:
scaler_minmax = MinMaxScaler()

#### Feature Selection

- Identification of collinear features (fs_colinearity) based on a threshold for the collinearity (colinearity_threshold).
- Identify features with high multicollinearity (fs_vif) based on a threshold for the variance inflation factor (VIF) (vif_threshold).  
- Saving the features in dropped_features identified by the two functions above if they are under the correlation threshold (correlation_threshold).

```python
def fs_colinearity(df, colinearity_threshold=0.5,correlation_threshold=0.1):
    ...
    return dropped_features


def fs_vif(df, correlation_threshold=0.1, vif_threshold=5):
    ...
    return dropped_features
 
```
- Saving the list of dropped features in a JSON file named "dropped_features.json".

In [6]:
import json

def feature_selection(df,colinearity_threshold=0.5, correlation_threshold=0.1, vif_threshold=5):
    dropped_features = []
    dropped_features_set = set(dropped_features)

    # Add elements from fs_colinearity to dropped_features_set
    dropped_features_set.update(fs_colinearity(df, colinearity_threshold, correlation_threshold))

    # Add elements from fs_vif to dropped_features_set
    dropped_features_set.update(fs_vif(df, correlation_threshold, vif_threshold))

    # Convert dropped_features_set back to a list
    dropped_features = list(dropped_features_set)
    print("Dropping Features: ", dropped_features)
    # Drop the features in dropped_features from the DataFrame
    df = df.drop(columns=dropped_features)

    # Save dropped features list to a JSON file
    if df.shape[0] > 7000:
        with open('dropped_features.json', 'w') as f:
            json.dump(dropped_features, f)
    else:
        with open('./models/7000_samples/dropped_features.json', 'w') as f:
            json.dump(dropped_features, f)

    return df
feature_selection = FunctionTransformer(feature_selection).set_output(transform="pandas")

## Pipeline Building

- Für die Erstellung der Pipeline werden sogenannte Sub-Pipelines verwendet, um eine saubere Trennung zwischen numerischen und kategorischen Features sowie Label vorzunehmen 
- Dabei sollen fehlende Werte immer imputed werden mittels Durchschnitt (für numerische Features) oder dem am häufigsten vorkommenden Wert (für kategorische Features)

In [7]:
cleaning_pipeline = Pipeline(steps=[
])

cleaning_pipeline_scaled = Pipeline(steps=[
])

### Sub-Pipeline: Categorical Features
this sub-pipeline handles the categorical feature "wine type" 

we use one-hot encoding to transform the categorical feature into binary features

In [8]:
categorical_imputer = SimpleImputer(strategy="most_frequent").set_output(transform="pandas")

#pipeline for categorical features
categorical_pipeline = Pipeline(steps=[])
categorical_pipeline.steps.append(('imputer', categorical_imputer))
categorical_pipeline.steps.append(('onehot', OneHotEncoder(handle_unknown='ignore', sparse_output=False).set_output(transform="pandas")))


### Sub-Pipeline: Numerical Features
this sub-pipeline handles the numerical features like "fixed acidity", "volatile acidity" or "citric acid"

In [9]:
#pipeline for numerical features
numeric_pipeline = Pipeline(steps=[])
numerical_imputer = SimpleImputer(strategy="mean").set_output(transform="pandas")

numeric_pipeline.steps.append(('imputer', numerical_imputer))
numeric_pipeline.steps.append(('outlier_detection', outlier_detection))

#pipeline_scaled for numerical features
numeric_pipeline_scaled = Pipeline(steps=[])

numeric_pipeline_scaled.steps.append(('imputer', numerical_imputer))
numeric_pipeline_scaled.steps.append(('outlier_detection', outlier_detection))
numeric_pipeline_scaled.steps.append(('scaler', scaler_minmax))

### Sub-Pipeline: Label
this pipeline handles the label "quality" by just passing it through (label outlier detections happens later)

In [10]:
#pipeline for label
label_pipeline = Pipeline(steps=[])
#generate pass through function
pass_through = FunctionTransformer().set_output(transform="pandas")
label_pipeline.steps.append(('do_nothing', pass_through))


## Building the god pipeline (combining all steps)
this step builds the different column transformers (pipeline branches) and combines them into one pipeline

after that it adds the label outlier detection step and the feature selection step

Advantages of our approach
- Clean separation between numeric, categorical and label columns.
- This allows effective data cleaning and preprocessing to handle missing values, perform scaling, or code categorical features
- Outlier detection is designed to help identify and, if necessary, remove outliers in the target variables to improve model performance

### Split Features and Label

In [11]:
categorical_features = df.select_dtypes(include=['object']).columns
numerical_features = df.select_dtypes(include=[np.number]).columns
#drop 'quality' from numerical features (its a series)
numerical_features = numerical_features.drop('quality')
label = pd.Series('quality')

### Combining all steps

#### Pipeline Setup (non-scaled)

In [18]:
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_pipeline, numerical_features),
        ('cat', categorical_pipeline, categorical_features),
        ('label', label_pipeline, label)
    ]).set_output(transform="pandas")
cleaning_pipeline.steps.append(('preprocessor', preprocessor))
cleaning_pipeline.steps.append(("outlier_detection_label", outlier_detection_label))
cleaning_pipeline.steps.append(('feature_selection', feature_selection))

<img src="https://github.com/kevin-eberhardt/data-science/blob/4b5a1c8d57100ab2db31598532192e3ea6502756/figures/cleaning_pipeline.png?raw=true" alt="drawing" width="100%"/>

#### Pipeline Setup (scaled)

In [13]:
preprocessor_scaled = ColumnTransformer(
    transformers=[
        ('num', numeric_pipeline_scaled, numerical_features),
        ('cat', categorical_pipeline, categorical_features),
        ('label', label_pipeline, label)
    ]).set_output(transform="pandas")
cleaning_pipeline_scaled.steps.append(('preprocessor', preprocessor_scaled))
cleaning_pipeline_scaled.steps.append(("outlier_detection_label", outlier_detection_label))
cleaning_pipeline_scaled.steps.append(('feature_selection', feature_selection))

<img src="https://github.com/kevin-eberhardt/data-science/blob/4b5a1c8d57100ab2db31598532192e3ea6502756/figures/cleaning_pipeline_scaled.png?raw=true" alt="drawing" width="100%"/>

---

---

## Training Pipeline

#### Hyperparamter

- The use of hyperparameters is intended to adapt the models to the specific requirements of the data
- The goal is to optimize model performance 
- In addition, the choice of the right hyperparameters should avoid overfitting or underfitting by achieving an appropriate balance between variance and bias.
- this is done by using the Grid Search approach
- we also use cross validation to avoid overfitting (GridSearchCV function)

In [14]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.svm import SVR
from sklearn.neural_network import MLPRegressor
from sklearn.neighbors import KNeighborsRegressor

models = [
    {
        "name": "LinearRegression",
        "estimator": LinearRegression(),
        "hyperparameters":
            {
                "fit_intercept": [True, False],
                "copy_X": [True, False]
            },
        "scalable": 0
    },
    {
        "name": "DecisionTreeRegressor",
        "estimator": DecisionTreeRegressor(),
        "hyperparameters":
            {
                "criterion": ["squared_error", "friedman_mse"],
                "splitter": ["best", "random"],
                "max_depth": [None, 2, 5, 10],
                "min_samples_split": [2, 5, 10],
                "min_samples_leaf": [1, 5, 10]
            },
        "scalable": 0
    },
    {
        "name": "RandomForestRegressor",
        "estimator": RandomForestRegressor(),
        "hyperparameters":
            {
                "n_estimators": [100, 200],
                "criterion": ["squared_error", "friedman_mse"],
                "max_depth": [None, 2, 5, 10],
                "min_samples_split": [2, 5, 10],
                "min_samples_leaf": [1, 5, 10]
            },
        "scalable": 0
    },
    {
        "name": "Gradient Boosting Regressor",
        "estimator": GradientBoostingRegressor(),
        "hyperparameters":
        {       
                "n_estimators": [100, 200, 500],
                "max_depth": [None, 3, 5, 10],
                "min_samples_split": [2, 5, 10],
                "learning_rate": [0.1, 0.05, 0.001],
                "loss": ['squared_error', 'absolute_error', 'huber'],
        },
        "scalable": 0
    }, 
     {
        "name": "Support Vector Machine",
        "estimator": SVR(),
        "hyperparameters": {
            "C": [1, 10, 100],
            "kernel": ["rbf", "linear", "poly"]
        },
        "best_score": 0.5567442927702857,
        "scalable": 1
    },
    {
        "name": "ANN",
        "estimator": MLPRegressor(),
        "hyperparameters": {
            'activation': ['identity', 'logistic', 'tanh', 'relu'],
            'alpha': [0.0001, 0.001, 0.01],
            'hidden_layer_sizes': [(100, ),(100, 50), (100, 50, 25), (100, 75, 50, 25)],
            'learning_rate': ['constant', 'invscaling', 'adaptive'],
            'solver': ['adam', 'lbfgs']
        },
        "best_score": 0.5567442927702857,
        "scalable": 1
    }, 
    {
        "name": "KNN",
        "estimator": KNeighborsRegressor(),
        "hyperparameters": {
            'n_neighbors': [3, 5, 7, 9, 11, 13, 15],
            'weights' : ['uniform', 'distance']
        },
        "best_score": 0.5567442927702857,
        "scalable": 1
    }
]

## Model Training with Grid Search CV

- A cleaned version of the input dataset is created by applying the cleaning_pipeline and cleaning_pipeline_scaled
- The cleaned dataset is split into training and test data (train_test_split) with a test size of 20%
- A GridSearch is performed to find the best hyperparameters for the model by trying different combinations of the hyperparameters
- The best parameters (grid.best_params_), best score (grid.best_score_), and test score (grid.score(X_test, y_test)) are outputted
- The best models are written to a JSON file (best_models.json)
- The goal here is to automate model training, hyperparameter optimization and scoring
- The use of GridSearch allows to find the best hyperparameters for each model and to get an objective evaluation of the model performance

In [15]:
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
import pickle


best_models = []

def god_function(dirty_df):
    for model in models:
        print(model["name"])
        print("-"*len(model["name"]))
        if model["scalable"] is not None:
            if model["scalable"] == 0:
                clean_df = pd.DataFrame(cleaning_pipeline.fit_transform(dirty_df))
            if model["scalable"] == 1:
                clean_df = pd.DataFrame(cleaning_pipeline_scaled.fit_transform(dirty_df))
        X = clean_df.drop('label__quality', axis=1)
        y = clean_df['label__quality']
        X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0, stratify=y)
        grid = GridSearchCV(model["estimator"], model["hyperparameters"], cv=10, n_jobs=-1)
        grid = grid.fit(X_train, y_train)
        print("Best Parameters:")
        print(grid.best_params_)
        print("")
        print("Best Score:", grid.best_score_, "\t", "Test Score:", grid.score(X_test, y_test))
        print("Fit Time:", grid.refit_time_)
        print("")
        m = {
            "name": model["name"],
            "best_params": grid.best_params_,
            "best_score": grid.best_score_,
            "fit_time": grid.refit_time_,
            "test_score":  grid.score(X_test, y_test)
        }
        best_models.append(m)
        
        #save best models to json
        with open("./models/best_models.json", "w") as f:
            json.dump(best_models, f, indent=4)
            
        #save best estimator from grid with pickle
        with open("./models/" + model["name"] + '.pkl', 'wb') as f:
            pickle.dump(grid.best_estimator_, f)
        
god_function(df)

LinearRegression
----------------


  vif = 1. / (1. - r_squared_i)


Dropping Features:  ['num__calcium', 'num__free sulfur dioxide', 'num__minerals', 'num__residual sugar', 'cat__wine type_Pinot noir']
Best Parameters:
{'copy_X': True, 'fit_intercept': True}

Best Score: 0.5343677291345037 	 Test Score: 0.5349169536244307
Fit Time: 0.008506059646606445

DecisionTreeRegressor
---------------------


  vif = 1. / (1. - r_squared_i)


Dropping Features:  ['num__calcium', 'num__free sulfur dioxide', 'num__minerals', 'num__residual sugar', 'cat__wine type_Pinot noir']


KeyboardInterrupt: 

### Training with 7000 samples

In [16]:
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
import pickle


#sample data to 7000 rows
sample_testing = df.sample(1000, random_state=42)

#take other 7000 for training
df = df.drop(sample_testing.index)


best_models = []

def god_function_7000(dirty_df):
    for model in models:
        print(model["name"])
        print("-"*len(model["name"]))
        if model["scalable"] is not None:
            if model["scalable"] == 0:
                clean_df = pd.DataFrame(cleaning_pipeline.fit_transform(dirty_df))
            if model["scalable"] == 1:
                clean_df = pd.DataFrame(cleaning_pipeline_scaled.fit_transform(dirty_df))
        X = clean_df.drop('label__quality', axis=1)
        y = clean_df['label__quality']
        X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0, stratify=y)
        grid = GridSearchCV(model["estimator"], model["hyperparameters"], cv=10, n_jobs=-1)
        grid = grid.fit(X_train, y_train)
        print("Best Parameters:")
        print(grid.best_params_)
        print("")
        print("Best Score:", grid.best_score_, "\t", "Test Score:", grid.score(X_test, y_test))
        print("Fit Time:", grid.refit_time_)
        print("")
        m = {
            "name": model["name"],
            "best_params": grid.best_params_,
            "best_score": grid.best_score_,
            "fit_time": grid.refit_time_,
            "test_score":  grid.score(X_test, y_test)
        }
        best_models.append(m)

        #save best models to json
        with open("./models/best_models_7000_samples.json", "w") as f:
            json.dump(best_models, f, indent=4)
            
        #save best estimator from grid with pickle
        with open("./models/7000_samples/" + model["name"] + '.pkl', 'wb') as f:
            pickle.dump(grid.best_estimator_, f)
        

god_function_7000(df)

LinearRegression
----------------


  vif = 1. / (1. - r_squared_i)


Dropping Features:  ['num__free sulfur dioxide', 'num__residual sugar', 'cat__wine type_Pinot noir', 'num__calcium', 'num__minerals']
Best Parameters:
{'copy_X': True, 'fit_intercept': True}

Best Score: 0.5361524630669312 	 Test Score: 0.5296181819203758
Fit Time: 0.011995553970336914

DecisionTreeRegressor
---------------------


  vif = 1. / (1. - r_squared_i)


Dropping Features:  ['num__free sulfur dioxide', 'num__residual sugar', 'cat__wine type_Pinot noir', 'num__calcium', 'num__minerals']
Best Parameters:
{'criterion': 'squared_error', 'max_depth': None, 'min_samples_leaf': 1, 'min_samples_split': 2, 'splitter': 'best'}

Best Score: 0.7486894809833757 	 Test Score: 0.7842947045380559
Fit Time: 0.06297540664672852

RandomForestRegressor
---------------------


  vif = 1. / (1. - r_squared_i)


Dropping Features:  ['num__free sulfur dioxide', 'num__residual sugar', 'cat__wine type_Pinot noir', 'num__calcium', 'num__minerals']
Best Parameters:
{'criterion': 'friedman_mse', 'max_depth': None, 'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 200}

Best Score: 0.8367215196210633 	 Test Score: 0.8613304165569273
Fit Time: 13.91865348815918

Gradient Boosting Regressor
---------------------------


  vif = 1. / (1. - r_squared_i)


Dropping Features:  ['num__free sulfur dioxide', 'num__residual sugar', 'cat__wine type_Pinot noir', 'num__calcium', 'num__minerals']
Best Parameters:
{'learning_rate': 0.05, 'loss': 'absolute_error', 'max_depth': None, 'min_samples_split': 2, 'n_estimators': 500}

Best Score: 0.870339692351329 	 Test Score: 0.882921285695199
Fit Time: 242.45490026474

Support Vector Machine
----------------------


  vif = 1. / (1. - r_squared_i)


Dropping Features:  ['num__free sulfur dioxide', 'num__residual sugar', 'cat__wine type_Pinot noir', 'num__calcium', 'num__minerals']
Best Parameters:
{'C': 100, 'kernel': 'rbf'}

Best Score: 0.6319361251091907 	 Test Score: 0.6514476148866453
Fit Time: 42.292750120162964

ANN
---


  vif = 1. / (1. - r_squared_i)


Dropping Features:  ['num__free sulfur dioxide', 'num__residual sugar', 'cat__wine type_Pinot noir', 'num__calcium', 'num__minerals']




Best Parameters:
{'activation': 'relu', 'alpha': 0.01, 'hidden_layer_sizes': (100, 75, 50, 25), 'learning_rate': 'constant', 'solver': 'adam'}

Best Score: 0.6160400854722539 	 Test Score: 0.6257465486672892
Fit Time: 44.91675162315369

KNN
---


  vif = 1. / (1. - r_squared_i)


Dropping Features:  ['num__free sulfur dioxide', 'num__residual sugar', 'cat__wine type_Pinot noir', 'num__calcium', 'num__minerals']
Best Parameters:
{'n_neighbors': 15, 'weights': 'distance'}

Best Score: 0.8258640465632032 	 Test Score: 0.8530946418730501
Fit Time: 0.03298830986022949



## Evaluating best features based on p-values


Approach to identify best features based on p-values output by the model.

Test for Linear Regression and Decission Tree

In [17]:
from scipy import stats
def calculate_significant_features(X, y, model):
    coefficients = model.coef_
    intercept = model.intercept_


    residuals = y - model.predict(X)

    n = len(y)
    p = X.shape[1]
    df = n - p - 1

    mse = np.sum(residuals ** 2) / df
    variance_covariance_matrix = mse * np.linalg.inv(np.dot(X.T, X))
    standard_errors = np.sqrt(np.diagonal(variance_covariance_matrix))


    t_values = coefficients / standard_errors
    p_values = 2 * (1 - stats.t.cdf(np.abs(t_values), df))

    headers = ['Feature', 'Coefficient', 'Standard Error', 't-value', 'p-value']

    prediction_metrics = pd.DataFrame(columns=headers)
    for i in range(len(coefficients)):
        prediction_metrics.loc[i] = [X.columns.values[i], coefficients[i], standard_errors[i], t_values[i], p_values[i]]

    #remove rows with p-value > 0.05
    features_to_remove = prediction_metrics[prediction_metrics['p-value'] > 0.05]['Feature'].values
    print("Removing features: ", features_to_remove)
    prediction_metrics = prediction_metrics[prediction_metrics['p-value'] < 0.05]
    return prediction_metrics

In [18]:
test_models = [{
        "name": "LinearRegression",
        "estimator": LinearRegression(),
        "hyperparameters":
            {
                "fit_intercept": [True, False],
                "copy_X": [True, False],
                "n_jobs": [-1]
            }
    }
    ]
dirty_df = df.copy(deep=True)
for model in test_models:
    print(model["name"])
    print("-"*len(model["name"]))
    pipeline = cleaning_pipeline
    #pipeline.steps.pop(2)
    clean_df = pd.DataFrame(pipeline.fit_transform(dirty_df))
    X = clean_df.drop('label__quality', axis=1)
    y = clean_df['label__quality']
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0, stratify=y)
    grid = GridSearchCV(model["estimator"], model["hyperparameters"], cv=5, n_jobs=-1)
    grid = grid.fit(X_train, y_train)
    print("Best Parameters:")
    print(grid.best_params_)
    print("")
    print("Best Score:", grid.best_score_, "\t", "Test Score:", grid.score(X_test, y_test))
    print("Fit Time:", grid.refit_time_)
    print("")
    best_model = grid.best_estimator_
    significant_features = calculate_significant_features(X_train, y_train, best_model)
    #keep columns of X only if they are present in significant_features
    X_train = X_train[significant_features['Feature'].values]
    X_test = X_test[significant_features['Feature'].values]
    grid = GridSearchCV(model["estimator"], model["hyperparameters"], cv=5, n_jobs=-1)
    grid = grid.fit(X_train, y_train)
    print("Best Parameters:")
    print(grid.best_params_)
    print("")
    print("Best Score:", grid.best_score_, "\t", "Test Score:", grid.score(X_test, y_test))
    print("Fit Time:", grid.refit_time_)
    print("")


LinearRegression
----------------


  vif = 1. / (1. - r_squared_i)


Dropping Features:  ['num__free sulfur dioxide', 'num__residual sugar', 'cat__wine type_Pinot noir', 'num__calcium', 'num__minerals']
Best Parameters:
{'copy_X': True, 'fit_intercept': True, 'n_jobs': -1}

Best Score: 0.537981808702483 	 Test Score: 0.5296181819203758
Fit Time: 0.010994672775268555

Removing features:  ['num__citric acid' 'num__flavanoids' 'num__sulphates'
 'cat__wine type_Cabernet Sauvignon' 'cat__wine type_Chardonnay']
Best Parameters:
{'copy_X': True, 'fit_intercept': True, 'n_jobs': -1}

Best Score: 0.5382367523551554 	 Test Score: 0.5293169775836969
Fit Time: 0.00799560546875

