# <a id="0">Wine Data Exercises (Part 2) with pipeline/column transformer and Cross Validation</a>

In this notebook, we will review basic steps of exploratory data analysis following the example in the EDA-PIPELINE,ipynb example. We will work with the wine data set __winequality-white.csv__ provided in the data folder. 

__Dataset schema:__ 
   - fixed acidity
   - volatile acidity
   - citric acid
   - residual sugar
   - chlorides
   - free sulfur dioxide
   - total sulfur dioxide
   - density
   - pH
   - sulphates
   - alcohol

   Output variable (based on sensory data): 
   - quality (score between 0 and 10)

In [None]:
import pandas as pd

import warnings
warnings.filterwarnings("ignore")
  
df = pd.read_csv('../data/winequality-white.csv', sep=';')

We will look at number of rows, columns and some simple statistics of the dataset using [df.info()](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.info.html)

In [None]:
df.describe()

Create a catagorical feature column using qcut()

In [8]:
fixed_acidity_bin_labels_5 = ['poor', 'average', 'ok', 'good', 'best']
df['fixed acidity group'] = pd.qcut(df['fixed acidity'], q=[0, .2, .4, .6, .8, 1], labels=fixed_acidity_bin_labels_5)
df['fixed acidity group'].value_counts()

fixed acidity group
poor       1107
ok         1017
average     984
good        903
best        887
Name: count, dtype: int64

## Data Processing with Pipeline
 
__Part 1.__ Build a pipeline that has two pre-processors

- One is to impute the missing values with the mean using sklearn's SimpleImputer, scale the numerical features to have similar orders of magnitude by bringing them into the 0-1 range with sklearn's MinMaxScaler, for numerical features 

- One is to use one hot encoding to encode the catagorical feature. Note here even the feature `fixed acidity group` is ordinal, so make sure you choose the most appropriate encoder. 

Then we add a decision tree estimator to form the pipeline. Visualize pipeline. 

__Part 2.__ Test the pipeline on the training data, then on the test data.

__Part 3.__ Use 5-fold Grid Search to tune the hyper-parameter for the decision tree estimator. You may use a grid like this

        param_grid={'dt__max_depth': [100, 200, 300],#, 50, 75, 100, 125, 150, 200, 250], 
            'dt__min_samples_leaf': [5, 10, 15],#, 25, 30],
            'dt__min_samples_split': [2, 5, 15]#, 25, 30, 45, 50]
        }
