# Outline

**Session 1**

- Review of the basic ML workflow
- Encoding categorical variable
- Using Column Transfer and Pipeline
- Encoding Text Data

**Session 2**
- Handling missing value
- Switching to the full dataset
- Recap
- Evaluating and tuning a pipeline

Starter Code ( Copy From Here: http://bit.ly/first-ml-lesson)

In [1]:
import pandas as pd
from sklearn.preprocessing import OneHotEncoder
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.compose import make_column_transformer
from sklearn.pipeline import make_pipeline

In [2]:
cols = ['Parch', 'Fare', 'Embarked', 'Sex', 'Name']

In [3]:
df = pd.read_csv('http://bit.ly/kaggletrain', nrows=10)
X = df[cols]
y = df['Survived']

In [4]:
df_new = pd.read_csv('http://bit.ly/kaggletest', nrows=10)
X_new = df_new[cols]

In [5]:
ohe = OneHotEncoder()
vect = CountVectorizer()

In [6]:
ct = make_column_transformer(
    (ohe, ['Embarked', 'Sex']),
    (vect, 'Name'),
    remainder='passthrough')

In [7]:
logreg = LogisticRegression(solver='liblinear', random_state=1)

In [8]:
pipe = make_pipeline(ct, logreg)
pipe.fit(X, y)
pipe.predict(X_new)

array([0, 1, 0, 0, 1, 0, 1, 0, 1, 0], dtype=int64)

# Part 5 - Handling Missing Values

In [9]:
df

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S
5,6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q
6,7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S
7,8,0,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909,21.075,,S
8,9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27.0,0,2,347742,11.1333,,S
9,10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0,1,0,237736,30.0708,,C


In [46]:
cols = ['Parch', 'Fare', 'Embarked', 'Sex', 'Name', 'Age']

In [11]:
X = df[cols]
X

Unnamed: 0,Parch,Fare,Embarked,Sex,Name,Age
0,0,7.25,S,male,"Braund, Mr. Owen Harris",22.0
1,0,71.2833,C,female,"Cumings, Mrs. John Bradley (Florence Briggs Th...",38.0
2,0,7.925,S,female,"Heikkinen, Miss. Laina",26.0
3,0,53.1,S,female,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",35.0
4,0,8.05,S,male,"Allen, Mr. William Henry",35.0
5,0,8.4583,Q,male,"Moran, Mr. James",
6,0,51.8625,S,male,"McCarthy, Mr. Timothy J",54.0
7,1,21.075,S,male,"Palsson, Master. Gosta Leonard",2.0
8,2,11.1333,S,female,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",27.0
9,0,30.0708,C,female,"Nasser, Mrs. Nicholas (Adele Achem)",14.0


- The pipe.fit() step failed and gave an error because there are NaNs.
- Scikitlearn models don't accept data with missing values with one exception which is **Histogram Based Gradient Boosting Trees**. It is for classification and regression.
- It explicitly handles missing values without imputing them. So for this model alone you can use data with missing values and it will handle it for you.

In [12]:
#pipe.fit(X,y)

- **X.dropna()**, drops all the rows missing values.
- This might be a good option under two conditions,
  - If you know that you will be loosing very litte or considerable amount of data by doing this.
  - If the missing values are completely at random.
- But in the vast majority of cases this is not an ideal solution.

In [13]:
# Do this before splitting into X and y so that drop the rows in the target column as well.
X.dropna()

Unnamed: 0,Parch,Fare,Embarked,Sex,Name,Age
0,0,7.25,S,male,"Braund, Mr. Owen Harris",22.0
1,0,71.2833,C,female,"Cumings, Mrs. John Bradley (Florence Briggs Th...",38.0
2,0,7.925,S,female,"Heikkinen, Miss. Laina",26.0
3,0,53.1,S,female,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",35.0
4,0,8.05,S,male,"Allen, Mr. William Henry",35.0
6,0,51.8625,S,male,"McCarthy, Mr. Timothy J",54.0
7,1,21.075,S,male,"Palsson, Master. Gosta Leonard",2.0
8,2,11.1333,S,female,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",27.0
9,0,30.0708,C,female,"Nasser, Mrs. Nicholas (Adele Achem)",14.0


In [14]:
# To completely drop a columns with missing values
X.dropna(axis='columns')

Unnamed: 0,Parch,Fare,Embarked,Sex,Name
0,0,7.25,S,male,"Braund, Mr. Owen Harris"
1,0,71.2833,C,female,"Cumings, Mrs. John Bradley (Florence Briggs Th..."
2,0,7.925,S,female,"Heikkinen, Miss. Laina"
3,0,53.1,S,female,"Futrelle, Mrs. Jacques Heath (Lily May Peel)"
4,0,8.05,S,male,"Allen, Mr. William Henry"
5,0,8.4583,Q,male,"Moran, Mr. James"
6,0,51.8625,S,male,"McCarthy, Mr. Timothy J"
7,1,21.075,S,male,"Palsson, Master. Gosta Leonard"
8,2,11.1333,S,female,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)"
9,0,30.0708,C,female,"Nasser, Mrs. Nicholas (Adele Achem)"


**Imputation** : Imputing missing values is filling in data from known parts of the data. When you are imputing missing values, you are making up data.

**Caveats** : Imputation could be a course or a book on it's own, but we are not going to see that in depth here.

**Imputing Missing Values Using Scikitlearn**

In [15]:
from sklearn.impute import SimpleImputer

By default it uses the **mean** of that column to impute missing values.

In [16]:
imp = SimpleImputer()

In [17]:
imp.fit_transform(X[['Age']]) # 2D input

array([[22.        ],
       [38.        ],
       [26.        ],
       [35.        ],
       [35.        ],
       [28.11111111],
       [54.        ],
       [ 2.        ],
       [27.        ],
       [14.        ]])

In [18]:
# Can be used to check the value it learned or used to impute missing value.
imp.statistics_

array([28.11111111])

In [19]:
ct = make_column_transformer(
    (ohe, ['Embarked', 'Sex']),
    (vect, 'Name'),
    (imp, ['Age']),
    remainder='passthrough')

In [20]:
ct.fit_transform(X)

<10x48 sparse matrix of type '<class 'numpy.float64'>'
	with 88 stored elements in Compressed Sparse Row format>

In [21]:
pipe = make_pipeline(ct, logreg)
pipe.fit(X, y)
pipe.named_steps

{'columntransformer': ColumnTransformer(n_jobs=None, remainder='passthrough', sparse_threshold=0.3,
                   transformer_weights=None,
                   transformers=[('onehotencoder',
                                  OneHotEncoder(categories='auto', drop=None,
                                                dtype=<class 'numpy.float64'>,
                                                handle_unknown='error',
                                                sparse=True),
                                  ['Embarked', 'Sex']),
                                 ('countvectorizer',
                                  CountVectorizer(analyzer='word', binary=False,
                                                  decode_error='strict',
                                                  dtype=...
                                                  input='content',
                                                  lowercase=True, max_df=1.0,
                                             

- If X_new does not have any missing values, then nothing gets imputed in the new data.
- If X_new has missing values then it get's imputed with the mean (or median, etc) of X and not with the mean of X_new. 28.11 in our case.

**YOU CAN ONLY LEARN FROM THE TRAINING DATA AND NOT FROM THE TESTING OR OUT OF SAMPLE DATA**

In [22]:
X_new = df_new[cols]
pipe.predict(X_new)

array([0, 0, 0, 0, 1, 0, 1, 0, 1, 0], dtype=int64)

**Tip** : 
You can also create a new feature out of missing values. The column would have a 0 and 1 to indicate missing values. This can be useful when your missing values but not at random. In some cases data with missing values can also be a good predictor.

In [23]:
# Indicates missing values as 0 and 1
imp_indicator = SimpleImputer(add_indicator=True)

In [24]:
imp_indicator.fit_transform(X[['Age']])

array([[22.        ,  0.        ],
       [38.        ,  0.        ],
       [26.        ,  0.        ],
       [35.        ,  0.        ],
       [35.        ,  0.        ],
       [28.11111111,  1.        ],
       [54.        ,  0.        ],
       [ 2.        ,  0.        ],
       [27.        ,  0.        ],
       [14.        ,  0.        ]])

**Random Tip** : You can write some custom code to try out multiple models at once. But you could also use Stack Classifier or Voting Classifier which produces a combined output of multiple models and does not really give you the option to compare the output of every model.

# Part 6 - Switching To Full Dataset

In [25]:
df = pd.read_csv('http://bit.ly/kaggletrain')
df.shape

(891, 12)

In [45]:
df_new = pd.read_csv('http://bit.ly/kaggletest')
df_new.shape

(418, 11)

**We have a problem with missing values**
- The trianing set has missing values in *Embarked* column, but the testing set doesn't.
- The training set does not have missing values in *Fare* column, but the testing set has.

In [29]:
df.isna().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

In [30]:
df_new.isna().sum()

PassengerId      0
Pclass           0
Name             0
Sex              0
Age             86
SibSp            0
Parch            0
Ticket           0
Fare             1
Cabin          327
Embarked         0
dtype: int64

In [31]:
# Split Data Into X and y
X = df[cols]
y = df['Survived']

In [32]:
# Old Column Transformer
ct = make_column_transformer(
    (ohe, ['Embarked', 'Sex']),
    (vect, 'Name'),
    (imp, ['Age']),
    remainder='passthrough')

You will get an error saying there are NaNs.

In [34]:
# ct.fit_transform(X)

**Now we have to Impute Missing Values**. Our solution is to impute missing values before One-Hot-Encoding it.

- While imputing a string column like below, you can't use the **mean** or **median** option.
- We have two options - 1) impute the **mode** also called as **most frequent** value. 2) Is to impute a **constant** value of your choice.
- We are going to try out option 2. In this way missing values become another categories among the existing categories.

**Addressing first problem, where we have missing values in Train but not Test (Embarked).**

In [35]:
imp_constant = SimpleImputer(strategy='constant',fill_value='missing')

- In the like below, we are building a **Transformer-Only** pipeline just for example.
- In the pipeline below, we have first treates the mising values and then encoded using OHE on the *Embarked* column.

In [36]:
imp_ohe = make_pipeline(imp_constant, ohe)

What if you added **imp_constant** and **ohe** separately and directly into the **Column Transformer** instead of making a pipeline and then passing it into the **Column Transformer**?? 

The problem is, the **imputation** step will output two columns and the **ohe** will output multiple columns and you will end up having columns which are not suitale for the final step. So it's much more clear and safe to use it this way.

In [37]:
imp_ohe.fit_transform(X[['Embarked']])

<891x4 sparse matrix of type '<class 'numpy.float64'>'
	with 891 stored elements in Compressed Sparse Row format>

**Rules For Pipeline**
- All pipeline steps except for the final step **must** be a **Transformer**.
- The final step can be a **Transformer** or a **Model**.
- When we use a pipeline which ends in a **model** we use the methods **fit()** to fit the pipeline and **predict()** to predict on the new data.
- When we use a pipeline which ends in a **transformer** we use the methods **fit_transform()** on training data and **transform()** on test data.

**Making changes to the old Column Transformer in the line below**
- We have used **imp_ohe** here, so what does **imp_ohe** contain? Remember we build a pipeline of transformers which imputed missing values and then applied one-hot? Yes so we are taking that pipeline adding to *make_column_transformer*, to the columns where we need to impute missing values and one-hot encode it. 
- In the example below *Embarked* has missing values but *Sex* does not, it doesn't matter to apply the pipeline which imputes missing values nothing will happen to *Sex* column as it does not have missing values.

In [38]:
ct = make_column_transformer(
    (imp_ohe, ['Embarked', 'Sex']),
    (vect, 'Name'),
    (imp, ['Age']),
    remainder='passthrough')

In [39]:
ct.fit_transform(X)

<891x1518 sparse matrix of type '<class 'numpy.float64'>'
	with 7328 stored elements in Compressed Sparse Row format>

**Addressing second problem, where we have missing values in Test but not Train (Fare).**

- We are taking the column transformer and adding *Fare* in the *imp* instance we created along with *Age*. Note *Fare* is a numerical column.
- So what happens is, in the *fit()* section the *imputer* learns the *mean* of *Fare* and if at all it sees any missing values in Test data it imputes it with the *mean* it learned from the *Test* data.

In [40]:
ct = make_column_transformer(
    (imp_ohe, ['Embarked', 'Sex']),
    (vect, 'Name'),
    (imp, ['Age','Fare']),
    remainder='passthrough')

In [42]:
ct.fit_transform(X)

<891x1518 sparse matrix of type '<class 'numpy.float64'>'
	with 7328 stored elements in Compressed Sparse Row format>

In [43]:
# Update our pipeline

pipe = make_pipeline(ct, logreg)
pipe.fit(X, y);

In [48]:
X_new = df_new[cols]
pipe.predict(X_new)

array([0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0, 1,
       1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1,
       1, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1,
       1, 0, 0, 1, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 1, 1,
       1, 1, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0,
       0, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,
       1, 0, 1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1,
       1, 0, 1, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1,
       0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0,
       1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1,
       1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1,
       0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0,
       0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0,

In [58]:
# You can attach the predictions with the index by doing something like this
pd.Series(pipe.predict(X_new),index=df_new.index)

0      0
1      1
2      0
3      0
4      1
      ..
413    0
414    1
415    0
416    0
417    1
Length: 418, dtype: int64

In [54]:
# If you want to get the statistics out
ct.named_transformers_.simpleimputer.statistics_

array([29.69911765, 32.20420797])

In [53]:
# If you want to get the statistics out
ct.named_transformers_.pipeline.named_steps.simpleimputer.statistics_

array(['missing', 'missing'], dtype=object)

# Recap at http://bit.ly/complex-pipeline

In [59]:
import pandas as pd
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.compose import make_column_transformer
from sklearn.pipeline import make_pipeline

In [60]:
cols = ['Parch', 'Fare', 'Embarked', 'Sex', 'Name', 'Age']

In [61]:
df = pd.read_csv('http://bit.ly/kaggletrain')
X = df[cols]
y = df['Survived']

In [62]:
df_new = pd.read_csv('http://bit.ly/kaggletest')
X_new = df_new[cols]

In [63]:
imp_constant = SimpleImputer(strategy='constant', fill_value='missing')
ohe = OneHotEncoder()

In [64]:
imp_ohe = make_pipeline(imp_constant, ohe)
vect = CountVectorizer()
imp = SimpleImputer()

In [65]:
ct = make_column_transformer(
    (imp_ohe, ['Embarked', 'Sex']),
    (vect, 'Name'),
    (imp, ['Age', 'Fare']),
    remainder='passthrough')

In [66]:
logreg = LogisticRegression(solver='liblinear', random_state=1)

In [67]:
pipe = make_pipeline(ct, logreg)
pipe.fit(X, y)
pipe.predict(X_new)

array([0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0, 1,
       1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1,
       1, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1,
       1, 0, 0, 1, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 1, 1,
       1, 1, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0,
       0, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,
       1, 0, 1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1,
       1, 0, 1, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1,
       0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0,
       1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1,
       1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1,
       0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0,
       0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0,

**Drawbacks of doing your pre-processing in Pandas**

1) You can't make use of *CountVectorizer()* which helps you work with *text* data.

2) You can do you *OneHotEncoding* in Pandas using *GetDummies* method, but what this does is it adds the columns or the changes are made directly to the original dataframe. But using Scikit-Learn the original dataframe is untouched and clean to explore. Scikit-learn also saves memory space.

3) If you do your missing value imputation using Pandas, it leads to **Data Leakage**.

**Data Leakage**

When you use a *Model Evaluation* procedure such as *TrainTest split* or *CrossValidation* that is supposed to simulate the future so that you can estimate right now how your model will perform in the future. That is what a *Model Evaluation* procedure is for. For example you use *Model Evaluation* procedure to check whether you are *Overfitting* your training data.

If you do your missing value imputation in *Pandas* and then pass it to your model, then the model learns something it is not supposed to learn. The scores become unreliable. Your *Model Evaluation* will not be a simulation of reality.

So some Transformations like *missing value value imputation*, *Feature Encoding*, *onehot* not all shouldn't be done using Pandas.

**Summary Of Data Leakage**

- Data Leakage is learning something from the testing data that you are not allowed to know.
- Why should you avoid Data Leakage? Because the scores become unreliable and you end up taking bad decsions for Hyperparameter tuning and bad decsions on how well the model is performing.
- How does Scikit-Learn prevent Data Leakage? It has separate Fit and Fit-Transform steps.

**You can avoid Data Leakage using Pandas Transformations if your first step is Train-Test-Split**, but it is a huge pain in complex cases.

# Part 7

In [69]:
from sklearn.model_selection import cross_val_score

**Applying Cross-Validation**

In [70]:
cross_val_score(pipe, X, y, cv=5, scoring='accuracy').mean()

0.8114619295712762

What is happening behind?

- CV is splitting the data into 5 sets. So each time a set get's to act like the test data.
- So for 4 portions of the data *Column Transformer* is applied and the *model* is fit and predictions are made on the last portion of data.
- This is repeated for all the sets.
- So won't Data Leakage happen here?
- CV splits the data and then applies the transformations, it does not apple the transformations to the entire data and then split. In this way we won't have Data Leakage.

**Hyper-parameter Tuning**

Step1: We are going to be tuning a pipeline, so we should know the step names inside the pipeline.

In [71]:
pipe.named_steps.keys()

dict_keys(['columntransformer', 'logisticregression'])

Step2:The Parameters go in a dictionary, so we are going to build a dictionary containing the parameters and the values to be tried out while tuning.

How you create a dictionary for hyperparameter tuning is, you give the *step* name followed by double underscores and then the *parameter* name. For example **logisticregression** is the step name and **penalty** is the parameter.

In [79]:
params = {}
params['logisticregression__penalty'] = ['l1','l2']
params['logisticregression__C'] = [0.1, 1, 10]
params

{'logisticregression__penalty': ['l1', 'l2'],
 'logisticregression__C': [0.1, 1, 10]}

In [80]:
from sklearn.model_selection import GridSearchCV

In [82]:
grid = GridSearchCV(pipe, params, cv=5, scoring='accuracy')
grid.fit(X,y);

In [83]:
results = pd.DataFrame(grid.cv_results_)
results

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_logisticregression__C,param_logisticregression__penalty,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,mean_test_score,std_test_score,rank_test_score
0,0.030163,0.020577,0.007181,0.000746,0.1,l1,"{'logisticregression__C': 0.1, 'logisticregres...",0.787709,0.803371,0.769663,0.758427,0.797753,0.783385,0.016946,6
1,0.017154,0.002475,0.006782,0.001324,0.1,l2,"{'logisticregression__C': 0.1, 'logisticregres...",0.798883,0.803371,0.764045,0.775281,0.803371,0.78899,0.016258,5
2,0.019148,0.001466,0.006783,0.000399,1.0,l1,"{'logisticregression__C': 1, 'logisticregressi...",0.815642,0.820225,0.797753,0.792135,0.848315,0.814814,0.019787,2
3,0.016959,0.000631,0.006386,0.000482,1.0,l2,"{'logisticregression__C': 1, 'logisticregressi...",0.798883,0.825843,0.803371,0.786517,0.842697,0.811462,0.020141,3
4,0.025138,0.003444,0.006593,0.000476,10.0,l1,"{'logisticregression__C': 10, 'logisticregress...",0.832402,0.814607,0.820225,0.786517,0.853933,0.821537,0.022107,1
5,0.018344,0.000477,0.005994,1.5e-05,10.0,l2,"{'logisticregression__C': 10, 'logisticregress...",0.782123,0.803371,0.808989,0.797753,0.853933,0.809234,0.02408,4


In [84]:
results.sort_values('rank_test_score')

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_logisticregression__C,param_logisticregression__penalty,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,mean_test_score,std_test_score,rank_test_score
4,0.025138,0.003444,0.006593,0.000476,10.0,l1,"{'logisticregression__C': 10, 'logisticregress...",0.832402,0.814607,0.820225,0.786517,0.853933,0.821537,0.022107,1
2,0.019148,0.001466,0.006783,0.000399,1.0,l1,"{'logisticregression__C': 1, 'logisticregressi...",0.815642,0.820225,0.797753,0.792135,0.848315,0.814814,0.019787,2
3,0.016959,0.000631,0.006386,0.000482,1.0,l2,"{'logisticregression__C': 1, 'logisticregressi...",0.798883,0.825843,0.803371,0.786517,0.842697,0.811462,0.020141,3
5,0.018344,0.000477,0.005994,1.5e-05,10.0,l2,"{'logisticregression__C': 10, 'logisticregress...",0.782123,0.803371,0.808989,0.797753,0.853933,0.809234,0.02408,4
1,0.017154,0.002475,0.006782,0.001324,0.1,l2,"{'logisticregression__C': 0.1, 'logisticregres...",0.798883,0.803371,0.764045,0.775281,0.803371,0.78899,0.016258,5
0,0.030163,0.020577,0.007181,0.000746,0.1,l1,"{'logisticregression__C': 0.1, 'logisticregres...",0.787709,0.803371,0.769663,0.758427,0.797753,0.783385,0.016946,6


In [85]:
# You can see all the parameters using this line of code
pipe.named_steps.columntransformer.named_transformers_

{'pipeline': Pipeline(memory=None,
          steps=[('simpleimputer',
                  SimpleImputer(add_indicator=False, copy=True,
                                fill_value='missing', missing_values=nan,
                                strategy='constant', verbose=0)),
                 ('onehotencoder',
                  OneHotEncoder(categories='auto', drop=None,
                                dtype=<class 'numpy.float64'>,
                                handle_unknown='error', sparse=True))],
          verbose=False),
 'countvectorizer': CountVectorizer(analyzer='word', binary=False, decode_error='strict',
                 dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
                 lowercase=True, max_df=1.0, max_features=None, min_df=1,
                 ngram_range=(1, 1), preprocessor=None, stop_words=None,
                 strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
                 tokenizer=None, vocabulary=None),
 'simpleimputer': SimpleImp

In [86]:
# You can also use parameter tuning if you do transformations using Scikit-Learn
params['columntransformer__pipeline__onehotencoder__drop'] = [None,'first']

In [87]:
# Tuning CountVectorizer
params['columntransformer__countvectorizer__ngram_range'] = [(1,1),(1,2)]

In [88]:
# Tuning Imputer
params['columntransformer__simpleimputer__add_indicator'] = [False, True]

In [89]:
params

{'logisticregression__penalty': ['l1', 'l2'],
 'logisticregression__C': [0.1, 1, 10],
 'columntransformer__pipeline__onehotencoder__drop': [None, 'first'],
 'columntransformer__countvectorizer__ngram_range': [(1, 1), (1, 2)],
 'columntransformer__simpleimputer__add_indicator': [False, True]}

In [90]:
grid = GridSearchCV(pipe, params, cv=5, scoring='accuracy')
grid.fit(X,y);

In [93]:
results = pd.DataFrame(grid.cv_results_)
results.sort_values('rank_test_score')

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_columntransformer__countvectorizer__ngram_range,param_columntransformer__pipeline__onehotencoder__drop,param_columntransformer__simpleimputer__add_indicator,param_logisticregression__C,param_logisticregression__penalty,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,mean_test_score,std_test_score,rank_test_score
28,0.035727,0.001691,0.007772,0.00097,"(1, 2)",,False,10.0,l1,{'columntransformer__countvectorizer__ngram_ra...,0.860335,0.825843,0.825843,0.780899,0.859551,0.830494,0.029113,1
40,0.04868,0.003254,0.008577,0.00049,"(1, 2)",first,False,10.0,l1,{'columntransformer__countvectorizer__ngram_ra...,0.849162,0.825843,0.814607,0.786517,0.853933,0.826012,0.024517,2
34,0.038306,0.002715,0.00739,0.000481,"(1, 2)",,True,10.0,l1,{'columntransformer__countvectorizer__ngram_ra...,0.854749,0.820225,0.820225,0.780899,0.853933,0.826006,0.027231,3
46,0.046155,0.005934,0.007992,0.000884,"(1, 2)",first,True,10.0,l1,{'columntransformer__countvectorizer__ngram_ra...,0.843575,0.831461,0.814607,0.780899,0.853933,0.824895,0.0256,4
4,0.026372,0.002409,0.00678,0.000398,"(1, 1)",,False,10.0,l1,{'columntransformer__countvectorizer__ngram_ra...,0.832402,0.814607,0.820225,0.786517,0.853933,0.821537,0.022107,5
22,0.032313,0.00507,0.006392,0.0005,"(1, 1)",first,True,10.0,l1,{'columntransformer__countvectorizer__ngram_ra...,0.821229,0.820225,0.814607,0.792135,0.853933,0.820426,0.019787,6
16,0.031535,0.003418,0.006384,0.000487,"(1, 1)",first,False,10.0,l1,{'columntransformer__countvectorizer__ngram_ra...,0.826816,0.820225,0.814607,0.780899,0.853933,0.819296,0.023467,7
10,0.027539,0.003587,0.007766,0.001701,"(1, 1)",,True,10.0,l1,{'columntransformer__countvectorizer__ngram_ra...,0.821229,0.820225,0.808989,0.780899,0.853933,0.817055,0.023494,8
20,0.01947,0.000448,0.006982,0.000631,"(1, 1)",first,True,1.0,l1,{'columntransformer__countvectorizer__ngram_ra...,0.810056,0.820225,0.797753,0.792135,0.853933,0.81482,0.021852,9
44,0.025202,0.001083,0.007246,0.000395,"(1, 2)",first,True,1.0,l1,{'columntransformer__countvectorizer__ngram_ra...,0.810056,0.820225,0.797753,0.792135,0.853933,0.81482,0.021852,9


In [94]:
grid.best_score_

0.8304940053982801

In [95]:
grid.best_params_

{'columntransformer__countvectorizer__ngram_range': (1, 2),
 'columntransformer__pipeline__onehotencoder__drop': None,
 'columntransformer__simpleimputer__add_indicator': False,
 'logisticregression__C': 10,
 'logisticregression__penalty': 'l1'}

Grid actually re-fits the model using the best parameters, so you can go ahead and predict on new data

In [96]:
grid.predict(X_new)

array([0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1,
       1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1,
       1, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1,
       1, 0, 0, 1, 1, 0, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1,
       1, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0,
       0, 0, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0,
       0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1,
       1, 0, 1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1,
       1, 0, 1, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1,
       0, 1, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0,
       1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 1, 1, 1,
       1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1,
       0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0,
       0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0,