## Hyper Parameter Tuning

- In contrast to __model parameters__ which are learned during training, __model hyperparameters__ are set by the data scientist ahead of training and control implementation aspects of the model. 
- The __weights learned during training__ of a linear regression model are __parameters__ while the __number of trees in a random forest is a model hyperparameter__ because this is set by the data scientist. 
- __Hyperparameters__ can be thought of as __model settings__. These settings need to be tuned for each problem because the best model hyperparameters for one particular dataset will not be the best across all datasets. 
- The process of hyperparameter tuning (also called __hyperparameter optimization)__ means finding the combination of hyperparameter values for a machine learning model that performs the best - as measured on a validation dataset - for a problem.

In [None]:
##! pip freeze

In [None]:
##! pip install -U scikit-learn

### Hyper Parameter Tuning using RandomForest Classifier

In [None]:
# Data manipulation libraries
import pandas as pd
import numpy as np

##### Scikit Learn modules needed for Logistic Regression
from sklearn.ensemble import RandomForestClassifier
from sklearn import tree 
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
from sklearn.preprocessing import LabelEncoder,MinMaxScaler , StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer

# Plotting libraries
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(color_codes = True)
%matplotlib inline

### Load training set for building the model

In [None]:
df = pd.read_csv("train.csv")
df.head()

#### Validate blank cells

In [None]:
print(df.describe())
df.isna().sum()

### There are several blank values in few columns as seen above

### Manage blank values with imputation and build ML pipeline

I have createda sample Pipeline with few variables however I will suggest you to explore more and build your own pipelines

In [None]:
# We create the preprocessing pipelines for both numeric and categorical data.
numeric_features = ['Age', 'Fare']
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='mean')),
    ('scaler', StandardScaler())])

categorical_features = ['Embarked', 'Sex', 'Pclass','SibSp']
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))])

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)])

# Append classifier to preprocessing pipeline.
# Now we have a full prediction pipeline.
clf = Pipeline(steps=[('preprocessor', preprocessor),
                      ('classifier', RandomForestClassifier(random_state= 42))])

In [None]:
df.columns

In [None]:
X_train, X_test, y_train, y_test = train_test_split(df[['Fare','Pclass', 'Name', 'Sex', 'Age','Embarked','SibSp']],
                                                    df["Survived"], test_size=0.2,random_state = 42)

In [None]:
clf.fit(X_train, y_train)
print("model score: %.3f" % clf.score(X_test, y_test))

Reference on Grid Search
https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html#sklearn.model_selection.GridSearchCV

In [None]:
RandomForestClassifier()

In [None]:
param_grid = {
    'preprocessor__num__imputer__strategy': ['mean', 'median'],
    'classifier__criterion': ["gini","entropy"],
    #'classifier__max_features': ["auto","sqrt","log2"],
    'classifier__max_depth':[10,50,100],
    'classifier__n_estimators':[10,50,150,200]
}

grid_search = GridSearchCV(clf, param_grid, cv=10, iid=False,n_jobs= -1, verbose= 2)
grid_search.fit(X_train, y_train)

print(("best Random Forest from grid search: %.3f"
       % grid_search.score(X_test, y_test)))

In [None]:
grid_search.best_params_

In [None]:
grid_search.best_params_

### Loading Test data
- Predicting on test set
- Please note that columns selected for training data must also be included in Testing data for prediction, else you will get error
- creating a submission file and extracting on disk
- submission file will be used for kaggle competition

In [None]:
test_data = pd.read_csv("test.csv")
X_test = test_data[["Pclass","Sex","Age","Fare","Embarked"]]
y_predicted = grid_search.predict(X_test)
y_predicted = pd.DataFrame({"Survived":y_predicted})

### File to be saved in correct format for Uploading to Kaggle Challenge

In [None]:
y_predicted.to_csv("My_submission_rf.csv",index=False) 
# This script will save file in the format required by competion