
# Classification - Ensemble Methods and Trees 
<br>

This notebook is intended for beginners; to provide them a guideline for what **Ensemble Methods** are, and how they are used to improve **Decision Trees**.

The tutorial leads to 80.86% accuracy with simple Random Forests

**Level** : Beginner 

**Task** : To predict if a passenger survived the sinking of the Titanic or not using [Titanic](https://www.kaggle.com/c/titanic/data) Datatset.



Importing required libraries

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import numpy as np 
import matplotlib.pyplot as plt


import os
import warnings
from sklearn import preprocessing

%matplotlib inline
warnings.filterwarnings('ignore')

Content
1.  [Exploring and Preparing Data](#explore)
    *  [Inspecting each feature Individually](#inspect)
    *  [Data Preprocessing](#preprocess)
    *  [Data Correlation](#correlate)
2. [Ensemble Methods and Trees](#ensemble)
    * [Slides](#slides)    
    * Decision Trees and Random Forests
        * [Evaluating Models Using Default Params](#default)
        * [Parameter Selection](#param)
        * [Cross Validation Scoring](#cv)
3.  [Submission](#submit)

  ## <a id="explore">1. Exploring and Preparing Data</a>
<br>
Reading data from "train.csv", which will later be divided into train(to train model) and test(to check accuracy).  

In [None]:
print(os.listdir("../input"))
train= pd.read_csv('../input/train.csv')
train_init= pd.read_csv('../input/train.csv')
test= pd.read_csv('../input/test.csv')
test_init= pd.read_csv('../input/test.csv')
train.head()

In [None]:
train.set_index('PassengerId', inplace=True)
test.set_index('PassengerId', inplace=True)
train.head()

In [None]:
plt.figure(figsize=[20,5])
y=train.isna().sum()/len(train)*100
x=train.isna().sum().index.values
plt.bar(x,y);
plt.title("Missing Values in Training Data");
plt.ylabel('Percentage of Missing Values');


In [None]:
plt.figure(figsize=[20,5])
y=test.isna().sum()/len(train)*100
x=test.isna().sum().index.values
plt.bar(x,y);
plt.title("Missing Values in Test Data");
plt.ylabel('Percentage of Missing Values');

Will be looking at the features with Missing Values individually under next heading. <br>

## <a id="inspect">Inspecting Each Feature Individually</a>
<br>



Now, Lets look into each feature separately

* [**Survived**](#survived) 0 = No, 1 = Yes 
* [**pclass**](#pclass)	Ticket class	1 = 1st, 2 = 2nd, 3 = 3rd
* [**Sex**](#sex)	Male or Female
* [**Age**](#age)	Age in years	
* [**sibsp**](#family)	- # of siblings / spouses aboard the Titanic	
* [**parch**](#family)	- # of parents / children aboard the Titanic	
* [**ticket**](#ticket)	Ticket number	
* [**fare**](#fare)	Passenger fare	
* [**cabin**](#cabin)	Cabin number	
* [**embarked**](#embark)	Port of Embarkation	C = Cherbourg, Q = Queenstown, S = Southampton


Lets explore these features one by one

### <a id="survived">Survived</a>

In [None]:
col=train.Survived
print("Unique Values: ",col.unique())
ind=col.value_counts().index.values
plt.bar(ind,col.value_counts());
plt.xticks(ind);
plt.title('Survived');
plt.ylabel('No. of Passengers');

It is the target variable and it can be seen that ratio of negative to positive is 60% and 40%.

### <a id="pclass">Pclass</a>

In [None]:
print("Missing Values: ",train.Pclass.isna().sum())
col=train.Pclass
col_1=col[train.Survived==1]
col_0=col[train.Survived==0]
print("Unique Values: ",col.unique())

y1=col_0.value_counts().sort_index()
p1 = plt.bar(y1.index.values, y1)
y2=col_1.value_counts().sort_index()
p2 = plt.bar(y2.index.values, y2,bottom=y1)

plt.ylabel('No. of Passengers')
plt.title('Passenger Class')
plt.legend((p1[0], p2[0]), ('Died', 'Survived'))
plt.xticks(y1.index.values);

**Observations : **
* Class 3 has the max number of passengers, and almost 75% of them died. (*Deadly Class :( * )
* whereas, more than half passengers survived in Class 1 (*Safe Class :)* )

This feature seems to carry useful information.

### <a id="sex">Sex</a>

In [None]:
print("Missing Values: ",train.Sex.isna().sum())
col=train.Sex
col_1=col[train.Survived==1]
col_0=col[train.Survived==0]
print("Unique Values: ",col.unique())

y1=col_0.value_counts().sort_index()
p1 = plt.bar(y1.index.values, y1)
y2=col_1.value_counts().sort_index()
p2 = plt.bar(y2.index.values, y2,bottom=y1)

plt.ylabel('No. of Passengers')
plt.title('Sex of Passengers')
plt.legend((p1[0], p2[0]), ('Died', 'Survived'))
plt.xticks(y1.index.values);

Lets Compare Sex of Passengers Vs Pclass to get an idea of ratios of survival for men and women for different classes.

In [None]:
print("Missing Values: ",train.Pclass.isna().sum())
col=train.Pclass
col_0=col[train.Sex=="male"]
col_1=col[train.Sex=="female"]
print("Unique Values: ",col.unique())
plt.figure(figsize=[20,10])
y1=col_0.value_counts().sort_index()
s1=y1
p1 = plt.bar(y1.index.values, y1)
y2=col_1.value_counts().sort_index()
p2 = plt.bar(y2.index.values, y2,bottom=y1)

col_died=col_0
col_survived=col_1

col=col_died
col_0=col[train.Survived==0]
col_1=col[train.Survived==1]


width=0.2
w=0.25
y1=col_0.value_counts().sort_index()
p3 = plt.bar(y1.index.values+w, y1,width)
y2=col_1.value_counts().sort_index()
p4 = plt.bar(y2.index.values+w, y2,width,bottom=y1)


col=col_survived
col_0=col[train.Survived==0]
col_1=col[train.Survived==1]

width=0.2
w=0.25
y1=col_0.value_counts().sort_index()
p5 = plt.bar(y1.index.values+w, y1,width,bottom=s1,color='g')
y2=col_1.value_counts().sort_index()
p6 = plt.bar(y2.index.values+w, y2,width,bottom=s1+y1,color='r')

plt.ylabel('No. of Passengers')
plt.title('PClass Vs Survival Vs Sex of Passengers')
plt.legend((p1[0], p2[0],p3[0], p4[0]), ('Died','Survived','male', 'female'))
plt.xticks(y1.index.values);



**Observation:**
* In class 3, half of passengers who survived are men, whereas in other two classes almost all the passengers who survived were women.
*  So we can assume if a passenger is women with Class 2, their is a greater chance that she survived.

### <a id="name">Name</a>

In [None]:
print("Missing Values: ",train.Name.isna().sum())
train.Name.head(5)

Lets extract the title out of Names, to simplify the feature

In [None]:
train['title'] = train.Name.map( lambda x: x.split(',')[1].split( '.' )[0].strip())
train['title'] = train['title'].replace('Mlle', 'Miss')
train['title'] = train['title'].replace(['Mme','Lady','Ms','the Countess'], 'Mrs')
train.title.loc[ (train.title !=  'Master') & (train.title !=  'Mr') & (train.title !=  'Miss') & (train.title !=  'Mrs')] 
train.title.loc[ (train.title !=  'Master') & (train.title !=  'Mr') & (train.title !=  'Miss') & (train.title !=  'Mrs')] = 'Others'
print("For train data, title count is: \n",train['title'].value_counts())


test['title'] = test.Name.map( lambda x: x.split(',')[1].split( '.' )[0].strip())
test['title'] = test['title'].replace('Mlle', 'Miss')
test['title'] = test['title'].replace(['Mme','Lady','Ms'], 'Mrs')
test.title.loc[ (test.title !=  'Master') & (test.title !=  'Mr') & (test.title !=  'Miss') & (test.title !=  'Mrs')] = 'Others'

test=test.drop(['Name'], axis=1)
train=train.drop(['Name'], axis=1)


In [None]:
col=train.title
col_1=col[train.Survived==1]
col_0=col[train.Survived==0]
print("Unique Values: ",col.unique())
width=0.5
y1=col_0.value_counts().sort_index()
p1 =plt.bar(y1.index.values, y1,width)
y2=col_1.value_counts().sort_index()
p2 = plt.bar(y2.index.values, y2,width,bottom=y1[y2.index.values])

plt.ylabel('No. of Passengers')
plt.title('Name Titles of Passengers')
plt.legend((p1[0], p2[0]), ('Died', 'Survived'))
plt.xticks(y1.index.values);

In [None]:
print("Female: " ,(train.Sex=="female").sum())
print("Male: " ,(train.Sex=="male").sum())

print("Mr+Master+Others: ",(train.title=="Mr").sum()+(train.title=="Master").sum()+(train.title=="Others").sum())
print("Mrs+Miss: ",(train.title=="Mrs").sum()+(train.title=="Miss").sum())


**Observation:**
* As obvious in last two features, most the passengers who died were men.

By crosschecking title count with Sex Count, Others has only 1 members from female, rest are men.

### <a id="family">SibSp and Parch</a>

In [None]:
train['family'] = train.SibSp + train.Parch
test['family'] = test.SibSp + test.Parch
test=test.drop(['SibSp','Parch'], axis=1)
train=train.drop(['SibSp','Parch'], axis=1)

We summed up *SibSp* and *Parch* and calculated total number of family members aboard for each passenger.

In [None]:
col=train.family
col_1=col[train.Survived==1]
col_0=col[train.Survived==0]
print("Unique Values: ",col.unique())
plt.figure(figsize=[10,5])

y1=col_0.value_counts().sort_index()
p1 =plt.bar(y1.index.values, y1)
y2=col_1.value_counts().sort_index()
p2 = plt.bar(y2.index.values, y2,bottom=y1[y2.index.values])

plt.ylabel('No. of Passengers')
plt.title('No. of Family Members Aboard')
plt.legend((p1[0], p2[0]), ('Died', 'Survived'))
plt.xticks(y1.index.values);

### <a id="ticket"> Ticket </a>

In [None]:
print("Ticket Category: \n",train.Ticket.unique()[:20])
train.Ticket=train.Ticket.map(lambda x: x[0])
test.Ticket=test.Ticket.map(lambda x: x[0])
print("\nAfter Mapping to Simplify Feature: \n",train.Ticket.unique()[:20])

In [None]:
col=train.Ticket
col_1=col[train.Survived==1]
col_0=col[train.Survived==0]
print("Unique Values: ",col.unique())

y1=col_0.value_counts().sort_index()
p1 =plt.bar(y1.index.values, y1)
y2=col_1.value_counts().sort_index()
p2 = plt.bar(y2.index.values, y2,bottom=y1[y2.index.values])

plt.ylabel('No. of Passengers')
plt.title('Ticket Category')
plt.legend((p1[0], p2[0]), ('Died', 'Survived'))
plt.xticks(y1.index.values);

### <a id="fare"> Fare </a>

In [None]:
print("Fare Values (first 20)\n",train.Fare.unique()[:20])
train['fare_value']=round(train.Fare/10)*10
test['fare_value']=round(test.Fare/10)*10
col=train['fare_value']
print("\nFare values after rounding off for Simplification\n",col.unique())


In [None]:
col_1=col[train.Survived==1]
col_0=col[train.Survived==0]
print("Unique Values: ",col.unique())
plt.figure(figsize=[10,5])
width=4
y1=col_0.value_counts().sort_index()
p1 =plt.bar(y1.index.values, y1,width)
y2=col_1.value_counts().sort_index()
p2 = plt.bar(y2.index.values, y2,width,bottom=y1[y2.index.values])

plt.ylabel('No. of Passengers')
plt.title('Fare Value')
plt.legend((p1[0], p2[0]), ('Died', 'Survived'))
plt.xticks(y1.index.values);

In [None]:
col_0=col[train.Pclass==1]
col_1=col[train.Pclass==2]
col_2=col[train.Pclass==3]
print("Unique Values: ",col.unique())
plt.figure(figsize=[20,5])
width=4
y1=col_0.value_counts().sort_index()
p1 =plt.bar(y1.index.values, y1,width)
y2=col_1.value_counts().sort_index()
p2 = plt.bar(y2.index.values, y2,width,bottom=y1[y2.index.values])
y3=col_2.value_counts().sort_index()
p3 = plt.bar(y3.index.values, y3,width,bottom=y1[y2.index.values]+y2[y3.index.values])


plt.ylabel('No. of Passengers')
plt.title('Fare Value Vs PClass')
plt.legend((p1[0], p2[0],p3[0]), ('Class 1', 'Class 2','Class 3'))
plt.xticks(y1.index.values);

We saw in the beginning that Fare feature in test has NaN value, lets fix it

In [None]:
test[test.Fare.isna()]

Getting an idea how entries with similar features have in common

In [None]:
test[test.Ticket=="3"][test.title=="Mr"][test.Embarked=="S"][test.Pclass==3]

In [None]:
col=test.Age[test.Ticket=="3"][test.Pclass==3]
col_0=col[test.fare_value==10]
col_1=col[test.fare_value==20]
col_2=col[test.fare_value>=30]
print("Unique Values: ",col.unique())
width=0.5
y1=col_0.value_counts().sort_index()
p1 =plt.bar(y1.index.values, y1,width)
y2=col_1.value_counts().sort_index()
p2 = plt.bar(y2.index.values, y2,width)
y3=col_2.value_counts().sort_index()
p3 = plt.bar(y3.index.values, y3,width)
plt.ylabel('No. of Passengers')
plt.title('Embarked')
plt.legend((p1[0], p2[0], p3[0]), ('10', '20','30'))

Seems like most of the values for Pclass=3 and Ticket=3 are 10 after 40 age, and we are looking for 60 years, thus assuming fare to be 10

In [None]:
test['fare_value']=test['fare_value'].replace(np.nan,10)
test.fare_value.isna().sum()

### <a id="cabin"> Cabin </a>

Checking for missing values

In [None]:
print("Percentage of missing values: ",train.Cabin.isna().sum()/len(train.Cabin)*100)

Dropping the feature due to 77% missing values.

In [None]:
test=test.drop(['Cabin'], axis=1)
train=train.drop(['Cabin'], axis=1)

### <a id="embark"> Embarked </a>

Checking missing Values

In [None]:
print("Percentage of missing values: ",train.Embarked.isna().sum()/len(train.Embarked)*100)

In [None]:
print("Unique Values:\n", train.Embarked.value_counts())

In [None]:
train[train.Embarked.isna()]

Lets check general trend for Ticket Value 1 and Fare around 80

In [None]:
train[ (train.fare_value==80) & (train.Ticket=="1")]

In [None]:
col=train[ (train.fare_value==80) & (train.Ticket=="1")].Fare
col_0=col[train.Embarked=="C"]
col_1=col[train.Embarked=="S"]
print("Unique Values: ",col.unique())
width=0.25
y1=col_0.value_counts().sort_index()
p1 =plt.bar(y1.index.values, y1,width)
y2=col_1.value_counts().sort_index()
p2 = plt.bar(y2.index.values, y2,width)
plt.ylabel('No. of Passengers')
plt.title('Embarked')
plt.legend((p1[0], p2[0]), ('C', 'S'))

Most of the passenger who have paid amount close to 80 are travelling to "S", thus changing the missing value to "S"

In [None]:
train.Embarked=train.Embarked.replace(np.nan,"S")

In [None]:
col=train.Embarked
col_1=col[train.Survived==1]
col_0=col[train.Survived==0]
print("Unique Values: ",col.unique())
width=0.5
y1=col_0.value_counts().sort_index()
p1 =plt.bar(y1.index.values, y1,width)
y2=col_1.value_counts().sort_index()
p2 = plt.bar(y2.index.values, y2,width,bottom=y1[y2.index.values])

plt.ylabel('No. of Passengers')
plt.title('Embarked')
plt.legend((p1[0], p2[0]), ('Died', 'Survived'))
plt.xticks(y1.index.values);

### <a id="age">Age</a>

Checking missing Values

In [None]:
print("Percentage of missing values: ",train.Age.isna().sum()/len(train.Age)*100)

In [None]:
train['age_round']=round(train.Age/10)*10
train.age_round=train.age_round.replace(0,10)
train.age_round.unique()

In [None]:
train.age_round[train['age_round'] == 10]="child"
train['age_round'] = train['age_round'].replace([0,10], 'child')
train['age_round'] = train['age_round'].replace([20,30], 'young')
train['age_round'] = train['age_round'].replace([40,50], 'adult')
train['age_round'] = train['age_round'].replace([60,70,80], 'old')

In [None]:
train[train.age_round.isna()].head(10)

In [None]:
col=train.age_round
col_0=col[train.Pclass==1]
col_1=col[train.Pclass==2]
col_2=col[train.Pclass==3]
width=0.5
print("Unique Values: ",col.unique())
y1=col_0.value_counts().sort_index()
p1 =plt.bar(y1.index.values, y1,width)
y2=col_1.value_counts().sort_index()
p2 = plt.bar(y2.index.values, y2,width,bottom=y1[y2.index.values])
y3=col_2.value_counts().sort_index()
p3 = plt.bar(y3.index.values, y3,width,bottom=y1[y2.index.values]+y2)

plt.ylabel('No. of Passengers')
plt.title('Age (and PClass)')
plt.legend((p1[0], p2[0], p3[0]), ('1', '2','3'))
plt.xticks(y1.index.values);

In [None]:
col=train.age_round
col_1=col[train.Survived==1]
col_0=col[train.Survived==0]
width=0.5
y1=col_0.value_counts().sort_index()
p1 = plt.bar(y1.index.values, y1,width)
y2=col_1.value_counts().sort_index()
p2 = plt.bar(y2.index.values, y2,width,bottom=y1)

plt.ylabel('No. of Passengers')
plt.title('Age of Passengers')
plt.legend((p1[0], p2[0]), ('Died', 'Survived'))
plt.xticks(y1.index.values);

We dont seem to find any prominent relation right now, lets deal with it later and drop it for now.

In [None]:
train=train.drop('Age',axis=1)
train=train.drop('age_round',axis=1)
test=test.drop('Age',axis=1)

#train=train.drop('family',axis=1)
#test=test.drop('family',axis=1)

train=train.drop('Fare',axis=1)
test=test.drop('Fare',axis=1)

Removing target variable from data and saving it as label

In [None]:
label = train.Survived
train_corr=train
train=train.drop('Survived',axis=1)
print("Train Shape: ",train.shape)
print("Test Shape: ",test.shape)
print("Label Shape: ",label.shape)

## <a id="preprocess" > Data Preprocessing</a>

Perform data conversion from Categorical to Numeric, and Data Scaling.

In [None]:
def scale(data):
    print("\nScaling Data:\n")
    min_max_scaler = preprocessing.MinMaxScaler()
    data_scale = min_max_scaler.fit_transform(data)
    data_scale=pd.DataFrame(data_scale, columns=data.columns.values, index=data.index.values)
    print(data_scale.head())
    return(data_scale)

from pandas.api.types import is_string_dtype
def cat_to_num(data):
    print("\nConverting Categorical Data To Numerical:\n")
    obj_columns=[]
    nonobj_columns=[]
    for col in data.columns.values:
        if data[col].dtype=='object':
            obj_columns.append(col)
        else:
            nonobj_columns.append(col)
    print(len(obj_columns)," Object Columns are \n",obj_columns,'\n')
    print(len(nonobj_columns),"Non-object columns are \n",nonobj_columns)
    data_obj=data[obj_columns]
    data_nonobj=data[nonobj_columns]
    for col in data_obj.columns.values:
        data_obj[col]=data_obj[col].astype('category').cat.codes
    data_merge=pd.concat([data_nonobj,data_obj],axis=1)
    print("\nData after conversion:\n",data_merge.head())
    return data_merge

def data_preprocess(data):
    data=cat_to_num(data)
    data=scale(data)
    return(data)

In [None]:
X=data_preprocess(train)
test_data=data_preprocess(test)
y=label

## <a id="correlate"> Correlation wrt Target Variable</a>
<br>
Adding Processed Numeric Data and Label to find Correlation wrt target 'Survived'

In [None]:
temp=pd.concat([X,label],axis=1)

In [None]:
import seaborn as sns
plt.figure(figsize=(13,10))
train_corr=temp
train_corred=train_corr.corr()
sns.heatmap(abs(train_corred), vmax=0.8)
price_corr_values=train_corred['Survived'].sort_values(ascending=False)
print(abs(price_corr_values).sort_values(ascending=False).head(20))


**Observations: **

* Features Sex , Pclass ,Fare ,Embarked  are highly correlated to the target variable.

# <a id="ensemble">Ensemble Methods and Trees</a>
<br>

<a id="slides"></a> Please use **Previous** and **Next** Button to change the slide. 

[How to code a Image Slider in Kaggle Notebook](https://www.kaggle.com/sabasiddiqi/image-slider-using-ipython-display)

In [None]:
from IPython.core.display import display, HTML, Javascript
   
html=   """
        <style>
        .mySlides {display:none;}
        </style>
        <img class="mySlides" src="https://github.com/SabaSiddiqi/Backup/blob/master/trees/Slide1.PNG?raw=true">
        <img class="mySlides" src="https://github.com/SabaSiddiqi/Backup/blob/master/trees/Slide2.PNG?raw=true">
        <img class="mySlides" src="https://github.com/SabaSiddiqi/Backup/blob/master/trees/Slide3.PNG?raw=true">
        <img class="mySlides" src="https://github.com/SabaSiddiqi/Backup/blob/master/trees/Slide4.PNG?raw=true">
        <img class="mySlides" src="https://github.com/SabaSiddiqi/Backup/blob/master/trees/Slide5.PNG?raw=true">
        <img class="mySlides" src="https://github.com/SabaSiddiqi/Backup/blob/master/trees/Slide6.PNG?raw=true">
        <button class="w3-button w3-display-left" onclick="plusDivs(-1)">&#10094; Previous</button>
        <button class="w3-button w3-display-right" onclick="plusDivs(+1)">Next &#10095;</button>
        <script>
                var slideIndex = 1;
                showDivs(slideIndex);

                function plusDivs(n) {
                showDivs(slideIndex += n);
                }

                function showDivs(n) {
                    var i;
                    var x = document.getElementsByClassName("mySlides");
                    if (n > x.length) {slideIndex = 1} 
                    if (n < 1) {slideIndex = x.length} ;
                    for (i = 0; i < x.length; i++) {
                        x[i].style.display = "none"; 
                    }
                    x[slideIndex-1].style.display = "block"; 
                }
        </script>

        """

display(HTML(html))

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
train, test,train_labels, test_labels = train_test_split(X, y, train_size=0.8, random_state=42)

## <a id="default"> Decision Tree Classifier</a> 
<br>
Using sklearn [DecisionTreeClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html)

In [None]:
from sklearn.tree import DecisionTreeClassifier
clf_dt = DecisionTreeClassifier(random_state=42) 
clf_dt = clf_dt.fit(train, train_labels)
y_pred=clf_dt.predict(test)
accuracy_score(test_labels, y_pred)

## <a id="default">Random Forest Classifier</a>
<br>
Using sklearn [Random Forest Classifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html)

In [None]:
from sklearn.ensemble import RandomForestClassifier
clf_rf = RandomForestClassifier(
    random_state=42,
) 
clf_rf = clf_rf.fit(train, train_labels)
y_pred=clf_rf.predict(test)
accuracy_score(test_labels, y_pred)

In [None]:
from sklearn.model_selection import cross_val_score

cv_value=5
scores=cross_val_score(clf_dt, X, y, cv=cv_value)
print("Decision Trees Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))
scores=cross_val_score(clf_rf, X, y, cv=cv_value)  
print("Random Forest Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))

With default parameters and cross validating, it can be seen that decision Trees and Random Forests perform almost the same. Lets tune Random Forests to improve result.

## <a id="param"> Parameter Selection </a>

In [None]:
clf=clf_rf 
print("Default Parameters of estimator are: \n",clf.get_params)

To find optimal combination of parameters to achieve maximum accuracy ,using **GridSearchCV** from **sklearn** library. [GridSearchCV](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html) does exhaustive search over specified parameter values for an estimator. <br>
Storing values of parameters to be passed to GridSearch in **parameters**, keeping cross-validation folds as **3** and passing SVM as estimator.  

In [None]:
from sklearn.model_selection import GridSearchCV,RandomizedSearchCV

parameters = {'n_estimators': [10, 100 , 200 , 400, 500],#default=10
              #'max_features':['auto','log2'],
              'max_depth': [10, 20, 40, None],#default=none If None, 
              #then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples.
              #'oob_score':[True,False], #default=False
              'warm_start':[True,False], #default=False 
              #'min_samples_split': [2, 5, 10],#default=2
              'min_samples_leaf': [1, 2, 4], #default=1
              #'class_weight' : ['balanced', 'balanced_subsample',None],#default=None
             } 


p = GridSearchCV(clf , param_grid=parameters, cv=3)

In [None]:
import time 
start_time = time.time()
p.fit(X,y);
elapsed_time = time.time() - start_time
print("Time consumed to fit model: ",time.strftime("%H:%M:%S", time.gmtime(elapsed_time)))

In [None]:
print("Scores for all Parameter Combination: \n",p.cv_results_['mean_test_score'])
print("\nOptimal C and Gamma Combination: ",p.best_params_)
print("\nMaximum Accuracy acheieved on LeftOut Data: ",p.best_score_)

To verify, lets pass the optimal parameters to Classifier and check the score.

In [None]:
from sklearn.ensemble import RandomForestClassifier
clf_rf = RandomForestClassifier(
    random_state=42,
    n_estimators=p.best_params_['n_estimators'],
    warm_start=p.best_params_['warm_start'],
    #oob_score=p.best_params_['oob_score'],
    max_depth=p.best_params_['max_depth'],
    #min_samples_split=p.best_params_['min_samples_split'],
    min_samples_leaf=p.best_params_['min_samples_leaf'],
) 


In [None]:
clf_rf = clf_rf.fit(train, train_labels)
y_pred=clf_rf.predict(test)
accuracy_score(test_labels, y_pred)

## <a id="cv" >Cross Validation Scores with Optimal Parameters </a>

In [None]:
from sklearn.model_selection import cross_val_score
cv_value=5
scores=cross_val_score(clf_dt, X, y, cv=cv_value)
print("Decision Trees Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))
scores=cross_val_score(clf_rf, X, y, cv=cv_value)  
print("Random Forest Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))

Cross Validation as compared to the train test split, gives a better idea of how the model will perform on test data, as its evaluated on various spllits of train and validation instead of 1.

# <a id="submit">Submission</a>

In [None]:
clf_rf = clf_rf.fit(X, y)

In [None]:
y_pred=clf_rf.predict(test_data)

In [None]:
my_submission = pd.DataFrame({'PassengerId': test_data.index.values, 'Survived': y_pred})
my_submission.to_csv('submission.csv', index=False)
my_submission

### References
* [Figure-1](https://www.google.com/url?sa=i&source=images&cd=&cad=rja&uact=8&ved=2ahUKEwi8m-Tu4ZvfAhWHoYMKHWKUC2kQjRx6BAgBEAU&url=https%3A%2F%2Fwww.datasciencecentral.com%2Fprofiles%2Fblogs%2Fwant-to-win-at-kaggle-pay-attention-to-your-ensembles&psig=AOvVaw0OrODMqk8EMXDOy9XCuDjx&ust=1544754667905270)
* [Figure-2](https://www.analyticsvidhya.com/blog/2018/06/comprehensive-guide-for-ensemble-models/)
* [Figure-3](https://www.google.com/url?sa=i&source=images&cd=&cad=rja&uact=8&ved=2ahUKEwjIzfSj95vfAhVk1oMKHeU0AT8QjRx6BAgBEAU&url=https%3A%2F%2Fblog.bigml.com%2F2017%2F03%2F14%2Fintroduction-to-boosted-trees%2F&psig=AOvVaw1hk50OmA-NupNTQcNJVn6X&ust=1544760435140043)
* [Figure-4](https://en.wikipedia.org/wiki/Decision_tree_learning)
* [Figure-5](https://www.google.com/url?sa=i&source=images&cd=&cad=rja&uact=8&ved=2ahUKEwjX2MqqkZvfAhUk6oMKHZOTBZEQjRx6BAgBEAU&url=https%3A%2F%2Fwww.xoriant.com%2Fblog%2Fproduct-engineering%2Fdecision-trees-machine-learning-algorithm.html&psig=AOvVaw2qwG53Cu_Sjs3B4I7ONTCj&ust=1544733051689721)
* [Figure-6](http://www.google.com/url?sa=i&source=images&cd=&ved=2ahUKEwjcuMrx65vfAhVtooMKHSTyAVMQjRx6BAgBEAU&url=https%3A%2F%2Fmedium.com%2F%40williamkoehrsen%2Frandom-forest-simple-explanation-377895a60d2d&psig=AOvVaw12EyrSQ5rZRLicFr37YQzJ&ust=1544757335620058)

