# Title- Titanic: Machine Learning from Disaster

### Competition Description

The sinking of the RMS Titanic is one of the most infamous shipwrecks in history.  On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. This sensational tragedy shocked the international community and led to better safety regulations for ships.

One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class.

In this challenge, we ask you to complete the analysis of what sorts of people were likely to survive. In particular, we ask you to apply the tools of machine learning to predict which passengers survived the tragedy.

#### Practice Skills
###### Binary classification 
---

### Overview
The data has been split into two groups:

training set (train.csv)
test set (test.csv)

The training set should be used to build your machine learning models. For the training set, we provide the outcome (also known as the “ground truth”) for each passenger. Your model will be based on “features” like passengers’ gender and class. You can also use feature engineering to create new features.

The test set should be used to see how well your model performs on unseen data. For the test set, we do not provide the ground truth for each passenger. It is your job to predict these outcomes. For each passenger in the test set, use the model you trained to predict whether or not they survived the sinking of the Titanic.

We also include gender_submission.csv, a set of predictions that assume all and only female passengers survive, as an example of what a submission file should look like.

----

### Data Dictionary

<table>
<tbody>
<tr><th><b>Variable</b></th><th><b>Definition</b></th><th><b>Key</b></th></tr>
<tr>
<td>survival</td>
<td>Survival</td>
<td>0 = No, 1 = Yes</td>
</tr>
<tr>
<td>pclass</td>
<td>Ticket class</td>
<td>1 = 1st, 2 = 2nd, 3 = 3rd</td>
</tr>
<tr>
<td>sex</td>
<td>Sex</td>
<td></td>
</tr>
<tr>
<td>Age</td>
<td>Age in years</td>
<td></td>
</tr>
<tr>
<td>sibsp</td>
<td># of siblings / spouses aboard the Titanic</td>
<td></td>
</tr>
<tr>
<td>parch</td>
<td># of parents / children aboard the Titanic</td>
<td></td>
</tr>
<tr>
<td>ticket</td>
<td>Ticket number</td>
<td></td>
</tr>
<tr>
<td>fare</td>
<td>Passenger fare</td>
<td></td>
</tr>
<tr>
<td>cabin</td>
<td>Cabin number</td>
<td></td>
</tr>
<tr>
<td>embarked</td>
<td>Port of Embarkation</td>
<td>C = Cherbourg, Q = Queenstown, S = Southampton</td>
</tr>
</tbody>
</table>

 


---
### Variable Notes
<p><b>pclass</b>: A proxy for socio-economic status (SES)<br> 1st = Upper<br> 2nd = Middle<br> 3rd = Lower<br><br> <b>age</b>: Age is fractional if less than 1. If the age is estimated, is it in the form of xx.5<br><br> <b>sibsp</b>: The dataset defines family relations in this way...<br> Sibling = brother, sister, stepbrother, stepsister<br> Spouse = husband, wife (mistresses and fiancés were ignored)<br><br> <b>parch</b>: The dataset defines family relations in this way...<br> Parent = mother, father<br> Child = daughter, son, stepdaughter, stepson<br> Some children travelled only with a nanny, therefore parch=0 for them.</p>

---
### 1.Prepare Problem
#### a.Load libraries
#### b.Load dataset
##### for this problem we will be loading training set and test  from two files as given by kaggle 

In [None]:
import os
import math
import datetime

from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import LabelBinarizer

from mlsettings.settings import load_app_config, get_datafolder_path
from mltools.mlcommon import load_data, print_dataset_info, split_dataset, auto_scatter_simple

import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
 
% matplotlib inline 
from numpy import set_printoptions
set_printoptions(precision=4)
sns.set_palette('husl')

In [None]:
load_app_config()
DIRECTORY="kaggle_titanic"
TRAIN_FILE ='train.csv'
TEST_FILE = 'test.csv'
RESPONSE = 'Survived'
input_path = get_datafolder_path()

In [None]:
def load_dataset(filename=TRAIN_FILE,response=RESPONSE):
    input_file = os.path.join(input_path, DIRECTORY, filename)
    input_dataset = load_data(input_file)
    print(" input file is :{0} loaded.".format(input_file))
    #print(input_dataset.head())
    
    try:
        continuous_vars = input_dataset.describe().columns.values.tolist()
        print("Continous Variables")
        print(continuous_vars)
    except ValueError:
        print("No continous variables")
    
    try:
        categorical_vars = input_dataset.describe(include=["object"]).columns.values.tolist()
        print("Categorical Variables")
        print(categorical_vars)
    except ValueError:
        print("No categorical variables")
        categorical_vars = None
    
    response_column =  [col for col in input_dataset.columns if response in col]
    feature_columns =  [col for col in input_dataset.columns if response not in col]
      
    return  input_dataset,feature_columns,response_column,continuous_vars,categorical_vars


In [None]:
train_dataset,feature_columns,response_column,continuous_vars,categorical_vars = load_dataset(filename=TRAIN_FILE,response=RESPONSE)
train_X = train_dataset[feature_columns]
train_y = train_dataset[response_column]


In [None]:
test_dataset,tfeature_columns,tresponse_column,tcontinuous_vars,tcategorical_vars  = load_dataset(filename=TEST_FILE,response=RESPONSE)
test_X =[]
test_y=[]
if feature_columns:
    test_X = test_dataset[tfeature_columns]

if response_column:
    test_y = test_dataset[tfeature_columns]
   

In [None]:
'''
from collections import Counter
def detect_outliers(dataset,noutliers,columns):
    outlier_indices = []
    for column in columns:
        # 1st quartile (25%)
        Q1 = np.percentile(dataset[column], 25)
        # 3rd quartile (75%)
        Q3 = np.percentile(dataset[column],75)
        # Interquartile range (IQR)
        IQR = Q3 - Q1
        
        # outlier step
        outlier_step = 1.5 * IQR
        
        # Determine a list of indices of outliers for feature col
        outlier_list_col = dataset[(dataset[column] < Q1 - outlier_step) | (dataset[column] > Q3 + outlier_step )].index
        outlier_indices.extend(outlier_list_col)
         
    outlier_indices = Counter(outlier_indices)
     
    multiple_outliers = list( k for k, v in outlier_indices.items() if v > noutliers )
    return multiple_outliers 
        
Outliers_to_drop = detect_outliers(train_dataset,2,["Age","SibSp","Parch","Fare"])
print(train_dataset.loc[Outliers_to_drop])
train_dataset = train_dataset.drop(Outliers_to_drop, axis = 0).reset_index(drop=True)
train_X = train_dataset[feature_columns]
train_y = train_dataset[response_column]
'''

In [None]:
print(train_X.info())
print(test_X.info())

### 2.Summarize Data 
#### a) Descriptive statistics
#### b) Data visualizations


In [None]:
def display_data_descriptives(input_dataset,X,feature_columns,y,response_column):
    print("<{0} {1} {0}>".format("="*40,"info"))
    print(input_dataset.info())
    print("<{0} {1} {0}>".format("="*40,"feature columns"))
    print(feature_columns)
    print("<{0} {1} {0}>".format("="*40,"data header"))
    print(X.head().to_string())
    print("<{0} {1} {0}>".format("="*40,"response"))
    print(response_column)
    print("<{0} {1} {0}>".format("="*40,"Descriptive Statistics -X"))
    print(X.describe())
    print("<{0} {1} {0}>".format("="*40,"Descriptive Statistics -y"))
    print(y.describe())
    print("<{0} {1} {0}>".format("="*40,"value_count -y"))
 
    print(y.groupby(response_column)[response_column].count())
    ##print("<{0} {1} {0}>".format("="*40,"Correlation"))
    ##print(input_dataset.corr(method='pearson'))

In [None]:
pd.set_option('display.width', 120)
pd.set_option('precision', 4)
display_data_descriptives(train_dataset,train_X,feature_columns,train_y,response_column)
#display_data_descriptives(test_dataset,tfeature_columns,tresponse_column,tcontinuous_vars,tcategorical_vars)
print(test_dataset.info())

In [None]:
categorical = ['Sex', 'Embarked','SibSp','Parch','Pclass']
def bar_plots(train_dataset,categorical):
    fig = plt.figure(figsize=(16,12))
    size =len(categorical)
     
    for i in range(size):
        #counts=train_dataset.groupby(categorical[i])['Survived'].value_counts()
        #print("Dataset group by {0} ".format(categorical[i]))
        #print(counts)
        ax = fig.add_subplot(3, 2, i+1)
        sns.barplot(x=categorical[i], y="Survived", data=train_dataset,ax=ax,errwidth =0)
        sns.despine()
    plt.tight_layout()

In [None]:
sns.set(style="white", color_codes=True)

flatui = ["#9b59b6", "#3498db", "#95a5a6", "#e74c3c", "#34495e", "#2ecc71"]
pkmn_type_colors = ['#78C850',  # Grass
                    '#F08030',  # Fire
                    '#6890F0',  # Water
                    '#A8B820',  # Bug
                    '#A8A878',  # Normal
                    '#A040A0',  # Poison
                    '#F8D030',  # Electric
                    '#E0C068',  # Ground
                    '#EE99AC',  # Fairy
                    '#C03028',  # Fighting
                    '#F85888',  # Psychic
                    '#B8A038',  # Rock
                    '#705898',  # Ghost
                    '#98D8D8',  # Ice
                    '#7038F8',  # Dragon
                   ]

bar_plots(train_dataset,categorical)

### Inferences  from bar plots  
##### Survival rate of female is more than males in all passenger categories
##### Survival rate for passenger class 3 is least
##### Survival rate of  passengers boarded at Embarked  at  C is greater than others
##### Passengers havings siblings survived better  than  zero siblings
##### Passengers travelling alone had less chances of survival than families 



In [None]:
#g  = sns.factorplot(x="Pclass", hue="Sex", col="Survived",data=train_dataset, kind="count",size=5, aspect=.7,palette=flatui);
#g1 = sns.factorplot(x="Embarked", hue="Sex", col="Survived",data=train_dataset, kind="count",size=5, aspect=.7,palette=flatui);   
#g2 = sns.factorplot(x="SibSp", col="Survived",data=train_dataset, kind="count",size=5, aspect=.7,palette=sns.color_palette("husl",2)); 
#g3 = sns.factorplot(x="Parch", col="Survived",data=train_dataset, kind="count",size=5, aspect=.7,palette=sns.color_palette("husl",2)); 

In [None]:
g = sns.factorplot(x='Sex',y='Survived',hue='Pclass',size=4, aspect=1,palette=flatui ,data =train_dataset)

##### Women  from 1st  and 2nd class have 100 % survival
##### Men from 2nd and 3rd Pclass have only around 10% survival chance.

In [None]:
g1 =sns.factorplot(x='Pclass', y='Survived', hue='Sex', col='Embarked', data=train_dataset)

##### Males from Pclass 1 only have slightly higher survival chance than Pclass 2 and 3

---  
### 3. Prepare Data
#### a) Data Cleaning
#### b) Feature Selection
#### c) Data Transforms

In [None]:
from sklearn.preprocessing import Imputer


In [None]:
full_dataset = [train_dataset,test_dataset]
 
## identity the null data sets 
for dataset in full_dataset:
    print("<{0} {1} {0}>".format("="*40,"Columns having null values"))
    check_null = dataset.isnull().sum()[dataset.isnull().sum()>0] 
    print(check_null)
     

In [None]:
first_char = lambda x : x[0]
transform_cabin = lambda x : 1 if x!='X' else 0
for dataset in full_dataset:
    dataset['Cabin'].fillna('X' ,inplace=True)
    dataset['Cabin']= dataset['Cabin'].map(first_char)
    #dataset['Cabin']= dataset['Cabin'].map(transform_cabin)

g = sns.factorplot("Survived", col="Cabin" ,col_wrap=4 ,data=full_dataset[0],kind="count", size=3.5, aspect=.8)


In [None]:
for dataset in full_dataset:
    dataset['Title'] = dataset['Name'].str.extract(' ([A-Za-z]+)\.',expand=False)
    
print(train_dataset.groupby('Title')['Survived'].value_counts())
print(test_dataset.groupby('Title')['Name'].count()) 
for dataset in full_dataset:
    dataset['Title'] = dataset['Title'].replace(['Lady', 'Countess','Capt', 'Col', 'Don', 'Dr', 'Major', 'Rev', 'Sir', 'Jonkheer', 'Dona'], 'Other')
    dataset['Title'] = dataset['Title'].replace('Mlle', 'Miss')
    dataset['Title'] = dataset['Title'].replace('Ms', 'Miss')
    dataset['Title'] = dataset['Title'].replace('Mme', 'Mrs')
    
full_dataset[0][['Title', 'Survived']].groupby(['Title'], as_index=False).mean()

In [None]:
#ax  =sns.violinplot(x="Embarked", y="Age", hue="Survived", data=train_dataset, split=True)
fig = plt.figure(figsize=(8,4)) 
ax = sns.boxplot(y="Age",x='Survived', data=train_dataset, palette="Set2")
ax.set_xticklabels(ax.get_xticklabels())
plt.gca().spines['top'].set_visible(False)
plt.gca().spines['right'].set_visible(False)
plt.tight_layout()
plt.show()

fig = plt.figure(figsize=(6,6)) 
ax = fig.add_subplot(1, 1, 1)
sns.barplot(x='Survived' , y='Age' ,hue ='Title',data=train_dataset,ax=ax,errwidth =0)
ax.set_xticklabels(ax.get_xticklabels(),ha='right')
plt.gca().spines['top'].set_visible(False)
plt.gca().spines['right'].set_visible(False)
plt.tight_layout()
plt.show()



In [None]:
train_mean_age =full_dataset[0][['Title', 'Age']].groupby(['Title'], as_index=False).mean().set_index('Title') 
 

In [None]:
 
test_mean_age =full_dataset[1][['Title', 'Age']].groupby(['Title'], as_index=False).mean().set_index('Title') 
test_mean_age

In [None]:
full_dataset[0][full_dataset[0]["Age"].isnull()].groupby(['Title'], as_index=False)['Age'].count()

In [None]:
full_dataset[1][full_dataset[1]["Age"].isnull()].groupby(['Title'], as_index=False)['Age'].count()

####  Transform  sex  label in numerical categorical value, assign mean age  to null 
#####  Fill in missing Embarked values 

In [None]:
age_null_index =list(full_dataset[0]["Age"][full_dataset[0]["Age"].isnull()].index)
 
for each_index in age_null_index:
    title =full_dataset[0]['Title'].iloc[each_index]
    if title =='Other':
        full_dataset[0]['Age'].iloc[each_index] = -1
    else:
        age= train_mean_age.loc[title]['Age']
        full_dataset[0]['Age'].iloc[each_index] = age
     

In [None]:
full_dataset[0][full_dataset[0]["Age"].isnull()]

In [None]:
tage_null_index =list(full_dataset[1]["Age"][full_dataset[1]["Age"].isnull()].index)
 
for each_index in tage_null_index:
    title =full_dataset[1]['Title'].iloc[each_index]
    if title =='Other':
        full_dataset[1]['Age'].iloc[each_index] = -1
    else:
        age= test_mean_age.loc[title]['Age']
        full_dataset[1]['Age'].iloc[each_index] = age

In [None]:
full_dataset[1][full_dataset[1]["Age"].isnull()]

In [None]:
#sex_mapping= {'male':0,'female':1}
for dataset in full_dataset:
    #dataset['Sex'] =dataset['Sex'].map(sex_mapping)
    dataset['Embarked'] = dataset['Embarked'].fillna('S')
    
    '''
    median_age = math.ceil(dataset["Age"].median())
    #dataset['Age'].fillna(median_age, inplace=True)
    
    age_null_index =list(dataset["Age"][dataset["Age"].isnull()].index)
    print(len(age_null_index))
    for each_index in age_null_index:
        median_age = math.ceil(dataset["Age"].median())
        pred_age = dataset["Age"][((dataset['SibSp'] == dataset.iloc[each_index]["SibSp"]) &
                                   (dataset['Parch'] == dataset.iloc[each_index]["Parch"]) &
                                   (dataset['Pclass'] == dataset.iloc[each_index]["Pclass"]))].median()
        if not np.isnan(pred_age) :
            dataset['Age'].iloc[each_index] = pred_age
        else :
            dataset['Age'].iloc[each_index] = pred_age
    '''
    
for dataset in full_dataset:
    print("<{0} {1} {0}>".format("="*40,"Columns having null values"))
    check_null = dataset.isnull().sum()[dataset.isnull().sum()>0] 
    print(check_null)

#test_dataset[test_dataset["Age"].isnull()]

####  Transform Fare 

In [None]:
#test_dataset['Age'].fillna(median_age, inplace=True)
#full_dataset = [train_dataset,test_dataset]
full_dataset[1][full_dataset[1]["Fare"].isnull()]

print(full_dataset[1][  (full_dataset[1]['Pclass'] ==3  ) & 
                  (full_dataset[1]['Sex'] == 'male'  ) &
                  (full_dataset[1]['Age'] >= 50  )
               ])
# assign same fare
full_dataset[1]['Fare'].iloc[152]=14.5
print(full_dataset[1][  (full_dataset[1]['Pclass'] ==3  ) & 
                  (full_dataset[1]['Sex'] == 'male'  ) &
                  (full_dataset[1]['Age'] >= 50  )
               ])

In [None]:
from sklearn import feature_extraction
def one_hot_dataframe(data,columns,replace=False):
    fe_vec= feature_extraction.DictVectorizer()
    make_dict = lambda row :dict((column,row[column]) for column in  columns)
    vector_data=pd.DataFrame(fe_vec.fit_transform( data[columns].apply(make_dict, axis=1)).toarray())
    vector_data.columns = fe_vec.get_feature_names()
    vector_data.index= data.index
    if replace:
        data = data.drop(columns, axis=1)
        data = data.join(vector_data)
    return data,vector_data


 

In [None]:
train_dataset,train_dataset_n = one_hot_dataframe(train_dataset, ['Pclass','Embarked', 'Sex','Title','Cabin'], replace=True)
test_dataset,test_dataset_n = one_hot_dataframe(test_dataset, ['Pclass','Embarked', 'Sex','Title','Cabin'], replace=True)

In [None]:
full_dataset = [train_dataset,test_dataset]
train_dataset['AgeBand'] = pd.cut(train_dataset[train_dataset['Age']>-1]['Age']  ,5)
train_dataset['AgeBand'] 

In [None]:
for dataset in full_dataset:
    dataset.loc[ dataset['Age'] < 0, 'Age'] = -1
    dataset.loc[ (dataset['Age'] > 0 ) & (dataset['Age'] <= 16.336), 'Age'] = 0
    dataset.loc[(dataset['Age'] > 16.336) & (dataset['Age'] <= 32.252), 'Age'] = 1
    dataset.loc[(dataset['Age'] > 32.252) & (dataset['Age'] <= 48.168), 'Age'] = 2
    dataset.loc[(dataset['Age'] > 48.168) & (dataset['Age'] <= 64.084), 'Age'] = 3
    dataset.loc[ dataset['Age'] > 64.084, 'Age'] = 4

In [None]:
train_dataset['FareBand'] = pd.qcut(train_dataset['Fare'], 4)
print (train_dataset[['FareBand', 'Survived']].groupby(['FareBand'], as_index=False).mean())

for dataset in full_dataset:
    dataset.loc[ dataset['Fare'] <= 7.91, 'Fare'] = 0
    dataset.loc[(dataset['Fare'] > 7.91) & (dataset['Fare'] <= 14.454), 'Fare'] = 1
    dataset.loc[(dataset['Fare'] > 14.454) & (dataset['Fare'] <= 31), 'Fare']   = 2
    dataset.loc[ dataset['Fare'] > 31, 'Fare'] = 3
    dataset['Fare'] = dataset['Fare'].astype(int)

In [None]:
for dataset in full_dataset:
    dataset['FamilySize'] = dataset['SibSp'] +  dataset['Parch'] + 1
    #dataset['IsAlone'] = 0
    #dataset.loc[dataset['FamilySize'] == 1, 'IsAlone'] = 1
    dataset['Single'] = dataset['FamilySize'].map(lambda s: 1 if s == 1 else 0)
    dataset['SmallF'] = dataset['FamilySize'].map(lambda s: 1 if  s == 2  else 0)
    dataset['MedF'] = dataset['FamilySize'].map(lambda s: 1 if 3 <= s <= 4 else 0)
    dataset['LargeF'] = dataset['FamilySize'].map(lambda s: 1 if s >= 5 else 0)
    


#### Feature Selection

In [None]:
features_drop = ['Name', 'SibSp', 'Parch','FamilySize','Ticket']
train_dataset = train_dataset.drop(features_drop, axis=1)
test_dataset = test_dataset.drop(features_drop, axis=1)
train_dataset = train_dataset.drop(['PassengerId', 'AgeBand', 'FareBand'], axis=1)




In [None]:
### we will drop Cabin T
X_train = train_dataset.drop(['Survived'], axis=1)
y_train = train_dataset['Survived']
X_test = test_dataset.drop("PassengerId", axis=1).copy()

In [None]:
print(X_train.columns.values)
print(X_test.columns.values)
all_features  =set(X_test.columns.values).intersection(set(X_train.columns.values))
all_features =list(all_features)
print(all_features)

X_train.shape, y_train.shape, X_test.shape
 

In [None]:
X_train =X_train[all_features]
X_test = X_test[all_features]
X_train.shape, y_train.shape, X_test.shape
print(all_features)

### 4. Evaluate Algorithms
####  a) Split-out validation dataset
####  b) Test options and evaluation metric
####  c) Spot Check Algorithms
####  d) Compare Algorithms

In [None]:
X_train.head()

In [None]:
from sklearn.model_selection  import  train_test_split
from sklearn.linear_model  import LogisticRegression
test_size = 0.33
seed = 7

X_trainmodel, X_val, y_trainmodel, y_val = train_test_split(X_train, y_train, test_size=test_size,random_state=seed)

In [None]:
logrmodel = LogisticRegression()
logrmodel.fit(X_trainmodel, y_trainmodel.values.ravel())
result = logrmodel.score(X_trainmodel, y_trainmodel.values)
print ("Accuracy: {0:.3f}".format(result*100.0))

In [None]:
from sklearn.svm import LinearSVC
lin_svc = LinearSVC()
lin_svc.fit(X_trainmodel, y_trainmodel.values.ravel())
#y_pred_linear_svc = lin_svc.predict(X_test)
acc_linear_svc = round(lin_svc.score(X_trainmodel, y_trainmodel) * 100, 2)
print (acc_linear_svc)

In [None]:
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
def train_and_evaluate(model, X_train, y_train, t_splits =10,seed=7):
    model.fit(X_train, y_train)
    print ("Coefficient of determination on training set:",model.score(X_train, y_train))
    # create a k-fold cross validation iterator of k=5 folds
    cv = KFold(n_splits= t_splits,shuffle=True, random_state=seed)
    scores = cross_val_score(model, X_train, y_train, cv=cv)
    print(scores)
    print ("Average coefficient of determination using {0}-fold crossvalidation:{1}".format(t_splits,np.mean(scores)))

In [None]:
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier

In [None]:
models = []
models.append(('LR', LogisticRegression()))
###models.append(('LDA', LinearDiscriminantAnalysis()))
models.append(('DT',DecisionTreeClassifier()))
models.append(('KNN',KNeighborsClassifier()))
models.append(('GB',GaussianNB()))
models.append(('SVC',SVC()))
models.append(('RFC',RandomForestClassifier(n_estimators=300,random_state=0,criterion='gini')))
# evaluate each model in turn
results = []
names = []
scoring = 'accuracy'
for name, model in models:
    kfold = KFold(n_splits=10, random_state=7)
    cv_results = cross_val_score(model, X_trainmodel, y_trainmodel.values.ravel(), cv=kfold, scoring=scoring)
    results.append(cv_results)
    names.append(name)
    msg = "Accuracy of {0} is {1} with variance {2}".format(name, cv_results.mean(), cv_results.std())
    print(msg)
fig = plt.figure()
fig.suptitle('Algorithm Comparison')
ax = fig.add_subplot(111)
plt.boxplot(results)
ax.set_xticklabels(names)
plt.show()

In [None]:
from sklearn import metrics
def measure_performance(X, y, clf, show_accuracy=True,show_classification_report=True,
                        show_confusion_matrix=True, show_r2_score=False):
    y_pred = clf.predict(X) 
    if show_accuracy:
        print ("Accuracy:{0:.3f}".format( metrics.accuracy_score(y, y_pred)) )
    if show_classification_report:
        print ("Classification report")
        print (metrics.classification_report(y, y_pred))
    if show_confusion_matrix:
        print("Confusion matrix") 
        print(metrics.confusion_matrix(y, y_pred),)
    if show_r2_score:
        print ("Coefficient of determination:{0:.3f}"
               .format( metrics.r2_score(y, y_pred)))
    return y_pred

In [None]:
rfc =RandomForestClassifier(n_estimators=300,random_state=0 )
print(rfc)
rfc.fit(X_trainmodel, y_trainmodel.values.ravel())
y_pred=measure_performance(X_val,y_val,rfc, show_accuracy=False, 
                    show_classification_report=True,
                    show_confusion_matrix=True, show_r2_score=False)

In [None]:
'''
svc=SVC(C=10, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape=None, degree=3, gamma=0.01, kernel='rbf',
  max_iter=-1, probability=True, random_state=None, shrinking=True,
  tol=0.001, verbose=False)
svc.fit(X_trscaled, y_trainmodel.values.ravel())
y_pred=measure_performance(X_valscaled,y_val,svc, show_accuracy=False, 
                    show_classification_report=True,
                    show_confusion_matrix=True, show_r2_score=False)
                    '''

In [None]:
# rfc= RandomForestClassifier()

from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVC
## Search grid for optimal parameters
'''
rf_param_grid = {"max_depth": [5,10],
              "max_features": [3, 5, 10],
              "min_samples_split": [10,20, 40],
              "min_samples_leaf": [1, 3, 5],
              "bootstrap": [False],
              "n_estimators" :[100,500],
              "criterion": ["gini"]}
'''

rf_param_grid = {"max_depth": [None],
              "max_features": [1, 3, 10],
              "min_samples_split": [2, 3, 10],
              "min_samples_leaf": [1, 3, 10],
              "bootstrap": [False],
              "n_estimators" :[100,500],
              "criterion": ["gini"]}

grid_search = GridSearchCV(rfc,param_grid = rf_param_grid, cv=10, scoring="accuracy", verbose = 1,n_jobs =-1)

grid_search.fit(X_trainmodel,y_trainmodel.values.ravel())

rfc_best = grid_search.best_estimator_

 #Best score
print("Best parameters: {}".format(grid_search.best_params_))
print("Best cross-validation score: {:.2f}".format(grid_search.best_score_))
print("Best estimator:\n{}".format(grid_search.best_estimator_))


In [None]:
rfc_best_params =grid_search.best_params_
#rfc =RandomForestClassifier(**rfc_best_params)
rfc =RandomForestClassifier(bootstrap=False, class_weight=None, criterion='gini',
            max_depth=None, max_features=None, max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=3,
            min_samples_split=2, min_weight_fraction_leaf=0.00001,
            n_estimators=2000, n_jobs=1, oob_score=False, random_state=0,
            verbose=0, warm_start=False)
print(rfc)
rfc.fit(X_trainmodel, y_trainmodel.values.ravel())
y_pred=measure_performance(X_val,y_val,rfc, show_accuracy=False, 
                    show_classification_report=True,
                    show_confusion_matrix=True, show_r2_score=False)

#print(X_trainmodel.info())
#print(X_test.info())
#X_test.to_csv('tranform_test.csv', index=False)
y_pred_result=rfc.predict(X_test)

In [None]:
def plot_feature_importances(model,X_trainmodel):
    features = pd.DataFrame()
    features['feature'] = X_trainmodel.columns.values
    features['importance'] = model.feature_importances_
    features.sort_values(by=['importance'], ascending=True, inplace=True)
    features.set_index('feature', inplace=True)
    fig = plt.figure(figsize=(8,6)) 
    ax = fig.add_subplot(1, 1, 1)
    features.plot(kind='barh',ax=ax)
    plt.gca().spines['top'].set_visible(False)
    plt.gca().spines['right'].set_visible(False)
    plt.tight_layout()
    plt.show()
    print(features['importance'].nlargest(18).index)
plot_feature_importances(rfc,X_trainmodel)

In [None]:
best_features =['Sex=female', 'Title=Mr', 'Sex=male', 'Title=Miss', 'Cabin=X', 'Pclass', 'Title=Mrs', 'Fare', 'MedF', 'LargeF',
 'Age', 'Embarked=S', 'Single', 'Title=Master', 'Embarked=C', 'Cabin=E', 'Cabin=B', 'Title=Other']
 

In [None]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_trscaled = scaler.fit(X_trainmodel).transform(X_trainmodel)
X_valscaled = scaler.fit_transform(X_val)
svc=SVC()
svc.fit(X_trscaled, y_trainmodel.values.ravel())
y_pred=measure_performance(X_valscaled,y_val,svc, show_accuracy=False, 
                    show_classification_report=True,
                    show_confusion_matrix=True, show_r2_score=False)

In [None]:


# 'poly', 'rbf', 'sigmoid'
param_grid = [{"kernel" : ['rbf'],
               'C': [0.001, 0.01, 0.1, 1, 10, 100],
               'gamma': [0.001, 0.01, 0.1, 1, 10, 100]},
              {'kernel': ['linear'],
               'C': [0.001, 0.01, 0.1, 1, 10, 100]}]
print("List of grids:\n{}".format(param_grid))
grid_search = GridSearchCV(SVC(probability=True), param_grid, cv=5,n_jobs =-1)
grid_search.fit(X_trscaled,y_trainmodel.values.ravel())
print("Test set score: {:.2f}".format(grid_search.score(X_valscaled, y_val)))
print("Best parameters: {}".format(grid_search.best_params_))
print("Best cross-validation score: {:.2f}".format(grid_search.best_score_))
print("Best estimator:\n{}".format(grid_search.best_estimator_))


In [None]:
svc_best=grid_search.best_params_
X_testscaled = scaler.fit_transform(X_test)
svc=SVC(**svc_best)
svc.fit(X_trscaled, y_trainmodel.values.ravel())
y_pred=measure_performance(X_valscaled,y_val,svc, show_accuracy=False, 
                    show_classification_report=True,
                    show_confusion_matrix=True, show_r2_score=False)
y_pred_result=svc.predict(X_testscaled)

In [None]:
from sklearn.ensemble import VotingClassifier
eclf = VotingClassifier(estimators=[('svc', svc), ('rf', rfc)], voting='hard')
eclf = eclf.fit(X_trscaled, y_trainmodel.values.ravel())
y_pred=measure_performance(X_valscaled,y_val,eclf, show_accuracy=False, 
                    show_classification_report=True,
                    show_confusion_matrix=True, show_r2_score=False)
y_pred_result=eclf.predict(X_testscaled)

In [None]:
submission = pd.DataFrame({
        "PassengerId": test_X["PassengerId"],
        "Survived": y_pred_result
    })
submission.to_csv('submission_new_0209.csv', index=False)