# Kaggle Competition

# Titanic: Machine Learning from Disaster

# The challenger

The sinking of the Titanic is one of the most infamous shipwrecks in history.

On April 15, 1912, during her maiden voyage, the widely considered “unsinkable” RMS Titanic sank after colliding with an iceberg. Unfortunately, there weren’t enough lifeboats for everyone onboard, resulting in the death of 1502 out of 2224 passengers and crew.

While there was some element of luck involved in surviving, it seems some groups of people were more likely to survive than others.

In this challenge, we ask you to build a predictive model that answers the question: “what sorts of people were more likely to survive?” using passenger data (ie name, age, gender, socio-economic class, etc).

# Import basic libraries

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import random
import time
import warnings

from collections import Counter


%matplotlib inline
random.seed(0)
warnings.filterwarnings('ignore')


# Import data

In [None]:
try: 
    df_train = pd.read_csv('D:\\Kaggle\\Titanic\\train.csv')
    print('File 1 loading - Success!')
    df_test = pd.read_csv('D:\\Kaggle\\Titanic\\test.csv')
    print('File 2 loading - Success!')
except:
    print('File loading - Failed!')
    

# Exploratory data analyses 

In [None]:
df_train.head()


In [None]:
df_test.head()


In [None]:
df_train.info()


In [None]:
df_test.info()


## Features

* PassengerId: Unique passenger identification.
* Survived: Whether a passenger survived or not; 1 if survived and 0 if not.
* Pclass: Ticket class; 1 = 1st, 2 = 2nd, 3 = 3rd.
* Name: Passanger name.
* Sex: Passanger gender. 
* Age: Passanger age in years.
* SibSp: Number of sibling/spouses aboard the Titanic.
* Parch: Number of children/parents aboard the Titanic.
* Ticket: Ticket number.
* Fare: Passanger fare.
* Cabin: Cabin number. 
* Embarked: Port of embarkation; C = Cherbourg, Q = Queenstown, S = Southampton.



In [None]:
df_train.describe()


In [None]:
df_test.describe()


Initially, we can see that:

* Age feature has missing values;
* There are indications of non-normalities of some features.


In [None]:
df_train.groupby("Survived").count()


Comments:

* We have a lot of missing values in 2 features ('Age' and 'Cabin'). The 'Cabin' feature will be discarded due to the large number of NaNs, but its informative content can generate useful information about the respective deck. About 'Age', your NaNs will be studied in the topic "Missing Values".

* The features 'PassengerId' and 'Name' will be discarded, because they are variables of individual identification. However, before discarding 'Name' we will create a new feature based on the individual's treatment pronouns, which can be important information.

* NaNs are also present in the test data, the treatment given to the training data will also be given to the test data.


## Take a look in our features

In [None]:
# histogram of the dependent variables in relation to the variable of interest.

for i in ['Pclass','Sex','Age','SibSp','Parch','Fare','Embarked']: 
    plot = sns.FacetGrid(df_train, col='Survived')
    plot.map(plt.hist, i, bins=20)
    

Comments:

* We can observe an imbalance of some variables, such as 'Sex' which has more women than men or 'Embarked' where the majority of the embarks were in Cherbourg ('C').

* We can think of the first questions, such as:
    1. Does gender matter? Does being a woman increase my chances of surviving?
    2. Does ticket class matter?
    3. Does boarding Southamptom increase my chances of survival? ('Embarked' = 'S')
    
    
In the end, the goal is to know who survives or who doesn't.


# Data pre-processing

Step by step:
    
    1. Combine Features;
    2. Missing values;
    3. Transform features;
    4. Scaling numerical data;
    5. Drop features;
    6. Dummies;
  

## Import data pre-processing functions

In [None]:
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split


## Combine Features 

Comments:

Initially, we have two features with informational content that can be combined, which are 'SibSp' and 'Parch'. If we combine, we can build features called 'Family_size' and 'Travelled_alone'. 

* 'Family_size': is the sum of 'SibSp' and 'Parch' plus 1, indicating the size of the family.

* 'Travelled_alone': contains the information whether the individual travelled alone or not (given the domain of the original features).

Given the new feature 'Familyy_size', we can combine it with 'Fare' and find out what the average fare for the individual's family group is.

* 'Fare_per_family': the average fare for the individual's family group.



### Family Size

In [None]:
df_train['Family_size'] = df_train['SibSp'] + df_train['Parch'] + 1
df_test['Family_size'] = df_test['SibSp'] + df_test['Parch'] + 1


In [None]:
df_train


In [None]:
df_test


### Fare per family

In [None]:
df_train['Fare_per_family'] = df_train['Fare'] / df_train['Family_size']
df_test['Fare_per_family'] = df_test['Fare'] / df_test['Family_size']


In [None]:
df_train.info()

### Travelled alone    

In [None]:
# Function to create 'Travelled_alone' feature

def feature_travelled_alone(data):
    
    data['Travelled_alone'] = "" #creates an empty feature called 'Travelled_alone' in the dataset 
        
    data.loc[ (data['SibSp'] == 0) & (data['Parch'] == 0), 'Travelled_alone'] = 1 #condition to travelled alone.
    data.loc[ (data['SibSp'] != 0) | (data['Parch'] != 0), 'Travelled_alone'] = 0
    
    return data


In [None]:
df_train = feature_travelled_alone(df_train)
df_test = feature_travelled_alone(df_test)


In [None]:
df_train


In [None]:
df_test


## Missing values

In [None]:
df_train.isna().sum()


In [None]:
df_test.isna().sum()


Comments:

* As stated earlier, we will investigate 'Age' and 'Cabin' NaNs.

About 'Cabin', owning one was not for everyone, possibly individuals with NaNs just didn't have a cabin. In this way, we can create a feature containing the information whether or not it has a cabin and/or the deck's correspondence (which is information contained in the cabin identification, like 'C' or 'A').

The first question about 'Age':

* Do 'Age' NaNs have a pattern or are they random?


In [None]:
#pd.set_option('display.max_rows', 200)


In [None]:
df_train[ (df_train['Age'].isnull() == True) ] #data with NaNs in 'Age', looking for some pattern.


In [None]:
df_train[ (df_train['Age'].isnull() == True) ].describe()


In [None]:
df_train.describe()


In [None]:
for i in ['Pclass','Sex','Embarked','Travelled_alone']: 
    plot = sns.FacetGrid(df_train[ (df_train['Age'].isnull() == True) ], col='Survived')
    plot.map(plt.hist, i, bins=20)
    

Comments:

* Looking at the outputs above, we have indications that the 'Age' NaNs are not random. Thus, the simple disposal of these lines with NaN cannot be done.

* Group indication (possible): 'Pclass' = 3, 'Embarked' = 'Q' and 'Travelled_alone' = 1.


In [None]:
df_groupy_NaN = df_train[ (df_train['Pclass'] == 3) ]
df_groupy_NaN.describe()


In [None]:
df_groupy_NaN = df_train[ (df_train['Embarked'] == 'Q') ]
df_groupy_NaN.describe()


In [None]:
df_groupy_NaN = df_train[ (df_train['Travelled_alone'] == 1) ]
df_groupy_NaN.describe()


In [None]:
df_groupy_NaN = df_train[ (df_train['Pclass'] == 3) ]
df_groupy_NaN = df_groupy_NaN[ (df_groupy_NaN['Travelled_alone'] == 1) ]
#df_groupy_NaN = df_groupy_NaN[ (df_groupy_NaN['Embarked'] == 'Q') ]
df_groupy_NaN.describe()


In [None]:
df_groupy_NaN = df_train[ (df_train['Pclass'] == 3) ]
#df_groupy_NaN = df_groupy_NaN[ (df_groupy_NaN['Travelled_alone'] == 1) ]
df_groupy_NaN = df_groupy_NaN[ (df_groupy_NaN['Embarked'] == 'Q') ]
df_groupy_NaN.describe()


In [None]:
df_groupy_NaN = df_train[ (df_train['Pclass'] == 3) ]
df_groupy_NaN = df_groupy_NaN[ (df_groupy_NaN['Travelled_alone'] == 1) ]
df_groupy_NaN = df_groupy_NaN[ (df_groupy_NaN['Embarked'] == 'Q') ]
df_groupy_NaN.describe()


Comments:

* This specific group (using the 3 equalities) corresponds to 20% of the total 'Age' NaNs in training data.

* Filtering by 'Pclass' and 'Travelled_alone' we have 56% of the total 'Age' NaNs in training data. This will be our specific group.

* The treatment of NaNs will be done based on the group for those who belong to the identified group. The others will be treated considering the complete sample.


In [None]:
# Function to imput data in 'Age' considering specific group 


# We will imput random values based on the mean and standard deviation of the data (total or group)



def imput_data_by_group(data_1, data_2):
    
    data = data_1.append(data_2, sort = False) #merge training and test data
    
    data_group = data[ (data['Pclass'] == 3) ] #first group filter
    data_group = data_group[ (data_group['Travelled_alone'] == 1) ] #second group filter
    
    mean_base = data['Age'].mean() #average of full data 
    std_base = data['Age'].std() #standard desviation of full data
    
    mean_group = data_group['Age'].mean() #average of group on full data
    std_group = data_group['Age'].std() #standard desviation of group on full data
    
    
    for i in range(len(data_1)): #treating NaNs from training data 
        if (pd.isnull(data_1.loc[i,'Age']) == True):
            if (data_1.loc[i,"Pclass"] == 3) and (data_1.loc[i,"Travelled_alone"] == 1): #condition to be part of the group
                data_1.loc[i,'Age'] = np.random.randint(mean_group - std_group, mean_group + std_group, 1) 
                #imputing random value
            else:
                data_1.loc[i,'Age'] = np.random.randint(mean_base - std_base, mean_base + std_base, 1)
                #imputing random value
                
    for i in range(len(data_2)): #treating NaNs from test data 
        if (pd.isnull(data_2.loc[i,'Age']) == True):
            if (data_2.loc[i,"Pclass"] == 3) and (data_2.loc[i,"Travelled_alone"] == 1): #condition to be part of the group
                data_2.loc[i,'Age'] = np.random.randint(mean_group - std_group, mean_group + std_group, 1)
                #imputing random value
            else:
                data_2.loc[i,'Age'] = np.random.randint(mean_base - std_base, mean_base + std_base, 1)
                #imputing random value
    
    return (data_1, data_2)
        

In [None]:
random.seed(0)


df_train, df_test = imput_data_by_group(df_train, df_test)


In [None]:
df_train.describe()


In [None]:
df_test.describe()


In [None]:
df_train.isna().sum()


In [None]:
df_test.isna().sum()


In [None]:
df_train = df_train[ df_train['Embarked'].isnull() == False]


Comments:

* As we cannot rule out any observation from the test base, we will imput based on the average and standard deviation, in the same way as we did for 'Age'.


In [None]:
# Function to imput data in 'Fare' and 'Fare_per_family'


def imput_data_fare(data_1, data_2, columns):
    
    data = data_1.append(data_2, sort = False) #merge training and test data
    
    mean_base = data[columns].mean() #average of full data 
    std_base = data[columns].std() #standard desviation of full data
                
    for i in range(len(data_2)): #treating NaNs from test data 
        if (pd.isnull(data_2.loc[i,columns]) == True):
                data_2.loc[i,columns] = np.random.randint(mean_base - std_base, mean_base + std_base, 1) #imputing random value
    
    return (data_2)



In [None]:
random.seed(0)

df_test = imput_data_fare(df_train, df_test,'Fare')
df_test = imput_data_fare(df_train, df_test,'Fare_per_family')


## Transform features


'Age' - We can transform the numeric variable 'Age' into a new categorical feature ('Age_group'), depending on age groups. Being, respectively: child; teen; young adult; adult; senior; and, retired.

'Fare' - We can transform the numeric variable 'Fare' into a new categorical feature ('Fare_group'), depending on fare groups. Being, respectively: very low; low; base; high; and,  very high.

'Name' - We can extract the pronoun from the individual's treatment by removing this information from the 'Name' resource and creating a new categorical resource called 'Title'.

'Cabin' - As stated earlier, we are going to create a new feature called 'Deck' based on the information contained in the 'Cabin' feature. For individuals without this information, we will assign the string 'U' in reference to 'Unknown'.

'Pclass' - We will change the type of the feature value to string, to facilitate the process of making it a dummy.


### Age group

In [None]:
# Function to create a feature 'Age_group' (categorial feature) 


def convert_age_to_group(data):
    
    data['Age_group'] = ''
    
    data.loc[ data['Age'] <= 12, 'Age_group'] = 'child'
    data.loc[(data['Age'] > 12) & (data['Age'] <= 18), 'Age_group'] = 'teen'
    data.loc[(data['Age'] > 18) & (data['Age'] <= 27), 'Age_group'] = 'young_adult'
    data.loc[(data['Age'] > 27) & (data['Age'] <= 40), 'Age_group'] = 'adult'
    data.loc[(data['Age'] > 40) & (data['Age'] <= 59), 'Age_group'] = 'senior'
    data.loc[(data['Age'] > 59), 'Age_group'] = 'retired'

    return data


In [None]:
df_train = convert_age_to_group(df_train)
df_test = convert_age_to_group(df_test)


In [None]:
df_train['Age_group'].value_counts()

### Fare Group

In [None]:
# Function to create a feature 'Fare_rate' (categorial feature) 

def convert_fare_to_group(data):
    
    data['Fare_rate'] = ''
    
    data.loc[ data['Fare'] <= 8, 'Fare_rate'] = 'very_low'
    data.loc[(data['Fare'] > 8) & (data['Fare'] <= 16), 'Fare_rate'] = 'low'
    data.loc[(data['Fare'] > 15) & (data['Fare'] <= 32), 'Fare_rate'] = 'base'
    data.loc[(data['Fare'] > 32) & (data['Fare'] <= 64), 'Fare_rate'] = 'high'
    data.loc[(data['Fare'] > 64), 'Fare_rate'] = 'very_high'

    return data


In [None]:
df_train = convert_fare_to_group(df_train)
df_test = convert_fare_to_group(df_test)


In [None]:
df_train['Fare_rate'].value_counts()

# Title

In [None]:
# Function to create a feature 'Title' (categorial feature) 


def convert_title_to_group(data):
    
    data['Title'] = data.Name.str.extract(' ([A-Za-z]+)\.', expand=False)
    
    data.loc[ (data['Title'] == 'Ms'), 'Title'] = 'Miss'
    data.loc[ (data['Title'] == 'Mlle'), 'Title'] = 'Miss' 
    data.loc[ (data['Title'] == 'Mme'), 'Title'] = 'Mrs'
    
    data.loc[ (data['Title'] != 'Mr') & (data['Title'] != 'Mrs') & (data['Title'] != 'Miss') & (data['Title'] != 'Master'), 'Title'] = 'Distinct'           

    return data


In [None]:
df_train = convert_title_to_group(df_train)
df_test = convert_title_to_group(df_test)


### Pclass to categorical data

In [None]:
df_train['Pclass'] = df_train['Pclass'].astype(str)

In [None]:
df_train.info()


In [None]:
df_train.head()


In [None]:
df_test.head()


### Deck

In [None]:
# Function to create a feature 'Deck' (categorial feature) 

def convert_cabin_to_deck(data):
    
    data['Deck'] = data['Cabin'].fillna('U0')
    data['Deck'] = [x[0] for x in data['Deck'].values]

    return data


In [None]:
df_train = convert_cabin_to_deck(df_train)
df_test = convert_cabin_to_deck(df_test)


In [None]:
df_train['Deck'].value_counts()


Comments:

Because of the low variability of the data, we will not use the features of deck A, F, G and T after creating your dummies.



In [None]:
# Function to transform 'Cabin' into new  feature

def convert_cabin_to_havecabin(data):
    
    data['Have_cabin'] = ''
    
    data.loc[(data['Cabin'].isna() == False ), 'Have_cabin'] = 1 #it is a dummy features, 1 have and 0 don't have cabin
    data.loc[(data['Cabin'].isna() == True ), 'Have_cabin'] =  0

    return data




In [None]:
df_train = convert_cabin_to_havecabin(df_train)
df_test = convert_cabin_to_havecabin(df_test)


In [None]:
df_train.head()
    

In [None]:
df_test.head()


## Drop features

Comments:

Discarding the features previously indicated.


In [None]:
df_train.columns


In [None]:
df_train = df_train.drop(columns=['PassengerId', 'Name', 'Ticket', 'Cabin'])


## Scaling numerical variables

Comments:
    
As I will use dummy variables (0 or 1), I chose to use the MinMax function, returning the numerical values on the scale between 0 and 1.


In [None]:
num_features = list(df_train.select_dtypes(include=['int64', 'float64', 'int32']).columns)[1:-2] #

#ss_scaler = StandardScaler()

ss_scaler = MinMaxScaler()

df_train = pd.DataFrame(data = df_train)
df_train[num_features] = ss_scaler.fit_transform(df_train[num_features])

df_test = pd.DataFrame(data = df_test)
df_test[num_features] = ss_scaler.fit_transform(df_test[num_features])



In [None]:
num_features


## Dummies

Comments:

Remember the dummy variable rule (n - 1). We will adapt to this rule later on, at the moment we will keep all the dummy features created.


In [None]:
cat_features = list(df_train.select_dtypes(include=['object']).columns) #categorical features to change into dummies


#transforming categorical data into dummy features

for i in cat_features: #for training data
    df_train = pd.concat([df_train, pd.get_dummies(df_train[i], prefix=i)], axis=1)
    df_train.drop(i, axis = 1, inplace=True)
    
    
for i in cat_features: #for test data
    df_test = pd.concat([df_test, pd.get_dummies(df_test[i], prefix=i)], axis=1)
    df_test.drop(i, axis = 1, inplace=True)
    


In [None]:
cat_features


In [None]:
df_test.columns


## Features

In [None]:
df_train.head()


In [None]:
df_test.isna().sum()

## Data snapshot after pre-procesing

In [None]:
df_train


# Correlogram

In [None]:
corrmat = df_train.corr()
f, ax = plt.subplots(figsize=(16, 9))
sns.heatmap(corrmat, vmax=.8, square=True)


Comments:

We can observe a moderate correlation between 'Suviver' and 'Sex_female', it is a good insight into who may have survived. The expected correlations for the other features were observed.


In [None]:
# Adapting the features given the dummy rule


features = ['Survived', 'Parch', 'Fare', 'Family_size', 'Travelled_alone', 'Fare_per_family', 'Pclass_1', 'Pclass_2', 
            'Sex_female', 'Embarked_C', 'Embarked_Q', 'Age_group_child', 'Age_group_teen', 'Age_group_adult', 'Age_group_young_adult', 
            'Age_group_senior', 'Fare_rate_very_high','Fare_rate_high', 'Fare_rate_base','Fare_rate_low', 'Title_Master', 'Title_Miss',
            'Title_Mr', 'Title_Mrs','Deck_A','Deck_B','Deck_C','Deck_D','Deck_E','Deck_U']



#drop: 'PassengerId', 'Name', 'Age', SibSp','Ticket', 'Cabin', 'Have_cabin','Embarked_S', 'Age_group_senior', 'Title_Distinct', 'Fare_rate_very_high',
#   'Pclass_3','Deck_F','Deck_G' and 'Deck_T'. 



features_dummy = ['Survived', 'Travelled_alone', 'Have_Cabin', 'Pclass_1', 'Pclass_2', 
            'Pclass_3', 'Sex_female','Sex_male', 'Embarked_C', 'Embarked_Q', 'Embarked_S', 'Age_group_child', 'Age_group_teen', 
            'Age_group_adult', 'Age_group_young_adult', 'Age_group_senior', 'Fare_rate_very_high','Fare_rate_high', 'Fare_rate_base',
            'Fare_rate_low', 'Title_Master', 'Title_Miss','Title_Mr', 'Title_Mrs','Deck_A','Deck_B','Deck_C','Deck_D','Deck_E','Deck_U']



features_all = ['Survived', 'Age', 'SibSp','Parch', 'Fare', 'Family_size', 'Travelled_alone', 'Fare_per_family', 'Pclass_1', 'Pclass_2', 
            'Pclass_3', 'Sex_female','Sex_male', 'Embarked_C', 'Embarked_Q', 'Embarked_S', 'Age_group_child', 'Age_group_teen', 'Have_cabin',
            'Age_group_adult', 'Age_group_young_adult', 'Age_group_senior', 'Fare_rate_very_high','Fare_rate_high', 'Fare_rate_base',
            'Fare_rate_low', 'Title_Master', 'Title_Miss','Title_Mr', 'Title_Mrs','Deck_A','Deck_B','Deck_C','Deck_D','Deck_E','Deck_U']

#drop: 'PassengerId', 'Name', 'Ticket', 'Cabin'.


Drop deck, variavel ruim 


# Methods and Metrics

## Methods

Which method to use?

Basically, we will use here all the classification methods available from the sklearn library, but those that are compatible with crossvalidation and my data (Ex: the Gaussian Naive Bayes is for categorical data, and we also have numerical variables). Which, by their groups, are:


* Discriminant Analysis:
    1. Linear Discriminant Analysis;
    2. Quadratic Discriminant Analysis.


* Ensemble:
    1. Ada Boost Classifier;
    2. Bagging Classifier;
    3. Extra Tree Classifier;
    4. Gradient Boosting Classifier;
    5. Ramdom Forest Classifier.


* GLM:
    1. Logistic;
    2. Passive Agressive Classifier;
    3. Perceptron;
    4. Ridge Classifier;
    5. SGD Classifier.


* Naive Bayes:
    1. Bernoulli Naive Bayes. 


* Nearest Neighbors:
    1. K Neighbors Classifier.


* NN:
    1. Multi-layer Perceptron Classifier.


* SVM:
    1. SVC;
    2. Nu-SVC;
    3. Linear SVC.


* Decision Trees:
    1. Decision Tree Classifier;
    2. Extra Tree Classifier.

See: https://scikit-learn.org/stable/modules/classes.html# 


Additionally, we will also use the XGBoost method
   
   * Extreme Gradiente Boost. 

See: https://xgboost.readthedocs.io/en/latest/tutorials/model.html



Why use all of these classification methods?

Because the purpose of the challenge and the nature of the data allow me to do that.


Regarding the first statement, the challenge is whether those who survived or not survived, without going through an analysis within the sample or its dependent variables. If we needed to investigate the effects and significance of these input features, we would be restricted to parametric methods, with most of the methods used later being non-parametric. 

About the second statement, as noted in the correlogram earlier, if we obey the rule of dummies and avoid using the variables with high collineraryity, we will not have a restriction of methods.



## Metrics
    
    
As we will use cross validation, we will use average accuracy.


## Crossvalidation





See: http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.ShuffleSplit.html#sklearn.model_selection.ShuffleSplit    



## Import metrics functions

In [None]:
from sklearn import metrics


## Import methods

In [None]:
from sklearn import svm, tree, neighbors, naive_bayes, ensemble, linear_model, discriminant_analysis, gaussian_process
from sklearn import model_selection, metrics
from xgboost import XGBClassifier


# Model Selection

The selected model will be based on the best average accuracy observed.


## X_train, Y_train and Y_test

In [None]:
X_train = df_train[features[1:]]
Y_train = df_train[features[0]]


X_test = df_test[features[1:]]


## Base line

In [None]:
random.seed(0)


methods = [
    
    #Ensemble Methods
    ensemble.AdaBoostClassifier(),
    ensemble.BaggingClassifier(),
    ensemble.ExtraTreesClassifier(),
    ensemble.GradientBoostingClassifier(),
    ensemble.RandomForestClassifier(), 
    #https://scikit-learn.org/stable/modules/classes.html#module-sklearn.ensemble
    
    #GLM
    linear_model.LogisticRegressionCV(),
    linear_model.PassiveAggressiveClassifier(),
    linear_model.RidgeClassifierCV(),
    linear_model.SGDClassifier(),
    linear_model.Perceptron(),
    #https://scikit-learn.org/stable/modules/classes.html#module-sklearn.linear_model
    
    #Navies Bayes
    naive_bayes.BernoulliNB(),
    #https://scikit-learn.org/stable/modules/classes.html#module-sklearn.naive_bayes
    
    #Nearest Neighbor
    neighbors.KNeighborsClassifier(),
    #https://scikit-learn.org/stable/modules/classes.html#module-sklearn.neighbors
    
    #SVM
    svm.SVC(probability=True),
    svm.NuSVC(probability=True),
    svm.LinearSVC(),
    #https://scikit-learn.org/stable/modules/classes.html#module-sklearn.svm
    
    #Trees    
    tree.DecisionTreeClassifier(),
    tree.ExtraTreeClassifier(),
    #https://scikit-learn.org/stable/modules/classes.html#module-sklearn.tree
    
    #Discriminant Analysis
    discriminant_analysis.LinearDiscriminantAnalysis(),
    discriminant_analysis.QuadraticDiscriminantAnalysis(),
    #https://scikit-learn.org/stable/modules/classes.html#module-sklearn.discriminant_analysis
    
    #xgboost
    XGBClassifier()    
    #https://xgboost.readthedocs.io/en/latest/tutorials/index.html
    
]



#split dataset in cross-validation with this splitter class
#note: this is an alternative to train_test_split

cv_split = model_selection.ShuffleSplit(n_splits = 10, test_size = .3, train_size = .7, random_state = 0) 


#create table to compare parameters and metrics
methods_columns = ['Method', 'Parameters', 'Test_Accuracy_Mean','Test_Accuracy_Std','Time']
methods_compare = pd.DataFrame(columns = methods_columns)


#create table to compare
method_predict = pd.DataFrame(df_test['PassengerId'])


#index through methods and save performance to table
row_index = 0

for method in methods:

    #set name and parameters
    method_name = method.__class__.__name__
    methods_compare.loc[row_index, 'Method'] = method_name
    methods_compare.loc[row_index, 'Parameters'] = str(method.get_params())
    
    #score model with cross validation: 
    cv_results = model_selection.cross_validate(method, X_train, Y_train, cv  = cv_split)

    methods_compare.loc[row_index, 'Time'] = cv_results['fit_time'].mean()
    methods_compare.loc[row_index, 'Test_Accuracy_Mean'] = cv_results['test_score'].mean()
    methods_compare.loc[row_index, 'Test_Accuracy_Std'] = cv_results['test_score'].std()  

    method.fit(X_train, Y_train)
    method_predict[method_name] = method.predict(X_test)
    
    row_index += 1
    

In [None]:
methods_compare.sort_values(by = ['Test_Accuracy_Mean'], ascending = False, inplace = True)
methods_compare


In [None]:
methods_compare.to_csv('methods_compare_1.csv', index = False, encoding='utf-8')                  


In [None]:
#result_gb_cv = pd.concat([pd.DataFrame(df_test['PassengerId'])], axis=1)
#result_gb_cv['Survived'] = method_predict['GradientBoostingClassifier']

#result_logit_cv = pd.concat([pd.DataFrame(df_test['PassengerId'])], axis=1)
#result_logit_cv['Survived'] = method_predict['LogisticRegressionCV']                         

In [None]:
#result_logit_cv.to_csv('result_logit_cv.csv', index = False, encoding='utf-8')

#result_gb_cv.to_csv('result_gb_cv.csv', index = False, encoding='utf-8')


In [None]:
#result_lda_cv = pd.concat([pd.DataFrame(df_test['PassengerId'])], axis=1)
#result_lda_cv['Survived'] = method_predict['LinearDiscriminantAnalysis']

#result_lda_cv.to_csv('result_lda_cv.csv', index = False, encoding='utf-8')

In [None]:
#result_lsvc_cv = pd.concat([pd.DataFrame(df_test['PassengerId'])], axis=1)
#result_lsvc_cv['Survived'] = method_predict['LinearSVC']

#result_lsvc_cv.to_csv('result_lsvc_cv.csv', index = False, encoding='utf-8')

## Initial Results



# Can we improve performance?

We can use 2 approaches to try to improve the performance of the models, which are the 'feature selection' and the 'tuning parameters'.

In the 'feature selection' we use methods to select the features that will be used in the model, we will not use all those maintained in X_train and X_test. This selection aims to maintain only the features with the greatest predictive potential. 

The 'tuning parameters' is to select the arguments of the respective ones, selecting the group of arguments that obtained the best results based on a specific metric.
  

Here we will only use 'tunning parameters' to try to improve performance.

# Tuning parameters

we will use GridSearchCV to find the best hyperparameters, given the set of possibilities passed for each method.


See: http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html





In [None]:
vote = [
    
    #Ensemble Methods:
    ('ada', ensemble.AdaBoostClassifier()),
    ('bc', ensemble.BaggingClassifier()),
    ('etc', ensemble.ExtraTreesClassifier()),
    ('gbc', ensemble.GradientBoostingClassifier()),
    ('rfc', ensemble.RandomForestClassifier()),

    #Gaussian Processes:
    #('gpc', gaussian_process.GaussianProcessClassifier()),
    
    #GLM: 
    ('lr', linear_model.LogisticRegressionCV()),
    #('rr', linear_model.RidgeClassifierCV()),
    
    #Navies Bayes: 
    #('bnb', naive_bayes.BernoulliNB()),
    #('gnb', naive_bayes.GaussianNB()),
    
    #Nearest Neighbor: 
    #('knn', neighbors.KNeighborsClassifier()),
    
    #SVM: 
    ('lsvc', svm.LinearSVC()),
    ('svc', svm.SVC(probability=True)),
    
    #Discriminant Analysis
    ('lda', discriminant_analysis.LinearDiscriminantAnalysis()),
    
    #xgboost:
   ('xgb', XGBClassifier())

]



In [None]:
#Grid hyperparameter

grid_n_estimator = [10, 50, 100, 300]
grid_ratio = [.1, .25, .5, .75, 1.0]
grid_learn = [.01, .03, .05, .1, .25]
grid_max_depth = [2, 4, 6, 8, 10, None]
grid_min_samples = [5, 10, .03, .05, .10]
grid_criterion = ['gini', 'entropy']
grid_bool = [True, False]
grid_seed = [0]


grid_param = [
    
            [{
            #AdaBoostClassifier
            #http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.AdaBoostClassifier.html
            'n_estimators': grid_n_estimator, #default=50
            'learning_rate': grid_learn, #default=1
            'random_state': grid_seed
            #'algorithm': ['SAMME', 'SAMME.R'], #default=’SAMME.R
            }],
       
    
            [{
            #BaggingClassifier
            #http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.BaggingClassifier.html#sklearn.ensemble.BaggingClassifier
            'n_estimators': grid_n_estimator, #default=10
            'max_samples': grid_ratio, #default=1.0
            'bootstrap' : grid_bool, #default=True
            'random_state': grid_seed
            #'bootstrap_features' : boolean, optional (default=False)
            #'oob_score' : bool, optional (default=False)
            #'n_jobs' : int or None, optional (default=None)
             }],

    
            [{
            #ExtraTreesClassifier
            #http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.ExtraTreesClassifier.html#sklearn.ensemble.ExtraTreesClassifier
            'n_estimators': grid_n_estimator, #default=10
            'criterion': grid_criterion, #default=”gini”
            'max_depth': grid_max_depth, #default=None
            'bootstrap': grid_bool, #default=False
            'random_state': grid_seed
            #
             }],


            [{
            #GradientBoostingClassifier
            #http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingClassifier.html#sklearn.ensemble.GradientBoostingClassifier
            'loss': ['deviance', 'exponential'], #default=’deviance’
            'learning_rate': grid_learn, #default=0.1 
            'n_estimators': grid_n_estimator, #default=100 
            #'criterion': ['friedman_mse', 'mse', 'mae'], #default=”friedman_mse”
            #'subsample' : , float, optional (default=1.0)    
            'max_depth': grid_max_depth, #default=3   
            'random_state': grid_seed
             }],

    
            [{
            #RandomForestClassifier
            #http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html#sklearn.ensemble.RandomForestClassifier
            'n_estimators': grid_n_estimator, #default=10
            'criterion': grid_criterion, #default=”gini”
            'max_depth': grid_max_depth, #default=None
            #'bootstrap': grid_bool, #default=True
            'oob_score': [True,False], #default=False 
            'random_state': grid_seed
             }],
        
    
            [{
            #LogisticRegressionCV
            #http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegressionCV.html#sklearn.linear_model.LogisticRegressionCV
            #'C' : [5,10,20], #default: 10
            'fit_intercept': grid_bool, #default: True
            'solver': ['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga'], #default: lbfgs
            'random_state': grid_seed
            #'penalty': ['l1','l2'],
             }],
    
            
            #[{
            #RidgeClassifier
            #https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.RidgeClassifier.html#sklearn.linear_model.RidgeClassifier
            #'alpha': [.1,.25,.5,.75,1], #default : 1
            #'solver': ['auto','svd','cholesky','lsqr','sparce_cg','sag','saga'], #default=auto
            #'random_state': grid_seed
             #}],


            #[{
            #KNeighborsClassifier
            #http://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html#sklearn.neighbors.KNeighborsClassifier
            #'n_neighbors': [1,2,3,4,5,6,7,8,9,10], #default: 5
            #'weights': ['uniform', 'distance'], #default = ‘uniform’
            #'algorithm': ['auto', 'ball_tree', 'kd_tree', 'brute']
            #}],
    
    
             [{
            #LinearSVC
            #https://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVC.html#sklearn.svm.LinearSVC
            #'lossstr' : [‘hinge’,‘squared_hinge’] (default=’squared_hinge’)
            'fit_intercept' : grid_bool, #default=True   
            'random_state': grid_seed
             }],
    
    
            [{
            #SVC
            #http://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html#sklearn.svm.SVC
            #http://blog.hackerearth.com/simple-tutorial-svm-parameter-tuning-python-r
            #'kernel': ['linear', 'poly', 'rbf', 'sigmoid'],
            'C': [1,2,3,4,5], #default=1.0
            'gamma': grid_ratio, #default: auto
            'decision_function_shape': ['ovo', 'ovr'], #default:ovr
            'probability': [True],
            'random_state': grid_seed
             }],
    
    
            [{
            #DiscriminantAnalysis
            #https://scikit-learn.org/stable/modules/generated/sklearn.discriminant_analysis.LinearDiscriminantAnalysis.html#sklearn.discriminant_analysis.LinearDiscriminantAnalysis
            'solver' : ['svd','lsqr','eigen'],  #default: svd
            #'shrinkage' : ['auto', None], #default: none
             }],

    
            [{
            #XGBClassifier
            #http://xgboost.readthedocs.io/en/latest/parameter.html
            'learning_rate': grid_learn, #default: .3
            'max_depth': [1,2,4,6,8,10], #default 2
            'n_estimators': grid_n_estimator, 
            'seed': grid_seed  
             }]   
        ]



In [None]:
#Hyperparameter Tune 


random.seed(0)


start_total = time.perf_counter() #https://docs.python.org/3/library/time.html#time.perf_counter
for clf, param in zip (vote, grid_param): #https://docs.python.org/3/library/functions.html#zip
    
    start = time.perf_counter()        
    best_search = model_selection.GridSearchCV(estimator = clf[1], param_grid = param, cv = cv_split, scoring = 'roc_auc')
    best_search.fit(X_train, Y_train)
    run = time.perf_counter() - start

    best_param = best_search.best_params_
    print('The best parameter for {} is {} with a runtime of {:.2f} seconds.'.format(clf[1].__class__.__name__, best_param, run))
    clf[1].set_params(**best_param) 
    print()
    print()
    print('-'*100)
    print()


run_total = time.perf_counter() - start_total
print('Total optimization time was {:.2f} minutes.'.format(run_total/60))
print()
print('-'*100)
print('-'*100)
print('-'*100)



In [None]:
random.seed(0)


methods_otim = [
    
    #Ensemble Methods
    ensemble.AdaBoostClassifier(n_estimators = 300, learning_rate = .03, random_state = 0),
    ensemble.BaggingClassifier(n_estimators = 300, bootstrap = True, max_samples = .25, random_state = 0),
    ensemble.ExtraTreesClassifier(n_estimators = 10, bootstrap = False, criterion = 'entropy', max_depth = 8, random_state = 0),
    ensemble.GradientBoostingClassifier(n_estimators = 50, loss = 'deviance', learning_rate = 0.25, max_depth = 2, random_state = 0),
    ensemble.RandomForestClassifier(n_estimators = 100, criterion = 'gini', oob_score = True, max_depth = 10, 
                                    random_state = 0), 
    
    #GLM
    linear_model.LogisticRegressionCV(fit_intercept = False, solver = 'saga', random_state = 0),
    linear_model.RidgeClassifierCV(),
    
    #Navies Bayes
    #naive_bayes.BernoulliNB(alpha = 0.25),
    
    
    #Nearest Neighbor
    neighbors.KNeighborsClassifier(n_neighbors = 10, weights = 'uniform', algorithm = 'ball_tree'),
    
    
    #SVM
    svm.LinearSVC(fit_intercept = False, random_state = 0),
    svm.SVC(C = 2, gamma = .25, decision_function_shape = 'ovo', probability = True, random_state = 0),

    #Discriminant Analysis
    discriminant_analysis.LinearDiscriminantAnalysis(solver = 'eigen'),
    
    
    #xgboost
    XGBClassifier(n_estimators = 50, learning_rate = .03, max_samples = 4, random_state = 0)    

    
]



#split dataset in cross-validation with this splitter class
#http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.ShuffleSplit.html#sklearn.model_selection.ShuffleSplit
#note: this is an alternative to train_test_split

#cv_split = model_selection.ShuffleSplit(n_splits = 10, test_size = .3, train_size = .7, random_state = 0) 


#create table to compare parameters and metrics
methods_columns = ['Method', 'Parameters', 'Test_Accuracy_Mean','Test_Accuracy_Std','Time']
methods_compare = pd.DataFrame(columns = methods_columns)


#create table to compare
method_predict = pd.DataFrame(df_test['PassengerId'])


#index through methods and save performance to table
row_index = 0

for method in methods_otim:

    #set name and parameters
    method_name = method.__class__.__name__
    methods_compare.loc[row_index, 'Method'] = method_name
    methods_compare.loc[row_index, 'Parameters'] = str(method.get_params())
    
    #score model with cross validation: 
    cv_results = model_selection.cross_validate(method, X_train, Y_train, cv  = cv_split)

    methods_compare.loc[row_index, 'Time'] = cv_results['fit_time'].mean()
    methods_compare.loc[row_index, 'Test_Accuracy_Mean'] = cv_results['test_score'].mean()
    methods_compare.loc[row_index, 'Test_Accuracy_Std'] = cv_results['test_score'].std()  

    method.fit(X_train, Y_train)
    method_predict[method_name] = method.predict(X_test)
    
    row_index += 1

    
methods_compare.sort_values(by = ['Test_Accuracy_Mean'], ascending = False, inplace = True)
methods_compare


In [None]:
#methods_compare.to_csv('methods_compare_w_tuning.csv', index = False, encoding='utf-8') 

In [None]:
result_gb = pd.concat([pd.DataFrame(df_test['PassengerId'])], axis=1)
result_gb['Survived'] = method_predict['GradientBoostingClassifier']


result_ext = pd.concat([pd.DataFrame(df_test['PassengerId'])], axis=1)
result_ext['Survived'] = method_predict['ExtraTreesClassifier']


result_svc = pd.concat([pd.DataFrame(df_test['PassengerId'])], axis=1)
result_svc['Survived'] = method_predict['SVC']   


result_rf = pd.concat([pd.DataFrame(df_test['PassengerId'])], axis=1)
result_rf['Survived'] = method_predict['RandomForestClassifier']


In [None]:
result_gb.to_csv('result_gb_cv_wt.csv', index = False, encoding='utf-8')
result_ext.to_csv('result_ext_cv_wt.csv', index = False, encoding='utf-8')
result_svc.to_csv('result_svc_cv_wt.csv', index = False, encoding='utf-8')
result_rf.to_csv('result_rf_cv_wt.csv', index = False, encoding='utf-8')


In [None]:
result_rf[ result_rf['Survived'] != result_gb['Survived'] ].count()

# Final Results