# Explorind the titanic dataset

## Introduction

Since I don't have a lot of ideas let's see what other people have done for this dataset. The following are key points form other peoples kernels.

---

### [Megan Risdal](https://www.kaggle.com/mrisdal/exploring-survival-on-the-titanic)

*Feature engeneering*
 - Break families into 3 groups (plot family size vs survival - barplot)
 - Separate the passangers with respect to their decs (from the Cabin variable)
 - Child and mother bins (plot age histogram + survival)

*Implanting missing data*
 - Given the small data set do not delete
 - Implant missing Embarked data based on passenger class and fare (replace the NA values with 'C')
 - Implant missing Fare value (maybe use median)
 - Implant age using recursive partitioning for regression model (Look up mice implantation)
 
*Model*
 - Random forrest
 - Show model error (plot)
 - Plot variable importance
 
*Notes*
 - Nice format (structure) of the kernel
 - She has index, which is nice


### [swamysm](https://www.kaggle.com/swamysm/beginners-titanic)

 - Interesting conclusion - 
 ```When I submit the predicted survival data from various models that built in the course to Kaggle competion, i have got approximately the same score. Now I realize that why data scientist used to spend most of their time into feature engineering and exploratory analysis compare to actual model building. Model that we are using is definitely important, however more than that understanding our data and feature engineering is crucial.```


### [Anisotropic](https://www.kaggle.com/arthurtok/introduction-to-ensembling-stacking-in-python)

```
Method of ensembling (combining) base learning models, in particular the variant of ensembling known as Stacking
```

*Model*

 - RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier, ExtraTreesClassifier, SVC, KFold


### General notes
 - A lot of people use RandomForrest even though it is known to overfit
 
 ---
 
## Goals
 - Construct pipeline that does the preprocessing and learning
 - Get higher than `2793/8677` in kaggle
 
 ---
 
## Plan
 - Apply regression to fill the missing age values
 - Input the median value for the orther missing feature values
 - One-hot encode Gender and Embarked
 - Apply binning to age
 - Visualize different features against survival rate
 - Visualize confusion matrix
 - Construct pipeline for feature mapping
 - Try RandomForrest, SVM and Logistic regression
 - Vizualize model error and variable importance
 - Grid search for the best hyper-parameters
 - Apply model on test data and submit
 - Profit

### Loading the necessary scripts and the data

In [7]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
import mglearn
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import Ridge
from sklearn.linear_model import Lasso
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import LabelBinarizer
from sklearn.pipeline import FeatureUnion

from IPython.display import display

%matplotlib inline
pd.options.mode.chained_assignment = None

In [2]:
# Loading the data
train = pd.read_csv('data/titanic-train.csv', index_col='PassengerId')
test = pd.read_csv('data/titanic-test.csv', index_col='PassengerId')

# Lets have a look at the data
train.head(5)

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


## Getting to know the data

In [3]:
train.describe()

Unnamed: 0,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,714.0,891.0,891.0,891.0
mean,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,0.0,1.0,0.42,0.0,0.0,0.0
25%,0.0,2.0,20.125,0.0,0.0,7.9104
50%,0.0,3.0,28.0,0.0,0.0,14.4542
75%,1.0,3.0,38.0,1.0,0.0,31.0
max,1.0,3.0,80.0,8.0,6.0,512.3292


Some of the **age** rows are empty, we weould need to **implant** some data there.

In [5]:
train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 891 entries, 1 to 891
Data columns (total 11 columns):
Survived    891 non-null int64
Pclass      891 non-null int64
Name        891 non-null object
Sex         891 non-null object
Age         714 non-null float64
SibSp       891 non-null int64
Parch       891 non-null int64
Ticket      891 non-null object
Fare        891 non-null float64
Cabin       204 non-null object
Embarked    889 non-null object
dtypes: float64(2), int64(4), object(5)
memory usage: 83.5+ KB


Lets implant the expected value of the age on the empty rows

In [8]:
data = train.copy()
data.Age[data.Age.isnull()] = np.mean(data.Age)

In [9]:
data.describe()

Unnamed: 0,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,891.0,891.0,891.0
mean,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,0.486592,0.836071,13.002015,1.102743,0.806057,49.693429
min,0.0,1.0,0.42,0.0,0.0,0.0
25%,0.0,2.0,22.0,0.0,0.0,7.9104
50%,0.0,3.0,29.699118,0.0,0.0,14.4542
75%,1.0,3.0,35.0,1.0,0.0,31.0
max,1.0,3.0,80.0,8.0,6.0,512.3292


Much better

## Baseline model

In [14]:
X = train.drop('Cabin', axis=1)
X = X.drop('Survived', axis=1)
y = data['Survived']

In [15]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=8, stratify=y)

In [16]:
model = LogisticRegression().fit(X_train, y_train)
print("train score:", model.score(X_train, y_train))
print("test score: ", model.score(X_test, y_test))

ValueError: could not convert string to float: 'S'

Its ok I guess, but we need can make it better.

### Letst one hot encode the gender

In [None]:
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.preprocessing import Imputer
from sklearn.svm import SVC
from sklearn.model_selection import cross_val_score


class ItemSelector(BaseEstimator, TransformerMixin):
    def __init__(self, key):
        self.key = key

    def fit(self, x, y=None):
        return self

    def transform(self, data_dict):
        return data_dict[[self.key]]
    
class StringImputer(TransformerMixin):
    def fit(self, X, *_):
        self.modes = X.mode().iloc[0]
        return self
        
    def transform(self, X, y=None):
        return X.fillna(self.modes)

class LabelBinarizerPipelineFriendly(LabelBinarizer):
    def fit(self, X, y=None):
        super().fit(X)

    def transform(self, X, y=None):
        return super().transform(X)

    def fit_transform(self, X, y=None):
        return super().fit(X).transform(X)

model = Pipeline([
    ('union', FeatureUnion([
        ('age', Pipeline([
            ('select', ItemSelector('Age')),
            ('imputer', Imputer(strategy='mean')),
            ('scaler', StandardScaler()),
        ])),
        ('gender', Pipeline([
            ('select', ItemSelector('Sex')),
            ('imputer', StringImputer()),
            ('encoder', LabelBinarizerPipelineFriendly()),
        ])),
        ('embarked', Pipeline([
            ('select', ItemSelector('Embarked')),
            ('imputer', StringImputer()),
            ('encoder', LabelBinarizerPipelineFriendly()),
        ])),
        ('sibsp', Pipeline([
            ('select', ItemSelector('SibSp')),
            ('scaler', StandardScaler()),
        ])),
        ('parch', Pipeline([
            ('select', ItemSelector('Parch')),
            ('scaler', StandardScaler()),
        ])),
    ])),
    ('svc', SVC())
])

scores = cross_val_score(model, X_train, y_train, cv=5)
print(scores)
print(scores.mean())

### Lets try to learn the missing age values

In [None]:
data = original_data[~original_data.Age.isnull()]
data.describe()

In [None]:
X = data[['Pclass', 'Survived', 'SibSp', 'Parch', 'Fare']]
y = data['Age']

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=8)

pipeline = make_pipeline(StandardScaler(), PolynomialFeatures(degree=2), LinearRegression())
model = pipeline.fit(X_train, y_train)
print("train score:", model.score(X_train, y_train))
print("test score: ", model.score(X_test, y_test))

In [None]:
data = original_data.copy()
missing_age_data = original_data[original_data.Age.isnull()][['Pclass', 'Survived', 'SibSp', 'Parch', 'Fare']]
predicted_age = pipeline.predict(missing_age_data)

data.Age[data.Age.isnull()] = predicted_age

In [None]:
data.describe()

In [None]:
X = data[['Pclass', 'Age', 'SibSp', 'Parch', 'Fare']]
y = data['Survived']

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=8)

pipeline = make_pipeline(StandardScaler(), PolynomialFeatures(degree=5), LogisticRegression())
model = pipeline.fit(X_train, y_train)
print("train score:", model.score(X_train, y_train))
print("test score: ", model.score(X_test, y_test))