# Kaggle Titanic

## dataset description

Variable | Definition | Key
---------|------------|-----
Survival | survival | 0 = No, 1 = Yes
pclass | Ticket class | 1 = 1st, 2 = 2nd, 3 = 3rd
sex	| Sex	
Age | Age in years
sibsp |	# of siblings / spouses aboard the Titanic	
parch | # of parents / children aboard the Titanic	
ticket | Ticket number	
fare	| Passenger fare	
cabin	| Cabin number	
embarked	| Port of Embarkation	C = Cherbourg, Q = Queenstown, S = Southampton

**Variable Notes**

**pclass:** A proxy for socio-economic status (SES)

1st = Upper
2nd = Middle
3rd = Lower

**age:** Age is fractional if less than 1. If the age is estimated, is it in the form of xx.5

**sibsp:** The dataset defines family relations in this way...
Sibling = brother, sister, stepbrother, stepsister
Spouse = husband, wife (mistresses and fiancés were ignored)

**parch:** The dataset defines family relations in this way...
Parent = mother, father
Child = daughter, son, stepdaughter, stepson
Some children travelled only with a nanny, therefore parch=0 for them.

# imports

In [254]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from scipy import stats
import seaborn as sns
from statistics import mode

from sklearn.model_selection import train_test_split
from sklearn import preprocessing


%matplotlib inline

# init data

In [255]:
import memory_usage

Reducing the memory usage of dataframe.
https://www.kaggle.com/gemartin/load-data-reduce-memory-usage

In [256]:
link = 'D:\STUDY\practice\\titanic'
train = 'train.csv'
test = 'test.csv'

In [257]:
df = pd.read_csv(link+'\\'+train)
#df = memory_usage.import_data(link+'\\'+train)
test_df = memory_usage.import_data(link+'\\'+test)

df = df.rename(columns={c:str.lower(c) for c in df.columns})
test_df = test_df.rename(columns={c:str.lower(c) for c in test_df.columns})

Memory usage of dataframe is 0.04 MB
Memory usage after optimization is: 0.04 MB
Decreased by -27.7%


# exploring data

## basic info

In [258]:
df.head()

Unnamed: 0,passengerid,survived,pclass,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [259]:
df.shape

(891, 12)

In [260]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
passengerid    891 non-null int64
survived       891 non-null int64
pclass         891 non-null int64
name           891 non-null object
sex            891 non-null object
age            714 non-null float64
sibsp          891 non-null int64
parch          891 non-null int64
ticket         891 non-null object
fare           891 non-null float64
cabin          204 non-null object
embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


In [261]:
df.isna().sum()

passengerid      0
survived         0
pclass           0
name             0
sex              0
age            177
sibsp            0
parch            0
ticket           0
fare             0
cabin          687
embarked         2
dtype: int64

In [262]:
df.describe()

Unnamed: 0,passengerid,survived,pclass,age,sibsp,parch,fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


In [263]:
df.median()

passengerid    446.0000
survived         0.0000
pclass           3.0000
age             28.0000
sibsp            0.0000
parch            0.0000
fare            14.4542
dtype: float64

## missing values

In [264]:
df.isna().sum()

passengerid      0
survived         0
pclass           0
name             0
sex              0
age            177
sibsp            0
parch            0
ticket           0
fare             0
cabin          687
embarked         2
dtype: int64

In [265]:
df['embarked'] = df['embarked'].fillna(mode(df['embarked']))

### age

In [266]:
df.groupby(['pclass', 'sex'])[['age']].median()

Unnamed: 0_level_0,Unnamed: 1_level_0,age
pclass,sex,Unnamed: 2_level_1
1,female,35.0
1,male,40.0
2,female,28.0
2,male,30.0
3,female,21.5
3,male,25.0


In [267]:
df.groupby(['embarked', 'sex'])[['age']].median()

Unnamed: 0_level_0,Unnamed: 1_level_0,age
embarked,sex,Unnamed: 2_level_1
C,female,24.0
C,male,30.0
Q,female,21.5
Q,male,30.0
S,female,27.5
S,male,28.0


In [268]:
age = df.groupby(['pclass', 'embarked', 'sex'])[['age']].median()

In [269]:
age

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,age
pclass,embarked,sex,Unnamed: 3_level_1
1,C,female,37.0
1,C,male,36.5
1,Q,female,33.0
1,Q,male,44.0
1,S,female,34.0
1,S,male,42.0
2,C,female,22.0
2,C,male,29.5
2,Q,female,30.0
2,Q,male,57.0


Замена пропусков на медианы в соответсвии с классом, портом отправления и полом

In [270]:
q25 = df[ (df['sex']=='male') & (df['name'].str.contains('Master')) & (df['pclass']==3)][['age']].quantile(.25)
q75 = df[ (df['sex']=='male') & (df['name'].str.contains('Master')) & (df['pclass']==3)][['age']].quantile(.75)

df.loc[ (df['age'].isna()==1) & (df['sex']=='male') & (df['name'].str.contains('Master')), 'age'] = np.array(range(int(q25[0]),int(q75[0]+1),round(int(q75[0]-q25[0])/4)))
df.loc[df['age'].isnull(), 'age'] = df.groupby(['pclass', 'sex', 'embarked'])['age'].transform('median')

In [271]:
test_df.loc[test_df['age'].isnull(), 'age'] = df.groupby(['pclass', 'sex', 'embarked'])['age'].transform('median')

# features

## selection

Минимальное количество фичей

In [272]:
df.head()

Unnamed: 0,passengerid,survived,pclass,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [273]:
", ".join(df.columns.tolist())

'passengerid, survived, pclass, name, sex, age, sibsp, parch, ticket, fare, cabin, embarked'

In [274]:
columns_to_fit = 'survived, pclass, sex, age, sibsp, parch'.replace(' ','').split(',')
df = df[df.columns.intersection(columns_to_fit)]

In [275]:
df['is_female'] = df['sex'].apply(lambda x: (x=='female')*1)
df.drop(['sex'], axis=1, inplace=True)

c_variables = ['pclass', 'sibsp', 'parch']
for c in c_variables:
    dummies = pd.get_dummies(df[c], prefix=c.lower())
    df = pd.concat([df, dummies], axis=1)
    df.drop([c], axis=1, inplace=True)
    
del dummies, c_variables

In [276]:
df.head()

Unnamed: 0,survived,age,is_female,pclass_1,pclass_2,pclass_3,sibsp_0,sibsp_1,sibsp_2,sibsp_3,sibsp_4,sibsp_5,sibsp_8,parch_0,parch_1,parch_2,parch_3,parch_4,parch_5,parch_6
0,0,22.0,0,0,0,1,0,1,0,0,0,0,0,1,0,0,0,0,0,0
1,1,38.0,1,1,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0
2,1,26.0,1,0,0,1,1,0,0,0,0,0,0,1,0,0,0,0,0,0
3,1,35.0,1,1,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0
4,0,35.0,0,0,0,1,1,0,0,0,0,0,0,1,0,0,0,0,0,0


## scaling

In [277]:
min_max_scaler = preprocessing.MinMaxScaler(feature_range=(0,1))
columns_to_scale = ['age']
for c in columns_to_scale:
    df[[c]] = min_max_scaler.fit_transform(df[[c]])

In [278]:
df.head(3)

Unnamed: 0,survived,age,is_female,pclass_1,pclass_2,pclass_3,sibsp_0,sibsp_1,sibsp_2,sibsp_3,sibsp_4,sibsp_5,sibsp_8,parch_0,parch_1,parch_2,parch_3,parch_4,parch_5,parch_6
0,0,0.271174,0,0,0,1,0,1,0,0,0,0,0,1,0,0,0,0,0,0
1,1,0.472229,1,1,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0
2,1,0.321438,1,0,0,1,1,0,0,0,0,0,0,1,0,0,0,0,0,0


# modeling

In [279]:
from sklearn.model_selection import cross_val_predict
from sklearn.metrics import accuracy_score
from sklearn.ensemble import RandomForestClassifier


In [280]:
target = 'survived'
X = df.loc[:, df.columns != target]
y = df[target]

In [305]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25)

In [306]:
def fit_n_predict(model, X_train, y_train, cv):
    model.fit(X_train, y_train)
    acc = round(model.score(X_train, y_train)*100, 2)
    train_pred = cross_val_predict(model, X_train, y_train, cv=10, n_jobs=-1)
    acc_cv = round(accuracy_score(y_train, train_pred)*100, 2)
    
    return train_pred, acc, acc_cv

## Random Forest

In [307]:
model = RandomForestClassifier()

In [308]:
train_pred, acc, acc_cv = fit_n_predict(model, X_train, y_train, 10)
acc, acc_cv


(93.26, 81.89)

## test data prepare

In [285]:
X_train.head()

Unnamed: 0,age,is_female,pclass_1,pclass_2,pclass_3,sibsp_0,sibsp_1,sibsp_2,sibsp_3,sibsp_4,sibsp_5,sibsp_8,parch_0,parch_1,parch_2,parch_3,parch_4,parch_5,parch_6
391,0.258608,0,0,0,1,1,0,0,0,0,0,0,1,0,0,0,0,0,0
596,0.359135,1,0,1,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0
18,0.384267,1,0,0,1,0,1,0,0,0,0,0,1,0,0,0,0,0,0
751,0.070118,0,0,0,1,1,0,0,0,0,0,0,0,1,0,0,0,0,0
12,0.246042,0,0,0,1,1,0,0,0,0,0,0,1,0,0,0,0,0,0


In [298]:
test_df.head()

Unnamed: 0,age,is_female,pclass_1,pclass_2,pclass_3,sibsp_0,sibsp_1,sibsp_2,sibsp_3,sibsp_4,sibsp_5,sibsp_8,parch_0,parch_1,parch_2,parch_3,parch_4,parch_5,parch_6,parch_9
0,0.452881,0,0,0,1,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0
1,0.617676,1,0,0,1,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0
2,0.81543,0,0,1,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0
3,0.354004,0,0,0,1,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0
4,0.288086,1,0,0,1,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0


In [295]:
test_df.isna().sum()

age          0
is_female    0
pclass_1     0
pclass_2     0
pclass_3     0
sibsp_0      0
sibsp_1      0
sibsp_2      0
sibsp_3      0
sibsp_4      0
sibsp_5      0
sibsp_8      0
parch_0      0
parch_1      0
parch_2      0
parch_3      0
parch_4      0
parch_5      0
parch_6      0
parch_9      0
dtype: int64

In [288]:
test_df = test_df[test_df.columns.intersection(columns_to_fit)]

In [292]:
test_df['is_female'] = test_df['sex'].apply(lambda x: (x=='female')*1)

In [293]:
test_df.drop(['sex'], axis=1, inplace=True)

In [294]:
c_variables = ['pclass', 'sibsp', 'parch']
for c in c_variables:
    dummies = pd.get_dummies(test_df[c], prefix=c.lower())
    test_df = pd.concat([test_df, dummies], axis=1)
    test_df.drop([c], axis=1, inplace=True)



In [297]:
for c in columns_to_scale:
    test_df[[c]] = min_max_scaler.fit_transform(test_df[[c]])

In [303]:
test_df = test_df.drop('parch_9', axis=1)

## output

In [310]:
predictions = model.predict(test_df)

In [311]:
submission = pd.DataFrame()
submission['PassengerId'] = pd.read_csv(link+'\\'+test)['PassengerId']
submission['Survived'] = predictions
submission.head()

Unnamed: 0,PassengerId,Survived
0,892,0
1,893,1
2,894,1
3,895,0
4,896,0


In [312]:
submission.to_csv('submission.csv', index=False)

# next target

проработка фичей

- возраст по группам pd.cut
- разбиение по группам sibsp и parch
- порт отправления
- кабина

Тюнинг модели

Ensemble