# Kaggle Titanic Competition Attempt 1

First attempt at modelling a Kaggle dataset

Outline:

- Import Data
- Explore Data
- Data Manipulation
- Feature Selection
- Model Selection and application
- Iteratively improve model



# Import Data

In [44]:
import pandas as pd

titanic_train_location = './data/train.csv'
titanic_test_location = './data/test.csv'

train_df = pd.read_csv(titanic_train_location)
test_df = pd.read_csv(titanic_test_location)

# Explore Training Data

In [45]:
train_df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


We can see here we have 12 columns in the Data. We can already speculate that 'PassengerId' and 'Name' may not have an impact on surival through intuition.

'Ticket', 'Embarked', 'Name','Sex', Survived', 'Pclass' and 'Cabin' are all categorical variables
'Survived', 'Pclass', 'Sex', 'Embarked' could all be represented by numbers for analysis
'Age', 'SibSp', 'Parch' and ' Fare are all Quantitative variables

'Cabin' has missing values

In [46]:
train_df.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


From this we can see that 38.38% of passengers survived.

Age has missing values

# Data Manipulation

In [47]:
# Need to dummy code the categorical features for the model : Sex, Cabin, Embarked
sex_dummies = pd.get_dummies(train_df.Sex, drop_first=True, prefix='Sex')
embarked_dummies = pd.get_dummies(train_df.Embarked, drop_first=True, prefix='Embarked')
cabin_dummies = pd.get_dummies(train_df.Cabin, drop_first=True, prefix='Cabin')

# Create new dataframe with encoded values (leave out cabins as this is too sparse)
train_df = pd.concat([train_df, sex_dummies, embarked_dummies], axis=1)

# Need to do something with missing data NaNs
print(train_df.isnull().sum())
print(train_df.shape)
print(train_df.dropna(subset=['Age'], how='any').shape)

train_df.dropna(subset=['Age'], how='any', inplace=True)

#print(train_df.head())

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
Sex_male         0
Embarked_Q       0
Embarked_S       0
dtype: int64
(891, 15)
(714, 15)


# Feature Selection

In [48]:
y = train_df.Survived

# Haven't done feature selection yet
features = ['Pclass','Sex_male','Age','SibSp','Parch','Fare','Embarked_Q', 'Embarked_S']
X = train_df[features]



# Model Selection and Implementation

In [49]:
from sklearn.tree import DecisionTreeClassifier



DT_model = DecisionTreeClassifier(random_state = 1)

DT_model.fit(X, y)



DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='gini',
                       max_depth=None, max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort='deprecated',
                       random_state=1, splitter='best')

In [50]:
test_df.head()

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S
2,894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q
3,895,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S
4,896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S


In [51]:
# Get test data ready to feed into Model

# Need to dummy code the categorical features for the model : Sex, Cabin, Embarked
test_sex_dummies = pd.get_dummies(test_df.Sex, drop_first=True, prefix='Sex')
test_embarked_dummies = pd.get_dummies(test_df.Embarked, drop_first=True, prefix='Embarked')

# Create new dataframe with encoded values (leave out cabins as this is too sparse)
test_df = pd.concat([test_df, test_sex_dummies, test_embarked_dummies], axis=1)

test_df.head()


Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Sex_male,Embarked_Q,Embarked_S
0,892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q,1,1,0
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S,0,0,1
2,894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q,1,1,0
3,895,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S,1,0,1
4,896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S,0,0,1


In [71]:
# 
test_X = test_df[features]

print(test_X.isnull().sum())

age_mean = 29.699118
fare_mean = 32.204208
test_X['Age'].fillna(value=age_mean, inplace=True)
test_X['Fare'].fillna(value=fare_mean, inplace=True)
print(test_X.isnull().sum())

predictions = DT_model.predict(test_X)



prediction = pd.DataFrame(predictions, columns=['Survived'])

predictions = pd.concat([test_df.PassengerId, prediction], axis=1)



predictions.to_csv('prediction.csv', index=False)

Pclass         0
Sex_male       0
Age           86
SibSp          0
Parch          0
Fare           1
Embarked_Q     0
Embarked_S     0
dtype: int64
Pclass        0
Sex_male      0
Age           0
SibSp         0
Parch         0
Fare          0
Embarked_Q    0
Embarked_S    0
dtype: int64


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self._update_inplace(new_data)
