## Titanic: Machine Learning from Disaster
#### Start here! Predict survival on the Titanic and get familiar with ML basics

This is the legendary Titanic ML competition – the best, first challenge for you to dive into ML competitions and familiarize yourself with how the Kaggle platform works.

The competition is simple: use machine learning to create a model that predicts which passengers survived the Titanic shipwreck.
    
##### The Challenge
The sinking of the Titanic is one of the most infamous shipwrecks in history.

On April 15, 1912, during her maiden voyage, the widely considered “unsinkable” RMS Titanic sank after colliding with an iceberg. Unfortunately, there weren’t enough lifeboats for everyone onboard, resulting in the death of 1502 out of 2224 passengers and crew.

While there was some element of luck involved in surviving, it seems some groups of people were more likely to survive than others.

In this challenge, we ask you to build a predictive model that answers the question: “what sorts of people were more likely to survive?” using passenger data (ie name, age, gender, socio-economic class, etc).   

Access the challenge: https://www.kaggle.com/c/titanic

Access my page on kaggle: https://www.kaggle.com/marinaramalhete

In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# Any results you write to the current directory are saved as output.

/kaggle/input/titanic/gender_submission.csv
/kaggle/input/titanic/test.csv
/kaggle/input/titanic/train.csv


In [2]:
# Import libraries

from sklearn.tree import DecisionTreeClassifier

In [3]:
# Read data

gender = pd.read_csv('/kaggle/input/titanic/gender_submission.csv')
test = pd.read_csv('/kaggle/input/titanic/test.csv')
train = pd.read_csv('/kaggle/input/titanic/train.csv')

In [4]:
# View the data in gender

gender.head()

Unnamed: 0,PassengerId,Survived
0,892,0
1,893,1
2,894,0
3,895,0
4,896,1


In [5]:
# Drop the columns that will not be used

test.drop(['Name', 'Ticket', 'Cabin'], axis = 1, inplace = True)
train.drop(['Name', 'Ticket', 'Cabin'], axis = 1, inplace = True)

In [6]:
# # View the data in test

test.head()

Unnamed: 0,PassengerId,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked
0,892,3,male,34.5,0,0,7.8292,Q
1,893,3,female,47.0,1,0,7.0,S
2,894,2,male,62.0,0,0,9.6875,Q
3,895,3,male,27.0,0,0,8.6625,S
4,896,3,female,22.0,1,1,12.2875,S


In [7]:
# View the data in train

train.head()

Unnamed: 0,PassengerId,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked
0,1,0,3,male,22.0,1,0,7.25,S
1,2,1,1,female,38.0,1,0,71.2833,C
2,3,1,3,female,26.0,0,0,7.925,S
3,4,1,1,female,35.0,1,0,53.1,S
4,5,0,3,male,35.0,0,0,8.05,S


In [8]:
# Use .get_dummies to convert categorical variable into dummy/indicator variables
# In dataset -> sex and embarked

new_train = pd.get_dummies(train)
new_test = pd.get_dummies(test)

In [9]:
# View the data in new train

new_train.head()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare,Sex_female,Sex_male,Embarked_C,Embarked_Q,Embarked_S
0,1,0,3,22.0,1,0,7.25,0,1,0,0,1
1,2,1,1,38.0,1,0,71.2833,1,0,1,0,0
2,3,1,3,26.0,0,0,7.925,1,0,0,0,1
3,4,1,1,35.0,1,0,53.1,1,0,0,0,1
4,5,0,3,35.0,0,0,8.05,0,1,0,0,1


In [10]:
# View the data in new test

new_test.head()

Unnamed: 0,PassengerId,Pclass,Age,SibSp,Parch,Fare,Sex_female,Sex_male,Embarked_C,Embarked_Q,Embarked_S
0,892,3,34.5,0,0,7.8292,0,1,0,1,0
1,893,3,47.0,1,0,7.0,1,0,0,0,1
2,894,2,62.0,0,0,9.6875,0,1,0,1,0
3,895,3,27.0,0,0,8.6625,0,1,0,0,1
4,896,3,22.0,1,1,12.2875,1,0,0,0,1


In [11]:
# Replace the null data in age and fare with the average of the data

new_test['Age'].fillna(new_test['Age'].mean(), inplace = True)
new_test['Fare'].fillna(new_test['Fare'].mean(), inplace = True)

In [12]:
new_train['Age'].fillna(new_train['Age'].mean(), inplace = True)

In [13]:
# Check that there is no more null data

new_test.isnull().sum().sort_values(ascending = False).head(10)

Embarked_S    0
Embarked_Q    0
Embarked_C    0
Sex_male      0
Sex_female    0
Fare          0
Parch         0
SibSp         0
Age           0
Pclass        0
dtype: int64

In [14]:
new_train.isnull().sum().sort_values(ascending = False).head(10)

Embarked_S    0
Embarked_Q    0
Embarked_C    0
Sex_male      0
Sex_female    0
Fare          0
Parch         0
SibSp         0
Age           0
Pclass        0
dtype: int64

In [15]:
# Separate the data -> features (input) and target to create the model

X =  new_train.drop('Survived', axis = 1)
y = new_train['Survived']

In [16]:
# Decision Tree model

tree = DecisionTreeClassifier(max_depth = 3, random_state = 0)
tree.fit(X, y)
tree.score(X, y)

0.8271604938271605

In [19]:
# Save the data for submission

submission = pd.DataFrame()
submission['PassengerId'] = new_test['PassengerId']
submission['Survived'] = tree.predict(new_test)

In [20]:
submission.to_csv('submission.csv', index = False)

In [21]:
# View the data in submission

submission.head()

Unnamed: 0,PassengerId,Survived
0,892,0
1,893,1
2,894,0
3,895,0
4,896,1
