# MACHINE LEARNING MODEL: TITANIC MACHINE LEARNING FROM DISASTER

Building a predictive model that answers the question: “what sorts of people
were more likely to survive?” using passenger data
(i.e. name, age, gender, socio-economic class, etc)

## Collect Datasets
Load the training dataset

In [70]:
import numpy as np

In [71]:
import pandas as pd

In [72]:
import re

In [73]:
fname = 'train.csv'
data = pd.read_csv(fname)

### Dataset Info


In [74]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


### View Original Dataset 

In [75]:
data.head(100)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
95,96,0,3,"Shorney, Mr. Charles Joseph",male,,0,0,374910,8.0500,,S
96,97,0,1,"Goldschmidt, Mr. George B",male,71.0,0,0,PC 17754,34.6542,A5,C
97,98,1,1,"Greenfield, Mr. William Bertram",male,23.0,0,1,PC 17759,63.3583,D10 D12,C
98,99,1,2,"Doling, Mrs. John T (Ada Julia Bone)",female,34.0,0,1,231919,23.0000,,S


### Data Column names

In [76]:
data.columns

Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
      dtype='object')

## Cleaning the dataset

### Choosing the columns required to generate the model

#### Columns to be ignored (initial)
Columns that can be ignored based on the assumption it wouldn't have any effect on the survival of passengers:
* PassengerID
* Name

#### Input parameters
* Pclass (Proxy for Socio-economic Status)
    1 = 1st (Upper)
    2 = 2nd (Middle)
    3 = 3rd (Lower)
* Sex 
* Age (In Years)
* SibSp (# of siblings / spouses aboard the Titanic) 
* Parch (# of parents / children aboard the Titanic) 
* Ticket (Ticket number)
* Fare (Passenger Fare)
* Cabin (Cabin number)
* Embarked (Port of Embarkation)
    C = Cherbourg, 
    Q = Queenstown, 
    S = Southampton

#### Output Parameters
* Survived 
    0 = No
    1 = Yes



### Unknown/Missing/Error Values in Dataset

**Note:** Unknown or missing value will be treated as 'NAN' and will be categorised as '-1'

### New Dataset

In [77]:
data_size = len(data)
data_new = pd.DataFrame(data).drop(['PassengerId','Name'],axis=1)

### Category-to-numeric Conversion
The columns 'Sex','Cabin','Embarked' needs to be converted to numerical data

#### Data Column: Sex
Category-mapping of the column "Sex":

In [78]:
data_new['Sex'] = pd.Categorical(data_new['Sex'],ordered=True,dtype='category')
cat_sex = dict(enumerate(data_new['Sex'].cat.categories))
print(cat_sex)

{0: 'female', 1: 'male'}


The numeric category is then applied to the column 'Sex':

In [79]:
data_new['Sex'] = data_new.Sex.cat.codes

#### Data Column: Cabin
Category-mapping of the column "Cabin":

In [80]:
data_new['Cabin'] = pd.Categorical(data_new['Cabin'],ordered=True,dtype='category')
cat_cabin = dict(enumerate(data_new['Cabin'].cat.categories))
# cat_cabin = dict(zip(data_new['Cabin'].cat.codes, data_new['Cabin']))
print(cat_cabin)


{0: 'A10', 1: 'A14', 2: 'A16', 3: 'A19', 4: 'A20', 5: 'A23', 6: 'A24', 7: 'A26', 8: 'A31', 9: 'A32', 10: 'A34', 11: 'A36', 12: 'A5', 13: 'A6', 14: 'A7', 15: 'B101', 16: 'B102', 17: 'B18', 18: 'B19', 19: 'B20', 20: 'B22', 21: 'B28', 22: 'B3', 23: 'B30', 24: 'B35', 25: 'B37', 26: 'B38', 27: 'B39', 28: 'B4', 29: 'B41', 30: 'B42', 31: 'B49', 32: 'B5', 33: 'B50', 34: 'B51 B53 B55', 35: 'B57 B59 B63 B66', 36: 'B58 B60', 37: 'B69', 38: 'B71', 39: 'B73', 40: 'B77', 41: 'B78', 42: 'B79', 43: 'B80', 44: 'B82 B84', 45: 'B86', 46: 'B94', 47: 'B96 B98', 48: 'C101', 49: 'C103', 50: 'C104', 51: 'C106', 52: 'C110', 53: 'C111', 54: 'C118', 55: 'C123', 56: 'C124', 57: 'C125', 58: 'C126', 59: 'C128', 60: 'C148', 61: 'C2', 62: 'C22 C26', 63: 'C23 C25 C27', 64: 'C30', 65: 'C32', 66: 'C45', 67: 'C46', 68: 'C47', 69: 'C49', 70: 'C50', 71: 'C52', 72: 'C54', 73: 'C62 C64', 74: 'C65', 75: 'C68', 76: 'C7', 77: 'C70', 78: 'C78', 79: 'C82', 80: 'C83', 81: 'C85', 82: 'C86', 83: 'C87', 84: 'C90', 85: 'C91', 86: 'C92

In [81]:
data_new['Cabin'] = data_new.Cabin.cat.codes

#### Data Column: Embarked
Category-mapping of the column "Embarked":

In [82]:
data_new['Embarked'] = pd.Categorical(data_new['Embarked'],ordered=True, dtype='category')
cat_embarked = dict(enumerate(data_new['Embarked'].cat.categories))
# print(cat_embarked)

In [83]:
data_new['Embarked'] = data_new.Embarked.cat.codes

#### Data Column: Age

In [87]:
age_nan = np.isnan(data_new['Age'])
data_age = data_new['Age']
data_new['Age'][age_nan] = -1

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until


#### Extract the numeric data from column: Ticket

Using regular expression to extract the numerics

In [88]:
data_new['Ticket']

0             A/5 21171
1              PC 17599
2      STON/O2. 3101282
3                113803
4                373450
             ...       
886              211536
887              112053
888          W./C. 6607
889              111369
890              370376
Name: Ticket, Length: 891, dtype: object

In [89]:
# lst = re.findall('[0-9.]+',data_new['Ticket'])

for line in data_new['Ticket']:
    line = line.rstrip()
    print(line)
#     tik_num = re.findall('\s*[0-9]+', line)
#     tik_num = re.findall('(\s([0-9]+)|[0-9]+)', line)    
#     tik_num = re.findall('(\s([0-9]+))', line)
    tik_num = re.findall('(\s[0-9]+)|(^[0-9]+)', line)
    print(tik_num)

A/5 21171
[(' 21171', '')]
PC 17599
[(' 17599', '')]
STON/O2. 3101282
[(' 3101282', '')]
113803
[('', '113803')]
373450
[('', '373450')]
330877
[('', '330877')]
17463
[('', '17463')]
349909
[('', '349909')]
347742
[('', '347742')]
237736
[('', '237736')]
PP 9549
[(' 9549', '')]
113783
[('', '113783')]
A/5. 2151
[(' 2151', '')]
347082
[('', '347082')]
350406
[('', '350406')]
248706
[('', '248706')]
382652
[('', '382652')]
244373
[('', '244373')]
345763
[('', '345763')]
2649
[('', '2649')]
239865
[('', '239865')]
248698
[('', '248698')]
330923
[('', '330923')]
113788
[('', '113788')]
349909
[('', '349909')]
347077
[('', '347077')]
2631
[('', '2631')]
19950
[('', '19950')]
330959
[('', '330959')]
349216
[('', '349216')]
PC 17601
[(' 17601', '')]
PC 17569
[(' 17569', '')]
335677
[('', '335677')]
C.A. 24579
[(' 24579', '')]
PC 17604
[(' 17604', '')]
113789
[('', '113789')]
2677
[('', '2677')]
A./5. 2152
[(' 2152', '')]
345764
[('', '345764')]
2651
[('', '2651')]
7546
[('', '7546')]
11668
[(

2625
[('', '2625')]
347089
[('', '347089')]
347063
[('', '347063')]
112050
[('', '112050')]
347087
[('', '347087')]
248723
[('', '248723')]
113806
[('', '113806')]
3474
[('', '3474')]
A/4 48871
[(' 48871', '')]
28206
[('', '28206')]
347082
[('', '347082')]
364499
[('', '364499')]
112058
[('', '112058')]
STON/O2. 3101290
[(' 3101290', '')]
S.C./PARIS 2079
[(' 2079', '')]
C 7075
[(' 7075', '')]
347088
[('', '347088')]
12749
[('', '12749')]
315098
[('', '315098')]
19972
[('', '19972')]
392096
[('', '392096')]
3101295
[('', '3101295')]
368323
[('', '368323')]
1601
[('', '1601')]
S.C./PARIS 2079
[(' 2079', '')]
367228
[('', '367228')]
113572
[('', '113572')]
2659
[('', '2659')]
29106
[('', '29106')]
2671
[('', '2671')]
347468
[('', '347468')]
2223
[('', '2223')]
PC 17756
[(' 17756', '')]
315097
[('', '315097')]
392092
[('', '392092')]
1601
[('', '1601')]
11774
[('', '11774')]
SOTON/O2 3101287
[(' 3101287', '')]
S.O./P.P. 3
[(' 3', '')]
113798
[('', '113798')]
2683
[('', '2683')]
315090
[(''

### Numeric Converted Dataset - Preview

In [90]:
data_new.head(100)

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,0,3,1,22.0,1,0,A/5 21171,7.2500,-1,2
1,1,1,0,38.0,1,0,PC 17599,71.2833,81,0
2,1,3,0,26.0,0,0,STON/O2. 3101282,7.9250,-1,2
3,1,1,0,35.0,1,0,113803,53.1000,55,2
4,0,3,1,35.0,0,0,373450,8.0500,-1,2
...,...,...,...,...,...,...,...,...,...,...
95,0,3,1,-1.0,0,0,374910,8.0500,-1,2
96,0,1,1,71.0,0,0,PC 17754,34.6542,12,0
97,1,1,1,23.0,0,1,PC 17759,63.3583,91,0
98,1,2,0,34.0,0,1,231919,23.0000,-1,2


### Prediction Target

In [91]:
y = data_new.Survived
print(y)

0      0
1      1
2      1
3      1
4      0
      ..
886    0
887    1
888    0
889    1
890    0
Name: Survived, Length: 891, dtype: int64


### Features/Inputs 

In [93]:
features = ['Pclass','Sex','Age','SibSp','Parch','Fare','Cabin', 'Embarked']

the data will be called 'X'

In [94]:
X = data_new[features]

Review the data using the .describe & .head


In [95]:
X.describe()

Unnamed: 0,Pclass,Sex,Age,SibSp,Parch,Fare,Cabin,Embarked
count,891.0,891.0,891.0,891.0,891.0,891.0,891.0,891.0
mean,2.308642,0.647587,23.60064,0.523008,0.381594,32.204208,16.62963,1.529742
std,0.836071,0.47799,17.867496,1.102743,0.806057,49.693429,38.140335,0.800254
min,1.0,0.0,-1.0,0.0,0.0,0.0,-1.0,-1.0
25%,2.0,0.0,6.0,0.0,0.0,7.9104,-1.0,1.0
50%,3.0,1.0,24.0,0.0,0.0,14.4542,-1.0,2.0
75%,3.0,1.0,35.0,1.0,0.0,31.0,-1.0,2.0
max,3.0,1.0,80.0,8.0,6.0,512.3292,146.0,2.0


In [96]:
X.head(100)

Unnamed: 0,Pclass,Sex,Age,SibSp,Parch,Fare,Cabin,Embarked
0,3,1,22.0,1,0,7.2500,-1,2
1,1,0,38.0,1,0,71.2833,81,0
2,3,0,26.0,0,0,7.9250,-1,2
3,1,0,35.0,1,0,53.1000,55,2
4,3,1,35.0,0,0,8.0500,-1,2
...,...,...,...,...,...,...,...,...
95,3,1,-1.0,0,0,8.0500,-1,2
96,1,1,71.0,0,0,34.6542,12,0
97,1,1,23.0,0,1,63.3583,91,0
98,2,0,34.0,0,1,23.0000,-1,2


## Build the model (random forest model)

Random forest model is 

In [97]:
from sklearn.ensemble import RandomForestClassifier

## Train the network

## Evaluate the model

## Predict