### Using machine learning to create a model that predicts which passengers survived the Titanic shipwreck.



#### Loading Dataset

In [1]:
import pandas as pd
dataset = pd.read_csv("./data/dataset.csv")

#### First, we need to check if the dataset has some Nan values and fill it with something depending the column.
Above, we can note that there are some columns with Nan values

In [2]:
dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB


Deleting column Cabin because it has much Nan values and maybe isn't a significant data

In [3]:
dataset.drop(columns=["Cabin"], inplace=True)

Fill Nan values on Age column with mean of column because is a numeric value and continuous. It should be a good choice

In [4]:
dataset["Age"].fillna(dataset["Age"].dropna().mean(), inplace=True)

Fill nan value from column Embarked with mode, because is a categorical data and it is only 3 values missing values

In [5]:
dataset["Embarked"].fillna(dataset["Embarked"].mode()[0], inplace=True)

Now, we can note that dataset not have missing or Nan values

In [6]:
dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 11 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            891 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Embarked       891 non-null object
dtypes: float64(2), int64(5), object(4)
memory usage: 76.6+ KB


#### Now, let's separate the data between train and test data

In [7]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(dataset.drop(columns=["Survived"]), dataset["Survived"], test_size=0.3, random_state=42)

In [8]:
X_train.head()

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Embarked
445,446,1,"Dodge, Master. Washington",male,4.0,0,2,33638,81.8583,S
650,651,3,"Mitkoff, Mr. Mito",male,29.699118,0,0,349221,7.8958,S
172,173,3,"Johnson, Miss. Eleanor Ileen",female,1.0,1,1,347742,11.1333,S
450,451,2,"West, Mr. Edwy Arthur",male,36.0,1,2,C.A. 34651,27.75,S
314,315,2,"Hart, Mr. Benjamin",male,43.0,1,1,F.C.C. 13529,26.25,S


#### Then, let's to encode the categorical columns

In [9]:
from sklearn.preprocessing import OneHotEncoder
encoder = OneHotEncoder(handle_unknown='ignore')
X_train_encoded = pd.DataFrame(encoder.fit_transform(X_train[["Pclass", "Sex", "Embarked"]]).toarray())

#### Let's also normalize the numerical columns

In [10]:
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
X_train_norm = pd.DataFrame(scaler.fit_transform(X_train[["Age", "SibSp", "Parch", "Fare"]]))

  return self.partial_fit(X, y)


In [11]:
X_train_norm.head()

Unnamed: 0,0,1,2,3
0,0.044986,0.0,0.333333,0.159777
1,0.367921,0.0,0.0,0.015412
2,0.007288,0.125,0.166667,0.021731
3,0.447097,0.125,0.333333,0.054164
4,0.535059,0.125,0.166667,0.051237


Now, it's enough concatenate the categorical dataframe and the numerical dataframe

In [12]:
X_train_processed = pd.concat([X_train_encoded, X_train_norm], axis=1)

In [13]:
X_train_processed.head()

Unnamed: 0,0,1,2,3,4,5,6,7,0.1,1.1,2.1,3.1
0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.044986,0.0,0.333333,0.159777
1,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.367921,0.0,0.0,0.015412
2,0.0,0.0,1.0,1.0,0.0,0.0,0.0,1.0,0.007288,0.125,0.166667,0.021731
3,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.447097,0.125,0.333333,0.054164
4,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.535059,0.125,0.166667,0.051237


So, the data is ready to be use in a machine learning algorithm. I chose the Random Forest algoritm

In [14]:
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier()
rf.fit(X_train_processed, y_train)



RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=None,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

 We will use the test dataset to evaluate the classifier.
 First, we need to processe it too

In [15]:
X_test_norm = pd.DataFrame(scaler.transform(X_test[["Age", "SibSp", "Parch", "Fare"]]))
X_test_encoded = pd.DataFrame(encoder.transform(X_test[["Pclass", "Sex", "Embarked"]]).toarray())
X_test_processed = pd.concat([X_test_encoded, X_test_norm], axis=1)

In [16]:
X_test_processed.head()

Unnamed: 0,0,1,2,3,4,5,6,7,0.1,1.1,2.1,3.1
0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0,0.367921,0.125,0.166667,0.029758
1,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.384267,0.0,0.0,0.020495
2,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.246042,0.0,0.0,0.015469
3,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.070118,0.0,0.166667,0.064412
4,0.0,0.0,1.0,1.0,0.0,1.0,0.0,0.0,0.170646,0.125,0.0,0.021942


#### So, let's predict using the classifier

In [17]:
y_pred = rf.predict(X_test_processed)

#### To measure the accuracy of the model, we will use accuracy_score from sklearn

In [18]:
from sklearn.metrics import accuracy_score
accuracy_score(y_test, y_pred)

0.8022388059701493

#### It means that the classifier predict correctly 80% of the test dataset 😀