# Titanic Example

Here's an example training a neural net on Kaggle's Titanic dataset using the fastai and pytorch frameworks adapted from this [tutorial](https://www.kaggle.com/code/hitesh1724/titanic-1-fastai-beginner-tutorial).

In [2]:
!pip install pandas fastai

import pandas as pd
import os

import fastcore
import fastai

from fastai.tabular.all import *

Defaulting to user installation because normal site-packages is not writeable
distutils: /home/crauguth/.local/lib/python3.9/site-packages
sysconfig: /home/crauguth/.local/lib64/python3.9/site-packages[0m
user = True
home = None
root = None
prefix = None[0m


In [3]:
# Added code to visualize changes in DVCLive

!pip install dvclive
from dvclive import Live

live = Live("../dvclive_logs")  

Defaulting to user installation because normal site-packages is not writeable
distutils: /home/crauguth/.local/lib/python3.9/site-packages
sysconfig: /home/crauguth/.local/lib64/python3.9/site-packages[0m
user = True
home = None
root = None
prefix = None[0m


### Prep The Data

First import the train and test data obtained from Kaggle

In [4]:
!pip install dvc
!pip install 'dvc[s3]'

!dvc pull ../../raw_data./data_set_1/data.dvc

df_test = pd.read_csv('../../raw_data/data_set_1/data/test.csv')
df_train = pd.read_csv('../../raw_data/data_set_1/data/train.csv')
df_train.head()

Defaulting to user installation because normal site-packages is not writeable
distutils: /home/crauguth/.local/lib/python3.9/site-packages
sysconfig: /home/crauguth/.local/lib64/python3.9/site-packages
user = True
home = None
root = None
prefix = None
Defaulting to user installation because normal site-packages is not writeable
distutils: /home/crauguth/.local/lib/python3.9/site-packages
sysconfig: /home/crauguth/.local/lib64/python3.9/site-packages
user = True
home = None
root = None
prefix = None
ERROR                                                                 : failed to collect 'workspace' - '/home/crauguth/dvcs3-getting-started/dvc-client/project-template/raw_data./data_set_1/data.dvc' does not exist
ERROR: failed to pull data from the cloud - '/home/crauguth/dvcs3-getting-started/dvc-client/project-template/raw_data./data_set_1/data.dvc' does not exist


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Thayer)",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


We can take a statistical look into the data.

In [5]:
df_train.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


Calculating the average null values we have in our data. It is important to know. Just to get intuition about data

In [6]:
df_train.isnull().sum().sort_index()/len(df_train)

Age            0.198653
Cabin          0.771044
Embarked       0.002245
Fare           0.000000
Name           0.000000
Parch          0.000000
PassengerId    0.000000
Pclass         0.000000
Sex            0.000000
SibSp          0.000000
Survived       0.000000
Ticket         0.000000
dtype: float64

Let's look at the datatypes of our columns to better unserstand which are continous (floats), discrete (integers), or categorial (objects)

In [7]:
df_train.dtypes
g_train =df_train.columns.to_series().groupby(df_train.dtypes).groups
g_train

{int64: ['PassengerId', 'Survived', 'Pclass', 'SibSp', 'Parch'], float64: ['Age', 'Fare'], object: ['Name', 'Sex', 'Ticket', 'Cabin', 'Embarked']}

Based on this we should group the names:

In [8]:
cat_names  = ['Name', 'Sex', 'Ticket', 'Cabin', 'Embarked']
cont_names = ['PassengerId', 'Pclass', 'SibSp', 'Parch', 'Age', 'Fare']

### Preprocessing

Now we are diving into fastai.

Here we are using fastai TabularPandas library. Which will do all the preprocessing for us. before that splitting our data into validation set to have a fair amount of idea that we are not overfitting the data. valid_pct= 0.2 means (as you may have guessed by now) it means 20% validation data.


In [9]:
splits = RandomSplitter(valid_pct=0.2)(range_of(df_train))

to = TabularPandas(df_train, procs=[Categorify, FillMissing, Normalize],
                   cat_names = cat_names,
                   cont_names = cont_names,
                   y_names = 'Survived',
                   splits=splits)

Here you can see 'Age_na', 'WikiId_na', 'Age_wiki_na', 'Class_na' . Which were created becoz their columns had missing values

In [10]:
g_train =to.train.xs.columns.to_series().groupby(to.train.xs.dtypes).groups
g_train

{int8: ['Sex', 'Embarked', 'Age_na'], int16: ['Name', 'Ticket', 'Cabin'], float64: ['PassengerId', 'Pclass', 'SibSp', 'Parch', 'Age', 'Fare']}

In [11]:
to.train.xs.Age_na.head()

437    1
418    1
757    1
84     1
245    1
Name: Age_na, dtype: int8

In [12]:
to.train.xs

Unnamed: 0,Name,Sex,Ticket,Cabin,Embarked,Age_na,PassengerId,Pclass,SibSp,Parch,Age,Fare
437,689,1,238,0,3,1,-0.052811,-0.378354,1.355471,3.108708,-0.429218,-0.269634
418,514,2,226,0,3,1,-0.126240,-0.378354,-0.469743,-0.472082,0.024901,-0.388434
757,59,2,239,0,3,1,1.183901,-0.378354,-0.469743,-0.472082,-0.883337,-0.419426
84,387,1,638,0,3,1,-1.417058,-0.378354,-0.469743,-0.472082,-0.959023,-0.440087
245,540,2,93,79,2,1,-0.794837,-1.577315,1.355471,-0.472082,1.084512,1.202456
...,...,...,...,...,...,...,...,...,...,...,...,...
621,440,2,57,96,3,1,0.658298,-1.577315,0.442864,-0.472082,0.933139,0.428792
664,485,2,661,0,3,1,0.824481,0.820607,0.442864,-0.472082,-0.731964,-0.493288
875,572,1,196,0,1,1,1.639938,0.820607,-0.469743,-0.472082,-1.110396,-0.507751
560,559,2,472,0,2,2,0.422550,0.820607,-0.469743,-0.472082,-0.088629,-0.496904


### Training

Now that our data is preprocessed we can use the RandomForestClassifier to solve this problem.

In [13]:
from sklearn.ensemble import RandomForestClassifier

X_train = to.train.xs
X_valid = to.valid.xs

y_train = to.train.ys.values.ravel()
y_valid = to.valid.ys.values.ravel()

We have table without any hardcore preprocessing all we did was just to use fastai tabular function to get this.

In [14]:
X_train.head()

Unnamed: 0,Name,Sex,Ticket,Cabin,Embarked,Age_na,PassengerId,Pclass,SibSp,Parch,Age,Fare
437,689,1,238,0,3,1,-0.052811,-0.378354,1.355471,3.108708,-0.429218,-0.269634
418,514,2,226,0,3,1,-0.12624,-0.378354,-0.469743,-0.472082,0.024901,-0.388434
757,59,2,239,0,3,1,1.183901,-0.378354,-0.469743,-0.472082,-0.883337,-0.419426
84,387,1,638,0,3,1,-1.417058,-0.378354,-0.469743,-0.472082,-0.959023,-0.440087
245,540,2,93,79,2,1,-0.794837,-1.577315,1.355471,-0.472082,1.084512,1.202456


In [15]:
rnf_classifier= RandomForestClassifier(n_estimators=100, n_jobs=-1)
rnf_classifier.fit(X_train,y_train)

We just Trained randomforest classifier and predicting accuracy on validation set

In [16]:
from sklearn.metrics import accuracy_score

y_pred = rnf_classifier.predict(X_valid)
acc = accuracy_score(y_pred, y_valid)

In [17]:
# Make sure to log the accuracy in DVCLive

live.log_metric('accuracy', acc)
live.next_step()

### TEST Dataset

In [18]:
df_test.head()

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S
2,894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q
3,895,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S
4,896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S


Doing the same preprocessing as before:

In [19]:
df_test.dtypes
g_train =df_test.columns.to_series().groupby(df_test.dtypes).groups
g_train

{int64: ['PassengerId', 'Pclass', 'SibSp', 'Parch'], float64: ['Age', 'Fare'], object: ['Name', 'Sex', 'Ticket', 'Cabin', 'Embarked']}

In [20]:
cat_names  = ['Name', 'Sex', 'Ticket', 'Cabin', 'Embarked']
cont_names = ['PassengerId', 'Pclass', 'SibSp', 'Parch', 'Age', 'Fare']

In [21]:
test = TabularPandas(df_test, procs=[Categorify, FillMissing,Normalize],
                   cat_names = cat_names,
                   cont_names = cont_names,
                   )

In [22]:
X_test= test.train.xs


In [23]:
X_test.head()

Unnamed: 0,Name,Sex,Ticket,Cabin,Embarked,Age_na,Fare_na,PassengerId,Pclass,SibSp,Parch,Age,Fare
0,207,2,153,0,2,1,1,-1.727912,0.873482,-0.49947,-0.400248,0.386231,-0.497413
1,404,1,222,0,3,1,1,-1.719625,0.873482,0.616992,-0.400248,1.37137,-0.512278
2,270,2,74,0,2,1,1,-1.711337,-0.315819,-0.49947,-0.400248,2.553537,-0.4641
3,409,2,148,0,3,1,1,-1.70305,0.873482,-0.49947,-0.400248,-0.204852,-0.482475
4,179,1,139,0,3,1,1,-1.694763,0.873482,0.616992,0.619896,-0.598908,-0.417491


In [24]:
X_test.dtypes
g_train =X_test.columns.to_series().groupby(X_test.dtypes).groups
g_train


{int8: ['Sex', 'Cabin', 'Embarked', 'Age_na', 'Fare_na'], int16: ['Name', 'Ticket'], float64: ['PassengerId', 'Pclass', 'SibSp', 'Parch', 'Age', 'Fare']}

In [25]:
X_test= X_test.drop('Fare_na', axis=1)

In [26]:
y_pred=rnf_classifier.predict(X_test)

In [27]:
y_pred= y_pred.astype(int)

In [28]:
output= pd.DataFrame({'PassengerId':df_test.PassengerId, 'Survived': y_pred})
output.to_csv('my_submission_titanic.csv', index=False)
output.head()

Unnamed: 0,PassengerId,Survived
0,892,0
1,893,0
2,894,0
3,895,0
4,896,1
