# Titanic Example

Here's an example training a neural net on Kaggle's Titanic dataset using the fastai and pytorch frameworks adapted from this [tutorial](https://www.kaggle.com/code/hitesh1724/titanic-1-fastai-beginner-tutorial).

In [1]:
import pandas as pd
import os

import fastcore
import fastai

from fastai.tabular.all import *

In [2]:
# Added code to visualize changes in DVCLive

from dvclive import Live

live = Live("../dvclive_logs")  

### Prep The Data

First import the train and test data obtained from Kaggle

In [6]:
df_test = pd.read_csv('../data/test.csv')
df_train = pd.read_csv('../data/train.csv')
df_train.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Thayer)",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


We can take a statistical look into the data.

In [7]:
df_train.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


Calculating the average null values we have in our data. It is important to know. Just to get intuition about data

In [8]:
df_train.isnull().sum().sort_index()/len(df_train)

Age            0.198653
Cabin          0.771044
Embarked       0.002245
Fare           0.000000
Name           0.000000
Parch          0.000000
PassengerId    0.000000
Pclass         0.000000
Sex            0.000000
SibSp          0.000000
Survived       0.000000
Ticket         0.000000
dtype: float64

Let's look at the datatypes of our columns to better unserstand which are continous (floats), discrete (integers), or categorial (objects)

In [9]:
df_train.dtypes
g_train =df_train.columns.to_series().groupby(df_train.dtypes).groups
g_train

{int64: ['PassengerId', 'Survived', 'Pclass', 'SibSp', 'Parch'], float64: ['Age', 'Fare'], object: ['Name', 'Sex', 'Ticket', 'Cabin', 'Embarked']}

Based on this we should group the names:

In [10]:
cat_names  = ['Name', 'Sex', 'Ticket', 'Cabin', 'Embarked']
cont_names = ['PassengerId', 'Pclass', 'SibSp', 'Parch', 'Age', 'Fare']

### Preprocessing

Now we are diving into fastai.

Here we are using fastai TabularPandas library. Which will do all the preprocessing for us. before that splitting our data into validation set to have a fair amount of idea that we are not overfitting the data. valid_pct= 0.2 means (as you may have guessed by now) it means 20% validation data.


In [11]:
splits = RandomSplitter(valid_pct=0.2)(range_of(df_train))

to = TabularPandas(df_train, procs=[Categorify, FillMissing, Normalize],
                   cat_names = cat_names,
                   cont_names = cont_names,
                   y_names = 'Survived',
                   splits=splits)

Here you can see 'Age_na', 'WikiId_na', 'Age_wiki_na', 'Class_na' . Which were created becoz their columns had missing values

In [12]:
g_train =to.train.xs.columns.to_series().groupby(to.train.xs.dtypes).groups
g_train

{int8: ['Sex', 'Embarked', 'Age_na'], int16: ['Name', 'Ticket', 'Cabin'], float64: ['PassengerId', 'Pclass', 'SibSp', 'Parch', 'Age', 'Fare']}

In [13]:
to.train.xs.Age_na.head()

53     1
434    1
32     2
577    1
232    1
Name: Age_na, dtype: int8

In [14]:
to.train.xs

Unnamed: 0,Name,Sex,Ticket,Cabin,Embarked,Age_na,PassengerId,Pclass,SibSp,Parch,Age,Fare
53,247,1,240,0,3,1,-1.517679,-0.386096,0.432222,-0.486631,-0.013031,-0.108152
434,745,2,73,129,3,1,-0.038630,-1.593492,0.432222,-0.486631,1.626563,0.564498
32,290,1,289,0,2,2,-1.599201,0.821300,-0.474173,-0.486631,-0.091107,-0.518715
577,746,1,73,129,3,1,0.516499,-1.593492,0.432222,-0.486631,0.767728,0.564498
232,755,2,129,0,3,1,-0.822797,-0.386096,-0.474173,-0.486631,2.329245,-0.389360
...,...,...,...,...,...,...,...,...,...,...,...,...
188,102,2,436,0,2,1,-0.993606,0.821300,0.432222,0.798435,0.845804,-0.344366
671,203,2,570,39,3,1,0.881409,-1.593492,0.432222,-0.486631,0.143121,0.476761
154,613,2,574,0,3,2,-1.125595,0.821300,-0.474173,-0.486631,-0.091107,-0.528558
273,578,2,595,55,1,1,-0.663635,-1.593492,-0.474173,0.798435,0.611576,-0.024914


### Training

Now that our data is preprocessed we can use the RandomForestClassifier to solve this problem.

In [15]:
from sklearn.ensemble import RandomForestClassifier

X_train = to.train.xs
X_valid = to.valid.xs

y_train = to.train.ys.values.ravel()
y_valid = to.valid.ys.values.ravel()

We have table without any hardcore preprocessing all we did was just to use fastai tabular function to get this.

In [16]:
X_train.head()

Unnamed: 0,Name,Sex,Ticket,Cabin,Embarked,Age_na,PassengerId,Pclass,SibSp,Parch,Age,Fare
53,247,1,240,0,3,1,-1.517679,-0.386096,0.432222,-0.486631,-0.013031,-0.108152
434,745,2,73,129,3,1,-0.03863,-1.593492,0.432222,-0.486631,1.626563,0.564498
32,290,1,289,0,2,2,-1.599201,0.8213,-0.474173,-0.486631,-0.091107,-0.518715
577,746,1,73,129,3,1,0.516499,-1.593492,0.432222,-0.486631,0.767728,0.564498
232,755,2,129,0,3,1,-0.822797,-0.386096,-0.474173,-0.486631,2.329245,-0.38936


In [17]:
rnf_classifier= RandomForestClassifier(n_estimators=100, n_jobs=-1)
rnf_classifier.fit(X_train,y_train)

We just Trained randomforest classifier and predicting accuracy on validation set

In [18]:
from sklearn.metrics import accuracy_score

y_pred = rnf_classifier.predict(X_valid)
acc = accuracy_score(y_pred, y_valid)

In [19]:
# Make sure to log the accuracy in DVCLive

live.log_metric('accuracy', acc)
live.next_step()

### TEST Dataset

In [20]:
df_test.head()

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S
2,894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q
3,895,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S
4,896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S


Doing the same preprocessing as before:

In [21]:
df_test.dtypes
g_train =df_test.columns.to_series().groupby(df_test.dtypes).groups
g_train

{int64: ['PassengerId', 'Pclass', 'SibSp', 'Parch'], float64: ['Age', 'Fare'], object: ['Name', 'Sex', 'Ticket', 'Cabin', 'Embarked']}

In [22]:
cat_names  = ['Name', 'Sex', 'Ticket', 'Cabin', 'Embarked']
cont_names = ['PassengerId', 'Pclass', 'SibSp', 'Parch', 'Age', 'Fare']

In [23]:
test = TabularPandas(df_test, procs=[Categorify, FillMissing,Normalize],
                   cat_names = cat_names,
                   cont_names = cont_names,
                   )

In [24]:
X_test= test.train.xs


In [25]:
X_test.head()

Unnamed: 0,Name,Sex,Ticket,Cabin,Embarked,Age_na,Fare_na,PassengerId,Pclass,SibSp,Parch,Age,Fare
0,207,2,153,0,2,1,1,-1.727912,0.873482,-0.49947,-0.400248,0.386231,-0.497413
1,404,1,222,0,3,1,1,-1.719625,0.873482,0.616992,-0.400248,1.37137,-0.512278
2,270,2,74,0,2,1,1,-1.711337,-0.315819,-0.49947,-0.400248,2.553537,-0.4641
3,409,2,148,0,3,1,1,-1.70305,0.873482,-0.49947,-0.400248,-0.204852,-0.482475
4,179,1,139,0,3,1,1,-1.694763,0.873482,0.616992,0.619896,-0.598908,-0.417491


In [26]:
X_test.dtypes
g_train =X_test.columns.to_series().groupby(X_test.dtypes).groups
g_train


{int8: ['Sex', 'Cabin', 'Embarked', 'Age_na', 'Fare_na'], int16: ['Name', 'Ticket'], float64: ['PassengerId', 'Pclass', 'SibSp', 'Parch', 'Age', 'Fare']}

In [27]:
X_test= X_test.drop('Fare_na', axis=1)

In [28]:
y_pred=rnf_classifier.predict(X_test)

In [29]:
y_pred= y_pred.astype(int)

In [30]:
output= pd.DataFrame({'PassengerId':df_test.PassengerId, 'Survived': y_pred})
output.to_csv('my_submission_titanic.csv', index=False)
output.head()

Unnamed: 0,PassengerId,Survived
0,892,0
1,893,0
2,894,0
3,895,0
4,896,1
