# Titanic 3 - Impute age

This is the third attempt at the Titanic classification exercise.  We use the same approach as before, but deal with the nulls in Age in a better way.


In [None]:
from dasi_library import *

## Load the data from the CSV file

In [None]:
titanic = readCsv('../datasets/titanic/titanic.csv')

## Inspect the data

In [None]:
titanic.dtypes

In [None]:
titanic.shape

## Prepare the data

In the previous attempt (2) we threw away all null rows.  Let's do some imputation (replacing of nulls with values).



In [None]:
checkForNulls(titanic)

Fill age with the mean age:

In [None]:
titanic = imputeNullWithMean(titanic, "Age")

Looking at the Cabin col, 77% are null.  Imputing would mean making up 77% of the values.  This doesn't make sense, so let's delete this column.

In [None]:
titanic = removeCol(titanic, "Cabin")

In [None]:
checkForNulls(titanic)

We still have some nulls in Embarked.  Let's remove those:

In [None]:
titanic = dropNullRows(titanic)

In [None]:
checkForNulls(titanic)

What are we left with?

In [None]:
titanic.shape

We now have 889 of our 891 rows preserved!  Previously we only had 183!!

### Handle categorical features

Let's take a look at our categorical columns:

In [None]:
selectCols(titanic, ["PassengerId", "Name", "Sex", "Ticket", "Embarked"]).head()

Remove the categorical columns we don't want to process:

In [None]:
titanic = removeCols(titanic, ["PassengerId", "Name", "Ticket"])

Sklean can't deal with Sex and Embarked as the are categorical, so we need to convert to numerical.  We can achieve this by one-hot encoding:

In [None]:
titanic = oneHotEncode(titanic, ["Sex","Embarked"])
titanic.head()

### Check if we have useful features
Check how well our features separate the classes:

In [None]:
classComparePlot(titanic, 'Survived', 'hist')

## Split out the target feature

In [None]:
X,Y = splitXY(titanic, 'Survived')

## Split into training and test sets

In [None]:
test_size = 0.33
seed = 7
X_train, X_test, Y_train, Y_test = trainTestSplit(X, Y, test_size=test_size, random_state=seed)

In [None]:
algorithms = []
algorithms.append(LogisticRegression)
algorithms.append(LinearDiscriminantAnalysis)
algorithms.append(KNeighborsClassifier)
algorithms.append(DecisionTreeClassifier)
algorithms.append(GaussianNB)
algorithms.append(SVC)
evaluateAlgorithmsClassification(X_train, Y_train, algorithms, seed)

In [None]:
model = modelFit(X_train, Y_train, LogisticRegression)

## Test the model

In [None]:
predictions = predict(model, X_test)

In [None]:
print(accuracy_score(Y_test, predictions))
print(confusion_matrix(Y_test, predictions))
print(classification_report(Y_test, predictions))