# Titanic 4 - Do something with ticket and cabin

This is the fourth attempt at the Titanic classification exercise.  We use the same approach as before, but see if we can do somthing with the ticket and cabin features.


In [1]:
from dasi_library import *

ModuleNotFoundError: No module named 'dasi_library'

## Load the data from the CSV file

In [None]:
titanic = readCsv('../datasets/titanic/titanic.csv')

## Prepare the data

### Process cabin
In the previous attempt (3) we threw away Cabin because 77% were null.  We don't want to impute, because we don't want to make up 77% of the cabins!  But we could possibly get some additional predictive power by knowing what cabing those 23% were in.

Let's take a look at the unique list of cabins we have in our passenger data:

In [None]:
listUnique(titanic, "Cabin")

And let's see the class distribution, the number of people in each cabin:

In [None]:
classDistribution(titanic, "Cabin")

We might reason that the cabin number is not that useful as the data is spread very sparesly across cabins, but the cabin letter could be of some use.  A quick Google search for cabin layout on the Titanic reveals a few images of the deck layout that could be helpful.  E.g. https://ssmaritime.com/Titanic-3.htm.  It looks like the letters correspond to the different levels on the ship.  That could well be significant!  But at the moment we are throwing all of that away!

Now, we have some anomolies in the data

'B82 B84',
'F G73',
'D',

We could come up with a strategy for dealing with those, but for simplicity I will just extract the first letter and call that the Cabin letter.



In [None]:
titanic = splitFeatureOnPosition(titanic, "Cabin", 1, ["Cabin_letter", "Cabin_remainder"])

In [None]:
titanic

### Handle nulls

In [None]:
checkForNulls(titanic)

Fill age with the mean age:

In [None]:
titanic = imputeNullWithMean(titanic, "Age")

In [None]:
checkForNulls(titanic)

Remove Cabin_remainder col, as we are just interested in the letter:

In [None]:
titanic = removeCol(titanic, "Cabin_remainder")

In [None]:
checkForNulls(titanic)

We will leave Embarked and Cabin_letter.  Although they contain nulls, we will one-hot encode them, which will create a column for the nulls

In [None]:
titanic.shape

We've preserved all 891 rows!

### Handle categorical features

Let's take a look at our categorical columns:

In [None]:
selectCols(titanic, ["PassengerId", "Name", "Sex", "Ticket", "Embarked"]).head()

Remove the categorical columns we don't want to process:

In [None]:
titanic = removeCols(titanic, ["PassengerId", "Name", "Ticket"])

One-hot encode the categorical features we want to keep.  Sex doesn't contain nulls, but for Embarked and Cabin_letter we ask for a null column to be created:

In [None]:
titanic.head()

In [None]:
titanic = oneHotEncode(titanic, ["Sex"])
titanic = oneHotEncode(titanic, ["Embarked", "Cabin_letter"], dummy_na=True)

In [None]:
titanic.head()

### Check if we have useful features
Check how well our features separate the classes:

In [None]:
classComparePlot(titanic, 'Survived', 'hist')

## Split out the target feature

In [None]:
X,Y = splitXY(titanic, 'Survived')

## Split into training and test sets

In [None]:
test_size = 0.33
seed = 7
X_train, X_test, Y_train, Y_test = trainTestSplit(X, Y, test_size=test_size, random_state=seed)

In [None]:
algorithms = []
algorithms.append(LogisticRegression)
algorithms.append(LinearDiscriminantAnalysis)
algorithms.append(KNeighborsClassifier)
algorithms.append(DecisionTreeClassifier)
algorithms.append(GaussianNB)
algorithms.append(SVC)
evaluateAlgorithmsClassification(X_train, Y_train, algorithms, seed)

In [None]:
model = modelFit(X_train, Y_train, LogisticRegression)

## Test the model

In [None]:
predictions = predict(model, X_test)

In [None]:
print(accuracy_score(Y_test, predictions))
print(confusion_matrix(Y_test, predictions))
print(classification_report(Y_test, predictions))