# Titanic 2 - Keep sex and embarked

This is the second attempt at the Titanic classification exercise.  We use the same approach as before, but preserve the sex and embarked categorical features.

In [None]:
from dasi_library import *

## Load the data from the CSV file

In [None]:
titanic = readCsv('../datasets/titanic/titanic.csv')

## Inspect the data

In [None]:
titanic.dtypes

In [None]:
titanic.shape

## Prepare the data

In the previous attempt (1) we threw away a lot of useful data.  Let's try to preserve some.


### Handle nulls

In [None]:
checkForNulls(titanic)

Just drop any rows containing nulls:

In [None]:
titanic = dropNullRows(titanic)

Confirm this worked:

In [None]:
checkForNulls(titanic)

Check how much this reduces the size of the data set:

In [None]:
titanic.shape

### Handle categorical features

Let's take a look at our categorical columns:

In [None]:
selectCols(titanic, ["PassengerId", "Name", "Sex", "Ticket", "Cabin", "Embarked"]).head()

What can we do with these?  Name, Ticket and Cabin look a bit complex, so let's leave those.  Sex and embarked can be converted to dummy variables

Let's remove the others:

In [None]:
titanic = removeCols(titanic, ["PassengerId", "Name", "Ticket", "Cabin"])

In [None]:
selectCols(titanic, ["Sex", "Embarked"]).head()

Sklean can't deal with these as they are categorical, so we need to convert to numerical.  We can achieve this by one-hot encoding:

In [None]:
titanic = oneHotEncode(titanic, ["Sex","Embarked"])

In [None]:
titanic.head()

### Check if we have useful features
Check how well our features separate the classes:

In [None]:
classComparePlot(titanic, 'Survived', 'hist')

## Split out the target feature

In [None]:
X,Y = splitXY(titanic, 'Survived')

## Split into training and test sets

In [None]:
test_size = 0.33
seed = 7
X_train, X_test, Y_train, Y_test = trainTestSplit(X, Y, test_size=test_size, random_state=seed)

In [None]:
algorithms = []
algorithms.append(LogisticRegression)
algorithms.append(LinearDiscriminantAnalysis)
algorithms.append(KNeighborsClassifier)
algorithms.append(DecisionTreeClassifier)
algorithms.append(GaussianNB)
algorithms.append(SVC)
evaluateAlgorithmsClassification(X_train, Y_train, algorithms, seed)

In [None]:
model = modelFit(X_train, Y_train, LogisticRegression)

## Test the model

In [None]:
predictions = predict(model, X_test)

In [None]:
print(accuracy_score(Y_test, predictions))
print(confusion_matrix(Y_test, predictions))
print(classification_report(Y_test, predictions))