# Titanic 1 - Naive approach

This worksheet provides a working of the Titanic classification task.  The aim is to create a model that can predict the survival of passengers based on the given features.

The data and description of the challenge can be found here:

https://www.kaggle.com/c/titanic

The following is a description of the data:


|Variable	|Definition	|Key|
|-----------|-----------|---|
|survival	|Survival	|0 = No, 1 = Yes
|pclass	|Ticket class	|1 = 1st, 2 = 2nd, 3 = 3rd
|sex	|Sex	|
|Age	|Age in years	|
|sibsp	|# of siblings / spouses aboard the Titanic	
|parch	|# of parents / children aboard the Titanic	
|ticket	|Ticket number	
|fare	|Passenger fare	
|cabin	|Cabin number	
|embarked	|Port of Embarkation	|C = Cherbourg, Q = Queenstown, S = Southampton

This first sheet provides a basic naive approach, with no informed data cleansing or feature engineering.

In [None]:
from dasi_library import *

## Load the data from the CSV file

In [None]:
titanic = readCsv('../datasets/titanic/titanic.csv')

## Inspect the data

In [None]:
titanic.head()

In [None]:
titanic.dtypes

In [None]:
titanic.shape

In [None]:
titanic.describe()

In [None]:
boxPlotAll(titanic)

In [None]:
histPlotAll(titanic)

## Prepare the data
###  Remove nulls

In [None]:
checkForNulls(titanic)

Just drop any rows containing nulls:

In [None]:
titanic = dropNullRows(titanic)

Confirm this worked:

In [None]:
checkForNulls(titanic)

Check how much this reduces the size of the data set:

In [None]:
titanic.shape

### Handle categorical features

Sklearn can't work with categorical data, so remove any categorical features:

In [None]:
titanic = removeCols(titanic, ["PassengerId", "Name", "Sex", "Ticket", "Cabin", "Embarked"])

### Check if we have useful features
Check how well our features separate the classes:

In [None]:
classComparePlot(titanic, 'Survived', 'hist')

## Split out the target feature

In [None]:
X,Y = splitXY(titanic, 'Survived')

## Split into training and test sets

In [None]:
test_size = 0.33
seed = 7
X_train, X_test, Y_train, Y_test = trainTestSplit(X, Y, test_size=test_size, random_state=seed)

In [None]:
algorithms = []
algorithms.append(LogisticRegression)
algorithms.append(LinearDiscriminantAnalysis)
algorithms.append(KNeighborsClassifier)
algorithms.append(DecisionTreeClassifier)
algorithms.append(GaussianNB)
algorithms.append(SVC)
evaluateAlgorithmsClassification(X_train, Y_train, algorithms, seed)

In [None]:
model = modelFit(X_train, Y_train, LinearDiscriminantAnalysis)

## Test the model

In [None]:
predictions = predict(model, X_test)

In [None]:
print(accuracy_score(Y_test, predictions))
print(confusion_matrix(Y_test, predictions))
print(classification_report(Y_test, predictions))