In [1]:
import numpy as np
from scipy.stats import mode
import pandas as pd

In [2]:
data = pd.read_csv('titanic_train.csv')
data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


We convert "Sex" and "Embarked" into numeric variables using one hot encoding:

In [3]:
# Sexmale is 1 if passenger is male, 0 if female
data['Sexmale'] = np.int32(data['Sex'] == 'male')
data['Sexfemale'] = np.int32(data['Sex'] == 'female') 
data['EmbarkedS'] = np.int32(data['Embarked'] == 'S')
data['EmbarkedC'] = np.int32(data['Embarked'] == 'C')
data['EmbarkedQ'] = np.int32(data['Embarked'] == 'Q')

There are some values we must discard before running any algorithm.  
The "Cabin" feature contains missing values, and is the only column with missing values, so we delete it.
The "PassengerID" field is simply each entry's index, and should not influence whether or not a passenger survived.
The "Name" and "Ticket" fields are string fields that are not amenable to conversion to numbers.

In [4]:
# create the training data
X = data[['Pclass','Sexmale','Sexfemale','Age','SibSp','Parch','Fare','EmbarkedS','EmbarkedC','EmbarkedQ']]
y = data['Survived']

In [5]:
X.head()

Unnamed: 0,Pclass,Sexmale,Sexfemale,Age,SibSp,Parch,Fare,EmbarkedS,EmbarkedC,EmbarkedQ
0,3,1,0,22.0,1,0,7.25,1,0,0
1,1,0,1,38.0,1,0,71.2833,0,1,0
2,3,0,1,26.0,0,0,7.925,1,0,0
3,1,0,1,35.0,1,0,53.1,1,0,0
4,3,1,0,35.0,0,0,8.05,1,0,0


We need to check the appropriateness of using raw accuracy as a measure of how good the model is.
We check the accuracy of always predicting the modal value for "Survived".

In [8]:
mode(y)

  mode(y)


ModeResult(mode=array([0]), count=array([549]))

In [12]:
# mode is 0
ymode = np.zeros(y.size)
# get accuracy
np.sum(ymode == y) / y.size

0.6161616161616161

We can get an accuracy of 0.616 by just predicting the most common value.  
At minimum, we want our model to beat this value.

Use https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html