In [239]:
import pandas
import numpy as np

In this notebook, we'll be building a predictive model for survival on the titanic based on training data provided by kaggle. This is part of the Warmup Project for Data Science 2016. 

#### A. import the training data.

In [240]:
titanic = pandas.read_csv("./data/train.csv")

# Uncomment print statements below to take a look at the 
# first 5 rows of the dataframe and the describing output.
# print(titanic.head(5))
# print(titanic.describe())

#### B. clean up the missing data. 

Occasionally a dataset contains missing values (null, not a number, NA, etc.) and we want to prevent these missing values from affecting our computations in unintended ways. In particular, this training data set has missing values for `Age`, so let's clean that up!

In [241]:
titanic["Age"] = titanic["Age"].fillna(titanic["Age"].median())

#### C. convert non-numeric (categorical) variables into usable numbers!

In particular, `Sex` and `Embarked` should be converted into usable numbers. We'll find all the unique values for these non-numeric data points and replace them with numbers that can be used by the predictive model in a later step.

In [242]:
# Find all the unique genders 
print"unique genders are", titanic["Sex"].unique()

# From genders to numbers
titanic.loc[titanic["Sex"] == "male", "Sex"] = 0
titanic.loc[titanic["Sex"] == "female", "Sex"] = 1

unique genders are ['male' 'female']


In [243]:
# Find all the uniqued embarked values
print "unique embarked values are", titanic["Embarked"].unique()

# From embarked letters to numbers
titanic["Embarked"] = titanic["Embarked"].fillna("S")
titanic.loc[titanic["Embarked"] == "S", "Embarked"] = 0
titanic.loc[titanic["Embarked"] == "C", "Embarked"] = 1
titanic.loc[titanic["Embarked"] == "Q", "Embarked"] = 2

unique embarked values are ['S' 'C' 'Q' nan]


#### D. cross validation, linear regression, first stab at predictions 

We want to make sure that we don't train our model on the same data that we'll make predictions on, so we're going to split the data into several folds. In each trial, one fold will be set aside for predictions, and the remaining folds will be used for training. Thus there's no overlap between the folds/partitions that were used for training and the one fold used for predictions. We'll run several trials with these fold combinations and eventually get predictions for the entire dataset.

In [244]:
# Code from dataquest mission 74, part 9.

# Import the linear regression class
from sklearn.linear_model import LinearRegression
# Sklearn also has a helper that makes it easy to do cross validation
from sklearn.cross_validation import KFold

# The columns we'll use to predict the target
predictors = ["Pclass", "Sex", "Age", "SibSp", "Parch", "Fare", "Embarked"]

# Initialize our algorithm class
alg = LinearRegression()
# Generate cross validation folds for the titanic dataset.  It return the row indices corresponding to train and test.
# We set random_state to ensure we get the same splits every time we run this.
kf = KFold(titanic.shape[0], n_folds=3, random_state=1)

predictions = []
for train, test in kf:
    # The predictors we're using the train the algorithm.  Note how we only take the rows in the train folds.
    train_predictors = (titanic[predictors].iloc[train,:])
    # The target we're using to train the algorithm.
    train_target = titanic["Survived"].iloc[train]
    # Training the algorithm using the predictors and target.
    alg.fit(train_predictors, train_target)
    # We can now make predictions on the test fold
    test_predictions = alg.predict(titanic[predictors].iloc[test,:])
    predictions.append(test_predictions)

In [245]:
print predictions

[array([0.08998778098664095, 0.9607562062264069, 0.5926762776633059,
       0.9311387278910165, 0.05293430709559788, 0.17027568492773304,
       0.36994359030196483, 0.10347484742579771, 0.5215979058146003,
       0.8744910503206457, 0.6488836111621734, 0.8297427688347156,
       0.1347971983986338, -0.1611268436425003, 0.6581413066296763,
       0.6398197484682686, 0.15173387493789559, 0.29543271790803804,
       0.5353779589276406, 0.6210076833082576, 0.2618725916383594,
       0.2626875613868237, 0.7317391597365706, 0.5059958971692413,
       0.5613985666552621, 0.33503973416336075, 0.13033880755757876,
       0.4687657665282141, 0.660737752649522, 0.09108192184311348,
       0.47722392003543795, 1.0422002619036157, 0.6606916127819125,
       0.08715392731116645, 0.5285507322778327, 0.40187433784208143,
       0.13034030746039582, 0.1293396723117648, 0.5727171285933548,
       0.6652388218334924, 0.4832157785469805, 0.7608074080287128,
       0.13057836346464047, 0.8718671208885529,

#### D. contninued: accuracy!

How did this first stab of predictions go? The possible outcomes are 1 and 0 (survival is a binary thing), but the linear regression model output doesn't match this binary format. Thus we have to map our predictions to outcomes. We'll also compute the accuracy of these results by comparing our predictions to the `Survived` column of the training data. 

In [246]:
# The predictions are in three separate numpy arrays.  Concatenate them into one.  
# We concatenate them on axis 0, as they only have one axis.
predictions = np.concatenate(predictions, axis=0)

# Map predictions to outcomes (only possible outcomes are 1 and 0)
predictions[predictions > .5] = 1
predictions[predictions <=.5] = 0

# Take a look
# print(predictions.shape)
# print(titanic["Survived"].shape)

num_accurate_predictions = 0 # counter

# Check whether the predictions are correct
for i in range(predictions.shape[0]):
    if predictions[i] == titanic["Survived"][i]:
        num_accurate_predictions +=1

accuracy = float(num_accurate_predictions) / predictions.shape[0]


The accuracy of this linear regression model is `0.783389450056` -- definitely a lot of room for improvement! Perhaps using a different model or some feature engineering could help. :)

#### E. second stab: logistic regression

In [247]:
from sklearn import cross_validation
from sklearn.linear_model import LogisticRegression

# Initialize our algorithm
alg = LogisticRegression(random_state=1)
# Compute the accuracy score for all the cross validation folds.  (much simpler than what we did before!)
scores = cross_validation.cross_val_score(alg, titanic[predictors], titanic["Survived"], cv=3)
# Take the mean of the scores (because we have one for each fold)
print(scores.mean())

0.792368125701


The accuracy of the logistic regression model is `0.792368125701` -- better, but not perfect. Let's go through making a submission to kaggle before continuing to tweak the model.

#### F. preparing a submission to kaggle; running the model on the test data

In [248]:
titanic_test = pandas.read_csv("./data/test.csv")

# Age column
titanic_test["Age"] = titanic_test["Age"].fillna(titanic["Age"].median())

# Sex column
titanic_test.loc[titanic_test["Sex"] == "male", "Sex"] = 0
titanic_test.loc[titanic_test["Sex"] == "female", "Sex"] = 1

# Embarked column
titanic_test["Embarked"] = titanic_test["Embarked"].fillna("S")
titanic_test.loc[titanic_test["Embarked"] == "S", "Embarked"] = 0
titanic_test.loc[titanic_test["Embarked"] == "C", "Embarked"] = 1
titanic_test.loc[titanic_test["Embarked"] == "Q", "Embarked"] = 2

# Fare column
titanic_test["Fare"] = titanic_test["Fare"].fillna(titanic["Fare"].median())

In [249]:
# Initialize the algorithm class
alg = LogisticRegression(random_state=1)

# Train the algorithm using all the training data
alg.fit(titanic[predictors], titanic["Survived"])

# Make predictions using the test set.
predictions = alg.predict(titanic_test[predictors])

# Create a new dataframe with only the columns Kaggle wants from the dataset.
submission = pandas.DataFrame({
        "PassengerId": titanic_test["PassengerId"],
        "Survived": predictions
    })

In [250]:
# generate a submission file
# commented out to prevent unintentional file overwrite/creation
# submission.to_csv("dataquest_logistic_regression.csv", index=False)

#### G. improving the dataquest code

Brain dump of ideas:
* Not using every feature in the model, relevant to the curse of dimensionality -- see if using the same logistic regression with less features is helpful. Perhaps things like ticket number and fare are not as useful as sex and age. 
* Try different models
* Combine features together: perhaps combining sex and age into one feature somehow (encoding it with one digit for sex and one digit for age)

In [251]:
# Helper functions: Use logistic regression, try using different features

def make_titanic_test_predictions(predictors):
    # Initialize our algorithm
    alg = LogisticRegression(random_state=1)
    # Compute the accuracy score for all the cross validation folds.  (much simpler than what we did before!)
    scores = cross_validation.cross_val_score(alg, titanic[predictors], titanic["Survived"], cv=3)
    # Take the mean of the scores (because we have one for each fold)
    print "accuracy", scores.mean()
    return

In our first attempt, predictors included `['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Embarked']`.    
Let's see what happens when we do something super bare bones with just `Sex` and `Age`. I expect that this will be less accurate because while these features do seem important, there is probably more to the relationship between people and survival than `Sex and Age`.

In [252]:
predictors2 = ['Sex', 'Age'] 
print predictors2
predictions2 = make_titanic_test_predictions(predictors2)

['Sex', 'Age']
accuracy 0.786756453423


Was it better or worse than expected? TODO: answer this after debugging

In [253]:
predictors3 = ['Pclass', 'Sex', 'Age']
print predictors3
predictions3 = make_titanic_test_predictions(predictors3)

['Pclass', 'Sex', 'Age']
accuracy 0.793490460157


(You might notice that this is a bit of a iterative way to play with features... more quantitative measures like correlation coefficients might automate this sort of process.)

In [255]:
# predictors4 = ['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Embarked']
# print predictors4
# predictions4 = make_titanic_test_predictions(predictors4)

['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Embarked']
accuracy 0.792368125701
