In [1]:
import pandas
import numpy as np

In this notebook, we'll be building a predictive model for survival on the titanic based on training data provided by kaggle. This is part of the Warmup Project for Data Science 2016. 

#### A. import the training data.

In [2]:
titanic = pandas.read_csv("./data/train.csv")

# Uncomment print statements below to take a look at the 
# first 5 rows of the dataframe and the describing output.
# print(titanic.head(5))
# print(titanic.describe())

#### B. clean up the missing data. 

Occasionally a dataset contains missing values (null, not a number, NA, etc.) and we want to prevent these missing values from affecting our computations in unintended ways. In particular, this training data set has missing values for `Age`, so let's clean that up!

In [3]:
titanic["Age"] = titanic["Age"].fillna(titanic["Age"].median())

#### C. convert non-numeric (categorical) variables into usable numbers!

In particular, `Sex` and `Embarked` should be converted into usable numbers. We'll find all the unique values for these non-numeric data points and replace them with numbers that can be used by the predictive model in a later step.

In [4]:
# Find all the unique genders 
print"unique genders are", titanic["Sex"].unique()

# From genders to numbers
titanic.loc[titanic["Sex"] == "male", "Sex"] = 0
titanic.loc[titanic["Sex"] == "female", "Sex"] = 1

unique genders are ['male' 'female']


In [5]:
# Find all the uniqued embarked values
print "unique embarked values are", titanic["Embarked"].unique()

# From embarked letters to numbers
titanic["Embarked"] = titanic["Embarked"].fillna("S")
titanic.loc[titanic["Embarked"] == "S", "Embarked"] = 0
titanic.loc[titanic["Embarked"] == "C", "Embarked"] = 1
titanic.loc[titanic["Embarked"] == "Q", "Embarked"] = 2

unique embarked values are ['S' 'C' 'Q' nan]


#### D. cross validation, linear regression, first stab at predictions 

We want to make sure that we don't train our model on the same data that we'll make predictions on, so we're going to split the data into several folds. In each trial, one fold will be set aside for predictions, and the remaining folds will be used for training. Thus there's no overlap between the folds/partitions that were used for training and the one fold used for predictions. We'll run several trials with these fold combinations and eventually get predictions for the entire dataset.

In [6]:
# Code from dataquest mission 74, part 9.

# Import the linear regression class
from sklearn.linear_model import LinearRegression
# Sklearn also has a helper that makes it easy to do cross validation
from sklearn.cross_validation import KFold

# The columns we'll use to predict the target
predictors = ["Pclass", "Sex", "Age", "SibSp", "Parch", "Fare", "Embarked"]

# Initialize our algorithm class
alg = LinearRegression()
# Generate cross validation folds for the titanic dataset.  It return the row indices corresponding to train and test.
# We set random_state to ensure we get the same splits every time we run this.
kf = KFold(titanic.shape[0], n_folds=3, random_state=1)

predictions = []
for train, test in kf:
    # The predictors we're using the train the algorithm.  Note how we only take the rows in the train folds.
    train_predictors = (titanic[predictors].iloc[train,:])
    # The target we're using to train the algorithm.
    train_target = titanic["Survived"].iloc[train]
    # Training the algorithm using the predictors and target.
    alg.fit(train_predictors, train_target)
    # We can now make predictions on the test fold
    test_predictions = alg.predict(titanic[predictors].iloc[test,:])
    predictions.append(test_predictions)

In [7]:
print predictions

[array([  8.99877810e-02,   9.60756206e-01,   5.92676278e-01,
         9.31138728e-01,   5.29343071e-02,   1.70275685e-01,
         3.69943590e-01,   1.03474847e-01,   5.21597906e-01,
         8.74491050e-01,   6.48883611e-01,   8.29742769e-01,
         1.34797198e-01,  -1.61126844e-01,   6.58141307e-01,
         6.39819748e-01,   1.51733875e-01,   2.95432718e-01,
         5.35377959e-01,   6.21007683e-01,   2.61872592e-01,
         2.62687561e-01,   7.31739160e-01,   5.05995897e-01,
         5.61398567e-01,   3.35039734e-01,   1.30338808e-01,
         4.68765767e-01,   6.60737753e-01,   9.10819218e-02,
         4.77223920e-01,   1.04220026e+00,   6.60691613e-01,
         8.71539273e-02,   5.28550732e-01,   4.01874338e-01,
         1.30340307e-01,   1.29339672e-01,   5.72717129e-01,
         6.65238822e-01,   4.83215779e-01,   7.60807408e-01,
         1.30578363e-01,   8.71867121e-01,   7.09855487e-01,
         9.11369897e-02,   1.39181745e-01,   6.60691613e-01,
         6.82833485e-02

#### D. contninued: accuracy!

How did this first stab of predictions go? The possible outcomes are 1 and 0 (survival is a binary thing), but the linear regression model output doesn't match this binary format. Thus we have to map our predictions to outcomes. We'll also compute the accuracy of these results by comparing our predictions to the `Survived` column of the training data. 

In [8]:
# The predictions are in three separate numpy arrays.  Concatenate them into one.  
# We concatenate them on axis 0, as they only have one axis.
predictions = np.concatenate(predictions, axis=0)

# Map predictions to outcomes (only possible outcomes are 1 and 0)
predictions[predictions > .5] = 1
predictions[predictions <=.5] = 0

# Take a look
# print(predictions.shape)
# print(titanic["Survived"].shape)

num_accurate_predictions = 0 # counter

# Check whether the predictions are correct
for i in range(predictions.shape[0]):
    if predictions[i] == titanic["Survived"][i]:
        num_accurate_predictions +=1

accuracy = float(num_accurate_predictions) / predictions.shape[0]


The accuracy of this linear regression model is `0.783389450056` -- definitely a lot of room for improvement! Perhaps using a different model or some feature engineering could help. :)

#### E. second stab: logistic regression

In [9]:
from sklearn import cross_validation
from sklearn.linear_model import LogisticRegression

# Initialize our algorithm
alg = LogisticRegression(random_state=1)
# Compute the accuracy score for all the cross validation folds.  (much simpler than what we did before!)
scores = cross_validation.cross_val_score(alg, titanic[predictors], titanic["Survived"], cv=3)
# Take the mean of the scores (because we have one for each fold)
print(scores.mean())

0.787878787879


The accuracy of the logistic regression model is `0.792368125701` -- better, but not perfect. Let's go through making a submission to kaggle before continuing to tweak the model.

#### F. preparing a submission to kaggle; running the model on the test data

In [10]:
titanic_test = pandas.read_csv("./data/test.csv")

# Age column
titanic_test["Age"] = titanic_test["Age"].fillna(titanic["Age"].median())

# Sex column
titanic_test.loc[titanic_test["Sex"] == "male", "Sex"] = 0
titanic_test.loc[titanic_test["Sex"] == "female", "Sex"] = 1

# Embarked column
titanic_test["Embarked"] = titanic_test["Embarked"].fillna("S")
titanic_test.loc[titanic_test["Embarked"] == "S", "Embarked"] = 0
titanic_test.loc[titanic_test["Embarked"] == "C", "Embarked"] = 1
titanic_test.loc[titanic_test["Embarked"] == "Q", "Embarked"] = 2

# Fare column
titanic_test["Fare"] = titanic_test["Fare"].fillna(titanic["Fare"].median())

In [11]:
# Initialize the algorithm class
alg = LogisticRegression(random_state=1)

# Train the algorithm using all the training data
alg.fit(titanic[predictors], titanic["Survived"])

# Make predictions using the test set.
predictions = alg.predict(titanic_test[predictors])

# Create a new dataframe with only the columns Kaggle wants from the dataset.
submission = pandas.DataFrame({
        "PassengerId": titanic_test["PassengerId"],
        "Survived": predictions
    })

In [12]:
# generate a submission file
# commented out to prevent unintentional file overwrite/creation
# submission.to_csv("dataquest_logistic_regression.csv", index=False)

Uploaded the submission file to kaggle; it resulted in an score of 0.75120 (rank 3393). This model did approximately 3% worse on the test dataset compared to the training dataset. 3% does "feel" like a big difference, however it doesn't seem like overfitting was the only issue. It seems more likely to me that there are nuanced differences in the passenger data that this current model did not capture. 

#### G. improving the dataquest code

Brain dump of ideas:
* Not using every feature in the model, relevant to the curse of dimensionality -- see if using the same logistic regression with less features is helpful. Perhaps things like ticket number and fare are not as useful as sex and age. 
* Try different models
* Combine features together: perhaps combining sex and age into one feature somehow (encoding it with one digit for sex and one digit for age)

In [13]:
# Helper functions: Use logistic regression, try using different features

def make_titanic_test_predictions(predictors):
    # Initialize our algorithm
    alg = LogisticRegression(random_state=1)
    
    # Compute the accuracy score for all the cross validation folds.  (much simpler than what we did before!)
    scores = cross_validation.cross_val_score(alg, titanic[predictors], titanic["Survived"], cv=3)
    
    # Take the mean of the scores (because we have one for each fold)
    print "accuracy", scores.mean()
    return  

def prepare_submission_file_different_predictors(predictors, filename):
    # Initialize the algorithm class
    alg = LogisticRegression(random_state=1)

    # Train the algorithm using all the training data
    alg.fit(titanic[predictors], titanic["Survived"])

    # Make predictions using the test set.
    predictions = alg.predict(titanic_test[predictors])
    
    # Create a new dataframe with only the columns Kaggle wants from the dataset.
    submission = pandas.DataFrame({
            "PassengerId": titanic_test["PassengerId"],
            "Survived": predictions
        }) 
    
    # Save it
    submission.to_csv(filename, index=False)

In our first attempt, predictors included all of the provided features from the kaggle dataset:      
`['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Embarked']`.    

Let's see what happens when we do something super bare bones with just `Sex` and `Age`. I expect that this will be less accurate because while these features do seem important, there is probably more to the relationship between people and survival than `Sex and Age`.     

The code in the next few cells somewhat resembles one of the data mining approaches in the reading (I believe the reading mentioned computing the correlation coefficient for each of the variables). We'll see which variables work well for predictions, and then proceeding onwards based on which variables seem to be helping the accuracy score. 

In [14]:
predictors2 = ['Sex', 'Age'] 
print predictors2
predictions2 = make_titanic_test_predictions(predictors2)

['Sex', 'Age']
accuracy 0.786756453423


It turns out that using just `Sex` and `Age` gives us a score comparable to using all of the features! This definitely makes me think that some of the features in the dataset are not helpful in this logistic regression model... this is not a surprise because we know that more variables is not necessarily better with a fixed amount of data (insert reference to the curse of dimensionality concept.      

Based on contextual knowledge about the Titanic story (DataQuest mission 74 also mentions this), we know that passenger class was relevant because the first class cabins were closer to the deck of the ship. A distance advantage to safety almost certainly would impact survival rate, so let's try including `Pclass` in addition to the bare-bones model based on just `Sex` and `Age`.

In [15]:
# prepare_submission_file_different_predictors(predictors2, "logistic_regression_SA.csv")

This bare bones two-feature model also did better on the test set -- it received a score of 0.76555 (now at rank 3098; improvement compared to first submission score was 0.01435). 

In [16]:
predictors3 = ['Pclass', 'Sex', 'Age']
print predictors3
predictions3 = make_titanic_test_predictions(predictors3)

['Pclass', 'Sex', 'Age']
accuracy 0.789001122334


In [17]:
# prepare_submission_file_different_predictors(predictors2, "logistic_regression_PSA.csv")

This three-feature model did (very slightly with an improvement of about 0.005; probably not "significant") better than the two-feature model on the training dataset, and it had the same performance as the two-feature model on the test dataset -- it received a score of 0.76555 (same place on the kaggle leaderboard). 

#### H. Other things to try (for model_iteration_2.ipynb) !

Due to time constraints I didn't have a bunch of time to implement more ideas -- but these are some things I will explore more in future iterations and perhaps discuss in class soon:

* Take another look at the data, see what the unique values themselves look like. For example, is there some pattern in the names of the passengers?
* Combine variables:
    * In the brain dump cell earlier I mentioned combining `sex` and `age` somehow. Consider "female child, male child, female adult, male adult, female senior, male senior", and put these categories in one variable. Maybe this would help the curse of dimensionality problem? Or maybe it would prevent the model from learning nuances that need `sex` and `age` to be provided separately? 
* Consider the tradeoff between doing a bunch of feature engineering myself and letting the model figure out the trends on its own. There must be a sweet spot between the data processing I do and what happens automatically in logistic regression.
* Revisit exploration.ipynb for more bottom-up data inspiration!
* Different models provided by scikit-learn (Random Forest?)