## Model 1 Iteration for Kaggle Titanic Dataset

I start by importing all of the libraries that I need! I then read in the training dataset and see what's inside

In [240]:
import pandas
import numpy as np
# Import the linear regression class
from sklearn.linear_model import LinearRegression
# Sklearn also has a helper that makes it easy to do cross validation
from sklearn.cross_validation import KFold
from sklearn import cross_validation
from sklearn import *

# We can use the pandas library in python to read in the csv file.
# This creates a pandas dataframe and assigns it to the titanic variable.
titanic = pandas.read_csv("train.csv")

# Print the first 5 rows of the dataframe.
print(titanic.head(5))
print titanic.describe()

   PassengerId  Survived  Pclass  \
0            1         0       3   
1            2         1       1   
2            3         1       3   
3            4         1       1   
4            5         0       3   

                                                Name     Sex  Age  SibSp  \
0                            Braund, Mr. Owen Harris    male   22      1   
1  Cumings, Mrs. John Bradley (Florence Briggs Th...  female   38      1   
2                             Heikkinen, Miss. Laina  female   26      0   
3       Futrelle, Mrs. Jacques Heath (Lily May Peel)  female   35      1   
4                           Allen, Mr. William Henry    male   35      0   

   Parch            Ticket     Fare Cabin Embarked  
0      0         A/5 21171   7.2500   NaN        S  
1      0          PC 17599  71.2833   C85        C  
2      0  STON/O2. 3101282   7.9250   NaN        S  
3      0            113803  53.1000  C123        S  
4      0            373450   8.0500   NaN        S  
       P

Not all ages were filled in, so I instead filled them in with the median value of all ages. I then changed all of the values of sex from 'male' or 'female' to numbers that I could acutally use in my calculations, being 0 and 1.

In [241]:
titanic["Age"] = titanic["Age"].fillna(titanic["Age"].median())

# Find all the unique genders -- the column appears to contain only male and female.
print(titanic["Sex"].unique())

# Replace all the occurences of male with the number 0.
titanic.loc[titanic["Sex"] == "male", "Sex"] = 0
titanic.loc[titanic["Sex"] == "female", "Sex"] = 1

['male' 'female']


Then, I also made the three different values of the port from which was embarked into tangible numbers as well, from S C or Q to 0 1 or 2

In [242]:
# Find all the unique values for "Embarked".
print(titanic["Embarked"].unique())

titanic["Embarked"] = titanic["Embarked"].fillna("S")
titanic.loc[titanic["Embarked"] == 'S', "Embarked"] = 0
titanic.loc[titanic["Embarked"] == 'C', "Embarked"] = 1
titanic.loc[titanic["Embarked"] == 'Q', "Embarked"] = 2



['S' 'C' 'Q' nan]


I then used the LinearRegression function from sklearn on the training data, which I split into 3 different arrays, so I could test it against each other. 

In [243]:
# The columns we'll use to predict the target
predictors = ["Pclass", "Sex", "Age", "SibSp", "Parch", "Fare", "Embarked"]

# Initialize our algorithm class
alg = LinearRegression()
# Generate cross validation folds for the titanic dataset.  It return the row indices corresponding to train and test.
# We set random_state to ensure we get the same splits every time we run this.
kf = KFold(titanic.shape[0], n_folds=3, random_state=1)

predictions = []
for train, test in kf:
    # The predictors we're using the train the algorithm.  Note how we only take the rows in the train folds.
    train_predictors = (titanic[predictors].iloc[train,:])
    # The target we're using to train the algorithm.
    train_target = titanic["Survived"].iloc[train]
    # Training the algorithm using the predictors and target.
    alg.fit(train_predictors, train_target)
    # We can now make predictions on the test fold
    test_predictions = alg.predict(titanic[predictors].iloc[test,:])
    predictions.append(test_predictions)
    
print predictions
    

[array([  8.99877810e-02,   9.60756206e-01,   5.92676278e-01,
         9.31138728e-01,   5.29343071e-02,   1.70275685e-01,
         3.69943590e-01,   1.03474847e-01,   5.21597906e-01,
         8.74491050e-01,   6.48883611e-01,   8.29742769e-01,
         1.34797198e-01,  -1.61126844e-01,   6.58141307e-01,
         6.39819748e-01,   1.51733875e-01,   2.95432718e-01,
         5.35377959e-01,   6.21007683e-01,   2.61872592e-01,
         2.62687561e-01,   7.31739160e-01,   5.05995897e-01,
         5.61398567e-01,   3.35039734e-01,   1.30338808e-01,
         4.68765767e-01,   6.60737753e-01,   9.10819218e-02,
         4.77223920e-01,   1.04220026e+00,   6.60691613e-01,
         8.71539273e-02,   5.28550732e-01,   4.01874338e-01,
         1.30340307e-01,   1.29339672e-01,   5.72717129e-01,
         6.65238822e-01,   4.83215779e-01,   7.60807408e-01,
         1.30578363e-01,   8.71867121e-01,   7.09855487e-01,
         9.11369897e-02,   1.39181745e-01,   6.60691613e-01,
         6.82833485e-02

I then put the three arrays back into one, and made the predictions binary again, either 0 or 1 for survival, so I could test vs the actual predictions of survivial or not. I counted all the correct ones and divided by the total number to get an accuracy rating of ~78 percent, which isn't great.

In [244]:
# The predictions are in three separate numpy arrays.  Concatenate them into one.  
# We concatenate them on axis 0, as they only have one axis.
# print predictions
predictions = np.concatenate(predictions, axis=0)

# Map predictions to outcomes (only possible outcomes are 1 and 0)
predictions[predictions > .5] = 1
predictions[predictions <=.5] = 0

# print predictions, titanic["Survived"]
correct = 0.0

for i in range(0,len(predictions)):
     if (predictions[i] == titanic["Survived"][i]):
        correct += 1
        
accuracy = correct / len(predictions)
print(accuracy)

0.783389450056


I then used a Logistic Regression function from sklearen to compute the accuracy across all of the three different folds that I had used before and averaged the results, giving me a similar accuracy percent of ~78.8.

In [245]:
# Initialize our algorithm
alg = linear_model.LogisticRegression(random_state=1)
# Compute the accuracy score for all the cross validation folds.  (much simpler than what we did before!)
scores = cross_validation.cross_val_score(alg, titanic[predictors], titanic["Survived"], cv=3)
# Take the mean of the scores (because we have one for each fold)
print scores.mean()

0.787878787879


I then cleaned the test data with the same conversions to numeric catagories as before.

In [246]:
titanic_test = pandas.read_csv("test.csv")

print(titanic_test["Sex"].unique())

titanic_test["Age"] = titanic_test["Age"].fillna(titanic["Age"].median())

titanic_test["Embarked"] = titanic_test["Embarked"].fillna("S")

titanic_test.loc[titanic_test["Sex"] == 'male', "Sex"] = 0
titanic_test.loc[titanic_test["Sex"] == 'female', "Sex"] = 1

titanic_test.loc[titanic_test["Embarked"] == 'S', "Embarked"] = 0
titanic_test.loc[titanic_test["Embarked"] == 'C', "Embarked"] = 1
titanic_test.loc[titanic_test["Embarked"] == 'Q', "Embarked"] = 2

titanic_test["Fare"] = titanic_test["Fare"].fillna(titanic_test["Fare"].median())

print titanic_test



['male' 'female']
     PassengerId  Pclass                                               Name  \
0            892       3                                   Kelly, Mr. James   
1            893       3                   Wilkes, Mrs. James (Ellen Needs)   
2            894       2                          Myles, Mr. Thomas Francis   
3            895       3                                   Wirz, Mr. Albert   
4            896       3       Hirvonen, Mrs. Alexander (Helga E Lindqvist)   
5            897       3                         Svensson, Mr. Johan Cervin   
6            898       3                               Connolly, Miss. Kate   
7            899       2                       Caldwell, Mr. Albert Francis   
8            900       3          Abrahim, Mrs. Joseph (Sophie Halaut Easu)   
9            901       3                            Davies, Mr. John Samuel   
10           902       3                                   Ilieff, Mr. Ylio   
11           903       1          

I then made predictions on the test set using the training algorithm from beforehand. Finally, I created a submission file for kaggle in the form of a csv for submission! Yay!

In [247]:
# Initialize the algorithm class
alg = linear_model.LogisticRegression(random_state=1)

# Train the algorithm using all the training data
alg.fit(titanic[predictors], titanic["Survived"])

# Make predictions using the test set.
predictions = alg.predict(titanic_test[predictors])

# Create a new dataframe with only the columns Kaggle wants from the dataset.
submission = pandas.DataFrame({
        "PassengerId": titanic_test["PassengerId"],
        "Survived": predictions
    })
    
submission.to_csv("kaggle.csv", index=False)

## Part One Complete! Now on to making the model better
_____

I want to improve my model to get a better accuracy score than ~78 percent. 