# Importing all the things!

In [24]:
import pandas as pd

 # Data Quest Tutorial
 These next few cells are me following the data quest tutorial for the titanic dataset. 

In [25]:
# We can use the pandas library in python to read in the csv file.
# This creates a pandas dataframe and assigns it to the titanic variable.
titanic = pd.read_csv("./data/train.csv")

# Print the first 5 rows of the dataframe.
print(titanic.describe())

       PassengerId    Survived      Pclass         Age       SibSp  \
count   891.000000  891.000000  891.000000  714.000000  891.000000   
mean    446.000000    0.383838    2.308642   29.699118    0.523008   
std     257.353842    0.486592    0.836071   14.526497    1.102743   
min       1.000000    0.000000    1.000000    0.420000    0.000000   
25%     223.500000    0.000000    2.000000   20.125000    0.000000   
50%     446.000000    0.000000    3.000000   28.000000    0.000000   
75%     668.500000    1.000000    3.000000   38.000000    1.000000   
max     891.000000    1.000000    3.000000   80.000000    8.000000   

            Parch        Fare  
count  891.000000  891.000000  
mean     0.381594   32.204208  
std      0.806057   49.693429  
min      0.000000    0.000000  
25%      0.000000    7.910400  
50%      0.000000   14.454200  
75%      0.000000   31.000000  
max      6.000000  512.329200  


Now, we're filling in missing data by just replacing all the missing data with the mean value. Is this really something that you can do? Is this okay? It feels wrong to me, because you'd be skewing the data even more heavily to the median. If we have to fill in the data, wouldn't it be better to fill it in with something that contains noise?

In [30]:
# The titanic variable is available here.
titanic["Age"] = titanic["Age"].fillna(titanic["Age"].median())

Now, we want to recode the sex column so that it's a number, and not a string. 

In [31]:
# Find all the unique genders -- the column appears to contain only male and female.
print(titanic["Sex"].unique())

# Replace all the occurences of male with the number 0.
titanic.loc[titanic["Sex"] == "male", "Sex"] = 0
titanic.loc[titanic["Sex"] == "female", "Sex"] = 1

[0 1]


Doing the same thing witht the embarked column. Again, I'm very unsure that it's "okay" to just replace missing values with the most common value. Couldn't that confuse our model?

In [32]:
# Find all the unique values for "Embarked".
print(titanic["Embarked"].unique())

titanic["Embarked"] = titanic["Embarked"].fillna("S")
titanic.loc[titanic["Embarked"] == "S", "Embarked"] = 0
titanic.loc[titanic["Embarked"] == "C", "Embarked"] = 1
titanic.loc[titanic["Embarked"] == "Q", "Embarked"] = 2

[0 1 2]


  result = getattr(x, name)(y)


TypeError: invalid type comparison

Now that we've cleaned up our data, let's move on to machine learning!

In [33]:
# Import the linear regression class
from sklearn.linear_model import LinearRegression
# Sklearn also has a helper that makes it easy to do cross validation
from sklearn.cross_validation import KFold

# The columns we'll use to predict the target
predictors = ["Pclass", "Sex", "Age", "SibSp", "Parch", "Fare", "Embarked"]

# Initialize our algorithm class
alg = LinearRegression()
# Generate cross validation folds for the titanic dataset.  It return the row indices corresponding to train and test.
# We set random_state to ensure we get the same splits every time we run this.
kf = KFold(titanic.shape[0], n_folds=3, random_state=1)

predictions = []
for train, test in kf:
    # The predictors we're using the train the algorithm.  Note how we only take the rows in the train folds.
    train_predictors = (titanic[predictors].iloc[train,:])
    # The target we're using to train the algorithm.
    train_target = titanic["Survived"].iloc[train]
    # Training the algorithm using the predictors and target.
    alg.fit(train_predictors, train_target)
    # We can now make predictions on the test fold
    test_predictions = alg.predict(titanic[predictors].iloc[test,:])
    predictions.append(test_predictions)

Now that we've made some predictions, we want to evaluate them!

In [34]:
import numpy as np

# The predictions are in three separate numpy arrays.  Concatenate them into one.  
# We concatenate them on axis 0, as they only have one axis.
predictions = np.concatenate(predictions, axis=0)

# Map predictions to outcomes (only possible outcomes are 1 and 0)
predictions[predictions > .5] = 1
predictions[predictions <=.5] = 0

accuracy = sum([predictions[i] == titanic["Survived"].tolist()[i] for i in range(len(predictions))])/float(len(predictions))
print "accuracy: " + str(accuracy)

accuracy: 0.783389450056


Okay, so now we're, as per the tutorial, trying logistic regression. I believe that this helps us actually classify things, instead of predicting on a scale from 0 to 1.

In [35]:
from sklearn import cross_validation
from sklearn.linear_model import LogisticRegression

# Initialize our algorithm
alg = LogisticRegression(random_state=1)
# Compute the accuracy score for all the cross validation folds.  (much simpler than what we did before!)
scores = cross_validation.cross_val_score(alg, titanic[predictors], titanic["Survived"], cv=3)
# Take the mean of the scores (because we have one for each fold)
print(scores.mean())

0.787878787879


Now, we actually want to make predictions on the test data provided to us by Kaggle!

First, let's load in and clean up the data.

In [37]:
titanic_test = pd.read_csv("./data/test.csv")

#Set age values to the median age in the training set
titanic_test["Age"] = titanic_test["Age"].fillna(titanic["Age"].median())

#Recode the sex
titanic_test.loc[titanic_test["Sex"] == "male", "Sex"] = 0
titanic_test.loc[titanic_test["Sex"] == "female", "Sex"] = 1

#Fill in the missing values are recode the point of embarkation
titanic_test["Embarked"] = titanic_test["Embarked"].fillna("S")
titanic_test.loc[titanic_test["Embarked"] == "S", "Embarked"] = 0
titanic_test.loc[titanic_test["Embarked"] == "C", "Embarked"] = 1
titanic_test.loc[titanic_test["Embarked"] == "Q", "Embarked"] = 2

#Fill in missing fare values
titanic_test["Fare"] = titanic_test["Fare"].fillna(titanic["Fare"].median())


Now, we are making a prediction! This will have accuracty of about 0.75, but we'll work on this in future iterations!

In [38]:
# Initialize the algorithm class
alg = LogisticRegression(random_state=1)

# Train the algorithm using all the training data
alg.fit(titanic[predictors], titanic["Survived"])

# Make predictions using the test set.
predictions = alg.predict(titanic_test[predictors])

# Create a new dataframe with only the columns Kaggle wants from the dataset.
submission = pd.DataFrame({
        "PassengerId": titanic_test["PassengerId"],
        "Survived": predictions
    })

submission.to_csv("kaggle.csv", index=False)

Indeed, my model had a score 0.75120 wen I uploaded it to Kaggle


# Model Iteration 1
For this iteration, I'll switch from away from encoding the port that people embarked on as numbers and increase (0 for S, 1 for Q and 2 for C). The reason that I am doing this is that we are finding weights for each column. If we encode things like the port where someone embarked as a number that increases, we might falsely give the model the impression that these numbers are increasing, and then embarking with number 2 means that someone embarked twice as much (whatever that means...) as someone that has embarked 1. To do this, I will transition to one-hot encoding. This means that I will add columns "EmbarkedS", "EmbarkedC", and "EmbarkedQ". These columns will have have a 1 in them if a person embarked at that respective port and 0 otherwise. 

(I tried two different variations of this model, one where I filled in the NaN values for Embarked and one where I didn't)

In [39]:
titanic_it1 = pd.read_csv("./data/train.csv")

print(titanic.describe())

       PassengerId    Survived      Pclass         Age       SibSp  \
count   891.000000  891.000000  891.000000  891.000000  891.000000   
mean    446.000000    0.383838    2.308642   29.361582    0.523008   
std     257.353842    0.486592    0.836071   13.019697    1.102743   
min       1.000000    0.000000    1.000000    0.420000    0.000000   
25%     223.500000    0.000000    2.000000   22.000000    0.000000   
50%     446.000000    0.000000    3.000000   28.000000    0.000000   
75%     668.500000    1.000000    3.000000   35.000000    1.000000   
max     891.000000    1.000000    3.000000   80.000000    8.000000   

            Parch        Fare    Embarked  
count  891.000000  891.000000  891.000000  
mean     0.381594   32.204208    0.361392  
std      0.806057   49.693429    0.635673  
min      0.000000    0.000000    0.000000  
25%      0.000000    7.910400    0.000000  
50%      0.000000   14.454200    0.000000  
75%      0.000000   31.000000    1.000000  
max      6.000000

Here, I'm going to recode the Sex and Embarked columns.

In [40]:
# The titanic variable is available here.
titanic_it1["Age"] = titanic_it1["Age"].fillna(titanic_it1["Age"].median())

#Recode the sex
titanic_it1.loc[titanic_it1["Sex"] == "male", "Sex"] = 0
titanic_it1.loc[titanic_it1["Sex"] == "female", "Sex"] = 1

#Recode the Port where people Embarked
titanic_it1["Embarked"] = titanic_it1["Embarked"].fillna("S")
titanic_it1['EmbarkedS'] = titanic_it1['Embarked'].apply(lambda x: int(x == 'S'))
titanic_it1['EmbarkedC'] = titanic_it1['Embarked'].apply(lambda x: int(x == 'C'))
titanic_it1['EmbarkedQ'] = titanic_it1['Embarked'].apply(lambda x: int(x == 'Q'))

Here, I'm just checking to see that my recoding worked by showing the description of the dataframe.

In [41]:
titanic_it1

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,EmbarkedS,EmbarkedC,EmbarkedQ
0,1,0,3,"Braund, Mr. Owen Harris",0,22,1,0,A/5 21171,7.2500,,S,1,0,0
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",1,38,1,0,PC 17599,71.2833,C85,C,0,1,0
2,3,1,3,"Heikkinen, Miss. Laina",1,26,0,0,STON/O2. 3101282,7.9250,,S,1,0,0
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",1,35,1,0,113803,53.1000,C123,S,1,0,0
4,5,0,3,"Allen, Mr. William Henry",0,35,0,0,373450,8.0500,,S,1,0,0
5,6,0,3,"Moran, Mr. James",0,28,0,0,330877,8.4583,,Q,0,0,1
6,7,0,1,"McCarthy, Mr. Timothy J",0,54,0,0,17463,51.8625,E46,S,1,0,0
7,8,0,3,"Palsson, Master. Gosta Leonard",0,2,3,1,349909,21.0750,,S,1,0,0
8,9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",1,27,0,2,347742,11.1333,,S,1,0,0
9,10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",1,14,1,0,237736,30.0708,,C,0,1,0


In [42]:
# The columns we'll use to predict the target
predictors = ["Pclass", "Sex", "Age", "SibSp", "Parch", "Fare", "EmbarkedS", "EmbarkedC", "EmbarkedQ"]

# Initialize our algorithm class
alg = LogisticRegression()
# Generate cross validation folds for the titanic dataset.  It return the row indices corresponding to train and test.
# We set random_state to ensure we get the same splits every time we run this.
kf = KFold(titanic_it1.shape[0], n_folds=3, random_state=1)

predictions = []
for train, test in kf:
    # The predictors we're using the train the algorithm.  Note how we only take the rows in the train folds.
    train_predictors = (titanic_it1[predictors].iloc[train,:])
    # The target we're using to train the algorithm.
    train_target = titanic_it1["Survived"].iloc[train]
    # Training the algorithm using the predictors and target.
    alg.fit(train_predictors, train_target)
    # We can now make predictions on the test fold
    test_predictions = alg.predict(titanic_it1[predictors].iloc[test,:])
    predictions.append(test_predictions)
    
# The predictions are in three separate numpy arrays.  Concatenate them into one.  
# We concatenate them on axis 0, as they only have one axis.
predictions = np.concatenate(predictions, axis=0)
    
accuracy = sum([predictions[i] == titanic_it1["Survived"].tolist()[i] for i in range(len(predictions))])/float(len(predictions))
print "accuracy: " + str(accuracy)

accuracy: 0.786756453423


Okay, so my training accuracy here is a little better then before -- now I get around 78.7 percent when I test on the training set, so I'm going to try another submission to Kaggle. 

Below, I'll generate my submission file for Kaggle

First, I need to recode the data in the same way. 

In [43]:
titanic_test_it1 = pd.read_csv("./data/test.csv")

#Set age values to the median age in the training set
titanic_test_it1["Age"] = titanic_test_it1["Age"].fillna(titanic_it1["Age"].median())

#Recode the sex
titanic_test_it1.loc[titanic_test_it1["Sex"] == "male", "Sex"] = 0
titanic_test_it1.loc[titanic_test_it1["Sex"] == "female", "Sex"] = 1

#Fill in the missing values are recode the point of embarkation
titanic_test_it1['EmbarkedS'] = titanic_test_it1['Embarked'].apply(lambda x: int(x == 'S'))
titanic_test_it1['EmbarkedC'] = titanic_test_it1['Embarked'].apply(lambda x: int(x == 'C'))
titanic_test_it1['EmbarkedQ'] = titanic_test_it1['Embarked'].apply(lambda x: int(x == 'Q'))

#Fill in missing fare values
titanic_test_it1["Fare"] = titanic_test_it1["Fare"].fillna(titanic_it1["Fare"].median())


In [44]:
# Initialize the algorithm class
alg = LogisticRegression(random_state=1)

# Train the algorithm using all the training data
alg.fit(titanic_it1[predictors], titanic_it1["Survived"])

# Make predictions using the test set.
predictions = alg.predict(titanic_test_it1[predictors])

# Create a new dataframe with only the columns Kaggle wants from the dataset.
submission = pd.DataFrame({
        "PassengerId": titanic_test["PassengerId"],
        "Survived": predictions
    })

submission.to_csv("kaggle_it1_5.csv", index=False)

When I submitted this model the first time, I got a score of 0.74641 both times, intrestingly enough, which is worse than I did with the DataQuest model. Maybe this has something to do with the fact that I pulled out more columns for something that isn't all that important, and the model got confused. 