## Model Iteration 1

### Data Cleaning

In [64]:
import pandas as pd
df = pd.read_csv('data/train.csv')
df.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


Let's first check for columns with missing values.

In [65]:
df.isnull().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

Since the **Age** column is missing data, let's impute it with the median value.

In [66]:
df['Age'] = df.Age.fillna(df.Age.median())

The next significant feature is **Sex**.

In [67]:
df.Sex.unique()

array(['male', 'female'], dtype=object)

Let's convert Sex to an binary representation.

In [68]:
df['Sex'] = df.Sex.apply(lambda sex: int(sex == 'female'))

The next feature is **Embarked**.

In [69]:
df.Embarked.unique()

array(['S', 'C', 'Q', nan], dtype=object)

In [78]:
df['Embarked'] = df.Embarked.fillna(df.Embarked.mode().values[0])

mapping = {'S': 1, 'C': 2, 'Q': 3}
df['Embarked'] = df.Embarked.apply(lambda port: mapping[port])

['S' 'C' 'Q']


### Cross Validation

In [79]:
# Import the linear regression class
from sklearn.linear_model import LinearRegression
# Sklearn also has a helper that makes it easy to do cross validation
from sklearn.cross_validation import KFold

# The columns we'll use to predict the target
predictors = ["Pclass", "Sex", "Age", "SibSp", "Parch", "Fare", "Embarked"]

# Initialize our algorithm class
alg = LinearRegression()
# Generate cross validation folds for the titanic dataset.  It return the row indices corresponding to train and test.
# We set random_state to ensure we get the same splits every time we run this.
kf = KFold(df.shape[0], n_folds=3, random_state=1)

predictions = []
for train, test in kf:
    # The predictors we're using the train the algorithm.  Note how we only take the rows in the train folds.
    train_predictors = (df[predictors].iloc[train,:])
    # The target we're using to train the algorithm.
    train_target = df["Survived"].iloc[train]
    # Training the algorithm using the predictors and target.
    alg.fit(train_predictors, train_target)
    # We can now make predictions on the test fold
    test_predictions = alg.predict(df[predictors].iloc[test,:])
    predictions.append(test_predictions)