### 1: The Competition
We'll be learning how to generate a submission for a Kaggle competition. Kaggle is a site where you create algorithms, and compete against machine learning practitioners around the world. Your algorithm wins if it's the most accurate on a given dataset. Kaggle is a fun way to practice your machine learning skills.

Kaggle has several different competitions on their site. On of them is about predicting which passengers survived the sinking of the Titanic. In this and the next mission, we'll be learning how to make our first submission to the competition.

Our data is in .csv format. You can get started with the competition and download the data here.

Each row represents a passenger on the Titanic, and some information about them. Let's take a look at the columns:

- PassengerId -- A numerical id assigned to each passenger.
- Survived -- Whether the passenger survived (1), or didn't (0). We'll be making predictions for this column.
- Pclass* -- The class the passenger was in -- first class (1), second class (2), or third class (3).
- Name -- the name of the passenger.
- Sex -- The gender of the passenger -- male or female.
- Age -- The age of the passenger. Fractional.
- SibSp -- The number of siblings and spouses the passenger had on board.
- Parch -- The number of parents and children the passenger had on board.
- Ticket -- The ticket number of the passenger.
- Fare -- How much the passenger paid for the ticker.
- Cabin -- Which cabin the passenger was in.
- Embarked -- Where the passenger boarded the Titanic.

A good first step is to think logically about the columns and what we're trying to predict. What variables might logically affect the outcome of survived? (reading more about the Titanic might help here).

We know that women and children were more likely to survive. Thus, Age and Sex are probably good predictors. It's also logical to think that passenger class might affect the outcome, as first class cabins were closer to the deck of the ship. Fare is tied to passenger class, and will probably be highly correlated with it, but might add some additional information. Number of siblings and parents/children will probably be correlated with survival one way or the other, as either there are more people to help you, or more people to think about and try to save.

There's a less clear link between survival and columns like Embarked (maybe there is some information about how close to the top of the ship people's cabins were here), Ticket, and Name.

This step is generally known as acquiring domain knowledge, and it fairly important to most machine learning tasks. We're looking to engineer the features so that we maximize the information we have about what we're trying to predict.

### 2. Looking at the data
We'll be using python 3, the pandas library, and scikit-learn to analyze our data and create a submission.

In [33]:
import pandas
# We can use the pandas library in python to read in the csv file.
# This creates a pandas dataframe and assigns it to the titanic variable.
titanic = pandas.read_csv("../../data/train.csv")

# Print the first 5 rows of the dataframe.
print(titanic.describe())

       PassengerId    Survived      Pclass         Age       SibSp  \
count   891.000000  891.000000  891.000000  714.000000  891.000000   
mean    446.000000    0.383838    2.308642   29.699118    0.523008   
std     257.353842    0.486592    0.836071   14.526497    1.102743   
min       1.000000    0.000000    1.000000    0.420000    0.000000   
25%     223.500000    0.000000    2.000000         NaN    0.000000   
50%     446.000000    0.000000    3.000000         NaN    0.000000   
75%     668.500000    1.000000    3.000000         NaN    1.000000   
max     891.000000    1.000000    3.000000   80.000000    8.000000   

            Parch        Fare  
count  891.000000  891.000000  
mean     0.381594   32.204208  
std      0.806057   49.693429  
min      0.000000    0.000000  
25%      0.000000    7.910400  
50%      0.000000   14.454200  
75%      0.000000   31.000000  
max      6.000000  512.329200  


### 3. Missing data
When you used .describe() on the titanic dataframe in the last screen, you might have noticed that the Age column has a count of 714 when all the other columns have a count of 891. This indicates that there are missing values in the Age column -- the count is of non-missing (null, NA, or not a number) values.

This means that the data isn't perfectly clean, and we're going to have to clean it ourselves. We don't want to have to remove the rows with missing values, because more data helps us train a better algorithm. We also don't want to get rid of the whole column, as age is probably fairly important to our analysis.

There are many strategies for cleaning up missing data, but a simple one is to just fill in all the missing values with the median of all the values in the column.

We can then use the .fillna method on the series to replace any missing values. .fillna takes one argument, the value to replace the missing values with.

In [34]:
# The titanic variable is available here.
titanic["Age"] = titanic["Age"].fillna(titanic["Age"].median())

### 4. Non-numeric columns

We have to either exclude our non-numeric columns when we train our algorithm (Name, Sex, Cabin, Embarked, and Ticket), or find a way to convert them to numeric columns.

We'll ignore the Ticket, Cabin, and Name columns. There isn't much information we can extract from there. Most of the values in the cabin column are missing (only 204 values out of 891 rows), and it likely isn't a particularly informative column in the first place. The Ticket and Name columns are unlikely to tell us much without some domain knowledge about what the ticket numbers mean, and about which names correlate with characteristics like large or rich families.

### 5. Converting the Sex Column

The Sex column is non-numeric, but we want to keep it around -- it could be very informative. We can convert it to a numeric column by replacing each gender with a numeric code. A machine learning algorithm will then be able to use these categories to make predictions.

To do this, we first have to find all the unique genders in the column (we know male and female are there, but did whoever recorded the dataset use another code for missing values?). We'll also assign a code of 0 to male, and a code of 1 to female

In [35]:
# Find all the unique genders -- the column appears to contain only male and female.
print(titanic["Sex"].unique())

# Replace all the occurences of male with the number 0.
titanic.loc[titanic["Sex"] == "male", "Sex"] = 0
titanic.loc[titanic["Sex"] == "female", "Sex"] = 1

['male' 'female']


### 6: Converting The Embarked Column
    
We now can convert the Embarked column to codes the same way we converted the Sex column. The unique values in Embarked are S, C, Q, and missing (nan). Each letter is an abbreviation of an embarkation port name.

We can dummy code the Embarked column to get numeric columns corresponding to the labels. 

In [36]:
# Find all the unique values for "Embarked".
print(titanic["Embarked"].unique())
titanic["Embarked"] = titanic["Embarked"].fillna('S')

embarked_dummy = pandas.get_dummies(titanic["Embarked"], dummy_na=True, prefix="Embarked")
titanic = pandas.concat([titanic, embarked_dummy], axis=1)

['S' 'C' 'Q' nan]


### 7. Machine Learning

We can use the excellent scikit-learn library to make predictions. We'll use a helper from sklearn to split the data up into cross validation folds, and then train an algorithm for each fold, and make predictions. At the end, we'll have a list of predictions, with each list item containing predictions for the corresponding fold.

In [41]:
# Import the linear regression class
from sklearn.linear_model import LinearRegression

# Sklearn also has a helper that makes it easy to do cross validation
from sklearn.cross_validation import KFold

# The columns we'll use to predict the target
predictors = ["Pclass", "Sex", "Age", "SibSp", "Parch", "Fare", "Embarked_Q", 
              "Embarked_S", "Embarked_nan"]

# Initialize our algorithm class
alg = LinearRegression()

# Generate cross validation folds for the titanic dataset.  It return the row indices corresponding to train and test.
# We set random_state to ensure we get the same splits every time we run this.
kf = KFold(titanic.shape[0], n_folds=3, random_state=1)

predictions = []
for train, test in kf:
    # The predictors we're using the train the algorithm.  
    # Note how we only take the rows in the train folds.
    train_predictors = (titanic[predictors].iloc[train,:])
    
    # The target we're using to train the algorithm.
    train_target = titanic["Survived"].iloc[train]
    
    # Training the algorithm using the predictors and target.
    alg.fit(train_predictors, train_target)
    
    # We can now make predictions on the test fold
    test_predictions = alg.predict(titanic[predictors].iloc[test,:])
    predictions.append(test_predictions)

### 8: Evaluating Error
Now that we have predictions, we can evaluate our error.

We'll first need to define an error metric, so we can figure out how accurate our model is. From the Kaggle competition description, the error metric is percentage of correct predictions. We'll use this same metric to evaluate our performance locally.

The metric will basically involve finding the number of values in predictions that are the exact same as their counterparts in titanic["Survived"], and then dividing by the total number of passengers.

Before we can do this, we need to combine the 3 sets of predictions into one column. Since each set of predictions is a numpy (python scientific computing library) array, we can use a numpy function to concatenate them into one.

In [42]:
import numpy as np

# The predictions are in three separate numpy arrays.  Concatenate them into one.  
# We concatenate them on axis 0, as they only have one axis.
predictions = np.concatenate(predictions, axis=0)

# Map predictions to outcomes (only possible outcomes are 1 and 0)
predictions[predictions > .5] = 1
predictions[predictions <=.5] = 0
accuracy = sum(predictions == titanic["Survived"]) / (1.0 * len(predictions))
print(accuracy)

0.787878787879


### 9. Logistic Regression

ne good way to think of logistic regression is that it takes the output of a linear regression, and maps it to a probability value between 0 and 1. The mapping is done using the logit function. Passing any value through the logit function will map it to a value between 0 and 1 by "squeezing" the extreme values. This is perfect for us, because we only care about two outcomes.

Sklearn has a class for logistic regression that we can use. We'll also make things easier by using an sklearn helper function to do all of our cross validation and evaluation for us.


In [45]:
from sklearn import cross_validation
from sklearn.linear_model import LogisticRegression

# Initialize our algorithm
alg = LogisticRegression(random_state=1)
# Compute the accuracy score for all the cross validation folds.  (much simpler than what we did before!)
scores = cross_validation.cross_val_score(
    alg, titanic[predictors], titanic["Survived"], cv=3)
# Take the mean of the scores (because we have one for each fold)
print(scores.mean())

0.789001122334


### 10. Processing the test set
Our accuracy is decent, but not great. We can still try a few things to make it better, which we'll talk about in the next mission.

But, we need to make a submission to the competition. To do this, we need to take the exact same steps on the test data that we took on the training data. If we don't do the exact same operations, then we won't be able to make valid predictions on it.

These operations are all the changes we made to the columns before.

In [47]:
titanic_test = pandas.read_csv("../../data/test.csv")

titanic_test['Age'] = titanic_test['Age'].fillna(titanic['Age'].median())
titanic_test['Fare'] = titanic_test['Fare'].fillna(titanic_test['Fare'].median())

embarked_test_dummy = pandas.get_dummies(titanic_test["Embarked"], dummy_na=True, prefix="Embarked")
titanic_test = pandas.concat([titanic_test, embarked_test_dummy], axis=1)

titanic_test.loc[titanic_test['Sex'] == 'male', 'Sex'] = 0
titanic_test.loc[titanic_test['Sex'] == 'female', 'Sex'] = 1

### 11. Generating A Submission File
Now we have everything we need to generate a submission for the competition!

First, we have to train an algorithm on the training data. Then, we make predictions on the test set. Finally, we'll generate a csv file with the predictions and passenger ids.

In [49]:
# Initialize the algorithm class
log_regressor = LogisticRegression(random_state=1)

# Train the algorithm using all the training data
log_regressor.fit(titanic[predictors], titanic["Survived"])

# Make predictions using the test set.
predictions = log_regressor.predict(titanic_test[predictors])

# Create a new dataframe with only the columns Kaggle wants from the dataset.
submission = pandas.DataFrame({
        "PassengerId": titanic_test["PassengerId"],
        "Survived": predictions
    })
submission.to_csv("../../submissions/others/dataquest1.csv", index=False)