# Obtaining ground truth
In order to complete the code in `1_titanic.ipynb` and assess the algorithms using performance metrics, the ground truth data from the `test.csv` file is needed. Therefore, ground truth values for the survival of the test passengers was needed.  

The full titanic data set was obtained from the <a href=https://hbiostat.org/data>Vanderbilt Biostatistics Datasets</a> webpage. The `titanic3.csv` file was downloaded and the following code was used to match the ground truth in the `test.csv` file from Kaggle.

In [1]:
import numpy as np
import pandas as pd

Read in the data sets.

In [2]:
full_titanic = pd.read_csv('full_titanic_data.csv')
test_titanic = pd.read_csv("test.csv")

Standardize the column names to lowercase.

In [5]:
test_titanic.rename(str.lower, axis='columns', inplace=True)

Take a look at the data.

In [338]:
test_titanic.head(2)

Unnamed: 0,passengerid,survived,pclass,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked
0,892,0,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
1,893,1,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S


In [339]:
full_titanic.head(2)

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest
0,1,1,"Allen, Miss. Elisabeth Walton",female,29.0,0,0,24160,211.3375,B5,S,2,,"St Louis, MO"
1,1,1,"Allison, Master. Hudson Trevor",male,0.92,1,2,113781,151.55,C22 C26,S,11,,"Montreal, PQ / Chesterville, ON"


In [340]:
print(full_titanic.shape, test_titanic.shape)

(1309, 14) (418, 12)


Set the indexes to merge on.

In [341]:
test_titanic.set_index(keys='name', inplace=True)
full_titanic.set_index(keys='name', inplace=True)

## Merge the data
The following merge uses the index from the right data set (i.e., `test_titanic`) to match to the left data set (i.e., `full_titanic`). The index order of the left data set `full_titanic` is preserved. The right join ensures that all the data from the `test_titanic` data set is preserved.

In [343]:
test_ground_truth = pd.merge(full_titanic, test_titanic, how='right')

Drop duplicate `PassengerId`s that were generated by the merge.

In [344]:
test_ground_truth.drop_duplicates(subset='passengerid', inplace=True)

The `test_ground_truth` data was then posted as a submission to Kaggle to ensure that these were the correct labels. A score of 1 was achieved indicating that the merge correctly identified the ground truth on the test data set. This will allow for much easier and faster performance metrics to be calculated in the `1_titanic.ipynb` analysis.