# Kaggle Machine Learning Project
## Project: Titanic - Machine Learning from Disaster 

The problem to be solved is to predict the passengers that survive the Titanic. The goal is to use passenger characteristics and location to determine if a passenger is likely to die or survive. Based off the Kaggle prompt and the expected typed of output, this problem is a supervised classification problem. A potential way to solve this problem is to use a neural network classifier. For the sake of exploration, I hope to use other machine learning techniques like random forests and support vector machines to classify the passengers aboard the Titanic. The benchmark model will be a simple linear regression. The score to determine usefulness of the solution is the accuracy, the percentage of passengers correctly predicted.

>**Note:** Code and Markdown cells can be executed using the **Shift + Enter** keyboard shortcut. In addition, Markdown cells can be edited by typically double-clicking the cell to enter edit mode.

In [1]:
# Import libraries necessary for this project
import numpy as np
import pandas as pd

# Pretty display for notebooks
%matplotlib inline

#Import Train Data
train_df = pd.read_csv('Data/train.csv')

#Import Test Data
test_df = pd.read_csv('Data/test.csv')

# Success Check
print "Titanic Disaster training dataset has {} rows of data with {} variables each.".format(*train_df.shape)
print "Titanic Disaster testing dataset has {} rows of data with {} variables each.".format(*test_df.shape)

Titanic Disaster training dataset has 891 rows of data with 12 variables each.
Titanic Disaster testing dataset has 418 rows of data with 11 variables each.


In [2]:
#Grab train_x data excluding Survivers
train_x = train_df.drop('Survived', axis = 1)

#Grab 
features = list(train_x)
train_y = train_df.drop(features, axis = 1)

print train_x
print train_y

     PassengerId  Pclass                                               Name  \
0              1       3                            Braund, Mr. Owen Harris   
1              2       1  Cumings, Mrs. John Bradley (Florence Briggs Th...   
2              3       3                             Heikkinen, Miss. Laina   
3              4       1       Futrelle, Mrs. Jacques Heath (Lily May Peel)   
4              5       3                           Allen, Mr. William Henry   
5              6       3                                   Moran, Mr. James   
6              7       1                            McCarthy, Mr. Timothy J   
7              8       3                     Palsson, Master. Gosta Leonard   
8              9       3  Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)   
9             10       2                Nasser, Mrs. Nicholas (Adele Achem)   
10            11       3                    Sandstrom, Miss. Marguerite Rut   
11            12       1                           B

## Data Exploration

The data can be described as

1. survival = Survival
1. pclass = Ticket class
1. sex = Sex	
1. Age = Age in years	
1. sibsp =	# of siblings / spouses aboard the Titanic	
1. parch =	# of parents / children aboard the Titanic	
1. ticket =	Ticket number	
1. fare = Passenger fare	
1. cabin =	Cabin number	
1. embarked =	Port of Embarkation

I used Pandas describe() function to explore the numbers data. However, this function does not take into consideration the data that involves strings such as: 'Name', 'Sex', 'Ticket', 'Cabin', and 'Embarked'. 'Sex' looks like a good candidate to change 'Male' and "Female' to binary values. PassengerId doesn't mean much based on these statistics as its just a unique identifier for each individual. Pclass seems to show that most people were of Pclass 2 or 3. Age shows that most people were approximately 30 years old. The minimum and maximum also suggests that there are elderly and children on board. The count of the age being lower than other features indicates that there is missing data that has to be dealt with. Sibsp and Parch show that the majority of passengers came alone or possibly with friends depending on what the data represents. Fare seems to indicate that there were 3 ticket classes as the 25-75% quartiles seem fairly different. It is interesting to note that the mean seems to be in the 75% quartile of data possibly due to a few very large fares indicated by the max being $512. Cabin is most likely going to be removed as a feature because there is too much missing data and it would be hard to make conclusions when most people's information is unknown.

In [3]:
#Grab list of keypoint names
print features
print train_x.isnull().sum()
train_x.describe()

['PassengerId', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp', 'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked']
PassengerId      0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64


Unnamed: 0,PassengerId,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,2.0,20.125,0.0,0.0,7.9104
50%,446.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,3.0,38.0,1.0,0.0,31.0
max,891.0,3.0,80.0,8.0,6.0,512.3292


## Preprocessed Data

In [1]:
#Show dimensions
print train_x['Age'].shape
#Remove Cabin
#train_x = train_x.drop('Cabin', axis = 1)
train_x = train_x.dropna()
print train_x

NameError: name 'train_x' is not defined

## Benchmark Model