# "Titanic" Kaggle Exercise

https://www.kaggle.com/c/titanic

## Overview
Predicting survival for Titanic passengers, based on age, social class, gender, etc.

### Method

In this challenge, I am using the Random Forest Model.

* define X (Features: "Pclass", "Sex", "SibSp", "Parch") & y (Features: "Survived")
* simply load the RandomForestClassifier model
* fit model using X & y
* predict y_pred, by passing in X_test to the model


### Random Forests

Random Forests (or Random Decision Forests) are an ensemble ML technique (comprised of multiple algorithms to increase performance) that is mainly used for classification or regression tasks - here we will use it for classification. Random Forests enjoys the simplicity that comes with decision trees, but are far more accurate and account for the overfitting that occurs from decision trees.

The algorithm constructs a multitude of decision trees and outputs the most popular classification for all the trees (classification) or the mean prediction (regression). In our Titanic example, each tree spits out a class prediction (Survived, 1,0) and the most popular prediction across the trees becomes the model's prediction.

For Random Forests to work well we need to use features of the data which will have good predictive power (not just random garbage!) and we need predictions made by individual trees to have a low correlation with each other. The wisdom of the crowd gives Random Trees its effectiveness over the single use of a decision tree.

Bagging and Feature Randomness are used by the Random Forests Algorithm to build individual trees that have low correlation.

#### Decision Trees

The logic behind a decision tree is to look for a feature in the dataset that we can use to split the dataset into sub branches. We perform the split again, until a decision is made about the class of the data. At each split in the tree (Node), we aim to optimise the greatest possible difference between data in each branch and maximise the similarity between data in the same branch.

For our observation (test data input), we query the observation at each node (which is a question that splits the data the "best"). We do this until we have reached a conclusion about the class of the input data. Individually, Decision Trees are simple, but not very accurate at classifying new data.

#### Random Forests Algorithm

* We create a decision tree by using a bootstrapped version of our dataset and selecting a random subset of the features at each node (from this random subset we choose the "best" [see above] feature to split on). 
* From the root node, we continue this random selection process until it is possible to classify an input using this tree.
* We continue to create many more trees uing this method of bootstrapping the data and choosing nodes from a randomn subset of the columns/features.
* Predictions are made by passing the input through all of the trees and taking the most popular class across all of the trees as the model's prediction.



### pandas.get_dummies

Dummy Variables are a way to turn categorical data into numerical data using One-Hot Encoding; therefore allowing you to compare numerical data with previously uncomparable categorical fields.

#### Ex.

Color:\
Red\
Green\
Blue

Becomes 

R: G: B:\
0 1 0\
1 0 0\
0 0 1

Where R, G, B become the 'dummy' variables

get_dummies(df['']) is used to create a new dataframe\
get_dummies(df, columns='') is used to merge the dummies automatically with the original df


Be aware of Dummy Variable Trap:
Drop one dummy var from the df, as it can be derived from the remaining (n-1) - this avoids errors.\
Why? What is DVT?




### Load the data

In [2]:
import pandas as pd
train_set = pd.read_csv("~/Documents/Kaggle/titanic/train.csv")
test_set = pd.read_csv("~/Documents/Kaggle/titanic/test.csv")

print("\n\nTrain Set")
print(train_set.head())
print(train_set.columns)
print("\n\n\n\nTest Set")
print(test_set.head())
print(test_set.columns)


# Do we need to normalize the data?



Train Set
   PassengerId  Survived  Pclass  \
0            1         0       3   
1            2         1       1   
2            3         1       3   
3            4         1       1   
4            5         0       3   

                                                Name     Sex   Age  SibSp  \
0                            Braund, Mr. Owen Harris    male  22.0      1   
1  Cumings, Mrs. John Bradley (Florence Briggs Th...  female  38.0      1   
2                             Heikkinen, Miss. Laina  female  26.0      0   
3       Futrelle, Mrs. Jacques Heath (Lily May Peel)  female  35.0      1   
4                           Allen, Mr. William Henry    male  35.0      0   

   Parch            Ticket     Fare Cabin Embarked  
0      0         A/5 21171   7.2500   NaN        S  
1      0          PC 17599  71.2833   C85        C  
2      0  STON/O2. 3101282   7.9250   NaN        S  
3      0            113803  53.1000  C123        S  
4      0            373450   8.0500   NaN  

In [16]:
# What is the survival rate for men/women?

women = train_set.loc[train_set.Sex == 'female']["Survived"]
rate_women = sum(women)/len(women)

print("% of women who survived:", rate_women)

men = train_set.loc[train_set.Sex == 'male']["Survived"]
rate_men = sum(men)/len(men)

print("% of men who survived:", rate_men)

% of women who survived: 0.7420382165605095
% of men who survived: 0.18890814558058924


### Create Model using RandomForestClassifier

In [31]:
from sklearn.ensemble import RandomForestClassifier

y = train_set["Survived"]

features = ["Pclass", "Sex", "SibSp", "Parch"]

X = pd.get_dummies(train_set[features])
X_test = pd.get_dummies(test_set[features])

print(X) # Categorical data has been One-Hot Encoded

     Pclass  SibSp  Parch  Sex_female  Sex_male
0         3      1      0           0         1
1         1      1      0           1         0
2         3      0      0           1         0
3         1      1      0           1         0
4         3      0      0           0         1
..      ...    ...    ...         ...       ...
886       2      0      0           0         1
887       1      0      0           1         0
888       3      1      2           1         0
889       1      0      0           0         1
890       3      0      0           0         1

[891 rows x 5 columns]


In [28]:
model = RandomForestClassifier(n_estimators=100, max_depth=5, random_state=1)
model.fit(X, y)
y_pred = model.predict(X_test)

In [29]:
output = pd.DataFrame({'PassengerId': test_set.PassengerId, 'Survived': y_pred})
print(output)

     PassengerId  Survived
0            892         0
1            893         1
2            894         0
3            895         0
4            896         1
..           ...       ...
413         1305         0
414         1306         1
415         1307         0
416         1308         0
417         1309         0

[418 rows x 2 columns]


In [26]:
# Done