# Introduction
This is my first Kaggle competition, in which we analyze data provided by Kaggle to determine which individuals on the Titanic were more likely to have survived based on a set of the most significant features.

My process for analyzing the data is as follows: 
1. Data Preprocessing
    1. Importing the libraries
    2. Importing the dataset
    3. Taking care of missing data
        1. Age
        2. Cabin Number
        3. Location from which Individual Embarked
    4. Encoding Categorical Data
    5. Splitting the training set into 80% training data, 20% testing
    6. Feature Scaling
2. Individual Data Relationship Analysis
3. Logistic Regression
4. Random Forest
5. Model Assessment 
6. Conclusion

Many thanks to Kirill Eremenko and Hadelin de Ponteves for their amazing course Machine Learning A-Z™: Hands-On Python & R In Data Science available on udemy.com. I used mostly what I learned there to complete my first kaggle kernel!
Also thanks to SaraG on Kaggle for her awesome kernel, really helped me figure out how to do this for the first time. :) Check it out here: https://www.kaggle.com/sgus1318/titanic-analysis-learning-to-swim-with-python.

# 1. Data Pre-Processing

## 1.A. Importing the libraries

In [None]:
# importing the libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import re

import seaborn as sns
sns.set_style('whitegrid')
%matplotlib inline


## 1.B. Importing the data

In [None]:
# importing the training dataset
titanic_train_df = pd.read_csv('../input/train.csv')

#X = titanic_train_df.iloc[:, 2:12].values
#y = titanic_train_df.iloc[:, 1].values

# importing the test dataset (test-data for the challenge)
titanic_test_df = pd.read_csv('../input/test.csv')


In [None]:
# snapshot of top 5 rows of training dataframe
titanic_train_df.head()

In [None]:
# snapshot of top 5 rows of testing dataframe
titanic_test_df.head()

# Note: no survival data because this is what we are predicting for the challenge.

## 1.C. Taking care of missing data

#### Age, cabin, and embarked from location are all variables with missing values which we need to correct for before moving forward with our variable analysis.  For each of these variables, we first found how many values were missing out of the total passengers in the training set, to determine our approach to filling in the missing values. Typically, the decision for filling in values is to ignore, remove, fill in a value (using mean, median, mode, back fill, or forward fill), or replace the value with a static value. 

In [None]:
# see how many values are in each column of training dataset
len(titanic_train_df['PassengerId'])

In [None]:
# see how many values are missing in training dataset
titanic_train_df.isnull().sum()

### 1.C.a Age

In [None]:
# Number of null values in Age column
sum(pd.isnull(titanic_train_df['Age']))

In [None]:
# Percentage of Age column having values
sum(pd.isnull(titanic_train_df['Age']))/891 # 891 is the total number of individuals in the dataset found above

#### Only 20% of age is null. This is a signifitherefore, we keep age

In [None]:
# Attempt to use Imputer from sklearn, but couldn't get it to work this time..
#from sklearn.preprocessing import Imputer
#imputer = Imputer(missing_values = 'NaN', strategy = 'mean', axis = 1)
#q = imputer.fit_transform(titanic_train_df(["Age"]).T
#titanic_train_df["Age"] = q

In [None]:
# Fill in missing values with mean for missing Age values
titanic_train_df["Age"].fillna(titanic_train_df["Age"].mean(), inplace=True)

# check out the first 100 data points
titanic_train_df.head(100)

In [None]:
# Do the same with Age column in test set
titanic_test_df["Age"].fillna(titanic_train_df["Age"].mean(), inplace=True)

In [None]:
#def findWholeWord(w):
#   return re.compile(r'\b({0})\b'.format(w), flags=re.IGNORECASE).search

#name_string = str(titanic_train_df['Name'])
#while findWholeWord('Miss')(name_string) == True:
#    titanic_train_df['Title'] = 'Miss'
#titanic_train_df.head()

In [None]:
#What is each person's title? 
#titanic_train_df['Title'] = titanic_train_df['Name'].map(lambda x: re.compile(", (.*?).").findall(x)[0])
 
# Group low-occuring, related titles together
#titanic_train_df['Title'][titanic_train_df.Title == 'Jonkheer'] = 'Master'
#titanic_train_df['Title'][titanic_train_df.Title.isin(['Ms','Mlle'])] = 'Miss'
#titanic_train_df['Title'][titanic_train_df.Title == 'Mme'] = 'Mrs'
#titanic_train_df['Title'][titanic_train_df.Title.isin(['Capt', 'Don', 'Major', 'Col', 'Sir'])] = 'Sir'
#titanic_train_df['Title'][titanic_train_df.Title.isin(['Dona', 'Lady', 'the Countess'])] = 'Lady'

#titanic_train_df['Title'] = 'Master'

# Build binary features
#titanic_train_df = pd.concat([titanic_train_df, pd.get_dummies(titanic_train_df['Title']).rename(columns=lambda x: 'Title_' + str(x))], axis=1)
#master_number = 0
#if titanic_train_df['Title'] == 'Master':
        #master_number += 1
        #titanic_train_df['Age'] 
#titanic_train_df.head(10)
#print(master_number)

### 1.C.b Cabin
For the variable Cabin, we start by analyzing the amount of missing values.

In [None]:
sum(pd.isnull(titanic_train_df['Cabin']))

In [None]:
sum(pd.isnull(titanic_train_df['Cabin']))/891

77% of the values for Cabin are missing. Therefore, we will not consider this variable.

In [None]:
# Drop Cabin column from the data as over 75% of data is missing from training set
titanic_train_df.drop("Cabin",axis=1,inplace=True)

#Do the same with test set
titanic_test_df.drop("Cabin",axis=1,inplace=True)

# Verify both have been removed
titanic_train_df.head()
titanic_test_df.head()

### 1.C.c. Embarkment Location

Like the other variables, we first find the number of missing embarkment location values.

In [None]:
sum(pd.isnull(titanic_train_df['Embarked']))

In [None]:
# Find proportion of missing Embarked values
sum(pd.isnull(titanic_train_df['Embarked']))/891

The proportion of missing embarked values is less than 1%, so this variable is worth keeping. Let's look further at the feature itself.

In [None]:
sns.countplot(x='Embarked', data=titanic_train_df,palette='GnBu_d')
plt.xlabel('Embarkment Location')
plt.ylabel('Number of People')
plt.xticks( np.arange(3), 
           ('Southampton (S)', 'Cherbourg (C)', 'Queenstown (Q)') )
plt.show()
titanic_train_df['Embarked'].value_counts()
#titanic_train_df.Embarked.hist(alpha=.75,bins=5, color='mediumturquoise')

From our analysis, the people onboard the Titanic were overwhelmingly from Southampton. Becuase we only have two missing Embarkment location values, we will replace them with the majority location for both the training and test set.

In [None]:
# Replace 'NaN' values with S 
titanic_train_df['Embarked'].fillna('S',inplace=True)
titanic_test_df['Embarked'].fillna('S',inplace=True)

Last thing we do in our Missing Data section is check both the training and test set to make sure we no longer have any null values.

In [None]:
titanic_train_df.isnull().sum()
print("-------------------------")
titanic_test_df.isnull().sum()

There is one missing Fare value in our test data. Since it is only one, we will replace this with the median value of the other fare values.

In [None]:
titanic_test_df["Fare"].fillna(titanic_test_df['Fare'].median(),inplace=True)

In [None]:
titanic_train_df.isnull().sum()
print("------------------------")
titanic_test_df.isnull().sum()

Now we are ready to encode the categorical data, and get analyzing!

## 1.D. Encoding Categorical Data
