**TITANIC Dataset Exploration - KAGGLE**

Hi, Let's explore Titanic Dataset and predict the survival accuracy using Machine Learning Algorithms.

**Dataset's Used:**

* train.csv <- Training Dataset
* test.csv  <- Testing Dataset

In [1]:
# Importing required libraries
# Data Analysis
import numpy as np
import pandas as pd

# Machine Learning
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.cross_validation import train_test_split
from sklearn.grid_search import GridSearchCV
from sklearn.cluster import KMeans
from sklearn.naive_bayes import GaussianNB
from sklearn import tree
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn import linear_model as lm
from sklearn.svm import SVC, LinearSVC
from sklearn.metrics import accuracy_score, precision_score, recall_score

# Matplotlib
import seaborn as sns
from matplotlib import pyplot as plt
%matplotlib inline



In [2]:
# Loading the train and test data
train_df = pd.read_csv('Data/train.csv')
test_df = pd.read_csv('Data/test.csv')

# Analyzing the Dataframe Shape
print "Train dataframe shape: ", train_df.shape
print "Test dataframe shape: ", test_df.shape

Train dataframe shape:  (891, 12)
Test dataframe shape:  (418, 11)


In [3]:
# Analyzing the Columns Values
print "Train Column: ", train_df.columns.values
print "\nTest Column: ", test_df.columns.values

Train Column:  ['PassengerId' 'Survived' 'Pclass' 'Name' 'Sex' 'Age' 'SibSp' 'Parch'
 'Ticket' 'Fare' 'Cabin' 'Embarked']

Test Column:  ['PassengerId' 'Pclass' 'Name' 'Sex' 'Age' 'SibSp' 'Parch' 'Ticket' 'Fare'
 'Cabin' 'Embarked']


In [4]:
# Obtaining the Summary and Statistics of the Train Data
train_df.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


In [5]:
# Obtaining informatin about the Dataset - train_df
train_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB


In [6]:
# Obtaining informatin about the Dataset - test_df
test_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 11 columns):
PassengerId    418 non-null int64
Pclass         418 non-null int64
Name           418 non-null object
Sex            418 non-null object
Age            332 non-null float64
SibSp          418 non-null int64
Parch          418 non-null int64
Ticket         418 non-null object
Fare           417 non-null float64
Cabin          91 non-null object
Embarked       418 non-null object
dtypes: float64(2), int64(4), object(5)
memory usage: 36.0+ KB


In [7]:
# Dropping the Column value - 'Cabin' and 'Ticket' as it has most null values.
train_df.drop(['Ticket', 'Cabin'], axis=1, inplace=True)
test_df.drop(['Ticket', 'Cabin'], axis=1, inplace=True)

In [8]:
# Analysing the shape of the dataset after dropping the column
print train_df.shape
print test_df.shape

(891, 10)
(418, 9)


In [9]:
# Analyzing the Columns Values
print "Train Column: ", train_df.columns.values
print "\nTest Column: ", test_df.columns.values

Train Column:  ['PassengerId' 'Survived' 'Pclass' 'Name' 'Sex' 'Age' 'SibSp' 'Parch'
 'Fare' 'Embarked']

Test Column:  ['PassengerId' 'Pclass' 'Name' 'Sex' 'Age' 'SibSp' 'Parch' 'Fare'
 'Embarked']


**Machine Learning Implementation:**

Our goal of this project is to find the accuracy of the dataset given the test and train data samples. Although, there are many Machine Learning algorithms which can be implemented in our dataset, to find the accuracy, I have decide to test the data against five such algorithms namely;

1. Logistic Regression
2. Gaussian Naive Bayes
3. Decision Tree
4. RandomForest
5. Support Vector Machines

In [10]:
# Training the Data
X_train = train_df[["Pclass", "SibSp", "Parch", "Fare"]]
Y_train = train_df["Survived"]
X_test = test_df[["Pclass", "SibSp", "Parch", "Fare"]]

In [11]:
# Using Cross_Validation and Splitting the data into two samples.
# dev_X and dev_y <- Development Samples <- Features Train and Test
# val_X and val_y <- Validation Samples <- Labels Train and Test
dev_X, val_X, dev_y, val_y = train_test_split(X_train, Y_train, test_size=0.33, random_state=42)

In [13]:
### Supervised Algorithm - Machine Learning
# Logistic Regression
clf = lm.LogisticRegression()
clf.fit(dev_X, dev_y)
val_pred = clf.predict(val_X)
print "Logistic Regression"
print "Accuracy Score: ", accuracy_score(val_y, val_pred)

 Logistic Regression
Accuracy Score:  0.708474576271


In [14]:
# Non-Ordered Algorithm -> Naive Bayes
g_clf = GaussianNB()
g_clf.fit(dev_X, dev_y)
val_pred = g_clf.predict(val_X)
print "Gaussian Naive Bayes"
print "Accuracy Score: ", accuracy_score(val_y, val_pred)

Gaussian Naive Bayes
Accuracy Score:  0.694915254237


In [15]:
# Non-Ordered Algorithm -> Decision Tree
d_clf = tree.DecisionTreeClassifier(criterion = 'gini', max_depth = 2, max_leaf_nodes = 5,
                                    min_samples_leaf = 10, min_samples_split = 2)
d_clf.fit(dev_X, dev_y)
val_pred = d_clf.predict(val_X)
print "Decision Tree Classifier"
print "Accuracy Score: ", accuracy_score(val_y, val_pred)

Decision Tree Classifier
Accuracy Score:  0.715254237288


In [16]:
# Random Forest Classifier Algorithm
r_clf = RandomForestClassifier(n_estimators = 10, n_jobs = 1, random_state = 50)
r_clf.fit(dev_X, dev_y)
val_pred = r_clf.predict(val_X)
print "Random Forest Classifier"
print "Accuracy Score: ", accuracy_score(val_y, val_pred)

Random Forest Classifier
Accuracy Score:  0.654237288136


In [17]:
# Non-Ordered Algorithm -> AdaBoost
a_clf = AdaBoostClassifier(n_estimators=50, algorithm = 'SAMME', learning_rate = 0.4)
a_clf.fit(dev_X, dev_y)
val_pred = a_clf.predict(val_X)
print "AdaBoost Classifier"
print "Accuracy Score: ", accuracy_score(val_y, val_pred)

AdaBoost Classifier
Accuracy Score:  0.674576271186


In [18]:
# Support Vector Machines
svc = SVC()
svc.fit(dev_X, dev_y)
val_pred = svc.predict(val_X)
print "Support Vector Machine"
print "Accuracy Score: ", accuracy_score(val_y, val_pred)

Support Vector Machine
Accuracy Score:  0.732203389831


**Observation:**

As, can be noticed that Support Vector Machine Algorithm, had the highest accuracy of -> 73%