## Titanic Survival Prediction (Beginner)

I just started the data science and this is my first time joining the Kaggle practice as well.

This notebook is created just for my own practice so it may contain unrealted contents with different languages

This practice was inspired by https://www.kaggle.com/nadintamer/titanic-survival-predictions-beginner

Thanks to Nadin Tamer

## Contents
    1. Import library
    2. Read In and explore the data
    3. Data analysis
    4. Data visualization
    5. Cleaning data
    6. Choosing the best model
    7. Creating submission file
    

## 1. Import library

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory

import os
print(os.listdir("../input"))

# Any results you write to the current directory are saved as output.

현재 input 폴더 내에 3개의 csv 파일, 'train.csv', 'gender_submission.csv', 'test.csv' 가 위치 하고 있는 것을 알수 있다.

또한, library의 경우 Numpy, Pandas 가 사용되었다.

추후 Visualisation을 위해 matplotlib, seaborn 또한 요구된다.

In [None]:
#visualization libraries
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline


## 2. Read in and explore data

In [None]:
#read csv files
train = pd.read_csv('../input/train.csv')
test = pd.read_csv('../input/test.csv')

#description of 'train'
train.describe(include='all')

In [None]:
#description of 'test'
test.describe(include='all')

## 3. Data analysis


In [None]:
#we just need to use some columns that we need to analyse
print(train.columns)

In [None]:
#We need to know what are the types and characteristics of the columns and this can be done by looking at samples
train.sample(10)

**Features type**
- Numerical: Age (Continuous), Fare (Continuous), SibSp (Discrete), Parch (Discrete)
- Categorical:  Survived, Sex, Embarked, Pclass
- Alphanumeric: Ticket, Cabin

**Data type**
- Survived: int
- Pclass: int
- Name: string
- Sex: string
- Age: float
- SibSp: int
- Parch: int
- Ticket: string
- Fare: float
- Cabin: string
- Embarked: string

In [None]:
#check missing values
print(pd.isnull(train).sum())

177 age values are missing which is 19.8% of the value. Age value can be used for prediction so this will be adjusted. <br> 687 cabin values are missing which is 77.1% of the value. Cabin value is not considered needed in the prediction so this will be removed. <br> 2 embarked values are missing which is 0.2% of the value. Embarked value might be needed so this will be adjusted or ignored.

## Prediction
**Sex**: Females are more likely to survive   
**SibSp/Parch**: People who are travelling alone are more likely to survive   
**Age**: Children are more likely to survive  
**Pclass**: Higher socioeconomic class people are more likely to survive  

## 4. Data visualisation

### Sex

In [None]:
#draw a bar plot of survival by sex
sns.barplot(x="Sex", y="Survived", data=train)

#print percentages of females vs. males that survive
print("Percentage of females who survived:", train["Survived"][train["Sex"] == 'female'].value_counts(normalize = True)[1]*100)
print("Percentage of males who survived:", train["Survived"][train["Sex"] == 'male'].value_counts(normalize = True)[1]*100)

As it is predicted, female shows higher chance of survival than male. Therefore, Sex is a important variable in the prediction

### Pclass

In [None]:
#draw a bar plot of survival by Pclass
sns.barplot(x="Pclass", y="Survived", data=train)

#print percentage of people by Pclass that survived
print("Percentage of Pclass = 1 who survived:", train["Survived"][train["Pclass"] == 1].value_counts(normalize = True)[1]*100)
print("Percentage of Pclass = 2 who survived:", train["Survived"][train["Pclass"] == 2].value_counts(normalize = True)[1]*100)
print("Percentage of Pclass = 3 who survived:", train["Survived"][train["Pclass"] == 3].value_counts(normalize = True)[1]*100)

As it is predicted, Pclass shows difference as higher socioeconomic class have higher chance of survival.  
Therefore, Pclass variable is a important variable in the prediction

### SibSp

In [None]:
#draw a bar plot for SibSp vs. survival
sns.barplot(x="SibSp", y="Survived", data=train)

#I won't be printing individual percent values for all of these.
print("Percentage of SibSp = 0 who survived:", train["Survived"][train["SibSp"] == 0].value_counts(normalize = True)[1]*100)
print("Percentage of SibSp = 1 who survived:", train["Survived"][train["SibSp"] == 1].value_counts(normalize = True)[1]*100)
print("Percentage of SibSp = 2 who survived:", train["Survived"][train["SibSp"] == 2].value_counts(normalize = True)[1]*100)

it shows that people with less siblings are more likely to survive but unlike prediction,   
with no siblings shows lower outcome then with one or two siblings.

### Parch

In [None]:
#draw a bar plot for Parch vs. survival
sns.barplot(x="Parch", y="Survived", data=train)
plt.show()

People with less than 4 children or parents are more like to survive.  
As it is shown in the Sibsp variable, people with no children or no parents are less likely to survive.

### Age

In [None]:
#sort the ages into logical categories
train["Age"] = train["Age"].fillna(-0.5)
test["Age"] = test["Age"].fillna(-0.5)
bins = [-1, 0, 5, 12, 18, 24, 35, 60, np.inf]
labels = ['Unknown', 'Baby', 'Child', 'Teenager', 'Student', 'Young Adult', 'Adult', 'Senior']
train['AgeGroup'] = pd.cut(train["Age"], bins, labels = labels)
test['AgeGroup'] = pd.cut(test["Age"], bins, labels = labels)

#draw a bar plot of Age vs. survival
sns.barplot(x="AgeGroup", y="Survived", data=train)
plt.show()

#fillna() = replace the NaN value to (value). in this case unknown values are going to be placed under -1 to 0

Babies are more likely to survive whose age are 0 to 5. 
  
하지만 아이들의 경우 어른보다 사이즈가 작고 무게가 적게 나가기 때문에 탈출시 수용인원과 관련되어 설명되었을수도 있으며.  
탈출시 아이들을 데리고 갔을 어른으로 인해 아이들이 구명용 보트 혹은 기타 수단에 동승 했을 경우로도 설명 가능할수도 있다.

### Cabin

In [None]:
train["CabinBool"] = (train["Cabin"].notnull().astype('int'))
test["CabinBool"] = (test["Cabin"].notnull().astype('int'))

#calculate percentages of CabinBool vs. survived
print("Percentage of CabinBool = 1 who survived:", train["Survived"][train["CabinBool"] == 1].value_counts(normalize = True)[1]*100)
print("Percentage of CabinBool = 0 who survived:", train["Survived"][train["CabinBool"] == 0].value_counts(normalize = True)[1]*100)

#draw a bar plot of CabinBool vs. survival
sns.barplot(x="CabinBool", y="Survived", data=train)
plt.show()

People with cabin records are more likely to survive.  
However, there is a chance that who are survived were able to record their cabin or    
there is another chance that people with higher socioeconomic class are the one with recorded cabin  
which most of them are survived due to their level

## 5. Cleaning data

### Test data

In [None]:
test.describe(include='all')

In [None]:
#check missing values
print(pd.isnull(test).sum())

1 value for the Fare variable is missing so we need to adjust it.  
327 values for the Cabin variable is missing but this is not considered critical value for the prediction,  
so I am  going to remove it


### Cabin

In [None]:
#Assuming that cabin feature is not the critical factor to predict the survival
train = train.drop(['Cabin'], axis = 1)
test = test.drop(['Cabin'], axis = 1)

### Ticket

In [None]:
#same goes to ticket
train = train.drop(['Ticket'], axis = 1)
test = test.drop(['Ticket'], axis = 1)

In [None]:
#to fill in the missing values, I wanna check which is the majority of people were embarked from
print("Number of people embarking in Southampton (S):")
southampton = train[train["Embarked"] == "S"].shape[0]
print(southampton)

print("Number of people embarking in Cherbourg (C):")
cherbourg = train[train["Embarked"] == "C"].shape[0]
print(cherbourg)

print("Number of people embarking in Queenstown (Q):")
queenstown = train[train["Embarked"] == "Q"].shape[0]
print(queenstown)

Since most of the people comes from the Southampton, we gonna replace them with South hampton  
**하지만** 여기서 사우스 햄튼으로 모두 대체하는것이 아닌, 각 *승선장소의 비율로 했을경우* 에는 다른 결과 값을 초래 할수 있을것 같다.

In [None]:
#replacing the missing values in the Embarked feature with S
train = train.fillna({"Embarked": "S"})

### Age

For the age missing values, it is hard for us to fill in with same age for all of them,  
so we have to guess what their age are

In [None]:
#create a combined group of both datasets
combine = [train, test]

#extract a title for each Name in the train and test datasets
for dataset in combine:
    dataset['Title'] = dataset.Name.str.extract(' ([A-Za-z]+)\.', expand=False)

pd.crosstab(train['Title'], train['Sex'])

As you can see above, there are various titles for the people.  
So in our case, I am going to replace them into 5 titles.  

* Rare: Lady, Capt, Col, Don, Dr, Major, Rev, Jonkheer, Dona
* Royal: Countess, Lady, Sir
* Miss: Mlle, Ms
* Mme: Mrs
* Mr
* Master


In [None]:
for dataset in combine:
    dataset['Title'] = dataset['Title'].replace(['Lady', 'Capt', 'Col',
    'Don', 'Dr', 'Major', 'Rev', 'Jonkheer', 'Dona'], 'Rare')  
    dataset['Title'] = dataset['Title'].replace(['Countess', 'Lady', 'Sir'], 'Royal')
    dataset['Title'] = dataset['Title'].replace('Mlle', 'Miss')
    dataset['Title'] = dataset['Title'].replace('Ms', 'Miss')
    dataset['Title'] = dataset['Title'].replace('Mme', 'Mrs')

train[['Title', 'Survived']].groupby(['Title'], as_index=False).mean()

In [None]:
#map each of the title groups to a numerical value
title_mapping = {"Mr": 1, "Miss": 2, "Mrs": 3, "Master": 4, "Royal": 5, "Rare": 6}
for dataset in combine:
    dataset['Title'] = dataset['Title'].map(title_mapping)
    dataset['Title'] = dataset['Title'].fillna(0)

train.head()

By using most common age for their title, we can predict the missing age values

In [None]:
# fill missing age with mode age group for each title
mr_age = train[train["Title"] == 1]["AgeGroup"].mode() #Young Adult
miss_age = train[train["Title"] == 2]["AgeGroup"].mode() #Student
mrs_age = train[train["Title"] == 3]["AgeGroup"].mode() #Adult
master_age = train[train["Title"] == 4]["AgeGroup"].mode() #Baby
royal_age = train[train["Title"] == 5]["AgeGroup"].mode() #Adult
rare_age = train[train["Title"] == 6]["AgeGroup"].mode() #Adult

age_title_mapping = {1: "Young Adult", 2: "Student", 3: "Adult", 4: "Baby", 5: "Adult", 6: "Adult"}

#train = train.fillna({"Age": train["Title"].map(age_title_mapping)})
#test = test.fillna({"Age": test["Title"].map(age_title_mapping)})

for x in range(len(train["AgeGroup"])):
    if train["AgeGroup"][x] == "Unknown":
        train["AgeGroup"][x] = age_title_mapping[train["Title"][x]]
        
for x in range(len(test["AgeGroup"])):
    if test["AgeGroup"][x] == "Unknown":
        test["AgeGroup"][x] = age_title_mapping[test["Title"][x]]

In [None]:
#map each Age value to a numerical value
age_mapping = {'Baby': 1, 'Child': 2, 'Teenager': 3, 'Student': 4, 'Young Adult': 5, 'Adult': 6, 'Senior': 7}
train['AgeGroup'] = train['AgeGroup'].map(age_mapping)
test['AgeGroup'] = test['AgeGroup'].map(age_mapping)

train.head()

#dropping the Age feature for now, might change
train = train.drop(['Age'], axis = 1)
test = test.drop(['Age'], axis = 1)

### Name

In [None]:
#drop the name feature since it contains no more useful information.
train = train.drop(['Name'], axis = 1)
test = test.drop(['Name'], axis = 1)

개인적으로 생각 했을때, 이름의 성을 이용해서, 가족관계를 설명후에, 가족중에 살아남은 사람 으로도 분석이 가능하지 않을까?  
이전에 말했던 아이들의 생존율을 이용하여, 가족이 동반하여 아이가 구출되었는지?  
혹은 아이 혼자만 살아 남은 것인지를 분석이 가능할듯 하다.


### Sex

In [None]:
#map each Sex value to a numerical value
sex_mapping = {"male": 0, "female": 1}
train['Sex'] = train['Sex'].map(sex_mapping)
test['Sex'] = test['Sex'].map(sex_mapping)

train.head()

###  Embarked

In [None]:
#map each Embarked value to a numerical value
embarked_mapping = {"S": 1, "C": 2, "Q": 3}
train['Embarked'] = train['Embarked'].map(embarked_mapping)
test['Embarked'] = test['Embarked'].map(embarked_mapping)

train.head()

### Fare
Need to categorize with logical separating skills and need to fill out the missing value

In [None]:
#fill in missing Fare value in test set based on mean fare for that Pclass 
for x in range(len(test["Fare"])):
    if pd.isnull(test["Fare"][x]):
        pclass = test["Pclass"][x] #Pclass = 3
        test["Fare"][x] = round(train[train["Pclass"] == pclass]["Fare"].mean(), 4)
        
#map Fare values into groups of numerical values
train['FareBand'] = pd.qcut(train['Fare'], 4, labels = [1, 2, 3, 4])
test['FareBand'] = pd.qcut(test['Fare'], 4, labels = [1, 2, 3, 4])

#drop Fare values
train = train.drop(['Fare'], axis = 1)
test = test.drop(['Fare'], axis = 1)

In [None]:
#check train data
train.head()

In [None]:
#check test data
test.head()

## 6. Choosing the best model

### Splitting the training data to test (taking 30% of them)

In [None]:
from sklearn.model_selection import train_test_split

predictors = train.drop(['Survived', 'PassengerId'], axis=1)
target = train["Survived"]
x_train, x_val, y_train, y_val = train_test_split(predictors, target, test_size = 0.30, random_state = 0)

### Testing Different Models  
I will be testing the following models with my training data (got the list from here):   

* Gaussian Naive Bayes  
* Logistic Regression  
* Support Vector Machines  
* Perceptron  
* Decision Tree Classifier  
* Random Forest Classifier  
* KNN or k-Nearest Neighbors  
* Stochastic Gradient Descent  
* Gradient Boosting Classifier  
  
For each model, we set the model, fit it with 80% of our training data, predict for 20% of the training data and check the accuracy.

In [None]:
# Gaussian Naive Bayes
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score

gaussian = GaussianNB()
gaussian.fit(x_train, y_train)
y_pred = gaussian.predict(x_val)
acc_gaussian = round(accuracy_score(y_pred, y_val) * 100, 2)
print(acc_gaussian)

In [None]:
# Logistic Regression
from sklearn.linear_model import LogisticRegression

logreg = LogisticRegression()
logreg.fit(x_train, y_train)
y_pred = logreg.predict(x_val)
acc_logreg = round(accuracy_score(y_pred, y_val) * 100, 2)
print(acc_logreg)

In [None]:
# Support Vector Machines
from sklearn.svm import SVC

svc = SVC()
svc.fit(x_train, y_train)
y_pred = svc.predict(x_val)
acc_svc = round(accuracy_score(y_pred, y_val) * 100, 2)
print(acc_svc)

In [None]:
# Linear SVC
from sklearn.svm import LinearSVC

linear_svc = LinearSVC()
linear_svc.fit(x_train, y_train)
y_pred = linear_svc.predict(x_val)
acc_linear_svc = round(accuracy_score(y_pred, y_val) * 100, 2)
print(acc_linear_svc)

In [None]:
# Perceptron
from sklearn.linear_model import Perceptron

perceptron = Perceptron()
perceptron.fit(x_train, y_train)
y_pred = perceptron.predict(x_val)
acc_perceptron = round(accuracy_score(y_pred, y_val) * 100, 2)
print(acc_perceptron)

In [None]:
#Decision Tree
from sklearn.tree import DecisionTreeClassifier

decisiontree = DecisionTreeClassifier()
decisiontree.fit(x_train, y_train)
y_pred = decisiontree.predict(x_val)
acc_decisiontree = round(accuracy_score(y_pred, y_val) * 100, 2)
print(acc_decisiontree)

In [None]:
# Random Forest
from sklearn.ensemble import RandomForestClassifier

randomforest = RandomForestClassifier()
randomforest.fit(x_train, y_train)
y_pred = randomforest.predict(x_val)
acc_randomforest = round(accuracy_score(y_pred, y_val) * 100, 2)
print(acc_randomforest)

In [None]:
# KNN or k-Nearest Neighbors
from sklearn.neighbors import KNeighborsClassifier

knn = KNeighborsClassifier()
knn.fit(x_train, y_train)
y_pred = knn.predict(x_val)
acc_knn = round(accuracy_score(y_pred, y_val) * 100, 2)
print(acc_knn)

In [None]:
# Stochastic Gradient Descent
from sklearn.linear_model import SGDClassifier

sgd = SGDClassifier()
sgd.fit(x_train, y_train)
y_pred = sgd.predict(x_val)
acc_sgd = round(accuracy_score(y_pred, y_val) * 100, 2)
print(acc_sgd)

In [None]:
# Gradient Boosting Classifier
from sklearn.ensemble import GradientBoostingClassifier

gbk = GradientBoostingClassifier()
gbk.fit(x_train, y_train)
y_pred = gbk.predict(x_val)
acc_gbk = round(accuracy_score(y_pred, y_val) * 100, 2)
print(acc_gbk)

In [None]:
#compare all of the results with their accuracies
models = pd.DataFrame({
    'Model': ['Support Vector Machines', 'KNN', 'Logistic Regression', 
              'Random Forest', 'Naive Bayes', 'Perceptron', 'Linear SVC', 
              'Decision Tree', 'Stochastic Gradient Descent', 'Gradient Boosting Classifier'],
    'Score': [acc_svc, acc_knn, acc_logreg, 
              acc_randomforest, acc_gaussian, acc_perceptron,acc_linear_svc, acc_decisiontree,
              acc_sgd, acc_gbk]})
models.sort_values(by='Score', ascending=False)

KNN model shows the best score in the my model, and next goes to Support Vector Machines

## 7. Creating Submission file

In [None]:
#set ids as PassengerId and predict survival 
ids = test['PassengerId']
predictions = rf_cv.predict(test.drop('PassengerId', axis=1))

#set the output as a dataframe and convert to csv file named submission.csv
output = pd.DataFrame({ 'PassengerId' : ids, 'Survived': predictions })
output.to_csv('submission.csv', index=False)

### Closing
모델의 여부는 똑같앗지만 22프로의 세팅으로 했던 레퍼런스 케이스와는 다르게 모델 값이 나와져 있다.  
SKL learn 머신러닝의 모델을 공부하는것으로 통해 무엇때문에 서로 다른 결과값을 가지게 되었는지 알아 볼수 있을것 같다.  
또한 왜 영유아가 가장 많은 서바이벌 수를 가졌는지를 좀더 깊게 분석해본다면,  
social economic level 의 사람들과 연관이 되어 있다던지, 가족수가 1자녀 혹은 2자녀에서 높았던것 처럼  
아이를 살리기 위해 가족 구성원이 혼자 온 사람보다 더 많이 살게 되었는지에 대한 연관성이 찾아 지지 않을까 싶다.