# Classification

Consider [this](https://www.kaggle.com/c/titanic) scenario:

> The sinking of the Titanic is one of the most infamous shipwrecks in history. On April 15, 1912, during her maiden voyage, the widely considered “unsinkable” RMS Titanic sank after colliding with an iceberg. Unfortunately, there weren’t enough lifeboats for everyone onboard, resulting in the death of 1502 out of 2224 passengers and crew. While there was some element of luck involved in surviving, it seems some groups of people were more likely to survive than others.

We could use the classification method to analyze it.

In this session, we want to compare the kNN, decision tree, and random forest.

## Prepare and explore the data

In [None]:
# Package imports

from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import MinMaxScaler
from sklearn.tree import DecisionTreeClassifier, plot_tree

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

In [None]:
df = pd.read_csv('https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv')
df.head()

Since the PassengerId and Name is just our data identification, we could PassengerId column as a rownames while the Name column could be dropped

In [None]:
df = df.set_index('PassengerId') \
    .drop('Name', axis = 1)

Lets do some data exploration

In [None]:
# Information about datatype

df.info()

The data contains 891 rows and 11 columns, each row represents a customer. The columns are:
* `Survived` - Is the passenger survived (0 = No, 1 = Yes)
* `Pclass` - The passenger's ticket class (1 = 1st, 2 = 2nd, 3 = 3rd)
* `Sex` - The customer's sex (male/female)
* `Age` - The customer's age in year
* `SibSp` - Number of siblings/spouse aboard the Titanic
* `Parch` - Number of parents/children aboard the Titanic
* `Ticket` - The customer's ticket number
* `Fare` - The customer's ticket number
* `Cabin` - The customer's cabin number
* `Embarked` - The customer's embarkation port (C = Cherbourg, Q = Queenstown, S = Southampton)

In [None]:
df.groupby(['Survived']).size()

In [None]:
# Check missing value

df.isnull().sum() # / len(df)

In [None]:
df[['Pclass', 'Age', 'SibSp', 'Parch', 'Fare']].describe()

## Data preprocessing

Before running the models, we have to preprocess the data to make sure that our model could use the data as an input.

In [None]:
# Drop some unnecessary columns

df = df.drop(['Ticket', 'Cabin', 'Embarked'], axis = 1)

We found out that the Ticket column is just and identifier so we could remove it. The Cabin column has a very large number of missing values. Besides, the Pclass and Fare columns should be enough to infer the passenger's status and position in the Titanic so we could the Cabin column. We could also dropped the Embarked column.

In [None]:
# Handle the missing value

df.loc[df['Age'].isnull(), 'Age'] = df['Age'].median()

We could fill the missing value of the Age column using its median.

In [None]:
# Recode the non numeric variable

# pd.get_dummies(df, columns=['Sex'])

df = pd.get_dummies(df, columns=['Sex'], drop_first=True)

df.head()

Since our model could just accept the numeric value, we have to recode the non numeric column.

In [None]:
# Assign the data to new variable

X = df[['Pclass', 'Age', 'SibSp', 'Parch', 'Sex_male', 'Fare']]
y = df['Survived']

## Data preprocessing 2

In [None]:
# Split the data

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 123)

The train data could be used as an input of our model while the test data could be used to measure our model's performance.

In [None]:
# Scale the data

scaler = MinMaxScaler()
X_train = scaler.fit_transform(X_train) # Fit the data then do transform
X_test = scaler.transform(X_test) # Do the transform using fitted data

X_train = pd.DataFrame(X_train, columns=X.columns)
X_test = pd.DataFrame(X_test, columns=X.columns)

In [None]:
X_train.head()

## Modelling

In [None]:
# Create function to evaluate our model
def classification_eval (aktual, prediksi, name):
    cm = confusion_matrix(aktual, prediksi)
    tp = cm[1][1]
    tn = cm[0][0]
    fp = cm[0][1]
    fn = cm[1][0]

    accuracy = round((tp+tn) / (tp+tn+fp+fn) * 100, 2)
    precision = round((tp) / (tp+fp) * 100, 2)
    recall = round((tp) / (tp+fn) * 100, 2)
    specificity = round((tn) / (tn+fp) * 100, 2)

    print('Evaluation Model:', name)
    print(cm)
    print('Accuracy   :', accuracy, '%')
    print('Precision  :', precision, '%')
    print('Recall     :', recall, '%')
    print('Specificity:', specificity, '%')

### kNN

In [None]:
knn = KNeighborsClassifier()
knn.fit(X_train, y_train)

In [None]:
y_train_pred = knn.predict(X_train)
y_test_pred = knn.predict(X_test)

In [None]:
y_train_pred

In [None]:
# Training Performance
classification_eval(y_train, y_train_pred, 'KNN Training Perf.')

In [None]:
# Testing Performance
classification_eval(y_test, y_test_pred, 'KNN Testing Perf.')

### Decision Tree

In [None]:
dt = DecisionTreeClassifier()
dt.fit(X_train, y_train)

In [None]:
y_train_pred = dt.predict(X_train)
y_test_pred = dt.predict(X_test)

In [None]:
# Training Performance
classification_eval(y_train, y_train_pred, 'Dectree Training Perf.')

In [None]:
# Testing Performance
classification_eval(y_test, y_test_pred, 'Dectree Testing Perf.')

In [None]:
# Plot Tree
fig = plt.figure(figsize=(15, 10))
_ = plot_tree(dt, feature_names = X_train.columns, class_names = ['No', 'Yes'])

plt.show()

In [None]:
pd.DataFrame(list(zip(X_train.columns, dt.feature_importances_)),
             columns = ['Feature', 'Importance']) \
    .sort_values('Importance', ascending = False)

### Random Forest

In [None]:
rf = RandomForestClassifier(random_state = 123)
rf.fit(X_train, y_train)

In [None]:
y_train_pred = rf.predict(X_train)
y_test_pred = rf.predict(X_test)

In [None]:
# Training Performance
classification_eval(y_train, y_train_pred, 'Ranfor Training Perf.')

In [None]:
# Testing Performance
classification_eval(y_test, y_test_pred, 'Ranfor Testing Perf.')

In [None]:
pd.DataFrame(list(zip(X_train.columns, rf.feature_importances_)),
             columns = ['Feature', 'Importance']) \
    .sort_values('Importance', ascending = False)

In [None]:
# Individual tree
rf.estimators_