# Live Session 13 (07/10/2022) - Classification II
#### By Ika Nurfitriani (PYTN-KS10-008)

Di pertemuan ini akan membahas classification bagian 2. Terdapat beberapa hal yang akan dipelajari yaitu data splitting, data scaling, modelling dengan implementasi SVM dan implementasi random forest, serta evaluasi.

# Naive Bayes Classifier
Naive Bayes adalah teknik klasifikasi statistik berdasarkan Bayes Theorem. Ini adalah salah satu supervised learning algorithms yang paling sederhana. Naive Bayes classifier adalah algoritme yang cepat, akurat, dan andal. Naive Bayes classifier memiliki akurasi dan kecepatan tinggi pada kumpulan data besar.

# Decision Tree Classifier
Decision tree adalah flowchart-like tree structure dimana internal node mewakili feature (atau attribute), branch mewakili decision rule, dan setiap leaf node mewakili outcome. Node paling atas dalam pohon keputusan dikenal sebagai root node. Root node belajar untuk mempartisi berdasarkan nilai atribut. Root node mempartisi pohon secara rekursif memanggil partisi rekursif. Flowchart-like structure ini membantu kita dalam pengambilan keputusan. Visualisasinya seperti diagram flowchart yang dengan mudah meniru pemikiran tingkat manusia, sehingga mudah dipahami dan diinterpretasikan.

# Random Forest
Random Forest secara teknis adalah ensemble method (berdasarkan pendekatan divide-and-conquer) dari decision trees yang dihasilkan pada dataset yang dipisahkan secara acak. Kumpulan decision tree classifiers ini juga dikenal sebagai forest.

## Random Forests vs Decision Trees
- Random forests adalah kumpulan dari beberapa decision trees.
- Deep decision trees mungkin mengalami overfitting, tetapi random forests mencegah overfitting dengan membuat trees pada random subsets.
- Decision trees secara komputasi lebih cepat.
- Random forests sulit untuk diinterpretasikan, sedangkan decision tree mudah diinterpretasikan dan dapat diubah menjadi rules.

# SVM
Secara umum, Support Vector Machines dianggap sebagai classification approach, tetapi dapat digunakan di kedua jenis masalah klasifikasi dan regresi. SVM dapat dengan mudah menangani beberapa variabel kontinu dan kategorikal. SVM membangun hyperplane dalam multidimensional space untuk memisahkan kelas yang berbeda. SVM menghasilkan hyperplane optimal secara berulang, yang digunakan untuk meminimalkan kesalahan. Ide inti dari SVM adalah menemukan maximum marginal hyperplane (MMH) yang paling baik membagi dataset menjadi beberapa kelas.

**Import library yang dibutuhkan**

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.svm import LinearSVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report

import warnings
warnings.filterwarnings("ignore")

**Open data**

In [2]:
df = pd.read_csv("train.csv", index_col=0)
df.head()

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


**Melihat informasi dataset**

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 891 entries, 1 to 891
Data columns (total 11 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Survived  891 non-null    int64  
 1   Pclass    891 non-null    int64  
 2   Name      891 non-null    object 
 3   Sex       891 non-null    object 
 4   Age       714 non-null    float64
 5   SibSp     891 non-null    int64  
 6   Parch     891 non-null    int64  
 7   Ticket    891 non-null    object 
 8   Fare      891 non-null    float64
 9   Cabin     204 non-null    object 
 10  Embarked  889 non-null    object 
dtypes: float64(2), int64(4), object(5)
memory usage: 83.5+ KB


**Hapus kolom yang tidak dibutuhkan**

In [4]:
df.drop(['Name', 'Ticket', 'Embarked', 'Cabin'], axis=1, inplace=True)

**Melihat informasi dataset**

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 891 entries, 1 to 891
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Survived  891 non-null    int64  
 1   Pclass    891 non-null    int64  
 2   Sex       891 non-null    object 
 3   Age       714 non-null    float64
 4   SibSp     891 non-null    int64  
 5   Parch     891 non-null    int64  
 6   Fare      891 non-null    float64
dtypes: float64(2), int64(4), object(1)
memory usage: 55.7+ KB


**Menghitung jumlah value dari kolom Sex**

In [6]:
df.Sex.value_counts()

male      577
female    314
Name: Sex, dtype: int64

**Mengubah representasi kategori pada kolom Sex**

In [7]:
df.Sex = df.Sex.map({'female':0,'male':1})

**Melihat informasi dataset**

In [8]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 891 entries, 1 to 891
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Survived  891 non-null    int64  
 1   Pclass    891 non-null    int64  
 2   Sex       891 non-null    int64  
 3   Age       714 non-null    float64
 4   SibSp     891 non-null    int64  
 5   Parch     891 non-null    int64  
 6   Fare      891 non-null    float64
dtypes: float64(2), int64(5)
memory usage: 55.7 KB


**Imputasi**

In [9]:
median = df.Age.median()
df.Age.fillna(median, inplace=True)

**Melihat informasi dataset**

In [10]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 891 entries, 1 to 891
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Survived  891 non-null    int64  
 1   Pclass    891 non-null    int64  
 2   Sex       891 non-null    int64  
 3   Age       891 non-null    float64
 4   SibSp     891 non-null    int64  
 5   Parch     891 non-null    int64  
 6   Fare      891 non-null    float64
dtypes: float64(2), int64(5)
memory usage: 55.7 KB


**Improvement: Hilangkan sibsp & parch dengan menggabungkan info keduanya**

In [11]:
df['alone'] = 0
df['alone'][(df.SibSp == 0) & (df.Parch == 0)] = 1

In [12]:
df.drop(['Parch', 'SibSp'], axis=1, inplace=True)

**Modelling**

In [13]:
X = df.drop('Survived', axis=1)
y = df['Survived']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=46)

**Scaling**

In [14]:
scaler = StandardScaler()
scaler.fit(X_train)

StandardScaler()

**Transformasi X nya**

In [15]:
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)

**Modelling SVM**

In [16]:
model = LinearSVC()
model.fit(X_train_scaled, y_train)

y_pred = model.predict(X_test_scaled)

**Evaluasi**

In [17]:
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.83      0.89      0.86       104
           1       0.84      0.75      0.79        75

    accuracy                           0.83       179
   macro avg       0.83      0.82      0.82       179
weighted avg       0.83      0.83      0.83       179



**Modelling Random Forest**

In [18]:
rf = RandomForestClassifier()
rf.fit(X_train, y_train)

y_pred = rf.predict(X_test)

**Evaluasi**

In [19]:
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.84      0.90      0.87       104
           1       0.85      0.76      0.80        75

    accuracy                           0.84       179
   macro avg       0.85      0.83      0.84       179
weighted avg       0.84      0.84      0.84       179



**Prediksi data tes**

In [20]:
df_test = pd.read_csv("test.csv", index_col=0)
df_test.head()

Unnamed: 0_level_0,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S
894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q
895,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S
896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S


In [21]:
df_test.drop(['Name', 'Ticket', 'Embarked', 'Cabin'], axis=1, inplace=True)

In [22]:
df_test.Sex = df_test.Sex.map({'female':0,'male':1})

**Imputasi**

In [23]:
median = df.Age.median()
df_test.Age.fillna(median, inplace=True)

median = df.Fare.median()
df_test.Fare.fillna(median, inplace=True)

**Melihat informasi dataset**

In [24]:
df_test.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 418 entries, 892 to 1309
Data columns (total 6 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   Pclass  418 non-null    int64  
 1   Sex     418 non-null    int64  
 2   Age     418 non-null    float64
 3   SibSp   418 non-null    int64  
 4   Parch   418 non-null    int64  
 5   Fare    418 non-null    float64
dtypes: float64(2), int64(4)
memory usage: 22.9 KB


**Improvement: Hilangkan sibsp & parch dengan menggabungkan info keduanya**

In [25]:
df_test['alone'] = 0
df_test['alone'][(df_test.SibSp == 0) & (df_test.Parch == 0)] = 1

df_test.drop(['Parch', 'SibSp'], axis=1, inplace=True)

**test_pred = scaler.transform(df_test)**

In [26]:
test_pred = rf.predict(df_test)

In [27]:
df_test['Survived'] = test_pred
df_res = df_test[['Survived']]
df_res.head()

Unnamed: 0_level_0,Survived
PassengerId,Unnamed: 1_level_1
892,0
893,0
894,1
895,1
896,1
