# Tugas Pendahuluan
Tugas Pendahuluan dikerjakan dengan dataset titanic yang dapat didownload pada link [berikut](https://drive.google.com/file/d/16j_9FEHLjh_Y_3CdUtp9M13VwImyT89T/view?usp=sharing). Lakukan prediksi apakah suatu penumpang selamat atau tidak (kolom **survived**), bernilai 0 jika tidak selamat, dan 1 jika selamat.

<br>
Tugas dikerjakan secara berkelompok, dengan 1 kelompok terdiri atas 2 mahasiswa. Waktu pengerjaan dari 28 Maret 2022 - 3 April 2022 pukul 23.59.

# 0. Loading Data and Library

In [130]:
# Put your library here
import pandas as pd
import matplotlib.pyplot as plt
from sklearn import preprocessing, metrics, svm
from sklearn.preprocessing import MinMaxScaler, StandardScaler
from sklearn.model_selection import train_test_split, cross_val_score, KFold, GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.ensemble import VotingClassifier, StackingClassifier
from sklearn.pipeline import make_pipeline

from imblearn.over_sampling import RandomOverSampler
from imblearn.under_sampling import RandomUnderSampler

In [3]:
# Read data here

# titanic_dataset
titanic = pd.read_csv('titanic_dataset.csv')
titanic

Unnamed: 0,index,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked
0,0,3.0,1.0,"Abelseth, Miss. Karen Marie",female,16.0,0.0,0.0,348125,7.6500,,S
1,1,3.0,0.0,"Burns, Miss. Mary Delia",female,18.0,0.0,0.0,330963,7.8792,,Q
2,2,1.0,1.0,"Fortune, Miss. Alice Elizabeth",female,24.0,3.0,2.0,19950,263.0000,C23 C25 C27,S
3,3,3.0,1.0,"de Messemaeker, Mrs. Guillaume Joseph (Emma)",female,36.0,1.0,0.0,345572,17.4000,,S
4,4,3.0,0.0,"Jonsson, Mr. Nils Hilding",male,27.0,0.0,0.0,350408,7.8542,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
1304,1304,3.0,1.0,"Dahl, Mr. Karl Edwart",male,45.0,0.0,0.0,7598,8.0500,,S
1305,1305,1.0,0.0,"Penasco y Castellana, Mr. Victor de Satode",male,18.0,1.0,0.0,PC 17758,108.9000,C65,C
1306,1306,2.0,1.0,"Becker, Miss. Ruth Elizabeth",female,12.0,2.0,1.0,230136,39.0000,F4,S
1307,1307,3.0,1.0,"Murphy, Miss. Katherine ""Kate""",female,,1.0,0.0,367230,15.5000,,Q


# I. Data Understanding
Tujuan dari bagian ini adalah peserta dapat memahami kualitas dari data yang diberikan. Hal ini meliputi:
1. Ukuran data
2. Statistik dari tiap fitur
3. Pencilan (outlier)
4. Korelasi
5. Distribusi 

## I.1 
Carilah:
1. Ukuran dari data (instances dan features)
2. Tipe dari tiap-tiap fitur 
3. Banyaknya unique values dari fitur yang bertipe kategorikal
4. Nilai minimum, maksimum, rata-rata, median, dan standar deviasi dari fitur yang tidak bertipe kategorikal

In [11]:
# I.1 Put your code here

sizeOfInstances = titanic["index"].size
sizeOfFeatures = len(titanic.columns)

featuresType = titanic.dtypes


titanic["survived"] = pd.Categorical(titanic["survived"])
titanic["pclass"] = pd.Categorical(titanic["pclass"])
titanic["sex"] = pd.Categorical(titanic["sex"])
titanic["embarked"] = pd.Categorical(titanic["embarked"])
len(titanic["sex"].unique())
categoricalColumns = titanic.select_dtypes(include='category').columns
sumOfCategoricalUniqueValue = 0
for col in categoricalColumns :
    sumOfCategoricalUniqueValue += titanic[col].unique().size

titanicExCategory = titanic.select_dtypes(exclude=['category'])
titanicEx = titanicExCategory.loc[:, ~titanicExCategory.columns.isin(['index', 'name', 'ticket', 'cabin'])]
titanicDesc = titanicEx.describe()

print("Size of instances : " + str(sizeOfInstances))
print("Size of features  : " + str(sizeOfFeatures))
print("==========================================================================")
print("\nTypes of each features\n")
print(featuresType)
print("==========================================================================")
print("\nJumlah unique values dari fitur bertipe categorical : " + str(sumOfCategoricalUniqueValue) + "\n")
print("==========================================================================")
print("Nilai minimum, maksimum, rata-rata, median, dan standar deviasi dari fitur yang tidak bertipe kategorikal :")
print("*nb: 50% --> median")
titanicDesc


Size of instances : 1309
Size of features  : 12

Types of each features

index          int64
pclass      category
survived    category
name          object
sex         category
age          float64
sibsp        float64
parch        float64
ticket        object
fare         float64
cabin         object
embarked    category
dtype: object

Jumlah unique values dari fitur bertipe categorical : 11

Nilai minimum, maksimum, rata-rata, median, dan standar deviasi dari fitur yang tidak bertipe kategorikal :
*nb: 50% --> median


Unnamed: 0,age,sibsp,parch,fare
count,1046.0,1309.0,1309.0,1308.0
mean,29.881135,0.498854,0.385027,33.295479
std,14.4135,1.041658,0.86556,51.758668
min,0.1667,0.0,0.0,0.0
25%,21.0,0.0,0.0,7.8958
50%,28.0,0.0,0.0,14.4542
75%,39.0,1.0,0.0,31.275
max,80.0,8.0,9.0,512.3292


## I.2
Carilah:
1. Missing values dari tiap fitur
2. Outliers dari tiap fitur (gunakan metode yang kalian ketahui)

In [12]:
# I.2 Put your code here
	
missingValues = titanic.isna().sum()




print("Missing values from each feature\n")
print(missingValues)
print("==========================================================================")

import numpy as np
# titanicExCategory[(np.abs(titanicExCategory-titanicExCategory.mean()) <= (3*titanicExCategory.std()))]

cols = titanicExCategory.columns

Q1 = titanic[cols].quantile(0.25)
Q3 = titanic[cols].quantile(0.75)
IQR = Q3 - Q1

titanicOutliers = titanic[~((titanic[cols] < (Q1 - 1.5 * IQR)) |(titanic[cols] > (Q3 + 1.5 * IQR))).any(axis=1)]
titanicOutliers


Missing values from each feature

index          0
pclass         0
survived       0
name           0
sex            0
age          263
sibsp          0
parch          0
ticket         0
fare           1
cabin       1014
embarked       2
dtype: int64


  titanicOutliers = titanic[~((titanic[cols] < (Q1 - 1.5 * IQR)) |(titanic[cols] > (Q3 + 1.5 * IQR))).any(axis=1)]


Unnamed: 0,index,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked
0,0,3.0,1.0,"Abelseth, Miss. Karen Marie",female,16.0,0.0,0.0,348125,7.6500,,S
1,1,3.0,0.0,"Burns, Miss. Mary Delia",female,18.0,0.0,0.0,330963,7.8792,,Q
3,3,3.0,1.0,"de Messemaeker, Mrs. Guillaume Joseph (Emma)",female,36.0,1.0,0.0,345572,17.4000,,S
4,4,3.0,0.0,"Jonsson, Mr. Nils Hilding",male,27.0,0.0,0.0,350408,7.8542,,S
5,5,1.0,1.0,"Chambers, Mr. Norman Campbell",male,27.0,1.0,0.0,113806,53.1000,E8,S
...,...,...,...,...,...,...,...,...,...,...,...,...
1300,1300,2.0,0.0,"Harris, Mr. Walter",male,30.0,0.0,0.0,W/C 14208,10.5000,,S
1301,1301,3.0,0.0,"Gronnestad, Mr. Daniel Danielsen",male,32.0,0.0,0.0,8471,8.3625,,S
1302,1302,3.0,1.0,"Madsen, Mr. Fridtjof Arne",male,24.0,0.0,0.0,C 17369,7.1417,,S
1304,1304,3.0,1.0,"Dahl, Mr. Karl Edwart",male,45.0,0.0,0.0,7598,8.0500,,S


## I.3
Carilah:
1. Korelasi antar fitur
2. Visualisasikan distribusi dari tiap fitur (kategorikal dan kontinu)
3. Visualisasikan distribusi dari tiap fitur, dengan data dibagi tiap unique values fitur survived

In [None]:
# I.3 Put your code here
correlation = titanic.corr(method ='pearson')

print("Correlation beetwen features\n")
print(correlation)
print("==========================================================================")

titanicCategory = titanic.select_dtypes(include=['category'])
print("\nDistribution of categorical feature\n")
for col in titanicCategory:
    print("Distribution of " + col)
    titanic[col].value_counts().plot(kind='bar')
    plt.show()
print("==========================================================================")

print("\nDistribution of continuous feature\n")
for col in titanicEx.columns:
    print("Distribution of " + col)
    titanic.hist(bins=25, column=col)


## I.4
Lakukanlah analisa pada data lebih lanjut jika dibutuhkan, kemudian lakukanlah:
1. Penambahan fitur jika memungkinkan
2. Pembuangan fitur yang menurut kalian tidak dibutuhkan
3. Penanganan missing values
4. Transformasi data kategorikal menjadi numerikal (encoding), dengan metode yang kalian inginkan
5. Lakukan scaling dengan MinMaxScaler

In [34]:
# I.4 Put your code here

# Add feature if needed
# newColData = ['data1', 'data2', 'data3', 'data4', '.....']
# df["newCol"] = newColData

# Remove feature
# inplace=True --> column will be deleted from original data
# df.drop('columnToBeRemoved', inplace=True, axis=1)

# column "cabin" perlu dihapus karena terlalu banyak missing value (1014 missing value)
titanicDropped = titanic.drop('cabin', inplace=False, axis=1)

# Handle missing values
titanicDropped.interpolate(method ='backfill', limit_direction ='backward', inplace=True)
titanicDropped.fillna(method='pad', inplace=True)
# to check missing values: titanicDropped.isna().sum()

# Transformasi data kategorikal menjadi numerikal (encoding)
encode_titanic= preprocessing.LabelEncoder()
titanicDropped = titanicDropped.apply(encode_titanic.fit_transform)
titanicDropped

# scaling dengan MinMaxScaler
scaler = MinMaxScaler()
titanicDropped[titanicDropped.columns] = scaler.fit_transform(titanicDropped[titanicDropped.columns])
titanicDropped


Unnamed: 0,index,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,embarked
0,0.000000,1.0,1.0,0.003063,0.0,0.247423,0.000000,0.000000,0.520474,0.103571,1.0
1,0.000765,1.0,0.0,0.131700,0.0,0.268041,0.000000,0.000000,0.415948,0.167857,0.5
2,0.001529,0.0,1.0,0.307044,0.0,0.371134,0.500000,0.285714,0.132543,0.996429,1.0
3,0.002294,1.0,1.0,0.993874,0.0,0.556701,0.166667,0.000000,0.448276,0.464286,1.0
4,0.003058,1.0,0.0,0.469372,1.0,0.422680,0.000000,0.000000,0.612069,0.160714,1.0
...,...,...,...,...,...,...,...,...,...,...,...
1304,0.996942,1.0,1.0,0.221286,1.0,0.680412,0.000000,0.000000,0.753233,0.185714,1.0
1305,0.997706,0.0,0.0,0.723583,1.0,0.268041,0.166667,0.000000,0.898707,0.932143,0.0
1306,0.998471,0.5,1.0,0.082695,0.0,0.195876,0.333333,0.142857,0.157328,0.707143,1.0
1307,0.999235,1.0,1.0,0.637825,0.0,0.195876,0.166667,0.000000,0.659483,0.428571,0.5


# II. Experiments Design
Tujuan dari bagian ini adalah peserta dapat memahami cara melakukan eksperimen mencari metode terbaik dengan benar. Hal ini meliputi:
1. Pembuatan model
2. Proses validasi
3. Hyperparameter tuning

## II.1
Tentukanlah metrics yang akan digunakan pada eksperimen kali ini (dapat lebih dari 1 metric)

Put your answer for section II.1 here
### metrics.accuracy_score

## II.2 
Bagi data dengan perbandingan 0.8 untuk data train dan 0.2 untuk data validasi

In [81]:
# II.2 Put your code here
feature_names = ['index', 'pclass', 'name', 'sex', 'age', 'sibsp', 'parch', 'ticket', 'fare', 'embarked']
X_train, X_test, y_train, y_test = train_test_split(
    titanicDropped[feature_names], 
    titanicDropped['survived'], 
    test_size=0.2,
    random_state=0
)
X_train

Unnamed: 0,index,pclass,name,sex,age,sibsp,parch,ticket,fare,embarked
1118,0.854740,1.0,0.562021,1.0,0.432990,0.0,0.000000,0.829741,0.089286,1.0
44,0.033639,1.0,0.986217,1.0,0.371134,0.0,0.000000,0.259698,0.064286,0.0
1072,0.819572,1.0,0.974732,1.0,0.443299,0.0,0.000000,0.730603,0.457143,1.0
1130,0.863914,0.5,0.739663,1.0,0.659794,0.0,0.142857,0.907328,0.507143,1.0
574,0.438838,1.0,0.294028,1.0,0.268041,0.0,0.000000,0.593750,0.142857,1.0
...,...,...,...,...,...,...,...,...,...,...
763,0.583333,0.0,0.524502,0.0,0.731959,0.0,0.000000,0.119612,0.564286,1.0
835,0.638379,0.0,0.139357,1.0,0.690722,0.0,0.000000,0.066810,0.642857,1.0
1216,0.929664,1.0,0.707504,0.0,0.639175,0.0,0.714286,0.360991,0.717857,1.0
559,0.427370,1.0,0.173047,1.0,0.319588,0.0,0.000000,0.788793,0.117857,0.5


## II.3
Lakukanlah:
1. Prediksi dengan menggunakan model Logistic Regression sebagai *baseline*
2. Tampilkan evaluasi dari model yang dibangun dari metrics yang anda tentukan pada II.1
3. Tampilkan confusion matrix

In [76]:
# II.3 Put your code here
clf_lr = LogisticRegression(random_state=0, max_iter=5000).fit(X_train, y_train)
clf_lr.predict(X_train)

clf_lr.predict_proba(X_train)

r_predict = clf_lr.predict(X_test)
acc = accuracy_score(y_test, r_predict)
print("Accuracy Score for Titanic Dataset: ", acc)
print("==========================================================================")

print("Confusion Matrix for Titanic Dataset")
print(metrics.confusion_matrix(y_test,clf_lr.predict(X_test)))






Accuracy Score for Titanic Dataset:  0.7900763358778626
Confusion Matrix for Titanic Dataset
[[144  24]
 [ 31  63]]


## II.4 
Lakukanlah:
1. Pembelajaran dengan model lain
2. Hyperparameter tuning model yang kalian pakai dengan menggunakan Grid Search (perhatikan random factor pada beberapa algoritma model)
3. Lakukan validasi dengan menggunakan cross validation


In [77]:
# II.4 Put your code here

parameters = {'kernel':('linear', 'rbf'), 'C':[1, 10]}
svc = svm.SVC()
clf_svm = GridSearchCV(svc, parameters)

clf_svm.fit(X_train, y_train)

r_predict = clf_svm.predict(X_test)
acc = accuracy_score(y_test, r_predict)
print("Accuracy Score for Titanic Dataset: ", acc)


cv = KFold(n_splits=10, random_state=1, shuffle=True)
scores = cross_val_score(clf_svm, X_test, y_test, scoring='accuracy', cv=cv, n_jobs=-1)
# report performance
print('Accuracy for Grid Search Cross Validation (std): %.3f (%.3f)' % (np.mean(scores), np.std(scores)))


Accuracy Score for Titanic Dataset:  0.7938931297709924
Accuracy for Grid Search Cross Validation (std): 0.778 (0.081)


# III. Improvement
Terdapat beberapa metode untuk melakukan peningkatan performa, contohnya adalah:
1. Melakukan oversampling / undersampling pada data
2. Menggabungkan beberapa model 

Pada bagian ini, kalian diharapkan dapat:
1. Melakukan training dengan data hasil oversampling / undersampling dan melakukan validasi dengan benar
2. Memahami beberapa metode untuk menggabungkan beberapa model

## III.1
Lakukanlah:
1. Oversampling pada kelas minoritas pada data train, kemudian train dengan model *baseline* (II.3), lakukan validasi dengan data validasi. Data train dan validasi adalah data yang kalian bagi pada bagian II.2
2. Undersampling pada kelas mayoritas pada data train, kemudian train dengan model *baseline* (II.3) lakukan validasi dengan data validasi. Data train dan validasi adalah data yang kalian bagi pada bagian II.2

In [93]:
# III.1 Put your code here
# To use RandomOverSampler, install imbalanced-learn with the command below
# pip install imbalanced-learn
oversample = RandomOverSampler(sampling_strategy='minority')
X_res, y_res = oversample.fit_resample(X_train, y_train)

clf_lr = LogisticRegression(random_state=0, max_iter=5000).fit(X_res, y_res)
clf_lr.predict(X_res)

clf_lr.predict_proba(X_res)

r_predict = clf_lr.predict(X_test)
acc = accuracy_score(y_test, r_predict)
print("Accuracy Score for Titanic Dataset (oversampling) : ", acc)

undersample = RandomUnderSampler(sampling_strategy='majority')
X_res, y_res = undersample.fit_resample(X_train, y_train)

clf_lr = LogisticRegression(random_state=0, max_iter=5000).fit(X_res, y_res)
clf_lr.predict(X_res)

clf_lr.predict_proba(X_res)

r_predict = clf_lr.predict(X_test)
acc = accuracy_score(y_test, r_predict)
print("Accuracy Score for Titanic Dataset (undersampling) : ", acc)



Accuracy Score for Titanic Dataset (oversampling) :  0.7824427480916031
Accuracy Score for Titanic Dataset (undersampling) :  0.7786259541984732


## III.2
Lakukanlah:
1. Eksplorasi soft voting, hard voting, dan stacking
2. Buatlah model Logistic Regression dan SVM (boleh menggunakan model dengan beberapa parameter yang berbeda)
3. Lakukanlah soft voting dari model-model yang sudah kalian buat pada poin 2
4. Lakukan hard voting dari model-model yang sudah kalian buat pada poin 2
5. Lakukanlah stacking dengan final classifier adalah Logistic Regression dari model-model yang sudah kalian buat pada poin 2
6. Lakukan validasi dengan metrics yang kalian tentukan untuk poin 3, 4, dan 5

Put your answer for section III.2 point 1 here

In [136]:
# III.2 Put your code here

models = [('lr',LogisticRegression(random_state=0, max_iter=5000)),('svm',make_pipeline(StandardScaler(), svm.SVC(probability=True)))]

ensemble = VotingClassifier(estimators=models, voting='soft')
ensemble.fit(X_train, y_train)

r_predict = ensemble.predict(X_test)
acc = accuracy_score(y_test, r_predict)
print("Accuracy Score for Titanic Dataset (Soft Voting) : ", acc)


ensemble = VotingClassifier(estimators=models, voting='hard')
ensemble.fit(X_train, y_train)


r_predict = ensemble.predict(X_test)
acc = accuracy_score(y_test, r_predict)
print("Accuracy Score for Titanic Dataset (Hard Voting) : ", acc)



clf = StackingClassifier(
    estimators=models, final_estimator=LogisticRegression(random_state=0, max_iter=5000)
)

acc = clf.fit(X_train, y_train).score(X_test, y_test)
print("Accuracy Score for Titanic Dataset (Stacking)    : ", acc)

Accuracy Score for Titanic Dataset (Soft Voting) :  0.7900763358778626
Accuracy Score for Titanic Dataset (Hard Voting) :  0.7977099236641222
Accuracy Score for Titanic Dataset (Stacking)    :  0.7900763358778626


# IV. Analisis
Bandingkan hasil dari:
1. Model Baseline (II.3)
2. Model lain (II.4)
3. Hasil undersampling
4. Hasil oversampling
5. Hasil soft voting
6. Hasil hard voting
7. Hasil stacking 

Put your answer for section IV here