![alt text](https://www.dropbox.com/s/5hbbkuxz6xp3aov/logo_rshpc.jpeg?raw=1 "Logo LIPI")

# Workshop HPC Tools for Machine Learning
---

## Python for Data Analytics

<div class="alert alert-block alert-success">
Dataset yang digunakan adalah __German Credit Risk Data Set__ (https://www.openml.org/d/31). Format telah dirubah dari arff menjadi csv, dan untuk keperluan workshop beberapa data sengaja dihilangkan untuk keperluan pembelajaran __pre-processing__
</div>

Author: Dr. Hans Hofmann Source: [UCI](https://archive.ics.uci.edu/ml/datasets/Statlog+(German+Credit+Data)) - 1994 

German Credit data 

This dataset classifies people described by a set of attributes as good or bad credit risks. 

This dataset comes with a cost matrix: 
``` 
Good Bad (predicted) 
Good 0 1 (actual) Bad 5 0 
``` 

It is worse to class a customer as good when they are bad (5), than it is to class a customer as bad when they are good (1). 

### Attribute description 

1. Status of existing checking account, in Deutsche Mark. 
2. Duration in months 
3. Credit history (credits taken, paid back duly, delays, critical accounts) 
4. Purpose of the credit (car, television,...) 
5. Credit amount 
6. Status of savings account/bonds, in Deutsche Mark. 
7. Present employment, in number of years. 
8. Installment rate in percentage of disposable income 
9. Personal status (married, single,...) and sex 
10. Other debtors / guarantors 
11. Present residence since X years 
12. Property (e.g. real estate) 
13. Age in years 
14. Other installment plans (banks, stores) 
15. Housing (rent, own,...) 
16. Number of existing credits at this bank 
17. Job 
18. Number of people being liable to provide maintenance for 
19. Telephone (yes,no) 
20. Foreign worker (yes,no)

In [1]:
import os.path
import urllib.request
import numpy as np
import pandas as pd

df = pd.read_csv('dataset.csv')

In [2]:
# Lihat 5 data awal
df.head()

Unnamed: 0,checking_status,duration,credit_history,purpose,credit_amount,savings_status,employment,installment_commitment,personal_status,other_parties,...,property_magnitude,age,other_payment_plans,housing,existing_credits,job,num_dependents,own_telephone,foreign_worker,class
0,<0,6,'critical/other existing credit',radio/tv,1169.0,'no known savings',>=7,4,'male single',none,...,'real estate',67,none,own,2,skilled,1,yes,yes,good
1,0<=X<200,48,'existing paid',radio/tv,5951.0,<100,1<=X<4,2,'female div/dep/mar',none,...,'real estate',22,none,own,1,skilled,1,none,yes,bad
2,'no checking',12,'critical/other existing credit',education,2096.0,<100,4<=X<7,2,'male single',none,...,'real estate',49,none,own,1,'unskilled resident',2,none,yes,good
3,<0,42,'existing paid',furniture/equipment,7882.0,<100,4<=X<7,2,'male single',guarantor,...,'life insurance',45,none,'for free',1,skilled,2,none,yes,good
4,<0,24,'delayed previously','new car',4870.0,<100,1<=X<4,3,'male single',none,...,'no known property',53,none,'for free',2,skilled,2,none,yes,bad


In [3]:
# Lihat deskripsi mengenai dataframe
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 21 columns):
checking_status           1000 non-null object
duration                  1000 non-null int64
credit_history            978 non-null object
purpose                   1000 non-null object
credit_amount             976 non-null float64
savings_status            1000 non-null object
employment                1000 non-null object
installment_commitment    1000 non-null int64
personal_status           1000 non-null object
other_parties             1000 non-null object
residence_since           1000 non-null int64
property_magnitude        981 non-null object
age                       1000 non-null int64
other_payment_plans       1000 non-null object
housing                   1000 non-null object
existing_credits          1000 non-null int64
job                       1000 non-null object
num_dependents            1000 non-null int64
own_telephone             966 non-null object
foreign_wo

Dari data diatas terlihat ada beberapa parameter yang mempunyai jumlah data kurang dari 1.000 (misal credit_history, credit_amount, age dan lain-lain.

In [4]:
df['class'].value_counts()

good    700
bad     300
Name: class, dtype: int64

##### dataframe.describe() 
---
df.describe() akan men-generate statistic dari data, secara default hanya akan mengukur column numeric

df.describe(include=[np.object]) => untuk column object

[resource](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.describe.html?highlight=describe#pandas.DataFrame.describe)


In [5]:
df.describe()

Unnamed: 0,duration,credit_amount,installment_commitment,residence_since,age,existing_credits,num_dependents
count,1000.0,976.0,1000.0,1000.0,1000.0,1000.0,1000.0
mean,20.903,3255.97541,2.973,2.845,35.546,1.407,1.155
std,12.058814,2807.759457,1.118715,1.103718,11.375469,0.577654,0.362086
min,4.0,250.0,1.0,1.0,19.0,1.0,1.0
25%,12.0,1364.0,2.0,2.0,27.0,1.0,1.0
50%,18.0,2319.5,3.0,3.0,33.0,1.0,1.0
75%,24.0,3965.25,4.0,4.0,42.0,2.0,1.0
max,72.0,18424.0,4.0,4.0,75.0,4.0,2.0


In [6]:
df.describe(include=[np.object]) 

Unnamed: 0,checking_status,credit_history,purpose,savings_status,employment,personal_status,other_parties,property_magnitude,other_payment_plans,housing,job,own_telephone,foreign_worker,class
count,1000,978,1000,1000,1000,1000,1000,981,1000,1000,1000,966,1000,1000
unique,4,5,10,5,5,4,3,4,3,3,4,2,2,2
top,'no checking','existing paid',radio/tv,<100,1<=X<4,'male single',none,car,none,own,skilled,none,yes,good
freq,394,523,280,603,339,548,907,322,814,713,630,574,963,700


In [7]:
# Melihat korelasi antar variable numeric
df.corr()

Unnamed: 0,duration,credit_amount,installment_commitment,residence_since,age,existing_credits,num_dependents
duration,1.0,0.624826,0.074749,0.034067,-0.036136,-0.011284,-0.023834
credit_amount,0.624826,1.0,-0.271588,0.025174,0.027152,0.023505,0.014752
installment_commitment,0.074749,-0.271588,1.0,0.049302,0.058266,0.021669,-0.071207
residence_since,0.034067,0.025174,0.049302,1.0,0.266419,0.089625,0.042643
age,-0.036136,0.027152,0.058266,0.266419,1.0,0.149254,0.118201
existing_credits,-0.011284,0.023505,0.021669,0.089625,0.149254,1.0,0.109667
num_dependents,-0.023834,0.014752,-0.071207,0.042643,0.118201,0.109667,1.0


# 1. Mengatasi Missing Data

In [8]:
#Cek nilai unik dari credit_history
df['credit_history'].unique()

array(["'critical/other existing credit'", "'existing paid'",
       "'delayed previously'", "'no credits/all paid'", "'all paid'", nan], dtype=object)

In [9]:
#Salah satu cara untuk mengatasi missing data adalah dengan mengisi data kosong dengan nilai yang sering muncul
df['credit_history'].value_counts()

'existing paid'                     523
'critical/other existing credit'    280
'delayed previously'                 86
'all paid'                           49
'no credits/all paid'                40
Name: credit_history, dtype: int64

In [10]:
# Copy dataframe df menjadi df_missing
df_missing = df

In [11]:
# Pada credit_history terlihat 52% data adalah 'existing paid'. 
# Sehingga kita akan mengisi missing value di credit_history menjadi 'existing_paid'
df_missing["credit_history"] = df_missing["credit_history"].fillna("existing_paid")

In [12]:
# Pada credit ammount kita akan mengisi missing value dengan median nya. 
# inplace akan membuat proses fillna menjadi mutable
df_missing["credit_amount"].fillna(df_missing["credit_amount"].median(), inplace=True)

In [13]:
# Untuk mengisi property_magnitute, kita menggunakan forward-fill untuk mempropagasi dari nilai sebelumnya
df_missing['property_magnitude'].fillna(method='ffill', inplace=True)

In [14]:
# Untuk mengisi own_telephone, kita menggunakan backward-fill untuk mempropagasi dari nilai sebelumnya
df_missing['own_telephone'].fillna(method='bfill', inplace=True)

In [15]:
df_missing.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 21 columns):
checking_status           1000 non-null object
duration                  1000 non-null int64
credit_history            1000 non-null object
purpose                   1000 non-null object
credit_amount             1000 non-null float64
savings_status            1000 non-null object
employment                1000 non-null object
installment_commitment    1000 non-null int64
personal_status           1000 non-null object
other_parties             1000 non-null object
residence_since           1000 non-null int64
property_magnitude        1000 non-null object
age                       1000 non-null int64
other_payment_plans       1000 non-null object
housing                   1000 non-null object
existing_credits          1000 non-null int64
job                       1000 non-null object
num_dependents            1000 non-null int64
own_telephone             1000 non-null object
foreig

## 2. Transformasi Data Menjadi Numeric

Untuk merubah data object menjadi numeric, alternatifnya adalah dengan mengubahnya menjadi categorical data atau dummy variable. Disini kita akan mengubahnya menjadi dummy variable. Untuk penjelasannya bisa lihat di [link](https://stats.stackexchange.com/questions/115049/why-do-we-need-to-dummy-code-categorical-variables)

In [16]:
df_transform = df_missing

## 2.a Dummy Variable

In [17]:
checking_status_dummy  = pd.get_dummies(df_transform['checking_status'], prefix='checking_status')

In [18]:
checking_status_dummy.head()

Unnamed: 0,checking_status_'no checking',checking_status_0<=X<200,checking_status_<0,checking_status_>=200
0,0,0,1,0
1,0,1,0,0
2,1,0,0,0
3,0,0,1,0
4,0,0,1,0


In [19]:
# Salah satu dummy variable bisa dihilangkan, dengan asumsi apabila dummy variable yang lain 0, 
# maka sebenarnya variable inilah yang berlaku
checking_status_dummy.drop(['checking_status_>=200'], axis=1, inplace=True)

<div class="alert alert-block alert-info">
Gunakan hal yang sama untuk credit_history, purpose, savings_status, employment, personal_status, other_parties, property_magnitude, other_payment_plans, housing, job, own_telephone, foreign_worker</div>

In [20]:
#credit_history
df_transform['credit_history'].unique()

array(["'critical/other existing credit'", "'existing paid'",
       "'delayed previously'", "'no credits/all paid'", "'all paid'",
       'existing_paid'], dtype=object)

In [21]:
credit_history_dummy  = pd.get_dummies(df_transform['credit_history'], prefix='credit_history')
credit_history_dummy.drop(['credit_history_existing_paid'], axis=1, inplace=True)

In [22]:
# purpose
df_transform['purpose'].unique()

array(['radio/tv', 'education', 'furniture/equipment', "'new car'",
       "'used car'", 'business', "'domestic appliance'", 'repairs',
       'other', 'retraining'], dtype=object)

In [23]:
purpose_dummy  = pd.get_dummies(df_transform['purpose'], prefix='purpose')
purpose_dummy.drop(['purpose_retraining'], axis=1, inplace=True)

In [24]:
#savings_status
df_transform['savings_status'].unique()

array(["'no known savings'", '<100', '500<=X<1000', '>=1000', '100<=X<500'], dtype=object)

In [25]:
savings_status_dummy  = pd.get_dummies(df_transform['savings_status'], prefix='savings_status')
savings_status_dummy.drop(['savings_status_100<=X<500'], axis=1, inplace=True)

In [26]:
# employment
df_transform['employment'].unique()

array(['>=7', '1<=X<4', '4<=X<7', 'unemployed', '<1'], dtype=object)

In [27]:
employment_dummy  = pd.get_dummies(df_transform['employment'], prefix='employment')
employment_dummy.drop(['employment_<1'], axis=1, inplace=True)

In [28]:
#personal_status
df_transform['personal_status'].unique()

array(["'male single'", "'female div/dep/mar'", "'male div/sep'",
       "'male mar/wid'"], dtype=object)

In [29]:
personal_status_dummy  = pd.get_dummies(df_transform['personal_status'], prefix='personal_status')
personal_status_dummy.drop(["personal_status_'male mar/wid'"], axis=1, inplace=True)

In [30]:
#other_parties
df_transform['other_parties'].unique()

array(['none', 'guarantor', "'co applicant'"], dtype=object)

In [31]:
other_parties_dummy  = pd.get_dummies(df_transform['other_parties'], prefix='other_parties')
other_parties_dummy.drop(["other_parties_none"], axis=1, inplace=True)

In [32]:
#property_magnitude
df_transform['property_magnitude'].unique()

array(["'real estate'", "'life insurance'", "'no known property'", 'car'], dtype=object)

In [33]:
property_magnitude_dummy  = pd.get_dummies(df_transform['property_magnitude'], prefix='property_magnitude')
property_magnitude_dummy.drop(["property_magnitude_car"], axis=1, inplace=True)

In [34]:
#other_payment_plans
df_transform['other_payment_plans'].unique()

array(['none', 'bank', 'stores'], dtype=object)

In [35]:
other_payment_plans_dummy  = pd.get_dummies(df_transform['other_payment_plans'], prefix='other_payment_plans')
other_payment_plans_dummy.drop(["other_payment_plans_none"], axis=1, inplace=True)

In [36]:
#housing
df_transform['housing'].unique()

array(['own', "'for free'", 'rent'], dtype=object)

In [37]:
housing_dummy  = pd.get_dummies(df_transform['housing'], prefix='housing')
housing_dummy.drop(["housing_rent"], axis=1, inplace=True)

In [38]:
#job, own_telephone, foreign_worker
df_transform['job'].unique()

array(['skilled', "'unskilled resident'", "'high qualif/self emp/mgmt'",
       "'unemp/unskilled non res'"], dtype=object)

In [39]:
job_dummy  = pd.get_dummies(df_transform['job'], prefix='job')
job_dummy.drop(["job_skilled"], axis=1, inplace=True)

## 2.b Mapping

In [40]:
#own_telephone
df_transform['own_telephone'].unique()

array(['yes', 'none'], dtype=object)

Pada data diatas terlihat bahwa nilai variable hanya ada dua : _yes_ dan _none_. Di kasus ini kita bisa menggunakan dummy variable, atau menggunakan mapping. Keduanya akan menghasilkan output yang sama, akan tetapi mapping membutuhkan kode yang lebih singkat

In [41]:
phone_map = {'yes' : 1, 'none' : 0}
df_transform['own_telephone'] = df_transform['own_telephone'].map(phone_map)

In [42]:
# foreign_worker
df_transform['foreign_worker'].unique()

array(['yes', 'no'], dtype=object)

In [43]:
df_transform['foreign_worker'].value_counts()

yes    963
no      37
Name: foreign_worker, dtype: int64

In [44]:
worker_map = {'yes' : 1, 'no' : 0}
df_transform['foreign_worker'] = df_transform['foreign_worker'].map(worker_map)

## 2.c Join dummy variable 

Tambahkan dummy variable ke dataframe dan hapus column parent nya

In [45]:
#df_transform.info()

In [46]:
dummy_list = list(filter(lambda x: '_dummy' in x, dir()))

print(dummy_list)

for dummy in dummy_list:
    df_dummy = eval(dummy)
    #parent_dummy = eval(dummy.replace('_dummy',''))
    df_transform = df_transform.join(df_dummy)
    df_transform = df_transform.drop([dummy.replace('_dummy','')], axis=1)

['checking_status_dummy', 'credit_history_dummy', 'employment_dummy', 'housing_dummy', 'job_dummy', 'other_parties_dummy', 'other_payment_plans_dummy', 'personal_status_dummy', 'property_magnitude_dummy', 'purpose_dummy', 'savings_status_dummy']


<div class="alert alert-block alert-warning">
  **Class** adalah label dari data ini (output yang akan diprediksi). Dan ini pun memerlukan mapping agar bernilai numerik
</div>

In [47]:
df_transform['class'].unique()

array(['good', 'bad'], dtype=object)

In [48]:
class_map = {'good' : 1, 'bad' : 0}
df_transform['class'] = df_transform['class'].map(class_map)

In [49]:
df_transform.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 50 columns):
duration                                           1000 non-null int64
credit_amount                                      1000 non-null float64
installment_commitment                             1000 non-null int64
residence_since                                    1000 non-null int64
age                                                1000 non-null int64
existing_credits                                   1000 non-null int64
num_dependents                                     1000 non-null int64
own_telephone                                      1000 non-null int64
foreign_worker                                     1000 non-null int64
class                                              1000 non-null int64
checking_status_'no checking'                      1000 non-null uint8
checking_status_0<=X<200                           1000 non-null uint8
checking_status_<0                      

### Additional

- Ada kemungkinan beberapa field dianggap tidak diperlukan dalam analisis, field ini dapat dihilangkan

In [50]:
df_transform.drop(list(df.filter(regex = 'purpose_')), axis=1, inplace=True)

- Apabila diperlukan data dapat dinormalisasikan agar memenuhi distribusi normal

In [51]:
df['credit_amount'].describe()

count     1000.000000
mean      3233.500000
std       2777.531737
min        250.000000
25%       1375.500000
50%       2319.500000
75%       3919.000000
max      18424.000000
Name: credit_amount, dtype: float64

In [52]:
df_learn = df_transform


from sklearn import preprocessing
#scaler = preprocessing.StandardScaler()
min_max_scaler = preprocessing.MinMaxScaler()

#df_learn[['credit_amount']] = scaler.fit_transform(df_learn[['credit_amount']])
df_learn[['credit_amount']] = min_max_scaler.fit_transform(df_learn[['credit_amount']])


In [53]:
df['credit_amount'].describe()

count     1000.000000
mean      3233.500000
std       2777.531737
min        250.000000
25%       1375.500000
50%       2319.500000
75%       3919.000000
max      18424.000000
Name: credit_amount, dtype: float64

## 3. Machine Learning

In [54]:
X_Variables = df_learn.drop(['class'], axis=1)
Y_Variable = df_learn['class']

In [55]:
from sklearn.model_selection import train_test_split

X_Train, X_Test, Y_Train, Y_Test = train_test_split(X_Variables, Y_Variable, test_size=0.2)

In [56]:
from sklearn.metrics import accuracy_score

## 3.1 Logistic Regression

In [57]:
from sklearn.linear_model import LogisticRegression

# Logistic Regression
logreg = LogisticRegression()

#Learn The Model
logreg.fit(X_Train, Y_Train)

# Check training accuracy
logreg.score(X_Train, Y_Train)
print(logreg.score(X_Train, Y_Train))

# Check tesing accuracy
Y_pred_log = logreg.predict(X_Test)
print(accuracy_score(Y_Test,logreg.predict(X_Test)))

0.7925


0.71499999999999997

## 3.2 SVM

In [58]:
from sklearn.svm import SVC, LinearSVC

# Support Vector Machines
svc = SVC()
svc.fit(X_Train, Y_Train)
svc.score(X_Train, Y_Train)
print(svc.score(X_Train, Y_Train))

Y_pred_svc = svc.predict(X_Test)
print(accuracy_score(Y_Test,svc.predict(X_Test)))

0.75875


0.71999999999999997

## 3.3  Random Forest

In [59]:
from sklearn.ensemble import RandomForestClassifier

# Random Forests
random_forest = RandomForestClassifier(n_estimators=100)
random_forest.fit(X_Train, Y_Train)
random_forest.score(X_Train, Y_Train)
print(random_forest.score(X_Train, Y_Train))

Y_pred_rf = random_forest.predict(X_Test)
print(accuracy_score(Y_Test,random_forest.predict(X_Test)))

1.0


0.745

## 3.4 kNN

In [61]:
from sklearn.neighbors import KNeighborsClassifier

# Fitting k-NN on our scaled data set
knn = KNeighborsClassifier(n_neighbors = 5)
knn.fit(X_Train, Y_Train)
knn.score(X_Train, Y_Train)
print(knn.score(X_Train, Y_Train))

Y_pred_knn = knn.predict(X_Test)
print(accuracy_score(Y_Test,knn.predict(X_Test)))

0.7725


0.73499999999999999

## 3.5 Gaussian Naive Bayes

In [62]:
from sklearn.naive_bayes import GaussianNB

# Gaussian Naive Bayes
gaussian = GaussianNB()
gaussian.fit(X_Train, Y_Train)
gaussian.score(X_Train, Y_Train)
print(gaussian.score(X_Train, Y_Train))

Y_pred_gaussian = gaussian.predict(X_Test)
print(accuracy_score(Y_Test,gaussian.predict(X_Test)))

0.73625
0.69
