# **Contoh Aplikasi**


1. Data klasifikasi bunga Iris sebagai studi kasus sederhana
2. Link data: https://archive.ics.uci.edu/ml/datasets/iris
3. Paper sumber data: Fisher,R.A. "The use of multiple measurements in taxonomic problems" Annual Eugenics, 7, Part II, 179-188 (1936); also in "Contributions to Mathematical Statistics" (John Wiley, NY, 1950).
4. Masalah klasifikasinya adalah mengklasifikasikan jenis Bunga Iris berdasarkan bentuk (e.g. panjang dan lebar) bunga.

In [None]:
#import some python modules
from google.colab import drive
drive.mount ('/content/drive')
import warnings; warnings.simplefilter('ignore')
import pandas as pd, seaborn as sns

# load the iris data
df = sns.load_dataset("iris")
g = sns.pairplot(df, hue="species")

In [None]:
df.sample(10)

In [None]:
df.describe()

In [None]:
# Data ini bukan murni Binary Classification
# Kita akan ambil sebagiannya untuk menjadikannya masalah binary classification
set(df['species'].values)

In [None]:
# Bentuk data binary dari sini menggunakan teknik di Modul 03: EDA
# Disimpan dalam variabel baru "df_bin"
df_bin = df[df["species"].isin(['setosa','versicolor']) ]
set(df_bin['species'].values)

In [None]:
df_bin.sample(7)

# **Pisahkan menjadi training dan Test Data**

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(df_bin[['sepal_length', 'sepal_width','petal_length','petal_width']], 
                                                    df_bin['species'], test_size=0.5)
print(X_train.shape, X_test.shape)

**Pemodelan Regresi Logistik menggunakan Python (module SciKit-Learn)**

In [None]:
from sklearn.linear_model import LogisticRegression

clf = LogisticRegression().fit(X_train, y_train)

In [None]:
clf

**Seberapa "baik" prediksi ini? = Akurasi/Evaluasi Model**

In [None]:
y_reglog = clf.predict(X_test)
y_reglog

In [None]:
# Pertama-tama Kita gunakan metric/pengukuran yang umum
from sklearn.metrics import accuracy_score

accuracy_score(y_test, y_reglog)

# **contoh data lain: Data Klasifikasi Kanker**


1. Dapat diunduh dari link ini: https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Diagnostic)
2. Link scikit utk datanya: https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_breast_cancer.html#sklearn.datasets.load_breast_cancer

In [None]:
from sklearn.datasets import load_breast_cancer

data = load_breast_cancer()
type(data), dir(data)

In [None]:
X = data.data
print(type(X), X.shape)
X[:3]

In [None]:
Y = data.target
print(type(Y), Y.shape)
print(data.target_names)
Y[-10:]

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.5, random_state=99)
print(X_train.shape, X_test.shape)

In [None]:
clf = LogisticRegression().fit(X_train, y_train)
y_reglog = clf.predict(X_test)
accuracy_score(y_test, y_reglog)
# Masih "mudah", namun lebih baik dari sebelumnya

In [None]:
dir(clf)

In [None]:
# Persamaannya? (ada 30 variabel)
clf.coef_

# **Matriks Konfusi (Confussion Matrix)**

In [None]:
from sklearn.metrics import precision_score, recall_score, f1_score

print('presisi = ', precision_score(y_test, y_reglog))
print('Recall = ', recall_score(y_test, y_reglog))
print('f1_score = ', f1_score(y_test, y_reglog))

**Alternatif (1)**

In [None]:
from sklearn.metrics import confusion_matrix, classification_report

print(confusion_matrix(y_test, y_reglog))
print(classification_report(y_test, y_reglog))

**Alternatif (2)**

In [None]:
# Cross validation
# Perhatikan variabelnya, kita sekarang menggunakan seluruh data
# http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html
from sklearn.model_selection import cross_val_score
import time

mulai = time.time()
scores_regLog = cross_val_score(clf, X, Y, cv=10) # perhatikan sekarang kita menggunakan seluruh data
waktu = time.time() - mulai
# Interval Akurasi 95 CI 
print("Accuracy Regresi Logistik: %0.2f (+/- %0.2f), Waktu = %0.3f detik" % (scores_regLog.mean(), scores_regLog.std() * 2, waktu))

In [None]:
# Kita juga bisa menampilkan BoxPlotnya untuk mendapatkan informasi yang lebih lengkap
%matplotlib inline
import matplotlib.pyplot as plt; plt.style.use('classic')
import seaborn as sns; sns.set()

df = pd.DataFrame({'Regresi Logistik':scores_regLog})
sns.boxplot(data=df)
plt.show()

# **Regresi Logistik untuk Multiclass Classification?**

In [None]:
from sklearn.datasets import load_iris
X, y = load_iris(return_X_y=True)
set(y) # 3 Kategori

In [None]:
X.shape # 4 variabel, 150 baris

In [None]:
clf = LogisticRegression(multi_class='ovr').fit(X, y)
clf.coef_
# Perhatikan ada 3 persamaan

# **k-Nearest Neighbour**

In [None]:
# k-NN: http://scikit-learn.org/stable/modules/neighbors.html
from sklearn import neighbors

n_neighbors = 3
weights = 'distance'
kNN = neighbors.KNeighborsClassifier(n_neighbors, weights=weights)
kNN.fit(X_train, y_train)
print('Done!')

In [None]:
# Prediksi dengan k-NN
y_kNN = kNN.predict(X_test)
y_kNN[-10:]

In [None]:
# Akurasi
accuracy_score(y_test, y_kNN)

In [None]:
# Cross Validasi
del kNN
kNN = neighbors.KNeighborsClassifier(n_neighbors, weights=weights)

mulai = time.time()
scores_kNN = cross_val_score(kNN, X, y, cv=10) # perhatikan sekarang kita menggunakan seluruh data
waktu = time.time() - mulai
# Interval Akurasi 95 CI 
print("Accuracy kNN: %0.2f (+/- %0.2f), Waktu = %0.3f detik" % (scores_kNN.mean(), scores_kNN.std() * 2, waktu))

# **Decision Tree**

In [None]:
# Decision Tree: http://scikit-learn.org/stable/modules/tree.html
from sklearn import tree

DT = tree.DecisionTreeClassifier()
DT = DT.fit(X_train, y_train)
y_DT = DT.predict(X_test)
print(accuracy_score(y_test, y_DT))
print(confusion_matrix(y_test, y_DT))
print(classification_report(y_test, y_DT))

In [None]:
# Load ulang Data
df = sns.load_dataset("iris")
df.sample(7)

In [None]:
# Separate Data
X = df[['sepal_length','sepal_width','petal_length','petal_width']]
Y = df['species']
seed = 9
validation_size = 0.3
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=validation_size, random_state=seed)
print(X_train.shape, len(Y_test))

In [None]:
# Build the model and Evaluate
dt_model = tree.DecisionTreeClassifier(criterion='entropy', max_depth=3, random_state=seed) # Default Gini
dt = dt_model.fit(X_train, Y_train)
dt_prediction = dt.predict(X_test)
print('Akurasi = ', accuracy_score(Y_test, dt_prediction))
print(confusion_matrix(Y_test, dt_prediction))
print(classification_report(Y_test, dt_prediction))

In [None]:
# Varible importance - Salah satu kelebihan Decision Tree
dt.feature_importances_

In [None]:
# Kelebihan lain Decision Tree yang tidak dimiliki model lain

# "WARNING" 
# 1. tidak bisa dijalankan di Google Colab
# 2. membutuhkan software "graphViz" + setting system variabel
# caranya ada disini: https://stackoverflow.com/questions/49471867/installing-graphviz-for-use-with-python-3-on-windows-10
# installernya ada di Folder "UIN Bandung" yang diawal di copy dari flashDisk

import graphviz

dot_data = tree.export_graphviz(dt, out_file=None) 
graph = graphviz.Source(dot_data) 
graph.render("iris") 
var_names = ['sepal_length','sepal_width','petal_length','petal_width']
categories = ['Setosa', 'VersiColor', 'Virginica']
dot_data = tree.export_graphviz(dt, out_file=None, 
                         feature_names = var_names,  
                         class_names=categories,  
                         filled=True, rounded=True,  
                         special_characters=True)  
graph = graphviz.Source(dot_data)  
graph 

# **Naive Bayes**

In [None]:
# Naive Bayes: http://scikit-learn.org/stable/modules/naive_bayes.html
from sklearn.naive_bayes import GaussianNB

gnb = GaussianNB()
nbc = gnb.fit(X_train, Y_train)
nbc_prediction = nbc.predict(X_test)

print('Akurasi = ', accuracy_score(Y_test, nbc_prediction))
print(confusion_matrix(Y_test, nbc_prediction))
print(classification_report(Y_test, nbc_prediction))

In [None]:
# Model Comparisons using Cross Validation
X = df[['sepal_length','sepal_width','petal_length','petal_width']]
Y = df['species']

Models = [('Regresi Logistik',clf), ('k-NN',kNN),('Naive Bayes',gnb), ('Decision Tree',DT)]
Scores = {}
for model_name, model in Models:
    if model_name=='Naive Bayes':
        Scores[model_name] = cross_val_score(model, X.values, Y, cv=10,scoring='accuracy')
    else:
        Scores[model_name] = cross_val_score(model, X, Y, cv=10,scoring='accuracy')
        
dt = pd.DataFrame.from_dict(Scores)
ax = sns.boxplot(data=dt)