# Pengantar Machine Learning
### Capaian Pembelajaran:
* _Mahasiswa mampu menggunakan bahasa pemrograman python untuk analisis Machine Learning sederhana dengan menggunakan PCA dan K-Means Clustering_
* _Mahasiswa mampu menjelaskan struktur data (sampel, fitur, dan label)_
* _Mahasiswa mampu menjelaskan kelebihan dan kekurangan dimensionality reduction dan clustering untuk menganalisis data biologis_

### Deskripsi Modul
Pada modul ini, kita akan belajar menggunakan metode machine learning sederhana dan mengaplikasikannya dalam klasifikasi spesies. Kita akan mengeksplor dataset dari bunga Iris yang mendeskripsikan hasil pengamatan morfologis (sepal dan petal) untuk mengklasifikasikan genus ini ke dalam tiga spesies: _Iris setosa_, _Iris versicolor_, dan _Iris virginosa_ .

<img src="https://storage.googleapis.com/kaggle-datasets-images/19/19/default-backgrounds/dataset-card.jpg" alt="drawing" width="200"/>

### Outline
- [ ] Dataset Cleaning & Exploratory Data Analysis (EDA)
- [ ] Dimensionality Reduction: Principal Component Analysis
- [ ] K-means Clustering

### Referensi:
*Modul ini diadaptasi dari*: https://www.kaggle.com/bburns/iris-exploration-pca-k-means-and-gmm-clustering


In [None]:
# Load Library
import pandas as pd
import numpy as np

import seaborn as sns
import matplotlib as mpl
import matplotlib.pyplot as plt

from sklearn import datasets
from sklearn import preprocessing
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans

## Dataset Cleaning & EDA

In [None]:
iris = datasets.load_iris()
print(iris.DESCR)

In [None]:
# load data as pd DataFrame
data = pd.DataFrame(data= np.c_[iris['data'], ['iris '+iris.target_names[i] for i in iris.target]],
                     columns= iris['feature_names']+['species'])

# transform string into float
data.loc[:, data.columns[:4]] = data.loc[:, data.columns[:4]].astype('float')

# show top 5 row
data.head()

In [None]:
# use seaborn to make scatter plot showing species for each sample
sns.set(style="ticks", color_codes=True)

sns.pairplot(data, hue="species", diag_kind="hist")

plt.show()

# so again, this shows how similar versicolor and virginica are, at least with the given features.
# but there could be features that you didn't measure that would more clearly separate the species.
# it's the same for any unsupervised learning - you need to have the right features
# to separate the groups in the best way.

## Diskusi
* Dataset ini memiliki berapa dimensi? Berapa sampel dan fitur?
* Berdasarkan pairplot, fitur/karakter apa yg baik untuk membedakan spesies tersebut?

## Dimensionality Reduction: PCA

In [None]:
# split data into features (X) and labels (y)
X = iris.data
Y = iris.target

In [None]:
# the data is unbalanced (eg sepallength ~4x petalwidth), so should do feature scaling,
# otherwise the larger features will dominate the others in clustering, etc.
scaler = preprocessing.StandardScaler()

scaler.fit(X)
X_scaled_array = scaler.transform(X)
X_scaled = pd.DataFrame(X_scaled_array, columns = iris['feature_names'])

X_scaled.sample(5)

In [None]:
# mash the data down into 2 dimensions
# if you have a lot of features it can be helpful to do some feature reduction
# to avoid the curse of dimensionality (i.e. needing exponentially more data
# to do accurate predictions as the number of features grows).

# you can do this with Principal Component Analysis (PCA), which remaps the data
# to a new (smaller) coordinate system which tries to account for the
# most information possible.

# you can *also* use PCA to visualize the data by reducing the 
# features to 2 dimensions and making a scatterplot. 
# it kind of mashes the data down into 2d, so can lose 
# information - but in this case it's just going from 4d to 2d, 
# so not losing too much info. 
seed = 0
ndimensions = 2

pca = PCA(n_components=ndimensions, random_state=seed)
pca.fit(X_scaled)
X_pca_array = pca.transform(X_scaled)
X_pca = pd.DataFrame(X_pca_array, columns=['PC1','PC2']) # PC=principal component
X_pca.sample(5)

## K-Means Clustering

In [None]:
nclusters = 3 # this is the k in kmeans


km = KMeans(n_clusters=nclusters, random_state=seed)
km.fit(X_scaled)

# predict the cluster for each data point
y_cluster_kmeans = km.predict(X_scaled)
y_cluster_kmeans

In [None]:
# first, convert species to an arbitrary number
y_id_array = Y

df_plot = X_pca.copy()
df_plot['ClusterKmeans'] = y_cluster_kmeans
df_plot['SpeciesId'] = y_id_array # also add actual labels so we can use it in later plots
df_plot.sample(5)

In [None]:
# so now we can make a 2d scatterplot of the clusters
# first define a plot fn

def plotData(df, groupby):
    "make a scatterplot of the first two principal components of the data, colored by the groupby field"
    
    # make a figure with just one subplot.
    # you can specify multiple subplots in a figure, 
    # in which case ax would be an array of axes,
    # but in this case it'll just be a single axis object.
    fig, ax = plt.subplots(figsize = (7,7))

    # color map
    cmap = mpl.cm.get_cmap('prism')

    # we can use pandas to plot each cluster on the same graph.
    # see http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.plot.html
    for i, cluster in df.groupby(groupby):
        cluster.plot(ax = ax, # need to pass this so all scatterplots are on same graph
                     kind = 'scatter', 
                     x = 'PC1', y = 'PC2',
                     color = cmap(i/(nclusters-1)), # cmap maps a number to a color
                     label = "%s %i" % (groupby, i), 
                     s=30) # dot size
    ax.grid()
    ax.axhline(0, color='black')
    ax.axvline(0, color='black')
    ax.set_title("Principal Components Analysis (PCA) of Iris Dataset");

In [None]:
# plot the clusters each datapoint was assigned to
plotData(df_plot, 'ClusterKmeans')
plotData(df_plot, 'SpeciesId')

## Diskusi
* Kenapa perlu dilakukan standarisasi/normalisasi data sebelum melakukan PCA?
* Data Iris memiliki 4 fitur (dimensi). Apakah kita dapat melakukan visualisasi 4 dimensi? 
* Bagaimanakah data Iris yang memiliki 4 fitur direduksi menjadi dua dimensi? Apa yang dimaksud dengan PC1 & PC2? Apakah hasil PCA juga dapat diproyeksikan dalam tiga dimensi?
* Menurutmu, pairplot mana yang paling mirip dengan hasiL PCA?
* Apa resiko dalam menggunakan PCA?
* Berdasarkan informasi yang kita berikan, apakah K-means clustering dapat mengklasifikasikan spesies dengan baik?
* Apakah metode analisis di atas dapat diaplikasikan ke jenis data biologis yang lain?
* Apakah ada metode dimensionality reduction dan clustering yang lebih baik?