# SciKit Learn Week 3 - Machine Learning Workflow & Data Preprocessing

***

# Persiapan Dataset

## Load Sample Dataset: Iris Dataset

In [1]:
from sklearn.datasets import load_iris

iris = load_iris()

X = iris.data
y = iris.target

## Splitting Dataset: Training & Testing Set

In [6]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.4, random_state=1)

Saat menggunakan method `train_test_split()`, maka akan dimasukkan 4 parameter:
1. Sekumpulan nilai features.
2. Sekumpulan nilai target.
3. Ukuran dari testing set. (Dalam kasus ini, 0.4 = 40% untuk testing, dan 60% untuk training.)
4. Random seed number.

# Training Model

* Pada Scikit Learn, model machine learning dibentuk dari class yang dikenal dengan istilah **estimator**.
* Setiap estimator akan mengimplementasikan dua method utama, yaitu `fit()` dan `predict()`
* Method `fit()` digunakan untuk melakukan training model.
* Method `predict()` diguakan untuk melakukan estimasi / prediksi dengan memanfaatkan trained model.

In [7]:
from sklearn.neighbors import KNeighborsClassifier

model = KNeighborsClassifier(n_neighbors=3)
model.fit(X_train, y_train)

KNeighborsClassifier(n_neighbors=3)

# Evaluasi Model

In [8]:
from sklearn.metrics import accuracy_score

y_pred = model.predict(X_test)
acc = accuracy_score(y_test, y_pred)
print(f'Accuracy: {acc}')

Accuracy: 0.9833333333333333


# Pemanfaatan Trained Model

In [9]:
data_baru = [[5, 5, 3, 2], [2, 4, 3, 5]]

preds = model.predict(data_baru)
preds

array([1, 2])

In [10]:
pred_species = [iris.target_names[p] for p in preds]
print(f'Hasil Prediksi: {pred_species}')

Hasil Prediksi: ['versicolor', 'virginica']


Dengan features (5, 5, 3, 2) diprediksi masuk ke dalam species versicolor (1).

Dengan features (2, 4, 3, 5) diprediksi masuk ke dalam species virginica (2).

# Dump & Load Trained Model

## Dumping Model Machine Learning menjadi file `joblib`

In [11]:
import joblib
joblib.dump(model, 'iris_classifier_knn.joblib')

['iris_classifier_knn.joblib']

## Loading Model Machine Learning dari file `joblib`

In [12]:
production_model = joblib.load('iris_classifier_knn.joblib')

***

# Data Preprocessing dengan SciKit Learn

# Sample Data

In [13]:
import numpy as np
from sklearn import preprocessing

sample_data = np.array([[2.1, -1.9, 5.5],
                       [-1.5, 2.4, 3.5],
                       [0.5, -7.9, 5.6],
                       [5.9, 2.3, -5.8]])

sample_data

array([[ 2.1, -1.9,  5.5],
       [-1.5,  2.4,  3.5],
       [ 0.5, -7.9,  5.6],
       [ 5.9,  2.3, -5.8]])

In [14]:
sample_data.shape

(4, 3)

# Binarisation

Tujuan utama teknik ini adalah untuk menghasilkan suatu data yang terdiri dari dua nilai numerik saja, yaitu 0 dan 1.

In [15]:
sample_data

array([[ 2.1, -1.9,  5.5],
       [-1.5,  2.4,  3.5],
       [ 0.5, -7.9,  5.6],
       [ 5.9,  2.3, -5.8]])

In [16]:
preprocessor = preprocessing.Binarizer(threshold=0.5)
binarised_data = preprocessor.transform(sample_data)
binarised_data

array([[1., 0., 1.],
       [0., 1., 1.],
       [0., 0., 1.],
       [1., 1., 0.]])

Setiap nilai yang melampaui nilai `threshold` akan dikonversi menjadi 1. Jika sama persis dengan nilai `threshold`, maka akan menjadi 0.

# Scaling

Tujuannya adalah untuk menghasilkan suatu data numerik yang berada dalam skala / rentang tertentu.

In [17]:
sample_data

array([[ 2.1, -1.9,  5.5],
       [-1.5,  2.4,  3.5],
       [ 0.5, -7.9,  5.6],
       [ 5.9,  2.3, -5.8]])

In [18]:
preprocessor = preprocessing.MinMaxScaler(feature_range=(0, 1))
preprocessor.fit(sample_data)
scaled_data = preprocessor.transform(sample_data)
scaled_data

array([[0.48648649, 0.58252427, 0.99122807],
       [0.        , 1.        , 0.81578947],
       [0.27027027, 0.        , 1.        ],
       [1.        , 0.99029126, 0.        ]])

In [19]:
scaled_data = preprocessor.fit_transform(sample_data)
scaled_data

array([[0.48648649, 0.58252427, 0.99122807],
       [0.        , 1.        , 0.81578947],
       [0.27027027, 0.        , 1.        ],
       [1.        , 0.99029126, 0.        ]])

# L1 Normalisation: Least Absolute Deviations

Referensi: [https://en.wikipedia.org/wiki/Least_absolute_deviations]

In [20]:
sample_data

array([[ 2.1, -1.9,  5.5],
       [-1.5,  2.4,  3.5],
       [ 0.5, -7.9,  5.6],
       [ 5.9,  2.3, -5.8]])

In [22]:
l1_normalised_data = preprocessing.normalize(sample_data, norm='l1')
l1_normalised_data

array([[ 0.22105263, -0.2       ,  0.57894737],
       [-0.2027027 ,  0.32432432,  0.47297297],
       [ 0.03571429, -0.56428571,  0.4       ],
       [ 0.42142857,  0.16428571, -0.41428571]])

Untuk menggunakan method `normalize()`, maka akan dibutuhkan dua parameter:
1. Sample data.
2. Norm.

# L2 Normalisation: Least Squares

Referensi: [https://en.wikipedia.org/wiki/Least_squares]

In [23]:
sample_data

array([[ 2.1, -1.9,  5.5],
       [-1.5,  2.4,  3.5],
       [ 0.5, -7.9,  5.6],
       [ 5.9,  2.3, -5.8]])

In [24]:
l2_normalised_data = preprocessing.normalize(sample_data, norm='l2')
l2_normalised_data

array([[ 0.33946114, -0.30713151,  0.88906489],
       [-0.33325106,  0.53320169,  0.7775858 ],
       [ 0.05156558, -0.81473612,  0.57753446],
       [ 0.68706914,  0.26784051, -0.6754239 ]])