## **Perbedaan Normalisasi dan Unnormalisasi**

Normalisasi data merupakan sebuah proses untuk memastikan record pada dataset tetap konsisten. Proses normalisasi diperlukan transformasi data atau mengubah data asli menjadi format yang memungkinkan pemrosesan data yang efisien.

In [1]:
### Data Wrangling 
import pandas as pd
import numpy as np
from scipy.io import arff
from collections import OrderedDict

### Modelling 
from sklearn import preprocessing
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import MinMaxScaler
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier

### ensamble stacking
from sklearn.ensemble import StackingClassifier

### Remove unnecessary warnings
import warnings
warnings.filterwarnings('ignore')

## Import dataset format.arff

In [2]:
data = arff.loadarff('/content/drive/MyDrive/datamining/tugas/messidor_features.arff')
df = pd.DataFrame(data[0])
df

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,Class
0,1.0,1.0,22.0,22.0,22.0,19.0,18.0,14.0,49.895756,17.775994,5.270920,0.771761,0.018632,0.006864,0.003923,0.003923,0.486903,0.100025,1.0,b'0'
1,1.0,1.0,24.0,24.0,22.0,18.0,16.0,13.0,57.709936,23.799994,3.325423,0.234185,0.003903,0.003903,0.003903,0.003903,0.520908,0.144414,0.0,b'0'
2,1.0,1.0,62.0,60.0,59.0,54.0,47.0,33.0,55.831441,27.993933,12.687485,4.852282,1.393889,0.373252,0.041817,0.007744,0.530904,0.128548,0.0,b'1'
3,1.0,1.0,55.0,53.0,53.0,50.0,43.0,31.0,40.467228,18.445954,9.118901,3.079428,0.840261,0.272434,0.007653,0.001531,0.483284,0.114790,0.0,b'0'
4,1.0,1.0,44.0,44.0,44.0,41.0,39.0,27.0,18.026254,8.570709,0.410381,0.000000,0.000000,0.000000,0.000000,0.000000,0.475935,0.123572,0.0,b'1'
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1146,1.0,1.0,34.0,34.0,34.0,33.0,31.0,24.0,6.071765,0.937472,0.031145,0.003115,0.000000,0.000000,0.000000,0.000000,0.537470,0.116795,0.0,b'0'
1147,1.0,1.0,49.0,49.0,49.0,49.0,45.0,37.0,63.197145,27.377668,8.067688,0.979548,0.001552,0.000000,0.000000,0.000000,0.516733,0.124190,0.0,b'0'
1148,1.0,0.0,49.0,48.0,48.0,45.0,43.0,33.0,30.461898,13.966980,1.763305,0.137858,0.011221,0.000000,0.000000,0.000000,0.560632,0.129843,0.0,b'0'
1149,1.0,1.0,39.0,36.0,29.0,23.0,13.0,7.0,40.525739,12.604947,4.740919,1.077570,0.563518,0.326860,0.239568,0.174584,0.485972,0.106690,1.0,b'1'


In [3]:
col_names = []
for i in range(20):
    if i == 0:
        col_names.append('quality')
    if i == 1:
        col_names.append('prescreen')
    if i >= 2 and i <= 7:
        col_names.append('ma' + str(i))
    if i >= 8 and i <= 15:
        col_names.append('exudate' + str(i))
    if i == 16:
        col_names.append('euDist')
    if i == 17:
        col_names.append('diameter')
    if i == 18:
        col_names.append('amfm_class')
    if i == 19:
        col_names.append('label')

In [4]:
df.columns = [col_names]

Membuat normalisasi data

In [5]:
# percent_amount_of_test_data = / HUNDRED_PERCENT
percent_amount_of_test_data = 0.2

In [6]:
unnormalized_data = df.drop(columns = ["label"])

In [7]:
y = df["label"].values

In [8]:
scaler = MinMaxScaler()
scala = scaler.fit(unnormalized_data)

In [9]:
normalized_dataset = scaler.transform(unnormalized_data)

## Preprocessing data

In [10]:
le = preprocessing.LabelEncoder()
le.fit(y)
y = le.transform(y)

## Split data tanpa normalisasi

In [11]:
# split unnormalized data
X_train, X_test, y_train, y_test = train_test_split(unnormalized_data, y, test_size = percent_amount_of_test_data, random_state=42, shuffle=False)

## Split data dengan normalisasi

In [12]:
# split normalized data
X_train_norm, X_test_norm, y_train_norm, y_test_norm = train_test_split(normalized_dataset, y, test_size = percent_amount_of_test_data, random_state=42, shuffle=False)

## Implementasi pada algoritma Naive Bayes

### Akurasi dengan menggunakan normalisasi data

In [13]:
clf_norm_nb = GaussianNB()
clf_norm_nb.fit(X_train_norm, y_train_norm)

GaussianNB()

In [14]:
post_norm_nb = clf_norm_nb.predict_proba(X_test_norm)
probas_norm_nb = post_norm_nb[:,1]
probas_norm_nb = np.round(probas_norm_nb)

In [15]:
pred_norm_nb = probas_norm_nb
accuracy_norm_nb = accuracy_score(y_test_norm, pred_norm_nb)

In [16]:
print(f"Akurasi dengan data ternormalisasi {accuracy_norm_nb} model Gaussian Naive Bayes")

Akurasi dengan data ternormalisasi 0.6363636363636364 model Gaussian Naive Bayes


### Akurasi tanpa menggunakan normalisasi data

In [17]:
clf_nb = GaussianNB()
clf_nb.fit(X_train, y_train)

GaussianNB()

In [18]:
post_nb = clf_nb.predict_proba(X_test)
probas_nb = post_nb[:,1]
probas_nb = np.round(probas_nb)

In [19]:
pred_nb = probas_nb
accuracy_nb = accuracy_score(y_test, pred_nb)

In [20]:
print(f"Akurasi dengan data tidak ternormalisasi {accuracy_nb} model Gaussian Naive Bayes")

Akurasi dengan data tidak ternormalisasi 0.645021645021645 model Gaussian Naive Bayes
