<a href="https://colab.research.google.com/github/nasy-sr/Project-Tugas-Kelompok-Neural-Networks/blob/notebooks/NN%26Prediksi_OnlineRetailII_.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Praktikum

Penerapan Neural Network Menggunakan TensorFlow/Keras
Supervised Learning – Klasifikasi

Deskripsi Dataset dan Tujuan Eksperimen

Dataset yang digunakan pada eksperimen ini adalah Online Retail II, yaitu dataset transaksi penjualan yang berisi informasi mengenai produk, jumlah pembelian, harga, negara asal transaksi, serta detail invoice.

Tujuan dari eksperimen ini adalah untuk membangun model Neural Network berbasis TensorFlow/Keras yang mampu melakukan klasifikasi transaksi, khususnya untuk memprediksi apakah suatu transaksi merupakan transaksi pembatalan (cancelled transaction) atau bukan.
Model ini diharapkan dapat membantu dalam analisis perilaku transaksi dan mendukung pengambilan keputusan berbasis data.

# Import Library dan Setup Environment

Tahap awal dilakukan impor seluruh library yang dibutuhkan untuk manipulasi data, preprocessing, pembuatan model Neural Network, serta evaluasi performa model.
Selain itu, random seed ditetapkan untuk memastikan hasil eksperimen dapat direproduksi.

In [None]:
# =========================
# 0) Setup
# =========================
import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer

from sklearn.metrics import (
    classification_report, confusion_matrix,
    mean_absolute_error, mean_squared_error, r2_score
)

import tensorflow as tf

SEED = 42
tf.random.set_seed(SEED)
np.random.seed(SEED)


# Mount Google Drive dan Load Dataset

Dataset disimpan di Google Drive sehingga perlu dilakukan proses mounting agar file dapat diakses dari Google Colab.
Setelah itu, dataset dimuat ke dalam DataFrame menggunakan Pandas.

In [None]:
from google.colab import drive
drive.mount('/content/drive')


Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
# =========================
# 1) Load dataset
# =========================
file_path = "/content/drive/MyDrive/Machine Learning/TaskKlompok_DSet1/online_retail_II.xlsx"

df = pd.read_excel(file_path)
display(df.head())
print(df.shape)
print(df.dtypes)


Unnamed: 0,Invoice,StockCode,Description,Quantity,InvoiceDate,Price,Customer ID,Country
0,489434,85048,15CM CHRISTMAS GLASS BALL 20 LIGHTS,12,2009-12-01 07:45:00,6.95,13085.0,United Kingdom
1,489434,79323P,PINK CHERRY LIGHTS,12,2009-12-01 07:45:00,6.75,13085.0,United Kingdom
2,489434,79323W,WHITE CHERRY LIGHTS,12,2009-12-01 07:45:00,6.75,13085.0,United Kingdom
3,489434,22041,"RECORD FRAME 7"" SINGLE SIZE",48,2009-12-01 07:45:00,2.1,13085.0,United Kingdom
4,489434,21232,STRAWBERRY CERAMIC TRINKET BOX,24,2009-12-01 07:45:00,1.25,13085.0,United Kingdom


(525461, 8)
Invoice                object
StockCode              object
Description            object
Quantity                int64
InvoiceDate    datetime64[ns]
Price                 float64
Customer ID           float64
Country                object
dtype: object


# Informasi Awal Dataset

Pada tahap ini dilakukan eksplorasi awal untuk mengetahui jumlah data, tipe data tiap kolom, serta struktur umum dataset. Informasi ini digunakan sebagai dasar untuk menentukan strategi preprocessing.

# Penentuan Target dan Pembersihan Data

Variabel target ditentukan berdasarkan informasi invoice. Transaksi yang mengindikasikan pembatalan dikonversi menjadi label biner (0 dan 1).
Beberapa kolom yang tidak relevan seperti ID transaksi, kode produk, teks deskripsi panjang, dan tanggal mentah dihapus.

In [None]:
# =========================
# 2) Tentukan target & bersihkan kolom
# =========================

# Check if 'Invoice' column exists before creating 'is_cancel'
if "Invoice" in df.columns:
    df["is_cancel"] = df["Invoice"].astype(str).str.startswith("C").astype(int)

TARGET = "is_cancel"

drop_cols = [
    "Invoice",
    "StockCode",
    "Description",
    "InvoiceDate"
]

# Only drop columns that actually exist in the DataFrame
existing_drop_cols = [col for col in drop_cols if col in df.columns]
df = df.drop(columns=existing_drop_cols, errors="ignore")
df = df.dropna(subset=[TARGET])

X = df.drop(columns=[TARGET])
y = df[TARGET]


# Identifikasi Fitur Numerik dan Kategorikal

Fitur dalam dataset dikelompokkan menjadi fitur numerik dan fitur kategorikal. Pemisahan ini penting agar masing-masing fitur dapat diproses menggunakan teknik preprocessing yang sesuai.

In [None]:
# =========================
# 3) Identifikasi fitur numerik vs kategorikal
# =========================
num_cols = X.select_dtypes(include=["int64", "float64"]).columns.tolist()
cat_cols = X.select_dtypes(include=["object", "category", "bool"]).columns.tolist()

print("Numerik:", num_cols)
print("Kategorikal:", cat_cols)


Numerik: ['Quantity', 'Price', 'Customer ID']
Kategorikal: ['Country']


# Preprocessing Data

Preprocessing dilakukan menggunakan ColumnTransformer.
Fitur numerik dilakukan imputasi nilai median dan standardisasi, sedangkan fitur kategorikal dilakukan imputasi nilai terbanyak dan one-hot encoding.

In [None]:
# =========================
# 4) Preprocess pipeline
# =========================
numeric_pipe = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="median")),
    ("scaler", StandardScaler())
])

categorical_pipe = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="most_frequent")),
    ("onehot", OneHotEncoder(handle_unknown="ignore"))
])

preprocess = ColumnTransformer(
    transformers=[
        ("num", numeric_pipe, num_cols),
        ("cat", categorical_pipe, cat_cols)
    ]
)


# Pembagian Data Training dan Testing

Dataset dibagi menjadi data latih dan data uji dengan rasio 70:30.
Karena ini merupakan kasus klasifikasi, digunakan stratified split agar distribusi kelas tetap seimbang.

In [None]:
# =========================
# 5) Split 70:30
# =========================
is_classification = True

X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.30,
    random_state=SEED,
    stratify=y
)

X_train_p = preprocess.fit_transform(X_train)
X_test_p  = preprocess.transform(X_test)

n_features = X_train_p.shape[1]
print("n_features:", n_features)


n_features: 43


# Pembangunan Model Neural Network

Model Neural Network dibangun menggunakan beberapa dense layer dengan fungsi aktivasi ReLU dan output layer sigmoid untuk klasifikasi biner.

In [None]:
# =========================
# 6) Build model
# =========================
def build_model_classification(n_features):
    model = tf.keras.Sequential([
        tf.keras.layers.Input(shape=(n_features,)),
        tf.keras.layers.Dense(64, activation="relu"),
        tf.keras.layers.Dropout(0.2),
        tf.keras.layers.Dense(32, activation="relu"),
        tf.keras.layers.Dense(1, activation="sigmoid")
    ])
    model.compile(
        optimizer="adam",
        loss="binary_crossentropy",
        metrics=["accuracy", tf.keras.metrics.AUC(name="auc")]
    )
    return model

model = build_model_classification(n_features)
model.summary()


# Training Model

Model dilatih menggunakan data latih dengan mekanisme Early Stopping untuk mencegah overfitting. Sebagian data latih digunakan sebagai data validasi.

In [None]:
# =========================
# 7) Train
# =========================
callbacks = [
    tf.keras.callbacks.EarlyStopping(
        monitor="val_loss",
        patience=5,
        restore_best_weights=True
    )
]

history = model.fit(
    X_train_p, y_train,
    validation_split=0.2,
    epochs=50,
    batch_size=32,
    callbacks=callbacks,
    verbose=1
)


Epoch 1/50
[1m9196/9196[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m41s[0m 4ms/step - accuracy: 0.9832 - auc: 0.8631 - loss: 0.0737 - val_accuracy: 0.9973 - val_auc: 0.9984 - val_loss: 0.0064
Epoch 2/50
[1m9196/9196[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m39s[0m 4ms/step - accuracy: 0.9973 - auc: 0.9980 - loss: 0.0076 - val_accuracy: 0.9984 - val_auc: 0.9996 - val_loss: 0.0033
Epoch 3/50
[1m9196/9196[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m36s[0m 4ms/step - accuracy: 0.9981 - auc: 0.9982 - loss: 0.0053 - val_accuracy: 0.9991 - val_auc: 0.9996 - val_loss: 0.0025
Epoch 4/50
[1m9196/9196[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m36s[0m 4ms/step - accuracy: 0.9984 - auc: 0.9979 - loss: 0.0048 - val_accuracy: 0.9990 - val_auc: 0.9996 - val_loss: 0.0024
Epoch 5/50
[1m9196/9196[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m34s[0m 4ms/step - accuracy: 0.9986 - auc: 0.9987 - loss: 0.0039 - val_accuracy: 0.9993 - val_auc: 0.9996 - val_loss: 0.0020
Epoch 6/50
[1m

# Evaluasi Model

Evaluasi dilakukan menggunakan confusion matrix dan classification report untuk melihat performa model dalam mengklasifikasikan data uji.

In [None]:
# =========================
# 8) Evaluasi
# =========================
y_prob = model.predict(X_test_p).ravel()
y_pred = (y_prob >= 0.5).astype(int)

print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))
print("\nClassification Report:\n", classification_report(y_test, y_pred))


[1m4927/4927[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m12s[0m 2ms/step
Confusion Matrix:
 [[154559     18]
 [    20   3042]]

Classification Report:
               precision    recall  f1-score   support

           0       1.00      1.00      1.00    154577
           1       0.99      0.99      0.99      3062

    accuracy                           1.00    157639
   macro avg       1.00      1.00      1.00    157639
weighted avg       1.00      1.00      1.00    157639



# Penyimpanan Model

Model yang telah dilatih disimpan ke Google Drive agar dapat digunakan kembali di kemudian hari.

In [None]:
# =========================
# 9) Simpan model
# =========================
model.save("/content/drive/MyDrive/Machine Learning/model_dataset1.keras")
