## UTS Machine Learning - Clustering

**Name:** Agatha Kinanthi Pramdriswara Truly Amorta

**Class:** TK-46-04

**NIM:** 1103223212


*- This notebook is part of the midterm assignment for the Machine Learning course.*  

*- The objective is to build an end-to-end clustering pipeline to segment credit card customers based on their usage and payment behavior.*


**1. Imports**

In [None]:
# Data manipulation
import numpy as np
import pandas as pd

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Preprocessing and Clustering
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score

**2. Mount Google Drive and Load Data**

In [None]:
from google.colab import drive
drive.mount('/content/drive')

datasets = '/content/drive/MyDrive/Machine-Learning-Midterm-Datasets/' #path to dataset

df = pd.read_csv(datasets + 'clusteringmidterm.csv', nrows = 50000) #nrows for limit rows
print("Shape:", df.shape)
print("First 5 rows of the dataset: \n")
df.head()

**3. Exploratory data Analysis (EDA)**

In [None]:
df.info()

In [None]:
df.describe().T

In [None]:
# Missing values
print("Missing values per column:\n", df.isnull().sum().sort_values(ascending=False).head(10))

In [None]:
# Distribution (BALANCE)

plt.figure(figsize=(5,3))
sns.histplot(df['BALANCE'], bins=30, kde=True)
plt.title("BALANCE distribution")
plt.show()

In [None]:
# Distribution (PURCHASES)

plt.figure(figsize=(5,3))
sns.histplot(df['PURCHASES'], bins=30, kde=True)
plt.title("PURCHASES distribution")
plt.show()

In [None]:
# Distribution (CREDIT_LIMIT)

plt.figure(figsize=(5,3))
sns.histplot(df['CREDIT_LIMIT'], bins=30, kde=True)
plt.title("CREDIT_LIMIT distribution")
plt.show()

In [None]:
# If the display is combined, it becomes:

fig, axes = plt.subplots(1, 3, figsize=(17,5))
sns.histplot(df['BALANCE'], bins=30, ax=axes[0], kde=True)
sns.histplot(df['PURCHASES'], bins=30, ax=axes[1], kde=True)
sns.histplot(df['CREDIT_LIMIT'], bins=30, ax=axes[2], kde=True)

axes[0].set_title("BALANCE distribution")
axes[1].set_title("PURCHASES distribution")
axes[2].set_title("CREDIT_LIMIT distribution")
plt.show()

**4. Preprocessing**

In [None]:
df_clust = df.drop(columns=['CUST_ID']) #remove CUST_ID
df_clust = df_clust.fillna(df_clust.median()) #handle missing values

# Standardize dataset
scaler = StandardScaler()
X_scaled = scaler.fit_transform(df_clust)

# Dataset shape
print("Shape after preprocessing:", X_scaled.shape)

In [None]:
# Scaling output
print("Example of the first 5 lines (scaled):\n", X_scaled[:5])

In [None]:
# Check for missing values
print("Missing values after processing:\n", df_clust.isnull().sum().sum())

**5. Elbow and Silhouette Method**

In [None]:
inertia = [] #distance from point to centroid (Elbow Method)
sil = [] #cluster quality (Silhoutte Score)
K = range(2, 8)

for k in K:
    km = KMeans(n_clusters=k, random_state=42, n_init=10)
    labels = km.fit_predict(X_scaled)
    inertia.append(km.inertia_)
    sil.append(silhouette_score(X_scaled, labels))

plt.figure(figsize=(12,5))
plt.subplot(1,2,1)
plt.plot(K, inertia, '-o'); plt.title('Elbow Method')
plt.subplot(1,2,2)
plt.plot(K, sil, '-o'); plt.title('Silhouette Score')
plt.show()

**Interpretation:**

Elbow Method (using inertia):
- Decreases continuously as k increases, but find the point where the decrease begins to slow down → *“elbow point”*.
---
Silhouette Score:
- A value close to 1 → clusters are very distinct.
- A value close to 0 → clusters overlap.
- A negative value → clusters are incorrectly formed.


**6. KMeans Clustering**

In [None]:
#k=4 based on elbow/silhouette analysis
k_final = 4
km = KMeans(n_clusters=k_final, random_state=42, n_init=10)
labels = km.fit_predict(X_scaled)

df['Cluster'] = labels
print("Silhouette final:", silhouette_score(X_scaled, labels))
# closer to 1 = better defined clusters; around 0 = overlapping cluster; negative = poor clustering)

# Cluster centers (scaled back to original space)
centers = scaler.inverse_transform(km.cluster_centers_)
centers_df = pd.DataFrame(centers, columns=df_clust.columns)
centers_df

**7. Visualization**

In [None]:
plt.figure(figsize=(7,5))
sns.scatterplot(x=df['BALANCE'], y=df['PURCHASES'], hue=df['Cluster'], palette='Set2')
plt.title("Clusters by BALANCE vs PURCHASES")
plt.show()

**Interpretation of Results:**

- Clustering was successfully performed using KMeans on the credit card customer dataset.
- The elbow and silhouette methods showed that the optimal number of clusters was around *k = 4*.
- The silhouette score indicated that the cluster quality was quite good.
---
**Segment interpretation:**
- Cluster 0: customers with low balances and few transactions.
- Cluster 1: customers with high balances and many purchases.
- Cluster 2: customers who frequently make cash advances.
- Cluster 3: customers with high credit limits who make full payments more often.

### **Conclusion**

Clustering helps understand credit card customer behavior, which is useful for marketing segmentation and risk management.