# STUDY CASES OF PCA AND CLUSTERING

# GOAL

- Kita ingin clusterkan data client bank
- Harapannya dapat **memahami karakter client bank** yang ada terhadap **campaign** yang dilakukan


---
# Dataset Information
r
- The data is related with direct marketing campaigns of a Portuguese banking institution. 
- The marketing campaigns were based on phone calls. Often, more than one contact to the same client was required, on order to access if the product (bank term deposit) would be (or not) subscribed. 
- Dataset `bank.csv` ordered by date (from May 2008 to November 2010). 
- The **exercise goal** is to discover interesting things about the measurement.

**Variables**
​

<u>Numeric</u>
- `age`
- `balance`: average yearly balance, in euros
- `duration`: last coontact duration, in seconds
- `campaign`: number of contacts performed during this campaign and for this client (numeric, includes last contact)
- `pdays`: number of days that passed by after the client was last contacted from a previous campaign (numeric, -1 means client was not previously contacted)
- `previous`: number of contacts performed before this campaign and for this client
​
<u>Categoric</u>
- `job` : type of job (categorical) 
- `marital` : marital status (categorical)
- `education` (categorical)
- `default`: has credit in default? (binary: "yes","no")
- `housing`: has housing loan? (binary: "yes","no")
- `loan`: has personal loan? (binary: "yes","no")
- `contact`: contact communication type (categorical) 
- `day`: last contact day of the month 
- `month`: last contact month of year (categorical)
- `poutcome`: outcome of the previous marketing campaign (categorical)


Source :  S. Moro, R. Laureano and P. Cortez. Using Data Mining for Bank Direct Marketing: An Application of the CRISP-DM Methodology. <br>
  In P. Novais et al. (Eds.), Proceedings of the European Simulation and Modelling Conference - ESM'2011, pp. 117-121, Guimarães, Portugal, October, 2011. EUROSIS.

---
# Import Data

In [1]:
import numpy as np
import pandas as pd

In [2]:
def importData(path, col_to_drop):
    # Read Data
    data = pd.read_csv(path)
    print(f"Data awal                  : {data.shape}, (#observasi, #fitur)")

    # Drop kolom
    data = data.drop(columns = col_to_drop)
    print(f"Data setelah drop kolom    : {data.shape}, (#observasi, #fitur)")

    # Drop duplikat
    print(f"Ada {data.duplicated().sum()} data duplikat")
    data = data.drop_duplicates()
    print(f"Data setelah drop duplikat : {data.shape}, (#observasi, #fitur)")

    return data


In [3]:
filepath = "bank.csv"
col_to_drop = "Unnamed: 0"

data = importData(path = filepath,
                  col_to_drop = col_to_drop)

FileNotFoundError: [Errno 2] No such file or directory: 'bank.csv'

In [None]:
data.head()

---
# Data Preprocessing
## Train-Test Split
​
- Kita tidak pisahkan input-output, karena akan menganalisa struktur data

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
data_train, data_test = train_test_split(data,
                                         test_size = 0.25,
                                         random_state = 123)

In [None]:
print(data_train.shape)
print(data_test.shape)

In [None]:
data_train.head()

In [None]:
data_test.head()

## Numerical & Categorical Split
- Cek unique value untuk setiap kolom

In [None]:
for col in data_train.columns:
    print(f"col: {col}, #unique: {len(data_train[col].unique())}")

- Kita anggap `day` dan `month` sebagai numerik dalam latihan ini
- karena jumlah unique valuenya besar.

In [None]:
num_col = ["age", "day", "month", "balance", 
           "duration", "campaign", "pdays", "previous"]
cat_col = list(set(data_train.columns) - set(num_col))

print(num_col)
print(cat_col)

In [None]:
def splitNumCat(data, num_col, cat_col):
    data_num = data[num_col]
    data_cat = data[cat_col]

    return data_num, data_cat


In [None]:
data_train_num, data_train_cat = splitNumCat(data = data_train,
                                             num_col = num_col,
                                             cat_col = cat_col)

In [None]:
data_train_num.head()

fitur month perlu di-transform jadi angka

In [None]:
data_train_cat.head()


## Handling Data - Impute & Standardize
**Transform** - fitur `month`

In [None]:
def transformMonth(data):
    month_list = ["jan", "feb", "mar", "apr", "may", "jun",
                  "jul", "aug", "sep", "oct", "nov", "dec"]
    number_list = [i+1 for i in range(len(month_list))]

    data["month"] = data["month"].replace(month_list, number_list)

    return data


In [None]:
data_train_num = transformMonth(data = data_train_num)

In [None]:
data_train_num.head()

Missing Values - Numerical

In [None]:
# Cek missing value
data_train_num.isna().any()

In [None]:
# Buat imputer, kalau-kalau ada yang butuh di data test
from sklearn.impute import SimpleImputer

def imputerNumeric(data, imputer = None):
    if imputer == None:
        # Buat imputer
        imputer = SimpleImputer(missing_values = np.nan,
                                strategy = "median")
        imputer.fit(data)

    # Transform data
    data_imputed = imputer.transform(data)
    data_imputed = pd.DataFrame(data = data_imputed,
                                columns = data.columns,
                                index = data.index)
    
    return data_imputed, imputer


In [None]:
data_train_num_imputed, num_imputer = imputerNumeric(data = data_train_num)

In [None]:
data_train_num_imputed.head()

Standardizing - Numerical

In [None]:
from sklearn.preprocessing import StandardScaler

# Buat scaler
def fitStandardize(data):
    scaler = StandardScaler()
    scaler.fit(data)

    return scaler

# Transform scaler
def transformStandardize(data, scaler):
    data_scaled = scaler.transform(data)
    data_scaled = pd.DataFrame(data = data_scaled,
                               columns = data.columns,
                               index = data.index)
    
    return data_scaled


In [None]:
# Cari scaler
num_scaler = fitStandardize(data = data_train_num_imputed)

# Transform data
data_train_num_clean = transformStandardize(data = data_train_num_imputed,
                                            scaler = num_scaler)

In [None]:
data_train_num_clean.head()


## Handling Data 2 - PCA (Dimensionality Reduction)

**Goal**: represent data in fewer dimensions

In [None]:
# Import package PCA - Sklearn
from sklearn.decomposition import PCA

In [None]:
# Define PCA with random state
pca_obj = PCA(random_state = 123)

In [None]:
# Fit to data_train_num_clean
pca_obj.fit(data_train_num_clean)

dapatkan principal component

In [None]:
# Show PCA Component
pca_component = pca_obj.components_

# Turn to dataframe
pca_component = pd.DataFrame(data = pca_component,
                             columns = data_train_num_clean.columns)
pca_component

dapatkan variance yang dijelaskan

In [None]:
# Explained variance
pca_obj.explained_variance_

In [None]:
# Explained variance ratio
pca_obj.explained_variance_ratio_


bisa kita lihat,
- PC 1 adalah baris pertama pada dataframe `pca_component`
- PC 1 menjelaskan 19.3% variasi data
*transform data dengan principal component*

In [None]:
# Transform data
data_train_num_pca = pca_obj.transform(data_train_num_clean)

# Set data sebagai dataframe
col_names = [f"PC_{i+1}" for i in range(data_train_num_pca.shape[1])]
data_train_num_pca = pd.DataFrame(data = data_train_num_pca,
                                  columns = col_names,
                                  index = data_train_num_clean.index)

data_train_num_pca.head()

*Berapa principal component?*

- Pilih untuk mempertahankan persentase variance tertentu dalam data

In [None]:
# Jika gunakan seluruh component, maka variance-nya
sum(pca_obj.explained_variance_ratio_)

In [None]:
# Jika memilih n component, maka variance yang dijelaskan
for i in range(1, len(pca_obj.explained_variance_ratio_) + 1):
    sum_of_variance_n = sum(pca_obj.explained_variance_ratio_[:i]) * 100
    print(f"n_component: {i}, %variance explained: {sum_of_variance_n:.2f} %")

- Apabila ingin mempertahankan 90% variance, maka Anda memilih 7 komponen
- Jumlah komponen yang dipilih dapat dijadikan bagian dari eksperimentasi


*Buat user-defined function untuk PCA*

In [None]:
def fitPCA(data):
    # Buat objek PCA
    pca_obj = PCA(random_state = 123)

    # Fit PCA pada data
    pca_obj.fit(data)

    # Tampilkan explained-variance
    print("Explained variance using n_components:")
    for i in range(1, len(pca_obj.explained_variance_ratio_) + 1):
        sum_of_variance_n = sum(pca_obj.explained_variance_ratio_[:i]) * 100
        print(f"n_component: {i}, %variance explained: {sum_of_variance_n:.2f} %")

    print()

    # Pilih n_components
    n_comp = int(input("n_components : "))

    # Buat ulang PCA
    pca_obj = PCA(n_components = n_comp,
                  random_state = 123)
    pca_obj.fit(data)

    # Ekstrak komponen
    pca_component = pca_obj.components_[:n_comp]

    # Turn to dataframe
    pca_component = pd.DataFrame(data = pca_component,
                                columns = data.columns)
    
    return pca_component, pca_obj


In [None]:
pca_component, pca_obj = fitPCA(data = data_train_num_clean)

In [None]:
pca_component

In [None]:
# Buat fungsi transformasi data
def transformPCA(data, pca_obj):
    # Transform data
    data_pca = pca_obj.transform(data)

    cols = [f"PC_{i+1}" for i in range(data_pca.shape[1])]
    data_pca = pd.DataFrame(data = data_pca,
                            columns = cols,
                            index = data.index)
    
    return data_pca


In [None]:
data_train_num_pca = transformPCA(data = data_train_num_clean,
                                  pca_obj = pca_obj)

In [None]:
# Cek data yang sudah diPCA
data_train_num_pca.head()

In [None]:
# Cek komponen
pca_component

membuat bi-plot

In [None]:
import matplotlib.pyplot as plt

In [None]:
transformed_data = data_train_num_clean @ pca_component[:2].T
transformed_data.head()

In [None]:
fig, ax = plt.subplots(nrows = 1, ncols = 1, figsize = (10, 7))

ax.scatter(transformed_data[0][data_train_cat["poutcome"]=="success"], 
           transformed_data[1][data_train_cat["poutcome"]=="success"], 
           marker=".", 
           c="red", #s=10,
           alpha=.2,
           label = "SUCCESS")

ax.scatter(transformed_data[0][data_train_cat["poutcome"]=="other"], 
           transformed_data[1][data_train_cat["poutcome"]=="other"], 
           marker=".", 
           c="blue", #s=10,
           alpha=.2,
           label = "FAILED")

for col in pca_component.columns:
    data_col = np.array(pca_component[col].loc[0:1])*5.
    start_point = [0, data_col[0]]
    end_point = [0, data_col[1]]

    ax.plot(start_point, end_point, marker="o", label=col)

ax.set_ylabel("Second Principal Components")
ax.set_xlabel("First Principal Components")
ax.set_xlim([-2.0, 5])
ax.set_ylim([-2.0, 5])
plt.grid()
plt.legend()
plt.show()

Gimana cara interpretasinya?
- Untuk PC_1, memberi bobot besar pada `pdays` dan `previous`, tapi bobot untuk `duration`, `balance` dan `age` kecil
- Artinya `pdays` dan `previous` berkorelasi satu sama lain,
- Semakin besar `previous`, semakin besar `pdays`


---
# Modeling Clustering - Data Full
- **Goal**: make separate group with similar character, and assign them into cluster
- **TASK CLUSTERING IS SUBJECTIVE**

In [None]:
from sklearn.cluster import KMeans

buat objek clustering

In [None]:
# Buat objek k-means
kmeans_obj = KMeans(n_clusters = 3,
                    random_state = 123)

In [None]:
# Fit objek k-means
kmeans_obj.fit(data_train_num_clean)

predict clustering

In [None]:
# Predict Cluster
kmeans_obj.predict(data_train_num_clean)

In [None]:
# Reshape predicted cluster to dataframe
cluster_result = kmeans_obj.predict(data_train_num_clean)
cluster_result = pd.DataFrame(data = cluster_result,
                              columns = ["cluster"],
                              index = data_train_num_clean.index)

In [None]:
cluster_result.head()

periksa proporsi cluster

In [None]:
cluster_result["cluster"].value_counts(normalize = True)

- 2 cluster memiliki porsi di atas 43% data

periksa centroid sebagai representasi cluster

In [None]:
# Check centroid
kmeans_obj.cluster_centers_

In [None]:
# Jadikan dataframe
centroids = kmeans_obj.cluster_centers_
centroids = pd.DataFrame(data = centroids,
                         columns = data_train_num_clean.columns)

centroids

- Tentu hal diatas tidak bisa diartikan
- Karena dalam bentuk terstandardkan
- Kita harus balikan ke dalam bentuk awal sebelum distandarisasi

inverse transform dari standardizer

In [None]:
centroid_real = num_scaler.inverse_transform(centroids)
centroid_real = pd.DataFrame(data = centroid_real,
                             columns = data_train_num_clean.columns)

centroid_real


*lalu artinya apa?* - Harus di translate sendiri
- Cluster 1 (0) adalah **group** yang
    - dikontak di awal bulan
    - sudah pernah dikontak 2x **selama** campaign
    - belum pernah dikontak **sebelum** campaign
    
*BEST K?*

Score -- within-cluster sum-of-squares
​
$$
\text{scores} = - \sum_{i=0}^{n} ||x_{i} - \mu_{j}||^{2}
$$

In [None]:
# Tampilkan score
-kmeans_obj.score(data_train_num_clean)

coba variasikan beberapa cluster

In [None]:
score_list = []
k_list = np.arange(2, 11, 1)

for k in k_list:
    # Buat object
    kmeans_obj_k = KMeans(n_clusters = k,
                          max_iter = 50,
                          random_state = 123)
    
    # Fit data
    kmeans_obj_k.fit(data_train_num_clean)

    # update score
    score_k = -kmeans_obj_k.score(data_train_num_clean)
    score_list.append(score_k)


In [None]:
score_list

In [None]:
fig, ax = plt.subplots(nrows=1, ncols=1, figsize=(10, 7))

ax.plot(k_list, score_list, "r", marker="o")

ax.set_xlabel("number of cluster")
ax.set_ylabel("within-cluster sum-of-square")
plt.show()

- Makin banyak cluster, makin rendah scorenya.
- Tapi, makin banyak cluster, makin kompleks untuk diinterpretasikan.
- Kita coba ambil cluster terbaik di 9, karena perubahan error di cluster 10 mengecil

In [None]:
# Buat object
kmeans_obj_best = KMeans(n_clusters = 9,
                         random_state = 123)

# Fit object
kmeans_obj_best.fit(data_train_num_clean)

Tampilkan Centroid

In [None]:
# Jadikan centroid dalam bentuk dataframe
centroids_best = kmeans_obj_best.cluster_centers_
centroids_best = pd.DataFrame(data = centroids_best,
                              columns = data_train_num_clean.columns)

# Inverse transform
centroid_real_best = num_scaler.inverse_transform(centroids_best)
centroid_real_best = pd.DataFrame(data = centroid_real_best,
                                  columns = data_train_num_clean.columns)

centroid_real_best


*Coba Interpretasikan di atas ini?*


**Predict Cluster**

In [None]:
cluster_best = kmeans_obj_best.predict(data_train_num_clean)

cluster_best = pd.DataFrame(data = cluster_best,
                            columns = ["cluster"],
                            index= data_train_num_clean.index)
cluster_best.head()


---
# Modeling Clustering - Data PCA


*Variasikan beberapa cluster*

In [None]:
score_list = []
k_list = np.arange(2, 11, 1)

for k in k_list:
    # Buat object
    kmeans_obj_k = KMeans(n_clusters = k,
                          max_iter = 50,
                          random_state = 123)
    
    # Fit data
    kmeans_obj_k.fit(data_train_num_pca)

    # update score
    score_k = -kmeans_obj_k.score(data_train_num_pca)
    score_list.append(score_k)


In [None]:
score_list

In [None]:
fig, ax = plt.subplots(nrows=1, ncols=1, figsize=(10, 7))

ax.plot(k_list, score_list, "r", marker="o")

ax.set_xlabel("number of cluster")
ax.set_ylabel("within-cluster sum-of-square")
plt.show()

- Makin banyak cluster, makin rendah scorenya.
- Tapi, makin banyak cluster, makin kompleks untuk diinterpretasikan.
- Kita coba ambil cluster terbaik di 7, karena perubahan error di cluster selanjutnya mengecil

In [None]:
# Buat object
kmeans_obj_pca_best = KMeans(n_clusters = 7,
                             random_state = 123)

# Fit object
kmeans_obj_pca_best.fit(data_train_num_pca)

Predict Cluster

In [None]:
cluster_pca_best = kmeans_obj_pca_best.predict(data_train_num_pca)

cluster_pca_best = pd.DataFrame(data = cluster_pca_best,
                                columns = ["cluster"],
                                index = data_train_num_pca.index)
cluster_pca_best.head()

Centroid PCA

In [None]:
# Cari centroid
centroid_pca_best = kmeans_obj_pca_best.cluster_centers_
centroid_pca_best = pd.DataFrame(data = centroid_pca_best,
                                 columns = data_train_num_pca.columns)
centroid_pca_best

In [None]:
# Inverse transform centroid pca
# agar dapat diinterpretasikan
centroid_pca_best_inv = pca_obj.inverse_transform(centroid_pca_best)
centroid_pca_best_inv = pd.DataFrame(centroid_pca_best_inv,
                                     columns = data_train_num_clean.columns)
centroid_pca_best_inv

In [None]:
# Inverse transform centroid standardisasi
# agar dapat diinterpretasikan
centroid_pca_best_real = num_scaler.inverse_transform(centroid_pca_best_inv)
centroid_pca_best_real = pd.DataFrame(centroid_pca_best_real,
                                      columns = data_train_num_clean.columns)

centroid_pca_best_real

Sekarang, data bisa diinterpretasikan

---
# Clustering Test Data

## Preprocessing Test Data

In [None]:
def transformTestData(data, num_col, cat_col, num_imputer, num_scaler):
    # 1. Split num-cat data
    data_num, _ = splitNumCat(data = data,
                              num_col = num_col,
                              cat_col = cat_col)
    
    # 2. Handling Data
    # 2.1 transform month
    data_num = transformMonth(data = data_num)

    # 2.2 impute data
    data_num_imputed, _= imputerNumeric(data = data_train_num,
                                        imputer = num_imputer)
    
    # 2.3 Standardization
    data_num_scaled = transformStandardize(data = data_num_imputed,
                                           scaler = num_scaler)
    
    return data_num_scaled
    

In [None]:
data_test_clean = transformTestData(data = data_test,
                                    num_col = num_col,
                                    cat_col = cat_col,
                                    num_imputer = num_imputer,
                                    num_scaler = num_scaler)

In [None]:
data_test_clean.head()

Transform PCA

In [None]:
data_test_clean_pca = transformPCA(data = data_test_clean,
                                   pca_obj = pca_obj)

In [None]:
data_test_clean_pca.head()

## Predict Test Data

predict data test - FULL

In [None]:
kmeans_obj_best.predict(data_test_clean)

In [None]:
kmeans_obj_pca_best.predict(data_test_clean_pca)