## Hierarchial Clustering

In [None]:
from scipy.cluster.hierarchy import linkage, dendrogram, fcluster
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np

In [None]:
# Load data
credit_score=pd.read_csv("data/credit_score.csv")

# Select features
df_features=credit_score.drop(['CUST_ID', 'CAT_GAMBLING', 'CAT_DEBT', 
                               'CAT_CREDIT_CARD', 'CAT_MORTGAGE',
                               'CAT_SAVINGS_ACCOUNT', 'CAT_DEPENDENTS',
                               'CREDIT_SCORE', 'DEFAULT'], axis=1)

# Standardize features
scaler=StandardScaler()
df_scaled=scaler.fit_transform(df_features)
df_scaled=pd.DataFrame(df_scaled, columns=df_features.columns)

In [None]:
# Transpose so that each row is a feature
df_transposed=df_scaled.transpose()
print(np.shape(df_transposed))

In [None]:
# Perform hierarchial clustering on features
linked=linkage(df_transposed, method='ward', metric='euclidean')
print(np.shape(linked))

In [None]:
df_linked=pd.DataFrame(linked,
                       columns=['c1', 'c2', 'distance', 'size'])
df_linked[['c1', 'c2', 'size']]=df_linked[['c1', 'c2', 'size']].astype('int')

df_linked.head(10)

# c1, c2 - indices of 2 clusters included in the new cluster
# distance - distance between the 2 clusters
# size - number of features in the resulting cluster

### Visualizing the clusters

In [None]:
# Create a dendrogram to visualize the feature clustering
plt.figure(figsize=(10, 5))

dendrogram(
    linked,
    # Root cluster at top, individual features at bottom
    orientation='top',
    labels=df_transposed.index,
    distance_sort='descending',
    show_leaf_counts=True
)

plt.xlabel('Features')
plt.ylabel('Ward distances')

### Group features into clusters

* `fcluster` (from `scipy.cluster.hierarchy`) converts a **hierarchical clustering** (the dendrogram encoded by `linked`) into **flat cluster labels** for each original sample.

* `linked` should be the **linkage matrix** you got from `linkage(X, method=...)`. It’s an `(n-1) x 4` array describing the sequence of merges: which clusters merged, at what distance, and the size of the new cluster.

* `criterion='maxclust'` tells `fcluster` to **cut the dendrogram** so that the number of resulting clusters is **no more than** `t`. In practice this yields **exactly `t` clusters** unless there are ties/degeneracies.

* `t=num_clusters` sets that `t` to 10, so you get (typically) **10 clusters**.

* The return value `labels` is a 1D integer array of length `n_samples`, with values like `1, 2, ..., k` indicating which cluster each sample belongs to.

In [None]:
num_clusters=10

labels=fcluster(linked, t=num_clusters, criterion='maxclust')

#### 1. What `.corr()` does

* `pandas.Series.corr()` computes the **Pearson correlation coefficient** between two Series (columns) by default.
* Formula:

$$
r = \frac{\text{Cov}(X, Y)}{\sigma_X \, \sigma_Y}
$$

Where:

* $\text{Cov}(X, Y)$ = covariance between features `X` and `Y`
* $\sigma_X, \sigma_Y$ = standard deviations of `X` and `Y`

---

#### 2. Step-by-step calculation

Suppose we’re correlating `CREDIT_SCORE` (Y) with some feature column (X):

1. **Mean-center both variables**

   $$
   X' = X - \bar{X}, \quad Y' = Y - \bar{Y}
   $$

2. **Compute covariance**

   $$
   \text{Cov}(X,Y) = \frac{1}{n-1} \sum_{i=1}^n X'_i \, Y'_i
   $$

3. **Normalize by variances**

   $$
   r = \frac{\text{Cov}(X,Y)}{\sqrt{\frac{1}{n-1}\sum (X'_i)^2} \; \sqrt{\frac{1}{n-1}\sum (Y'_i)^2}}
   $$

In [None]:
# Find correlation between features and credit score
correlations=[]

for col in df_features.columns:
    corr=credit_score["CREDIT_SCORE"].corr(credit_score[col])
    corr=round(corr, 3)
    correlations.append(corr)

* `df_features.columns` → an Index of feature names.
* `labels` → the cluster label for each feature (from your clustering step).
* `correlations` → Pearson r between each feature and CREDIT\_SCORE.
* `zip(...)` pairs items position-wise, e.g. `(feature_i, label_i, corr_i)`.
* `list(...)` materializes that iterator into a list of tuples for DataFrame construction.
* `pd.DataFrame(..., columns=...)` builds a DataFrame with explicit column names:

  * **feature**: column name from `df_features`
  * **cluster**: integer cluster id (1..k)
  * **correlation**: signed Pearson correlation with the target

```python
df_clusters['abs_corr'] = df_clusters['correlation'].abs()
```

* Creates a new column **abs\_corr** as the absolute value of the signed correlation.
* Purpose: rank features by **strength** of association regardless of sign.

```python
df_clusters.sort_values(by=['cluster', 'abs_corr'], ascending=[True, False], inplace=True)
```

* Sorts rows by two keys:

  1. **cluster** ascending → groups features cluster-by-cluster.
  2. **abs\_corr** descending → within each cluster, strongest correlations first.
* `inplace=True` mutates `df_clusters` instead of returning a new DataFrame.

```python
df_clusters.reset_index(drop=True, inplace=True)
```

* After sorting, the old row order (index) is meaningless; this resets to `0..n-1`.

* `drop=True` discards the old index rather than moving it into a column.

* `inplace=True` modifies the DataFrame directly.

In [None]:
df_clusters=pd.DataFrame(
    list(zip(df_features.columns, labels, correlations)),
    columns=['feature', 'cluster', 'correlation']
)

df_clusters['abs_corr']=df_clusters['correlation'].abs()

df_clusters.sort_values(by=['cluster', 'abs_corr'], ascending=[True, False], inplace=True)
df_clusters.reset_index(drop=True, inplace=True)

df_clusters.head(10)

### Sense check the clusters

In [None]:
c2_features=df_clusters[df_clusters['cluster']==2]['feature'].tolist()
c3_features=df_clusters[df_clusters['cluster']==3]['feature'].tolist()

print(c2_features)
print(c3_features)

```python
corr = credit_score[c2_features].corr()
```

* `c2_features` → should be a list of column names belonging to cluster 2 (from your earlier clustering).
* `credit_score[c2_features]` → selects those columns from the DataFrame.
* `.corr()` → computes the **pairwise Pearson correlation matrix** among the selected features:

  * Result is a square DataFrame (size = number of features in cluster 2 × same).
  * Values range between **-1 and +1**.

---

```python
plt.figure(figsize=(5, 4))
```

* Creates a new matplotlib figure with width=5 inches, height=4 inches.
* Controls the plot’s overall size.

---

```python
sns.heatmap(
    corr,
    annot=True,
    cmap='coolwarm',
    linewidths=0.5,
    fmt='.1f',
    annot_kws={"size": 7},
    vmin=-1,
    vmax=1
)
```

This draws the heatmap using **seaborn**:

* **`corr`** → the correlation matrix is the input (a 2D numeric table).

* **`annot=True`** → show the actual correlation values inside each cell.

* **`cmap='coolwarm'`** → color map: blue for negative values, red for positive values, white near zero.

* **`linewidths=0.5`** → thin white gridlines between cells.

* **`fmt='.1f'`** → format annotations with 1 decimal place.

* **`annot_kws={"size": 7}`** → annotation font size.

* **`vmin=-1, vmax=1`** → fixes the color scale so -1 = full blue, 0 = white, +1 = full red.
  Ensures all heatmaps are comparable.

---

```python
plt.title("Cluster 2")
```

* Adds a title above the heatmap.

---

```python
plt.xticks(size=7)
plt.yticks(size=7)
```

* Reduces the tick-label font size on the x and y axes to 7, keeping labels readable in small plots.

---

In [None]:
# Plot correlations for features in cluster 2
corr=credit_score[c2_features].corr()

plt.figure(figsize=(5, 4))
sns.heatmap(
    corr,
    annot=True,
    cmap='coolwarm',
    linewidths=0.5,
    fmt='.1f',
    annot_kws={"size": 7},
    vmin=-1,
    vmax=1
)

plt.title("Cluster 2")
plt.xticks(size=7)
plt.yticks(size=7)

In [None]:
# Plot correlations for features in cluster 3
corr=credit_score[c3_features].corr()

plt.figure(figsize=(5, 4))
sns.heatmap(
    corr,
    annot=True,
    cmap='coolwarm',
    linewidths=0.5,
    fmt='.1f',
    annot_kws={"size": 7},
    vmin=-1,
    vmax=1
)

plt.title("Cluster 3")
plt.xticks(size=7)
plt.yticks(size=7)

```python
cbar = plt.gcf().axes[-1]
```

* `plt.gcf()` — “get current figure”. It returns the `Figure` object that matplotlib is currently working with.

* `.axes` — a list of all `Axes` objects in that `Figure` (main plot axes, maybe subplots, and any colorbar axes).

* `[-1]` — picks the **last** `Axes` in that list.

* **Typical effect:** after drawing a seaborn heatmap, matplotlib usually appends the colorbar as the last axes in the figure, so `plt.gcf().axes[-1]` commonly returns the **colorbar's Axes**. `cbar` then holds that Axes object.

* **Caveat:** this assumes the colorbar is indeed the last axes. If you have multiple subplots or other axes, the last axes might not be the colorbar.

```python
cbar.tick_params(labelsize=7)
```

* `tick_params()` is an `Axes` method that controls tick/label appearance for that axes.

* `labelsize=7` sets the **font size (in points)** of the tick labels on that axes — here, the colorbar ticks.

* You can pass many other options to `tick_params()` (e.g., `rotation`, `length`, `width`, `direction='in'`), and you can target axes (`axis='x'/'y'`), but for a vertical colorbar `labelsize` is the usual parameter to shrink the numeric labels.

```python
plt.xticks(size=7)
plt.yticks(size=7)
```

* These are convenience `matplotlib.pyplot` functions that set the **font size** of the tick labels on the **current axes** (the axes that matplotlib considers active).

* `plt.xticks(size=7)` sets the x-axis tick label font size to 7 points. Same for `plt.yticks(size=7)` on the y-axis.

* **Important:** `plt.xticks()` and `plt.yticks()` affect whichever axes is current — if you want to be explicit and safe, it’s better to use the axes object directly (`ax.tick_params(axis='x', labelsize=7)` / `ax.tick_params(axis='y', labelsize=7)`) so you don’t accidentally change the wrong axes.

In [None]:
# Plot correlations for features of cluster 2 and 3
corr=df_features[np.append(c2_features, c3_features)].corr()

plt.figure(figsize=(8, 7))
sns.heatmap(
    corr,
    annot=True,
    cmap='coolwarm',
    linewidths=0.5,
    fmt='.1f',
    annot_kws={"size": 7},
    vmin=-1,
    vmax=1
)

plt.title("Cluster 2 and 3")
plt.xticks(size=7)
plt.yticks(size=7)

# Change size of colorbar labels
cbar=plt.gcf().axes[-1]
cbar.tick_params(labelsize=7)
plt.xticks(size=7)
plt.yticks(size=7)