# Hierarchical Clustering
- Linkage methods, Dendrograms, Cophenetic distance
- Real examples: Gene expression analysis, Document clustering

In [1]:
import numpy as np
from scipy.cluster.hierarchy import linkage, dendrogram, fcluster, cophenet
from scipy.spatial.distance import pdist
import matplotlib.pyplot as plt
print('Hierarchical clustering module loaded')

Hierarchical clustering module loaded


## Hierarchical Clustering Basics
**Approach**: Build tree (dendrogram) of nested clusters

**Two types**:
- **Agglomerative** (bottom-up): Start with singletons, merge
- **Divisive** (top-down): Start with all, split

**Linkage methods** (how to measure cluster distance):
- **Single**: Min distance between points
- **Complete**: Max distance
- **Average**: Average distance
- **Ward**: Minimize within-cluster variance (recommended)

In [2]:
# Sample data
np.random.seed(42)
data = np.random.randn(20, 2)
data[:7] += [3, 3]
data[7:14] += [-2, 2]
data[14:] += [1, -3]

print(f'Data: {len(data)} points in 2D\n')

# Hierarchical clustering with different linkages
methods = ['single', 'complete', 'average', 'ward']

for method in methods:
    Z = linkage(data, method=method)
    print(f'{method.capitalize()} linkage:')
    print(f'  Linkage matrix shape: {Z.shape}')
    print(f'  (n-1 merges for n points)\n')

Data: 20 points in 2D

Single linkage:
  Linkage matrix shape: (19, 4)
  (n-1 merges for n points)

Complete linkage:
  Linkage matrix shape: (19, 4)
  (n-1 merges for n points)

Average linkage:
  Linkage matrix shape: (19, 4)
  (n-1 merges for n points)

Ward linkage:
  Linkage matrix shape: (19, 4)
  (n-1 merges for n points)



## Dendrogram Visualization
Tree showing cluster hierarchy
Height = distance at which clusters merge

In [3]:
# Compute linkage
Z = linkage(data, method='ward')

print('Dendrogram interpretation:')
print('  - Horizontal axis: Data points')
print('  - Vertical axis: Distance/dissimilarity')
print('  - Height of merge = cluster distance')
print('  - Cut at height → clusters\n')

# Cophenetic correlation (quality metric)
c, coph_dists = cophenet(Z, pdist(data))
print(f'Cophenetic correlation: {c:.4f}')
print('  Close to 1.0 = dendrogram preserves distances well')

Dendrogram interpretation:
  - Horizontal axis: Data points
  - Vertical axis: Distance/dissimilarity
  - Height of merge = cluster distance
  - Cut at height → clusters

Cophenetic correlation: 0.8676
  Close to 1.0 = dendrogram preserves distances well


## Cutting Dendrogram: Forming Clusters
Two ways:
1. **By height**: `fcluster(Z, t, criterion='distance')`
2. **By count**: `fcluster(Z, k, criterion='maxclust')`

In [4]:
# Cut to get k clusters
k = 3
clusters = fcluster(Z, k, criterion='maxclust')

print(f'Cut dendrogram to get {k} clusters:')
for i in range(1, k+1):
    count = np.sum(clusters == i)
    print(f'  Cluster {i}: {count} points')

print(f'\nCluster assignments: {clusters}')

Cut dendrogram to get 3 clusters:
  Cluster 1: 7 points
  Cluster 2: 7 points
  Cluster 3: 6 points

Cluster assignments: [1 1 1 1 1 1 1 2 2 2 2 2 2 2 3 3 3 3 3 3]


## Real Example: Gene Expression Clustering
Group genes by similar expression patterns
Identify co-regulated genes

In [5]:
# Simulate gene expression data
# 50 genes × 10 conditions
np.random.seed(42)
n_genes = 50
n_conditions = 10

# Create gene groups with similar patterns
group1 = np.random.randn(15, n_conditions) + np.linspace(0, 3, n_conditions)  # Upregulated
group2 = np.random.randn(20, n_conditions) + np.linspace(3, 0, n_conditions)  # Downregulated
group3 = np.random.randn(15, n_conditions) + 1  # Stable

gene_expr = np.vstack([group1, group2, group3])
gene_names = [f'Gene_{i:02d}' for i in range(n_genes)]

print('Gene Expression Clustering')
print(f'  Genes: {n_genes}')
print(f'  Conditions: {n_conditions}')
print(f'  Data shape: {gene_expr.shape}\n')

# Hierarchical clustering
Z_genes = linkage(gene_expr, method='average')

# Cut to get gene groups
k_groups = 3
gene_clusters = fcluster(Z_genes, k_groups, criterion='maxclust')

print(f'Identified {k_groups} gene clusters:')
for i in range(1, k_groups+1):
    cluster_genes = np.where(gene_clusters == i)[0]
    print(f'\n  Cluster {i}: {len(cluster_genes)} genes')
    print(f'    Genes: {[gene_names[j] for j in cluster_genes[:5]]}...')
    
    # Average expression pattern
    avg_pattern = gene_expr[cluster_genes].mean(axis=0)
    trend = 'Upregulated' if avg_pattern[-1] > avg_pattern[0] else 'Downregulated' if avg_pattern[-1] < avg_pattern[0] else 'Stable'
    print(f'    Pattern: {trend}')

Gene Expression Clustering
  Genes: 50
  Conditions: 10
  Data shape: (50, 10)

Identified 3 gene clusters:

  Cluster 1: 29 genes
    Genes: ['Gene_15', 'Gene_16', 'Gene_17', 'Gene_18', 'Gene_19']...
    Pattern: Downregulated

  Cluster 2: 20 genes
    Genes: ['Gene_00', 'Gene_01', 'Gene_02', 'Gene_03', 'Gene_04']...
    Pattern: Upregulated

  Cluster 3: 1 genes
    Genes: ['Gene_43']...
    Pattern: Upregulated


## Real Example: Document Clustering
Group similar documents
Applications: Topic modeling, search results

In [6]:
# Simulate document-term matrix (TF-IDF-like)
np.random.seed(42)
n_docs = 30
n_terms = 100

# Topic 1: Tech documents (high weight on tech terms)
topic1 = np.zeros((10, n_terms))
topic1[:, :20] = np.random.rand(10, 20) * 5  # Tech terms
topic1[:, 20:] = np.random.rand(10, 80) * 0.5

# Topic 2: Sports documents
topic2 = np.zeros((10, n_terms))
topic2[:, 20:40] = np.random.rand(10, 20) * 5  # Sports terms
topic2[:, [0,1,2] + list(range(40,100))] = np.random.rand(10, 63) * 0.5

# Topic 3: Health documents
topic3 = np.zeros((10, n_terms))
topic3[:, 40:60] = np.random.rand(10, 20) * 5  # Health terms
topic3[:, list(range(0,40)) + list(range(60,100))] = np.random.rand(10, 80) * 0.5

docs = np.vstack([topic1, topic2, topic3])

print('Document Clustering')
print(f'  Documents: {n_docs}')
print(f'  Terms: {n_terms}\n')

# Hierarchical clustering (cosine distance common for text)
from scipy.spatial.distance import pdist, squareform
cosine_dist = pdist(docs, metric='cosine')
Z_docs = linkage(cosine_dist, method='average')

# Form clusters
k_topics = 3
doc_clusters = fcluster(Z_docs, k_topics, criterion='maxclust')

print(f'Discovered {k_topics} document clusters:')
for i in range(1, k_topics+1):
    cluster_docs = np.where(doc_clusters == i)[0]
    print(f'  Cluster {i}: {len(cluster_docs)} documents')
    print(f'    Doc IDs: {cluster_docs.tolist()}')

Document Clustering
  Documents: 30
  Terms: 100

Discovered 3 document clusters:
  Cluster 1: 10 documents
    Doc IDs: [10, 11, 12, 13, 14, 15, 16, 17, 18, 19]
  Cluster 2: 10 documents
    Doc IDs: [20, 21, 22, 23, 24, 25, 26, 27, 28, 29]
  Cluster 3: 10 documents
    Doc IDs: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]


## Summary

### Hierarchical Clustering Functions:
```python
from scipy.cluster.hierarchy import linkage, dendrogram, fcluster
from scipy.spatial.distance import pdist

# 1. Compute linkage matrix
Z = linkage(data, method='ward')

# 2. Visualize
dendrogram(Z)

# 3. Form clusters
clusters = fcluster(Z, k, criterion='maxclust')

# 4. Quality metric
c, _ = cophenet(Z, pdist(data))
```

### Linkage Methods:
- **Ward**: Minimize variance (best for compact clusters)
- **Average**: UPGMA, balanced
- **Complete**: Max distance, avoids chaining
- **Single**: Min distance, can chain

### Advantages over K-means:
✓ No need to specify K beforehand  
✓ Dendrogram shows hierarchy  
✓ Works with any distance metric  
✓ Deterministic (no random initialization)  

### Disadvantages:
✗ Slower: O(n²) vs O(n) for K-means  
✗ Memory: Need full distance matrix  
✗ Can't undo merges  

### Applications:
✓ **Biology**: Gene expression, phylogenetic trees  
✓ **Text**: Document clustering, topic discovery  
✓ **Social**: Community detection  
✓ **Market**: Product categorization  