# PCA (Principal Component Analysis) – Step-by-Step
PCA is an **unsupervised** technique that reduces the number of features while keeping as much **variance (information)** as possible.

In this notebook you will:
1) Create multi-feature data
2) Scale the data
3) Reduce dimensions using PCA
4) Inspect explained variance
5) Visualize the 2D PCA projection
6) (Bonus) Use PCA before K-Means


In [None]:
import numpy as np
import matplotlib.pyplot as plt

from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans

## 1) Create multi-feature data
We'll create 3D data with two natural groups (for visualization).

In [None]:
rng = np.random.RandomState(42)

# Two clusters in 3D
A = rng.normal(loc=[0, 0, 0], scale=[1.0, 0.7, 0.5], size=(60, 3))
B = rng.normal(loc=[4, 3, 2], scale=[1.0, 0.7, 0.5], size=(60, 3))
X = np.vstack([A, B])

print('Original shape:', X.shape)
X[:5]

## 2) Scale the data (important for PCA)

In [None]:
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

print('First row before:', X[0])
print('First row after :', X_scaled[0])

## 3) Apply PCA (reduce from 3D → 2D)

In [None]:
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)

print('New shape:', X_pca.shape)
X_pca[:5]

## 4) Explained variance (how much information is kept?)

In [None]:
print('Explained variance ratio:', pca.explained_variance_ratio_)
print('Total variance kept:', pca.explained_variance_ratio_.sum())

## 5) Visualize the 2D PCA projection

In [None]:
plt.figure()
plt.scatter(X_pca[:, 0], X_pca[:, 1])
plt.xlabel('PC1')
plt.ylabel('PC2')
plt.title('PCA Projection (3D → 2D)')
plt.show()

## 6) How many components should we keep?
Plot cumulative explained variance to decide.

In [None]:
pca_full = PCA().fit(X_scaled)
cum = np.cumsum(pca_full.explained_variance_ratio_)

plt.figure()
plt.plot(range(1, len(cum) + 1), cum, marker='o')
plt.xlabel('Number of components')
plt.ylabel('Cumulative explained variance')
plt.title('How many PCA components to keep?')
plt.ylim(0, 1.05)
plt.show()

## 7) Bonus: PCA before K-Means
Often you reduce dimensions first, then cluster.

In [None]:
kmeans = KMeans(n_clusters=2, random_state=42, n_init=10)
labels = kmeans.fit_predict(X_pca)

centers = kmeans.cluster_centers_

plt.figure()
plt.scatter(X_pca[:, 0], X_pca[:, 1], c=labels)
plt.scatter(centers[:, 0], centers[:, 1], marker='X', s=250)
plt.xlabel('PC1')
plt.ylabel('PC2')
plt.title('K-Means Clustering on PCA Output')
plt.show()

## Notes
- PCA is **linear** (best when relationships are roughly linear).
- PCA helps with visualization and speed, but may lose some information.
