# A GUIDE TO DIMENSIONALITY REDUCTION  
High-dimensional data poses significant challenges in machine learning, known as the **curse of dimensionality**. This notebook explores **PCA**, **t-SNE**, and **UMAP** as powerful techniques to reduce dimensions, improve efficiency, and uncover meaningful patterns in data.

**Table of Contents**  

1. **Introduction**  
   - 1.1 The Curse of Dimensionality  
   - 1.2 Import Needed Dependancies
   - 1.3 Generate Syntatic Dataset

2. **Principal Component Analysis (PCA)**  
   - 2.1 What is PCA?  
   - 2.2 Implementing PCA from Scratch  
   - 2.3 PCA using Scikit-learn.

3. **t-SNE (t-Distributed Stochastic Neighbor Embedding)**  
   - 3.1 What is t-SNE?  
   - 3.2 Implementing t-SNE  
   - 3.3 t-SNE using Scikit-learn. 

4. **UMAP (Uniform Manifold Approximation and Projection)**  
   - 4.1 What is UMAP?  
   - 4.2 Implementing UMAP  
   - 4.3 UMAP using Scikit-learn. 

5. **Comparing PCA, t-SNE, and UMAP**  


---
# <a id="1"><div style="text-align: center;">INTRODUCTION</div></a> 

---


## 1.2 Understanding the Curse of Dimensionality 

The **curse of dimensionality** refers to the challenges that arise when working with high-dimensional data. As the number of features (dimensions) increases, the data becomes increasingly sparse, making it harder to extract meaningful patterns. This phenomenon impacts both computational efficiency and model performance.  

1. **Data Sparsity**  
   ![Curse of Dimensionality](https://images.deepai.org/glossary-terms/f99300ef736b4ddba8c5506066903a3d/curse-dimensionality-2.png)  
   In the image above, as the number of dimensions grows, the data points **spread out**, leaving large empty spaces. In high dimensions, the distance between points becomes less meaningful, and algorithms struggle to identify relationships.  

2. **Performance Degradation**  
   ![Dimensionality vs Performance](https://www.visiondummy.com/wp-content/uploads/2014/04/dimensionality_vs_performance.png)  
   The model performance is affected by dimensionality. Initially, adding features improves performance, but beyond a certain point, the curse of dimensionality kicks in. The data becomes too sparse, leading to overfitting, increased computational cost, and reduced accuracy.  


## 1.2 Import Needed Dependancies

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# Set the display precision to 2 decimal places
np.set_printoptions(precision=3)

## 1.3 Generate Dataset

In [None]:
# Generate Sample Data
def generate_data():
    np.random.seed(42)
    X = np.random.rand(100, 6)   # 100 samples, 3 features
    X[:, 2] = X[:, 0] + X[:, 1]  # Add correlation
    X[:, 1] = X[:, 3] + X[:, 5]  # Add correlation
    return pd.DataFrame(X, columns=['Feature1', 'Feature2', 'Feature3', 'Feature4', 'Feature5', 'Feature6'])

**Load Dataset**

In [None]:
data = generate_data()

---
# <a id="2"><div style="text-align: center;">PCA</div></a> 

---


## 2.1 What is PCA?

PCA is a dimensionality reduction technique that transforms a dataset into a new coordinate system where the greatest variance lies on the first principal component (PC), the second greatest variance on the second PC, and so on.

#### **PCA Steps**
1. **Standardize the Data**: Ensure all features have a mean of 0 and a standard deviation of 1.
2. **Compute the Covariance Matrix**: Understand the relationship between features.
3. **Calculate Eigenvalues and Eigenvectors**: Identify the principal components.
4. **Project the Data**: Transform the dataset to the new coordinate system.


----
## 2.2 Implement PCA from Scratch using numpy

### [ Step1 ]: Standardize Dataset

In [56]:
data_mean = np.mean(data, axis=0)
data_mean

Unnamed: 0,0
Feature1,0.506084
Feature2,0.965149
Feature3,1.070996
Feature4,0.50538
Feature5,0.476977
Feature6,0.459768


In [57]:
data_std = np.std(data, axis=0)
data_std

Unnamed: 0,0
Feature1,0.29708
Feature2,0.431274
Feature3,0.422078
Feature4,0.284933
Feature5,0.305046
Feature6,0.296602


In [58]:
data_standardized = (data - data_mean) / data_std
data_standardized

Unnamed: 0,Feature1,Feature2,Feature3,Feature4,Feature5,Feature6
0,-0.442790,-0.488078,0.602397,0.327370,-1.052167,-1.024182
1,-1.508014,1.652856,-0.347652,0.711370,-1.496147,1.719955
2,1.098556,-0.595880,-0.062108,-1.130006,-0.566260,0.219109
3,-0.249559,-1.064967,-0.824070,-1.284115,-0.605919,-0.314923
4,-0.168352,-0.937834,0.403362,0.031075,0.378426,-1.393513
...,...,...,...,...,...,...
95,0.026611,1.406203,0.536978,0.409594,1.263373,1.651213
96,-1.208465,0.633306,0.006314,-0.867348,-0.058489,1.754084
97,-0.045328,-1.382237,-0.591422,-0.930867,-1.314932,-1.115600
98,-1.272514,0.049605,-1.874175,0.475532,-0.967388,-0.384695


### [ Step 2 ]: Compute Covariance Matrix

In [59]:
cov_matrix = np.cov(data_standardized, rowvar=False)
print(cov_matrix)

[[ 1.01   0.05   0.741 -0.01  -0.165  0.083]
 [ 0.05   1.01  -0.029  0.737  0.106  0.761]
 [ 0.741 -0.029  1.01  -0.077 -0.155  0.032]
 [-0.01   0.737 -0.077  1.01   0.114  0.101]
 [-0.165  0.106 -0.155  0.114  1.01   0.045]
 [ 0.083  0.761  0.032  0.101  0.045  1.01 ]]


### [ Step 3 ]: Calculate Eigenvalues and Eigenvectors


In [60]:
# Calculate Eigenvalues and Eigenvectors
eigenvalues, eigenvectors = np.linalg.eig(cov_matrix)
print("Eigenvalues: \n")
print(eigenvalues)
print("Eigenvectors: \n")
print(eigenvectors)

Eigenvalues: 

[ 1.827e+00  2.145e+00  2.648e-01  9.377e-01  8.859e-01 -6.802e-16]
Eigenvectors: 

[[-6.787e-01  9.285e-03  7.081e-01  1.946e-01 -6.053e-03 -5.038e-16]
 [-4.566e-02  6.824e-01 -3.120e-02 -7.942e-02 -3.535e-02 -7.237e-01]
 [-6.733e-01 -5.042e-02 -7.043e-01  2.175e-01  2.942e-02  4.004e-16]
 [ 5.292e-02  5.050e-01 -3.362e-02  2.622e-01 -6.661e-01  4.781e-01]
 [ 2.595e-01  1.396e-01  2.018e-02  8.396e-01  4.559e-01 -1.067e-17]
 [-1.172e-01  5.071e-01 -1.307e-02 -3.673e-01  5.885e-01  4.977e-01]]


### [ Step 4 ]: Select Principal Components (2 PCs)

In [62]:
# To sort eigenvalues and eigenvectors in ascending order
sorted_indices = np.argsort(eigenvalues)

# But we need to sort eigenvalues and eigenvectors in descending order
sorted_indices = np.argsort(eigenvalues)[::-1]

# Apply the sort
eigenvalues = eigenvalues[sorted_indices]
eigenvectors = eigenvectors[:, sorted_indices]

print("Eigenvalues: \n")
print(eigenvalues)
print("Eigenvectors: \n")
print(eigenvectors)

Eigenvalues: 

[ 2.145e+00  1.827e+00  9.377e-01  8.859e-01  2.648e-01 -6.802e-16]
Eigenvectors: 

[[ 9.285e-03 -6.787e-01  1.946e-01 -6.053e-03  7.081e-01 -5.038e-16]
 [ 6.824e-01 -4.566e-02 -7.942e-02 -3.535e-02 -3.120e-02 -7.237e-01]
 [-5.042e-02 -6.733e-01  2.175e-01  2.942e-02 -7.043e-01  4.004e-16]
 [ 5.050e-01  5.292e-02  2.622e-01 -6.661e-01 -3.362e-02  4.781e-01]
 [ 1.396e-01  2.595e-01  8.396e-01  4.559e-01  2.018e-02 -1.067e-17]
 [ 5.071e-01 -1.172e-01 -3.673e-01  5.885e-01 -1.307e-02  4.977e-01]]


In [63]:
n_components = 2
eigenvectors_reduced = eigenvectors[:, :n_components]
eigenvectors_reduced

array([[ 0.009, -0.679],
       [ 0.682, -0.046],
       [-0.05 , -0.673],
       [ 0.505,  0.053],
       [ 0.14 ,  0.26 ],
       [ 0.507, -0.117]])

In [72]:
explained_variance_ratio = eigenvalues[:n_components] / np.sum(eigenvalues)
explained_variance_ratio

array([0.354, 0.301])

### [ Step 5 ]: Project Data

In [65]:
projected_data = np.dot(data_standardized, eigenvectors_reduced)

In [68]:
pca_df = pd.DataFrame(projected_data, columns=[f'PC{i+1}' for i in range(n_components)])

----
## 2.3 Apply PCA using SKlearn

In [78]:
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

### [ step 1]: Load and Standardize Data

In [79]:
data = generate_data()

In [80]:
scaler = StandardScaler()
data_scaled = scaler.fit_transform(data)

### [ Step 2 ]: Apply PCA

In [81]:
pca = PCA(n_components=2)  # Reduce to 2 principal components
principal_components = pca.fit_transform(data_scaled)
pca_df = pd.DataFrame(principal_components, columns=['PC1', 'PC2'])
pca_df

In [82]:
# Explained Variance
explained_variance = pca.explained_variance_ratio_