### This nobebook is about the implementation of umap from scratch 

**UMAP (Uniform Manifold Approximation and Projection)** is a dimensionality reduction technique. It is used to reduce the number of features in a dataset while preserving the underlying structure of the data. UMAP is particularly well-suited for visualizing high-dimensional datasets and is often used in machine learning workflows to explore or preprocess data.

---

### **Key Characteristics**
1. **Manifold Learning**:
   - UMAP assumes that the data lies on a low-dimensional manifold embedded in a higher-dimensional space.
   - It approximates this manifold to reduce the dimensions of the dataset.

2. **Nonlinear Reduction**:
   - Unlike linear techniques like PCA, UMAP captures complex, nonlinear relationships between features.

3. **Preservation of Local Structure**:
   - UMAP is designed to retain the local relationships (neighborhood structure) between data points.

4. **Highly Configurable**:
   - Parameters like the number of neighbors (`n_neighbors`) and minimum distance (`min_dist`) allow you to adjust the balance between global and local data structure preservation.

---

### **Applications**
1. **Data Visualization**:
   - Reduce dimensions to 2D or 3D for exploratory data analysis.
2. **Preprocessing for Machine Learning**:
   - Reduce noise and redundancy in features.
3. **Clustering**:
   - Often combined with clustering algorithms like DBSCAN or KMeans for better results.
4. **Handling High-Dimensional Data**:
   - Text data, gene expression data, and image embeddings.

---

### **How UMAP Works**
UMAP uses graph theory to construct a weighted graph of data points based on their similarity:
1. **Graph Construction**:
   - A k-nearest neighbor graph is built to capture the local structure of the data.
   - Similarities between points are calculated using a probabilistic framework.
2. **Graph Optimization**:
   - The high-dimensional graph is transformed into a lower-dimensional representation while preserving its structure.

---

### **UMAP Parameters**
- **`n_neighbors`**:
  - Controls the size of the local neighborhood.
  - Smaller values emphasize local structure, while larger values capture global structure.
- **`min_dist`**:
  - Determines the minimum distance between points in the embedding.
  - Lower values result in tighter clusters.
- **`n_components`**:
  - The dimensionality of the reduced representation (e.g., 2D or 3D).
- **`metric`**:
  - The distance metric used to calculate similarity (default: "euclidean").
  
---

### **Advantages**
- Handles nonlinear relationships effectively.
- Computationally efficient and scalable.
- Flexible and customizable for various datasets.

### **Disadvantages**
- Sensitive to hyperparameters, requiring tuning for optimal results.
- May struggle with preserving global structures in some datasets.

---

### **UMAP vs. Other Techniques**
| **Aspect**            | **UMAP**                     | **PCA**                        | **t-SNE**                     |
|------------------------|------------------------------|---------------------------------|--------------------------------|
| **Type**              | Nonlinear                   | Linear                        | Nonlinear                     |
| **Local Structure**    | Preserves well              | Poorly preserved              | Preserves well                |
| **Global Structure**   | Depends on parameters       | Preserves well                | Poorly preserved              |
| **Scalability**        | Efficient                   | Very efficient                | Slower than UMAP              |
| **Dimensionality**     | 2D, 3D, or more             | Works in all dimensions       | Typically used for 2D or 3D   |
| **Interpretability**   | Moderate                   | High                          | Low                           |

---

---

### **Typical Use Cases**
1. **Text Data**:
   - Represent text as embeddings (e.g., word2vec, BERT) and reduce dimensions with UMAP.
2. **Image Data**:
   - Use convolutional features or embeddings, then apply UMAP for visualization.
3. **Clustering**:
   - Combine UMAP with clustering to enhance cluster separation.

UMAP is a versatile and efficient tool, making it a favorite for tasks involving high-dimensional data.

In [1]:
#Import statements 
import random 

import numpy as np 
import pandas as pd 

import seaborn as sns 
import matplotlib.pyplot as plt 

from sklearn.datasets import fetch_openml 

import scipy 
import scipy.sparse 
from scipy.optimize import curve_fit 
#import numba 

from pynndescent import NNDescent 

sns.set_theme()
random.seed(2)


ModuleNotFoundError: No module named 'pynndescent'

In [6]:
#Import statements 
import random 

import numpy as np 
import pandas as pd 

import seaborn as sns 
import matplotlib.pyplot as plt 

from sklearn.datasets import fetch_openml 

import scipy 
import scipy.sparse 
from scipy.optimize import curve_fit 

from pynndescent import NNDescent 

sns.set_theme()
random.seed(2)


ModuleNotFoundError: No module named 'pynndescent'

In [2]:
import numba

ModuleNotFoundError: No module named 'numba'