#### 1. **Introduction to t-SNE**

**t-SNE (t-Distributed Stochastic Neighbor Embedding)** is a dimensionality reduction technique primarily used for visualizing high-dimensional data in two or three dimensions. Unlike PCA, which focuses on capturing variance, t-SNE is more focused on preserving local relationships in the data, making it particularly useful for clustering and exploratory data analysis.

**Applications of t-SNE**:
- Visualizing clusters in high-dimensional data (e.g., MNIST digits, gene expression data).
- Reducing the complexity of data for interpretation.
- Identifying patterns or outliers in datasets.

---

#### 2. **How t-SNE Works**

t-SNE reduces high-dimensional data by creating a two- or three-dimensional map, aiming to preserve the relative distances of points (their "neighborhoods") in the reduced space.

**Key steps in t-SNE**:

1. **Pairwise Similarities**: In high-dimensional space, t-SNE calculates pairwise similarities between data points based on Gaussian distributions. Each point's neighborhood is modeled as a probability distribution.
2. **Low-Dimensional Mapping**: t-SNE tries to replicate the high-dimensional relationships in the low-dimensional space using t-distributions for pairwise similarities.
3. **Cost Function Optimization**: t-SNE minimizes the Kullback-Leibler (KL) divergence between the two probability distributions (in high and low dimensions). The goal is to maintain the local structure of the data.

---

#### 3. **Mathematical Explanation**

1. **High-Dimensional Similarities**:
   For each pair of points $ x_i $ and $ x_j $, t-SNE computes the conditional probability $ p_{j|i} $, representing how similar point $ x_j $ is to point $ x_i $, based on a Gaussian distribution centered at $ x_i $:
   
   $$
   p_{j|i} = \frac{\exp(- \| x_i - x_j \|^2 / 2\sigma_i^2)}{\sum_{k \neq i} \exp(- \| x_i - x_k \|^2 / 2\sigma_i^2)}
   $$

2. **Low-Dimensional Similarities**:
   Similarly, in the low-dimensional space (e.g., 2D), t-SNE computes the pairwise similarities $ q_{ij} $ between points $ y_i $ and $ y_j $, using a t-distribution with one degree of freedom (Student's t-distribution):
   
   $$
   q_{ij} = \frac{(1 + \| y_i - y_j \|^2)^{-1}}{\sum_{k \neq l} (1 + \| y_k - y_l \|^2)^{-1}}
   $$

3. **Cost Function (KL Divergence)**:
   t-SNE aims to minimize the Kullback-Leibler divergence between the two distributions $ P $ and $ Q $:
   
   $$
   KL(P \| Q) = \sum_{i \neq j} p_{ij} \log \frac{p_{ij}}{q_{ij}}
   $$

The algorithm adjusts the positions of points in the low-dimensional space to minimize this divergence, ensuring that points that were close in the high-dimensional space remain close in the low-dimensional space.

---

#### 4. **Step-by-Step Example**

Let’s consider a simple dataset with four-dimensional data that we want to visualize in 2D using t-SNE.

| X1  | X2  | X3  | X4  |
|-----|-----|-----|-----|
| 1.0 | 2.0 | 3.0 | 4.0 |
| 5.0 | 6.0 | 7.0 | 8.0 |
| 1.5 | 1.8 | 2.5 | 3.5 |
| 8.0 | 7.8 | 8.5 | 9.0 |
| 1.1 | 2.1 | 3.1 | 4.1 |
| 9.0 | 10.0| 11.0| 12.0|

We will reduce this dataset from 4 dimensions to 2 dimensions using t-SNE to visualize it.

---

#### 5. **Python Code Example**

Here’s how to implement t-SNE using Python’s `scikit-learn` library:

In [1]:
# Import necessary libraries
import numpy as np
import pandas as pd
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt

# Step 1: Create the dataset
data = {'X1': [1.0, 5.0, 1.5, 8.0, 1.1, 9.0],
        'X2': [2.0, 6.0, 1.8, 7.8, 2.1, 10.0],
        'X3': [3.0, 7.0, 2.5, 8.5, 3.1, 11.0],
        'X4': [4.0, 8.0, 3.5, 9.0, 4.1, 12.0]}

df = pd.DataFrame(data)

# Step 2: Apply t-SNE to reduce from 4 dimensions to 2 dimensions
tsne = TSNE(n_components=2, random_state=42)
df_tsne = tsne.fit_transform(df)

# Step 3: Plot the 2D results
plt.scatter(df_tsne[:, 0], df_tsne[:, 1])
plt.title('t-SNE Projection to 2D')
plt.xlabel('Component 1')
plt.ylabel('Component 2')
plt.show()

# Step 4: Display the transformed data
print("Original Data:\n", df)
print("\nTransformed Data (2 Dimensions):\n", df_tsne)


ValueError: perplexity must be less than n_samples

**Explanation**:
- **Step 1**: We create a 4-dimensional dataset.
- **Step 2**: We apply t-SNE to reduce the dataset from 4 dimensions to 2 dimensions.
- **Step 3**: We visualize the transformed data in a 2D scatter plot.
- **Step 4**: We display the transformed data in the 2D space.

---

#### 6. **Choosing Parameters for t-SNE**

t-SNE has a few important parameters that can affect the quality of the results:

1. **Perplexity**: It is related to the number of nearest neighbors considered for each point. A common range is between 5 and 50. Smaller datasets benefit from lower perplexity values, while larger datasets may need higher values.
   
   Example:
   ```python
   tsne = TSNE(n_components=2, perplexity=30, random_state=42)
   ```

2. **Learning Rate**: Controls how much the points are adjusted during optimization. If the learning rate is too high, points might collapse; if it's too low, optimization might be slow. Typical values range from 10 to 1000.

   Example:
   ```python
   tsne = TSNE(n_components=2, learning_rate=200, random_state=42)
   ```

3. **Number of Iterations**: Determines how many times the algorithm iterates to optimize the cost function. More iterations can lead to better results but will also increase computation time.

---

#### 7. **Advantages and Disadvantages of t-SNE**

**Advantages**:
- **Non-linear**: Unlike PCA, t-SNE captures non-linear relationships and preserves the local structure of the data.
- **Effective for Visualization**: t-SNE is particularly useful for visualizing complex, high-dimensional data in a 2D or 3D space.

**Disadvantages**:
- **Computationally Expensive**: t-SNE is slower than other dimensionality reduction techniques like PCA, especially for large datasets.
- **Not a Clustering Algorithm**: t-SNE is often mistaken for a clustering algorithm, but it is a visualization tool. Clusters seen in t-SNE plots are not always reliable.
- **Sensitive to Parameter Choices**: Results can vary significantly depending on the choice of parameters like perplexity and learning rate.

---

#### 8. **Conclusion**

t-SNE is a powerful and widely used technique for visualizing high-dimensional data in a reduced space. While it is computationally intensive and sensitive to parameters, it excels at preserving the local structure of data, making it ideal for clustering and exploratory analysis.

**Homework**:  
Apply t-SNE to a high-dimensional dataset (e.g., the MNIST dataset), experiment with different perplexity values, and visualize the results in 2D. Compare the performance and results of t-SNE with PCA.