# t-Distributed Stochastic Neighbor Embedding (t-SNE) Step-by-Step Example

This notebook provides a detailed, step-by-step guide to applying t-Distributed Stochastic Neighbor Embedding (t-SNE) on a dataset. We'll cover each step with explanations, code, and visualizations.

## Step 1: What is t-SNE?

t-Distributed Stochastic Neighbor Embedding (t-SNE) is a non-linear dimensionality reduction technique primarily used for the visualization of high-dimensional datasets. Unlike PCA or LDA, t-SNE is particularly good at preserving the local structure of data, making it very effective for visualizing clusters or patterns in complex datasets.

t-SNE converts the similarities between data points into joint probabilities and tries to minimize the Kullback-Leibler divergence between the joint probabilities of the low-dimensional embedding and the high-dimensional data.

## Step 2: Importing Required Libraries

We'll start by importing the necessary libraries for this analysis, including NumPy, Pandas, Matplotlib, and Scikit-learn.


In [None]:
import numpy as np
import pandas as pd
from sklearn.manifold import TSNE
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import load_iris
import matplotlib.pyplot as plt


## Step 3: Loading and Understanding the Dataset

For this demonstration, we'll use the Iris dataset, which contains 150 samples of iris flowers with 4 features each. The goal is to visualize the flowers in a 2D space to identify any inherent clusters.


In [None]:
# Load the Iris dataset
iris = load_iris()
X = iris.data
y = iris.target
df = pd.DataFrame(X, columns=iris.feature_names)
df['species'] = y

df.head()


## Step 4: Data Standardization

As with other dimensionality reduction techniques, it's important to standardize the data before applying t-SNE to ensure that each feature contributes equally to the result.


In [None]:
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)


## Step 5: Applying t-SNE

We'll now apply t-SNE to reduce the data from 4 dimensions down to 2 dimensions. t-SNE has several important hyperparameters, such as `perplexity`, `learning_rate`, and `n_iter`. We'll use the default settings for this demonstration.


In [None]:
tsne = TSNE(n_components=2, random_state=0)
X_tsne = tsne.fit_transform(X_scaled)


## Step 6: Visualizing the t-SNE Results

We can now visualize the data in the 2D space defined by the t-SNE transformation. This plot will help us understand the underlying structure of the dataset.


In [None]:
plt.figure(figsize=(8, 6))
for species in np.unique(y):
    plt.scatter(X_tsne[y == species, 0], X_tsne[y == species, 1], label=iris.target_names[species])
plt.xlabel('t-SNE Feature 1')
plt.ylabel('t-SNE Feature 2')
plt.title('t-SNE: Iris Dataset')
plt.legend(loc='best')
plt.grid(True)
plt.show()
