# t-Distributed Stochastic Neighbor Embedding (t-SNE)
t-SNE is a non-linear dimensionality reduction technique that is particularly well-suited for the visualization of high-dimensional datasets.

Excellent for exploring data structures at multiple scales or for identifying clusters in data.

In [None]:
from sklearn.manifold import TSNE

import pandas as pd
from sklearn.decomposition import PCA
from sklearn.preprocessing import OneHotEncoder, StandardScaler
import matplotlib.pyplot as plt
import seaborn as sns

# Load dataset
data = pd.read_csv('https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv')

# Preview data
print(data.head())

# Handle missing values by creating a copy to avoid chained assignment
data.loc[:, 'Age'] = data['Age'].fillna(data['Age'].median())
data.loc[:, 'Embarked'] = data['Embarked'].fillna(data['Embarked'].mode()[0])

# Drop columns that may not be useful for this analysis
data = data.drop(['Name', 'Ticket', 'Cabin'], axis=1)

# Encoding categorical data
encoder = OneHotEncoder()
categorical_features = ['Pclass', 'Sex', 'Embarked']
encoded_features = encoder.fit_transform(data[categorical_features]).toarray()
encoded_feature_names = encoder.get_feature_names_out(categorical_features)

# Create a DataFrame from encoded features and concatenate with the original dataset
encoded_df = pd.DataFrame(encoded_features, columns=encoded_feature_names)
data_encoded = pd.concat([data.drop(categorical_features, axis=1), encoded_df], axis=1)

# Scale data before applying t-SNE
scaler = StandardScaler()
data_scaled = scaler.fit_transform(data_encoded.drop('Survived', axis=1))

tsne = TSNE(n_components=2, perplexity=30, n_iter=1000)
tsne_results = tsne.fit_transform(data_scaled)

plt.figure(figsize=(10, 8))
sns.scatterplot(x=tsne_results[:, 0], y=tsne_results[:, 1], hue=data['Survived'], style=data['Survived'], palette='viridis')
plt.title('t-SNE Visualization of Titanic Dataset')
plt.show()

**Perplexity:** This parameter can greatly affect your results. It relates to the number of nearest neighbors that t-SNE considers when mapping each point. Common values are between 5 and 50, and tweaking this can help if your data looks too bunched up or too dispersed.

**Iterations:** Higher numbers of iterations allow t-SNE more time to optimize the arrangement of points. Insufficient iterations might lead to an 'unoptimized' plot, where the global structure might be okay, but local details could be misleading.

**Scaling:** Always scale your data (typically with standard normalization) before applying t-SNE to ensure one feature’s variance doesn’t dominate others.