# Section 6. Dimensionality Reduction

#### Instructor: Pierre Biscaye 

The content of this notebook draws from UC Berkeley D-Lab's Python Machine Learning [course](https://github.com/dlab-berkeley/Python-Machine-Learning).

What if there was a way you could reduce the number of dimensions in your data and still retain a significant portion of the critical information in your data--it's 'identity'?

That is **dimensionality reduction** in a nutshell. It is a statistical technique that reduces a dataset of `m` dimensions down to `k` while minimizing the loss of information. This technique is useful for generalizating, visualizing, and compressing data.

### Sections

1. Data and correlations between features
2. Principal Component Analysis
3. Interpreting PCA: understanding components, explaining variance
4. PCA and supervised ML
5. t-SNE and interactive plots

### Required packages
* pandas
* numpy
* matplotlib
* seaborn
* scikit-learn
* time

### Required data
* world_happiness.csv
* diamonds.csv

In [None]:
#imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE

## 1. Introduction

Part of the reason why dimensionality reduction works is because of the redundancy that exists in datasets. If you have a dataset of `N` features that are all highly correlated with another, then you don't really have `N` features worth of signal.

Think about the results of a survey asking subjects on their political opinions. The survey asks a respondent to rate how strongly they support/oppose issues on topics such as taxes, abortion, the environment, etc... Opinions on gun control probably correlate with opinions on abortion and opinions on healthcare probably correlate with opinions on military spending.

A dimensionality reduction technique could compress the results of this survey down to a single dimension that effectively represents a left-right political spectrum. And because of the multicollinearity in the data, the loss of information wouldn't be equal in proportion to the decrease in dimensions.

### Data: World Happiness Report

The data for this notebook originates from the [2022 World Happiness report](https://worldhappiness.report/ed/2022/) and was downloaded from this [kaggle repo](https://www.kaggle.com/datasets/ajaypalsinghlo/world-happiness-report-2022). You can read the full report [here](https://happiness-report.s3.amazonaws.com/2022/WHR+22.pdf). The following data dictionary explains what the variables mean. 
* **happiness_score:** The national average response to the question of life evaluation. Question asks “Please imagine a ladder, with steps numbered from 0 at the bottom to 10 at the top. The top of the ladder represents the best possible life for you and the bottom of the ladder represents the worst possible life for you. On which step of the ladder would you say you personally feel you stand at this time?”
* **gdp:** GDP per capita (variable name gdp) in purchasing power parity (PPP) at constant 2017 international dollar prices.
* **life_expectancy:** Healthy life expectancies at birth are based on the data extracted from the World Health Organization’s (WHO) Global Health Observatory data repository 
* **social_support:** The national average of the binary responses (either 0 or 1) to the GWP (Gallup World Poll) question “If you were in trouble, do you have relatives or friends you can count on to help you whenever you need them, or not?”
* **freedom:** Freedom to make life choices is the national average of responses to the GWP question “Are you satisfied or dissatisfied with your freedom to choose what you do with your life?”
* **generosity:** Generosity is the residual of regressing national average of response to the GWP question “Have you donated money to a charity in the past month?” on GDP per capita.
* **corruption:** The measure is the national average of the survey responses to two questions in the GWP: “Is corruption widespread throughout the government or not” and “Is corruption widespread within businesses or
not?” The overall perception is just the average of the two 0-or-1 responses. The corruption perception at the national level is just the average response of the overall perception at the individual level.
* **country** and **continent** identify the location of the country-level observations.

In [None]:
happiness = pd.read_csv("Data/world_happiness.csv")
happiness.head()

In [None]:
happiness.info()

Dimensionality reduction is most effective when there is significant redundancy in the data. By redundancy, we mean a high level of multicollinearity in the data — frequent high correlations between variables in the data.

Let's look at the correlation table for the world happiness data to see if that is the case. We won't include rank because that is perfectly correlated with happiness score.

In [None]:
corr = happiness.select_dtypes("number").drop(columns=['rank']).corr()
corr 

In [None]:
# Correlation heat map

# Generate a mask for the upper triangle
mask = np.triu(np.ones_like(corr, dtype=bool))

# Set up the matplotlib figure
f, ax = plt.subplots(figsize=(8,8))

# Plot a heatmap using seaborn
# Include the mask and correct aspect ratio, and a diverging colormap
sns.heatmap(corr, mask=mask, cmap='RdBu', vmax=.8, center=0,
            square=True, linewidths=.5, cbar_kws={"shrink": .5})
plt.show()

**Question**: Do you observe high multicollinearity?

## 2. Principal Component Analysis 

Principal component analysis (PCA) is the most commonly used dimensionality reduction algorithm. It works by transforming a high dimensional dataset into a smaller one while still retaining a significant amount of the information. The principal components are the outputted variables that are a linear mixture of the original variables. This transformed data are all uncorrelated with one another by construction, with the most valuable information compressed into the the first few components. 

![](https://miro.medium.com/max/600/1*e_kBZQz2hsa7de6TxpgJqg.gif)
Source: Towards Data Science

PCA first aims to understand how the variables of the data differ from their means and if the relationship between the variables and their means varies across variables. So in order to complete this part, PCA calculates the covariance matrix which has the dimensions of K x K features. Here's the formula to calculate covariance for sample size N:

![](https://www.gstatic.com/education/formulas2/443397389/en/covariance_formula.svg)

Next up the algorithm calculates the eigenvectors and eigenvalues. These are linear algebra calculations that are calculated from the covariance matrix. The eigenvectors and eigenvalues are used to calculate the principal components.

The eigenvectors are defined as the directions of axes of the principal components while the eigenvalue refers to the magnitude of the eigenvector. PCA orders the eigenvalues from greatest to least — this explains why the first component has the most signal.
![](https://miro.medium.com/max/600/1*BpwgqgR-dVZSmIPKTaM4JQ.gif)

Finally PCA transforms the original data by taking the dot product of the transposed eigenvectors and the original dataset.

![](https://devopedia.org/images/article/139/4543.1548137789.jpg)

Source: Devopedia

### PCA with our data

Let's select the numerical variables and initialize the PCA algorithm.

In [None]:
X = happiness.select_dtypes("number").drop("rank", axis = 1)

We need to scale the data because PCA is sensitive to variables with higher variances/ranges. If one variable (like GDP per capita) has a range of 0–80,000 and another (like Freedom) is 0–1, PCA will assume the high-range variable is more "important" simply because the numbers are bigger.

Therefore we want the algorithm to analyze data with all similar variances to avoid that kind of bias. We'll therefore normalize the data.

In [None]:
scale = StandardScaler()
Xs = scale.fit_transform(X)
Xs = pd.DataFrame(Xs, columns=X.columns)
Xs.head()

In [None]:
#Intialize PCA model and set n_components = 2
pca = PCA(n_components=2)

#Fit and transforms\
Xp = pca.fit_transform(Xs)
Xp[:5, :]

Picking a value for `n_components` is not required. We can apply pca to our data without setting a value for `n_components` and it will return a dataset with the same dimensions.

In [None]:
pca = PCA()
Xp = pca.fit_transform(Xs)
Xp[:5, :]

In [None]:
#Shape
Xp.shape

To get the first two components, select the first two columns. Notice that they are the same as when we specified `ncomponents=2`. That is because PCA always calculates all the eigenvalues.

In [None]:
#Overwrite Xp with just its first two columns.
Xp = Xp[:, :2]
Xp[:5, :]

## 3. Interpreting PCA

It is not always obvious what the identified principal components represent. We can look at how they correlate with particular variables to get a sense. We can also plot them to see. 

A great thing about dimensionality reduction is that we can visualize highly dimensional data in 2D or 3D.

In [None]:
#Put pca data into dataframe
Xp = pd.DataFrame(Xp, columns=["comp1", "comp2"])

In [None]:
#plot the pca data with a 2D scatter plot
plt.figure(figsize=(11, 8))
plt.scatter(x=Xp["comp1"], y = Xp["comp2"], s=80)
plt.xlabel("Component 1")
plt.ylabel("Component 2")
plt.show()

Let's add continent label to the chart to see if that helps with interpretation.

In [None]:
#Add continent to the pca dataset
Xp["continent"] = happiness["continent"].tolist()

plt.figure(figsize=(11, 8))
for continent in Xp.continent.unique():
    data = Xp[Xp.continent == continent]
    plt.scatter(x=data["comp1"], y = data["comp2"], label = continent, s=80)
    plt.xlabel("Component 1")
    plt.ylabel("Component 2")
    plt.legend(fontsize = "large")
plt.show()


**Question**: Based on these patterns and your expectations about likely differences in well-being across continents, are higher or lower values of these components likely to be associated with higher national well-being?

### Explained variance

Explained variance informs us how much of the original signal/identity each component possesses from the original dataset. The `explained_variance_ratio_` attribute is a normalized version of this variable.

In [None]:
#Explained variance ratio of each component
exp_var_ratio = pca.explained_variance_ratio_.round(3)
exp_var_ratio

In [None]:
#The cumulative sum of the explained variance ratio of each component
exp_var_ratio_cs = exp_var_ratio.cumsum()
exp_var_ratio_cs

The first two components (a quarter of the original dimensions) net almost three-quarters of the data's explained variance.

Cutting the number of dimensions by more than half leaves us with 83.2% of the signal.

In [None]:
#Plot
plt.figure(figsize=(11, 8))
plt.bar(x=range(1, exp_var_ratio.shape[0] +1), height=exp_var_ratio, color = "b")
plt.plot(range(1, exp_var_ratio.shape[0] +1), exp_var_ratio_cs, c = "red", marker = "*")
plt.xlabel("N Components", fontsize = 16)
plt.ylabel("Explained Variance Ratio",fontsize = 16)
plt.xticks(fontsize=14)
plt.yticks(fontsize=14)
plt.show()

### Eigenvectors/PCA components

For each of our features, the PCA calculates associated eigenvectors that relate to each of the principal components. The `components_` attribute of a PCA object is where the eigenvectors are stored. It is a matrix whose dimensions are equal to the number of components specified by the number of features. In this case it will be a 7x7 matrix.

These eignvectors tell us how much each feature contributes to each PCA component. 

In [None]:
pca.components_.shape

In [None]:
pca.components_.round(3)

In the following chart we visualize each feature's first and second eigenvector on a two dimensional plot.

In [None]:
plt.figure(figsize = (8,8))
plt.grid(True)
cols = X.columns
for i in range(len(cols)):
    x = pca.components_[0, i]
    y = pca.components_[1, i]
    plt.arrow(0, 0, x, y, color = "blue", width = 0.005, alpha = .3)
    plt.annotate(cols[i], xy = (x, y), fontsize = 14)
plt.show()

How do we interpret these blue eigenvector arrows? They tell us about **how each feature influences the two components**.
* Length: Longer arrows mean that feature has a stronger influence on the two components.
* Direction: Arrows pointing in the same direction are positively correlated. Arrows pointing in opposite directions are negatively correlated.
* Alignment: An arrow pointing straight right (or left) is a major driver of comp1 only; an arrow pointing straight up (or down) is a major driver of comp2 only. Angled arrows contribute to both.

We can also look at the component loadings (eigenvectors) numerically, to try and identify what each component is capturing. This can be clearer than the plot.


In [None]:
# Create a DataFrame of the loading eigenvectors
loadings = pd.DataFrame(
    pca.components_[:2].T, 
    columns=['PC1', 'PC2'], 
    index=X.columns
)
print(loadings.sort_values(by='PC1', ascending=False))

What does this tell us? Principal component 1 seems like a measure of "development" or "affluence". Happiness, GDP per capita, life expectancy, social support, and freedom all correlate strongly. The graph of eigenvectors tell us that we could also call PC1 "happiness".

Principal component 2 is a bit harder to identify, but seems like it may reflect something like "governance" or "norms", as generosity, freedom, and corruption are the three strongest drivers. Since generosity points straight up, this feature varies independently of the PC1 features.

A great way to check this is to color the scatter plot with the original data, to see if spatial distance in the PCA plot matches our intuition.

In [None]:
plt.figure(figsize=(10, 8))
sc = plt.scatter(x=Xp["comp1"], y=Xp["comp2"], c=X['generosity'], cmap='viridis', s=80)
plt.colorbar(sc, label='Generosity')
plt.xlabel("Component 1 (The Development Factor)")
plt.ylabel("Component 2 (The Governance/Norms Factor)")
plt.show()

Let's now plot a **biplot** which combines the scatterplot and arrows into a single visualization. 

Because component loadings are between -1 and 1, the PCA scores can be much larger. So we will scale the eigenvector loadings to be visible on the same axes.

In [None]:
# 1. Create the base scatter plot
plt.figure(figsize=(12, 8))
plt.scatter(Xp["comp1"], Xp["comp2"], alpha=0.5, color='gray', s=50)

# 2. Define the scaling factor for arrows (to make them visible against the dots)
# We'll start with 5 based on the above plot
arrow_scale = 5

# 3. Plotting the eigenvectors
for i, col_name in enumerate(X.columns):
    x_loading = pca.components_[0, i] * arrow_scale
    y_loading = pca.components_[1, i] * arrow_scale
    plt.arrow(0, 0, x_loading, y_loading, color='blue', 
              width=0.05, head_width=0.1, alpha=0.8)
    # Annotate with feature name
    plt.text(x_loading * 1.15, y_loading * 1.1, col_name, 
             color='blue', fontsize=12, ha='center', va='center')

plt.xlabel(f"PC1 ({exp_var_ratio[0]*100:.1f}%)")
plt.ylabel(f"PC2 ({exp_var_ratio[1]*100:.1f}%)")
plt.title("PCA Biplot: World Happiness Data")
plt.grid(True, linestyle='--', alpha=0.6)
plt.axhline(0, color='black', linewidth=1)
plt.axvline(0, color='black', linewidth=1)
plt.show()

How do we interpret this plot? Each dot is a country. Countries on the far right have more of the characteristics of the arrows pointing right. Countries on the far left have less. The same is true for countries at the far top or bottom.

**Question**: What are the characteristics of countries in the bottom left of the plot?

Let's label the countries at the extremes of both principal components, to dig a little deeper.

In [None]:
# 1. Identify the 4 extreme points
# We use .idxmax() and .idxmin() to find the index of the extremes
top_pc1_idx = Xp['comp1'].idxmax()
bot_pc1_idx = Xp['comp1'].idxmin()
top_pc2_idx = Xp['comp2'].idxmax()
bot_pc2_idx = Xp['comp2'].idxmin()

# Store them in a list to loop through easily
extreme_indices = [top_pc1_idx, bot_pc1_idx, top_pc2_idx, bot_pc2_idx]

# 2. Re-plot the Biplot
plt.figure(figsize=(12, 8))
plt.scatter(Xp["comp1"], Xp["comp2"], alpha=0.4, color='gray', s=50)

# 3. Add the 4 specific labels
for idx in extreme_indices:
    # Get country name from the original dataframe using the index
    name = happiness.loc[idx, 'country'] 
    plt.annotate(name, 
                 (Xp.loc[idx, 'comp1'], Xp.loc[idx, 'comp2']),
                 textcoords="offset points", 
                 xytext=(0,10), 
                 ha='center', 
                 fontweight='bold',
                 fontsize=12,
                 bbox=dict(boxstyle='round,pad=0.3', fc='yellow', alpha=0.3))

# 4. Add the arrows
arrow_scale = 5
for i, col_name in enumerate(X.columns):
    x_loading = pca.components_[0, i] * arrow_scale
    y_loading = pca.components_[1, i] * arrow_scale
    plt.arrow(0, 0, x_loading, y_loading, color='blue', width=0.04, head_width=0.1, alpha=0.6)
    plt.text(x_loading * 1.1, y_loading * 1.1, col_name, color='blue', fontsize=11)

plt.axhline(0, color='black', lw=1, alpha=0.5)
plt.axvline(0, color='black', lw=1, alpha=0.5)
plt.title("PCA Biplot with Extreme Country Labels")
plt.xlabel(f"PC1 ({exp_var_ratio[0]*100:.1f}%)")
plt.ylabel(f"PC2 ({exp_var_ratio[1]*100:.1f}%)")
plt.show()

It makes sense that Finalnd is at the far right, and that Afghanistan is at the far left. 

The top and bottom are a bit less clear. Recall that generosity was measured as the residual from a regression of charitable giving on GDP. It's not clear what exactly this represents. 

## 4. ML with PCA

One thing that also makes dimensionality reduction useful is that we can train machine learning models on dimensionality-reduced data and achieve a similar performance.

Let's analyze the relationship between principal components and the performance of a machine learning model. 

The plan:
1. Train a machine learning model on untransformed data — this is our baseline. Observe accuracy score.
2. Train a machine learning model on PCA-transformed data for every value between 1 and the number of features.
3. Plot the number of components used to train a model versus their performance in terms of accuracy and time elapsed in training the model, for both the transformed and untransformed data.

#### Diamond Data Dictionary

We will work with a dataset on characteristics of diamonds. The objective will be predict the price of a diamond based on its characteristics, using ridge regression. 
* **carat:** weight of the diamond (0.2--5.01)
* **cut:** quality of the cut (Fair, Good, Very Good, Premium, Ideal)
* **color:** diamond color, from J (worst) to D (best)
* **clarity:** a measurement of how clear the diamond is (I1 (worst), SI2, SI1, VS2, VS1, VVS2, VVS1, IF (best))
* **depth**: total depth percentage = z / mean(x, y) = 2 * z / (x + y) (43--79)
* **table**: width of top of diamond relative to widest point (43--95)
* **price:** price in US dollars (\$326--\$18,823)
* **x**: length in mm (0--10.74)
* **y**: width in mm (0--58.9)
* **z**: depth in mm (0--31.8)


In [None]:
#Load in diamonds data. 
#We will take a random sample of 15k rows to save time.
diamonds = (
    pd.read_csv("data/diamonds.csv", index_col=[0])
    .sample(n=15000, random_state=212)
    .reset_index(drop=True)
)
diamonds.head()

In [None]:
diamonds.info()

In [None]:
#Imports 
from sklearn.model_selection import  cross_val_score, KFold
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import RidgeCV
from sklearn.model_selection import train_test_split

#grab X and y
X = diamonds.drop("price", axis = 1)
y = diamonds["price"]

In [None]:
#To keep things simple let's use just numerical features
# That means we drop color, clarity, and cut
X = X.select_dtypes("number")
X.head()

These are the six dimensions we will start from.

In [None]:
#Scale the data
diamonds_scaler = StandardScaler()
Xs = diamonds_scaler.fit_transform(X)

Now let's establish the baseline performance when including these original features. We'll be deriving the cross-validated (to identify the optimal penalty) accuracy score for the non-PCA dataset and we'll observe the time taken to train the model. For convenience we'll do a simple search of just a few possible penalty parameters.

In [None]:
# Split the sample
X_train, X_test, y_train, y_test = train_test_split(Xs, y, test_size=0.2, random_state=212)

In [None]:
import time

start=time.time()

# Create ridge model, with CV
ridge_cv = RidgeCV(
    # Which alpha values to test for?
    alphas=np.arange(2,14,2),
    # Number of folds
    cv=5)
# Fit model
ridge_cv.fit(X_train, y_train)

end=time.time()
baseline_time=round(end-start,3)
# Evaluate model
baseline_score=ridge_cv.score(X_test, y_test)
print("Baseline accuracy: ",ridge_cv.score(X_test, y_test))
print("Baseline time: ",baseline_time," seconds")

Now we apply PCA to our data and train a model for a range of components between 1 and the number of features of our data. We collect the cross-validated accuracy scores and then plot them.

In [None]:
#Initialize PCA model
pca = PCA()

# Fit PCA on the training data and transform
Xpca = pca.fit_transform(X_train)
# Apply the same PCA transform to the test data
Xtest_pca=pca.transform(X_test)

acc_scores = []

times = []

components_range = np.arange(1, X_train.shape[1] + 1)

for comp in components_range:
    #Slice the columns of the Xpca matrix using comp
    pca_features = Xpca[:, :comp]
    pcatest_features = Xtest_pca[:, :comp]
    
    #cross-validate
    start = time.time()
    ridge_cv = RidgeCV(
        # Which alpha values to test for?
        alphas=np.arange(2,14,2),
        # Number of folds
        cv=5)
    
    # Fit model
    ridge_cv.fit(pca_features, y_train)    
    end = time.time()
    elapsed = round(end - start,3)
    
    acc_scores.append(ridge_cv.score(pcatest_features, y_test))
    times.append(elapsed)

In [None]:
#Plotting
fig, (ax1, ax2) = plt.subplots(nrows=1, ncols=2, figsize = (13, 6))

fig.tight_layout(pad = 3)
ax1.bar(components_range, acc_scores)
ax1.set_title("PCA Accuracy Scores", fontsize= 20)
ax1.set_xlabel("N Components", fontsize = 15)
ax1.set_ylabel("CV Accuracy", fontsize = 15)
ax1.hlines(y = baseline_score, xmin = components_range.min(), xmax = components_range.max(),
          colors = "black", linestyles = "dashed")
ax1.annotate(text = 'Baseline Accuracy', xy = (1, baseline_score*1.05),size = 14)
ax1.set_ylim(bottom=0, top = baseline_score*1.2)
ax1.grid(False)

ax2.set_title("PCA Times", fontsize= 20)
ax2.bar(components_range, times, color = "#1D8A99")
ax2.set_xlabel("N Components", fontsize = 15)
ax2.set_ylabel("Seconds Elapsed", fontsize = 15)
ax2.hlines(y = baseline_time, xmin = components_range.min(), xmax = components_range.max(),
          colors = "black", linestyles = "dashed")
ax2.annotate(text = 'Baseline Time Elapsed', xy = (1, baseline_time*1.05),size = 14)
ax2.set_ylim(bottom=0, top = baseline_time*1.2)
ax2.grid(False)

plt.show()

**Question:** What do these two charts tell us about the tell timing-vs-performance tradeoff when we train a model on PCA components? What might be the implications if we had a dataset with millions of variables and potentially hundreds of features?

## 5. t-SNE

![](https://tse2.mm.bing.net/th?id=OIP.OvotzpNWbWE8wZht7Pw3_QHaGF&w=690&c=7&pid=Api&p=0)

t-SNE (t-distributed stochastic neighbor embedding) is another popular dimensionality reduction algorithm that is really popular when visualizing high-dimensional plots into a 2D or 3D space. What gives it an advantage over PCA is that it's more suitable for non-linear data.

t-SNE works by producing a joint probability distribution that effectively measures correlations between data. The basis for this distribution comes from calculating the euclidean distance between every datapoint pair. The smaller the distance means the higher probability that two points are similar.

Next t-SNE initializes the output dimensions with random data and through a process to similar to gradient descent continously transforms the random data so that its joint probability distribution is as similar possible to that of the original data.

### Coding t-SNE

The most important parameter in t-SNE is *perplexity* which is used to set the number of neighbors that are used in calculating the joint distributions. The following image shows four different instances of t-SNE applied to the same dataset but with varying values for its perplexity parameter.

![](https://tse2.mm.bing.net/th?id=OIP.C_e2LzgeM_TC7LcC15U_QQHaGj&w=690&c=7&pid=Api&p=0)

Another important parameter is the number of components. We'll stick with 2 to facilitate 2D visualizations.

For more how t-SNE works and how parameters impact its transformation check out [this excellent tutorial](https://distill.pub/2016/misread-tsne/).

Now let's apply t-SNE to the world happiness data.

In [None]:
happiness.head()

In [None]:
#Redefine X to drop rank, country, and continent
X = happiness.iloc[:, 2:-1]
# Scale the data
scale = StandardScaler()
Xs = scale.fit_transform(X)
Xs = pd.DataFrame(Xs, columns=X.columns)
Xs.head()

In [None]:
# We imported TSNE at the beginning of the notebook
tsne = TSNE(n_components=2, perplexity=40, random_state=212, learning_rate=1)
Xt = tsne.fit_transform(Xs)
Xt = pd.DataFrame(Xt, columns=["tsne1", "tsne2"])
Xt.head()

We'll now reproduce the earlier plot where we visualize the country dots color-encoded by continent.

In [None]:
#Add continent to the pca dataset
Xt["continent"] = happiness["continent"].tolist()

plt.figure(figsize=(8,6))
for continent in Xt.continent.unique():
    data = Xt[Xt.continent == continent]
    plt.scatter(x=data["tsne1"], y = data["tsne2"], label = continent, s=80)
    plt.xlabel("TSNE 1")
    plt.ylabel("TSNE 2")
    plt.legend(fontsize = "large")
plt.show()

What is t-SNE doing? t-SNE looks at a point and asks, "Who are my 40 closest neighbors?" (we set `perplexity=40`). It then tries to arrange the points in 2D so that those neighbors stay close together.

Unlike PCA, which is a mathematical "rotation" of the data, t-SNE is a simulation. It starts with dots in random positions and "wiggles" them around until the 2D distances match the high-dimensional neighbors as closely as possible. Points that are close in high-dimensional space should also appear close in the 2D space.

In [None]:
#Compare to the PCA plot
pca = PCA(n_components=2)
#Fit and transform
Xp = pca.fit_transform(Xs)
Xp = pd.DataFrame(Xp, columns=["comp1", "comp2"])
Xp["continent"] = happiness["continent"].tolist()

plt.figure(figsize=(8, 6))
for continent in Xp.continent.unique():
    data_p = Xp[Xp.continent == continent]    
    plt.scatter(x=data_p["comp1"], y = data_p["comp2"], label = continent, s=80)
    plt.xlabel("PCA Component 1")
    plt.ylabel("PCA Component 2")
    plt.legend(fontsize = "large")
plt.show()

**Question:** How do the figures appear to relate to each other? Are they capturing similar dimensions?

t-SNE often produces results that look similar to PCA at a glance, but the philosophy behind how those dots got there is fundamentally different. While PCA tries to preserve the global structure (the big distances/variance), t-SNE is obsessed with local structure (nearest neighbors). 

How to interpret the results? With PCA, the axes actually mean something - the analysis is global. If a country moves from the left to the right, you can say its development/happiness is increasing. The distance between the "Europe" cluster and the "Africa" cluster is a real representation of their mathematical difference.

With t-SNE the analysis local, so the axes (tsne1, tsne2) are essentially meaningless. You cannot say "moving right means more happiness." What matters is proximity. If two countries are in the same little cluster, they are very similar across all features. The size of clusters and the distance between clusters can be misleading. t-SNE tends to "expand" dense clusters and "shrink" sparse ones to make the map look nice.

If the t-SNE and PCA plots look almost identical, it usually means the data has a very strong, linear structure (like the wealth-happiness correlation) that both algorithms are picking up easily. t-SNE is non-linear and can "bend" and "twist" the data. If the happiness data had a curved relationship (e.g., happiness increases with wealth only up to a point, then plateaus), t-SNE would capture that difference better than PCA.

### Interactive plots

These charts are nice, but what would really be great would be to be able to hover the mouse over the dots to reveal more information about the country dots.

We can use the [ploty-express plotting](https://plotly.com/python/plotly-express/) package to create interactive visualizations.

Let's create an interactive 2D plot using plotly-express.

In [None]:
#import plotly-express
import plotly.express as px

#Add the country name into Xt
Xt["country"] = happiness["country"].tolist()

#intialize plotting function
fig = px.scatter(Xt, x="tsne1", y = "tsne2", color="continent", hover_data=["country"])
#generate plot
fig.show()

Hover your mouse over the dots to see which countries they represent.

Let's improve the plot by making the dots larger and showing the original data when you hover over them.

In [None]:
#Add size variable to the dataframe
Xt["size"] = .3

#Add happiness score and gdp to Xt
Xt["happiness"] = happiness.happiness_score.tolist()
Xt["gdp"] = happiness.gdp.tolist()

#This dictionary allows us to turn on and turn off our chosen variables
hover_data={"country":True, 
            "continent":True, 
            "happiness":True,
            "gdp":True,
            "tsne1":False,
            "tsne2":False,
            "size":False}

#intialize plotting function
fig = px.scatter(Xt, x="tsne1", y = "tsne2", color="continent", 
                 hover_data=hover_data, size="size", opacity=.6)
#generate plot
fig.show()

How does it look now?