# 10 Dimension reduction

Part of ["Introduction to Data Science" course](https://github.com/kupav/data-sc-intro) by Pavel Kuptsov, [kupav@mail.ru](mailto:kupav@mail.ru)

Recommended reading for this section:

1. Grus, J. (2019). Data Science From Scratch: First Principles with Python (Vol. Second edition). Sebastopol, CA: O’Reilly Media
1. Muller, A and Guido, S (2017). Introduction to Machine Learning with Python. O'Reilly
1. A beginner’s guide to dimensionality reduction in Machine Learning. https://towardsdatascience.com/dimensionality-reduction-for-machine-learning-80a46c2ebb7e

The following Python modules will be required. Make sure that you have them installed.
- `matplotlib`
- `requests`
- `numpy`
- `sklearn`

## Lesson 1

### Principal components

Before start we define the function the downloads CSV file from the repository.

In [None]:
import csv
import numpy as np
import requests

def load_csv_dataset(file_name, dtype=float):
    """Downloads csv numeric dataset from repo to numpy array."""
    base_url = "https://raw.githubusercontent.com/kupav/data-sc-intro/main/data/"
    web_data = requests.get(base_url + file_name)
    assert web_data.status_code == 200
    
    reader = csv.reader(web_data.text.splitlines(), delimiter=',')
    data = []
    for row in reader:
        try:
            # Try to parse as a row of floats
            float_row = [dtype(x) for x in row]
            data.append(float_row)
        except ValueError:
            # If parsing as floats failed - this is header
            print(row)
            
    return np.array(data)

Sometimes multidimensional data contain redundant information. 

Consider a two dimensional dataset.

In [None]:
data = load_csv_dataset('pca1.csv')
print(data.shape)

This file contains two columns. Let us plot their separated histograms first

In [None]:
import matplotlib.pyplot as plt

fig, axs = plt.subplots(nrows=1, ncols=2, figsize=(10, 4))

axs[0].hist(data[:, 0], bins=50)
axs[1].hist(data[:, 1], bins=50);

We see random normally distributed data.

But their scatter plot looks like this:

In [None]:
import matplotlib.pyplot as plt
fig, ax = plt.subplots()
ax.scatter(data[:, 0], data[:, 1])
ax.grid()

It means that the two columns depends on each other: It is easy to notice that they obey the equation
$$
y = -2x
$$
We can check it:

In [None]:
import numpy as np
import matplotlib.pyplot as plt
fig, ax = plt.subplots()
ax.scatter(data[:, 0], data[:, 1])

vx = np.linspace(-4, 4, 100)
vy = -2 * vx
ax.plot(vx, vy, color='red')
ax.grid()

Given a dataset like this we do not need to analyze both of its columns. 

Only one contains the essential information.

More often such strong dependence is absent. 

But we still can notice that the data columns are not totally independent:

In [None]:
import matplotlib.pyplot as plt

data = load_csv_dataset('pca2.csv')

fig, ax = plt.subplots()
ax.scatter(data[:, 0], data[:, 1])
ax.grid()

We observe a cloud of points highly stretched along a certain direction. 

Obviously that variations along this direction are more essential than the perpendicular ones.

Probably the true underlying process that generated these data had strong dependence between 
columns
$$
y=kx
$$
and deviations from it appeared due to noise.

Direction of the most intensive variations is called principal component. 

The idea of its finding is as follows.

![pca_idea.svg](fig/pca_idea.svg)

We imagine that we have a scatter plot of data and draw a line between them. Then we compute distances $d_i$ between the line and the data points as shown in the figure. 

We need to rotate the line to make the sum of the distances as small as possible.

This line or a unit vector along it is called the first principal component. 

The vector perpendicular to the first  principal component is called the second  principal component. 

In the examples above we considered two dimensional data so that the cloud of points could 
be stretched along a single direction.
 
If data are multidimensional, i.e., there are more then two columns, the number of principal 
components equals to the number of columns.

The first one is the directions where the cloud is the most stretched.

The second one shows the most stretched direction among all perpendicular to the first one.

The third is the most stretched direction that is perpendicular to the first two. And so on.

Data analysis via the principal components is called PCA, Principal Components Analysis.

### Computation of the principal components

Mathematically computation of principal components involves the following steps.

First we must compute mean values along each data column and subtract them so that 
the data cloud becomes centered near the origin.
$$
\tilde x_i = x_i - \mu
$$

This is absolutely important step. Omitting it results in incorrect results. 

Also it is often recommended to rescale the data columns by dividing by the standard deviation. Together with the shifting means to the origin this is called data standardizing.
$$
z_i = \frac{x_i - \mu}{\sigma}
$$

As we already discussed preliminary standardizing is very important when we have different 
data units in different columns.

If the units are the same (e.g., all columns are in meters) it may be reasonable to left the data not rescaled. 

However the shift to the origin must be done in any case.

The next step is computing variances and covariances.

Let us remember that the variance is the mean squared deviation from the mean value:
$$
\overline x = \frac{1}{N}\sum_{i=1}^N x_i.
$$
$$
\text{Var}=\frac{1}{N-1}\sum_{i=1}^N (x_i - \overline x)^2.
$$

Since we have shifted the data to the origin the mean is zero, so that
$$
\text{Var}=\frac{1}{N-1}\sum_{i=1}^N z_i^2.
$$

Let us denote our multidimensional standardized data as $z_{i,j}$. 

Here $i$ is a number of a row. Usually we have many rows. The number of rows is the size of the dataset. 

Index $j$ is a number of a column. The number of columns means the dimension of the dataset. Given two columns we have two dimensional dataset.

Covariance is computed like this (provided that the data have zero mean values along $i$):
$$
\text{Cov}(j,k)=\frac{1}{N-1}\sum_{i=1}^N z_{i,j} z_{i,k}.
$$
Here the summation runs along rows for two columns $j$ and $k$.

If $j=k$ we merely have the variance of the column $j$. And $\text{Cov}(j,k)=\text{Cov}(k,j)$

The matrix collecting covariances for all pairs of the columns is called covariance matrix:
$$
C = 
\begin{pmatrix}
\text{Cov}(0,0)  & \text{Cov}(0,1) & \text{Cov}(0,2) & \ldots \\
\text{Cov}(1,0)  & \text{Cov}(1,1) & \text{Cov}(1,2) & \ldots \\
\ldots & \ldots & \ldots & \ldots 
\end{pmatrix}
$$


The next step is to compute eigenvalues and eigenvectors of the covariance matrix $C$.

Let us remember: we can multiply a matrix by a vector to obtain a new vector:
$$
A v_1 = v_2
$$
In general case $v_1$ and $v_2$ are different. They have different lengths and directions.

Each square matrix $N\times N$ has $N$ special vectors such that 
$$
A u = \lambda u
$$
Here $\lambda$ is scalar (i.e., just a number). It means that when we multiply the matrix $A$ by the vector $u$ 
we obtain a vector that points the same direction that $u$ but stretched or shrined by $\lambda$.

Scalars $\lambda$ are called eigenvalues of the matrix $A$ and $u$ are its eigenvectors.

Eigenvectors of the covariance matrix $C$ are principal components of our dataset and the corresponding eigenvalues $\lambda$ indicate the range of variations along this components.

For the covariance matrix $C$ the eigenvalues are always real positive numbers (this is because $C$ is symmetric).

The first principal component corresponds to the largest eigenvalue. The second one corresponds to the second largest $\lambda$ and so on.

The eigenvalues indicate how the cloud of points is stretched along the corresponding principal component. 

For example if the cloud almost exactly fits a line the first $\lambda_1$ is the largest and all others are close to zero.

If the cloud of multidimensional data is spread along a plane two first eigenvalues $\lambda_1$ and $\lambda_2$ will be large, while all others small. And so on.

Here the function that implements the steps above. We compute covariances and eigenvalues using functions from `numpy`:

In [None]:
def standardize(data):
    """Standartize data"""
    return (data - np.mean(data, axis=0)) / np.std(data, axis=0)    

def prin_comp(data):
    """Computes principal components for a multidimensional data
    Returns a list of eigenvalues and eigenvectors of the covariance matrix.
    Columns of the matrix vec are the principal components. Corresponding
    lam indicate their importance. Lamdas are always returned in 
    the ascending order.
    """
    # Covariance matrix
    cov = np.cov(data, rowvar=False)
    # Eigenvalues and eigenvectors of a symmetric matrix
    lam, vec = np.linalg.eigh(cov)
    return lam, vec

We read the dataset and standardize it at once

In [None]:
data = standardize(load_csv_dataset("pca3.csv"))
print(data.shape)

Before computing the principal components let us visualize the data using histograms and pairwise scatter 
plots.

In [None]:
import matplotlib.pyplot as plt

N = data.shape[1]
fig, axs = plt.subplots(nrows=N, ncols=N, figsize=(10, 10))
for i in range(N):
    for j in range(N):
        if i == j:
            axs[i, i].hist(data[:, i], bins=300, color='C1')
        else:
            axs[i, j].scatter(data[:, i], data[:, j], s=1)

# Requred to avoid overlapping of the subplots            
fig.tight_layout()

Visual inspection revels that columns 1, 2, and 4 are correlated. (Observe stretched clouds in panels 1-2, and 1-4).

Also the correlated columns are 3 and 5. (Stretched clouds in the panels 3-5. 

And these two sets of columns are not correlated at all. (Circular clouds, e.g. in panels 1-3 and 1-5)

These two groups of the correlated columns indicate that in the full 5th dimensional space there are tho 
main independent directions, i.e., the cloud of points is highly stretched along a plane while variations along 
the three other dimensions are small.

Now we compute the principal components.

In [None]:
with np.printoptions(precision=2):
    pc_lam, pc_vec = prin_comp(data)
    print("pc_lam=\n", pc_lam)
    print("pc_vec=\n", pc_vec)

### Essential and non-essential principal components

We indeed observe that two principal components are the most essential. 

It means altogether that our 5th dimensional dataset is essentially 2 dimensional and 
three dimensions can be dropped out.

This is called dimension reduction. 

Since PCA can see only a linear dependencies this is also called a linear dimension reduction.

All criteria for choosing the essential and non essential principal components are based on values of the eigenvalues 
$\lambda_i$.

Sometimes it is obvious, like in our example, which components can be removed.

But if $\lambda_i$ are not so different, various approaches are used. All of them are heuristic, i.e, are based on some intuition. 

- Keep components whose eigenvalues are greater than 1.
- Plot the scatter plot of $i$ against $\lambda_i$ and see if the points can be visually 
  separated into two clusters of high and small values.
- Compute the explained variances: $\tilde\lambda_i = \lambda_i / \sum_{i=1}^N \lambda_i$ (each eigenvalue is divided by the sum of all of them). Then keep those components that explain 95\% of variance.

Let us apply these approaches to our data. 

Two the most essential components are indeed grater then 1 while others are less.

The visualization confirms that we have to keep only two components: 

In [None]:
import matplotlib.pyplot as plt
fig, ax = plt.subplots()
nums = list(range(len(pc_lam)))
ax.scatter(nums, pc_lam);

The explained variances are:

In [None]:
# Each eigenvalue is divided by their sum and np.flip reverses the order
expl_var = np.flip(pc_lam / np.sum(pc_lam))
print(expl_var)

In [None]:
# Now we apply the cumulative sum to see what components explain 95% of variance
np.cumsum(expl_var)

We see that this criterion of 95% explained variance is not fulfilled exactly. 

The explained variance above 95% includes three components. 

But since the sum of the first two gives 94% while the next one bring only 1% it looks reasonable again 
to keep only two components.

### Linear dimension reduction using principal components

When we have a matrix whose columns are the principal components and have made a decision which will be kept we compose 
a matrix of this components.

Let us denote the result as $W$. This matrix is also called a projection matrix because we will find projection 
of the original data onto its columns. 

The Columns of this matrix are the essential principal components, i.e., the eigenvectors of
the covariance matrix that correspond to the essential eigenvalues. 

The number of the essential components will be the reduced dimension of our data set. 

In our example it will be 2, since we have decided to keep only two components. 

To perform the reduction we have to take rows of the initial dataset (of course, the one that we obtained after the standardizing)
and multiply them by the matrix $W$: 
$$
Z_{\text{red}} = Z w
$$
Here $Z$ is standardized dataset whose rows are one by one multiplied by $W$.

The result is $Z_{\text{red}}$, the reduced dataset.

In [None]:
print(pc_vec)
print()
proj_w = pc_vec[:,-2:]
print(proj_w)

The reduced dataset is:

In [None]:
red_data = data @ proj_w
print(red_data.shape)

Let us see its scatter plot:

In [None]:
import matplotlib.pyplot as plt
fig, ax = plt.subplots(figsize=(5,5))
ax.scatter(red_data[:, 0], red_data[:, 1]);

The circular cloud indicates that there are no more dependencies in our reduced data.

It means that each column is contains an essential information.

### Non-Negative Matrix Factorization

Dimension reduction is required to extract essential features from the dataset. 

The problem with PCA is that the extracted features can not be treated qualitatively. 

We can not say what exactly features are extracted by PCA, what they tell about the original data.

Another approach widely used for feature extraction is named Non-Negative Matrix Factorization (NMF). 

NMF can extract easily interpretable features.

For example, in the case of facial images, the features such as eyes, noses, moustaches, and lips. 

The important requirement is that the processed data must be non-negative.

Assume there is a data matrix $X$ whose entries are non negative. 

Its columns are features and the number of columns $N$.

NMF is representation of this matrix as a product of two non-negative matrices
$$
X \approx W H
$$

The reduced data are in $W$. Its columns are new features. Their number is $R\leq N$ and the number of rows in $W$ is the same as in $X$. 

The matrix $H$ gives weights of the reduced features in the original features. The most essential reduced features have the largest weights.

This decomposition unlike PCA is approximate and not unique.

The algorithm of NMF is rather complicated and we will use its implementation from the library `sklearn`.

Consider how NMF works: we take two signals and mix them together and add noise. Then apply NMF to extract the original signals.

First we create the signals:

In [None]:
import csv
import numpy as np
rng = np.random.default_rng()

# Dataset size
size = 1000

# Two original signals
tt = np.linspace(0, 9*np.pi, size)
s0 = 2 - np.fmod(-tt, np.pi) + np.fmod(2*tt, 3*np.pi)
s1 = np.abs(np.cos(tt*2)) + np.sin(0.5*tt)**2

s0 /= s0.max()
s1 /= s1.max()

import matplotlib.pyplot as plt

fig, axs = plt.subplots(nrows=2, ncols=1, figsize=(10, 4), sharex=True)

axs[0].plot(tt, s0)
axs[1].plot(tt, s1);

Now we create the dataset: we mix two signals with random weights and with noise. Repeat it $N$ times.

In [None]:
# Number of features in the dataset that will be processed
N = 100

# Empty array
data = np.zeros((size, N))

# Weights for the signals in the range [1, 4]
wts0 = 1 + 3 * rng.uniform(size=(size,))
wts1 = 1 + 3 * rng.uniform(size=(size,))

# Mix signals
for i in range(N):
    data[:, i] = wts0[i] * s0 + wts1[i] * s1 + 2 * rng.uniform(size=(size,))

Let us see what we have

In [None]:
import matplotlib.pyplot as plt
fig, axs = plt.subplots(nrows=5, ncols=1, figsize=(10, 10), sharex=True)
for i in range(5):
    axs[i].plot(tt, data[:, i])

Now we apply NMF. 

We export the class `NMF` from `sklearn` and specify the we want to get 2 components. 

In [None]:
from sklearn.decomposition import NMF
nmf = NMF(n_components=2, init='random', random_state=0, max_iter=10000)
W = nmf.fit_transform(data)

This is the plot of the extracted features. 

Observe that we rescale them by maximum.

In [None]:
import matplotlib.pyplot as plt
fig, axs = plt.subplots(2, 1, figsize=(10, 4), sharex=True)
axs[0].plot(tt, W[:, 0] / np.max(W[:, 0]))
axs[0].plot(tt, s1)  # try s1 here to adjust order
axs[1].plot(tt, W[:, 1] / np.max(W[:, 1]))
axs[1].plot(tt, s0); # try s0 here to adjust order

Some times the order of the recovered signals can be reverted. It can be adjusted manually.

We created the dataset of 100 columns where only two features were essential. 

NMF has successfully extracted them.

### Nonlinear dimension reduction

PCA and NMF can reveal linear dependencies between data columns, i.e.,
$$
y = k x
$$

If columns depend on each other nonlinearly, for example like this
$$
y = x^2
$$
these methods fail to extract lower dimensional set of features.

To extract nonlinear dependencies one can use various special methods.

The algorithms are rather complicated and we will use their implementation from `sklearn` library.

To test these methods we create a three dimensional dataset whose points lays on a curved smooth surface. 

This surface is called manifold.

This dataset will be processed using different methods that tries to find lower dimensional manifolds in high dimensional data.

Descriptions of the methods are taken from the review paper "A beginner’s guide to dimensionality reduction in Machine Learning" by 
Judy T Raj https://towardsdatascience.com/dimensionality-reduction-for-machine-learning-80a46c2ebb7e

First we define several useful functions.

In [None]:
def standardize(data):
    """Standartize data"""
    return (data - np.mean(data, axis=0)) / np.std(data, axis=0)    

def cos_sin(deg):
    """Given an anagles in degrees computes cos and sin"""
    rad = deg * np.pi / 180
    return np.cos(rad), np.sin(rad)

def rotate(v, ax, ay, az):
    """Rotate a vector v around axis x, y, and z"""
    cx, sx = cos_sin(ax)
    cy, sy = cos_sin(ay)
    cz, sz = cos_sin(az)
    Mx = np.array([[1,0,0], [0, cx, -sx], [0, sx, cx]])
    My = np.array([[cy, 0, sy], [0, 1, 0], [-sy, 0, cy]])
    Mz = np.array([[cz, -sz, 0], [sz, cz, 0], [0, 0, 1]])
    return Mz @ (My @ (Mx @ v))

def colorize(row):
    """Given the data row takes its first and second elemends and assingn color codes"""
    x, y = row[:2]
    if x > 0 and y > 0:
        return 0
    if x > 0 and y < 0:
        return 1
    if x < 0 and y > 0:
        return 2
    if x < 0 and y < 0:
        return 3

Now we create three dimensional dataset.

It will be 3D function  
$$
z=x^3 - y^2
$$
We add a noise to it and rotate to complicate the task.

After that we have to standardize the data. 

Manifold searching methods are based on a nearest-neighbor search. 

It means that data from different columns are compared. 

For the proper work the data must have the same units and moreover they must have same scales.

In [None]:
import csv
import numpy as np
rng = np.random.default_rng()

# Size of the dataset
size = 1000

# There will be three columns
N = 3

# Empty storage
data = np.zeros((size, N))

# Uniform random numbers between -1 and 1
data[:, 0] = 2*rng.uniform(size=(size,))-1
data[:, 1] = 2*rng.uniform(size=(size,))-1

# Function x^3 - x^2 plus noise
data[:, 2] = data[:, 0]**3 - data[:,1]**2 + 0.25 * (2*rng.uniform(size=(size,))-1)

# Different colors in different 
clrs = [colorize(row) for row in data]

# Angles of rotation around axes
ax, ay, az = 10, 15, 34

# Perform rotation
for i in range(size):
    data[i] = rotate(data[i], ax, ay, az)

data = standardize(data)

In [None]:
# Uncomment the next line to plot figure in a separate interactive window
#%matplotlib qt 
import matplotlib.pyplot as plt
fig, ax = plt.subplots(subplot_kw={"projection": "3d"}, figsize=(8,8))
im = ax.scatter(data[:, 0], data[:, 1], data[:, 2], c=clrs)

Now we apply different nonlinear methods to represent our three dimensional data on the plane.

"t-distributed Stochastic Neighbor Embedding (t-SNE): Computes the probability that pairs of data points in the high-dimensional space are related and then chooses a low-dimensional embedding which produce a similar distribution." https://towardsdatascience.com/dimensionality-reduction-for-machine-learning-80a46c2ebb7e

This method is probabilistic and different runs result in different results. 

To freeze a curtain plot we can specify parameters `random_state` that seed the random number generator.

In [None]:
%matplotlib inline 
from sklearn.manifold import TSNE
proj = TSNE()  # add here random_state=0 to keep the plot unchanged 
W = proj.fit_transform(data)

import matplotlib.pyplot as plt
fig, ax = plt.subplots()
ax.scatter(W[:, 0], W[:, 1], c=clrs);

"Multi-dimensional scaling (MDS): A technique used for analyzing similarity or dissimilarity of data as distances in a geometric spaces. Projects data to a lower dimension such that data points that are close to each other (in terms if Euclidean distance) in the higher dimension are close in the lower dimension as well." https://towardsdatascience.com/dimensionality-reduction-for-machine-learning-80a46c2ebb7e

In [None]:
from sklearn.manifold import MDS
proj = MDS()
W = proj.fit_transform(data)

import matplotlib.pyplot as plt
fig, ax = plt.subplots()
ax.scatter(W[:, 0], W[:, 1], c=clrs);

"Isometric Feature Mapping (Isomap): Projects data to a lower dimension while preserving the geodesic distance (rather than Euclidean distance as in MDS). Geodesic distance is the shortest distance between two points on a curve." https://towardsdatascience.com/dimensionality-reduction-for-machine-learning-80a46c2ebb7e

In [None]:
from sklearn.manifold import Isomap
proj = Isomap()
W = proj.fit_transform(data)

import matplotlib.pyplot as plt
fig, ax = plt.subplots()
ax.scatter(W[:, 0], W[:, 1], c=clrs);

"Locally Linear Embedding (LLE): Recovers global non-linear structure from linear fits. Each local patch of the manifold can be written as a linear, weighted sum of its neighbours given enough data." https://towardsdatascience.com/dimensionality-reduction-for-machine-learning-80a46c2ebb7e

In [None]:
from sklearn.manifold import LocallyLinearEmbedding
proj = LocallyLinearEmbedding()
W = proj.fit_transform(data)

import matplotlib.pyplot as plt
fig, ax = plt.subplots()
ax.scatter(W[:, 0], W[:, 1], c=clrs);

"Hessian Eigenmapping (HLLE): Projects data to a lower dimension while preserving the local neighbourhood like LLE but uses the Hessian operator to better achieve this result and hence the name." https://towardsdatascience.com/dimensionality-reduction-for-machine-learning-80a46c2ebb7e

In [None]:
from sklearn.manifold import LocallyLinearEmbedding
proj = LocallyLinearEmbedding(method='hessian', n_neighbors=11, eigen_solver='dense')
W = proj.fit_transform(data)

import matplotlib.pyplot as plt
fig, ax = plt.subplots()
ax.scatter(W[:, 0], W[:, 1], c=clrs);

Another improved version of the Locally Linear Embedding.

In [None]:
from sklearn.manifold import LocallyLinearEmbedding
proj = LocallyLinearEmbedding(method='modified')
W = proj.fit_transform(data)

import matplotlib.pyplot as plt
fig, ax = plt.subplots()
ax.scatter(W[:, 0], W[:, 1], c=clrs);

"Spectral Embedding (Laplacian Eigenmaps): Uses spectral techniques to perform dimensionality reduction by mapping nearby inputs to nearby outputs. It preserves locality rather than local linearity" https://towardsdatascience.com/dimensionality-reduction-for-machine-learning-80a46c2ebb7e

In [None]:
from sklearn.manifold import SpectralEmbedding
proj = SpectralEmbedding()
W = proj.fit_transform(data)

import matplotlib.pyplot as plt
fig, ax = plt.subplots()
ax.scatter(W[:, 0], W[:, 1], c=clrs);