# Dimensionality reduction
In the **Clustering** section, we touched basic approaches for clustering. You might have noticed that the examples we used were all in the **2-dimensional space**, i.e., the samples were described by only two variables. In reality, the samples are often described by tens of thousands of variables, or, they lie in a high-dimensional space. Can we still use the same good old clustering approaches?

Short answer is yes. Clustering could still provide us ideas about the data. But the distance estimations between samples would not be as useful as they are in 2D space, especially for commonly used Euclidean or Manhatton distance metrics (see [Curse of dimensionality](https://en.wikipedia.org/wiki/Curse_of_dimensionality) for detailed exaplanations). In addtion, high-dimensional data might carry redundant variables and require more computing time. These would be less of an issue if we could reduce the dimensions first. Two major types of approaches for dimensionality reduction are **feature selection** and **feature extraction**.

<figure>
  <img src="https://qph.fs.quoracdn.net/main-qimg-80196c6c2359bc032615dc9435377687" width="500"  alt="comparing feature selcetion and extraction"/>
  <figcaption>
    Comparing feature selction and feature extraction (taking PCA as an example). From an answer in the <a href="https://www.quora.com/Should-I-apply-PCA-before-or-after-feature-selection">Quora forum</a>
    </figcaption>
</figure>

## Data loading

To see how dimensionality reduction works, we will play with a single cell RNA-seq data set generated from brain tissue of E18 Mouse by 10x Genomics ([raw data source](https://support.10xgenomics.com/single-cell-gene-expression/datasets/2.1.0/neurons_900)). This is used only to examplify the dimensionality reduction approaches. If you are actually working with single cell data, using an all-inclusive package like [Scanpy](https://scanpy.readthedocs.io/en/stable/) in Python or [Seurat](https://satijalab.org/seurat/), [Monocle](http://cole-trapnell-lab.github.io/monocle-release/) in R would be easier. 

For simplicity, I have processed the original long-format matrix into a wide-format matrix ([difference between wide and long data](https://sejdemyr.github.io/r-tutorials/basics/wide-and-long/)). Let's run the code block below and get this data first.

In [0]:
# shell commands can be used in python notebooks by 
# prefixing the command with "!". Other tricks for this type can be found here
# https://jakevdp.github.io/PythonDataScienceHandbook/01.05-ipython-and-shell-commands.html
!wget https://github.com/k326xh/SingleCellRNA-seq/raw/master/toy_data/scRNA_widemtx.csv
# use shell command ls and see what we have in the file system right now
!ls

In [0]:
# scRNA_widemtx.csv is what we want. let's read this file in
import pandas as pd
data = pd.read_csv("scRNA_widemtx.csv", compression='gzip', index_col = 0) # my input data was zipped, so included compression options in this. Not required when raw data is not zipped.
print(data.shape)
data.head()

This data set contains 931 cells that are described by 16152 genes (931 samples by 16152 variables). Now, let's see what are the tools we can use to reduce the number of genes, i.e. dimensions, to a computationally affordable level. All the tools covered below are described in [The Ultimate Guide to 12 Dimensionality Reduction Techniques](https://www.analyticsvidhya.com/blog/2018/08/dimensionality-reduction-techniques-python/)

# Feature selections

## 1. Missing value ratio
For a given variable, it would not convey enough information when there are too many missing values. We can choose to drop variables with a missing value percentage higher than a threshold we set. For our data, it would mean we will drop the genes if they are not being expressed by enough number of cells. 

In [0]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats

In [0]:
# checking the percentage of zeroes in each variable
ratios = (data == 0).sum()/len(data)*100
# Note, if for checking missing values (Null), use the code below instead
# ratios = data.isnull().sum()/len(data)*100

# Plot zero/missing value distribution
f, axes = plt.subplots(1, 2, figsize=(12,5))
sns.distplot(ratios, bins=20, ax=axes[0]).set_title('Missing value ratio distribution')
sns.distplot(ratios, bins=200, ax=axes[1]).set_title('Zoomed in to 98% to 100%')
plt.xlim(98, 100)
plt.show()

Most of the variables here suffer from high missing value ratio, which is the nature of single cell RNA data. From the zoomed in view on the right, we can just remove the variables with a ratio over 99.5%.

In [0]:
# "ratio <= 99.5" give a boolean vector that would be used to extract
# gene names from data.columns that would be used to extract
# columns from data 
data_1 = data[data.columns[ratios <= 99.5]] # setting the threshold as 99.5%

print("Total number of feature included has dropped from %d to %d" % (len(data.columns), len(data_1.columns)))

Why the y-values of the density plots above do not seem to add up to one? `seaborn distplot` keeps the area under the curve to one. 

## 2. Low variance filter
If all the samples share the same value for a variable, then it cannot provide information for us to understand the difference between the samples. This is of course an extreme case, but generally for variables with low variances, the likelihood of them being useful in defining data structure is also low, and can be removed. For our data, it means we would remove genes with low variance across the cells, or the other way around, only keep the genes that are highly variable.

In [0]:
# Calculate variance for all variables
variances = data_1.var()
# Note that I am using data past missing value ratio filtering, which is not required
# For your own data, you can build your process pipeline with a combination of 
# approaches in and out of this notebook

# Plot variance distribution
f, axes = plt.subplots(1, 2, figsize=(12,5))
sns.distplot(variances, bins=20, ax=axes[0]).set_title('Variance distribution')
sns.distplot(variances, bins=20000, ax=axes[1]).set_title('Zoomed in to 0-200')
plt.xlim(0,200)
plt.show()

We could do a really simple filtering like what we did for missing value ratios:

In [0]:
# "variances > 25" give a boolean vector that would be used to extract
# gene names from data.columns that would be used to extract
# columns from data 
data_2 = data_1[data_1.columns[variances > 25]] # setting the threshold as 25

print("Total number of feature included has dropped from %d to %d" % (len(data_1.columns), len(data_2.columns)))

But variance level could depend on the level of means. 

In [0]:
# Calculate coefficient of variance (CV)
means = data_1.mean()

f, axes = plt.subplots(1, 2, figsize=(12,5))
sns.scatterplot(means, variances, ax=axes[0]).set_title('Variance over mean')
sns.scatterplot(means, variances, ax=axes[1]).set_title('Variance over mean. Limit y-axis to 0-8000')
plt.xlim(right = 100)
plt.ylim(bottom = -1000, top = 8000)

So if we set one threshold for all, we could bias our selection due to the variance-mean dependency. Taking that into consideration, we can use coefficient of variation instead. 

In [0]:
stds = data_1.std()
CV = stds/means

# Plot CV distribution and CV over mean
f, axes = plt.subplots(1, 2, figsize=(12,5))
sns.distplot(CV, ax=axes[0]).set_title('CV distribution')
sns.scatterplot(means, np.log10(CV**2), ax=axes[1]).set_title('log10 squared CV by means')
plt.xscale('log')
plt.show()

From right plot, if we group genes (features) by their mean, then calculate the median of the log10 squared CV, we can set a cut-off and keep only genes higher than that in each bin (details for implementing this could be found [here](https://github.com/k326xh/SingleCellRNA-seq/blob/master/Notebook/GeneSubsets.ipynb)). By picking only the highly variable genes, we can get a even smaller number of dimensions left.

## 3. High correlation filter
High correlation between two variables means they have similar trends and are likely to carry similar information. Dropping variables from a highly correlated variable group would be valid.

In [0]:
corr = data_2.corr()
plt.figure(figsize=(10, 10))

# show pairwise correlation between variables
ax = sns.heatmap(
    corr, 
    vmin=-1, vmax=1, center=0,
    cmap=sns.diverging_palette(20, 220, n=200),
    square=True
)

A lot of the genes are actually highly correlated with one another. It is again, part of the nature of RNA-seq data set. For a smaller data set with a couple of independent variables, seeing a correlation at 0.6 would be alarming and could be used as threshold for dropping variables. 

## 4. Random forest
Consider we can find one variable that could separate samples into two categories, and for each category, we can again, find one variable and separate the samples into smaller categories. The process continues until no categories left. What we are doing here is making a decision tree. As you could imagine, there are numerous possible ways of making these trees (how to split the categories, which feature to use for the split, etc). Random forest takes an average of all the possible trees. Features that show up more often in the trees are more likely to be important for describing the data. Normally, random forest requires a target attribute as a class label. For our toy data, we don't actually have a label for the cells, e.g., their cell types. But, we can pick one of the neuronal markers RBFOX3, use its expression level as the target attribute and see which set of genes could potentially determine this level difference in the data, and thus, are good features to keep. 

<figure>
  <img src="https://miro.medium.com/max/1480/1*i0o8mjFfCn-uD79-F1Cqkw.png" width="500"  alt="Random Forest model"/>
  <figcaption>
    How does Random Forest work. From <a href="https://medium.com/@williamkoehrsen/random-forest-simple-explanation-377895a60d2d">Random Forest Simple Explanation</a>
    </figcaption>
</figure>


In [0]:
from sklearn.ensemble import RandomForestRegressor
# Set up the regressor
model = RandomForestRegressor(random_state=1, max_depth=10)

# the ENSEMBLE gene id for RBFOX3 is ENSMUSG00000025576
## We will use this as the target attribute, so we
## remove this variable from the matrix (by .drop())
model.fit(data_1.drop(["ENSMUSG00000025576"], axis=1),data_1["ENSMUSG00000025576"])

# Obtain features that often split the trees 
features = data_1.columns
importances = model.feature_importances_
indices = np.argsort(importances)[-9:]  # top 10 features

# plot the features ranked top by their relative importances
plt.title('Feature Importances')
## horizontal bar plot
plt.barh(range(len(indices)), importances[indices], color='b', align='center')
plt.yticks(range(len(indices)), [features[i] for i in indices])
plt.xlabel('Relative Importance')
plt.show()

These genes have been often selected as best split in the constructed trees, so most likely informative features to retain. We can either manually save these top features, or use `SelectFromModel(model)` to get the selected features. 
 

In [0]:
from sklearn.feature_selection import SelectFromModel
# Obtain features picked by random forest
feature = SelectFromModel(model)
# Extract the matrix containing the selected features
Fit = feature.fit_transform(data_1, data_1["ENSMUSG00000025576"])

print("The number of dimensions is reduced to %d" % Fit.shape[1])

## 5. Backward feature elimination

This approach goes by generating models with the total number of features (n), then by (n - 1) features and see removing which of the features do not change much of the performance. Drop these features then repeat the process of building models with one less features until no variables could be dropped. 

<figure>
  <img src="https://slideplayer.com/slide/4646748/15/images/11/Backward+Elimination+%28wrapper%29.jpg" width="500"  alt="process of backward elimination"/>
  <figcaption>
    Process of backward elimination. <a href="https://slideplayer.com/slide/4646748/">Lecture 4: Embedded methods</a>
    </figcaption>
</figure>

This method can be used when building Linear Regression or Logistic Regression models. Due to time limit, we will not run the code chunk below. If you are interested in trying, I would recommend to replace the toy data below and test with a smaller data set.


In [0]:
# # Uncomment this chunk to run backward feature elimination for your data
# from sklearn.linear_model import LinearRegression
# from sklearn.feature_selection import RFE
# from sklearn import datasets
# lreg = LinearRegression()
# rfe = RFE(lreg, 10)
# # Replace the data variable with your data set and labels with 
# # an array labeling the samples
# rfe = rfe.fit_transform(data_1, data_1["ENSMUSG00000025576"])
# # Check the ranking of the variables by 
# rfe.ranking_ 

## 6. Forward feature selection
Opposite process of the Backward Feature Elimination where we train the model starting with only one feature, keep the one giving best performance, then adding one feature at a time. Repeat this process until no significant improvement is seen. 

<figure>
  <img src="https://slideplayer.com/slide/4646748/15/images/7/Forward+Selection+%28wrapper%29.jpg" width="500"  alt="process of forward Selection"/>
  <figcaption>
    Process of forward selection. <a href="https://slideplayer.com/slide/4646748/">Lecture 4: Embedded methods</a>
    </figcaption>
</figure>

This is not very useful for the single cell RNA-seq data, but we can again take the RBFOX3 gene as the target attribute and see what genes features are informative in determining the RBFOX3 expression pattern across cells.


In [0]:
from sklearn.feature_selection import f_regression
ffs = f_regression(data_1, data_1["ENSMUSG00000025576"])
# ffs is an array with F-values and the according p-values of the variables
featureToKeep = data_1.columns[ffs[0] >= 10]
print("The number of dimensions is reduced to %d" % len(featureToKeep))

# Factor extraction
Approaches above works fine when the data set is reasonably small. For larger data sets including our toy data, factor extraction approaches would have better performance.

## 7. Factor analysis (FA)
Variables can be grouped by their intercorrelations. All variables within a particular group will have a high correlation among themselves, but a low correlation with variables from other group(s). Each group here is a factor. More details for FA please see Day 4 Ex-6.

In [0]:
from sklearn.decomposition import FactorAnalysis
# We are again using the post-missing value ratio filtering data
# Factor analysis requires input as a numpy ndarray. 
# "data_1.values" extracts values from dataframe data_1 and reformat it into ndarray
# n_components decides the number of factors in the transformed data
FA = FactorAnalysis(n_components = 3).fit_transform(data_1.values)

In [0]:
# plot the transformed data by each of the three factor pairs.
plt.figure(figsize=(12,8))
plt.title('Factor Analysis Components')
plt.scatter(FA[:,0], FA[:,1])
plt.scatter(FA[:,1], FA[:,2])
plt.scatter(FA[:,2],FA[:,0])

## 8. Principal component analysis (PCA)

A principal component is a **linear combination** of the original variables. Principal components are extracted in such a way that the first principal component explains the maximum variance in the dataset. Second principal component tries to explain the remaining variance in the dataset and is uncorrelated to the first principal component (orthogonal). Third principal component tries to explain the variance which is not explained by the first two principal components and so on.

In [0]:
from sklearn.decomposition import PCA
# Set the modeler. We ask it to calculate the first 10 components
pca = PCA(n_components=10)
# PCA also require ndarray as input format. so use "data_1.values"
pca_result = pca.fit_transform(data_1.values)

For PCA, we often check how much **variance** could principal components (PCs) **explain**. Since PCs explain smaller and smaller variances following the sequential order, we can pick the first N PCs that could explain most variance in the data and discard the rest. 

In [0]:
# Plot component-wise and cumulative explained variance
plt.plot(range(1,11), pca.explained_variance_ratio_)
plt.plot(range(1,11), np.cumsum(pca.explained_variance_ratio_))
plt.title("Component-wise and Cumulative Explained Variance")

The blue curve shows the variances that could be explained by each PC, while the orange curve shows the cumulative explained variances. For our sample, the first four component could explain over 95% of the variance. So decomposing the data to four components could still make sure that we keep at least 95% the variances in the original data. 

One thing stand out from the plot above is that our first component can explain most of the variances. This is a red flag in the data processing. It could be that some components varies a lot more than others due to their respective scales (e.g., cm vs. km). To avoid this type of bias, we need to scale and center the data before PCA.

I provide one potential workflow for incorporating this into the analysis. Please note that this might not be directly generalizable.





In [0]:
# log2 transform the data (this generally works well for rna-seq data)
log_data = np.log2(data_1 + 1) # 1 pseudocount added to avoid log2(0)

from sklearn.preprocessing import StandardScaler # data scaling
scaler = StandardScaler()
scale_data = scaler.fit_transform(log_data)

pca = PCA(n_components=50)
pca_result_txf = pca.fit_transform(scale_data) 

# Plot component-wise and cumulative explained variance
plt.plot(range(1,51), pca.explained_variance_ratio_)
plt.plot(range(1,51), np.cumsum(pca.explained_variance_ratio_))
plt.title("Component-wise and Cumulative Explained Variance")


There are a variety of extensions and variants of PCA that helps to improve certain aspects of this method. Feel free to check them out [here](https://scikit-learn.org/stable/modules/decomposition.html#pca).

## 9. Independent component analysis (ICA)

Independent component analysis attempts to decompose a multivariate signal into **independent** non-Gaussian signals. A simple application of ICA is the "cocktail party problem", where the underlying speech signals are separated from a sample data consisting of people talking simultaneously in a room. 

<figure>
  <img src="http://iiis.tsinghua.edu.cn/~jianli/courses/ATCS2016spring/ATCS-21.jpg" width=500 alt="cocktail party">
  <figcaption>
    Cocktail party problem. Voices from speakers and the crowd are captured by both microphones. ICA's goal is to recover and separate the source signals from the speakers. (From <a href="http://iiis.tsinghua.edu.cn/~jianli/courses/ATCS2016spring/ATCS-2016s.htm">Selected Topics in Learning, Prediction, and Optimization)</a>
  </figcaption>
</figure>

Similarly, we need to provide number of components we would like to calculate as in PCA. What's different here is that in PCA, the top PCs would not change however many components you ask the algorithm to calculate: it always starts from the one that could explain the most variance. However, there is no such ranking for the components in ICA. You can think about this number as the number of **signal sources** contributing to the final signals captured. Thus, a different number for components would generate totally different results. 

We can also apply this approach to our data, and consider independent components as gene pathways that could give rise to certain type of gene expression patterns in cells.

In [0]:
from sklearn.decomposition import FastICA 
# build the ICA modeler
ICA = FastICA(n_components=3, random_state=12) 
ICA_result =ICA.fit_transform(data_1.values)

In [0]:
# plot the transformed data by each of the three factor pairs.
plt.figure(figsize=(8,8))
plt.title('ICA Components')
plt.scatter(ICA_result[:,0], ICA_result[:,1])
# plt.scatter(ICA_result[:,1], ICA_result[:,2])
# plt.scatter(ICA_result[:,2], ICA_result[:,0])

Factor analysis, PCA, and ICA all belong to **linear** dimensionalitiy reduction. For the rest of this notebook, we will go through some of the ways of doing **non-linear** dimensionality reduction.

## 10. Isometric Feature Mapping (ISOMAP)
ISOMAP is a projection-based approach for dimensionality reduction. Before going into details about this specific approach, let's talk a little bit about this type of approach in general.

A simple illustration of projection is shown below, where **a1** is the projection of **a** on **b**. By projecting one vector onto the other, dimensionality can be reduced. These projections can be done onto interesting directions or onto manifolds. 

![example of vector projection](https://s3-ap-south-1.amazonaws.com/av-blog-media/wp-content/uploads/2018/08/Screenshot-from-2018-08-07-15-33-15.png)

What is a **manifold**? A manifold is a topological space that resembles Euclidean space locally near every point, but may not globally. The surface of the Earth lies in 3D space, but could be considered as a 2D manifold. Although we cannot plot the entire surface of the Earth onto the Euclidean space, we can project the surface onto two (or even more) 2D maps. 

<p><a href="https://commons.wikimedia.org/wiki/File:Polar_stereographic_projections.jpg#/media/File:Polar_stereographic_projections.jpg"><img src="https://upload.wikimedia.org/wikipedia/commons/f/f0/Polar_stereographic_projections.jpg" alt="Polar stereographic projections.jpg"></a><br>By <a href="//commons.wikimedia.org/wiki/User:RokerHRO" title="User:RokerHRO">User:RokerHRO</a> - Combination of <a href="//commons.wikimedia.org/wiki/File:Stereographic_Projection_Water_Hemisphere.jpg" title="File:Stereographic Projection Water Hemisphere.jpg">Stereographic Projection Water Hemisphere</a> and <a href="//commons.wikimedia.org/wiki/File:Stereographic_Projection_Polar_Extreme.jpg" title="File:Stereographic Projection Polar Extreme.jpg">Stereographic Projection Polar Extreme</a> into a single image, <a href="https://creativecommons.org/licenses/by-sa/3.0" title="Creative Commons Attribution-Share Alike 3.0">CC BY-SA 3.0</a>, <a href="https://commons.wikimedia.org/w/index.php?curid=19990550">Link</a></p>

The high-dimensional data resides in a high dimensional space (like the surface of the Earth is in a 3D space), but might be projected to a lower dimension manifold (the surface of the Earth could be charted in 2D space), and thus fulfill our goal of dimensionality reduction.

Step-by-step, projection-based approaches reduces dimensions by:
1. Look for a manifold that is close to the data
2. Project the data onto that manifold
3. Unfold the manifold for representation

Back to ISOMAP, it estimates the intrinsic geometry of a data manifold based on a rough estimate of each data point’s neighbors on the manifold. 

<figure>
  <img src="http://benalexkeen.com/wp-content/uploads/2017/05/isomap.png" width=800>
  <figcaption>
    From <a href="http://benalexkeen.com/isomap-for-dimensionality-reduction-in-python/">Isomap for Dimensionality Reduction in Python</a>. Simplified description for Fig. 3 from the original <a href="https://science.sciencemag.org/content/290/5500/2319.full">ISOMAP paper</a>
  </figcaption>
  <figcaption>
    A. Two points that are close together in Euclidean Space in this “Swiss roll” dataset may not reflect the intrinsic similarity between these two points. 
    </figcaption>
    <figcaption>
    B. A graph is constructed with each point as n nearest neighbours (K=7 here). The shortest geodesic distance is then calculated by a path finding algorithm such as Djikstra’s Shortest Path.
     </figcaption>
     <figcaption>
    C. The 2D graph is recovered from applying classical MDS (Multidimensional scaling) to the matrix of graph distances. A straight line has been applied to represent a simpler and cleaner approximation to the true geodesic path shown in A.
    </figcaption>
</figure>

Let's see how to implement this for our toy data. Here, `n_neighbors` decides the number of neighbors for each point and 
`n_components` decides the number of coordinates for manifold.

In [0]:
from sklearn import manifold 
# Set up the modeler. We consider 10 points as neighbor points for each point
# and would like to project the data onto three dimensions
isomap = manifold.Isomap(n_neighbors=10, n_components=3)
# Fit the data
trans_data = isomap.fit_transform(data_1.values)

In [0]:
# visualize transformed data by each of the three factor pairs.
plt.figure(figsize=(12,8))
plt.title('Decomposition using ISOMAP')
plt.scatter(trans_data[:,0], trans_data[:,1])
plt.scatter(trans_data[:,1], trans_data[:,2])
plt.scatter(trans_data[:,2], trans_data[:,0])

## 11. t-Distributed Stochastic Neighbor Embedding (t-SNE)
t-SNE has become a widespread top hit in the field of machine learning due to its ability in finding hidden structures when other approaches failed. It preserves both local and global structure well when transforming high-dimension data into compelling 2D or 3D maps that are more human-friendly for visualization. 

Under the hood, t-SNE based its calculations on SNE, but uses Student-t distribution to measure similarities in the lower dimension and symmetric probability distribution for the higher dimension.

Optimizing a couple of parameters, especially **perplexity**, could drastically change the result from tSNE. Perplexity is related to the **number of nearest neighbors** that is used in other manifold learning algorithms. Higher perplexity leads to lower number of visual clusters on the resulting map ([Dimension Reduction - t-SNE](https://blog.paperspace.com/dimension-reduction-with-t-sne/)).


In [0]:
# Load required library
from sklearn.manifold import TSNE 
 
tsne = TSNE(n_components=2, n_iter=1000, perplexity=30).fit_transform(data_1.values)

In [0]:
plt.figure(figsize=(12,8))
plt.title('t-SNE components')
plt.scatter(tsne[:,0], tsne[:,1])

Playing with the parameters during this session could take too much time. But if you are interested in how much could some of the factors affect the final visualization, you can check out this great post: [How to Use t-SNE Effectively.](https://distill.pub/2016/misread-tsne/)

One final side note on using tSNE: To reduce the time spent in running the calculations in high-dimensional space, it is recommended to run some other dimensionality reductions prior tSNE. For example, we can do PCA first, and use the transformed data as input.


In [0]:
# PCA transformed data from previous code chunk is saved in pca_result
# and is in ndarray format. we can use it as input directly
tsne_pca = TSNE(n_components=2, n_iter=1000, perplexity=30).fit_transform(pca_result)
plt.figure(figsize=(12,8))
plt.title('t-SNE components')
plt.scatter(tsne_pca[:,0], tsne_pca[:,1])

## 12. Uniform Manifold Approximation and Projection (UMAP)

UMAP preserves as much of the local, and more of the global data structure as compared to t-SNE, with a shorter runtime. Interestingly, it also largely reflects the transitions in the data. This has become interesting for biologists in building trajectories from single cell data.

Key parameters to change the result of UMAP is also the number of nearest neighbors (`n_neighbors`). Similar to tSNE, larger values will result in more global structure being preserved at the loss of detailed local structure creating lower number of visual clusters on the resulting map.

In [0]:
import umap
# min_dist is more of an aesthetic parameter, controlling desired separation 
# between close points in the embedding space
umap_data = umap.UMAP(n_neighbors=20, min_dist=0.3, n_components=2).fit_transform(data_1.values)

In [0]:
plt.figure(figsize=(12,8))
plt.title('Decomposition using UMAP')
plt.scatter(umap_data[:,0], umap_data[:,1])

# Take-home messages
 Dimensional reduction can be done by factor selection and factor extraction. While factor selection works for all, it is not as good as factor extraction for data sets with higher dimensions. 
 
Factor extraction can be either linear or non-linear, and should be selected for use ac on the true disribution of the data. For how to determine the non-linearity of data, check out the discussion [here](https://stats.stackexchange.com/questions/304262/how-to-know-when-to-use-linear-dimensionality-reduction-vs-non-linear-dimensiona).
 
Approaches listed here could be combined into a set of approaches and used together for your data (e.g., missing value ratio filtering -> low variance filtering -> PCA -> UMAP). 