# Problem setting

In this Machine learning series we gonna talk about the dimensionality reduction. First we'll do some theory and then as usual try some practice. Today we'll cover a bunch of methods, they are on the slide.

And let's start.

As you might know, there are 3 main pillars of machine learning:
- classification
    
        when we predict class of an object


- regression
    
        when we predict some numeric value


- and clusterization
    
        when we don't have a training set so we deduct the inner structure of the data


Apart form those 3 major tasks, there are quite a few others. And one of those tasks is dimensionality reduction, which we gonna talk about today.

In AI realm we work with signals. And the signal could be simple, like a record in a relational database, or it could extremely complex, like a 5-minute video stream. 

Now data tends to be more and more complex. For example if we want to classify an RGB image of 480 resolution, our model will have to deal with at least 1M dimensional frame.
 
So we need compress it somehow:
we need to reduce its size
and we need to do it in a smart way - while removing insignificant noise, we should retain as much valuable information as possible.

The first is usually achieved by simple **feature selection** (we are talking about feature selection methods in a separate video).
And the second is achieved by construction of new features. This is the basis of deep learning models, when raw image is being converted to some compressed semantic representation that numerically encodes the content of the image.

There is a closely related concept of manifold learning, when we assume that data points are distributied along some structured manifold. By learning the form of this manifold we can represent data in a much more compressed way.

So, what is the value of dimension reduction?
1. We can process more data. Because there is less strain on CPU or memory.
2. In some ML tasks the compressed signal could be a result itself.
For example in Topic Modeling we represent each document as a blend of topics
3. We are able to plot data and analise it visually. We just need to project the data onto 2-dimensional space and plot it

There are two large groups of methods:
- methods from the first group work with global structure data
- methods from the first group utilize the local structure of the data

Timeline:
- PCA
1900
- MDS
1960
- NMF
1999
- LLE
2000
- Isomap
2000
- Random Projection
2001
- LTSA
2002
- Eigenmaps
2003
- LDA (Andrew Ng)
2003
- t-SNE
2008
- Word2Vec
2013
- GloVe
2014
- largeVis
2016
- UMAP
2018












### Evaluating

Most dimensionality reduction algorithms are unsupervised. So the question comes - how can we assess its performance?
And the answer is - we can kinda move it to semi-supervised setting. 

Let's take a look at 2 datasets that are frequently used to assess unsupervised methods 
- first one is MNIST, which is a dataset of handwritten digits from 0 to 9, 
- and the second one is MFashion - a collection of small images of different kinds of clothes.

We mask target labels and treat our data as if it were unlabeled.
Then we apply dimensionality reduction (for example, PCA) and see how actual labels are distributed in low-dimensional space.

The easiest way is to use 2-dimensional target space so we can actually see on a plane the mapping of labels. Let's try to do it with digits dataset. We assign a separate color for each digit and take a look at how these colors are distributed in low-dimensinal space. Ideally, we want the points to form well-defined and separated clusters of the same color.

Here is our result. As we see, PCA projection is far from being ideal, there is a large intersection of labels and cluster edges are not very well defined. We’ll see later that modern approaches tend to do the job way better.

There is a couple of enhancemnts to standard approach. And here they are:
Sparse PCA is PCA performed with an added constraint on number of 

# TBD

## LargeVis (2016)
References
- https://arxiv.org/pdf/1602.00370.pdf
- https://habr.com/ru/post/341208/

LargeVis is an extended version of t-SNE. Major differences:
- On the first step, when they are finding nearest neighbors  they use more effective Random Projection Trees instead of Quadtrees
- Cost function is defined probabilistic
- It is optimized using maximum likelihood
- They use negative sampling to make optimization more efficient

## ePCA (2017)

## NMF (2001)



## Glove (2014)

It's an alternative model for word embedding, that appeared in Stanford a year later. Unlike Word2Vec it uses  Document-term matrices which are much easier to get. 

Используется 2 набора параметров вместо одного, поскольку так обучается лучше. Практика показывает, что результирующий эмбединг лучше всего задавать как X + Y.

## LTSA (2002)
Local Tangent Space Alighnment.

## LDA
In NLP there is one of the central tasks which is called "Topic modeling". There is a plethora of algorithms that solve this task. The modern ones are often quite involved, but the idea is very simple - we need to . In some sense it's very close to clusterization, but unlike clusterization it allows multiple topics to be present in one document.

Instead of using a raw description of a text, (for example bag-of-encoded-words), we can describe it just as a mixture of several topics. This would be enough to compare different documents to each other, plot them on a single map and so on.

You don't even have to describe the topics themselves. They could act as just uninterpretable numeric embeddings.

As of today probably the most popular application - is latent dirichlet allocation (LDA).

First it was used to track representation of different species in population analysis. But later it became the central method for exrtacting topics from texts. I'm not gonna spent much time on it. I'll make another video.









# Dimensionality Reduction и Deep Learning
Как правило, на вход подается какой-то многомерный сигнал, на выходе имеем очень простой результат (если только это не генеративная или seq2seq модель). То есть сокращение размерности происходит в любом случае, просто где-то на этапе предобработки, а где-то в конце применения модели.

До распространения глубокого обучепния работа со сложным сигналом была сильно завязана на ручную генерацию фичей. Действительно, при работе с изображениями, автор алгоритма сначала рассчитывал некий набор агрегатов, который был призван максимально точно передать информацию о сигнале, а затем уже строил предиктивную модель.

В конце 80х Ян Лекун подумал, а зачем все это делать вручную, когда можно этот процесс параметризовать и сделать частью алгоритма. Так появились первые сверточные нейронные сети. Архитектурно, они состояли из продолжительного процесса свертки, то есть по сути нашего dimension reduction, и небольшой моделирующей части, отвечающей за обучение предиктивной модели. Параметры свертки добавились к параметрам сети и стали настраиваться уже при обучении модели.

Затем в конце 90х появились рекуррентные сети, как аналог сверточных но уже для последовательностей, то есть главным образом текстов и различных временных рядов. Механизм обработки немного другой, но приницпы оставалсиь теми же. Перед подачей в сеть каждое слово переводилось в свой эмбединг - семантический вектор в пространстве меньшей размерности. Эмбединг слов также стал частью сети и настраивался на этапе обучения модели.

До приблизительно 2010 года глубокие сети не имели большого распространения, в основном из-за высокой стоимости железа, требующегося для полноценных расчетов. Но после 2010 состоялся взрывной рост их популярности и появилось огромное количество новых алгоритмов.

Автоэнкодеры - тип нейронных сетей, которые встраивают сигнал в простанство меньшей размерности. Они используюься как самостоятельно для получения сокращенного описания, но чаще как часть архитектуры большей сети.


