## Approximate Nearest Neighbors for search

* Author
* some terminology
* kNN
* Applications
    - information retrieval
    - useful in more complex tasks (pattern matching)
* Why ANN
* Approaches
    - LSH
    - Dimensionality-reduction based methods
    - Graph-based
    - Product Quantization
* Libraries
    - annoy
    - NMSLib
    - Rii

## Author

#### Jakub Bartczuk

- Machine learning engineer, previously developer
- Currently: Data scientist @Semantive
- Math background (Theoretical Math BSc)

For a previous year heavily into deep learning. Actually I started using ANN tools in a DL-based project!

## kNN Terminology

<img style="float: right;" src="https://upload.wikimedia.org/wikipedia/commons/thumb/e/e7/KnnClassification.svg/279px-KnnClassification.svg.png">

Let's fix some notation:

$X$ - dataset, $X \subset \mathbb{R}^d$

$y$ - target in supervised learning

$\hat{y}$ - estimate (output of model) in supervised learning


$\| x - y \|_{p} = \sqrt[p]{\sum_{i<d} (x_i - y_i)^p}$ - $p$th norm, for example $ \| x - y \|_{2} $ - Euclidean norm

$q$ - query vector

$N_k(q, X)$ - $k$ nearest neighbors of $q$ in $X$


## kNN

Probably one of the simplest ML methods.

Pros
- doesn't make any assumption on distribution in supervised learning (nonparametric)
- quite easy to interpret prediction - just look at the neighbors!

Cons
- prone to overfitting
- suffers from 'curse of dimensionality'
- **costly inference**

## Applications

### Information retrieval

Traditionally IR is mostly done for text data that is easily searchable in a different way (inverted index).

kNN enables IR of any data that we can vectorize.

###### Examples:
- music (there are many tools for extracting features from sound)
- images (just run your favourite CNN and extract activations from some modestly sized layer)
- text again - extract features using Doc2Vec or some RNN of your choice
- actually anything for which you have a DL model that captures relevant structure

A concrete example of music search can be found in [findkit](https://github.com/lambdaofgod/findkit) library.

### Useful in other ML task

- kNN is used as intermediate step in manifold learning algorithms (picture from megaman documentation) <img src="http://mmp2.github.io/megaman/_images/spectra_Halpha.png" style="float: right;" width="350"/>

- Texture synthesis - for example see [Style-Transfer via Texture-Synthesis](https://arxiv.org/pdf/1609.03057.pdf) (that's a non-CNN based Style Transfer method)

## kNN for search

### suffers from 'curse of dimensionality'

Why care? Most DL model layers don't have more than a couple of hundreds-dimensional outputs.

### costly inference

That's actually the biggest problem. 

TODO: O(n) complexity

## Approximate kNN

Idea: don't do *exact* kNN. Try to retrieve neighbors *with high probability*.

### Approaches
- LSH
- Dimensionality-reduction based methods
- Graph-based
- Product Quantization