# Hash table

**A hash table** is a data structure that implements an associative array, also called a dictionary, which is an abstract data type that maps keys to values.

We want to

- add an element

- delete an element

- check availability

faster than $O(logn)$

A hash table allows you to access an element based on its corresponding key (an alternative to an index in an array). Unlike binay search trees, there is no partial order between keys.

**A hash function** converts a **key** (array of $n$ elements) to an index of $m$ bits. Two or more different keys can map to the same index, so collisions must be resolved.

$0 \leq H < 2^{m}$

A good hash function should be computed quickly and have a minimal number of collisions. Result indices should be evenly distributed between 0 and $2^{m}$.

## Simple hash function examples:

- $H(k) = k mod M$,  where M is prime

   Bad case is  $M = 2^k$, the hash function will not depend on the high rank bytes of $k$.
   
   Example for decimal : $k mod 1000$

- $H(k)= [M \cdot \{A \cdot k\}]$, where {} - fractional part, [] - rounding down

    $0 < A < 1$ - real

    For example: A = 0.6180339887... for good results, $\{A \cdot k\}$ should be highly random

    This hash function should not be calculated using float variables because of the error they introduce during multiplication:

    if $M = 2^p$, where $p \leq 32$

    than 

    $A = \frac{s}{2^{32}} = \frac{2654435769}{2^{32}}$

    It can be shown that 

     $H(k)= ((k \cdot s) mod 2^{32}) >> (32-p)$
     
- For string $s = \{s_0, s_1..., s_{n-1}\}$:

    $H(s) = (s_0 + s_1a + s_2 a^2 ... s_{n-1}a^{n-1})mod M$

    or

    $H(s) = (s_{n-1} + s_{n-2}a + s_{n-3} a^2 ... s_{0}a^{n-1})mod M$

    All terms $(s_ia^{i}) modM$ should be different

    It can be shown that a and M should be coprime.

    Horner's method can be used for efficient calculation:

    $H(s) = (((s_0a + s_1)a + s_2)a ... s_{n-2}) + s_{n-1})$
    
## Popular hash functions

Cyclec redundancy check (CRC)

MD1,..., MD5, MD6

SHA1, SHA2


### Load factor

For a hash table of size $m$ that contains $n$ elements

$\alpha = \frac{n}{m}$ - load factor

## Separate chaining

![Title](img/Separate_chaining.png)

Add element -  $O(1)$, $O(n)$ if we are checking for duplicates

Delete element -  $O(n)$

Get element -  $O(n)$

Average time (expected value) for a good hash function:

$T =  O(1 + \frac{n}{m}) = O(1 + \alpha)$

## Open addressing

If cell $A[H(k)]$ is not empty, check the following cells until we find an empty one. It's much more efficient to check them out in order, depending on the $k$:

$p(k, 1)$, $p(k, 2)$, ... $p(k, m-1)$ - cell indeces to check

It's important that the set of cells to check is equal to the set of all cells in hash table $m$.

![Title](img/open_addressing_simple.png)

How to compute next $p(k, i)$?

- Linear probing - clulsters  of values are generated

$p(k, i) = (H(k) + i) modm$


- Quadratic probing

$p(k, i) = (H(k) +c_1 \cdot i + c_2\cdot i^2) modm$

for example $c_1 = c_2 = 0.5$

Still we are following the same chain of cells while searching for an empty cell

- Double hashing

$p(k, i) = (H_1(k) +i \cdot H_2(k)) modm$

Generally, $H_{1}$ and $H_{2}$ are selected from a set of universal hash functions; $H_{1}$ is selected to have a range of $\{0,|m|-1\}$ and $H_{2}$ to have a range of $\{1,|m|-1\}$.

$H_2(k)$ and m should be coprime: if $m = 2^t$ than $H_2$ - odd, if m - prime: $H_2 < m$

Example:

$H_1(k) = k mod m_1$ and $H_2(k) = 1 + k mod m_2$

## Interface

Add element - $O(n)$ in worst case

Delete element - we can mark an element as deleted (not empty!)

Get element -  $O(n)$ in worst case

$O(\frac{1}{1-\alpha})$ - on average

## Example: k-mer index

![Title](img/kmer_index.png)

# Bloom filter

Probobalistic data structure to use less memory than a hash table. Using bloom filters we can check if element $a$ is in set $S$.

Results: 
- No
- Probably yes.

False positives exist.

Bloom filters are primarily used in bioinformatics to test the existence of a k-mer in a sequence or set of sequences.

![Title](img/bloom.png)

We can combine all k filters with sizes $n_1, n_2,..., n_k$ to one array of size $n = n_1+n_2+...+n_k$ bits.


https://www.youtube.com/watch?v=9J8pRQ_EInA

Probability of a false positive for k Bloom filters of length n with m items added= $p^k$, where p is a fraction of bits in k Bloom filters that are set to 1, $p = 1-(1-\frac{1}{n})^{mk}$

$(1-\frac{1}{n})^{mk} = e^{ln((1-\frac{1}{n})^{mk})} = e^{mkln((1-\frac{1}{n}))}$

Transform using Taylor expension $ln(1+x) = x - \frac{x^2}{2} + \frac{x^3}{3} - \frac{x^4}{4} + ... $

$ln(1-\frac{1}{n}) \approx -\frac{1}{n}$ for large n

$(1-\frac{1}{n})^{mk} =e^{-\frac{mk}{n}}$

if we take the minimum of the function $f(k) = (1 - e^{-\frac{mk}{n}})^k$:

An optimal number of hash functions will be: $ k = \frac{m}{n}ln2$

When the number of functions is chosen optimaly ~ 50% of the bits are set, 50% of the bits are unset (p = 0.5).




# Sequence Bloom Trees

Task: given n databases $d_1, d_2...d_n$ - find a set of $d_i$ where sequence q is present.

0) Build tree: for each dataset $d_i$ with Bloom filter $B(d_i)$

Walk from the root to the leaves:

- if current node u has a single child - insert $d_i$ as u’s second child

- If u has two children, $B(d_i)$ is compared against the bloom filters B(left(u)) and B(right(u)) of the left left(u) and right right(u) children of u. The child with the more similar filter under the Hamming distance between the filters becomes the current node

- if u has no children, a new union node v is created as a child of u’s parent. This new node has two children: u and a new node representing $d_i$.

1) Query: Split q to k-mers and do search for each k-mer

2) Flowing each kmer over SBT starting from the root. At each node u, the bloom filter B(u) at that node is queried for each of the kmers in Kq. If more than θ|Kq| kmers are reported to be present in B(u), the search proceeds to all of the children of u, where θ is a cutoff between 0 and 1 governing the stringency required of the match. 

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4804353/

![Title](img/sbt.png)


# Bit-sliced Genomic Signature index

http://dx.doi.org/10.1038/s41587-018-0010-1

![Title](https://media.springernature.com/full/springer-static/image/art%3A10.1038%2Fs41587-018-0010-1/MediaObjects/41587_2018_10_Fig2_HTML.png?as=webp)


## REINDEER 

REINDEER performs indexing of sequences and records their abundances across a collection of datasets

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7355249/
    
http://dx.doi.org/10.1038/s41587-018-0010-1
