# Bioinformatics

This file contains answers to random questions asked concerning the bioinformatics part of the NST 1B Math Bio course. When answering questions I will overtime attempt to give both a short and a long answer.

## How are identical k-mers found and why is this so quick?

_tldr; using a hash function on a k-mer means we can use a hash table to find k-mer exact matches. This allows us to reduce any linear or binary time sort complexity to direct lookup's on a hashtable._

A k-mer is a substring of length k (where k is a positive integer) that is found within a longer string of nucleic acid or peptide sequences. K-mers are used in bioinformatics and sequence analysis to identify patterns and motifs within a sequence.

Hashing helps to quickly locate data that is stored in a hash table or hash map. Hashing generates a unique value based on a key (in this case, the k-mer). This value can then be used to quickly locate the k-mer in the hash table. This makes it much faster to search for a k-mer than it would be to search for it in a linear fashion.

A hash map, also known as a hash table, is a data structure that stores key-value pairs. Every key is associated with a unique value, and the key-value pairs are stored in an array. The key is used to generate a hash value, which is then used to quickly locate the associated value. Hash maps provide a fast way to look up values by key.

When searching a hash table the key tells us the exact location in the dictionary of the key-value pair, (or hash bucket). The hash table can be seen as an ordered list where the order is based on the hash and the hash provides instructions as to how one finds the location of the key-value pair. This means we don't have to make a key string comparison against each key, to find the correct one. Instead we follow the directions encoded by the hash to find the correct key-value pair.

A better an more in depth description of hashmaps and why they result in quick search is provided by [this medium post](https://medium.com/basecs/taking-hash-tables-off-the-shelf-139cbf4752f0).

ToDo: provide visual explanation based on k-mers and hashing.



## How to calculate time complexity of a function?

_One can calculate the time complexity of a function by following 3 steps._
1. _Identify the operations performed in the function._
2. _Assign a time complexity for each of the operations._
3. _Calculate the total time complexity of the function by adding the time complexities of the individual operations._

Time complexity is a measure of the amount of time it takes for a program or algorithm to complete its execution. It is typically expressed using Big O notation, which measures the time complexity of an algorithm relative to the size of the input.

One can assign a time complexity to an operation by analysing the operation and it's associated datastructure to determine the worst case time complexity. Determine the complexity of each operation based on the type of datastructure (this is why hash tables are important) and the number of elements it contains. For example searching an array of size n has time complexity O(n). If there are multiple operations, combine the time complexities to calculate the total time complexity of the function.

ToDo: provide an example using CLUSTAL.

### Assessing time complexity of CLUSTAL Omega
The CLUSTAL omega algorithm (from [wikipedia](https://en.wikipedia.org/wiki/Clustal)) consists of a few steps:
1. Pairwise alignment using k-tuple method.
2. Sequence clustering using mBed method.
3. Sequence clustering using k-means.
4. Guide tree using UPGMA.
5. Progressive alignment using HHAlign

#### Pairwise alignment using k-tuple method.
BLAST or FASTA can be used to do this. If we consider BLAST based on this [stack overflow](https://stackoverflow.com/questions/59454358/what-is-the-time-complexity-of-the-given-blast) we define the BLAST algorihtm as:
- O(n): Creating a list of words of length W of the query sequence. 
- O(n): Search for W words in the database.
- O(n^2): Elongation of hit sequences, ie those found, and assignment of a score. This is O(n*m) where m is the number of letter in the query word But as an upper bound we take time as O(n^2).
- O(n): These sequences will be given by a local alignment.

Where n is the number of elements in the sequence. Thus the algorithm can be asimilated by O(n^2). If we repeat this for all of the steps defined in the CLUSTAL omega algorithm we will find that in no step does the time complexity reach above O(n^2). Thus the time complexity of CLUSTAL can be assimilated by O(n^2).



