# **Setup**
 
Reset the Python environment to clear it of any previously loaded variables, functions, or libraries. Then, import the libraries needed to complete the code Professor Melnikov presented in the video.

In [None]:
%reset -f
from IPython.core.interactiveshell import InteractiveShell as IS
IS.ast_node_interactivity = "all"    # allows multiple outputs from a cell
import pandas as pd, numpy as np, seaborn as sns, matplotlib.pyplot as plt, nltk

np.set_printoptions(linewidth=10000, precision=4, edgeitems=20, suppress=True) 
pd.set_option('max_rows', 100, 'max_columns', 100, 'max_colwidth', 100, 'precision', 2, 'display.max_rows', 8)

## Hamming Distance

[Hamming distance](https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.hamming.html) can be considered the opposite of similarity. In two sequences of equal length, it counts the number of positions that corresponding characters are different. Just like Jaccard similarity, Hamming distance works on elements of any data type. Just like correlation, it requires sequences to be the same length. If sequence lengths mismatch, it will return an infinity or [`np.inf`](https://numpy.org/devdocs/reference/constants.html#numpy.inf) non-numeric value. 

The code below creates a `HammingDist()` function which returns the count of element-wise inequalities or infinity if arguments differ in length. The function `HammingDemo()` simply computes Hamming distance and prints out the formatted results.

In [None]:
# Hamming distance returns count of element-wise mismatches for two equal-length strings
HammingDist = lambda s1='ab', s2='ad': sum(ch1 != ch2 for ch1, ch2 in zip(s1, s2)) if len(s1)==len(s2) else np.inf
HammingDemo = lambda s1='ab', s2='ad': print(f'HammingDist({s1}, {s2}) = {HammingDist(s1, s2)}')

One example below computes the Hamming distance between the simple nucleotide strings `'ACGT'` and `'ACCT'`. This relates to genetic composition, since DNA is made up of varying combinations of four basic nucleotides: A,C,G,T. 

You can use Hamming distance on DNA sequences that are short or millions of nucleotides long. Computational speed makes Hamming distance a popular metric in genetics and related fields.

In [None]:
HammingDemo('cat', 'dog')
HammingDemo('cat', 'dogs')
HammingDemo('ACGT', 'ACCT')

Next, create a function `GenSeq()`, which builds a DNA sample sequence from randomly drawn nucleotides. This DNA is unlikely to relate to any living creature, but works well for our examples below. The `seed` ensures reproducibility of the sequence.

In [None]:
def GenSeq(nLen=5, seed=int(0), LsElements=list('ACGT')):
    if isinstance(seed, int):        # only integers >=0 are used for seeding
        np.random.seed(abs(seed))      # seed random number generator (RNG) if integer seed is provided
    return ''.join(np.random.choice(LsElements, nLen, replace=True))

GenDNA = lambda nLen=5, seed=0: GenSeq(nLen, seed, list('ACGT'))
GenDNA(500)

Run the cell below to see Hamming distance in action. Given two sequences, it prints the number of mismatches.

In [None]:
sDNA_X = GenDNA(30, seed=0)     # query DNA, which we need to identify
sDNA_1 = GenDNA(30, seed=1)     # some known DNA from a Bank
HammingDemo(sDNA_X, sDNA_1) 

Now, consider a database of 10,000 viral DNA samples, expressed as sequences of nucleotides A,C,G,T. The goal is to find the viral DNA sample that is most similar to a query sample `sDNA_X`, assuming all DNA subsamples are extracted from the same coordinates in their (much longer) DNA sequences.

The cell below applies Hamming distance by comparing each DNA sample with `sDNA_X`. After identifying the most similar virus (in this case, ID 4142), you can then decide whether the query DNA and viral DNA are sufficiently similar to be considered a match. Note that minor mutations sometimes occur in DNA sequences.

In [None]:
df = pd.DataFrame([GenDNA(30, seed=i+1) for i in range(10000)], columns=['Viral_DNA'])
HammingX = lambda sDNA: HammingDist(sDNA, sDNA_X) 

df['D2X'] = df['Viral_DNA'].apply(lambda sDNA: HammingX(sDNA))   # Hamming distance to DNA X
df.sort_values('D2X')

Another popular application of Hamming distance is in binary code comparison, where sequences of digits 0 and 1 are used to represent software or a document. A computer virus is a similar sequence of 0 and 1 digits. Many antivirus programs scan binary code for the similarity or dissimilarity to known computer virus sequences. If a matching subsequence is found, it can be treated, disabled by manipulating its bits or cut out.

The code below creates a function GenBinaryCode() that generates some (random) sequence of 0's and 1's.

In [None]:
GenBinaryCode = lambda nLen=5, seed=0: GenSeq(nLen, seed, ['0', '1'])
GenBinaryCode(500)

<hr style="border-top: 2px solid #606366; background: transparent;">

# **Optional Practice**

Now you will practice some of these basic string manipulation techniques. You will apply some measures discussed above to the quotes and to the `dfBin` dataframe, which represents a bank of 1000 simulated viral binary code samples.

As you work through these tasks, check your answers by running your code in the *#check solution here* cell, to see if you’ve gotten the correct result. If you get stuck on a task, click the See **solution** drop-down to view the answer.


In [None]:
dfBin = pd.DataFrame([GenBinaryCode(100, seed=i) for i in range(1000)], columns=['virus'])  # database of virus signatures
dfBin

## Task 1

Compute Hamming distance of each virus code in `dfBin` in relation to the viral code with row index 0 (i.e. the query virus). Order the virus codes by their closest distance to the query virus.

<b>Hint:</b> Use the code developed above with slight modifications. 

In [None]:
# check solution here


<font color=#606366>
    <details><summary><font color=#B31B1B>▶ </font>See <b>solution</b>.</summary>
<pre>
dfBin['Dist'] = dfBin['virus'].apply(lambda seq: HammingDist(seq, dfBin['virus'][0]))   # Hamming distance to DNA X
dfBin.sort_values('Dist')
</pre>
</details> 
</font>
<hr>