# AMAT 502: Modern Computing for Mathematicians
## Lecture 6 - More Data Types and the Edit Distance
### University at Albany SUNY



# Table of Contents

- Collection Data Types (*Skim this and refer later!*)
    - Lists
    - Tuples
    - Sets
    - Dictionaries
- **Edit Distances between Strings**

# Collection Data Types

- **List** is a collection which is ordered and changeable (**mutable**) and allows duplicate members.
- **Tuple** is a collection which is ordered and unchangeable. Allows duplicate members.
- **Set** is a collection which is unordered and unindexed. No duplicate members.
- **Dictionary** is a collection which is unordered, changeable and indexed. No duplicate members.

## Lists

In Python, lists are defined using square brackets `[ ]`

Last time we introduced lists, where we have additional operations that we can use such as...

* append( ): adds an item to the end of the list
* insert( ): adds an item to a given index
* reverse( ): Reverses the order of the list
* sort( ): Sorts the list

In [1]:
a = ["statistics","algebra", "topology", "statistics"]
a.append("analysis")
print(a)

a.insert(1,"complex analysis")
print(a)

['statistics', 'algebra', 'topology', 'statistics', 'analysis']
['statistics', 'complex analysis', 'algebra', 'topology', 'statistics', 'analysis']


### More List Operations

* remove( ): removes an item of a specified value
* pop( ): removes an item at a given index

In [2]:
print(a)

a.remove("statistics")
print(a)
#NOTICE: remove only removes the first instance of the element you want to remove
a.pop(2)
print(a)

['statistics', 'complex analysis', 'algebra', 'topology', 'statistics', 'analysis']
['complex analysis', 'algebra', 'topology', 'statistics', 'analysis']
['complex analysis', 'algebra', 'statistics', 'analysis']


## Lists of Lists = Multidimensional Lists

We can also consider lists who entries are lists, this allows us extra dimensions to navigate through.

In [1]:
a = [[1,2],[3,4],[5,6]]

print(a[0])

print(a[0][1])

a[0][1] = 3

print(a)

[1, 2]
2
[[1, 3], [3, 4], [5, 6]]


## Tuples

Tuples are defined using round brackets `( )`. 

Tuples are 
- ordered, and
- immutable.

So for example, the method `.append( )` would not apply to a tuple. 

You can only access elements in a tuple as follows:

In [3]:
t = (2,3,4)
print(t[2])
#t[2]=5 # this will generate an error.
#t.append(2) # as will this
print(t)

4
(2, 3, 4)


In [9]:
#s=(1,) # Need to have the comma!
#s=(1) # This will generate an error if you don't have the comma.
r= [1]
print(s)
print(r)
#t = t[0:2] + s + t[2:]
#t
#type(s)

1
[1]


int

## Set

In Python sets are written with curly brackets `{ }`. 

A set is a collection which is... 
- unordered, 
- unindexed, and 
- changeable/mutable. 

Sets give the opposite functionallity of a tuple, so we can `add` and `remove` elements from a set using
- `s.add()`
- `s.remove()`

However we cannot gain access to any given element since there is no indexing of the elements of the set.

We also have
- `s.pop()` which removes the "first" entry

In [10]:
s = {"algebra", "topology", "analysis"}
print(s) # Explain why this prints differently than the order above!

s.add("statistics") # Can you explain the behavior of the add operation? Why does it print in the order shown below?
print(s)

s.pop()
print(s)

# try print(s[0])
#s[0]

{'topology', 'algebra', 'analysis'}
{'topology', 'algebra', 'statistics', 'analysis'}
{'algebra', 'statistics', 'analysis'}


### Sets in `for` Loops

If we want to include a set in a loop, we need to access the elements not by their index, but just by the elements themselves. For example:

In [11]:
s = {"algebra", "topology", "analysis"}
for x in s:
    print(x)

topology
algebra
analysis


### More Set Operations

* difference( ): returns a set containing the difference between two or more sets

* intersection( ): returns a set, that is the intersection of two other sets

* isdisjoint( ):	returns whether two sets have a intersection or not

* issubset( ): returns whether another set contains this set or not

* issuperset( ): returns whether this set contains another set or not

* symmetric_difference( ): returns a set with the symmetric differences of two sets

* union( ): return a set containing the union of sets


## Dictionaries

In Python dictionaries are defined with curly brackets and colons, i.e. `d= {keyA: valueA, keyB: valueB,... }`.

Dictionaries are very similar to sets. Especially in the way that we have to navigate them, and how they are defined. 
A dictionary is a collection which is...
- unordered, 
- changeable, and 
- indexed. 

The way indexing works for dictionaries is that you provide the `key` as an index, i.e. `d["keyA"]` would return `valueA`.

In [13]:
myCar = {
  "make": "Ford",
  "model": "Mustang",
  "year": 1964
}
print(myCar)
print(myCar["make"])

{'make': 'Ford', 'model': 'Mustang', 'year': 1964}
Ford


### Dictionaries: Keys and Values

Navigating through dictionaries is exactly the same as navigating a list except now we use the *key* to access an element or *values*. The following methods access the key and values information of a dictionary.

* `.keys( )`: returns key value

* `.values( )`: returns values of dictionary

In [14]:
print(myCar.keys())
print(myCar.values())

dict_keys(['make', 'model', 'year'])
dict_values(['Ford', 'Mustang', 1964])


# And Now for Something Different...

![Bats](bat-crop.png)

# Edit Distance (Warmup Problem)

How different are the strings `s1` and `s2` for... 

- `s1='cats'` and `s2='dogs'`?
- `s3='spam'` and `s4='spa'`?
- `s5='aloud'` and `s6='loud'`?
- `s7='alien'` and `s8='sales'`?

# Hamming Distance

One way of quantifying the difference between strings is to compare them character by character and add up the number of times they disagree. This is called the **Hamming distance** In our example we would have that
$$d_{\text{Ham}}(s_1,s_2)=3$$

In [6]:
def Hamming_Distance(s1,s2):
    """Assumes s1 and s2 are strings of the same length and returns their Hamming Distance"""
    counter = 0
    for i in range(len(s1)):
        if s1[i]!=s2[i]:
            counter = counter +1
    return counter

s1='cats'
s2='dogs'
s3='spam'
s4='spa'
s5='aloud'
s6='loud'
s7='alien'
s8='sales'

#Hamming_Distance(s1,s2)
#Hamming_Distance(s3,s4) #Why does this throw an error? How would you fix it?
Hamming_Distance(s7,s8)

4

# Why Should We Care?

Traditionally the Hamming distance is formulated for strings consisting of 0's and 1's, so called **binary strings**. This is because he was interested in the theory of [error correcting codes](https://en.wikipedia.org/wiki/Error_correction_code) which is a huge part of our digital existence. These are methods to minimize distortions of data that undergo transmission in via noisy channel (see Claude Shannon's *The Mathematical Theory of Communication* for more).

For example the following image of the Mona Lisa was corrupted by transmission through the Earth's atmosphere and was subsequently corrected using the Reed-Solomon method, which is commonly used in CDs and DVDs.

![Mona Lisa Corrupted](mona-lisa-reed-solomon.jpg)
*Figure Caption: To clean up transmission errors introduced by Earth’s atmosphere (left), Goddard scientists applied Reed–Solomon error correction (right), which is commonly used in CDs and DVDs. Typical errors include missing pixels (white) and false signals (black). The white stripe indicates a brief period when transmission was paused.*

## Coronaviruses

Of course, since 2020, we know about **SARS-CoV-2**, which is the virus that causes COVID-19 (CoronaVirusDisease from 2019), but there are plenty of other coronaviruses.

Remarkably, the RNA transcriptase from cornaviruses have their own form of an error correcting code!

![Coronavirus Article](coronavirus.png)
*From [https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3127101/](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3127101/)*

## Cancer

The kind of errors in transcription that happens when a virus replicates itself is also illustrative of the main causes of cancer, which is **mutation**. Things that causes mutations are called *mutagens*

![Types of DNA Mutations](DNA-mutations.jpg)
*Taken from a Frontiers For Young Minds Article: [Ways You Can Protect Your Genes From Mutations With a Healthy Lifestyle](https://kids.frontiersin.org/article/10.3389/frym.2019.00046)*

## The Origin of Life

Mutations that occur during replication of DNA is also what drive evolution. One might characterize evolution of novel types of coronaviruses as an example of [microevolution](https://en.wikipedia.org/wiki/Microevolution), but it is also well accepted that mutations are the driver of [macroevolution](https://en.wikipedia.org/wiki/Macroevolution), which is what accounts for the origins of species.

![Haeckel's Tree of Life](tree-of-life-512.png)
*From [https://commons.wikimedia.org/wiki/File:Haeckel_arbol_bn.png](https://commons.wikimedia.org/wiki/File:Haeckel_arbol_bn.png)*

## The Origin of Vertebrates

![Haeckel's "Age of Man"](age-of-man.jpg)

*From [https://en.wikipedia.org/wiki/File:Age-of-Man-wiki.jpg](https://en.wikipedia.org/wiki/File:Age-of-Man-wiki.jpg)*

## The Origin of Hominids

![Haeckel's Pedigree of Man](pedigree-of-man.png)
*Cropped from [https://commons.wikimedia.org/wiki/File:Tree_of_life_by_Haeckel.jpg](https://commons.wikimedia.org/wiki/File:Tree_of_life_by_Haeckel.jpg)*

## Taxonomic Justifications

There were historically lots of reasons to believe gorillas, chimps and humans were related by a common ancestor, but when comparing their differences it was hard to determine which species diverged when.

![Gorilla Chimp and Human Comparisons](gorilla-chimp-human.png)
*Slide taken from [https://www.slideshare.net/mpallen/bio263-who-is-our-closest-relative](https://www.slideshare.net/mpallen/bio263-who-is-our-closest-relative)*

## Cladistics Says Differently

In the middle of the 20th century people were unlocking other methods for determining when humans, chimps and gorillas separated in the tree of life. Instead of using phenotypic expressions to compare organisms people like [Vincent Sarich](https://en.wikipedia.org/wiki/Vincent_Sarich) and Allan Wilson, used a method known as [molecular clocking](https://en.wikipedia.org/wiki/Molecular_clock) to determine evolutionary relationships. What we now know due to efforts by Sarich and others is that:
- Humans and chimps likely shared a common ancestor **4-6 million years ago**
- Gorillas separated from the human-chimp group **2 million years before that**

![Sequence Based Comparisons](sequence-tree.png)
*Slide taken from [https://www.slideshare.net/mpallen/bio263-who-is-our-closest-relative](https://www.slideshare.net/mpallen/bio263-who-is-our-closest-relative)*

# A More Flexible Distance

We already saw how there are different kinds of mutations, so how can we reliably compare strings using something other than the Hamming Distance, which just measures the number of substitutions for going between strings?

In particular, we want to define a distance between strings that uses the three operations:
- **Substitution:** Just change one character in a string.
    - Easiest way to go from `'maps'` to `'mops'`
- **Insertion:** Insert a character in a given location.
    - Easiest way to go from `'loud'` to `'aloud'`
- **Deletion:** Just delete a character at a given location.
    - Easiest way to go from `'spam'` to `'spa'`

## An Example Revisited: `'alien'` versus `'sales'`

The Hamming Distance between `s7='alien'` and `s8='sales'` is..

In [18]:
Hamming_Distance(s7,s8)

4

### Counting Mutations between `'alien'` and `'sales'`

We could...
1. INSERT an `'s'` in the beginning of `'alien'` to obtain...
    - `'salien'`
2. DELETE an `'i'` from `'salien'` to obtain...
    - `'salen'`
3. SUBSTITUTE an `'n'` for `'s'` to obtain...
    - `'sales'`

# Edit Distance(s)

The distance on strings that uses the above three types of mutations is called the [**Levenshtein Distance**](https://en.wikipedia.org/wiki/Levenshtein_distance) and it is one of several types of [**edit distances**](https://en.wikipedia.org/wiki/Edit_distance) which includes the Hamming Distance. However, in practice I've mostly heard people refer to the Levenshtein distance as **the edit distance**.

If you take a look at this definition for computing the minimum edit distance between strings
- a of length $m$
- b of length $n$
and define define $\text{lev}_{a,b}(i,j)$ to be the edit distance between the first $i$ characters of $a$ and the first $j$ characters of $b$. 

You'll see straight away that recursion is involved!

![Levenshtein Table](leven-table.png)
*From [https://medium.com/@ethannam/understanding-the-levenshtein-distance-equation-for-beginners-c4285a5604f0](https://medium.com/@ethannam/understanding-the-levenshtein-distance-equation-for-beginners-c4285a5604f0)*


## Step 1: Thinking About Base Cases

If we think about the line
$$\max(i,j) \qquad \text{if} \min(i,j)=0$$
What this is saying is that if one of the strings is empty, then return the length of the other string.

Said in Python we can start our definition of the edit distance using the following code block

```python
def edit_distance1(s,t):
    if len(s)==0 or len(t) ==0:
        return max(len(s),len(t))
```

## Step 2: Easy Recursive Case

Additionally, if two strings `s` and `t` have the same starting character, i.e. `s[0]==t[0]` then we can compute the edit distance on the rest of the string, i.e.

```python
def edit_distance2(s,t):
    if len(s)==0 or len(t) ==0:
        return max(len(s),len(t))
    if s[0]==t[0]:
        return edit_distance2(s[1:],t[1:])
```

## Step 3: Recursing Via 3 Strategies

We now emulate the "Use it" or "Lose it" strategy of the Subset-Sum problem that we considered last time.

What happens if `s` and `t` **DON'T** have the same starting character then we can either use a substitution to make them match and then recurse on the remaining string:
- `substitution = 1 + edit_distance(s[1:],t[1:])`

Or we can try deleting the first character of the first string and recurse on the shorter first string and the full second string:
- `deletion = 1 + edit_distance(s[1:],t)`

Or we can try inserting a character in the front of `s` so that it matches the first character of `t` and then recurse on the edit distance between the original (before insertion) string `s` and the rest of `t`. Alternatively you can think of this as deleting the first character of the second string:
- `insertion = 1 + edit_distance(s,t[1:])`

In [8]:
def edit_distance(s,t):
    """Given two strings s and t it returns the edit distance between them"""
    if len(s)==0 or len(t) ==0:
        return max(len(s),len(t))
    if s[0]==t[0]:
        return edit_distance(s[1:],t[1:])
    else:
        substitution = 1 + edit_distance(s[1:], t[1:])
        deletion = 1 + edit_distance(s[1:], t)
        insertion = 1 + edit_distance(s, t[1:])
        return min(substitution, deletion, insertion)
        #return substitution
#edit_distance('aloud','loud')
edit_distance('alien','sales')

3

## Reflection Questions
- Is `edit_distance` symmetric? I.e. is `edit_distance(s,t)==edit_distance(t,s)` `True`?
- Is it non-negative?
- Does it satisfy the triangle inequality?