## The Circle Game

        We're captive on the carousel of time
        We can't return we can only look
        Behind from where we came
        And go round and round and round
        In the circle game
        -- Joni Mitchell, `Circle Game`
 

Suppose you want to compare a string, character-by-character, to itself and see how many matches there are.


In [72]:
def matches(seq: str) -> int:
    """Count matches of seq with itself."""
    _matches = 0
    for chr in seq:
        if chr == chr:
            _matches += 1
    return _matches

You know the answer already: they all match, so it's just the number of characters in the string.

In [73]:
seq = "hello, world"
matches(seq) == len(seq)

True

Change `seq` above to something else and press enter if you think `"hello, world"` is special.

But what if you shift the string a little, like this?

        hello, world
         hello, world

Now, almost nothing matches -- there's only the second `l` in the original `hello`, which matches the first `l` in the shifted version, below it.  

You also have to special-case the characters at the beginning of the original and end of the shifted version, which don't have anything to compare to. You can get rid of that nuisance by treating the string as a circle, moving the letters shifted off the end around to the beginning.

        hello, world
        dhello, worl

Let's rewrite `matches()` to do just that.

In [74]:
def matches(seq: str, shift: int) -> int:
    """Count matches of seq to a circular permutation, shifted by shift."""
    matches = 0
    length = len(seq)
    for pos, chr in enumerate(seq):
        # just use modular arithmetic to find the character to compare to
        shifted_chr = seq[(pos+shift) % length]
        if chr == shifted_chr:
            matches += 1
    return matches

A shift of 0 should give the length of the string,

In [75]:
matches(seq, 0) == len(seq)

True

and a shift of 1 should just leave a single, matching `l`.

In [76]:
matches(seq, 1)

1

The rotated version of the string is `seq[shift:] + seq[:shift]`,
so we can see the matches for each rotation like this:


In [77]:
for shift in range(len(seq)):
    n_match = matches(seq, shift)
    print(f"{seq[shift:] + seq[:shift]}: {n_match}")

hello, world: 12
ello, worldh: 1
llo, worldhe: 0
lo, worldhel: 0
o, worldhell: 2
, worldhello: 1
 worldhello,: 0
worldhello, : 1
orldhello, w: 2
rldhello, wo: 0
ldhello, wor: 0
dhello, worl: 1


After it rotates half way around, we start getting the same counds in reverse.
Of course! It's a circle, so the number of matches for a shift of one to the right

    hello, world
    dhello, worl

will be the same as the number of matches for a shift of one to the left
 
    hello, world
    ello, worldh

The first `len(seq) // 2` shifts give all possible mis-alignments. Even including the original sequence aligned with itself cuts the time to run the previous example nearly in half:

In [78]:
for shift in range(len(seq)//2 + 1):
    n_match = matches(seq, shift)
    print(f"{seq[shift:] + seq[:shift]} : {n_match=}")

hello, world : n_match=12
ello, worldh : n_match=1
llo, worldhe : n_match=0
lo, worldhel : n_match=0
o, worldhell : n_match=2
, worldhello : n_match=1
 worldhello, : n_match=0


Surprisingly, the fourth mis-alignment has a pair of matches, more than the first! Its second `o`, from `world`, matches the original `o` in `hello`, and its penultimate letter, `l`, matches the `l` in the original `world`. 

        hello, world
        o, worldhell

A longer shift gives a better match.

It's worth wrapping that into a function.

In [79]:
def shift_matches(seq: str) -> list[int]:
    """Return number of matches of sequence with each possible shift of itself."""
    _shift_matches = []
    for shift in range(len(seq)//2 + 1):
        _shift_matches.append(matches(seq, shift))
    return _shift_matches

In [80]:
shift_matches(seq)

[12, 1, 0, 0, 2, 1, 0]

Processing and rearranging that list a little makes it easy to show the shifts with the best matches.

In [81]:
from collections import defaultdict

def best_matches(seq: str) -> list[tuple[int, list[int]]]:
    """Return list of pairs: (# of matches, [shift(s) with that many matches]).
    
    Sort the list, reporting positions with most matches first.
    """
    _positions = defaultdict(list)
    for pos, matches in enumerate(shift_matches(seq)):
        _positions[matches].append(pos)
    return sorted(_positions.items(), reverse=True)

In [82]:
best_matches(seq)

[(12, [0]), (2, [4]), (1, [1, 5]), (0, [2, 3, 6])]

As we've seen already, the most matches are with the unshifted string itself (no surprise), the second most are at a shift of 4.
Shifts of 1 and 5 both find only one match, and the rest have none. 

## Circular DNA

Well, okay, but so what? In the first place, who wants to compare strings with their shifted versions,
and in the second who ever has circular strings?

Let's start with the latter.
The DNA in your chromosomes is a linear string of the letters 'A', 'C', 'G', and 'T'.
Of course, these letters are chemicals -- Adenine, Cytosine, Guanine, and Thymine -- but the letters in this notebook
are really bit patterns, and even if you think of bits ones and zeros, they're implemented as electronics.

So. DNA is a string. Modern DNA-sequencing techniques supply a surfeit of such strings.

But not all DNA in your cells is in the nucleus. There's also DNA in your mitochondria. 
And while the DNA in chromosomes is linear strings in a four-letter alphabet, 
the DNA in the mitochondria of those cells is circular.

It's also a lot smaller. The biggest human chromosome has almost 250 million letters, the smallest, only about 45 million.
Human mitochondrial DNA? 16,569. Smaller than a 20KB file.  An 80x24 terminal window shows 1920 characters, so only about eight of those. 

The genome of *Escherischia coli*, the workhorse of bacterial genetics, is bigger but still only about 50MB. And circular.
Bacteriophage *lambda*, a common molecular biology tool, infects *E. coli*, and is only about 50KB, a tenth the size. 
(A "bacteriophage", or "phage" is a virus that infects bacteria.)
Another *E. coli* predator, *phiX174*, is only 5K.  All three are circular. 

Circular genomes are the rule in the bacterial world, not the exception.
We have tons of strings available to us that are circular. They're all DNA.




## A Look at PhiX DNA.

Let's try our code on [the phiX174 genome](https://www.ncbi.nlm.nih.gov/nuccore/9626372). You can download it to a file by clicking on **FASTA**, then **Send to:**.


The FASTA file, `sequence.fasta`, is just a text file -- a single-line header followed by the DNA sequence, broken into 72-column lines.

In [83]:
with open("sequence.fasta") as fasta:
   lines = fasta.readlines()
print(lines[:2])

['>NC_001422.1 Escherichia phage phiX174, complete genome\n', 'GAGTTTTATCGCTTCCATGACGCAGAAGTTAACACTTTCGGATATTTCTGATGAGTCGAAAAATTATCTT\n']


You could parse it yourself and concatenate all the DNA fragments into a single string,
but BioPython will do that work for you.

In [84]:
import Bio.SeqIO  # the BioPython module, for handling DNA sequences.

record = Bio.SeqIO.read("sequence.fasta", "fasta")
record.description


'NC_001422.1 Escherichia phage phiX174, complete genome'

In [85]:
len(record.seq)


5386

PhiX has 5386 bases (letters).

Let's wrap getting the sequence into a function.

In [86]:
import Bio.SeqIO  # the BioPython module, for handling DNA sequences.

def get_dna_sequence(filename: str) -> str:
    """Return sequence in FASTA file as a single string."""
    return Bio.SeqIO.read(filename, "fasta").seq


### Compare phiX to Itself, Shifted

All right-y, then! What happens when we look at shift-matches for the PhiX circular genome?
Once again, the most matches will be for the un-shifted genome, but what comes next?

In [87]:
seq = get_dna_sequence("sequence.fasta")
sorted_matches = best_matches(seq)
for shifts_producing_nmatches in sorted_matches[:10]:  # shift distances of 10 match-iest
    print(shifts_producing_nmatches)

(5386, [0])
(1609, [66])
(1608, [78])
(1599, [9])
(1594, [12])
(1590, [57])
(1587, [21])
(1585, [96])
(1584, [33, 48])
(1579, [30])


The most matches, 5386, are the whole sequence. When you don't shift at all, everything matches itself.
The second-most matches, 1609, are at a shift of 66. Is it a lot? We'd have to do math.

How many do we expect to see at random? Four letters, so on average, a quarter of the letters would end up randomly matching: 
5386/4 = 1347

Right?

Wrong.

The letters may not occur 1/4 of the time.  If, for example, you had a circular sequence that was all 'A' (even though this one isn't), every shift would still match at every letter. To figure out how many matches we expect at random requires first finding out the letter distribution, or *base composition*, and then doing some math. 

Seems like work, so let's come back to that later.

For now, go back and look again at the positions: 66, 78, 9, ..., 30. Every one is divisible by 3!

How far down our list of positions do we have to go to find one that isn't?


In [88]:
def not_divisible_by_3(positions: list[int]):
    return [position for position in positions if position%3 != 0]


peculiar_location_rank = []

In [89]:
# collect all the (n_matches, positions) pairs
# that have a position not divisible by 3
for rank, (n_matches, positions) in enumerate(sorted_matches):
    if not_divisible_by_3(positions):
        peculiar_locations.append((rank, n_matches, positions))
print(peculiar_locations[:10])      # ranks of first 10 shifts that aren't divisible by 3

[(44, 1502, [1828]), (51, 1492, [81, 270, 2074]), (55, 1486, [348, 672, 2047, 2104]), (56, 1485, [336, 1954, 2520]), (58, 1483, [246, 570, 1852]), (59, 1480, [231, 720, 2515]), (61, 1478, [135, 168, 626]), (62, 1477, [63, 1510]), (64, 1475, [243, 357, 1007]), (65, 1474, [273, 1810])]


That's impressive. The first 43, best-matching lshifts (42 if you count the "shift" that did nothing) are all divisible by 3.
Since there are often more than one shift with the same match, what's the median number of matches?


In [95]:
n_matches = shift_matches(seq)
print(f"mean number of matches = {sum(n_matches)/len(n_matches)}")
n_matches_sorted = sorted(n_matches, reverse=True)
print(f"median number of matches = {n_matches_sorted[len(n_matches)//2]}")

mean number of matches = 1377.2268002969563
median number of matches = 1374
