## Asymmetry of Replication

  In this second part, let's focus on other approaches to find ori and have a better understanding about the replication processes.
  The replication process begins at ori, as we already saw in the previous section. The replication happens with the reverse half-strands at first (thick lines shown in the figure) and after happens in the foward half-strands. 
  
  ![Asymmetry](data/replication.png)
  
  
  Firstly, the Dna polymerase must wait until the replication fork has open some space (about 2000 nucleotide), where a new primer (showed in red) will be formed at the end of the replication fork. The DNA polymerase starts replicating a small chunk of DNA starting from this primer and moving backward in the direction of ori. When the two DNA polymerases on forward half-strands reach ori, we have the situation shown below.

![Asymmetry2](data/replication2.png)


The replication on a forward half-strand requires occasional stopping and restarting, which results in the synthesis of short Okazaki fragments from multiple primers that are complementary to intervals on the forward half-strands.

![Asymmetry3](data/replication3.png)

When the replication fork reaches ter (Replication terminus), the replication process is almost complete, as all DNA has been synthesized. However, gaps still remain between the disconnected Okazaki fragments.

![Asymmetry4](data/replication4.png)

The consecutive Okazaki fragments are sewn together by an enzyme called DNA ligase, resulting in two intact daughter chromosomes, each consisting of one parent strand and one newly synthesized daughter strand.

![Asymmetry5](data/replication5.png)

## Deamination

  The asymmetry of the replication process causes what is call Deamination. The replication on the reversed half-strand happens much more quickly than in the forward half-strands. As consequence, the ones which are in a single strand configuration the most of the time (foward half-strands) are more propence to suffer mutations in one of the nucleotides. This way, we should observe a shortage of this nucleotide on the forward half-strand.
  Let’s compare the nucleotide counts of the reverse and forward half-strands of the Thermotoga petrophila genome. 
  
  ![Asymmetry6](data/count_des.png)
  
  As we can see, cytosine (C) is much more frequent on the reverse half-strand, because C has a tendency to turn into Thymine (T) in the forward half-strand through the deamination process. Therefore, will be a sort of substituition in some level from C-G to A-T basis. This way, we can observe a decrease in guanine (G) on the reverse half-strand (since that a forward parent half-strand synthesizes a reverse daughter half-strand, and vice-versa).We can use this particularity to find the ori region.
  
  ![Asymmetry7](data/count_des2.png)
  
## The skew diagram

 Since the genome is a circular sequence, firstly let's linearize it. Select an arbitrary position i, and let's assume that i is the begin of the sequence and will until the end of it. This way, we got a linear sequence  like the ones we analized before.
 To compute the skew at the genome's positions, let's do as it follows: If the nucleotide at position i is G, then $Skew_{i+1}(Genome)$ = $Skew_{i}(Genome)$ + 1; if this nucleotide is C, then $Skew_{i+1}(Genome)$ = $Skew_{i}(Genome)$ – 1; otherwise, $Skew_{i+1}(Genome)$ = $Skew_{i}(Genome)$.
 
 ![Skew](data/skew.png)
 
 We already saw that the skew is decreasing along the reverse half-strand and increasing along the forward half-strand. So we can supose that ori is located at the skew's minimum, because is where the reverse half-strand ends and the forward half-strand begins.

Now we show the skew diagram for a linearized E. coli genome.

 ![Skew2](data/skew2.png)

In [1]:
#Minimum Skew Problem
#Find a position in a genome where the skew diagram attains a minimum.
def MinimumSkew(text):
	count = 0
	output = []
	output.append((count))
	for i in text:
		if i == 'C':
			count -= 1
		elif i == 'G':
			count += 1
		output.append(count)
	temp = min(output) 
	res = [i for i, j in enumerate(output) if j == temp] 
	return res

In [2]:
MinimumSkew('TAAAGACTGCCGAGAGGCCAACACGAGTGCTAGAACGAGGGGCGTAAACGCGGGTCCGAT')

[11, 24]

## Hamming distance and mismatches

  The above algorithm will give us a clue to where the ori region starts. But, DnaA can bind not only to "perfect" DnaA boxes but to their slight variations as well. Therefore, we are introduced to the mismatch and Hamming distance concepts. 
  
  We say that position i in k-mers $p_{1}$ … $p_{k}$ and $q_{1}$ …$q_{k}$ is a mismatch if $p_{i}$ ≠ $q_{i}$. The number of mismatches between strings p and q is called the Hamming distance between these strings and is denoted HammingDistance(p, q).

In [9]:
def HammingDistance(text1, text2):
    count = 0
    for x, y in zip(text1, text2):
        if x != y:
            count += 1
    return count

In [10]:
HammingDistance('GGGCCGTTGGT', 'GGACCGTTGAC')

3

We say that a k-mer Pattern appears as a substring of Text with at most d mismatches if there is some k-mer substring Pattern' of Text having d or fewer mismatches with Pattern, i.e., HammingDistance(Pattern, Pattern') ≤ d.

In [11]:
#Approximate Pattern Matching Problem
def ApproximatePatternMatching(pattern, text, d):
	positions = []
	for i in range(len(text) - len(pattern)):
		temp = text[i : i+len(pattern)]
		count_mismatches = HammingDistance(pattern, temp)

		if count_mismatches <= d:
			positions.append(i)
	return positions

In [12]:
ApproximatePatternMatching('ATTCTGGA', 'CGCCCGAATCCAGAACGCATTCCCATATTTCGGGACCACTGGCCTCCACGGTACGGACGTCAATCAAAT', 3)

[6, 7, 26, 27]