# Bioinformatics Stronghold - Tree

## Counting DNA Nucleotides

**Problem**

A string is simply an ordered collection of symbols selected from some alphabet and formed into a word; the length of a string is the number of symbols that it contains.

An example of a length 21 DNA string (whose alphabet contains the symbols 'A', 'C', 'G', and 'T') is "ATGCTTCAGAAAGGTCTTACG."

Given: A DNA string s of length at most 1000 nt.

Return: Four integers (separated by spaces) counting the respective number of times that the symbols 'A', 'C', 'G', and 'T' occur in s.

Sample Dataset:
rosalind_dna.txt

*Reference: http://rosalind.info/problems/dna/*

In [19]:
f = open('rosalind_dna.txt', 'r')
s = f.read()
print (s.count('A'), s.count('C'), s.count('G'), s.count('T'))

214 267 271 248


## Transcribing DNA into RNA

**Problem**

An RNA string is a string formed from the alphabet containing 'A', 'C', 'G', and 'U'.

Given a DNA string t corresponding to a coding strand, its transcribed RNA string u is formed by replacing all occurrences of 'T' in t with 'U' in u.

Given: A DNA string t having length at most 1000 nt.

Return: The transcribed RNA string of t.

Sample Dataset:
rosalind_rna.txt

*Reference: http://rosalind.info/problems/rna/*

In [2]:
f = open('rosalind_rna.txt', 'r')
t = f.read()
print (t.replace('T', 'U'))

GACGAUCGCUUUACGUACAUCCCAGGCGAAACACACAAACAAUCUCAAUGGACACUGCCGAUGACUAGCUCUCGAAGCACCUCGUUGGAAGGUCGCCGUGAAUACUGGCCCCAGAGGUUUUUUUCACGCCAAGCUUGUGAUAUGAGAAGACUUCGUUAUAUAACUCGGGACACGCGUAUCAGGGGGCGUGCGACCUCCUGGUGCCAAGAGUGUAGUCUCCCAGUGUGUGACGUGUAAGCAUAUCACUAUAUUUGAGUAUCAAAUGGCAUGAGAUGGCACAGCCUGGAGGCUCGCUUCAUUUGCAUAAAUACUUCAUCUUCUGCCGGGGAGGGUGAGCAUGCAUGACGUGAAGCUAUAGAGGAUUCAGGGACAUUCCGCUUCAUCGAACAGGCCCGGUGGUAGGGACCUAGCCUCACGAACGGGAACGCCCGCAUCGUAAGAUACCUGUGGUGCCACACUCAGUGUUUGUGGCUUUACUUGCGUAUACAGCGAAGACUCAUCCGUAGGUACCCUGAGUUAAACUACAUGCUCUUGCGGGCCUGGGGCGCGUUUCUAGCGGAGAACUGCACGAGAACCGCCCUUGCCAUGUGGAAGAACCGGGCGAUGCGACUGAAUAGUGAGCCGUAAUUUAACGGACUGAGUCGUAUGCUCUCAAGGAGAGUCUCGUGAUGUCUCCCUACUGAUGUUGAGGCUAUCAGUACUUGAAGGGACUGCAGAGCCCCUCAUAUGUAGUUGCCUCCGUCCAAAAUUCUUUUAGCAAUUAUUGAGUGGUUUAGUGAAGCGUUGCCACCUCGGCGAGAUUCUGGAGCCUAACGCCUAACUCUGUUCAACUCUGUAUGGCACGUUCGUUUGACUGCCACGCACAGAGAUGCGCCCUGAGCAUCUUUAGCCCGCACCGUCAAUGCAUAGACGCGCAGGUUGCCCCAUCGAUCUUGAAGGUCCUAGUUAAUGG



## Complementing a Strand of DNA

**Problem**

In DNA strings, symbols 'A' and 'T' are complements of each other, as are 'C' and 'G'.

The reverse complement of a DNA string s is the string sc formed by reversing the symbols of s, then taking the complement of each symbol (e.g., the reverse complement of "GTCA" is "TGAC").

Given: A DNA string s of length at most 1000 bp.

Return: The reverse complement sc of s.

Sample Dataset: rosalind_revc.txt

*Reference: http://rosalind.info/problems/revc/*


In [20]:
f = open('rosalind_revc.txt', 'r')
trans = str.maketrans('ATGC', 'TACG')
revc = f.read().translate(trans)
print(revc[::-1])



CCTCGGGTAGGAGACGCTCCAACGAGGCACCCGGTTGGTGACCACGCGTTACTAATCCTATGCGATTATGCGACGCAGCACCGACGTGATGAAGGAAAGACAAATTCTGCAACCCGGATTGATGCCCTGGGGGAATCCCACATCACGGATGACAGAACAATCCCAACATCGAAGCGCGTCGTCGGCGATTTCTAAGTACTAGCCAACAGCCAAAGCCTTACGGCGACTGTCCTCGGCCGCCTGGGTCACCACGGATTACTGCGACCAGGAGAACAGATACGACGTTGTGATTGCATCACACGTCATCCCCAGGCGTGTGCCCTCTCAGAAAGACCTCTACCGGCGCTTCTACCCGCGGTTCTTGGGTTCAGCCGATAAACATCTGTCGTTGGGCGGGAGCGTAAATTATCACGATCAACTTGGCCGGTGGCTGAACAAGACGCCATGGCCGCCCCTGCTCCCAGGGTGTAAGCTCGAAGGACGGTTCACGTAATCCAGATATAGATTGGAACAAGGACGAATCGGTCCGCCGAAGCGGTCAAATCCGACGGAGCTTCTCGGCTAGTGAATACATATTTATAGAGGGACATAACCTGTTATGTCGTACAGCAGTCTCTCCTTTGATCCGCCGGCCTACGTGCGACTATCCCTCCTCACTCAGAGGGTACACTAAGGCTTTCCAGGAGCTGCGGAAAAGTTGGACGCTACTGGTCATACCCGTAGTTTGGCCCCTACATACTTTCCATGACGCGCCTGATCTCGGCTATGTCTGGGCCTTTACCCAAGCGCGCACCTATCCTCTGTTCCACTAGAGTGGAGACATTGCGTATGTTCCGGAGGGCACTTTCGGTGTCATATACTGCAGGGTCGTTTTCGGTGAGGCCTCTGCCGCGAAAGAGCTATGATGCATAGTTAGA


## Rabbits and Recurrence Relations

**Problem**

A sequence is an ordered collection of objects (usually numbers), which are allowed to repeat. Sequences can be finite or infinite. Two examples are the finite sequence (π,−2‾√,0,π) and the infinite sequence of odd numbers (1,3,5,7,9,…). We use the notation an to represent the n-th term of a sequence.

A recurrence relation is a way of defining the terms of a sequence with respect to the values of previous terms. In the case of Fibonacci's rabbits from the introduction, any given month will contain the rabbits that were alive the previous month, plus any new offspring. A key observation is that the number of offspring in any month is equal to the number of rabbits that were alive two months prior. As a result, if Fn represents the number of rabbit pairs alive after the n-th month, then we obtain the Fibonacci sequence having terms Fn that are defined by the recurrence relation Fn=Fn−1+Fn−2 (with F1=F2=1 to initiate the sequence). Although the sequence bears Fibonacci's name, it was known to Indian mathematicians over two millennia ago.

When finding the n-th term of a sequence defined by a recurrence relation, we can simply use the recurrence relation to generate terms for progressively larger values of n. This problem introduces us to the computational technique of dynamic programming, which successively builds up solutions by using the answers to smaller cases.

Given: Positive integers n≤40 and k≤5.

Return: The total number of rabbit pairs that will be present after n months, if we begin with 1 pair and in each generation, every pair of reproduction-age rabbits produces a litter of k rabbit pairs (instead of only 1 pair).

Sample Dataset: rosalind_fib.txt

*Reference: http://rosalind.info/problems/fib/*

In [45]:
f = open('rosalind_fib.txt', 'r')
f = f.read().split()
n = int(f[0])
k = int(f[1])

def rabbits(n, k):
   if n == 0:
       return 0
   if n == 1:
       return 1
   else:
       return rabbits(n-1, k) + k*rabbits(n-2, k)
print (rabbits(n, k))

20444528200


## Counting Point Mutations

**Problem**

Given two strings s and t of equal length, the Hamming distance between s and t, denoted dH(s,t), is the number of corresponding symbols that differ in s and t.

Given: Two DNA strings s and t of equal length (not exceeding 1 kbp).

Return: The Hamming distance dH(s,t).

Sample Dataset: rosalind_hamm.txt

*Reference: http://rosalind.info/problems/hamm/*

In [22]:
f = open('rosalind_hamm.txt', 'r')
f = f.read().split()
s = f[0]
t = f[1]
count = 0

for i in range(len(s)):
    if s[i] != t[i]:
        count = count + 1

print (count)

475


## Translating RNA into Protein

**Problem**

The 20 commonly occurring amino acids are abbreviated by using 20 letters from the English alphabet (all letters except for B, J, O, U, X, and Z). Protein strings are constructed from these 20 symbols. Henceforth, the term genetic string will incorporate protein strings along with DNA strings and RNA strings.

The RNA codon table dictates the details regarding the encoding of specific codons into the amino acid alphabet.

Given: An RNA string s corresponding to a strand of mRNA (of length at most 10 kbp).

Return: The protein string encoded by s.

Sample Dataset: rosalind_prot.txt

*Reference: http://rosalind.info/problems/prot/*

In [7]:
# Generate Codon Table (RNA)
bases = ['U', 'C', 'A', 'G']
codons = [a+b+c for a in bases for b in bases for c in bases]
amino_acids = 'FFLLSSSSYY**CC*WLLLLPPPPHHQQRRRRIIIMTTTTNNKKSSRRVVVVAAAADDEEGGGG'
codon_table = dict(zip(codons, amino_acids))

# Translate function
def translate(rna):
    rna = rna.replace('\n', '').replace(' ', '')
    peptide = ''
    
    for i in range(0, len(rna), 3):
        codon = rna[i: i+3]
        amino_acid = codon_table.get(codon, '*')
        if amino_acid != '*':
            peptide += amino_acid
        else:
            break
                 
    return peptide

f = open('rosalind_prot.txt', 'r')
f = f.read()

print (translate(f))


MSYRQGAKDLNTLSARKLSAETNEAFLVYTGVLLLVTHRTRVPNDTGLNIYANRKRFQRDVIKSFAAPGRTLPQTPPMERVTAAQRSNPALVNRDMEIFRENSVVVRRRRAYRAIPRKVVASHILRFVIYTHHGGIVYQAASDSPTECDPGCCLMSERPDQCLPELLTTLSATSSCFLVGSFEKSLSPHLFIGANGGWCPLLTHSYDFSTRRNGRGAVQAVCTFVQSIKRFLGLNRYDEAARRLRWRAGKGEISPSSYYRINDSGNSNDPDTPTWRCDLTNGVGEWRSCYLHDCTIRLVVSRIGGTHFHVRSLHADCKGVARRLRACVHTIRGIFTRGLSLNANKALIIYRRRPTRGGVPAVVAVASYASNSDKNVSRRIQNEPQTIISGTTPVQKGGAPITTVTHHHRRQEQTAPLRKTPLERVARFARLGFMATTERTSERSDERLSYPQASPGQPLMYAARTGACGIGTPCGTSLCILTAHRYANRCSRANLASFTNSPLRLRPVRSPPRRTRPYAGRTSDPIWRLGPAPETDMSVHRNQGRRYDASRPYSQTVKEADDVPQITLERLAAAGFAGSLSFGLISFFGWLNTVAAQFETKERSNGNVLANITCRQPCIVSNVPRIKSIDLICFGRLILVSVLHRTFADSETGTSHPRWTRPAVMYPELHGKPCRQREQVQSPFDRNYQAAPFNHLVHLIPGPHSFPPQNIPTRKVNGTVYDWRIITTSMLGLHSASTTCHSSGAPVGLLYYSTARSLVVAVQVVSNTTFFLVFAQPEYAGFARPRREPRVPECGSTLRRVDHPRAKLGRSLVLFLAGRLDYMDAYVRPDPGLRSRSVGRAMRLGEIIREILECVSTQRLPSYRTQGPTCAISPPIGLYWSPAPVVSQYNPLRRSRAVSRTLQCDDSLAGIARLQPRKLRGYDAWLPDLRPYGRNRRIIGMTTPMTNELKTTHSVARRSVVPSKVRIDALCITNTGGLDHQALVTSCRWAHRSLQTSVTP

## Finding a Motif in DNA

**Problem**

Given two strings s and t, t is a substring of s if t is contained as a contiguous collection of symbols in s (as a result, t must be no longer than s).

The position of a symbol in a string is the total number of symbols found to its left, including itself (e.g., the positions of all occurrences of 'U' in "AUGCUUCAGAAAGGUCUUACG" are 2, 5, 6, 15, 17, and 18). The symbol at position i of s is denoted by s[i].

A substring of s can be represented as s[j:k], where j and k represent the starting and ending positions of the substring in s; for example, if s = "AUGCUUCAGAAAGGUCUUACG", then s[2:5] = "UGCU".

The location of a substring s[j:k] is its beginning position j; note that t will have multiple locations in s if it occurs more than once as a substring of s (see the Sample below).

Given: Two DNA strings s and t (each of length at most 1 kbp).

Return: All locations of t as a substring of s.

*Reference: http://http://rosalind.info/problems/subs/*

In [22]:
f = open('rosalind_subs.txt', 'r')
f = f.read().split()
s = f[0]
t = f[1]
locations = []
for i in range(len(s)):
    if t == s[i:i+len(t)]:
        locations.append(i+1)

for i in locations:
    print (i, end=" ")

53 135 179 205 212 219 252 273 280 320 327 369 388 453 509 524 566 574 595 610 631 641 672 736 743 750 809 816 823 830 860 877 903 910 917 962 981 

## Mendel's First Law

**Problem**

Probability is the mathematical study of randomly occurring phenomena. We will model such a phenomenon with a random variable, which is simply a variable that can take a number of different distinct outcomes depending on the result of an underlying random process.

For example, say that we have a bag containing 3 red balls and 2 blue balls. If we let X represent the random variable corresponding to the color of a drawn ball, then the probability of each of the two outcomes is given by Pr(X=red)=35 and Pr(X=blue)=25.

Random variables can be combined to yield new random variables. Returning to the ball example, let Y model the color of a second ball drawn from the bag (without replacing the first ball). The probability of Y being red depends on whether the first ball was red or blue. To represent all outcomes of X and Y, we therefore use a probability tree diagram. This branching diagram represents all possible individual probabilities for X and Y, with outcomes at the endpoints ("leaves") of the tree. The probability of any outcome is given by the product of probabilities along the path from the beginning of the tree; see Figure 2 for an illustrative example.

An event is simply a collection of outcomes. Because outcomes are distinct, the probability of an event can be written as the sum of the probabilities of its constituent outcomes. For our colored ball example, let A be the event "Y is blue." Pr(A) is equal to the sum of the probabilities of two different outcomes: Pr(X=blue and Y=blue)+Pr(X=red and Y=blue), or 310+110=25 (see Figure 2 above).

Given: Three positive integers k, m, and n, representing a population containing k+m+n organisms: k individuals are homozygous dominant for a factor, m are heterozygous, and n are homozygous recessive.

Return: The probability that two randomly selected mating organisms will produce an individual possessing a dominant allele (and thus displaying the dominant phenotype). Assume that any two organisms can mate.

*Reference: http://rosalind.info/problems/iprb/*

In [23]:
f = open('rosalind_iprb.txt', 'r')
f = f.read().split()
k = f[0]
m = f[1]
n = f[2]
total = float(k+m+n)
print (1 - (n/total)*((n-1)/(total-1)) - (n/total)*(m/(total-1)) - (m/total)*((m-1)/(total-1))*0.25)

0.7833333333333333
