### Rosalind Problems

Given: A string s of length at most 200 letters and four integers a, b, c and d

Return: The slice of this string from indices a
through b and c through d (with space in between), inclusively. In other words, we should include elements s[b] and s[d] in our slice.

In [None]:
# opening files using with/as automatically handles opening/closing

with open("rosalind_ini3.txt","r") as f:

    # store lines 
    line1 = next(f);
    line2 = next(f);

    # store index values 
    ind = line2.split();
    a = int(ind[0]);
    b = int(ind[1]) + 1;
    c = int(ind[2]);
    d = int(ind[3]) +1;
    
    # return output 
    print(line1[a:b] + " " + line1[c:d])


Given: Two positive integers a and b (a<b<10000).

Return: The sum of all odd integers from a
through b, inclusively.

In [None]:
with open("rosalind_ini4.txt","r") as f:

    # extract the integers
    line1 = next(f);
    a,b = map(int,line1.split());

    # create a range of integers from a to b and sum odd integers
    r = range(a,b+1)
    sum = 0 

    for i in r: 
        if i % 2 != 0:
            sum = sum + i

sum

Given: A file containing at most 1000 lines.

Return: A file containing all the even-numbered lines from the original file. Assume 1-based numbering of lines.

Note that `enumerate` returns a tuple `(index,line)` for each `line` in file `f`. default numbering starts at 0  

In [None]:
with open("rosalind_ini5.txt","r") as f: 
    for index, line in enumerate(f,start =1):
        if index % 2 == 0:
            print(line.strip())


# alternately 
with open('rosalind_ini5.txt','r') as f: 
    print(''.join(f.readlines()[1::2]))



Given: A string *s* of length at most 10000 letters.

Return: The number of occurrences of each word in s, where words are separated by spaces. Words are case-sensitive, and the lines in the output can be in any order.

In [None]:
wordcount = {};

with open("rosalind_ini6.txt","r") as f: 
    for line in f:
        words = line.split()

        for word in words:
            if word in wordcount:
                wordcount[word] = int(wordcount[word]) + 1;
            else: 
                wordcount[word] = 1; 

for entry in wordcount: 
    print(entry + " " + str(wordcount[entry]))

In [None]:
with open("rosalind_ini6.txt","r") as f: 
    words = f.readline().split()

for word in set(words):
    print(word,words.count(word))

### A Rapid Introduction to Molecular Biology 

Cell nuclei are filled with chromatin which contains nucleic acids. NA are a polymer, and the monomer unit is a **nucleotide**. 

**Nucleotides** are made up of a sugar, negatively charged phosphate ion, and *nucleobase* or *base* for short. Nucleotides of a specific nucleic acid will always have the same sugar and phosphate ion, differing only in the base. 

Polymerization is achieved by the bonding of the sugar and phosphate ion forming the sugar-phosphate backbone of nucleic acid strands. Nucleic acid strands are differentiated based on the order of the bae. 

For DNA (Deoxyribose Nuclei Acid), Deoxyribose is the sugar, and the four choices of bases are A,C,G,T (Adenine, Cytosine, Guanine, Thymine)


### Problem 
Given: A DNA string s of length at most 1000 nt.

Return: Four integers (separated by spaces) counting the respective number of times that the symbols 'A', 'C', 'G', and 'T' occur in s.

In [None]:
# can't use set since it's unordered
# using a dictionary to count occurences 

bases = {};

with open("rosalind_dna.txt","r") as f:
    string = f.readline().strip();

for base in set(string):
    bases[base] = string.count(base)
        
print(bases["A"], bases["C"], bases["G"], bases["T"])

# storing variables in a dictionary allows for extension and maintenance

In [None]:

## alternatively, using a generator expression & unpacking

with open("rosalind_dna.txt","r") as fh: 
    fh = fh.read()
    print(*(fh.count(nuc) for nuc in "ACGT"))


# although more compact, this version can be less readable 
# less efficient for larger strings because we are looping through the 


### The Second Nucleic Acid 

RNA (Ribose Nucleic Acid) with a different sugar ribose is also present in chromatin. It has the base *uracil* instead of *thymine*. 

DNA serves as a template for mRNA, created during RNA transcription of DNA, which can enter the far reaches of the cell. In contrast, DNA stays in the nucleus. 

### Problem 
An RNA string is a string formed from the alphabet containing 'A', 'C', 'G', and 'U'.

Given a DNA string t corresponding to a coding strand, its transcribed RNA string u is formed by replacing all occurrences of 'T' in t with 'U' in u.

Given: A DNA string t having length at most 1000 nt.
Return: The transcribed RNA string of t.

In [None]:
with open("rosalind_rna.txt","r") as t: 
    t = t.read().strip()
    u = ""

    for bp in t: 
        if bp == "T":
            u = u + "U"
        else: 
            u = u + bp 

print(u)

# this approach works but is inefficient 
# strings are immutable - a new string is created in each iteration 
# results in O(n^2) complexity as string length grows  

In [None]:
with open('rosalind_rna.txt','r') as t: 
    t = t.read().strip()

u_l = []

for bp in t: 
    if bp == "T":
        u_l.append('U')
    else: 
        u_l.append(bp)


u = ''.join(u_l)
u

# this is a more efficient way, with O(n) complexity (linear)

In [None]:
## another alternative 
# efficient and most Pythonic, using replace() as designed

with open('rosalind_rna.txt','r') as t: 
    t = t.read().strip()
    u = t.replace("T","U")

print(u)

### The Secondary and Tertiary Structures of DNA 

The DNA molecule is a double helix made up of two strands running in opposite directions. Each base bonds to a base in the opposite strand complementarily (A-T) and (G-C) - called a **base pair**

The tertiary structure refers to the 3D shape of the molecule, while the secondary is compriesd of the two opposite strands and base pairs. 

### Problem 
In DNA strings, symbols 'A' and 'T' are complements of each other, as are 'C' and 'G'.

The reverse complement of a DNA string s is the string s<sup>c</sup> formed by reversing the symbols of s, then taking the complement of each symbol (e.g., the reverse complement of "GTCA" is "TGAC").

Given: A DNA string s of length at most 1000 bp.
Return: The reverse complement s<sup>c</sup> of s.



In [None]:
# Note the *reverse* complement is due to the two strands running in 
# *opposite* directions 

# must reverse first and then replace! 
with open('rosalind_revc.txt','r') as s:
    s = s.read().strip()

sc_list = list(s) 
sc_list.reverse()


# list comprehension
sc_list = [
    'A' if nt == 'T' else 
    'T' if nt == 'A' else
    'G' if nt == 'C' else
    'C' if nt == 'G' else nt

    for nt in sc_list
]

sc = ''.join(sc_list)
sc



In [None]:
## more efficient method using translation table

with open('rosalind_revc.txt', 'r') as s:
    s = s.read().strip()

sc = s[::-1].translate(str.maketrans("ATGC", "TACG"))
sc


In [16]:
## Dynamic Programming 

# 1) identify the mathematical structure 
# F(n) = F(n-1) + F(n-2) for n > 2

# 2) solve recursively: 
# a) initialize base cases: at month 1 and 2, 1 pair of rabbits 
#    F(0) = 1; F(1) = 1;
# b) call the function to solve a smaller version of the problem

# 3) memoization: store previously computed values in an array 

# define base cases and dictionary

def fib(n, memo): 

    # create array if function is called for the first time
    if not memo:
        memo = {0:1, 1:1}

    # search for previously computed solutions
    if n in memo:
        return memo[n]

    else: 
        memo[n] = fib(n-1, memo) + fib(n-2, memo)
        return memo[n]
    

# initialize base cases & call function 
memo = {0:1, 1:1} 


In [33]:
fib(1000,memo)

70330367711422815821835254877183549770181269836358732742604905087154537118196933579742249494562611733487750449241765991088186363265450223647106012053374121273867339111198139373125598767690091902245245323403501

In [14]:
# same scenario as above, except each mature rabbit pair births 3 rabbit pairs 

def fib2(n, k, memo=None): 

    if memo is None: 
        memo = {}
    
    # search for previously computed solutions
    if n in memo:
        return memo[n]

    else: 
        memo[n] = fib2(n-1,k,memo) + (k * fib2(n-2,k,memo))
        return memo[n]
    

# initialize base cases & call function 
memo = {0:1, 1:1} 

fib2(4,3,memo)

19

In [17]:

# open file 
with open("rosalind_fib.txt") as f:
    l = f.read().split()

    n = int(l[0]) - 1 # zero-based indexing 
    k = int(l[1])

fib2(n,k,memo)

1323839213083

### Computing GC Content 

#### Identifying Unknown DNA Quickly
Languages can be identified using software by analyzing the frequency of each letter. Each language has its own letter frequency, and the same can be said for genomes (e.g. human vs. animal)

Although two members of the same species have different genomes, they share 99.9% of the 3.2 billionbase pairs in a human genome (excluding those with major genetic defects). An average case genome such as this can be assembled for any species. 

In a double stranded molecule cytosine and guanine will always appear in equal amounts (G&C). GC content (% of bases that are either cytosine or guanine) can be used to differentiate many prokaryotes and eukaryotes by using small DNA samples. 


#### Problem 
The GC-content of a DNA string is given by the percentage of symbols in the string that are 'C' or 'G'. For example, the GC-content of "AGCTATAG" is 37.5%. Note that the reverse complement of any DNA string has the same GC-content.

DNA strings must be labeled when they are consolidated into a database. A commonly used method of string labeling is called FASTA format. In this format, the string is introduced by a line that begins with '>', followed by some labeling information. Subsequent lines contain the string itself; the first line to begin with '>' indicates the label of the next string.

In Rosalind's implementation, a string in FASTA format will be labeled by the ID "Rosalind_xxxx", where "xxxx" denotes a four-digit code between 0000 and 9999.

Given: At most 10 DNA strings in FASTA format (of length at most 1 kbp each).

Return: The ID of the string having the highest GC-content, followed by the GC-content of that string. Rosalind allows for a default error of 0.001 in all decimal answers unless otherwise stated; please see the note on absolute error below. 

Sample  Dataset: 
>\>Rosalind_6404
CCTGCGGAAGATCGGCACTAGAATAGCCAGAACCGTTTCTCTGAGGCTTCCGGCCTTCCC
TCCCACTAATAATTCTGAGG

>\>Rosalind_5959
CCATCGGTAGCGCATCCTTAGTCCAATTAAGTCCCTATCCAGGCGCTCCGCCGAAGGTCT
ATATCCATTTGTCAGCAGACACGC 

>\>Rosalind_0808
CCACCCTCGTGGTATGGCTAGGCATTCAGGAACCGGAGAACGCTTCAGACCAGCCCGGAC
TGGGAACCTGCGGGCAGTAGGTGGAAT

Sample Output: 
>Rosalind_0808

>60.919540


In [2]:
# initialize lists 

ros_id = []
gc_content= []

# open file & split FASTA strings
with open("rosalind_gc.txt") as f:
    fasta = f.read().replace("\n","").lstrip(">").split(">")

for val in fasta:
    if val:
        # save Rosalind ID
        ros_id.append(val[:13])

        # cleanup 'Rosalind_xxxx' from strings
        string = val[13:]
        
        # calculate gc content 
        num_gc = string.count("G") + string.count("C")
        num_bases = len(string)

        gc = num_gc/num_bases * 100 
        gc_content.append(gc)

# find index of FASTA string with highest GC content
max_id = gc_content.index(max(gc_content))

# print results 
print(ros_id[max_id])
print(round(gc_content[max_id],6))
    

Rosalind_7004
51.672862


In [1]:
# using pandas 
import pandas as pd 

with open("rosalind_gc.txt") as f:
    fasta_list = f.read().replace("\n","").lstrip(">").split(">")

# initialize df
df_fasta = pd.DataFrame()

# populate df
df_fasta = pd.DataFrame(fasta_list,columns = ['raw'])
df_fasta['ros_id'] = df_fasta['raw'].str[:13]
df_fasta['string'] = df_fasta['raw'].str[13:]

# calculate gc content
count_gc = df_fasta['string'].str.count("G") + df_fasta['string'].str.count("C")
count_bases = df_fasta['string'].apply(len)
df_fasta['gc_content'] = (count_gc / count_bases) * 100  

# find index of max gc_content 
maxgc_id = df_fasta['gc_content'].idxmax()

# print max gc content and Rosalind ID 
print(df_fasta['ros_id'][maxgc_id])
print(f"{df_fasta['gc_content'][maxgc_id]:.6f}")

Rosalind_7004
51.672862


#### Evolution as a Sequence of Mistakes
A mutation is a mistake that occurs during the creation/transcription of a nucleic acid, particularly DNA. This can be negative or positive. Macro effects of evolution are accumulated result of beneficial mutations over many generations. 

Point mutation, the simplest and most common type, replaces one base with another at a single nucleotide. In the case of DNA, a point mutation changes the complementary base as well. 

Two DNA strands from different organisms are homologous if they share a recent ancestor - counting # of bases at which homologous strands differ provides us with the minimum number of point mutations that could have occured on the evolutionary path between the two strands. 

Problem: Given two strings s and t of equal length, the Hamming distance between s and t, denoted dH(s,t), is the number of corresponding symbols that differ in s and t. 

Given: Two DNA strings s and t of equal length (not exceeding 1 kbp).
Return: The Hamming distance dH(s,t)

Sample Dataset: 
>GAGCCTACTAACGGGAT
>
>CATCGTAATGACGGCCT

Sample Output: 
>7

In [23]:
strings = []

with open("rosalind_hamm.txt") as f:
    fh = f.read().split()
    for line in fh: 
        strings.append(line)

str1 = strings[0]
str2 = strings[1]
hamm_dist = 0

if len(str1) == len(str2):
    str_length = len(str1)
else: 
    print("String Length Mismatch")

for base in range(str_length):
    if str1[base] != str2[base]:
        hamm_dist = hamm_dist + 1
    
hamm_dist

472

In [3]:
# alternative method 

with open('rosalind_hamm.txt') as f: 
    dna1 = f.readline().strip()
    dna2 = f.readline().strip()

# convert strings into a list of tuples 
hamm_d = sum(a != b for a,b in zip(dna1,dna2))
hamm_d

472

#### Mendel's First Law 

For any factor, an organism randomly passes one of two alleles to each offspring, so that an individual receives one allele from each parent. 

Alleles are either dominant or recessive. A phenotype is displayed for homozygous and heterozygous dominant alelles, or homozygous recessive alleles only. 

#### Probability 
An event (collection of outcomes) can be written as the sum of probabilities of its constitutient outcomes. For dependent outcomes, the probability is the product of probabilities along the path from the beginning of a tree. '


Given: Three positive integers k, m, and n, representing a population containing k+m+n organisms: k individuals are homozygous dominant for a factor, m are heterozygous, and n are homozygous recessive.

Return: The probability that two randomly selected mating organisms will produce an individual possessing a dominant allele (and thus displaying the dominant phenotype). Assume that any two organisms can mate.

Sample Dataset
> 2 2 2 

Sample Output 
> 0.7833

In [193]:
# this approach involves solving for the binomial coefficients of: 
# 1) all possible ways to choose a pair from the total population (n+m+k)_C_2 
# 2) individual favorable pairings 

ints = [19,25,16]

k = ints[0]
m = ints[1]
n = ints[2]

total_population = k + m + n

P_kk = (k*(k-1))/2 
P_km = k * m 
P_kn = k * n
P_mm = (m*(m-1))/2 * 0.75 
P_mn = (m * n) * 0.5
#P_nn = 0

possible = (total_population*(total_population-1))/2
favorable = P_kk + P_km + P_kn + P_mm + P_mn

P_dominant = favorable/possible

print(P_dominant)


0.7768361581920904


In [None]:
import numpy as np
import random

# Monte Carlo Simulation approach 

# determine child genotype given 2 parent genotypes
def child(parent1,parent2):
    if parent1 == 'AA' and parent2 == 'AA':
        return 'AA'
    elif (parent1 == 'AA' and parent2 == 'Aa') or (parent1 == 'Aa' and parent2 == 'AA'): 
        return np.random.choice(['AA','Aa']) 
    elif (parent1 == 'AA' and parent2 == 'aa') or (parent1 == 'aa' and parent2 == 'AA'):
        return 'Aa'
    elif parent1 == 'Aa' and parent2 == 'Aa':
        return np.random.choice(['AA','Aa','aa'], p = [0.25,0.5,0.25])
    elif (parent1 == 'Aa' and parent2 == 'aa') or (parent1 == 'aa' and parent2 == 'Aa'):
        return np.random.choice(['Aa','aa']) 
    elif parent1 == 'aa' and parent2 == 'aa':
        return 'aa'
    else:
        print("error") 
        return

# simulate selection of 2 individuals from the total population and return probability of the child displaying the phenotype
def simulate_1A(k,m,n,trials):

    # initialize dictionary of child genotypes
    genotype_counts = {'AA':0,'Aa':0,'aa':0}

    # define probabilities of selecting k,m, or n 
    population = ['AA'] * k + ['Aa'] * m + ['aa'] * n

        
    for i in range(trials):
        parents = np.random.choice(population,2,False)
        child_genotype = child(parents[0],parents[1])
       
        genotype_counts[child_genotype] += 1


    atleast1A = genotype_counts['AA'] + genotype_counts['Aa']

    return atleast1A/trials

            
#body 
with open("rosalind_iprb.txt") as fh: 
    ints = fh.read().split()

print('k,m,n :',ints)
k,m,n = int(ints[0]), int(ints[1]), int(ints[2])

probability_1k = simulate_1A(k,m,n,1000)
probability_10k = simulate_1A(k,m,n,10000)
probability_100k = simulate_1A(k,m,n,100000)
probability_1m = simulate_1A(k,m,n,1000000)

print('Probability of at least one dominant allele')
print('Calculated from binomial coefficient: 0.78333')
print('For 1,000 trials:',probability_1k)
print('For 10,000 trials:',probability_10k)
print('For 100,000 trials',probability_100k)
print('For 100,000 trials',probability_1m)

# note: for the dataset [k,m,n] = ['19','25','16'], rosalind accepted the simulated answer at 1m trials 0.777247
# calculated using the binomial coefficient, the solution was 0.776836, an inaccuracy of 0.053% 

k,m,n : ['19', '25', '16']
Probability of at least one dominant allele
Calculated from binomial coefficient: 0.78333
For 1,000 trials: 0.77
For 10,000 trials: 0.7819
For 100,000 trials 0.77533
For 100,000 trials 0.777247


In [184]:
#testing np.random.choice() when not given p array 

countings = {'A':0,'B':0,'C':0}

array = ['A']*5 + ['B']*3 + ['C']*2
trials = 10000

for i in range (trials): 
    selection = np.random.choice(array,1,False)
    countings[selection[0]] += 1 

print('A:',countings['A']/trials)
print('B:',countings['B']/trials)
print('C:',countings['C']/trials)

A: 0.5016
B: 0.2993
C: 0.1991


In [204]:
with open("datasets/rosalind_fib.txt") as fh: 
    f = fh.read().split()
    print(f)

['35', '3']
