# Code-review (Aprocton)

## Notebook 3.3: File objects and requests

### Download the iris data set and write it to a file

In [4]:
import requests

In [5]:
with open("./iris-data-dirty.csv", 'w') as outfile:
    outfile.write(requests.get("http://eaton-lab.org/data/iris-data-dirty.csv").text)

It was nice that you used the `requests.get` command inside the `outfile.write` command. It saved you some lines of code.

### read in the iris data set from its filepath and store the data as a string

In [7]:
irisfile = "./iris-data-dirty.csv"
with open(irisfile, 'r') as infile:
    datastr = infile.read()
print(datastr[:100])

5.1,3.5,1.4,0.2,Iris-setosa
4.9,3.0,1.4,0.2,Iris-setosa
4.7,3.2,1.3,0.2,Iris-setosa
4.6,3.1,1.5,0.2,


### replace "setsa" with "setosa" and "colour" with "color" in the string data

In [8]:
datastr = datastr.replace("setsa", "setosa")
datastr = datastr.replace("colour", "color")

I used the same function here.

### split the string to convert it into a list of lines from the file

In [10]:
data = datastr.split("\n")
data

['5.1,3.5,1.4,0.2,Iris-setosa',
 '4.9,3.0,1.4,0.2,Iris-setosa',
 '4.7,3.2,1.3,0.2,Iris-setosa',
 '4.6,3.1,1.5,0.2,Iris-setosa',
 '5.0,3.6,1.4,0.2,Iris-setosa',
 '5.4,3.9,1.7,0.4,Iris-setosa',
 '4.6,3.4,1.4,0.3,Iris-setosa',
 '5.0,3.4,1.5,0.2,Iris-setosa',
 '4.4,2.9,1.4,0.2,Iris-setosa',
 '4.9,3.1,1.5,0.1,Iris-setosa',
 '5.4,3.7,1.5,0.2,Iris-setosa',
 '4.8,3.4,1.6,0.2,Iris-setosa',
 '4.8,3.0,1.4,0.1,Iris-setosa',
 '4.3,3.0,1.1,0.1,Iris-setosa',
 '5.8,4.0,1.2,0.2,Iris-setosa',
 '5.7,4.4,1.5,0.4,Iris-setosa',
 '5.4,3.9,1.3,0.4,Iris-setosa',
 '5.1,3.5,1.4,0.3,Iris-setosa',
 '5.7,3.8,1.7,0.3,Iris-setosa',
 '5.1,3.8,1.5,0.3,Iris-setosa',
 '5.4,3.4,1.7,0.2,Iris-setosa',
 '5.1,3.7,1.5,0.4,Iris-setosa',
 '4.6,3.6,1.0,0.2,Iris-setosa',
 '5.1,3.3,1.7,0.5,Iris-setosa',
 '4.8,3.4,1.9,0.2,Iris-setosa',
 '5.0,3.0,1.6,0.2,Iris-setosa',
 '5.0,3.4,1.6,0.4,Iris-setosa',
 '5.2,3.5,1.5,0.2,Iris-setosa',
 '5.2,3.4,1.4,0.2,Iris-setosa',
 '4.7,3.2,1.6,0.2,Iris-setosa',
 '4.8,3.1,1.6,0.2,Iris-setosa',
 '5.4,3.

### strip the newline character from the end of each list element

In [18]:
for line in data:
    line.strip()

I used `r.strip()` here. `r.strip()` is supposed to remove any trailing characters. Not sure if `.strip()` is ideal here because by default it splits the strings by whitespace, which is not what you want to do here.

### remove any lines that are empty or have "NA" in them.

In [14]:
clean = []

for line in data:
    if ('NA' not in line) and ('Iris' in line):
        clean.append(line)

It was very smart to use `and ('Iris' in line)` here. I ended up having another `if` statement inside the loop.

### concatenate the list back into a string with newline characters between lines

In [15]:
cleanstring = '\n'.join(clean)

### write the string to a new file called "iris-data-clean.csv"

In [19]:
with open("./iris-data-clean.csv", 'w') as outfile:
    outfile.write(cleanstring)

## Notebook 3.4: Getting started with functions

### A. Write a function that will generate and return a random sequence of bases of length N. 

In [20]:
import random

def randseq(N):
    "A function to return a random sequence of DNA bases of length N (N is an integer)"
   
    ## initialize a list of bases
    bases = "ACGT"

    ## initialize a list to hold the sequence
    seq  = []
    
    ## iterate and append a random base each time
    for i in range(N):
        seq.append(random.choice(bases))
    
    return(seq)

randseq(9)

['G', 'G', 'A', 'T', 'C', 'G', 'C', 'T', 'T']

Nice code. I used a different approach though. Intead of returning a list of bases, I wanted to make a code that returned a string. I used `random.choices(bases,k=N)`, which returns a string with k number of characters samples with replacement from bases.

I also appreciate the comments throughout the code, they make it easier to follow.

### B. Write a function to calculate and return the frequency of As, Cs, Ts and Gs in a sequence.

In [21]:
def basefreq(seq):
    "A function to calculate the frequency of nucleobases in a genetic sequence (seq)"
    
    ## initialize base variables
    A = 0
    C = 0
    G = 0
    T = 0
    other = 0
    
    ## iterate through the sequence and count each base
    for base in seq:
        if base == 'A':
            A += 1
        elif base == 'C':
            C += 1
        elif base == 'G':
            G += 1
        elif base == 'T':
            T += 1
        else:
            other += 1
    
    ## calculate frequencies
    freqA = A/len(seq)
    freqC = C/len(seq)
    freqG = G/len(seq)
    freqT = T/len(seq)
    freq_other = other/len(seq)
    
    ## print frequencies
    freqstr = "Frequency of Bases:\nA: " + str(freqA) + "\nC: " + str(freqC) + "\nG: " + str(freqG) + "\nT: " + str(freqT) + "\nOther: " + str(freq_other)
    
    print(freqstr)
    
    ## return frequencies
    return [A/len(seq), C/len(seq), G/len(seq), T/len(seq), other/len(seq)]

basefreq("ACAGAGTCGCCAGGAGATGACAGAAAGGTCTGGGTTACAACTCTCTCTA")

Frequency of Bases:
A: 0.30612244897959184
C: 0.22448979591836735
G: 0.2653061224489796
T: 0.20408163265306123
Other: 0.0


[0.30612244897959184,
 0.22448979591836735,
 0.2653061224489796,
 0.20408163265306123,
 0.0]

I did a very similar thing, but used the `string.count()` function, which saved me many lines of code. It's not clear to me why you had the 'other' though.

### C. Write a function to concatenate (join end-to-end) two sequences and return it

In [22]:
def catseq(seq1, seq2):
    "A function to concatenate two sequences"
    
    if (type(seq1) != str) or (type(seq2) != str):
        print("Error: both sequences must be strings.")
    else:
        return(seq1 + seq2)

print(catseq("ACAGAGTCGCCAGGAGATGACAG", "AAAGGTCTGGGTTACAACTCTCTCTA"))
catseq("ACAGAGTCGCCAGGAGATGACAG", 7)

ACAGAGTCGCCAGGAGATGACAGAAAGGTCTGGGTTACAACTCTCTCTA
Error: both sequences must be strings.


Nice job in adding the error check step.

### D. Write a function to take two sequences of different lengths and return both trimmed down to be the same length.

In [24]:
def trimseq(seq1, seq2):
    "A function to take two sequences and return both of them trimmed down to the length of the shorter sequence"
    
    ## depending which sequence is longer, trim one sequence
    if (len(seq1) > len(seq2)):
        seq1 = seq1[:len(seq2)]
        return [seq1, seq2]
    elif (len(seq1) < len(seq2)):
        seq2 = seq2[:len(seq1)]
        return [seq1, seq2]
    else:                      # if sequences are the same length
        return [seq1, seq2]

print(trimseq("ACAGAGTC", "GCCAGGAGATGACAG"))
print(trimseq("ACAGAGTCGCCAGG", "AGATGACAG"))
print(trimseq("ACAGAGTCGCC", "AGGAGATGACA"))    # equal length sequences

['ACAGAGTC', 'GCCAGGAG']
['ACAGAGTCG', 'AGATGACAG']
['ACAGAGTCGCC', 'AGGAGATGACA']


Nice. I used Deren's example: `slen = min([len(i) for i in (seq1, seq2)])` so find the min length and then return only the first `slen` characters of the sequence.

### E. Write a function to return the proportion of bases across the shared length between two sequences that are the same.

In [25]:
def sharebase(seq1, seq2):
    "A function to take two sequences and return the proportion of identical bases across a shared length"
    
    ## use trimseq to trim the sequences to equal length
    seqs = trimseq(seq1, seq2)
    
    ## initialize count of identical bases
    ident = 0
    
    ## iterate through both sequences and count identical bases
    for i in range(len(seqs[0])):
        if (seqs[0][i] == seqs[1][i]):
            ident += 1
    
    return ident/len(seqs[1])

sharebase("ACAGAGTCGCCAGGAGATGACAG", "AAAGGTCTGGGTTACAACTCTCTCTA")

0.21739130434782608

Looks good. However, you could have used your `basefreq(seq)` function you created before.