### Code Review for 3.3 Challenges
Filter and count the Iris data set. Operate on strings and lists, read and write files. 

In [2]:
import requests

In [4]:
## Download the iris data set and write it to a file
url2 = "http://eaton-lab.org/data/iris-data-dirty.csv"
file = open("./iris-data-dirty.csv", 'w') #first open up an empty doc
file.write(requests.get(url2).text) #Then access the data from online and write it to the newly created file.
file.close() #Close the file

##### Code Review
This is a good way of writing the data into a file. In comparison, my code did not fulfill the requirement of "writing it to a file." Instead I just simply download the file to a `response` object. 


In [9]:
## read in the iris data set from its filepath and store the data as a string
iris = open("./iris-data-dirty.csv", 'r')
iris_data = iris.read() #reads data into a long string
iris_data

##### Code Review
This is similar to what I did. But I think this code forgot to close the file after reading it. So I guess it needs to add another line of code: `iris.close()`.


In [10]:
## replace "setsa" with "setosa" and "colour" with "color" in the string data
new_iris1 = iris_data.replace('setsa', 'setosa')
new_iris2 = new_iris1.replace('colour', 'color')

##### Code Review
I also use the `replace` function of string, which I believe is the simplest way of doing this task.



In [11]:
## split the string to convert it into a list of lines from the file
iris_list = new_iris2.split()

##### Code Review
This is the same as what I did. The `split` function in default can convert a string into a list of lines.


In [None]:
## strip the newline character from the end of each list element
#Ans: There is no newline character from the end of each list element to remove. See below: 
print(iris_list)

##### Code Review
I do notice that there is no `\n` character at the end of each element shown as output. But I still did the following to make sure there is no hidden `\n` that is not displayed:
```
data_list_edit = []
for i in range(len(data_list)):
    data_list_edit.append(data_list[i].strip('\n'))
```
The code above iterate over the entire data list, and strip each list element of `\n` if there is any. Then I used the `append` function to add the treated element to a new list `data_list_edit` in order.

In [None]:
## remove any lines that are empty or have "NA" in them.
for i in range(len(iris_list)): 
    if 'NA' in iris_list[i]: 
        iris_list.remove(iris_list[i])
print(iris_list)

##### Code Review
I notice that there is no empty line already, so it is understandable that the code does not have anything to remove the empty lines. But I still use `filter` to get rid of the `None` element in the list.

As for the task of removing lines with `NA`, this code did in a simpler fashion. But I'm a bit worried that when you remove some list elements, it will **change all the index after that element**. But the index `i` does not reflect the change. Therefore it will create problems if continuously two lines have `NA` in it.

I think there are two solutions: first, you can add a line `i = i - 1` within the `if`. Second, that is what I did: create an empty list and put all lines without `NA` into the new list.

In [15]:
## concatenate the list back into a string with newline characters between lines
iris_string = '\n'.join(iris_list)
print (iris_string)

##### Code Review
This is the same as what I did.

In [14]:
## write the string to a new file called "iris-data-clean.csv"
clean_file = open("./iris-data-clean.csv", "w")
clean_file.write(iris_string)
clean_file.close()

##### Code Review
This is a better way of doing it. I did not know that `open` can directly create a new file, so I specify the path and then use the `os.path.join` to create the new file. Everything afterwards is the same.

### Code Review for 3.4 Challenges
Write proper functions that include a documentation string and comments.

**A. Generate and return a random sequence of bases of length N**

In [18]:
#Write a function that will generate and return a random sequence of bases of length N.
import random
def random_seq(N):
    ##Create a new sequence that randomly chooses N bases with replacement from A, T, C, or G. 
    new_seq = random.choices('ATCG', k=N)
    ## Turn this new sequence from a list to a string with the join() function. 
    final_seq = ''.join(new_seq)
    return final_seq

random_seq(5)

'TTGTA'

##### Code Review
The function itself and the comments are both very good, but there is no documentation string, which is something like `'''return a random sequence of bases of length N.'''` within the function definition. 

For the function, I did in a more complicated way: I create a list of the base names, and do the `random.choice` for N times using a `for` loop.

**B. Write a function to calculate and return the frequency of As, Cs, Ts and Gs in a sequence.**

In [22]:
def freq_base(seq):
    ##Creating empty counts for each base. 
    A=0
    C=0
    T=0
    G=0
    ##Iterating over the sequence and adding to the count each time a base matches. 
    for i in seq: 
        if i == 'A': 
            A += 1
        if i == 'C':
            C += 1
        if i == 'T':
            T += 1
        if i == 'G':
            G += 1
    print ("There are {} As, {} Cs, {} Ts, and {} Gs in this sequence.".format(A,C,T,G))
    
freq_base("AAATTTGGGCCC")

There are 3 As, 3 Cs, 3 Ts, and 3 Gs in this sequence.


##### Code Review
Same as last one, there needs to be a documentation string.

Also I think the question asks for the frequency, so there may need to be an extra step to divide the counts you have by `len(seq)` which is the total number of bases.

The `print` seems to work fine here. But I also think using `return` could have a similar result and is more common.

**C. Write a function to concatenate (join end-to-end) two sequences and return it.**

In [23]:
def join_seq(seq1, seq2):
    ##Join two strings 
    joined = seq1 + seq2
    ##Return one string
    return (joined)
    
join_seq("AAAATTTCCC", "GGGGGGCCA")

'AAAATTTCCCGGGGGGCCA'

##### Code Review
I did exactly the same thing. Just need documentation string.

**D. Write a function to take two sequences of different lengths and return both trimmed down to be the same length.**

In [25]:
def trim_join(seq1, seq2): 
    ## Get the shortest input sequence length. slen is an integer.
    slen = min([len(i) for i in (seq1, seq2)])
    ## Create new sequences with with the minimum length (so you don't have to figure out which one is the short one)
    new_seq1 = seq1[:slen]
    new_seq2 = seq2[:slen]
    return [new_seq1,new_seq2]

trim_join("AAAAATTTT", "AAAAATTCTCTGGGGGGG")

['AAAAATTTT', 'AAAAATTCT']

##### Code Review
This is a good method to do the trim, different from mine. Basically it defines a new number as the shortest length of the two sequences. And then trim the two sequences. Certainly one of the two would not have any change, but the other one would be trimmed shorter.

My way is slightly different. I use a `if... else...` format to specify the two possibilities. I think both methods are fine.

**E. Write a function to return the proportion of bases across the shared length between two sequences that are the same.**

In [30]:
def prop_base(seq1, seq2):
    #Uses my trim_join function to trim the sequences to the same length as the shortest seq and reassigns the elements in the output list to independently named strings.
    trim1, trim2 = trim_join(seq1, seq2)
    ## A counter to store the number of similarities.
    count = 0
    for i in range(len(trim1)):
        ##check how many bases are similar between the two sequences.
        if trim1[i] == trim2[i]: 
            count += 1
    ##Returns the proportion of similarlity: the number of base matches divided by the total number. 
    return count/len(trim1)

prop_base("AACCCCCCCCCC", "AAGGCT")

0.5

##### Code Review
This function is nicely designed. I also did in a similar way, but I `return` in a different way. I'm actually not sure what is suppossed to be returned. Here the frequency of identity is returned. In my case, the function returns a string about how many bases out of the entire length are the same.