# 3-Python-basics Code Review
### 2.18.2018
#### Reviewer: Alexander M. Procton (aprocton)
#### Code by: Nicholas Locatelli (mistergroot)
## Notebook 3.3

In [3]:
import os
import gzip
import requests

### Download the iris data set and write it to a file

In [4]:
iris = "http://eaton-lab.org/data/iris-data-dirty.csv"
ffile = open("./iris-data-dirty.csv", 'w')
ffile.write(requests.get(iris).text)
ffile.close()

__aprocton:__ This (and other operations on files for other challenges) could have been done using `with as`, for example:

```python
with open("./iris-data-dirty.csv", 'w') as outfile:
    outfile.write(requests.get("http://eaton-lab.org/data/iris-data-dirty.csv").text)
```
### read in the iris data set from its filepath and store the data as a string

In [5]:
irisstring = open("./iris-data-dirty.csv", 'r')
irisread = irisstring.read()

__aprocton:__ In this cell, you should call `irisstring.close()` to end the connection to `iris-data-dirty.csv`.
### replace "setsa" with "setosa" and "colour" with "color" in the string data

In [6]:
iris1=irisread.replace("setsa", "setosa")
iris2=iris1.replace("colour", "color")

__aprocton:__ These strings could also have been assigned to `irisread` so that no new variables would need to be created.
### split the string to convert it into a list of lines from the file
### strip the newline character from the end of each list element

In [7]:
irisstrip=iris2.strip().split('\n')

### remove any lines that are empty or have "NA" in them.

In [18]:
clean=open("irisNA", "w")
for line in irisstrip:
    if "NA" not in line:
        clean.write(line + "\n")
clean.close()

__aprocton:__ This loop should have a separate or combined `if` statement that accounts for both "NA" and blank lines. It was not necessary to write `clean` to a file instead of storing as a list object. My code using both approaches:

```python
clean = []

for line in data:
    if ('NA' not in line) and ('Iris' in line):
        clean.append(line)
```
### concatenate the list back into a string with newline characters between lines

In [21]:
clean2=open("irisNA","r")
final=clean2.read()
clean2.close()

__aprocton:__ You essentially already did this by writing `line + "\n"` to `clean` in the last cell.
If the data were in a list, you could use `"\n".join(list)` to create a single string.
### write the string to a new file called "iris-data-clean.csv"

In [23]:
curdir = os.path.abspath('.')
clean2 = os.path.join(curdir, "iris-data-clean.csv")
ffile = open("iris-data-clean.csv", 'w')
ffile.write(final)
ffile.close()

__aprocton:__ I'm not sure that 
```python
curdir = os.path.abspath('.')
clean2 = os.path.join(curdir, "iris-data-clean.csv")
```
was necessary, as you should have already been operating in the directory that this notebook is downloaded in? I'm not sure, but I think you can just use `"./iris-data-clean.csv"` as your output path.

## Notebook 3.4
### A. Write a function that will generate and return a random sequence of bases of length N.

In [24]:
import random
def randseq(x):
    "Return x number of random bases"
    bases=["A","T","G","C"]
    return random.choices(bases,k=x)

# testing
randseq(10)

['G', 'T', 'T', 'T', 'T', 'G', 'A', 'A', 'G', 'C']

__aprocton:__ I used `random.choice()` instead of `random.choices()`, which would have made my code much simpler.

### B. Write a function to calculate and return the frequency of As, Cs, Ts and Gs in a sequence.

In [25]:
def freqATGC(x):
    "Count number of occurrences of each nucleotide (A,T,G,C, in that order) and divide by the total length of the sequence to get the frequency"
    countA=x.count("A")
    countT=x.count("T")
    countG=x.count("G")
    countC=x.count("C")
    bases=["A","T","G","C"]
    return (countA/(len(x)),countT/(len(x)),countG/(len(x)),countC/(len(x)))

__aprocton:__ Interesting that you decided to return the frequencies as a tuple. My code included an "other" option in case there are errors in the sequency, but did not use `count()`.
### C. Write a function to concatenate (join end-to-end) two sequences and return it

In [26]:
def comboseq(seq1, seq2):
    "Concatenate two sequences into a single sequence"
    return seq1 + seq2

__aprocton:__ I chose to use an `if` statement to check if both sequences were strings and return an error if not.

### D. Write a function to take two sequences of different lengths and return both trimmed down to be the same length.

In [27]:
def trim(seq1, seq2):
    "Returns seq1 and seq2 trimmed to the length of the shortest sequence"
    ## finding the length of the shortest sequence
    slen = min([len(i) for i in (seq1, seq2)])
    return ((seq1[0:slen]),(seq2[0:slen]))

__aprocton:__ I did not use `min()`, which necessitated the use of an `if-elif` ladder.

### E. Write a function to return the proportion of bases across the shared length between two sequences that are the same.

In [30]:
def prop(seq1, seq2):
    "return the proportion of similarity between two trimmed sequences"
    ## a counter to store the number of similarities
    count = 0    
    ## finding the length of the shortest sequence
    slen = min([len(i) for i in (seq1, seq2)])
    ## trimming each sequence to the length of the shortest
    seq1=str(seq1[0:slen])
    seq2=str(seq2[0:slen])
    
    ## modified from example; count shared bases from trimmed sequences and then return the 
    ## count divided by the total number of bases
    for idx in range(slen):
        if seq1[idx] == seq2[idx]:
            count += 1
    return (count)/(len(seq2))

__aprocton:__ You could have called the `trim()` function you wrote in the previous cell to find `slen`.