# Practice:  Importing libraries, reading files from URLs, comprehensions, map(), and writing custom functions

**12/3/2019<br>
BIOS 274: Introductory Python Programming for Genomics**<br>

## Table of Contents
1. [Importing libraries](#importing)<br>
>- [math](#math)
>- [gzip](#gzip)
>- [os](#os)
>- [random](#random)
2. [Different ways to navigate through a file](#different)
3. [Reading files from URLs](#reading)
4. [Comprehensions and map(): Simplifying for loops](#comprehensions)
5. [Writing custom functions](#writing)
6. [Let's practice!](#practice)

<a id="importing"></a>
## Importing libraries

**Some of the libraries we've learned about so far:**<br>

<code>math</code><br>
<code>gzip</code><br>
<code>os</code><br>
<code>random</code><br>

**There are different ways to import libraries:**

In [1]:
# import the math library
# find the sqrt of 9
import math
print(math.sqrt(9))

# import the math library, but call it m
# find the sqrt of 9
import math as m
print(m.sqrt(9))

# import ONLY the sqrt function from math
# find the sqrt of 9
from math import sqrt
print(sqrt(9))

# import ONLY the sqrt function from math, but call it s
# find the sqrt of 9
from math import sqrt as s
print(s(9))

3.0
3.0
3.0
3.0


<a id="math"></a>
**math library**

In [2]:
import math

print(math.log(7))
print(math.log(7, 2))

print(math.sqrt(9))

print(math.factorial(4))

1.9459101490553132
2.807354922057604
3.0
24


In [4]:
# There are so many more methods in the math library! 
# Put your cursor after the 'math.' and press 'tab' to see them all!
math.

SyntaxError: invalid syntax (<ipython-input-4-f801f4f818d1>, line 3)

<a id="gzip"></a>
**gzip library**

In [5]:
import gzip

# The most common use is to read gzipped files

gzippedFileName = 'sample.txt.gz'

with gzip.open(gzippedFileName, 'rt') as inFile: # don't forget to use 'rt' not just 'r'
    for line in inFile:
        line = line.strip()
        print(line)

1	2	3	4	5
6	7	8	9	10
11	12	13	14	15
16	17	18	19	20
21	22	23	24	25
26	27	28	29	30


<a id="os"></a>
**os library**

In [6]:
import os

# Returns your current directory
print(os.getcwd())

# Returns a list containing the names of the files in the specified directory
print(os.listdir())

# Create a new directory
os.makedirs('newDirectory')

# Remove a a directory
os.removedirs('newDirectory')

open('newFile.txt', 'w') # this creates a new empty file

# Remove a file
os.remove('newFile.txt')

D:\my github\shen\content\en\post\2019-12-02-python-for-genomics-class_day06
['.ipynb_checkpoints', 'featured.png', 'index.ipynb', 'index.md', 'Practice06-Veronica.ipynb', 'sample.txt', 'sample.txt.gz', 'Untitled.ipynb']


In [7]:
# THIS CELL WON'T ACTUALLY EXECUTE
# THEY'RE JUST EXAMPLES

# Change the current working directory to the specified path.
os.chdir('PATH')

# Test whether a path exists.
os.path.exists('PATH')

FileNotFoundError: [WinError 2] The system cannot find the file specified: 'PATH'

In [8]:
# There are so many more methods in the os library! 
# Put your cursor after the 'os.' and press 'tab' to see them all!
os.

SyntaxError: invalid syntax (<ipython-input-8-0c81c9a48509>, line 3)

<a id="random"></a>
**random library**

In [9]:
import random

# Returns random number from the interval [0, 1)
print(random.random())

# Returns a random number between the minimum (inclusive) and maximum (not inclusive) provided
print(random.randrange(1, 101))

0.15603282355533332
29


In [10]:
# There are so many more methods in the os library! 
# Put your cursor after the 'random.' and press 'tab' to see them all!
random.

SyntaxError: invalid syntax (<ipython-input-10-73006f699103>, line 3)

<a id="different"></a>
# Different ways to navigate through a file

**REVIEW: Store the entire file as a string:  <code>inFile.read()</code>**

In [11]:
inFileName = 'sample.txt'

with open(inFileName, 'r') as inFile:
    inFile_string = inFile.read()

print(inFile_string)

1	2	3	4	5
6	7	8	9	10
11	12	13	14	15
16	17	18	19	20
21	22	23	24	25
26	27	28	29	30



**REVIEW: Go through a each line of the file**

In [12]:
inFileName = 'sample.txt'

with open(inFileName, 'r') as inFile:
    for line in inFile:
        line = line.strip()   # .rstrip() and .lstrip() remove from just the right or left sides of the line
        print(line)

1	2	3	4	5
6	7	8	9	10
11	12	13	14	15
16	17	18	19	20
21	22	23	24	25
26	27	28	29	30


**NEW: Read just a single line of the file: <code>next(inFile)</code> or <code>inFile.readline()</code>**

In [13]:
# This is often useful for skipping the header of files!

inFileName = 'sample.txt'

with open(inFileName, 'r') as inFile:
    firstLine = next(inFile)
    print(firstLine) # first line will print

    secondLine = inFile.readline()
    print(secondLine) # second line will print

    thirdLine2 = inFile.readline()
    print(thirdLine2) # third line will print

1	2	3	4	5

6	7	8	9	10

11	12	13	14	15



<a id="reading"></a>
## Reading files from URLs

**Reading a gzipped file from a URL**

In [14]:
import urllib.request
import gzip

URL = 'https://www.encodeproject.org/files/ENCFF215GBK/@@download/ENCFF215GBK.fastq.gz'
#URL = 'https://www.encodeproject.org/files/gencode.v24.primary_assembly.annotation/@@download/gencode.v24.primary_assembly.annotation.gtf.gz'

response = urllib.request.urlopen(URL)

with gzip.open(response, 'rt') as inFile:  # don't forget to use 'rt' not just 'r'
    i = 0
    for line in inFile:
        line = line.strip()
        print(line)
        i += 1
        
        #** Parsing of file goes here **

        if i >= 8:
            break

@HWI-ST1309F:188:C684MANXX:6:1101:1070:2027 1:Y:0:GCGAAAC
ATNTTTTTTTATTATTATTATTTTAANACNCCTGGCATCTGTTGCTATTT
+
BB#<<//BFFFBF/<FFF<F/<<FBB#<<#<<<</<////<FBFBFFF##
@HWI-ST1309F:188:C684MANXX:6:1101:1169:2086 1:N:0:GCGAAAC
AGCTAAGGGTTTACCCTGTTTGCTTTCATATTCCAGTAGCCATTTAAAAT
+
BBBBBFBFFBBF<FFFFFBB<FBFFF</FF<FFBFFFFFFFF</FBFFF<


**Reading a non-gzipped file from a URL**

In [15]:
URL = 'https://hgdownload.soe.ucsc.edu/gbdb/hg38/snp/snp142.fa'

response = urllib.request.urlopen(URL)

i = 0
for line in response:
    #line = line.strip()
    line = line.decode('utf-8').strip()
    print(line)
    i += 1
    
    #** Parsing of file goes here **

    if i >= 8:
        break

>rs171
tcattgatgg acatttgggt tggttccagg tctttgctat tgcgagtagt gccacaataa atatacgtgt gcatgtgtct
tgatagtagc atgatttata atcctttggg tatataccca ctaatgggat ggctgggtca aatggtattt ctagttctag
atccttgagg aatcaccaca ctgtcttcca catggttgaa ctaatttaca gtcccaacaa cagtgtaaaa gtgttcctat
ttctccatat cctctccagc acctgttgtt tcctgacttt ttaatgatca ccattctaac tgttgcgaga tggtatctca
ttgtggtttt gatttgcatt tctctgatgg gcagtgatga tgagcatttt ttcatgtgtc tgttggctgc ataaatgtct
tcttttggga agtgtctgtt catatccatt gcctagtttt gatggggttg tttgatttat ttcttgtaaa tttgtttacg
ttcgttgtag attctggat


<a id="comprehensions"></a>
## Comprehensions and map():  Condensing for loops

### **EXAMPLE:  Three ways to convert a list of strings to a list of integers**

**1. Using a for loop**

In [16]:
myList = ['1', '2', '3', '4']

myList_int = []

for num in myList:
    num = int(num)
    myList_int.append(num)
    
print(myList_int)

[1, 2, 3, 4]


**2. Using a comprehension **

<b>[</b> do this expression <b>for</b> item <b>in</b> list <b>if</b> conditional <b>]</b>

<code>for item in list:
    if conditional:
        expression</code>

In [17]:
myList = ['1', '2', '3', '4']

myList = [int(item) for item in myList]
print(myList)

[1, 2, 3, 4]


In [18]:
# Here's how you would convert to an integer AND multiply by 3 for each item in the list
#    only if the original number is greater than 2!

myList = ['1', '2', '3', '4']

myNewList = [int(x)*3 for x in myList if int(x) > 2]
print(myNewList)

[9, 12]


In [19]:
# Cool example of how to make two lists into a dictionary using a comprehension!

myList1 = ['one', 'two', 'three', 'four']
myList2 = [1, 2, 3, 4]

myDict = {key:value for key, value in zip(myList1, myList2)}
print(myDict)

{'one': 1, 'two': 2, 'three': 3, 'four': 4}


**3. Using the <code>map()</code> function**

**map(**do this function**,** to each item in this list**)**

In [20]:
myList = ['1', '2', '3', '4']

myList = list(map(int, myList))
print(myList)

[1, 2, 3, 4]


<a id="writing"></a>
## Writing custom functions

**Basic format of a function**

In [21]:
# THIS CELL WON'T ACTUALLY EXECUTE
# THEY'RE JUST EXAMPLES

def functionName(variable1):
    # Code goes here
    return variable2

def functionName(variable1, variable2, ...):
    # Code goes here
    return variable3, variable4, ...

SyntaxError: invalid syntax (<ipython-input-21-7dde220fb8b0>, line 8)

**This function, which is called <code>sum_list_of_strings()</code>,<br> takes a list of strings (i.e. <code>['1', '2', '3'])</code><br> and returns the sum of all the numbers in that list (i.e. <code>6</code>).**

In [22]:
def sum_list_of_strings(myList):
    '''
    Input: A list of strings (i.e. ['1', '2', '3'])
    Output: The sum of the numbers in that string (i.e. 6)
    '''
    SUM = sum([int(x) for x in myList])
    return SUM

In [23]:
help(sum_list_of_strings)

Help on function sum_list_of_strings in module __main__:

sum_list_of_strings(myList)
    Input: A list of strings (i.e. ['1', '2', '3'])
    Output: The sum of the numbers in that string (i.e. 6)



**Now let's use the function we just wrote!**

In [24]:
x = ['1', '2', '3']
y = ['5', '6', '7', '8', '9', '10']
z = ['34', '56', '23', '21', '67', '65', '89', '100', '34', '23']

print(sum_list_of_strings(x))
print(sum_list_of_strings(y))
print(sum_list_of_strings(z))

6
45
512


<a id="practice"></a>
## Let's practice!

### Exercise 1

Use the <code>sum_list_of_strings()</code> function to print out the sum of the numbers in each line of <code>'sample.txt.gz'</code>

Your output should be:<br>
<code>15
40
65
90
115
140</code>

In [25]:
def sum_list_of_strings(myList):
    '''
    Input: A list of strings (i.e. ['1', '2', '3'])
    Output: The sum of the numbers in that string (i.e. 6)
    '''
    SUM = sum([int(x) for x in myList])
    return SUM

In [26]:
# ADD YOUR OWN CODE TO THIS CELL

import gzip

gzippedFileName = 'sample.txt.gz'

with gzip.open(gzippedFileName, 'rt') as inFile: # don't forget to use 'rt' not just 'r'
    for line in inFile:
        line = line.strip().split('\t') # SOLUTION: add .split('\t')
        SUM_to_print = sum_list_of_strings(line) # SOLUTION: add this line
        print(SUM_to_print)

15
40
65
90
115
140


### Exercise 2

Find the square root of each item in <code>myList</code> using:
>**a.** a for loop<br>
>**b.** a comprehension<br>
>**c.** the map() function<br>

Your output should be a list of the square roots.

In [27]:
# YOUR SOLUTION: Using a for loop

import math

myList = [23, 64, 53, 3, 5, 9, 27, 100, 43]

myList_sqrt = [] # new list to append square rooted values to

for item in myList:
    myList_sqrt.append(math.sqrt(item)) # take square root and append to list
    
print(myList_sqrt)

[4.795831523312719, 8.0, 7.280109889280518, 1.7320508075688772, 2.23606797749979, 3.0, 5.196152422706632, 10.0, 6.557438524302]


In [28]:
# YOUR SOLUTION: Using a comprehension

myList = [23, 64, 53, 3, 5, 9, 27, 100, 43]

myList = [math.sqrt(x) for x in myList]
print(myList)

[4.795831523312719, 8.0, 7.280109889280518, 1.7320508075688772, 2.23606797749979, 3.0, 5.196152422706632, 10.0, 6.557438524302]


In [29]:
# YOUR SOLUTION: Using map()

myList = [23, 64, 53, 3, 5, 9, 27, 100, 43]

myList = list(map(math.sqrt, myList))
print(myList)

[4.795831523312719, 8.0, 7.280109889280518, 1.7320508075688772, 2.23606797749979, 3.0, 5.196152422706632, 10.0, 6.557438524302]


### Exercise 3

**3.a** Find the complement of <code>seq</code> using:
>**A.** a for loop<br>
>**B.** a comprehension<br>
>**C.** the map() function<br>

Your output should be a string.

In [30]:
# YOUR SOLUTION: Using a for loop

seq = 'ATCGATCAGCATCTATATTCGGTC'

compDict = {'A':'T', 'T':'A', 'G':'C', 'C':'G'}

seq_comp = ''

for base in seq:
    base_comp = compDict[base] # look up the complement of each base in the dictionary
    seq_comp += base_comp # concatonate each complemented base to the existing sequence string
    
print(seq_comp)

TAGCTAGTCGTAGATATAAGCCAG


In [31]:
# YOUR SOLUTION: Using a comprehension

seq = 'ATCGATCAGCATCTATATTCGGTC'

seq_comp = ''.join([compDict[base] for base in seq])
print(seq_comp)

TAGCTAGTCGTAGATATAAGCCAG


In [32]:
# YOUR SOLUTION: Using map()

seq = 'ATCGATCAGCATCTATATTCGGTC'

seq_comp = ''.join(list(map(compDict.get, seq)))
print(seq_comp)

TAGCTAGTCGTAGATATAAGCCAG


**3.b.** Write a function called <code>comp()</code> that will find the complement of any DNA sequence!

Use your function to find the complement of <code>seq1</code> and <code>seq2</code>

In [33]:
seq1 = 'TCCAATCGGGCTAATTCGGCTAGAGTCGGCTAGAGAGGGAGA'
seq2 = 'CTATTAAATCTGGGCTAGATATAGCTAATATCGCGCGGGCTATACGCAAATCGCGGGGACTAC'

In [34]:
# YOUR SOLUTION HERE

def comp(seq):
    seq_comp = ''.join([compDict[base] for base in seq])
    return seq_comp

print(comp(seq1))
print(comp(seq2))

AGGTTAGCCCGATTAAGCCGATCTCAGCCGATCTCTCCCTCT
GATAATTTAGACCCGATCTATATCGATTATAGCGCGCCCGATATGCGTTTAGCGCCCCTGATG


**3.c.** Write a function called <code>rev_comp()</code> that will find the reverse complement of any DNA sequence!<br><br>

Try to use <code>comp()</code> within <code>rev_comp()</code>!<br><br>
Use your function to find the complement of <code>seq1</code> and <code>seq2</code>

In [35]:
seq1 = 'TCCAATCGGGCTAATTCGGCTAGAGTCGGCTAGAGAGGGAGA'
seq2 = 'CTATTAAATCTGGGCTAGATATAGCTAATATCGCGCGGGCTATACGCAAATCGCGGGGACTAC'

In [36]:
# YOUR SOLUTION HERE

def rev_comp(seq):
    revComp = comp(seq)[::-1]
    return revComp

print(rev_comp(seq1))
print(rev_comp(seq2))

TCTCCCTCTCTAGCCGACTCTAGCCGAATTAGCCCGATTGGA
GTAGTCCCCGCGATTTGCGTATAGCCCGCGCGATATTAGCTATATCTAGCCCAGATTTAATAG


**3.d.** Write a function called <code>rev_comp_count()</code> that takes a DNA sequence as a string and returns:
>**A.** the reverse complement (as a string)<br>
   AND<br>
>**B.** the counts of each nucleotide in the reverse complemented sequence (as a dictionary)<br>

Use your function on <code>seq1</code> and <code>seq2</code>

In [37]:
seq1 = 'TCCAATCGGGCTAATTCGGCTAGAGTCGGCTAGAGAGGGAGA'
seq2 = 'CTATTAAATCTGGGCTAGATATAGCTAATATCGCGCGGGCTATACGCAAATCGCGGGGACTAC'

In [38]:
# YOUR SOLUTION HERE

def rev_comp_count(seq):
    countDict = {}

    # Reverse complement
    seq_revComp = rev_comp(seq)

    # Count number of bases
    for base in ['A', 'C', 'G', 'T']:
        count = seq_revComp.count(base)
        countDict[base] = count
        
    return seq_revComp, countDict

s, c = rev_comp_count(seq1)
print(s)
print(c)

TCTCCCTCTCTAGCCGACTCTAGCCGAATTAGCCCGATTGGA
{'A': 8, 'C': 15, 'G': 8, 'T': 11}


### Exercise 4

Write a function called <code>find_score()</code> that scores how similar two DNA sequences are to each other.
>- matches are worth +3<br>
>- mismatches are worth -2

Your function should return (as a tuple): 
>1. the score
>2. the number of matches
>3. the number of mismatches

For example, these two DNA sequences would have a score of 10, <br>
with 4 matches (each worth +3) and 1 mismatch (worth -2)<br><br>
<code>ACATC</code><br>
<code>|||x|</code><br>
<code>ACACC</code>

Use your function on the following pairs:
>1. <code>seq1</code> and <code>seq2</code>
>2. <code>seq2</code> and <code>seq3</code>
>3. <code>seq1</code> and <code>seq3</code>

In [39]:
seq1 = 'TCCAATCGGGCTAATTCGGCTAGAGTCGGCTAGAGAGGGAGA'
seq2 = 'TCCAGTCGGGCTTATACGGCTACTGTCCGCTAGTGACCCAGA'
seq3 = 'TGCAATGGGGCTAATACGGCAAGAGTCCCCTAGAGAGGGAGT'

In [40]:
# YOUR SOLUTION HERE

### WILL UPDATE WITH SOLUTION LATER
### MIGHT DO THIS TOMORROW