## Assignment 2 - genome composition

Today we're going to do two things:
- Calculate basic genome information for a chromesome. We'll look at its length, the GC content, etc (4 points)
- Basic string searches (4 points) 


### Part 1: Genome composition 

Lets take a look at chromosome 21 (chr21), the smallest chromosome in the human genome. You can download the sequence from here:
    
https://hgdownload.soe.ucsc.edu/goldenPath/hg38/chromosomes/
    
Or the direct link: [Chromosome 21 gzipped](https://hgdownload.soe.ucsc.edu/goldenPath/hg38/chromosomes/chr21.fa.gz)
    
This file is compressed using the GZIP tool. You can uncompress it if you'd like to take a peek, but we'll assume through the rest of the assignment that you've kept it compressed. Now we'd like to know a bit about this chromosome, specifically what is it's length, what's the GC content, and eventually where specific sequences we're interested in reside. 

### Opening a compressed file in python

We'll open chr21 as a compressed file and save it's contents to a variable we can play around with. The first few lines of the chr21 fasta file look like:

#### Actually opening the file

To open a compressed file we'll import the gzip library, which is in the standard python distribution. We'll then read off the header line (which we know is chr21), and then make a variable that concatenates all of the sequence. 

In [3]:
import gzip # use the gzip library in our project

chromosome_21_sequence = "" # this variable stores our chromsome 21 sequence

with gzip.open('chr21.fa.gz', 'rt') as chromosome_file: # adjust this path to the chromosome 21 if you've saved it somewhere else
    
    # read the first line, which will be our header ">chr21"
    header = chromosome_file.readline() 
    
    for line in chromosome_file:
        # here we'll add each line to our growing chromosome_21_sequence variable
        # we use the strip() function to remove the trailing newline, which tells the computer
        # to make a new line of text in the file (we don't want it)
        chromosome_21_sequence += line.strip() 

#### Genome length

We now have chromosome 21 as a variable in our notebook. First thing we want is a function to figure out it's length. This is a bit contrived, but
you'll make a very basic function to return the length of the sequence we read. I've provided the basic scaffold (dont change the name) but you'll 
need to fill in the very minimal details:

In [4]:
def genome_length(genome_as_a_string):
    '''
    This is a multiline comment (specified with the three single quotes above and below). This function take one
    parameter, the genome as a string, and returns the length of the genome as an integer value. Please fix it!
    '''
    return(len(genome_as_a_string)) # FIX ME



In [5]:
# you can check that your function that returns the right length for chromesome 21
assert genome_length(chromosome_21_sequence) == 46709983, "Incorrect length returned!"

### GC content 

Now we need a function that takes a genome nucleotide string and returns the GC content. Help fill in this function 
with a for loop that counts the total GCs, and returns the GCs/(total length)

In [14]:
def gc_content(genome_as_a_string):
    '''
    This function iterates over the genome string, counting Gs and Cs, and returns the GC ratio of a string of nucleotides (from 0.0 to 1.0)
    '''
    if len(genome_as_a_string) > 0:
        # we're going to assume (which is wrong) that every base is either CG or not. For instance Ns could be either, but we'll ignore that
        # depending on how you handle that, or how you count (counting CGs vs counting ATs), this could change the results of this function a lot
        # regardless I accepted any reasonable approach here
        return(sum([1 if x == 'C' or x == 'G' else 0 for x in genome_as_a_string.upper()])/len(genome_as_a_string))
    else:
        return 0


In [10]:
# a quick sanity check for your code before running the chr21
gc_content("ACGT")

0.5

In [13]:
# this is what I'm interested in:
gc_content(chromosome_21_sequence)

0.3513515515516244

### brute force string search

Here we're going to implement our own brute-force genome search. Like in the slides from class, we'll iterate over
every position in the genome, and compare each base in the search string to the genome. You can try to the 'exit
early' optimization when you discover a mismatch, though it's not required. 

If you find matches, return the start positions of each match (zero-based, so zero is the first position in the genome)
in a Python list. If you don't find a match, return the integer -1, indicating no match was found, in a one element list,
which is currently the default for the function. 

Don't use built-in Python search methods like index(), especially since we're looking for all matches.

In [19]:
def find_matches(genome,sequence_to_search_for):
    found_positions = []
    for i in range(0,len(genome) - (len(sequence_to_search_for) - 1)):
        if genome[i:i+len(sequence_to_search_for)] == sequence_to_search_for:
            found_positions.append(i)
    if len(found_positions) == 0:
        return([-1]) # we found no matches
    else:
        return(found_positions)
    

In [23]:
# a quick sanity check for your find_matches code, though try on subsequences within chr21!
print(find_matches("ACGTACGTACGT","ACGT"))
print(find_matches("AAAAAAAAAAAA","ACGT"))
print(find_matches("AAAAAAAAACGT","ACGT"))

[0, 4, 8]
[-1]
[8]


## we'll cover the more advanced methods in the following homework. Good luck!