## Welcome to Chang Lab Bioinformatics!

We're going to start with an introduction to Python, just to get a handle on basics and how to deal with data.  In this experiment, we're imagining that you're trying to create a clonal line with a mutation at a given locus.  You've picked lots of clones, extracted DNA and amplified the target region by PCR, and are now trying to analyze your results to quickly tell which wells are WT or mutant (or heterozygous).

The first section will be an introduction, and then we'll go through dealing with one well, and then go through dealing with all 96 wells.

### Part 0: Your name

Kevin Parker

### Part 1: Introduction

Make sure that you are familiar with the difference between a string, integer, and float; as well as lists and dictionaries; boolean operators (equals to, greater than, less than, etc.); and the basics of defining a function.

We're going to use a particular kind of dictionary called a <b>counter</b>.  Normally, a dictionary stores keys (what you look up) and values (what is returned), and these can be anything.  A counter is a dictionary that associates a count (an integer) with the keys, and is a really convenient way to keep track of how often you are seeing something (such as a FASTQ read).  There's also a couple of other nice features, like being able to return the most frequently seen items within the Counter.

We're going to practice with an example 'dataset' before moving on to actual (simulated) FASTQ reads.

The first step is to import the Counter object from the collections package, which is a bunch of specialized datatypes which are alternatives to the Python defaults (lists, sets, dictionaries, and tuples).  We're going to work with other packages in the future, and you'll get used to seeing import statments.

Note that once you import it in a session, there's no need to re-import it each time you want to use it.

In [2]:
# note the syntax here:
# we are importing the Counter object from the collections package

from collections import Counter

# mock dataset, don't change

test_data = ['cat',
            'dog',
            'cat',
            'mouse',
            'mouse',
            'cat',
            'cat',
            'dog',
            'rat',
            'dog',
            'rabbit',
            'mouse',
            'cat',
            'cat',
            'dog',
            'elephant']

How many elements are in test_data? (use the len() function)

#### There are two main ways to use counters:

(and you can look up the documentation here https://docs.python.org/3.8/library/collections.html#collections.Counter)

In [None]:
# if you already have a list with a bunch of items:
# this will create a new counter called test_count_1 out of the data in test_data

test_count_1 = Counter(test_data)

In [None]:
# now, we can find the most common elements in this counter:

print(test_count_1.most_common())

What kind of datatype is the result of most_common()?  What kind of datatype is each element of the result?

You can also use a counter by iterating through a list one-by-one (or, if you are reading through a list of FASTQ reads!)

In [None]:
# first we are going to initialize a new counter
test_count_2 = Counter()

# we are going to iterate through each item in test_data
for i in test_data:
    
    # and increment the count for that item by 1 in test_count_2
    test_count_2[i] += 1

#### Try playing around with adding a print statement where you print the counter over each iteration so you can see what is going on!  Also, how can you convince yourself that the two counters are equal?

Note that that the most_common() function returns a list of elements, sorted by their frequency.  Each item in the list is a tuple, with the element and its count.

If you want to store the result of most_common(), you'll need to do that explicitly by assigning it to the same or new variable.

In [None]:
print(test_count_2)

count_results = test_count_2.most_common()

print(count_results)

print(test_count_2)

#### Let's say we want to go through and calculate the percent votes assigned to the third result.  How would we do this?

First, we need to get the total number of votes:

In [None]:
# one nice thing that we can do is get just the count values
# which makes it easy to determine the total number of counts in the counter:

vals = test_count_2.values()
print(vals)

vals_sum = sum(vals)
print(vals_sum)

Second, we need to get the data for the third-most common item. There's two ways that I'm going to show you how to do this.

In [None]:
count_results = test_count_2.most_common()

# remember that python is 0-indexed!
# the first element is 0, then 1, then 2
v3 = count_results[2]
print(v3)

# now we can get the animal and count
# remember that v3 is a tuple of length 2
# and so we can access the first (0th) and second (1st) elements
v3_animal = v3[0]
v3_count = v3[1]

print(v3_animal)
print(v3_count)

In [None]:
count_results = test_count_2.most_common()

# this does the same thing as above!
# since count_results[2] is of length 2, we can just assign it to two variables at the same time
v3_animal, v3_count = count_results[2]

print(v3_animal)
print(v3_count)

If the second way didn't make too much sense, don't worry about it too much for now.  Ultimately, both of these do the same thing, it's just slightly cleaner/fewer lines.

Note that what we did above - saying  

<i> v3_animal, v3_count = count_results[3] </i>

isn't just restricted to tuples.  You can do this with a list, and it can be of any length - say you had a list of length 4; you could say: 

<i> my_list = [1,2,3,4]  
    a, b, c, d = my_list </i>
    
and then you would have assigned the values 1 to a, 2 to b, and so on.

#### Now let's calculate the percentage assigned to the third element.

In [None]:
count_results = test_count_2.most_common()
total_counts = sum(test_count_2.values())

# using the second method from above
v3_animal, v3_count = count_results[2]

# now calculate the percent
v3_percent = 100 * v3_count / total_counts
print(v3_animal)
print(v3_percent)

#### Now we've calculated that 18.75% of the animals in our original list were mice!

Just for fun, let's use a set of <b>if</b> statements to print one of three possible outcomes:

<i> if the percent is greater or equal to 50%, print "apple"  
if the percent is not greater or equal to 50%, but greater or equal to 20%, print "pear"  
otherwise, print "alligator" </i>

Note that for the second condition (greater than 20% but less than 50%), as long as the first <b>if</b> statement is first (greater than or equal to 50%) we can say "else if" or <b>elif</b>, meaning "if the conditions in the previous if statements are not met, but if this other condition is met, then do the following".

In [None]:
test_percent = 18.75

if test_percent >= 50:
    print('apple')
elif test_percent >= 20:
    print('pear')
else:
    print('alligator')

#### Try playing around with the percent value stored in 'test_percent' as well as the limits (50, 20), if/elif/else statements (adding/removing conditions), and outcomes (what is printed)!

For example, add a new condition such that if the value is between 30-50, it will print the value of test_percent.

### Part 2: Writing a function to determine whether a list of 'reads' is homozygous WT, heterozygous, or mutant (on both alleles).

Here, we're going to apply the concepts above to three test datasets.  Each of these datasets is going to be a counter.  However, to make things simpler, instead of reads, we're going to use animals; and we're just going to say that 'cats' are wild-type and anything else is 'mutant'.

In [None]:
c1 = Counter(['cat','cat','cat','dog','cat','cat','cat','rat','cat','cat'])
c2 = Counter(['cat','cat','cat','dog','dog','dog','cat','dog','rat','cat'])
c3 = Counter(['dog','rat','dog','rat','dog','rat','dog','rat','dog','rat'])

print(c1)
print(c2)
print(c3)

Just by looking at this, we can assign each of these as a particular status: c1 is WT, c2 is a het, and c3 is homozygous mutant.  But let's write a function to do this for us!

#### First, we need to think of the criteria that we mentally apply when deciding if c1, c2, c3 are which status.

Let's just set forth the following rules for each conditions:

+/+ (WT): at least 80% of the 'reads' are the WT read  
+/- : at least 40% of the 'reads' are WT, and at least 40% of the reads are for another non-WT allele  
-/- : the WT reads are fewer than 20% of the total number of reads.  Note that there are actually two possible cases here: it could be homozygous (two of the same mutant alleles) or heterozygous (two different mutant alleles).  In the first situation (+/+), we said that the WT allele needed to represent at least 80% of the reads.  So it seems reasonable to say that if at least 80% of the reads are for a single allele, then we will call it homozygous mutant, and if there's two alleles with at least 40% of the reads for each allele, we'll call it heterozygous mutant. 

Note that there's also a fourth situation, which is deciding that we have bad data.  For example, there's just a lot of random stuff and it doesn't look like good/real data.

In [None]:
c4 = Counter(['cat','dog','rat','cat','dog','rat','cat','dog','rat','alligator'])
print(c4)

#### What is our function going to do?

Our function will have two inputs: the Counter and the wild-type reference.

It will return as output one of five lists: ["WT","WT"], ["WT", "allele2"], ["allele1","WT"], ["allele1", "allele2"], where alleles 1 and 2 are the non-WT alleles.  It will also return ["bad","bad"] in the situation talked about above, where the data looks bad.

#### What are the steps we are going to take?

1. Figure out how many total counts there are  
2. Get the most common elements in the counter
3. Look at the first most common element  
3.1 Determine if this element is WT or mutant  
3.2 If it has at least 80% of the reads, then we are dealing with a homozygous situation and <b>return</b> early (since there's no need to look at the second allele)  
3.3 On the other hand, if it has at least 40% of the reads, then we are dealing with a heterozygous situation.  
3.4 If it doesn't then <b>return</b> 'bad' early (there's no need to look at the second allele if the most common one is under 40%, because the second most common one will also be under 40%)  
4. Look at the second most common element  
4.1 Determine if the element is WT or mutant  
4.2 Check if it has at least 40% of the reads: if not (meaning that the first most common read was at least 40%, but the second most common read was less than 40%) <b>return</b> 'bad'
5. <b>Return</b> the status

Note that a function can only return once: once your function hits a return statement, it will not run anything else below.

#### I've laid out certain components of the function, but you're going to have to use the skills you learned above to fill in the blanks!

In [None]:
# first, we are going to create are new function, genotype()
def genotype(c, wt_reference):

    # 1. figure out how many counts there are
    total_count = sum(c.values())
    
    # 2. get the most common elements with the most_common() function
    most_common_values = c.most_common()
    
    # 3. look at the first most common element
    v1_allele, v1_count = most_common_values[0]
    
    # 3.1 determine if this element is WT
    if v1_allele == wt_reference:
        v1_allele = 'WT'
    
    # 3.2 determine if it has at least 80% of the reads
    if v1_count / total_count >= .8:
        return [v1_allele, v1_allele]
    
    # 3.3 & 3.4 check if it has fewer than 40% of the reads and return early
    elif v1_count / total_count < .4:
        return ['bad','bad']
    
    # 4. look at the second most common element - you're going to have to do this on your own!
    
    
    # 4.1. determine if this element is WT and if so, set the value equal to 'WT'
    
    
    # 4.2. check if it has at least 40% of the reads and if not, return early as in 3.3/3.4
    
    
    # 5. return the two alleles
    return [v1_allele, v2_allele]    

#### Now let's try your function on the four test cases to see if we get the expected results!

In [None]:
genotype(c1, 'cat')

In [None]:
genotype(c2, 'cat')

In [None]:
genotype(c3, 'cat')

In [None]:
genotype(c4, 'cat')

#### Does everything look good? Congratulations for finishing this!! You've now learned the basics of writing a function, performing boolean operations, using if statements, and Counters!

### Part 3: Applying this to our FASTQ data.

We're going to do this in two parts.  First, we're going to learn to deal with a single FASTQ file.  Then, we're going to deal with an entire folder of FASTQ files.

We're going to use a new package, <b>os</b>, that can help us find what files are in a given directory.

We're also going to learn how to import a text file: at heart, a FASTQ file, is just a text file, where each line has a difference piece of information.  Each FASTQ read comprises four lines:
(see https://support.illumina.com/bulletins/2016/04/fastq-files-explained.html for more information)

1. Read ID: information on machine, cluster location, etc.  For our purpopses, not important.
2. The actual read.  Important!
3. Separator (a + sign).  Not important.
4. Base quality scores.  Often important, but we're going to ignore it for now and just assume that all of the reads are good enough.

So we can think of a FASTQ file as having a periodicity of 4, where the 2nd, 6th, 10th, etc. lines are the reads.  Which means that when we are reading in a FASTQ file, we only want to pay attention to the 2nd, 6th, 10th, etc. reads.

#### The first thing we need to do is create a new variable, called <i>path</i>, that is the path to the folder (directory) that has our files.  
You can find this in two ways: 1) in terminal, navigate to the directory with the FASTQ files (crispr_96), and type <i>pwd</i>. 2) in Finder, right click on a FASTQ file in that folder, click "get info" and in the "general" tab, look at "where" and that will be the path: it should be something like /Users/kevin/etc.

In [None]:
# importing the package os
import os

# creating our variable path which has the location to our files
# note that this is a string, and so should be enclosed in quotes
# also, make sure that it ends with a / ! This will be important in a second
path = '/Users/kevin/changlab/covid19/1_CRISPR/crispr_96/'

Now, let's find all of the files that are in that folder using the os.listdir() function.

In [None]:
# get a list of files
files = os.listdir(path)

# just to make our lives easier, let's sort this
files = sorted(files)

print(files)

Check how long the list files is.  It should be 96.

Let's just start with the first file.

In [None]:
fn = files[0]
print(fn)

Important note! <b>fn</b> is now a string that is the name of a single file in the directory.  The full path to the <i>file</i> is:

In [None]:
path_to_file = path + fn
print(path_to_file)

Above, we've added two strings together.  This is why making sure that our variable <b>path</b> ended with a / was important - if it didn't, then we would be looking for a file called "crispr_96crispr_well_0.fastq.gz" in the "covid19" directory; rather than the file "crispr_well_0.fastq.gz" in the "crispr_96" directory.

#### Now we're going to learn to open the file.  Importantly, this file is <i>gzipped</i>. 

For an uncompressed file, we would say:

    with open(path_to_file, 'r') as f:  
        for line in f:  
            do_something()  

Since our files are gzipped, we're going to use the gzip package to open this file, and say:

    with gzip.open(path_to_file, 'rt') as f:
        for line in f:
            do_something()
            
Note the 'r' that we're including: this means we're opening the file in 'read' mode. We could also say:
* 'r' for read (this won't change the file)
* 'w' for write (this will overwrite an existing file)
* 'a' for append (add lines to the end - will add lines to an existing file rather than overwrite it)

For the gzip open() function, we also need to include 't', to say that we are in text mode. (as opposed to 'b', the default, in binary mode). You can try importing with 'rb' (or just 'r', which defaults to 'rb') and see what it looks like!

#### Now let's put it all together and print the first twelve lines of the file, corresponding to the first four reads.

We're going to keep track of the number of lines we have looked at with a new variable, <i>line_number</i>, which starts at 1, and incremenet it by 1 after processing each line.  If line_number is greater than 12, then <b>break</b> out of the for loop.

In [None]:
import gzip

line_number = 0

with gzip.open(path_to_file, 'rt') as f:
    for line in f:
                        
        if line_number >= 12:
            break
        
        line_number += 1
        print(line)

Notice how there are extra spaces inbetween each line? If you try importing in 'rb' mode, you'll see that each lines with a "\n" which indicates newline.  Often, lines will have extra whitespace - spaces, tabs, newline characters - at the start or end of a line.  We can get rid of this with the .strip() function.

In [None]:
line_number = 0

with gzip.open(path_to_file, 'rt') as f:
    for line in f:
                        
        if line_number >= 12:
            break
        
        line_number += 1
        print(line.strip())

Now, let's modify this a little bit to just print the reads.  We're going to use the modulo operation, <b>%</b> (https://en.wikipedia.org/wiki/Modulo_operation), which basically is the remainder after division.

Basically, FASTQ reads have a period of 4.  This means that when line_number = 2, 6, 10...meaning that the remainder of dividing line_number by 4 is equal to 2, 2, 2...or that line_number % 4 equals 2...is when we are dealing with a read.

In [None]:
line_number = 0

with gzip.open(path_to_file, 'rt') as f:
    for line in f:
                        
        if line_number >= 12:
            break
        
        line_number += 1
        
        if line_number % 4 == 2:
            print(line.strip())

#### Now we've got a way to deal with the FASTQ files, which are gzipped, import each line of the file, keep track of how many lines we are reading in, and then print just the reads!

Note that if we have read in 12 lines, then it means that we have read in 3 (12/4) reads.  Put the other way, if we want to read in 100 reads, then we need to read in 400 lines.

#### Now, let's combine everything where we read in a single file, and return a Counter of the number of times we see each unique read.

To start, let's just read in 10 reads to get a sense of what things look like, before we eventually read in the entire file. Pay attention to what we have changed from above to make this work.

In [None]:
line_number = 0
read_counter = Counter()

with gzip.open(path_to_file, 'rt') as f:
    for line in f:
                        
        if line_number >= 10 * 4:
            break
        
        line_number += 1
        
        if line_number % 4 == 2:
            read_counter[line.strip()] += 1
            
print(read_counter)

#### Now write a function that will do all of this for us.

It will take as inputs the path to a file.  
It will return as an output a Counter of the read frequencies for that file.

In [None]:
def process_file(path_to_file):
    read_counter = Counter()
    
    # add your code here
    
    return read_counter

#### And let's put it together with the genotype function that we wrote above!

1. using process_file(), get a Counter for a file.
2. using genotype(), get the results for that file.

Note that in this case, crispr_well_0.fastq.gz is WT, meaning that the most common read in this file (which you just found) is the wt_reference.

In [None]:
# replace empty string with correct wt_reference sequence
wt_reference = ''

file_counter = process_file(path_to_file)
file_results = genotype(file_counter, wt_reference)

print(file_results)

#### When you run this with crispr_well_0, you should get the result ['WT', 'WT'].

### Part 4: Putting it all together and processing an entire folder of files.

Now, we're going to process the data for all of the files in our folder.

All we need to do is loop through all of the files, and then save the results.

In [None]:
# again, you'll need to change this for yourself
path = '/Users/kevin/changlab/covid19/crispr_96/'

# get a list of files
files = os.listdir(path)

# just to make our lives easier, let's sort this
files = sorted(files)

print(files)

In [None]:
wt_reference = ''

for fn in files:
    print(fn)
    path_to_file = path + fn
    
    file_counter = process_file(path_to_file)
    file_results = genotype(file_counter, wt_reference)
    print(file_results)

#### You should have now printed the results for each file!

Now, let's save the results in a new text file.  I'm going to provide a template where it just writes the same result for everything, but you'll need to modify it to process the files and write the actual results.

#### For the last part, outputting the results, it would be nice to know not just whether it is WT or mutant, but also some other information:

* How many reads total did each well get? (as an integer - no decimal point)
* What % of reads were for the first allele? (rounded to two decimal places)
* What % of reads were for the second allele? (also rounded to two decimal places)
* <i> In the case of a homozygous well (WT or mutant), only report a single allele and single percentage </i>
* <i> In the case of a bad well, still report the number of reads and the percent for each of the top two alleles </i>
    
You'll need to create a new function, genotype2(), to output not just the genotyping results (e.g., ['WT','sequenceofmutantallele']) but also the above information.  As an example, this could be [10000, 'WT', 45.55, 'sequenceofmutantallele', 43.28].

There's a couple of things to note.  First, that your list will now contain both numbers and strings. However, in order to write the output, your entire line needs to be a string.  There's a few ways you could do this - you could modify the above line to say [str(variable_with_read_count), 'var_with_allele1', str(var_with_percent1), 'var_with_allele2', str(var_with_percent2)].  Or, you could do what is written in the first line of the template to write the results section, which uses a <b> list comprehension</b> (don't worry about this too much yet!), to convert everything in the list file_result to a string.  List comprehensions are a useful tool in Python that let you do simple operations to each item in a list (or any other iterable - something you can iterate over), like convert every object to a string.

Second, we want to round the percentages to two decimal places.  Python has a built in round() function, which you'll need to look up how to use (https://docs.python.org/3/library/functions.html#round) - it's important to know how to look things up that you don't know how to use, and learn how to read the documentation for something.

I'd recommend first just trying to get the existing genotype() function working here - just output the allele results and make sure you can do that.  Then, make genotype2() (and just copy in the code for genotype()) and modify it to add in each piece of information, one by one.  In other words, try to do things step-by-step, adding things in one-by-one, rather than doing everything at once - this will make it easier to troubleshoot because you're changing fewer things at a time.

<b>Here is what we are doing with the last three lines:</b> making a new list, which makes a new list of everything we want to write in our output file by adding the file name (which needs to be a list - you can't add a string to a list, only a list to a list) to the output from genotype (where we're using a list comprehension to convert everythin to a string).  Then, we're joining the strings in this list with a tab, which is noted by '\t'.  Then, we're writing the results of each file, and adding a newline ('\n') at the end of each line.

Feel free to play around with different things.  Take the last three lines of code and try running them separately, and printing the result after you do each line, just to see what is happeneing.  What happens if you don't include the newline at the end?  What if you want to make the end file comma delimted (',') as opposed to tab delimited ('\t')?]

#### Also, since we're outputting in a tab delimited text format (the two main formats are either tab separated (usually .txt or .tsv) or comma separated (.csv)), you should be able to open your resulting file in Excel and look at it there (or in any other text editor).

In [1]:
def genotype2():
    

SyntaxError: unexpected EOF while parsing (<ipython-input-1-0b5def7053a0>, line 2)

In [None]:
# change the start of this to match your own computer
output_file = '/Users/kevin/changlab/covid19/crispr_96_results.txt'

with open(output_file, 'w') as f:
    for fn in files:
        wt_reference = ''
        
        # add code processing files here
        
        # replace this with the actual results
        file_result = [10000, 'WT', 50.00, 'WT', 41.28]
        
        # template to write the results
        line_to_write = [fn] + [str(i) for i in file_result]
        line_to_write = '\t'.join(line_to_write)
        f.write(line_to_write + '\n')

### Congratulations for making it to the end!!!

Comments: Feedback, suggestions, complaints...