# Welcome to Chang Lab Bioinformatics (R version)!
We're going to start with an introduction to R, just to get a handle on basics and how to deal with data. In this experiment, we're imagining that you're trying to create a clonal line with a mutation at a given locus. You've picked lots of clones, extracted DNA and amplified the target region by PCR, and are now trying to analyze your results to quickly tell which wells are WT or mutant (or heterozygous).

The first section will be an introduction, and then we'll go through dealing with one well, and then go through dealing with all 96 wells.

### Part 0: Your name

Replace this with your name.

## Part 1: Introduction
Make sure that you are familiar with the difference between a string, integer, and double; as well as data frames and vectors; boolean operators (equals to, greater than, less than, etc.); and the basics of defining a function.

We're going to use the **table** function frequently. To see how it works, type **?table**. 

We're going to practice with an example 'dataset' before moving on to actual (simulated) FASTQ reads.

Table is a built in function, so we don't need to load any R packages right now, but typically this would be the first thing to do in a new session.

In [1]:
# mock dataset, don't change

test_data = c('cat',
            'dog',
            'cat',
            'mouse',
            'mouse',
            'cat',
            'cat',
            'dog',
            'rat',
            'dog',
            'rabbit',
            'mouse',
            'cat',
            'cat',
            'dog',
            'elephant')

How many elements are in test_data? (Hint: use the **length()** function)

In [2]:
# We can use the table function to print the frequencies of each item in the test_data vector

table(test_data)

test_data
     cat      dog elephant    mouse   rabbit      rat 
       6        4        1        3        1        1 

In [3]:
# Now we can use the sort function to find the most common elements of test_data 
# (by default sort orders in increasing order, so we use the decreasing = TRUE argument)

sort(table(test_data), decreasing = TRUE)

test_data
     cat      dog    mouse elephant   rabbit      rat 
       6        4        3        1        1        1 

What if we only wanted to see the top 3 most common elements? (Hint: look at the **head** function using **?head**)

If you want to store the result of the sorted table, you'll need to do that explicitly by assigning it to the same or new variable.

In [4]:
test_table <- table(test_data)
test_table_sort <- sort(test_table, decreasing = TRUE)

test_table
test_table_sort

test_data
     cat      dog elephant    mouse   rabbit      rat 
       6        4        1        3        1        1 

test_data
     cat      dog    mouse elephant   rabbit      rat 
       6        4        3        1        1        1 

#### Let's say we want to go through and calculate the percent votes assigned to the third result.  How would we do this?

First, lets make a data.frame of the table results: 

In [5]:
test_df <- data.frame(test_table_sort)
test_df

test_data,Freq
<fct>,<int>
cat,6
dog,4
mouse,3
elephant,1
rabbit,1
rat,1


In [6]:
# Now we can divide each value in the Freq column by the total sum using the sum function to get the percent.
# We use the $ operator to name a new column of the dataframe:

test_df$percent <- test_df$Freq / sum(test_df$Freq)
test_df

test_data,Freq,percent
<fct>,<int>,<dbl>
cat,6,0.375
dog,4,0.25
mouse,3,0.1875
elephant,1,0.0625
rabbit,1,0.0625
rat,1,0.0625


Second, we need to get the data for the third-most common item. This can be done by indexing the dataframe. Unlike python, R is 1-indexed, so an index of 1 corresponds to the first row of the dataframe, etc. Dataframes are indexed with the row index first, followed by the column - so **dat[1,]** corresponds to the first row of a dataframe and **dat[,1]** the first column.

In [7]:
test_df[3,]

Unnamed: 0_level_0,test_data,Freq,percent
Unnamed: 0_level_1,<fct>,<int>,<dbl>
3,mouse,3,0.1875


In [8]:
# If we just want the name of the 3rd most common element, we can use row and column indexing.
# Rows and columns can be indexed either by number or by name:

test_df[3, 1]
test_df[3, "test_data"]

Notice that the output here is a **factor**, which has different levels. Factors are different than strings because they have an order. This may be annoying when you are starting out but factors are very useful! If you don't want something to be a factor you can convert is back to a regular string using the **as.character** function.

In [9]:
test_df$test_data <- as.character(test_df$test_data)
test_df[3, 1]
test_df[3, "test_data"]

One very useful thing to be able to do is to subset dataframes using logical vectors. Let's use a logical vector to return all animals with over 10% frequency in our dataset.

In [10]:
# We can make a logical vector based on the value of the percent column:

test_df$percent > 0.10

In [11]:
# Now we can use this vector to subset our dataframe (should this go before or after the comma? Why?):

test_df[test_df$percent > 0.10,]

Unnamed: 0_level_0,test_data,Freq,percent
Unnamed: 0_level_1,<chr>,<int>,<dbl>
1,cat,6,0.375
2,dog,4,0.25
3,mouse,3,0.1875


In [12]:
# If we just want the names of animals, we can just subset the test_data column.
# Can you think of another way to do this using row column indices?

test_df$test_data[test_df$percent > 0.10]

In [13]:
# If we want to return only the animals within a range, we can use an and statement.
# Let's find the names of the animals that are between 10-30% frequency:

upper <- 0.30
lower <- 0.10

test_df$test_data[test_df$percent > lower & test_df$percent < upper]

#### More things to try!

Try playing around with the limits (0.30, 0.10) and try using an **or** ( **|** operator) statement to return animals with either more than 30% or less than 10% frequency. Also try changing what value is returned. For example, write a statement that if the value is between 30-50, it will print the value of the percent column.

### Part 2: Writing a function to determine whether a list of 'reads' is homozygous WT, heterozygous, or mutant (on both alleles).

Here, we're going to apply the concepts above to three test datasets.  Each of these datasets is going to be a counter.  However, to make things simpler, instead of reads, we're going to use animals; and we're just going to say that 'cats' are wild-type and anything else is 'mutant'.

In [14]:
c1 <- table(c('cat','cat','cat','dog','cat','cat','cat','rat','cat','cat'))
c2 <- table(c('cat','cat','cat','dog','dog','dog','cat','dog','rat','cat'))
c3 <- table(c('dog','rat','dog','rat','dog','rat','dog','rat','dog','rat'))

c1
c2
c3


cat dog rat 
  8   1   1 


cat dog rat 
  5   4   1 


dog rat 
  5   5 

Just by looking at this, we can assign each of these as a particular status: c1 is WT, c2 is a het, and c3 is homozygous mutant.  But let's write a function to do this for us!

#### First, we need to think of the criteria that we mentally apply when deciding if c1, c2, c3 are which status.

Let's just set forth the following rules for each conditions:

+/+ (WT): at least 80% of the 'reads' are the WT read  
+/- : at least 40% of the 'reads' are WT, and at least 40% of the reads are for another non-WT allele  
-/- : the WT reads are fewer than 20% of the total number of reads.  Note that there are actually two possible cases here: it could be homozygous (two of the same mutant alleles) or heterozygous (two different mutant alleles).  In the first situation (+/+), we said that the WT allele needed to represent at least 80% of the reads.  So it seems reasonable to say that if at least 80% of the reads are for a single allele, then we will call it homozygous mutant, and if there's two alleles with at least 40% of the reads for each allele, we'll call it heterozygous mutant. 

Note that there's also a fourth situation, which is deciding that we have bad data.  For example, there's just a lot of random stuff and it doesn't look like good/real data.

In [15]:
c4 <- table(c('cat','dog','rat','cat','dog','rat','cat','dog','rat','alligator'))
c4


alligator       cat       dog       rat 
        1         3         3         3 

#### What is our function going to do?

Our function will have two inputs: the Counter and the wild-type reference.

It will return as output one of five vectors: ("WT","WT"), ("WT", "allele2"), ("allele1","WT"), ("allele1", "allele2"), where alleles 1 and 2 are the non-WT alleles.  It will also return ("bad","bad") in the situation talked about above, where the data looks bad.

#### What are the steps we are going to take?

1. Get the percent frequency for each element in the vector.
2. Look at the first most common element  
2.1 Determine if this element is WT or mutant  
2.2 If it has at least 80% of the reads, then we are dealing with a homozygous situation and <b>return</b> early (since there's no need to look at the second allele). On the other hand, if it has at least 40% of the reads, then we are dealing with a heterozygous situation.  
2.3 If it doesn't then <b>return</b> 'bad' early (there's no need to look at the second allele if the most common one is under 40%, because the second most common one will also be under 40%)  
3. Look at the second most common element  
3.1 Check if it has at least 40% of the reads: if not (meaning that the first most common read was at least 40%, but the second most common read was less than 40%) <b>return</b> 'bad'
3.2 Determine if the element is WT or mutant 
4. <b>Return</b> the status

Note that a function can only return once: once your function hits a return statement, it will not run anything else below.

#### I've laid out certain components of the function, but you're going to have to use the skills you learned above to fill in the blanks!

In [16]:
# Let's first explore the data without creating a function:
# You can use this cell to test lines of code before putting them into the genotype function. 
# The first part is done for you:

c <- c1
wt_reference <- "cat"

dat <- data.frame(sort(c, decreasing = TRUE))
dat

Var1,Freq
<fct>,<int>
cat,8
dog,1
rat,1


In [17]:
# Now, we are going to create are new function, genotype()

genotype <- function(c, wt_reference) {
    
    # 1. Sort the results and store in a dataframe
    dat <- data.frame(sort(c, decreasing = TRUE))
    
    # 1.1. Convert to character (factors will cause problems here - you can remove this line to see what happens!)
    dat[,1] <- as.character(dat[,1])
    
    # 1.2. Calculate the percent for each sequence
    dat$percent <- dat$Freq / sum(dat$Freq)
    
    # 2.1. Determine if the most common element is the wild-type allele and if it has at least 80% of the reads
    if (dat[1,1] == wt_reference & dat$percent[1] >= 0.8) {
        # 2.2. if so, return the WT vector early
        return(c("WT", "WT"))
        
    # 2.3. Check if the most common element has less than 40% of the reads
    } else if (dat$percent[1] < 0.4) {
        # if so, return the bad vector early
        return(c("bad", "bad"))
        
    # 3.1. Check if the second most common element has less than 40% of the reads - you have to do this on your own!
    # if so, return the bad vector early
    } else if (dat$percent[2] < 0.4) {
        return(c("bad", "bad"))
        
    # 3.2. Determine which elements are wild-type - you have to do this on your own!
    } else {
        dat[,1][dat[,1] == wt_reference] <- "WT"
        # 4. Return the two alleles.
        return(c(dat[1,1], dat[2,1]))
    }
}

In [18]:
genotype(c1, "cat")
genotype(c2, "cat")
genotype(c3, "cat")
genotype(c4, "cat")

#### Does everything look good? Congratulations for finishing this!! You've now learned the basics of writing a function, performing boolean operations, using if statements, and tables!

### Part 3: Applying this to our FASTQ data.

We're going to do this in two parts.  First, we're going to learn to deal with a single FASTQ file.  Then, we're going to deal with an entire folder of FASTQ files.

We're also going to learn how to import a text file: at heart, a FASTQ file, is just a text file, where each line has a difference piece of information.  Each FASTQ read comprises four lines:
(see https://support.illumina.com/bulletins/2016/04/fastq-files-explained.html for more information)

1. Read ID: information on machine, cluster location, etc.  For our purpopses, not important.
2. The actual read.  Important!
3. Separator (a + sign).  Not important.
4. Base quality scores.  Often important, but we're going to ignore it for now and just assume that all of the reads are good enough.

So we can think of a FASTQ file as having a periodicity of 4, where the 2nd, 6th, 10th, etc. lines are the reads.  Which means that when we are reading in a FASTQ file, we only want to pay attention to the 2nd, 6th, 10th, etc. reads.

#### The first thing we need to do is create a new variable, called <i>path</i>, that is the path to the folder (directory) that has our files.  
You can find this in two ways: 1) in terminal, navigate to the directory with the FASTQ files (crispr_96), and type <i>pwd</i>. 2) in Finder, right click on a FASTQ file in that folder, click "get info" and in the "general" tab, look at "where" and that will be the path: it should be something like /Users/kevin/etc.

In [19]:
# creating our variable path which has the location to our files
# note that this is a string, and so should be enclosed in quotes
# also, make sure that it ends with a / ! This will be important in a second
path  <- 'crispr_96_data/'

In [20]:
# get a list of files
files <- list.files(path)

# just to make our lives easier, let's sort this
files <- sort(files)

files

Check how long the list files is.  It should be 96.

Let's just start with the first file.

In [21]:
# Remember that R is 1-indexed!

fn <- files[1]
fn

Important note! <b>fn</b> is now a string that is the name of a single file in the directory.  The full path to the <i>file</i> is:

In [22]:
# Unlike python, R does not let you add strings to concatenate. Instead we use the paste function.
# paste0 is a shortcut for paste(sep = "")

path_to_file <- paste0(path, fn)
path_to_file

Above, we've added two strings together.  This is why making sure that our variable <b>path</b> ended with a / was important - if it didn't, then we would be looking for a file called "crispr_96crispr_well_0.fastq.gz"; rather than the file "crispr_well_0.fastq.gz" in the "crispr_96" directory.

#### Now we're going to learn to open the file.  Importantly, this file is <i>gzipped</i>. 

For an uncompressed file, we would say:

    readLines(file)

Since our files are gzipped, we're going to use the gzfile function to open this file, and say:

    readLines(gzfile(file))
    
Since these files are small, we will just read all lines into memory. If the files were large we would want to read in line by line or in chunks to avoid overwhelming the memory.

#### Now let's put it all together and print the first twelve lines of the file, corresponding to the first three reads.

In [23]:
head(readLines(gzfile(path_to_file)), n = 12)

Now, let's modify this a little bit to just print the reads.  We're going to use the seq function, which creates a vectors of numbers with a given interval between them. Use **?seq** for more information.

Basically, FASTQ reads have a period of 4.  This means that we want a vector that starts at 2 and ends at the last read, counting up by 4. 

In [24]:
# Let's save the fastq file text as a variable (What data type is the output saved as?):
fLines <- readLines(gzfile(path_to_file))

# Here's my indexing vector created by the seq function:
head(seq(2,length(fLines),4))

# Now we can use this to index the fLines variable to get just the read lines:
head(fLines[seq(2,length(fLines),4)], n=3)

#### Now we've got a way to deal with the FASTQ files, which are gzipped, import each line of the file, and then print just the reads!

#### Now, let's combine everything where we read in a single file, and return a table of the number of times we see each unique read.

To start, let's just read in 10 reads to get a sense of what things look like, before we eventually read in the entire file. Pay attention to what we have changed from above to make this work.

In [25]:
# Let's overwrite our fLines variable with just the read lines - this will save memory:
fLines <- readLines(gzfile(path_to_file))
fLines <- fLines[seq(2,length(fLines),4)]
table(fLines[1:10])


GTCCAGCTGTGCAAGAGAATATTCCCGCTCTCCGGAGAAGCTCTTCCTTCCTTTGCACTGAAAGCTGTAACTCTAAGTATCAGTGTGAAACGGGAGAAAACAGTAAAGGCAACGTCCAGGATAGAGTGAAGCGACCCATGAACGCATTCA 
                                                                                                                                                    10 

#### Now write a function that will do all of this for us.

It will take as inputs the path to a file.  
It will return as an output a table of the read frequencies for that file.

In [26]:
process_file <- function(path_to_file){
    fLines <- readLines(gzfile(path_to_file))
    fLines <- fLines[seq(2,length(fLines),4)]
    return(table(fLines))
}

#### And let's put it together with the genotype function that we wrote above!

1. using process_file(), get a table for a file.
2. using genotype(), get the results for that file.

Note that in this case, crispr_well_0.fastq.gz is WT, meaning that the most common read in this file (which you just found) is the wt_reference.

In [27]:
# replace empty string with correct wt_reference sequence
wt_reference <- 'GTCCAGCTGTGCAAGAGAATATTCCCGCTCTCCGGAGAAGCTCTTCCTTCCTTTGCACTGAAAGCTGTAACTCTAAGTATCAGTGTGAAACGGGAGAAAACAGTAAAGGCAACGTCCAGGATAGAGTGAAGCGACCCATGAACGCATTCA'

file_table <- process_file(path_to_file)
file_results <- genotype(file_table, wt_reference)

file_results

#### When you run this with crispr_well_0, you should get the result ['WT', 'WT'].

### Part 4: Putting it all together and processing an entire folder of files.

Now, we're going to process the data for all of the files in our folder.

All we need to do is loop through all of the files, and then save the results.

In [28]:
# again, you'll need to change this for yourself
path <- 'crispr_96_data/'

# get a list of files
files <- list.files(path)

# just to make our lives easier, let's sort this
files <- sort(files)

files

In [29]:
wt_reference = 'GTCCAGCTGTGCAAGAGAATATTCCCGCTCTCCGGAGAAGCTCTTCCTTCCTTTGCACTGAAAGCTGTAACTCTAAGTATCAGTGTGAAACGGGAGAAAACAGTAAAGGCAACGTCCAGGATAGAGTGAAGCGACCCATGAACGCATTCA'

for (fn in files) {
    message(fn)
    path_to_file <- paste0(path, fn)
    
    file_table <- process_file(path_to_file)
    file_results <- genotype(file_table, wt_reference)
    
    print(file_results)
}

crispr_well_0.fastq.gz


[1] "WT" "WT"


crispr_well_1.fastq.gz


[1] "GTCCAGCTGTGCAAGAGAATATTCCCGCTCTCCGGAGAAGCTCTTCCTTCTCGTGTCTTTGCACTGAAAGCTGTAACTCTAAGTATCAGTGTGAAACGGGAGAAAACAGTAAAGGCAACGTCCAGGATAGAGTGAAGCGACCCATGAACG"
[2] "WT"                                                                                                                                                    


crispr_well_10.fastq.gz


[1] "WT" "WT"


crispr_well_11.fastq.gz


[1] "bad" "bad"


crispr_well_12.fastq.gz


[1] "WT" "WT"


crispr_well_13.fastq.gz


[1] "WT"                                                                                                                                                    
[2] "GTCCAGCTGTGCAAGAGAATATTCCCGCTCTCCGGAGAAGCTCTTCCTTCCGTGACTTTGCACTGAAAGCTGTAACTCTAAGTATCAGTGTGAAACGGGAGAAAACAGTAAAGGCAACGTCCAGGATAGAGTGAAGCGACCCATGAACGC"


crispr_well_14.fastq.gz


[1] "WT" "WT"


crispr_well_15.fastq.gz


[1] "WT"                                                                                                                                                    
[2] "GTCCAGCTGTGCAAGAGAATATTCCCGCTCTCCGGAGAAGCTCTTCCTTCGATGTTCGTCTTTGCACTGAAAGCTGTAACTCTAAGTATCAGTGTGAAACGGGAGAAAACAGTAAAGGCAACGTCCAGGATAGAGTGAAGCGACCCATGA"


crispr_well_16.fastq.gz


[1] "WT" "WT"


crispr_well_17.fastq.gz


[1] "WT" "WT"


crispr_well_18.fastq.gz


[1] "GTCCAGCTGTGCAAGAGAATATTCCCGCTCTCCGGAGAAGCTCTTCCTTCAGGCTTTGCACTGAAAGCTGTAACTCTAAGTATCAGTGTGAAACGGGAGAAAACAGTAAAGGCAACGTCCAGGATAGAGTGAAGCGACCCATGAACGCAT"
[2] "GTCCAGCTGTGCAAGAGAATATTCCCGCTCTCCGGAGAAGCTCTTCCTTCTTGCACTGAAAGCTGTAACTCTAAGTATCAGTGTGAAACGGGAGAAAACAGTAAAGGCAACGTCCAGGATAGAGTGAAGCGACCCATGAACGCATTCATC"


crispr_well_19.fastq.gz


[1] "WT" "WT"


crispr_well_2.fastq.gz


[1] "WT" "WT"


crispr_well_20.fastq.gz


[1] "WT"                                                                                                                                                    
[2] "GTCCAGCTGTGCAAGAGAATATTCCCGCTCTCCGGAGAAGCTCTTCCTTCTTTGCACTGAAAGCTGTAACTCTAAGTATCAGTGTGAAACGGGAGAAAACAGTAAAGGCAACGTCCAGGATAGAGTGAAGCGACCCATGAACGCATTCAT"


crispr_well_21.fastq.gz


[1] "bad" "bad"


crispr_well_22.fastq.gz


[1] "WT" "WT"


crispr_well_23.fastq.gz


[1] "WT"                                                                                                                                                    
[2] "GTCCAGCTGTGCAAGAGAATATTCCCGCTCTCCGGAGAAGCTCTTCCTTCGAAAGCTGTAACTCTAAGTATCAGTGTGAAACGGGAGAAAACAGTAAAGGCAACGTCCAGGATAGAGTGAAGCGACCCATGAACGCATTCATCGTGTGGT"


crispr_well_24.fastq.gz


[1] "WT" "WT"


crispr_well_25.fastq.gz


[1] "WT"                                                                                                                                                    
[2] "GTCCAGCTGTGCAAGAGAATATTCCCGCTCTCCGGAGAAGCTCTTCCTTCGCACTGAAAGCTGTAACTCTAAGTATCAGTGTGAAACGGGAGAAAACAGTAAAGGCAACGTCCAGGATAGAGTGAAGCGACCCATGAACGCATTCATCGT"


crispr_well_26.fastq.gz


[1] "GTCCAGCTGTGCAAGAGAATATTCCCGCTCTCCGGAGAAGCTCTTCCTTCTGAAAGCTGTAACTCTAAGTATCAGTGTGAAACGGGAGAAAACAGTAAAGGCAACGTCCAGGATAGAGTGAAGCGACCCATGAACGCATTCATCGTGTGG"
[2] "WT"                                                                                                                                                    


crispr_well_27.fastq.gz


[1] "WT" "WT"


crispr_well_28.fastq.gz


[1] "WT" "WT"


crispr_well_29.fastq.gz


[1] "GTCCAGCTGTGCAAGAGAATATTCCCGCTCTCCGGAGAAGCTCTTCCTTCTTGCACTGAAAGCTGTAACTCTAAGTATCAGTGTGAAACGGGAGAAAACAGTAAAGGCAACGTCCAGGATAGAGTGAAGCGACCCATGAACGCATTCATC"
[2] "WT"                                                                                                                                                    


crispr_well_3.fastq.gz


[1] "WT"                                                                                                                                                    
[2] "GTCCAGCTGTGCAAGAGAATATTCCCGCTCTCCGGAGAAGCTCTTCCTTCGAAACAAAGCTTTGCACTGAAAGCTGTAACTCTAAGTATCAGTGTGAAACGGGAGAAAACAGTAAAGGCAACGTCCAGGATAGAGTGAAGCGACCCATGA"


crispr_well_30.fastq.gz


[1] "WT" "WT"


crispr_well_31.fastq.gz


[1] "WT"                                                                                                                                                    
[2] "GTCCAGCTGTGCAAGAGAATATTCCCGCTCTCCGGAGAAGCTCTTCCTTCATCACTTTGCACTGAAAGCTGTAACTCTAAGTATCAGTGTGAAACGGGAGAAAACAGTAAAGGCAACGTCCAGGATAGAGTGAAGCGACCCATGAACGCA"


crispr_well_32.fastq.gz


[1] "WT" "WT"


crispr_well_33.fastq.gz


[1] "WT" "WT"


crispr_well_34.fastq.gz


[1] "WT" "WT"


crispr_well_35.fastq.gz


[1] "WT"                                                                                                                                                    
[2] "GTCCAGCTGTGCAAGAGAATATTCCCGCTCTCCGGAGAAGCTCTTCCTTCGCCCCTTTGCACTGAAAGCTGTAACTCTAAGTATCAGTGTGAAACGGGAGAAAACAGTAAAGGCAACGTCCAGGATAGAGTGAAGCGACCCATGAACGCA"


crispr_well_36.fastq.gz


[1] "GTCCAGCTGTGCAAGAGAATATTCCCGCTCTCCGGAGAAGCTCTTCCTTCCTGAAAGCTGTAACTCTAAGTATCAGTGTGAAACGGGAGAAAACAGTAAAGGCAACGTCCAGGATAGAGTGAAGCGACCCATGAACGCATTCATCGTGTG"
[2] "WT"                                                                                                                                                    


crispr_well_37.fastq.gz


[1] "WT" "WT"


crispr_well_38.fastq.gz


[1] "WT" "WT"


crispr_well_39.fastq.gz


[1] "GTCCAGCTGTGCAAGAGAATATTCCCGCTCTCCGGAGAAGCTCTTCCTTCCTGAAAGCTGTAACTCTAAGTATCAGTGTGAAACGGGAGAAAACAGTAAAGGCAACGTCCAGGATAGAGTGAAGCGACCCATGAACGCATTCATCGTGTG"
[2] "GTCCAGCTGTGCAAGAGAATATTCCCGCTCTCCGGAGAAGCTCTTCCTTCGTCCGAGCTTTGCACTGAAAGCTGTAACTCTAAGTATCAGTGTGAAACGGGAGAAAACAGTAAAGGCAACGTCCAGGATAGAGTGAAGCGACCCATGAAC"


crispr_well_4.fastq.gz


[1] "WT"                                                                                                                                                    
[2] "GTCCAGCTGTGCAAGAGAATATTCCCGCTCTCCGGAGAAGCTCTTCCTTCGCACTGAAAGCTGTAACTCTAAGTATCAGTGTGAAACGGGAGAAAACAGTAAAGGCAACGTCCAGGATAGAGTGAAGCGACCCATGAACGCATTCATCGT"


crispr_well_40.fastq.gz


[1] "GTCCAGCTGTGCAAGAGAATATTCCCGCTCTCCGGAGAAGCTCTTCCTTCGACTTTGCACTGAAAGCTGTAACTCTAAGTATCAGTGTGAAACGGGAGAAAACAGTAAAGGCAACGTCCAGGATAGAGTGAAGCGACCCATGAACGCATT"
[2] "GTCCAGCTGTGCAAGAGAATATTCCCGCTCTCCGGAGAAGCTCTTCCTTCTTGCACTGAAAGCTGTAACTCTAAGTATCAGTGTGAAACGGGAGAAAACAGTAAAGGCAACGTCCAGGATAGAGTGAAGCGACCCATGAACGCATTCATC"


crispr_well_41.fastq.gz


[1] "WT"                                                                                                                                                    
[2] "GTCCAGCTGTGCAAGAGAATATTCCCGCTCTCCGGAGAAGCTCTTCCTTCCCGTTCCTTTGCACTGAAAGCTGTAACTCTAAGTATCAGTGTGAAACGGGAGAAAACAGTAAAGGCAACGTCCAGGATAGAGTGAAGCGACCCATGAACG"


crispr_well_42.fastq.gz


[1] "GTCCAGCTGTGCAAGAGAATATTCCCGCTCTCCGGAGAAGCTCTTCCTTCAGGCGCTTTGCACTGAAAGCTGTAACTCTAAGTATCAGTGTGAAACGGGAGAAAACAGTAAAGGCAACGTCCAGGATAGAGTGAAGCGACCCATGAACGC"
[2] "WT"                                                                                                                                                    


crispr_well_43.fastq.gz


[1] "WT" "WT"


crispr_well_44.fastq.gz


[1] "WT" "WT"


crispr_well_45.fastq.gz


[1] "GTCCAGCTGTGCAAGAGAATATTCCCGCTCTCCGGAGAAGCTCTTCCTTCCCTAAGGGTGCTTTGCACTGAAAGCTGTAACTCTAAGTATCAGTGTGAAACGGGAGAAAACAGTAAAGGCAACGTCCAGGATAGAGTGAAGCGACCCATG"
[2] "GTCCAGCTGTGCAAGAGAATATTCCCGCTCTCCGGAGAAGCTCTTCCTTCTGCACTGAAAGCTGTAACTCTAAGTATCAGTGTGAAACGGGAGAAAACAGTAAAGGCAACGTCCAGGATAGAGTGAAGCGACCCATGAACGCATTCATCG"


crispr_well_46.fastq.gz


[1] "bad" "bad"


crispr_well_47.fastq.gz


[1] "GTCCAGCTGTGCAAGAGAATATTCCCGCTCTCCGGAGAAGCTCTTCCTTCTCTTTGCACTGAAAGCTGTAACTCTAAGTATCAGTGTGAAACGGGAGAAAACAGTAAAGGCAACGTCCAGGATAGAGTGAAGCGACCCATGAACGCATTC"
[2] "WT"                                                                                                                                                    


crispr_well_48.fastq.gz


[1] "WT" "WT"


crispr_well_49.fastq.gz


[1] "bad" "bad"


crispr_well_5.fastq.gz


[1] "bad" "bad"


crispr_well_50.fastq.gz


[1] "GTCCAGCTGTGCAAGAGAATATTCCCGCTCTCCGGAGAAGCTCTTCCTTCATTTGCACTTTGCACTGAAAGCTGTAACTCTAAGTATCAGTGTGAAACGGGAGAAAACAGTAAAGGCAACGTCCAGGATAGAGTGAAGCGACCCATGAAC"
[2] "WT"                                                                                                                                                    


crispr_well_51.fastq.gz


[1] "WT" "WT"


crispr_well_52.fastq.gz


[1] "WT"                                                                                                                                                    
[2] "GTCCAGCTGTGCAAGAGAATATTCCCGCTCTCCGGAGAAGCTCTTCCTTCACTGAAAGCTGTAACTCTAAGTATCAGTGTGAAACGGGAGAAAACAGTAAAGGCAACGTCCAGGATAGAGTGAAGCGACCCATGAACGCATTCATCGTGT"


crispr_well_53.fastq.gz


[1] "GTCCAGCTGTGCAAGAGAATATTCCCGCTCTCCGGAGAAGCTCTTCCTTCTCTTTGCACTGAAAGCTGTAACTCTAAGTATCAGTGTGAAACGGGAGAAAACAGTAAAGGCAACGTCCAGGATAGAGTGAAGCGACCCATGAACGCATTC"
[2] "WT"                                                                                                                                                    


crispr_well_54.fastq.gz


[1] "WT"                                                                                                                                                    
[2] "GTCCAGCTGTGCAAGAGAATATTCCCGCTCTCCGGAGAAGCTCTTCCTTCAAAGCTGTAACTCTAAGTATCAGTGTGAAACGGGAGAAAACAGTAAAGGCAACGTCCAGGATAGAGTGAAGCGACCCATGAACGCATTCATCGTGTGGTC"


crispr_well_55.fastq.gz


[1] "bad" "bad"


crispr_well_56.fastq.gz


[1] "GTCCAGCTGTGCAAGAGAATATTCCCGCTCTCCGGAGAAGCTCTTCCTTCTGAAAGCTGTAACTCTAAGTATCAGTGTGAAACGGGAGAAAACAGTAAAGGCAACGTCCAGGATAGAGTGAAGCGACCCATGAACGCATTCATCGTGTGG"
[2] "GTCCAGCTGTGCAAGAGAATATTCCCGCTCTCCGGAGAAGCTCTTCCTTCTACCTACTTTGCACTGAAAGCTGTAACTCTAAGTATCAGTGTGAAACGGGAGAAAACAGTAAAGGCAACGTCCAGGATAGAGTGAAGCGACCCATGAACG"


crispr_well_57.fastq.gz


[1] "WT" "WT"


crispr_well_58.fastq.gz


[1] "WT" "WT"


crispr_well_59.fastq.gz


[1] "WT"                                                                                                                                                    
[2] "GTCCAGCTGTGCAAGAGAATATTCCCGCTCTCCGGAGAAGCTCTTCCTTCAGTTCCCTTTGCACTGAAAGCTGTAACTCTAAGTATCAGTGTGAAACGGGAGAAAACAGTAAAGGCAACGTCCAGGATAGAGTGAAGCGACCCATGAACG"


crispr_well_6.fastq.gz


[1] "WT"                                                                                                                                                    
[2] "GTCCAGCTGTGCAAGAGAATATTCCCGCTCTCCGGAGAAGCTCTTCCTTCTAAAAGCTTTGCACTGAAAGCTGTAACTCTAAGTATCAGTGTGAAACGGGAGAAAACAGTAAAGGCAACGTCCAGGATAGAGTGAAGCGACCCATGAACG"


crispr_well_60.fastq.gz


[1] "WT" "WT"


crispr_well_61.fastq.gz


[1] "WT" "WT"


crispr_well_62.fastq.gz


[1] "WT" "WT"


crispr_well_63.fastq.gz


[1] "WT" "WT"


crispr_well_64.fastq.gz


[1] "GTCCAGCTGTGCAAGAGAATATTCCCGCTCTCCGGAGAAGCTCTTCCTTCACTAAGTGCTTTGCACTGAAAGCTGTAACTCTAAGTATCAGTGTGAAACGGGAGAAAACAGTAAAGGCAACGTCCAGGATAGAGTGAAGCGACCCATGAA"
[2] "WT"                                                                                                                                                    


crispr_well_65.fastq.gz


[1] "GTCCAGCTGTGCAAGAGAATATTCCCGCTCTCCGGAGAAGCTCTTCCTTCGAAAGCTGTAACTCTAAGTATCAGTGTGAAACGGGAGAAAACAGTAAAGGCAACGTCCAGGATAGAGTGAAGCGACCCATGAACGCATTCATCGTGTGGT"
[2] "GTCCAGCTGTGCAAGAGAATATTCCCGCTCTCCGGAGAAGCTCTTCCTTCTTGCACTGAAAGCTGTAACTCTAAGTATCAGTGTGAAACGGGAGAAAACAGTAAAGGCAACGTCCAGGATAGAGTGAAGCGACCCATGAACGCATTCATC"


crispr_well_66.fastq.gz


[1] "WT" "WT"


crispr_well_67.fastq.gz


[1] "GTCCAGCTGTGCAAGAGAATATTCCCGCTCTCCGGAGAAGCTCTTCCTTCAAAGCTGTAACTCTAAGTATCAGTGTGAAACGGGAGAAAACAGTAAAGGCAACGTCCAGGATAGAGTGAAGCGACCCATGAACGCATTCATCGTGTGGTC"
[2] "WT"                                                                                                                                                    


crispr_well_68.fastq.gz


[1] "GTCCAGCTGTGCAAGAGAATATTCCCGCTCTCCGGAGAAGCTCTTCCTTCGATCCCTTACTTTGCACTGAAAGCTGTAACTCTAAGTATCAGTGTGAAACGGGAGAAAACAGTAAAGGCAACGTCCAGGATAGAGTGAAGCGACCCATGA"
[2] "WT"                                                                                                                                                    


crispr_well_69.fastq.gz


[1] "GTCCAGCTGTGCAAGAGAATATTCCCGCTCTCCGGAGAAGCTCTTCCTTCCTGAAAGCTGTAACTCTAAGTATCAGTGTGAAACGGGAGAAAACAGTAAAGGCAACGTCCAGGATAGAGTGAAGCGACCCATGAACGCATTCATCGTGTG"
[2] "WT"                                                                                                                                                    


crispr_well_7.fastq.gz


[1] "WT" "WT"


crispr_well_70.fastq.gz


[1] "bad" "bad"


crispr_well_71.fastq.gz


[1] "WT" "WT"


crispr_well_72.fastq.gz


[1] "GTCCAGCTGTGCAAGAGAATATTCCCGCTCTCCGGAGAAGCTCTTCCTTCTTTGCACTGAAAGCTGTAACTCTAAGTATCAGTGTGAAACGGGAGAAAACAGTAAAGGCAACGTCCAGGATAGAGTGAAGCGACCCATGAACGCATTCAT"
[2] "WT"                                                                                                                                                    


crispr_well_73.fastq.gz


[1] "WT" "WT"


crispr_well_74.fastq.gz


[1] "GTCCAGCTGTGCAAGAGAATATTCCCGCTCTCCGGAGAAGCTCTTCCTTCTGAAAGCTGTAACTCTAAGTATCAGTGTGAAACGGGAGAAAACAGTAAAGGCAACGTCCAGGATAGAGTGAAGCGACCCATGAACGCATTCATCGTGTGG"
[2] "WT"                                                                                                                                                    


crispr_well_75.fastq.gz


[1] "GTCCAGCTGTGCAAGAGAATATTCCCGCTCTCCGGAGAAGCTCTTCCTTCCACTGAAAGCTGTAACTCTAAGTATCAGTGTGAAACGGGAGAAAACAGTAAAGGCAACGTCCAGGATAGAGTGAAGCGACCCATGAACGCATTCATCGTG"
[2] "GTCCAGCTGTGCAAGAGAATATTCCCGCTCTCCGGAGAAGCTCTTCCTTCGAAAGCTGTAACTCTAAGTATCAGTGTGAAACGGGAGAAAACAGTAAAGGCAACGTCCAGGATAGAGTGAAGCGACCCATGAACGCATTCATCGTGTGGT"


crispr_well_76.fastq.gz


[1] "GTCCAGCTGTGCAAGAGAATATTCCCGCTCTCCGGAGAAGCTCTTCCTTCACTGAAAGCTGTAACTCTAAGTATCAGTGTGAAACGGGAGAAAACAGTAAAGGCAACGTCCAGGATAGAGTGAAGCGACCCATGAACGCATTCATCGTGT"
[2] "GTCCAGCTGTGCAAGAGAATATTCCCGCTCTCCGGAGAAGCTCTTCCTTCAATCTTTGCACTGAAAGCTGTAACTCTAAGTATCAGTGTGAAACGGGAGAAAACAGTAAAGGCAACGTCCAGGATAGAGTGAAGCGACCCATGAACGCAT"


crispr_well_77.fastq.gz


[1] "WT" "WT"


crispr_well_78.fastq.gz


[1] "WT" "WT"


crispr_well_79.fastq.gz


[1] "GTCCAGCTGTGCAAGAGAATATTCCCGCTCTCCGGAGAAGCTCTTCCTTCGGTGACCTTTGCACTGAAAGCTGTAACTCTAAGTATCAGTGTGAAACGGGAGAAAACAGTAAAGGCAACGTCCAGGATAGAGTGAAGCGACCCATGAACG"
[2] "WT"                                                                                                                                                    


crispr_well_8.fastq.gz


[1] "WT" "WT"


crispr_well_80.fastq.gz


[1] "GTCCAGCTGTGCAAGAGAATATTCCCGCTCTCCGGAGAAGCTCTTCCTTCACTGAAAGCTGTAACTCTAAGTATCAGTGTGAAACGGGAGAAAACAGTAAAGGCAACGTCCAGGATAGAGTGAAGCGACCCATGAACGCATTCATCGTGT"
[2] "GTCCAGCTGTGCAAGAGAATATTCCCGCTCTCCGGAGAAGCTCTTCCTTCTGAAAGCTGTAACTCTAAGTATCAGTGTGAAACGGGAGAAAACAGTAAAGGCAACGTCCAGGATAGAGTGAAGCGACCCATGAACGCATTCATCGTGTGG"


crispr_well_81.fastq.gz


[1] "WT" "WT"


crispr_well_82.fastq.gz


[1] "WT" "WT"


crispr_well_83.fastq.gz


[1] "WT" "WT"


crispr_well_84.fastq.gz


[1] "GTCCAGCTGTGCAAGAGAATATTCCCGCTCTCCGGAGAAGCTCTTCCTTCAGTACCTTTGCACTGAAAGCTGTAACTCTAAGTATCAGTGTGAAACGGGAGAAAACAGTAAAGGCAACGTCCAGGATAGAGTGAAGCGACCCATGAACGC"
[2] "WT"                                                                                                                                                    


crispr_well_85.fastq.gz


[1] "WT" "WT"


crispr_well_86.fastq.gz


[1] "WT" "WT"


crispr_well_87.fastq.gz


[1] "GTCCAGCTGTGCAAGAGAATATTCCCGCTCTCCGGAGAAGCTCTTCCTTCTGCACTTTGCACTGAAAGCTGTAACTCTAAGTATCAGTGTGAAACGGGAGAAAACAGTAAAGGCAACGTCCAGGATAGAGTGAAGCGACCCATGAACGCA"
[2] "GTCCAGCTGTGCAAGAGAATATTCCCGCTCTCCGGAGAAGCTCTTCCTTCTTAGCGACTTTGCACTGAAAGCTGTAACTCTAAGTATCAGTGTGAAACGGGAGAAAACAGTAAAGGCAACGTCCAGGATAGAGTGAAGCGACCCATGAAC"


crispr_well_88.fastq.gz


[1] "WT" "WT"


crispr_well_89.fastq.gz


[1] "WT" "WT"


crispr_well_9.fastq.gz


[1] "WT"                                                                                                                                                    
[2] "GTCCAGCTGTGCAAGAGAATATTCCCGCTCTCCGGAGAAGCTCTTCCTTCGCTTTGCACTGAAAGCTGTAACTCTAAGTATCAGTGTGAAACGGGAGAAAACAGTAAAGGCAACGTCCAGGATAGAGTGAAGCGACCCATGAACGCATTC"


crispr_well_90.fastq.gz


[1] "WT"                                                                                                                                                    
[2] "GTCCAGCTGTGCAAGAGAATATTCCCGCTCTCCGGAGAAGCTCTTCCTTCGAAAGCTGTAACTCTAAGTATCAGTGTGAAACGGGAGAAAACAGTAAAGGCAACGTCCAGGATAGAGTGAAGCGACCCATGAACGCATTCATCGTGTGGT"


crispr_well_91.fastq.gz


[1] "WT" "WT"


crispr_well_92.fastq.gz


[1] "WT" "WT"


crispr_well_93.fastq.gz


[1] "WT" "WT"


crispr_well_94.fastq.gz


[1] "GTCCAGCTGTGCAAGAGAATATTCCCGCTCTCCGGAGAAGCTCTTCCTTCCACTGAAAGCTGTAACTCTAAGTATCAGTGTGAAACGGGAGAAAACAGTAAAGGCAACGTCCAGGATAGAGTGAAGCGACCCATGAACGCATTCATCGTG"
[2] "WT"                                                                                                                                                    


crispr_well_95.fastq.gz


[1] "WT" "WT"


#### You should have now printed the results for each file!

Now, let's save the results in a new text file.  I'm going to provide a template where it just writes the same result for everything, but you'll need to modify it to process the files and write the actual results.

#### For the last part, outputting the results, it would be nice to know not just whether it is WT or mutant, but also some other information:

* How many reads total did each well get? (as an integer - no decimal point)
* What % of reads were for the first allele? (rounded to two decimal places)
* What % of reads were for the second allele? (also rounded to two decimal places)
* <i> In the case of a homozygous well (WT or mutant), only report a single allele and single percentage </i>
* <i> In the case of a bad well, still report the number of reads and the percent for each of the top two alleles </i>
    
You'll need to create a new function, genotype2(), to output not just the genotyping results (e.g., c('WT','sequenceofmutantallele')) but also the above information.  As an example, this could be c(10000, 'WT', 45.55, 'sequenceofmutantallele', 43.28).

At the end we will merge the data from each file into a new matrix and write the full table to a file. Don't forget to add the name of the file as a column so we know which file the result came from!

Second, we want to round the percentages to two decimal places.  R has a built in round() function, which you'll need to look up how to use (**?round** or look at https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/Round) - it's important to know how to look things up that you don't know how to use, and learn how to read the documentation for something.

I'd recommend first just trying to get the existing genotype() function working here - just output the allele results and make sure you can do that.  Then, make genotype2() (and just copy in the code for genotype()) and modify it to add in each piece of information, one by one.  In other words, try to do things step-by-step, adding things in one-by-one, rather than doing everything at once - this will make it easier to troubleshoot because you're changing fewer things at a time.

<b>Here is what we are doing with the additional lines:</b> Initiating an empty matrix which we can add our output to. Then we use the rbind function to add the new row generated by the genotype2 function (why do we need to use the transpose **t** function here?). Finally we write the results to our output file.

Feel free to play around with different things. What if you want to make the end file comma delimted (',') as opposed to tab delimited ('\t')?]

#### Also, since we're outputting in a tab delimited text format (the two main formats are either tab separated (usually .txt or .tsv) or comma separated (.csv)), you should be able to open your resulting file in Excel and look at it there (or in any other text editor).

In [30]:
genotype2 <- function(c, wt_reference) {
    dat <- data.frame(sort(c, decreasing = TRUE))
    dat[,1] <- as.character(dat[,1])
    dat$percent <- dat$Freq / sum(dat$Freq)
    totalReads <- sum(dat$Freq)
    allele1Reads <- round(dat$percent[1] * 100, 2)
    allele2Reads <- round(dat$percent[1] * 100, 2)
    if (dat[1,1] == wt_reference & dat$percent[1] >= 0.8) {
        return(c(totalReads, "WT", allele1Reads, "WT", NA))
    } else if (dat$percent[1] < 0.4) {
        return(c(totalReads,"bad", allele1Reads, "bad", allele2Reads))
    } else if (dat$percent[2] < 0.4) {
        return(c(totalReads,"bad", allele1Reads, "bad", allele2Reads))
    } else {
        dat[,1][dat[,1] == wt_reference] <- "WT"
        return(c(totalReads,dat[1,1], allele1Reads, dat[2,1], allele2Reads))
    }
}

In [31]:
# change the start of this to match your own computer
output_file <- 'crispr_96_results.txt'
output_mat <- matrix(nrow = 0, ncol = 6)

for (fn in files) {
    wt_reference = 'GTCCAGCTGTGCAAGAGAATATTCCCGCTCTCCGGAGAAGCTCTTCCTTCCTTTGCACTGAAAGCTGTAACTCTAAGTATCAGTGTGAAACGGGAGAAAACAGTAAAGGCAACGTCCAGGATAGAGTGAAGCGACCCATGAACGCATTCA'
    
    # add code processing files here
    path_to_file <- paste0(path, fn)
    file_table <- process_file(path_to_file)
    file_results <- c(fn, genotype2(file_table, wt_reference))
    
    output_mat <- rbind(output_mat, t(matrix(file_results)))
}
write.table(output_mat, file = output_file, quote = FALSE, sep = '\t', row.names = FALSE, col.names = FALSE)

### Congratulations for making it to the end!!!

Comments: Feedback, suggestions, complaints...