# Microbial Genomics: Lab 4
## Topic: Bash, Genome Assembly & Comparison
#### Tools used: Fastp, FastQC, Samtools, SPAdes, Bandage

## Part A: Lab Exercises
### Exercise 1: FastQ Basics
The FastQ format is used to contain raw DNA reads that come directly from a sequencer; these reads need to be processed in a method called assembly in order to be compiled into a usable format, such as the `fasta` files we have dealt with so far. Unlike our `fasta` files, `fastq` files have no context on where they came from or what organism they belong to- it's up to us to figure this out via assembly. Fortunately, there are many excellent tools to help us!

The anatomy of a `fastq` file differs from a `fasta` file in several key ways:
* Each DNA strand (i.e., forward and reverse) will have a separate set of reads; these are usually contained in two separate files, but can sometimes be "interleaved" in a single file
* FastQ files are typically much larger than their FastA counterparts; this is because it takes many reads to create a single high-quality sequence
* Each read in a FastQ typically contains four lines; the first begins with an @ symbol and a sequence ID, followed by the sequence itself, a line with a + symbol, and finally, a line describing the quality of the read. For example:

`@SEQ_ID
GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT
+
!''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65`

**Open the file ec_k12_mg1655_r1.fastq in a text editor and answer the following questions in comments below:**
1. Based on the sequence ID, what can you tell about the sequences in this file?
2. Pick a sequence and look at the quality line. Use [this page](https://en.wikipedia.org/wiki/FASTQ_format) to qualify the quality of the read. Is it generally high-quality?
3. Is this file interleaved or single-stranded?

In [None]:
# Exercise 1

### Exercise 2: FastQ Coverage and QC
One of the reasons FastQ files are much larger than the genomes they represent is because high-quality reads typically have very high "coverage" of the genome, meaning that for each base in the full genome, a FastQ file should have upwards of 300 reads (i.e., 300x coverage); that is, for E. Coli, which has a genome size of 4.6Mb, there can be more than 1.3 _billion_ bases. 

Because even modestly sized genomes require FastQ files that are several GB, we'll be working with plasmid sequences, which are smaller (~2000 bp). In this exercise, we'll be performing QC (quality control) for these sequences.

`fastp` is a general-purpose FastQ processing toolkit that includes QC, trimming, filtering, deduplication, and other useful functionality. For the most part, we will be concerned with qc (removing low-quality reads) and trimming (removing adapter sequences) for our pipelines. `fastqc` is another tool used to generate nice QC reports.

In [18]:
# run FastP on our fastq files; discard reads of quality <25, trim the adapters, and generate a report
! fastp -n 25 -h lab4/fastp_report.html -i lab4/read_1.fastq.gz -I lab4/read_2.fastq.gz -o lab4/read_1.trimmed.fastq.gz -O lab4/read_2.trimmed.fastq.gz

# run fastQC on our reads and move the results to the lab4 folder
! fastqc lab4/*.fastq.gz
! mv *fastqc* lab4/

Read1 before filtering:
total reads: 22720100
total bases: 2272010000
Q20 bases: 2173528270(95.6654%)
Q30 bases: 2069237662(91.0752%)

Read2 before filtering:
total reads: 22720100
total bases: 2317450200
Q20 bases: 2217453003(95.685%)
Q30 bases: 2099886322(90.6119%)

Read1 after filtering:
total reads: 21810827
total bases: 2180618868
Q20 bases: 2119614947(97.2024%)
Q30 bases: 2024715978(92.8505%)

Read2 aftering filtering:
total reads: 21810827
total bases: 2224204532
Q20 bases: 2170542617(97.5874%)
Q30 bases: 2063756635(92.7863%)

Filtering result:
reads passed filter: 43621654
reads failed due to low quality: 1809904
reads failed due to too many N: 8642
reads failed due to too short: 0
reads with adapter trimmed: 36808
bases trimmed due to adapters: 978926

Duplication rate: 3.53444%

Insert size peak (evaluated by paired-end reads): 156

JSON report: fastp.json
HTML report: lab4/fastp_report.html

fastp -n 25 -h lab4/fastp_report.html -i lab4/read_1.fastq.gz -I lab4/read_2.fastq.g

**Using the analysis performed above, answer the following questions:**
1. If we know that the genome that was sequenced has a length of 4.6 Mbp, and we used a sequencer with read length of 100, what is the estimated coverage for this fastq file?
2. Open one of the fastQC HTML files. Based on this [QC tutorial](https://rtsf.natsci.msu.edu/genomics/tech-notes/fastqc-tutorial-and-faq/), are our reads high quality or not? Are there any graphs that are particularly interesting?
3. Open up fastp_report.html. How many reads did we lose by performing QC and trimming? Does this seem like a large number?
4. Are both reads approximately the same in terms of quality and resulting size? If not, what is different?

In [None]:
# Exercise 2

### Exercise 3: Using Command Line Tools and Reference Genomes
You may have noticed that we used a `!` above to run the `fastp` command. We're actually running commands on the command line when we do this; as we use more tools throughout this course, you'll notice that although many of them are written in Python, they exist as standalone executables that need to be called via Bash. This is because most Bioinformatics pipelines are built using pipelines of many different programs linked together, and creating such pipelines is a relatively natrual process within a shell environment.

_If you haven't used Bash before, now might be a good time to review tutorials such as [this one](https://towardsdatascience.com/basics-of-bash-for-beginners-92e53a4c117a)._

From Jupyter, you can run any command that you'd have access to from your Conda environment by using `!`. `samtools` is a common suite of tools to work with SAM (non-binary) and BAM (binary) files, which are raw versions of aligned reads. Below, we'll use `samtools` to start exploring our `fastq` files and calculate coverage programmatically. The steps we'll take are:
1. Align our FastQ reads to a known reference genome
2. Correct any mate-pair issues that happened during alignment
3. Sort and index the resulting SAM file for fast downstream operations
5. Check the coverage of the resulting SAM file

*Note: this is a subset of a very common samtools workflow, which is covered in more depth [here](http://www.htslib.org/workflow/#fastq_to_bam)*.

In [10]:
# run minimap to align our fastq reads to the reference genome
! minimap2 -a -x sr lab4/mg1655_ref.fasta lab4/*_1.fastq.gz lab4/*_2.fastq.gz -o lab4/mg1655.sam

# fix the mate-pairs that may have been effected by alignment
! samtools fixmate -O sam,level=1 lab4/mg1655.sam lab4/mg1655_fixmate.sam

# Sort the SAM file- this is required for coverage calculation
! samtools sort -l 9 -o lab4/mg1655_sort.bam -T lab4/tmp/mg1655 lab4/mg1655_fixmate.sam

# calculate coverage and depth of the SAM file
! samtools coverage -o lab4/coverage.txt lab4/mg1655_sort.bam

**Use the files generated in the above cells to answer the following questions:**
1. Open one of the generated SAM files. How is it different from the raw fastq files?
2. What's different between the SAM and BAM files?
3. Take a look at the coverage.txt file. What is it telling us? 
4. The last line of the output above is the average coverage calculated by samtools. How does it compare to our estimated coverage above?

*Note: you may want to delete some of the files generated above after you're done with these questions if you're running low on disk space!*

In [None]:
# Exercise 3

### Exercise 4: Piping & I/O
Above, we ran several lines, one at a time, in order to produce an output. This worked, but we ended up with several (large) intermediate files. Often, we don't need to use these files, so to get around having to write them out and then read them back in, we can use bash _piping_. Essentially, this just means we take the output from one command and feed it into the next command without ever writing it into a file. This will be useful when we begin building larger pipelines below.

Let's take the commands from Exercise 3 and put them into a single command. We'll use `%%bash` at the top of the cell to turn the entire cell into a bash interpreter, instead of using `!`. We'll also use the `\` character to continue lines, for readability.

_Note: One important operator we're not using explicitly here is `>`, which is used to redirect output. Using `>` at any stage in this pipeline would direct the output of the preceding command to a file, i.e., writing `cat lab4/mg1655_ref.fasta > lab4/mg1655_ref2.fasta` would take the contents of `mg1655_ref.fasta` and write them into `mg1655_ref2.fasta`. You may need to use `>` in your homework assignment, so read up on it in the bash tutorial above if you feel the need._

In [14]:
%%bash
minimap2 -a -x sr lab4/mg1655_ref.fasta lab4/*_1.fastq.gz lab4/*_2.fastq.gz  | \
samtools fixmate -u -m - - | \
samtools sort -u -T lab4/tmp/mg1655 - | \
samtools coverage -o lab4/coverage_piped.txt -

[M::mm_idx_gen::0.161*1.02] collected minimizers
[M::mm_idx_gen::0.182*1.25] sorted minimizers
[M::main::0.182*1.25] loaded/built the index for 1 target sequence(s)
[M::mm_mapopt_update::0.182*1.25] mid_occ = 1000
[M::mm_idx_stat] kmer size: 21; skip: 11; is_hpc: 0; #seq: 1
[M::mm_idx_stat::0.191*1.24] distinct minimizers: 757222 (99.27% are singletons); average occurrences: 1.021; average spacing: 6.004; total length: 4639675
[M::worker_pipeline::3.637*3.17] mapped 495050 sequences
[M::worker_pipeline::5.515*3.41] mapped 495050 sequences
[M::worker_pipeline::7.468*3.54] mapped 495050 sequences
[M::worker_pipeline::9.590*3.62] mapped 495050 sequences
[M::worker_pipeline::11.621*3.67] mapped 495050 sequences
[M::worker_pipeline::19.672*3.04] mapped 495050 sequences
[M::worker_pipeline::20.349*3.01] mapped 495050 sequences
[M::worker_pipeline::21.095*3.03] mapped 495050 sequences
[M::worker_pipeline::23.222*3.13] mapped 495050 sequences
[M::worker_pipeline::25.087*3.19] mapped 495050 seq

**Based on the code and results above, answer the following questions:**
1. What does the `-` character do in the code above?
2. What files were created in this workflow, compared to the one in Exercise 3?
3. Are there any potential issues with this type of piped workflow?

In [None]:
# Exercise 4

### Exercise 5
Now that we have some familiarity with Bash, lets assemble our FastQ files! De novo assembly is used if a reference is unknown or non-existant; it attempts to build scaffolds from the reads, and assembles these scaffolds into a full genome. When the reference genome is known, we perform a different type of assembly, known as mapping assembly or alignment. `samtools` is actually a low-level aligner, and can perform this operation, but many other more sophisticated programs exist- we'll cover those in next week's lab.

For this laboratory, we'll use `spades` for de novo assembly. Others exist, but SPAdes is well-known and used often for short-read bacterial genome assembly. Note that **although we know the reference genome for our reads, we will pretend we do not for this exercise**; this is the basis of de novo assemblers. Although some include options for "reference-guided" assembly, they operate largely on the basis that we know very little about our input reads.

In [20]:
! spades.py -1 lab4/read_1.fastq.gz -2 lab4/read_2.fastq.gz -o lab4/spades_output

Command line: /Users/gregory.wood/anaconda3/bin/spades.py	-1	/Users/gregory.wood/Downloads/jupyter_notebooks/lab4/read_1.fastq.gz	-2	/Users/gregory.wood/Downloads/jupyter_notebooks/lab4/read_2.fastq.gz	-o	/Users/gregory.wood/Downloads/jupyter_notebooks/lab4/spades_output	

System information:
  SPAdes version: 3.12.0
  Python version: 3.8.8
  OS: macOS-10.15.7-x86_64-i386-64bit

Output dir: /Users/gregory.wood/Downloads/jupyter_notebooks/lab4/spades_output
Mode: read error correction and assembling
Debug mode is turned OFF

Dataset parameters:
  Multi-cell mode (you should set '--sc' flag if input data was obtained with MDA (single-cell) technology or --meta flag if processing metagenomic dataset)
  Reads:
    Library number: 1, library type: paired-end
      orientation: fr
      left reads: ['/Users/gregory.wood/Downloads/jupyter_notebooks/lab4/read_1.fastq.gz']
      right reads: ['/Users/gregory.wood/Downloads/jupyter_notebooks/lab4/read_2.fastq.gz']
      interlaced reads: not spe

  0:21:51.554     3G / 7G    INFO    General                 (main.cpp                  : 197)   Starting solid k-mers expansion in 12 threads.
  0:27:40.069     3G / 7G    INFO    General                 (main.cpp                  : 218)   Solid k-mers iteration 0 produced 687948 new k-mers.
  0:32:57.375     3G / 7G    INFO    General                 (main.cpp                  : 218)   Solid k-mers iteration 1 produced 25417 new k-mers.
  0:38:03.865     3G / 7G    INFO    General                 (main.cpp                  : 218)   Solid k-mers iteration 2 produced 361 new k-mers.
  0:43:15.719     3G / 7G    INFO    General                 (main.cpp                  : 218)   Solid k-mers iteration 3 produced 0 new k-mers.
  0:43:15.719     3G / 7G    INFO    General                 (main.cpp                  : 222)   Solid k-mers finalized
  0:43:15.719     3G / 7G    INFO    General                 (hammer_tools.cpp          : 220)   Starting read correction in 12 threads.
  0:43:1

  0:51:06.469    48M / 7G    INFO    General                 (main.cpp                  : 255)   Saving corrected dataset description to /Users/gregory.wood/Downloads/jupyter_notebooks/lab4/spades_output/corrected/corrected.yaml
  0:51:06.475    48M / 7G    INFO    General                 (main.cpp                  : 262)   All done. Exiting.

== Compressing corrected reads (with gzip)

== Dataset description file was created: /Users/gregory.wood/Downloads/jupyter_notebooks/lab4/spades_output/corrected/corrected.yaml


===== Read error correction finished. 


===== Assembling started.


== Running assembler: K21

  0:00:00.000     4M / 4M    INFO    General                 (main.cpp                  :  74)   Loaded config from /Users/gregory.wood/Downloads/jupyter_notebooks/lab4/spades_output/K21/configs/config.info
  0:00:00.000     4M / 4M    INFO    General                 (memory_limit.cpp          :  49)   Memory limit set to 250 Gb
  0:00:00.000     4M / 4M    INFO    General    

  0:06:10.547    48M / 7G    INFO    General                 (kmer_index_builder.hpp    : 127)   K-mer counting done. There are 16977717 kmers in total.
  0:06:10.547    48M / 7G    INFO    General                 (kmer_index_builder.hpp    : 133)   Merging temporary buckets.
  0:06:10.683    48M / 7G    INFO   K-mer Index Building     (kmer_index_builder.hpp    : 314)   Building perfect hash indices
  0:06:11.129    48M / 7G    INFO    General                 (kmer_index_builder.hpp    : 150)   Merging final buckets.
  0:06:11.227    48M / 7G    INFO   K-mer Index Building     (kmer_index_builder.hpp    : 336)   Index built. Total 7880952 bytes occupied (3.71355 bits per kmer).
  0:06:11.251    68M / 7G    INFO   DeBruijnExtensionIndexBu (kmer_extension_index_build:  99)   Building k-mer extensions from k+1-mers
  0:06:12.286    68M / 7G    INFO   DeBruijnExtensionIndexBu (kmer_extension_index_build: 103)   Building k-mer extensions from k+1-mers finished.
  0:06:12.286    68M / 7G   

  0:08:08.893    92M / 7G    INFO   Simplification           (parallel_processing.hpp   : 167)   Bulge remover triggered 1045 times
  0:08:08.893    92M / 7G    INFO   Simplification           (parallel_processing.hpp   : 165)   Running Low coverage edge remover
  0:08:08.920    72M / 7G    INFO   Simplification           (parallel_processing.hpp   : 167)   Low coverage edge remover triggered 815 times
  0:08:08.920    72M / 7G    INFO    General                 (simplification.cpp        : 362)   PROCEDURE == Simplification cycle, iteration 3
  0:08:08.920    72M / 7G    INFO   Simplification           (parallel_processing.hpp   : 165)   Running Tip clipper
  0:08:08.921    72M / 7G    INFO   Simplification           (parallel_processing.hpp   : 167)   Tip clipper triggered 8 times
  0:08:08.921    72M / 7G    INFO   Simplification           (parallel_processing.hpp   : 165)   Running Bulge remover
  0:08:08.948    60M / 7G    INFO   Simplification           (parallel_processing.hpp  

  0:08:09.047    56M / 7G    INFO    General                 (contig_output.hpp         :  22)   Outputting contigs to /Users/gregory.wood/Downloads/jupyter_notebooks/lab4/spades_output//K21/before_rr.fasta
  0:08:09.127    56M / 7G    INFO    General                 (contig_output_stage.cpp   :  51)   Outputting FastG graph to /Users/gregory.wood/Downloads/jupyter_notebooks/lab4/spades_output//K21/assembly_graph.fastg
  0:08:09.369    56M / 7G    INFO    General                 (contig_output.hpp         :  22)   Outputting contigs to /Users/gregory.wood/Downloads/jupyter_notebooks/lab4/spades_output//K21/simplified_contigs.fasta
  0:08:09.452    56M / 7G    INFO    General                 (contig_output.hpp         :  22)   Outputting contigs to /Users/gregory.wood/Downloads/jupyter_notebooks/lab4/spades_output//K21/final_contigs.fasta
  0:08:09.536    56M / 7G    INFO   StageManager             (stage.cpp                 : 132)   STAGE == Contig Output
  0:08:09.536    56M / 7G    I

  0:01:30.171    48M / 7G    INFO   K-mer Index Building     (kmer_index_builder.hpp    : 336)   Index built. Total 8200928 bytes occupied (3.71334 bits per kmer).
  0:01:30.196    68M / 7G    INFO   DeBruijnExtensionIndexBu (kmer_extension_index_build:  99)   Building k-mer extensions from k+1-mers
  0:01:31.377    68M / 7G    INFO   DeBruijnExtensionIndexBu (kmer_extension_index_build: 103)   Building k-mer extensions from k+1-mers finished.
  0:01:31.377    68M / 7G    INFO    General                 (stage.cpp                 : 101)   PROCEDURE == Early tip clipping
  0:01:31.377    68M / 7G    INFO    General                 (construction.cpp          : 253)   Early tip clipper length bound set as (RL - K)
  0:01:31.378    68M / 7G    INFO   Early tip clipping       (early_simplification.hpp  : 181)   Early tip clipping
  0:01:34.881    68M / 7G    INFO   Early tip clipping       (early_simplification.hpp  : 184)   11684834 34-mers were removed by early tip clipper
  0:01:34.881  

  0:03:22.116    56M / 7G    INFO   Simplification           (parallel_processing.hpp   : 167)   Bulge remover triggered 153 times
  0:03:22.116    56M / 7G    INFO   Simplification           (parallel_processing.hpp   : 165)   Running Low coverage edge remover
  0:03:22.120    56M / 7G    INFO   Simplification           (parallel_processing.hpp   : 167)   Low coverage edge remover triggered 56 times
  0:03:22.120    56M / 7G    INFO    General                 (simplification.cpp        : 362)   PROCEDURE == Simplification cycle, iteration 4
  0:03:22.120    56M / 7G    INFO   Simplification           (parallel_processing.hpp   : 165)   Running Tip clipper
  0:03:22.121    56M / 7G    INFO   Simplification           (parallel_processing.hpp   : 167)   Tip clipper triggered 0 times
  0:03:22.121    56M / 7G    INFO   Simplification           (parallel_processing.hpp   : 165)   Running Bulge remover
  0:03:22.124    56M / 7G    INFO   Simplification           (parallel_processing.

  0:03:22.180    52M / 7G    INFO    General                 (contig_output.hpp         :  22)   Outputting contigs to /Users/gregory.wood/Downloads/jupyter_notebooks/lab4/spades_output//K33/before_rr.fasta
  0:03:22.252    52M / 7G    INFO    General                 (contig_output_stage.cpp   :  51)   Outputting FastG graph to /Users/gregory.wood/Downloads/jupyter_notebooks/lab4/spades_output//K33/assembly_graph.fastg
  0:03:22.447    56M / 7G    INFO    General                 (contig_output.hpp         :  22)   Outputting contigs to /Users/gregory.wood/Downloads/jupyter_notebooks/lab4/spades_output//K33/simplified_contigs.fasta
  0:03:22.516    52M / 7G    INFO    General                 (contig_output.hpp         :  22)   Outputting contigs to /Users/gregory.wood/Downloads/jupyter_notebooks/lab4/spades_output//K33/final_contigs.fasta
  0:03:22.588    52M / 7G    INFO   StageManager             (stage.cpp                 : 132)   STAGE == Contig Output
  0:03:22.588    52M / 7G    I

  0:01:09.082    68M / 7G    INFO   DeBruijnExtensionIndexBu (kmer_extension_index_build: 103)   Building k-mer extensions from k+1-mers finished.
  0:01:09.082    68M / 7G    INFO    General                 (stage.cpp                 : 101)   PROCEDURE == Condensing graph
  0:01:09.329    68M / 7G    INFO   UnbranchingPathExtractor (debruijn_graph_constructor: 355)   Extracting unbranching paths
  0:01:13.148   908M / 7G    INFO   UnbranchingPathExtractor (debruijn_graph_constructor: 374)   Extracting unbranching paths finished. 13546988 sequences extracted
  0:01:15.345   908M / 7G    INFO   UnbranchingPathExtractor (debruijn_graph_constructor: 310)   Collecting perfect loops
  0:01:15.936   908M / 7G    INFO   UnbranchingPathExtractor (debruijn_graph_constructor: 343)   Collecting perfect loops finished. 0 loops collected
  0:01:22.696     4G / 7G    INFO    General                 (stage.cpp                 : 101)   PROCEDURE == Filling coverage indices (PHM)
  0:01:22.696     4G /

  0:09:46.380   892M / 10G   INFO   Simplification           (parallel_processing.hpp   : 167)   Initial ec remover triggered 1562 times
  0:09:46.380   892M / 10G   INFO   Simplification           (parallel_processing.hpp   : 165)   Running Initial isolated edge remover
  0:09:46.923   420M / 10G   INFO   Simplification           (parallel_processing.hpp   : 167)   Initial isolated edge remover triggered 21137 times
  0:09:47.005   344M / 10G   INFO   StageManager             (stage.cpp                 : 132)   STAGE == Simplification
  0:09:47.005   344M / 10G   INFO    General                 (simplification.cpp        : 357)   Graph simplification started
  0:09:47.005   344M / 10G   INFO    General                 (graph_simplification.hpp  : 634)   Creating parallel br instance
  0:09:47.005   344M / 10G   INFO    General                 (simplification.cpp        : 362)   PROCEDURE == Simplification cycle, iteration 1
  0:09:47.006   344M / 10G   INFO   Simplification           

  0:09:48.889     6G / 10G   INFO    General                 (edge_index_builders.hpp   :  82)   Used 1012 sequences.
  0:09:49.059   132M / 10G   INFO    General                 (kmer_index_builder.hpp    : 120)   Starting k-mer counting.
  0:09:49.311   128M / 10G   INFO    General                 (kmer_index_builder.hpp    : 127)   K-mer counting done. There are 4557054 kmers in total.
  0:09:49.311   128M / 10G   INFO    General                 (kmer_index_builder.hpp    : 133)   Merging temporary buckets.
  0:09:49.467   128M / 10G   INFO   K-mer Index Building     (kmer_index_builder.hpp    : 314)   Building perfect hash indices
  0:09:49.640   132M / 10G   INFO    General                 (kmer_index_builder.hpp    : 150)   Merging final buckets.
  0:09:49.711   132M / 10G   INFO   K-mer Index Building     (kmer_index_builder.hpp    : 336)   Index built. Total 2121304 bytes occupied (3.72399 bits per kmer).
  0:09:49.771   240M / 10G   INFO    General                 (edge_index_

  0:11:25.732   444M / 10G   INFO    General                 (sequence_mapper_notifier.h:  80)   Processed 16800000 reads
  0:11:31.701   444M / 10G   INFO    General                 (sequence_mapper_notifier.h:  98)   Total 22443574 reads processed
  0:11:31.747   444M / 10G   INFO    General                 (pair_info_count.cpp       : 209)   Edge pairs: 67108864 (rough upper limit)
  0:11:31.747   444M / 10G   INFO    General                 (pair_info_count.cpp       : 213)   10765739 paired reads (47.968% of all) aligned to long edges
  0:11:31.760   236M / 10G   INFO    General                 (pair_info_count.cpp       : 354)     Insert size = 505.196, deviation = 27.4939, left quantile = 472, right quantile = 544, read length = 102
  0:11:31.859   428M / 10G   INFO    General                 (pair_info_count.cpp       : 371)   Filtering data for library #0
  0:11:31.861   428M / 10G   INFO    General                 (pair_info_count.cpp       :  39)   Selecting usual mapper
  0

  0:12:21.723   240M / 10G   INFO    General                 (path_extender.hpp         : 883)   Processed 128 paths from 458 (27%)
  0:12:21.736   244M / 10G   INFO    General                 (path_extender.hpp         : 885)   Processed 138 paths from 458 (30%)
  0:12:21.739   244M / 10G   INFO    General                 (path_extender.hpp         : 885)   Processed 184 paths from 458 (40%)
  0:12:21.740   244M / 10G   INFO    General                 (path_extender.hpp         : 885)   Processed 230 paths from 458 (50%)
  0:12:21.740   244M / 10G   INFO    General                 (path_extender.hpp         : 883)   Processed 256 paths from 458 (55%)
  0:12:21.741   244M / 10G   INFO    General                 (path_extender.hpp         : 885)   Processed 276 paths from 458 (60%)
  0:12:21.744   244M / 10G   INFO    General                 (path_extender.hpp         : 885)   Processed 322 paths from 458 (70%)
  0:12:21.744   244M / 10G   INFO    General                 (path_extender.

That's all it takes! As you can see, SPAdes has a fairly simple command line interface. 

#### Use the output from the above command to answer the following questions:
1. Take a look at the [SPAdes documentation](https://github.com/ablab/spades); do you see any program options that would let us include a reference genome, if we suspected it might be related to our reads?
2. Open the log file found at `lab4/spades_output/spades.log`. This log file contains all the program output from our assembly. What is the kmer size that was used by SPAdes to create the final assembly?
3. Our final assembly is contained in `scaffolds.fasta`. What size is this file, compared to `reference.fasta`, the actual reference genome? Based on the coverage we calculated earlier, is this about right? 
4. Align `scaffolds.fasta` to `reference.fasta` and view it in Jalview. Do you see any obvious places where our assembly is wrong compared to the reference?

In [None]:
# Exercise 5

## Part B: Homework

#### Question 1: The main objective of this homework will be to build your own basic assembly pipeline using the tools we've discussed in this lab. You'll be assembling some reads from an unknown organism using de novo assembly, and then answering some questions about the organism based on the result. Your program can be written in an external shell script, or in Jupyter, but must complete the following steps:
1. Perform QC and trimming on the raw FastQ reads, discarding any reads that fail to meet a quality threadhold of N=20
2. Calculate the coverage of your resulting trimmed reads. If the coverage is less than 40x, return an error
3. Assemble the trimmed reads using SPAdes
4. Clean up any temporary files, but make sure to keep the SPAdes scaffold (`scaffolds.fasta`), the fastp/fastqc quality report, and the assembly graph (`assembly_graph.fastg`)

If you choose to write your script externally, you should still run it in the cell below. Make sure all results are placed into the lab4/results/ folder at the end of your code.

In [None]:
# Question 1

#### Question 2: Pipelines are often iterative, meaning that we may need to perform multiple runs after changing some parameters. Open the FastP or FastQC report that you generated above. Are the reads that you used for steps 2-4 high-quality? Comment on any steps you would consider taking to refine the reads further in the cell below.

In [None]:
# Question 2

#### Question 3: Run BLAST on either the command line or using the Web UI to search for the closest match to your assembled `scaffolds.fasta` file. Use the appropriate NCBI databases and download the top hit. Answer the following questions using this data:
* What types of hits showed up in your BLAST search? Were all of the organisms from the same species?
* How good was the top hit that you downloaded? If you think the match was good, describe why. If not, describe what you think went wrong- does a better hit exist in a different database? Was the assembly bad?
* Perform a multiple alignment of your scaffold file to the downloaded sequence. Describe the results- does the alignment match up with your answer to the previous question?

In [None]:
# Question 3

#### Question 4: As we discussed earlier, de novo assembly is generally difficult to evaluate. One way of telling how "good" our assembly is, is checking the De Bruijn graph. If our assembly is good, it will have a single contig, and the graph will be a fully connected "circle". If not, there will be (potentially many) disconnected or discontiguous loops. (*Note: a bad assembly can also have a "good"  graph- but a good assembly will never have a bad graph!*)

We'll use the `bandage` tool to view our graph. Go to the [Bandage GitHub page](https://rrwick.github.io/Bandage/) and download/install the tool. Once it is installed, run it and open up the `assembly_graph.fastg` file. Describe the graph you see below; is it contiguous? Are there extra "pieces" of the genome outside the main graph? Based on your knowledge, is our assembly of good quality?

In [None]:
# Question 4