## Methods:

**Source of Data**

Bacterial sequencing samples were obtained from the paths specified in params.bac_samples in the Nextflow configuration file. Each sample contained both long-read (Nanopore) and short-read (Illumina) sequencing data.

**Quality Control**

Initial quality control of the short reads was performed using FastQC to identify potential issues in base quality, adapter content, or sequence duplication.

**Long-read Filtering and Assembly**

Long-read sequences were filtered for length and quality using Filtlonger. Filtered reads were assembled into draft genomes using Flye, producing unpolished assemblies.

**Short-read Alignment and Polishing**

Short reads were aligned to the draft assemblies using Bowtie2. Alignments were sorted using SAMtools and used for polishing the draft assemblies with Pilon, resulting in high-quality polished genomes.

**Genome Annotation and Assessment**

Polished assemblies were annotated with Prokka to identify coding sequences, tRNAs, rRNAs, and other genomic features. Genome completeness was assessed using BUSCO. The canonical reference genome was downloaded from NCBI using NCBI Datasets CLI. Assemblies were compared to the reference using QUAST, and QUAST was also run on the unpolished assemblies to evaluate improvements from polishing. BUSCO_PLOT was used to visualize genome completeness results.



## Results:

**Circos Plot Visualization**

A Circos plot was generated from the GFF file produced by Prokka using the Python package PyCirclize in a conda environment with VSCode and Jupyter Notebooks. The plot displays genomic features along the circular bacterial genome, including forward and reverse coding sequences (CDS), rRNAs, and tRNAs. Forward CDS are shown in red, reverse CDS in  blue, rRNAs in green, and tRNAs in magenta. The outer axis includes major and minor tick marks with positions labeled in kilobases, and a legend indicates the feature types. While the plot is dense due to the high number of features, it confirms that annotation data from Prokka can be successfully visualized and verified using the installed software. The Circos plot is shown below.

<img src="../results/circos_plot.png" width="700">

**Assembly Metrics from QUAST**

The polished assembly was evaluated with QUAST and produced the following metrics: a genome fraction of 98.429%, a duplication ratio of 1.001, 19 misassemblies, 18.18 mismatches per 100 kbp, a total assembly length of 4,627,920 bp, and a GC content of 44.70%. The unpolished assembly showed nearly identical metrics, with a slightly higher mismatch rate of 18.33 per 100 kbp, indicating that polishing had a minor effect on overall assembly statistics but may have improved local base accuracy.

**Genome Completeness from BUSCO**

BUSCO analysis of the polished assembly returned the following summary: C:99.6% [S:99.5%, D:0.1%], F:0.0%, M:0.4%, n:1828. BUSCO evaluates genome completeness by searching for highly conserved single-copy orthologs expected in the organism. In this string, C represents the percentage of complete BUSCOs, S indicates complete single-copy genes, D shows duplicated genes, F corresponds to fragmented genes, and M represents missing genes. These results indicate that the assembly is nearly complete, with almost all expected genes present and very few missing or duplicated. The BUSCO summary plot generated from these results is shown below.

<img src="../results/busco_plot.png" width="700">

**Assembly Quality Assessment**

Overall, the assembly workflow successfully produced a high-quality bacterial genome. The polished assembly exhibits a very high genome fraction, a minimal duplication ratio, and near-complete BUSCO scores, indicating that most of the expected gene content is present. While the number of misassemblies and mismatches per 100 kbp remained similar before and after polishing, BUSCO results confirm that the gene content is highly complete. These metrics demonstrate that combining Flye assembly with Pilon polishing and Prokka annotation yielded a genome assembly closely matching the expected reference genome.