# MAG Analysis

Data from the [Rothman study](https://doi.org/10.1128/AEM.01448-21) corresponding to (unenriched) samples from the [HTP site](https://en.wikipedia.org/wiki/Hyperion_sewage_treatment_plant) was downloaded from the [ENA](https://www.ebi.ac.uk/ena/browser/view/prjna729801) and processed using the [nf-core/mag](https://github.com/PhilPalmer/mag/tree/genomad) pipeline to generate assembled and binned metagenomes (as shown by the image below). Here, we will analyse the results generated by the pipeline. First, we will load and aggregate the pipeline results, explaining the steps along the way. Then, we will use the aggregated results to answer some key questions about the data such as what organisms are present in the samples and how do the taxonomic classifications generated using the reads and the assembled genomes compare.

<img src="https://raw.githubusercontent.com/nf-core/mag/master/docs/images/mag_workflow.png" width="1000">


In [None]:
import matplotlib.pyplot as plt
import os
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go
import seaborn as sns
import shutil
import yaml

from Bio import SeqIO
from matplotlib import rcParams
from matplotlib.colors import LogNorm
# from matplotlib import colors as mcolors

In [None]:
########
# Change
########
# TODO: Upload directory to AWS S3 and add code to download the data if needed
data_dir = '../mag/results_rothman_htp'
ribodetector_csv = f'{data_dir}/QC_shortreads/RiboDetector/summary.csv'
samples_metadata = 'https://github.com/naobservatory/kmer-egd/raw/main/rothman.unenriched.simple' 
# samples_metadata = 'https://github.com/naobservatory/illumina_pilot/raw/main/metadata/microbio_wetlab_label_match.csv'
rothman = False
if samples_metadata and 'rothman' in samples_metadata.lower():
    rothman = True

##############
# Don't change
##############

# QC
fastqc_yaml = f'{data_dir}/multiqc/multiqc_data/multiqc_fastqc.yaml'

# Assembly data
bam_dir = f'{data_dir}/Assembly/SPAdes/QC/group-HTP'

# Genome Binning data
binner = 'MaxBin2'
busco_tsv = f'{data_dir}/GenomeBinning/QC/busco_summary.tsv'
quast_tsv = f'{data_dir}/GenomeBinning/QC/quast_summary.tsv'
bin_tsv = f'{data_dir}/GenomeBinning/bin_summary.tsv'
bin_dir = f'{data_dir}/GenomeBinning/{binner}'

# Taxonomy data
kraken_dir = f'{data_dir}/Taxonomy/kraken2'
genomad_dir = f'{data_dir}/Taxonomy/geNomad'
genomad_class = f'{genomad_dir}/group-HTP/SPAdes-group-HTP_scaffolds_aggregated_classification/SPAdes-group-HTP_scaffolds_aggregated_classification.tsv'
genomad_tax = f'{genomad_dir}/group-HTP/SPAdes-group-HTP_scaffolds_annotate/SPAdes-group-HTP_scaffolds_taxonomy.tsv'
gtdbtk_tsv = f'{data_dir}/Taxonomy/GTDB-Tk/gtdbtk_summary.tsv'

In [None]:
# Define helper functions
def decompress_gz(gz_file):
    """
    Decompress a gzipped file.
    """
    file = gz_file.replace('.gz', '')
    if os.path.exists(gz_file) and not os.path.exists(file):
        !gzip -dk {gz_file}
    return file

## Load the data

We will load the data from the key steps of the pipeline in the rough order they ran.

We will load each contig and associate it with any relevant information such as the length, bin/MAG and coverage etc. As we have information at different levels (sample, assembly, MAG and taxonomic levels) we will use IDs to link information in different dataframes.


### Quality Control (QC)
The first step of the pipeline was to perform QC on the raw reads. Here we will load the data from [FastQC](https://github.com/s-andrews/FastQC)/[MultiQC](https://github.com/ewels/MultiQC) to obtain the sample names and the total number of raw reads per sample.

In [None]:
# Read the QC data
with open(fastqc_yaml, 'r') as f:
    fastqc_data = yaml.load(f, Loader=yaml.FullLoader)

# Get the sample IDs and the total number of raw reads
sample_read_counts = {k: fastqc_data[k]['Total Sequences'] for k in fastqc_data.keys()}
sample_read_counts = {k.replace('_1',''): int(sample_read_counts[k] + sample_read_counts[k.replace('_1', '_2')]) for k in sample_read_counts.keys() if '_1' in k}
samples = list(sample_read_counts.keys())
print(samples)

In [None]:
if samples_metadata and not rothman:
    samples_df = pd.read_csv(samples_metadata, index_col=0)
    samples_df.index = samples_df.index.str.split('-').str[-1]
    display(samples_df)

In [None]:
if ribodetector_csv:
    ribo_df = pd.read_csv(ribodetector_csv, index_col=0)
    ribo_df['n_rrna_reads'] = ribo_df['n_total_reads'] - ribo_df['n_nonrrna_reads']
    ribo_df.loc['total'] = ribo_df.sum()
    ribo_df['perc_rrna_reads'] = ribo_df['n_rrna_reads'] / ribo_df['n_total_reads'] * 100
    display(ribo_df.style.format({'n_total_reads': '{:,.0f}'.format,'n_rrna_reads': '{:,.0f}'.format,'n_nonrrna_reads': '{:,.0f}'.format,'perc_rrna_reads': '{:.2f}%'.format})\
        .bar(subset=ribo_df.columns[:3], color='#d65f5f')\
        .bar(subset=['perc_rrna_reads'], color='#5fba7d', vmax=100))

### Assemblies

[metaSPAdes](http://cab.spbu.ru/software/spades/) was used for the assembly i.e. merging of reads.

(We will skip loading the contigs here because instead, we will load the MAGs (in the next step) which includes the same information for the contigs and additional information such as the bin).

#### Getting read counts for the contigs and samples

Here we will generate a count matrix i.e. a file mapping contig IDs (rows) and sample IDs (columns).

The nf-core/mag pipeline already maps the reads back to the contigs using [bowtie2](https://github.com/BenLangmead/bowtie2) (see [code](https://github.com/nf-core/mag/blob/master/modules/local/bowtie2_assembly_align.nf#L23-L30)). Therefore, we can use [samtools](http://www.htslib.org/) to get the read counts for each contig and sample (see [example](https://github.com/edamame-course/Metagenome/blob/master/2018-06-29-counting-abundance-with-mapped-reads.md#option-1-count-each-contigs)). The third column in the tab-seperated output from [`samtools idxstats`](http://www.htslib.org/doc/samtools-idxstats.html) is the number of reads mapped to the contig which we'll use to generate the count matrix.

##### How does bowtie2 align reads to contigs?
Bowtie2 suppports different modes of alignment. The default mode (used by the nf-core/mag pipeline) is "end-to-end" alignment which uses all characters in the read for alignment, e.g.:
```
Alignment:
  Read:      GACTGGGCGATCTCGACTTCG
             |||||  |||||||||| |||
  Reference: GACTG--CGATCTCGACATCG
```

An alignment score is used to quantify how similiar the read is to the aligned contig sequence. The higher the score, the more similiar the read is to the contig. A mismatched base at a high-quality position in the read receives a penalty of -6 by default. A length-2 read gap receives a penalty of -11 by default (-5 for the gap open, -3 for the first extension, -3 for the second extension). Thus, in end-to-end alignment mode, if the read is 150 bp long and it matches the reference exactly except for one mismatch at a high-quality position and one length-2 read gap, then the overall score is -(6 + 11) = -17. The best possible alignment score in end-to-end mode is 0, which happens when there are no differences between the read and the reference

A "match" is determined if the alignment score is no less than the minimum score threshold. In end-to-end alignment mode, the default minimum score threshold is `-0.6 + -0.6 * L`, where `L` is the read length. For example, using a read length of 150 bp, the minimum score threshold is `-0.6 + -0.6 * 150 = -90.6`. Therefore, if the alignment score is no less than -90 (e.g. there are fewer than 15 high-quality mismatches and no gaps), then the read is considered a "match".

<!--
Bowtie2 uses a heuristic to determine whether a read is "good enough" to be considered a "match". The heuristic is based on the read length and the alignment score. The alignment score is the sum of the scores of the matching bases and gaps. The alignment score is compared to a minimum score threshold. If the alignment score is no less than the minimum score threshold, then the read is considered "good enough" to be considered aligned. The minimum score threshold is configurable and is expressed as a function of the read length. In end-to-end alignment mode, the default minimum score threshold is `-0.6 + -0.6 * L`, where `L` is the read length. For example, using a read length of 150 bp, the minimum score threshold is `-0.6 + -0.6 * 150 = -90.6`. Therefore, if the alignment score is no less than -90 (e.g. there are fewer than 15 high-quality mismatches), then the read is considered a "match" and the contig is considered to have a read mapped to it. Information from both of the paired reads is also used in the alignment.
-->

In [None]:
# Get absolute path for bam_dir and generate a list of all of the bams
bam_dir = os.path.abspath(bam_dir)
bams = [f'{bam_dir}/{bam}' for bam in os.listdir(bam_dir) if bam.endswith('.bam')]

# Check if samtools is installed locally, if not, use the docker container from the nf-core/mag pipeline to run samtools
if shutil.which('samtools'):
    samtools_cmd = 'samtools'
else:
    samtools_cmd = f'docker run -v {bam_dir}:{bam_dir} -w {bam_dir} quay.io/biocontainers/mulled-v2-ac74a7f02cebcfcc07d8e8d1d750af9c83b4d45a:577a697be67b5ae9b16f637fd723b8263a3898b3-0 samtools'

# Run samtools idxstats on all of the bams
for bam in bams:
    sample = bam.split('/')[-1].split('.')[0].split('-')[-1]
    if not os.path.exists(f'{bam_dir}/{sample}_idxstats.txt'):
        # TODO: Use a method that assigns each read to the single best matching contig only and doesn't count reads multiple times
        !{samtools_cmd} idxstats {bam} > {bam_dir}/{sample}_idxstats.txt

In [None]:
# Load and concatenate the idxstats files
idxstat_cols = ['contig', 'length', 'num_mapped_reads', 'num_unmapped_reads']
idxstat_dfs = []

# Get a list of all of the idxstats files
idxstats = [f'{bam_dir}/{idxstat}' for idxstat in os.listdir(bam_dir) if idxstat.endswith('_idxstats.txt')]

# Read in the idxstats files
for idxstat in idxstats:
    sample = idxstat.split('/')[-1].split('_')[0]
    idxstat_df = pd.read_csv(idxstat, sep='\t', header=None, names=idxstat_cols)
    idxstat_df['sample_id'] = sample
    idxstat_dfs.append(idxstat_df)

# Combine the idxstats dataframes into a single dataframe
counts_df = pd.concat(idxstat_dfs, axis=0)
counts_df.contig = counts_df.contig.str.split('_').str[1]

In [None]:
# Reformat the dataframe so that the contigs are the index and the samples are the columns
counts_df = counts_df.pivot(index='contig', columns='sample_id', values='num_mapped_reads')

# Sort the dataframe
counts_df = counts_df[~counts_df.index.isnull()]
counts_df.index = counts_df.index.astype(int)
counts_df = counts_df.sort_index()

counts_df

### Genome binning

Here we will load the bins/MAGs

- Two tools were used to perform metagenome binning to generate metagenome assembled genomes (MAGs) - [MetaBAT2](https://bitbucket.org/berkeleylab/metabat/src/master/) and [MaxBin2](https://sourceforge.net/projects/maxbin2/)

- And two tools were used to check for quality control (QC) of the genome bins - [Busco](https://busco.ezlab.org/) and [Quast](http://quast.sourceforge.net/quast)


Let's look the summary information of the bins/MAGs


In [None]:
pd.set_option('display.max_columns', None)

bin_summary_df = pd.read_csv(bin_tsv, sep='\t', index_col=0)
bin_summary_df.head()

The results from the tools show that MaxBin2 generated more genome bins and a higher % completeness than MetaBAT2


In [None]:
busco_df = pd.read_csv(busco_tsv, sep='\t')
busco_df.head()

In [None]:
quast_df = pd.read_csv(quast_tsv, sep='\t')
quast_df.head()

We will therefore use the MaxBin2 results for the rest of the analysis


In [None]:
# Get the bin files
bin_files = []

for dirpath, dirnames, filenames in os.walk(bin_dir):
    bin_files.extend([os.path.join(dirpath, file) for file in filenames if file.endswith('.gz')])

bin_files

In [None]:
# Load the MAGs (i.e. contigs and bins) from the bin files into dataframes
bin_dfs = []

for bin_file_gz in bin_files:

    # Decompress (if they aren't already), then load the bin files as a dict {'node_id': ['length', 'cov', 'seq']} and convert to a dataframe
    bin_file = decompress_gz(bin_file_gz)
    bin_dict = {int(record.id.split('_')[1]): [int(record.id.split('_')[3]), float(record.id.split('_')[5]), str(record.seq)] for record in SeqIO.parse(bin_file, 'fasta')}
    bin_df = pd.DataFrame.from_dict(bin_dict, orient='index', columns=['length', 'coverage', 'seq'])

    # Add the bin ID to the dataframe
    bin_df['bin_id'] = bin_file.split('/')[-1].split('.')[1]

    # Add the dataframe to the list
    bin_dfs.append(bin_df)

# Concatenate all of the dataframes
bin_df = pd.concat(bin_dfs).sort_index()
bin_df = bin_df[~bin_df.index.duplicated(keep='first')]
bin_df.head()

### Taxonomic classification


#### Taxonomic classification of trimmed reads

The [kraken2](https://github.com/DerrickWood/kraken2/wiki/Manual) tool was used to classify trimmed reads using the [prebuilt 8GB minikraken DB](https://zenodo.org/record/4024003#.Y4-9PdLMK0o) as provided by the Center for Computational Biology of the John Hopkins University (from 2020-03). (**Note:** Using a larger kraken database would likely improve the results by decreasing the number of unclassified reads). The outputs were tab seperated files with the following columns:  

1. `percentage` - Percentage of fragments covered by the clade rooted at this taxon
2. `num_fragments` - Number of fragments covered by the clade rooted at this taxon
3. `num_assigned` - Number of fragments assigned directly to this taxon
4. `rank_code` - A rank code, indicating (U)nclassified, (R)oot, (D)omain, (K)ingdom, (P)hylum, (C)lass, (O)rder, (F)amily, (G)enus, or (S)pecies. Taxa that are not at any of these 10 ranks have a rank code that is formed by using the rank code of the closest ancestor rank with a number indicating the distance from that rank. E.g., "G2" is a rank code indicating a taxon is between genus and species and the grandparent taxon is at the genus rank.
5. `tax_id` NCBI taxonomic ID number
6. `scientific_name` - Indented scientific name

In [None]:
kraken_dfs = []
names = ['percentage', 'num_fragments', 'num_assigned', 'rank_code', 'tax_id', 'scientific_name']

for sample in samples:
    kraken_file = f'{kraken_dir}/{sample}/kraken2_report.txt'
    kraken_df = pd.read_csv(kraken_file, sep='\t', header=None, names=names, index_col=0)
    kraken_df['sample_id'] = sample
    kraken_dfs.append(kraken_df)

kraken_df = pd.concat(kraken_dfs)

# Remove whitespace from the scientific name
kraken_df.scientific_name = kraken_df.scientific_name.str.strip()
kraken_df = kraken_df.reset_index()
kraken_df.head()

In [None]:
def kraken_cols():
    return ['total_raw_reads','total_processed_reads','classified_reads','unclassified_reads','bacterial_reads','viral_reads','eukaryotic_reads','human_reads']

def mk_summary_df(kraken_df, samples_metadata, rothman, display_as_percent=False, vmax=None, cols=kraken_cols()):
    
    # Generate the individual dataframes
    unclass_df = kraken_df[kraken_df['rank_code'] == 'U'].groupby('sample_id').agg({'num_fragments': 'sum'}).rename(columns={'num_fragments': 'unclassified_reads'})
    class_df = kraken_df[kraken_df['rank_code'] == 'R'].groupby('sample_id').agg({'num_fragments': 'sum'}).rename(columns={'num_fragments': 'classified_reads'})
    bact_df = kraken_df[kraken_df['scientific_name'].str.contains('Bacteria')].groupby('sample_id').agg({'num_fragments': 'sum'}).rename(columns={'num_fragments': 'bacterial_reads'})
    viral_df = kraken_df[kraken_df['scientific_name'].str.contains('Viruses')].groupby('sample_id').agg({'num_fragments': 'sum'}).rename(columns={'num_fragments': 'viral_reads'})
    eukaryot_df = kraken_df[kraken_df['scientific_name'].str.contains('Eukaryota')].groupby('sample_id').agg({'num_fragments': 'sum'}).rename(columns={'num_fragments': 'eukaryotic_reads'})
    human_df = kraken_df[kraken_df['scientific_name'] == 'Homo sapiens'].groupby('sample_id').agg({'num_fragments': 'sum'}).rename(columns={'num_fragments': 'human_reads'})

    # Combine the dataframes and then process
    summary_df = pd.concat([unclass_df, class_df, bact_df, viral_df, eukaryot_df, human_df], axis=1)
    summary_df['total_raw_reads'] = [sample_read_counts[sample] for sample in summary_df.index]
    summary_df['total_processed_reads'] = summary_df['unclassified_reads'] + summary_df['classified_reads']
    summary_df['total_raw_reads'] = summary_df['total_raw_reads'].astype(int)
    summary_df['total_processed_reads'] = summary_df['total_processed_reads'].astype(int)

    # Use a more informative sample ID if possible
    if samples_metadata and not rothman:
        summary_df.index = [samples_df.loc[sample]['Experiment Type'] for sample in summary_df.index]

    # Add a row for the totals
    summary_df.loc['total'] = summary_df.sum()

    # Reorder the dataframe
    summary_df = summary_df[cols]
    summary_df = summary_df.sort_index()

    # Remove rows with no reads
    summary_df = summary_df[summary_df['total_raw_reads'] != 0]

    # Calculate percentages
    if display_as_percent:
        vmax = 100
        for col in cols[2:]:
            cols[cols.index(col)] = f'{col} (%)'
            summary_df = summary_df.rename(columns={col: f'{col} (%)'})
            col = f'{col} (%)'
            summary_df[col] = summary_df[col] / summary_df['total_processed_reads'] * 100

    return summary_df, cols, vmax

summary_df, cols, vmax = mk_summary_df(kraken_df, samples_metadata, rothman, display_as_percent=True)

# Display the summary dataframe with bars
summary_df.style\
    .bar(subset=cols[:2], color='#d65f5f')\
    .bar(subset=cols[2:], color='#5fba7d', vmax=vmax)\
    .format('{:,.0f}', subset=cols[:2])\
    .format('{:.1f}', subset=cols[2:])

In [None]:
# # Display the top 50 most abundant taxa
# kraken_df = kraken_df[kraken_df.rank_code != 'U']
# kraken_df = kraken_df[kraken_df.rank_code != 'R']
# kraken_df = kraken_df.sort_values('num_fragments', ascending=False)
# kraken_df.head(50)

# Generate a Krona chart for all of samples - this is a bit hacky
# !docker run -v $PWD:$PWD -w $PWD quay.io/biocontainers/krona:2.7.1--pl526_5 ktUpdateTaxonomy.sh taxonomy && !ktImportTaxonomy SRR14530762/kraken2_report.txt SRR14530763/kraken2_report.txt SRR14530764/kraken2_report.txt SRR14530765/kraken2_report.txt SRR14530766/kraken2_report.txt SRR14530767/kraken2_report.txt SRR14530769/kraken2_report.txt SRR14530770/kraken2_report.txt SRR14530771/kraken2_report.txt SRR14530772/kraken2_report.txt SRR14530880/kraken2_report.txt SRR14530881/kraken2_report.txt SRR14530882/kraken2_report.txt SRR14530884/kraken2_report.txt SRR14530885/kraken2_report.txt SRR14530886/kraken2_report.txt SRR14530887/kraken2_report.txt SRR14530888/kraken2_report.txt SRR14530889/kraken2_report.txt SRR14530890/kraken2_report.txt SRR14530891/kraken2_report.txt -tax taxonomy

#### Taxonomic classification of assembiles

##### Virus classification of contigs

Let's load the virus classifications predicted using [geNomad](https://github.com/apcamargo/genomad) and combine this with our existing data for the contigs

In [None]:
# Load the virus classification data
genomad_class_df = pd.read_csv(genomad_class, sep='\t')
genomad_tax_df = pd.read_csv(genomad_tax, sep='\t')
vir_df = pd.merge(genomad_class_df, genomad_tax_df, on='seq_name', how='left')
vir_df.index = vir_df.seq_name.str.split('_').str[1].astype(int)
vir_df = vir_df.drop('seq_name', axis=1)
vir_df.head()

In [None]:
# Merge the binning and taxonomy dataframes
df = pd.merge(bin_df, vir_df, left_index=True, right_index=True)

##### Bacterial classification of MAGs

Load the GTDB-Tk summary table (see [column descriptions](https://ecogenomics.github.io/GTDBTk/files/summary.tsv.html)) and combine with the existng information for the contigs

In [None]:
# Load the GTDB-Tk classification data
gtdbtk_df = pd.read_csv(gtdbtk_tsv, sep='\t')

# Filter the GTDB-Tk dataframe
gtdbtk_df = gtdbtk_df[gtdbtk_df.user_genome.str.contains(binner)]
cols = ['user_genome', 'classification', 'classification_method', 'other_related_references(genome_id,species_name,radius,ANI,AF)', 'msa_percent', 'red_value', 'warnings']
gtdbtk_df = gtdbtk_df[cols]
gtdbtk_df.head()

In [None]:
# Add the GTDB-Tk classification data to the main dataframe
gtdbtk_df['bin_id'] = gtdbtk_df.user_genome.str.split('.').str[1]
gtdbtk_df = gtdbtk_df.drop('user_genome', axis=1)

# Merge the GTDB-Tk dataframe with the main dataframe
df = pd.merge(df, gtdbtk_df, on='bin_id', how='left')

# Return index to how it was before
df.index = df.index + 1

### Putting it all together

This is what the final dataframe containing the information for each contig looks like:

In [None]:
df
# df.lineage.value_counts()

## Explore/plot the data

### What is the number of reads for different domains for the different samples (or sampling dates)

In [None]:
# Define vars
display_as_percent = False
summary_df, cols, vmax = mk_summary_df(kraken_df, samples_metadata, rothman, display_as_percent, cols=kraken_cols())
summary_df = summary_df[:-1]
xlabel = 'Sample ID'
unclass, bac, vir, euk = cols[3], cols[4], cols[5], cols[6]

# Get sampling date for the Rothman data
if rothman:
    rothman_df = pd.read_csv(samples_metadata, sep='\t', index_col=0, header=None, names=['sample_id', 'sampling_date', 'sampling_site'])
    summary_df['sampling_date'] = [rothman_df.loc[sample]['sampling_date'] for sample in summary_df.index]
    summary_df['sampling_date'] = pd.to_datetime(summary_df['sampling_date'], format='%Y-%m-%d')
    summary_df = summary_df.sort_values('sampling_date')
    summary_df['sampling_date'] = summary_df['sampling_date'].dt.strftime('%Y-%m-%d').str.replace('-','.')
    xlabel = 'Sampling date'

In [None]:
def format_label(label):
    return label.split('_')[0].capitalize()

fig, ax = plt.subplots(figsize=(15, 5))
sns.barplot(data=summary_df, x=summary_df.sampling_date, y=unclass, label=format_label(unclass), color='#808080')
sns.barplot(data=summary_df, x=summary_df.sampling_date, y=bac, label=format_label(bac), color='#d65f5f', bottom=summary_df[unclass])
sns.barplot(data=summary_df, x=summary_df.sampling_date, y=vir, label=format_label(vir), color='#5fba7d', bottom=summary_df[unclass] + summary_df[bac])
sns.barplot(data=summary_df, x=summary_df.sampling_date, y=euk, label=format_label(euk), color='#5f81ba', bottom=summary_df[unclass] + summary_df[bac] + summary_df[vir])
handles, labels = ax.get_legend_handles_labels()
ax.legend(handles[::-1], labels[::-1], loc='upper left', ncol=1, frameon=False, bbox_to_anchor=(1, 0.5))
ax.spines['right'].set_visible(False)
ax.spines['top'].set_visible(False)
plt.xticks(rotation=90)
plt.xlabel(xlabel)
plt.ylabel('Number of passed reads')
if display_as_percent:
    plt.ylabel('Percentage of passed reads')
plt.tight_layout()
plt.show()
# plt.savefig('kraken_barplot.png', dpi=300)

### What is the distribution of contig lengths?

The contig IDs are ordered by length, we can therefore plot the contig IDs against their length to see the distribution of contig lengths.

For contigs that are at least Xbp long, as their length increases, the number of contigs decreases superexponentially (note: the y-axis is shown as a log scale). 

The majority of contigs are less than 1000bp long and the longest contig is ~430kbp long, almost double the length of the next longest contig.

In [None]:
# Filter to keep contigs that are at least X bp long
contig_df = df[df['length'] >= 250].copy()
contig_df['index'] = contig_df.index

# Generate an interactive plot of the contig length distribution
# Plot the contig ID (x-axis) against the contig length (y-axis)
# Use a log scale for the y-axis so that the contigs are more easily visible

px.defaults.template = 'plotly_white'
fig = px.scatter(contig_df, x='index', y='length', log_y=True, hover_data={'length': ':,.0f', 'index': ':,.0f'}, width=1000, height=600)
fig.update_xaxes(title_text='Contig ID')
fig.update_yaxes(title_text='Contig length (bp)')
fig.show()

### What is the distribution of total base pairs for each of the contigs?

For each contig, we have the average coverage ($\overline{cov}$) which is equal to:

$ \overline{cov} = \frac{read\_len \times n\_reads}{contig\_len} = \frac{n\_base\_pairs}{contig\_len} $

Where:
- $read\_len$ = the average length of the reads
- $n\_reads$ = the number of reads (sequence fragments) that exactly align to the contig
- $contig\_len$ = the length of the contig
- $n\_base\_pairs$ = the number of base pairs from reads that exactly align to the contig

<br >

We will be estimating the number of base pairs from the reads which map to the contig (which can be thought of as the contig weight) using the following equation:

$ n\_base\_pairs = \overline{cov} \times contig\_len$

<br >

We will the plot the length of the contigs against the number of base pairs from the reads that map to the contig. In doing so we can see that there is a positive correlation between the length of the contig and the number of base pairs from the reads that map to the contig. This is expected because the longer the contig, the more reads will map to it. However, we can also see that there is a wide range of number of base pairs for any given contig length. This is because the coverage varies across the contigs.

The contigs are also coloured by the bin/MAG they belong to. We can see that:
1. All of the longest contigs belong to the same bin/MAG (002). All of the almost 1000 contigs were classified to be proteobacteria from the _Azonexus_ genus by GTDB-Tk.
2. The contigs with the greatest coverage/total number of base pairs belong to the bin/MAG (001). Most of the contigs were predicted to be viral by geNomad. Most sequences were predicted to be from the _Caudovirales_ order, i.e. bacteriophages, however, the contigs with the highest coverage were predicted to be from the _Martellivirales_ order which mainly consists of plant viruses including the tomato brown rugose fruit virus ([ToBRFV](https://en.wikipedia.org/wiki/Tomato_brown_rugose_fruit_virus))

In [None]:
contig_df['n_base_pairs'] = contig_df['length'] * contig_df['coverage']

# Use a log colour scale for the contig length
# fig = px.scatter(contig_df, x='index', y='n_base_pairs', log_x=False, log_y=True, hover_data={'index': ':,.0f', 'length': ':,.0f',  'coverage': ':,.0f', 'n_base_pairs': ':,.0f'}, width=1000, height=600, color='length')
fig = px.scatter(contig_df, x='length', y='n_base_pairs', log_x=True, log_y=True, hover_data={'index': ':,.0f', 'length': ':,.0f',  'coverage': ':,.0f', 'n_base_pairs': ':,.0f'}, width=1000, height=600, color='bin_id')
fig.update_xaxes(title_text='Length')
fig.update_yaxes(title_text='Total number of base pairs that align to the contig (bp)')
fig.show()

### For each contig, how many reads are from each sample?

Visualising all of the contigs > Xbp by number of reads from each sample

In [None]:
plt.figure(figsize=(15, 15))
ax = sns.heatmap(counts_df, cmap='viridis', cbar_kws={'label': 'Number of mapped reads'}, xticklabels=True, norm=LogNorm())
ax.set_xlabel('Sample ID')
ax.set_ylabel('Contig ID')
plt.tight_layout()

Visualising the top 500 contigs by number of reads from each sample

In [None]:
top_contigs_df = counts_df.head(500).copy()
top_contigs_df['bin_id'] = top_contigs_df.index.map(df['bin_id'])
top_contigs_df['index'] = top_contigs_df.index
# Sort by the bin ID and then by the contig ID
top_contigs_df = top_contigs_df.sort_values(['bin_id', 'index'])

In [None]:
top_contigs_df = counts_df.head(300).copy()

top_contigs_df['bin_id'] = top_contigs_df.index.map(df['bin_id'])
top_contigs_df['index'] = top_contigs_df.index
# Sort by the bin ID and then by the contig ID
top_contigs_df = top_contigs_df.sort_values(['bin_id', 'index'])
del top_contigs_df['bin_id']
del top_contigs_df['index']

x = top_contigs_df.index.tolist()
y = top_contigs_df.values.tolist()
y = list(map(list, zip(*y)))
colors = top_contigs_df.columns.tolist()
hover_data = []
for i in range(len(x)):
    hover_data.append(f'Contig ID: {x[i]}<br>Bin: {df.loc[x[i], "bin_id"]}<br>Classification: {df.loc[x[i], "classification"]}<br>Viral classification: {df.loc[x[i], "lineage"]}')


fig = go.Figure(data=[
    go.Bar(name=colors[i], x=x, y=y[i]) for i in range(len(colors))
])
# Change the bar mode
fig.update_layout(barmode='stack', xaxis={'categoryorder':'total descending'}, yaxis={'type': 'log'}, width=1000, height=800)
fig.update_traces(hovertemplate=hover_data)
fig.update_xaxes(title_text='Contig ID')
fig.update_yaxes(title_text='Number of mapped reads')

fig.show()

### What percentage of the reads were used in the assembly?
Using the total number of processed reads for each sample and the counts matrix, we can calculate the percentage of reads used in the assembly for each sample.

The results are promising, because for most of the samples there is a high % use of reads in the assembly. However, as noted in the [samtools documentation](http://www.htslib.org/doc/samtools-idxstats.html) the method we used to calculate the number of reads "may count reads multiple times if they are mapped more than once or in multiple fragments." (Note: this is also why the % use of reads in the assembly is over 100% for some samples). This means the results are likely an overestimate and we cannot be sure if it's the same reads that are found in multiple contigs or if they're unique reads that are found in multiple contigs.

In [None]:
# Get the total number of processed reads for each sample
sample_processed_read_counts = summary_df['total_processed_reads'].to_dict()

# Convert the number of reads mapped to each contig to a percentage of the total number of processed read for each sample
counts_perc_df = counts_df.copy()
for sample in counts_perc_df.columns:
    counts_perc_df[sample] = counts_perc_df[sample] / sample_processed_read_counts[sample] * 100

# Sum the percentage of reads mapped to each contig for each sample
counts_perc_df = pd.DataFrame(counts_perc_df.sum(axis=0), columns=['% reads mapped to contigs'])

# Plot a bar chart of the percentage of reads mapped to each contig for each sample using seaborn
ax = sns.barplot(x=counts_perc_df.index, y='% reads mapped to contigs', data=counts_perc_df, color='steelblue')
ax.set_xticklabels(ax.get_xticklabels(), rotation=90)
ax.set_xlabel('Sample ID')
ax.set_ylabel('Total reads that mapped to contigs (%)')
plt.tight_layout()