# MAG Analysis

Analysing assembled and binned metagenomes generated using the [nf-core/mag](https://github.com/PhilPalmer/mag/tree/genomad) pipeline on the [Rothman dataset](https://doi.org/10.1128/AEM.01448-21) with samples from the [HTP site](https://en.wikipedia.org/wiki/Hyperion_sewage_treatment_plant)


In [None]:
import os
import pandas as pd
import yaml

from Bio import SeqIO

In [None]:
########
# Change
########
# TODO: Upload directory to AWS S3 and add code to download the data if needed
data_dir = '../mag/results_rothman_htp'

##############
# Don't change
##############

# QC
fastqc_yaml = f'{data_dir}/multiqc/multiqc_data/multiqc_fastqc.yaml'

# Assembly data
# spades_dir = f'{data_dir}/Assembly/SPAdes'
# spades_fa_gz = f'{spades_dir}/SPAdes-group-HTP_scaffolds.fasta.gz'

# Genome Binning data
binner = 'MaxBin2'
busco_tsv = f'{data_dir}/GenomeBinning/QC/busco_summary.tsv'
quast_tsv = f'{data_dir}/GenomeBinning/QC/quast_summary.tsv'
bin_tsv = f'{data_dir}/GenomeBinning/bin_summary.tsv'
bin_dir = f'{data_dir}/GenomeBinning/{binner}'
contig_depths_gz = f'{data_dir}/GenomeBinning/depths/contigs/SPAdes-group-HTP-depth.txt.gz'

# Taxonomy data
genomad_dir = f'{data_dir}/Taxonomy/geNomad'
genomad_tsv = f'{genomad_dir}/group-HTP/SPAdes-group-HTP_scaffolds_aggregated_classification/SPAdes-group-HTP_scaffolds_aggregated_classification.tsv'
gtdbtk_tsv = f'{data_dir}/Taxonomy/GTDB-Tk/gtdbtk_summary.tsv'
kraken_dir = f'{data_dir}/Taxonomy/kraken2'


In [None]:
# Define helper functions
def decompress_gz(gz_file):
    """
    Decompress a gzipped file.
    """
    file = gz_file.replace('.gz', '')
    if os.path.exists(gz_file) and not os.path.exists(file):
        !gzip -dk {gz_file}
    return file

## Load the data

Load each contig with any relevant information eg the bin/MAG and coverage etc.

Link the infomation at the sample, assembly, MAG and taxonomic levels


### Quality Control (QC)
Print the sample names and get the total number of raw reads per sample

In [None]:
# Create a dictionary containing the raw number of reads for each sample

# Read the QC data
with open(fastqc_yaml, 'r') as f:
    fastqc_data = yaml.load(f, Loader=yaml.FullLoader)

# Get the sample IDs and the total number of raw reads
sample_read_counts = {k: fastqc_data[k]['Total Sequences'] for k in fastqc_data.keys()}
sample_read_counts = {k.replace('_1',''): int(sample_read_counts[k] + sample_read_counts[k.replace('_1', '_2')]) for k in sample_read_counts.keys() if '_1' in k}
sample_read_counts


In [None]:
with open(fastqc_yaml, 'r') as f:
    fastqc_data = yaml.safe_load(f)

samples = [sample.split('_')[0] for sample in list(fastqc_data.keys())]
samples = sorted(list(set(samples)))
print(samples)

n_reads = []
for sample in samples:
    n_reads.append(fastqc_data[f'{sample}_1']['Total Sequences'] + fastqc_data[f'{sample}_2']['Total Sequences'])

### Assemblies

[metaSPAdes](http://cab.spbu.ru/software/spades/) was used for the assembly and outputs the contigs i.e. merged reads as well as the scaffolds i.e. contigs that have been merged together

The commented code below loads the contigs, however, we will be skipping this step because the same information will be loaded using the MAGs instead


In [None]:
# # Decompress the FASTA file (if it hasn't already been decompressed)
# spades_fa = decompress_gz(spades_fa_gz)

# # Load the FASTA seqs and save them as a dict in the following format {'node_id': ['length', 'cov', 'seq']}
# seqs_dict = {int(record.id.split('_')[1]): [int(record.id.split('_')[3]), float(record.id.split('_')[5]), str(record.seq)] for record in SeqIO.parse(spades_fa, 'fasta')}

# # Create dataframe containing contigs info
# seqs_df = pd.DataFrame.from_dict(seqs_dict, orient='index', columns=['length', 'coverage', 'seq'])

# seqs_df.head()

### Load the bins/MAGs

- Two tools were used to perform metagenome binning to generate metagenome assembled genomes (MAGs) - [MetaBAT2](https://bitbucket.org/berkeleylab/metabat/src/master/) and [MaxBin2](https://sourceforge.net/projects/maxbin2/)

- And two tools were used to check for quality control (QC) of the genome bins - [Busco](https://busco.ezlab.org/) and [Quast](http://quast.sourceforge.net/quast)


Let's look the summary information of the bins/MAGs


In [None]:
bin_df = pd.read_csv(bin_tsv, sep='\t')
bin_df.head()

The results from the tools show that MaxBin2 generated more genome bins and a higher % completeness than MetaBAT2


In [None]:
busco_df = pd.read_csv(busco_tsv, sep='\t')
busco_df.head()

In [None]:
quast_df = pd.read_csv(quast_tsv, sep='\t')
quast_df.head()

We will therefore use the MaxBin2 results for the rest of the analysis


In [None]:
# Get the bin files
bin_files = []

for dirpath, dirnames, filenames in os.walk(bin_dir):
    bin_files.extend([os.path.join(dirpath, file) for file in filenames if file.endswith('.gz')])

bin_files

In [None]:
bin_dfs = []

for bin_file_gz in bin_files:

    # Load the bin file and save as a dataframe
    bin_file = decompress_gz(bin_file_gz)
    bin_dict = {int(record.id.split('_')[1]): [int(record.id.split('_')[3]), float(record.id.split('_')[5]), str(record.seq)] for record in SeqIO.parse(bin_file, 'fasta')}
    bin_df = pd.DataFrame.from_dict(bin_dict, orient='index', columns=['length', 'coverage', 'seq'])

    # Add the bin ID to the dataframe
    bin_df['bin_id'] = bin_file.split('/')[-1].split('.')[1]

    # Add the dataframe to the list
    bin_dfs.append(bin_df)

# Concatenate all of the dataframes
bin_df = pd.concat(bin_dfs).sort_index()
bin_df = bin_df[~bin_df.index.duplicated(keep='first')]
bin_df.head()

## Taxonomic classification


### Virus classification

Let's load the virus classifications predicted using [geNomad](https://github.com/apcamargo/genomad) and combine this with our existing data for the contigs


In [None]:
# Load the virus classification data
vir_df = pd.read_csv(genomad_tsv, sep='\t')
vir_df.index = vir_df.seq_name.str.split('_').str[1].astype(int)
vir_df = vir_df.drop('seq_name', axis=1)
vir_df.head()

In [None]:
# Merge the binning and taxonomy dataframes
df = pd.merge(bin_df, vir_df, left_index=True, right_index=True)
df.head()

### Taxonomic classification of binned genomes

Load the GTDB-Tk summary table (see [column descriptions](https://ecogenomics.github.io/GTDBTk/files/summary.tsv.html)) and combine with the existng information for the contigs


In [None]:
# Load the GTDB-Tk classification data
gtdbtk_df = pd.read_csv(gtdbtk_tsv, sep='\t')

# Filter the GTDB-Tk dataframe
gtdbtk_df = gtdbtk_df[gtdbtk_df.user_genome.str.contains(binner)]
cols = ['user_genome', 'classification', 'classification_method', 'other_related_references(genome_id,species_name,radius,ANI,AF)', 'msa_percent', 'red_value', 'warnings']
gtdbtk_df = gtdbtk_df[cols]

In [None]:
# Add the GTDB-Tk classification data to the main dataframe
gtdbtk_df['bin_id'] = gtdbtk_df.user_genome.str.split('.').str[1]
gtdbtk_df = gtdbtk_df.drop('user_genome', axis=1)

# Merge the GTDB-Tk dataframe with the main dataframe
df = pd.merge(df, gtdbtk_df, on='bin_id', how='left')

# Return index to how it was before
df.index = df.index + 1

df.head()

### Taxonomic classification of trimmed reads

The [kraken2](https://github.com/DerrickWood/kraken2/wiki/Manual) tool was used to classify trimmed reads using the [prebuilt 8GB minikraken DB](https://zenodo.org/record/4024003#.Y4-9PdLMK0o) as provided by the Center for Computational Biology of the John Hopkins University (from 2020-03). (**Note:** Using a larger kraken database would likely improve the results by decreasing the number of unclassified reads). The outputs were tab seperated files with the following columns:  

1. `percentage` - Percentage of fragments covered by the clade rooted at this taxon
2. `num_fragments` - Number of fragments covered by the clade rooted at this taxon
3. `num_assigned` - Number of fragments assigned directly to this taxon
4. `rank_code` - A rank code, indicating (U)nclassified, (R)oot, (D)omain, (K)ingdom, (P)hylum, (C)lass, (O)rder, (F)amily, (G)enus, or (S)pecies. Taxa that are not at any of these 10 ranks have a rank code that is formed by using the rank code of the closest ancestor rank with a number indicating the distance from that rank. E.g., "G2" is a rank code indicating a taxon is between genus and species and the grandparent taxon is at the genus rank.
5. `tax_id` NCBI taxonomic ID number
6. `scientific_name` - Indented scientific name


In [None]:
kraken_dfs = []
names = ['percentage', 'num_fragments', 'num_assigned', 'rank_code', 'tax_id', 'scientific_name']

for sample in samples:
    kraken_file = f'{kraken_dir}/{sample}/kraken2_report.txt'
    kraken_df = pd.read_csv(kraken_file, sep='\t', header=None, names=names)
    kraken_df['sample_id'] = sample
    kraken_dfs.append(kraken_df)

kraken_df = pd.concat(kraken_dfs)

# Remove whitespace from the scientific name
kraken_df.scientific_name = kraken_df.scientific_name.str.strip()
kraken_df = kraken_df.reset_index()
kraken_df.head()

In [None]:
# Define vars
cols = ['total_raw_reads', 'total_processed_reads', 'classified_reads', 'unclassified_reads', 'bacterial_reads', 'viral_reads']
display_as_percent = True
vmax = None

# Generate the individual dataframes
unclass_df = kraken_df[kraken_df['rank_code'] == 'U'].groupby('sample_id').agg({'num_fragments': 'sum'}).rename(columns={'num_fragments': 'unclassified_reads'})
class_df = kraken_df[kraken_df['rank_code'] == 'R'].groupby('sample_id').agg({'num_fragments': 'sum'}).rename(columns={'num_fragments': 'classified_reads'})
bact_df = kraken_df[kraken_df['scientific_name'].str.contains('Bacteria')].groupby('sample_id').agg({'num_fragments': 'sum'}).rename(columns={'num_fragments': 'bacterial_reads'})
viral_df = kraken_df[kraken_df['scientific_name'].str.contains('Viruses')].groupby('sample_id').agg({'num_fragments': 'sum'}).rename(columns={'num_fragments': 'viral_reads'})

# Combine the dataframes and then process
summary_df = pd.concat([unclass_df, class_df, bact_df, viral_df], axis=1)
summary_df['total_raw_reads'] = [sample_read_counts[sample] for sample in summary_df.index]
summary_df['total_processed_reads'] = summary_df['unclassified_reads'] + summary_df['classified_reads']
summary_df['total_raw_reads'] = summary_df['total_raw_reads'].astype(int)
summary_df['total_processed_reads'] = summary_df['total_processed_reads'].astype(int)
summary_df.loc['total'] = summary_df.sum()
summary_df = summary_df[cols]

# Calculate percentages
if display_as_percent:
    vmax = 100
    for col in cols[2:]:
        cols[cols.index(col)] = f'{col} (%)'
        summary_df = summary_df.rename(columns={col: f'{col} (%)'})
        col = f'{col} (%)'
        summary_df[col] = summary_df[col] / summary_df['total_processed_reads'] * 100

# Display the summary dataframe with bars
summary_df.style\
    .bar(subset=cols[:2], color='#d65f5f')\
    .bar(subset=cols[2:], color='#5fba7d', vmax=vmax)\
    .format('{:.2f}', subset=cols[2:])\
    .format('{:,.0f}', subset=cols[:2])

In [None]:
# # Display the top 50 most abundant taxa
# kraken_df = kraken_df[kraken_df.rank_code != 'U']
# kraken_df = kraken_df[kraken_df.rank_code != 'R']
# kraken_df = kraken_df.sort_values('num_fragments', ascending=False)
# kraken_df.head(50)

In [None]:
# Generate a Krona chart for all of samples - this is a bit hacky
# !docker run -v $PWD:$PWD -w $PWD quay.io/biocontainers/krona:2.7.1--pl526_5 ktUpdateTaxonomy.sh taxonomy && !ktImportTaxonomy SRR14530762/kraken2_report.txt SRR14530763/kraken2_report.txt SRR14530764/kraken2_report.txt SRR14530765/kraken2_report.txt SRR14530766/kraken2_report.txt SRR14530767/kraken2_report.txt SRR14530769/kraken2_report.txt SRR14530770/kraken2_report.txt SRR14530771/kraken2_report.txt SRR14530772/kraken2_report.txt SRR14530880/kraken2_report.txt SRR14530881/kraken2_report.txt SRR14530882/kraken2_report.txt SRR14530884/kraken2_report.txt SRR14530885/kraken2_report.txt SRR14530886/kraken2_report.txt SRR14530887/kraken2_report.txt SRR14530888/kraken2_report.txt SRR14530889/kraken2_report.txt SRR14530890/kraken2_report.txt SRR14530891/kraken2_report.txt -tax taxonomy

## Get read counts for the MAGs

### Get the sequencing depths per contig

Load the sequencing depths per contig and sample. This file is roughly equivalent to a count matrix i.e. a file mapping contig IDs (rows) and sample SRA IDs (columns) with the number of reads mapping to each contig in each sample. However, instead of the number of reads, we have the sequencing depth i.e. the number of bases divided by the contig length

This is calculated using MetaBAT2's `jgi_summarize_bam_contig_depths --outputDepth`. The values correspond to `(sum of exactly aligned bases) / ((contig length)-2*75)`. For example, for two reads aligned exactly with `10` and `9` bases on a 1000 bp long contig the depth is calculated by `(10+9)/(1000-2*75)` (1000bp length of contig minus 75bp from each end, which is excluded)


In [None]:
# Load the contig depths
contig_depths = decompress_gz(contig_depths_gz)
contig_depths_df = pd.read_csv(contig_depths, sep='\t')

# Preprocess the contig depths dataframe
contig_depths_df.index = contig_depths_df.contigName.str.split('_').str[1].astype(int)
contig_depths_df = contig_depths_df.drop(['contigName', 'contigLen'], axis=1)

# Drop all columns containing 'var' (i.e. the columns containing the variance)
contig_depths_df = contig_depths_df.drop([col for col in contig_depths_df.columns if 'var' in col], axis=1)

# Rename the columns e.g. from `SPAdes-group-HTP-SRR14530771.bam` to `SRR14530771`
contig_depths_df.columns = [col.split('-')[-1].split('.')[0] for col in contig_depths_df.columns]


In [None]:
# Merge the contig depths with the main dataframe
df = pd.merge(df, contig_depths_df, left_index=True, right_index=True, how='left')
df