# MAG Analysis

Analysing assembled and binned metagenomes generated using the [nf-core/mag](https://github.com/PhilPalmer/mag/tree/genomad) pipeline on the [Rothman dataset](https://doi.org/10.1128/AEM.01448-21) with samples from the [HTP site](https://en.wikipedia.org/wiki/Hyperion_sewage_treatment_plant)


In [None]:
import os
import pandas as pd

from Bio import SeqIO

In [None]:
########
# Change
########
# TODO: Upload directory to AWS S3 and add code to download the data if needed
data_dir = '../mag/results_rothman_htp'

##############
# Don't change
##############

# Assembly data
# spades_dir = f'{data_dir}/Assembly/SPAdes'
# spades_fa_gz = f'{spades_dir}/SPAdes-group-HTP_scaffolds.fasta.gz'

# Genome Binning data
binner = 'MaxBin2'
busco_tsv = f'{data_dir}/GenomeBinning/QC/busco_summary.tsv'
quast_tsv = f'{data_dir}/GenomeBinning/QC/quast_summary.tsv'
bin_tsv = f'{data_dir}/GenomeBinning/bin_summary.tsv'
bin_dir = f'{data_dir}/GenomeBinning/{binner}'

# Taxonomy data
genomad_dir = f'{data_dir}/Taxonomy/geNomad'
genomad_tsv = f'{genomad_dir}/group-HTP/SPAdes-group-HTP_scaffolds_aggregated_classification/SPAdes-group-HTP_scaffolds_aggregated_classification.tsv'
gtdbtk_tsv = f'{data_dir}/Taxonomy/GTDB-Tk/gtdbtk_summary.tsv'


In [None]:
# Define helper functions
def decompress_gz(gz_file):
    """
    Decompress a gzipped file.
    """
    file = gz_file.replace('.gz', '')
    if os.path.exists(gz_file) and not os.path.exists(file):
        !gzip -dk {gz_file}
    return file

## Load the data

Load each contig with any relevant information eg the bin/MAG and coverage etc.

Link the infomation at the sample, assembly, MAG and taxonomic levels


### Assemblies

[metaSPAdes](http://cab.spbu.ru/software/spades/) was used for the assembly and outputs the contigs i.e. merged reads as well as the scaffolds i.e. contigs that have been merged together

The commented code below loads the contigs, however, we will be skipping this step because the same information will be loaded using the MAGs instead


In [None]:
# # Decompress the FASTA file (if it hasn't already been decompressed)
# spades_fa = decompress_gz(spades_fa_gz)

# # Load the FASTA seqs and save them as a dict in the following format {'node_id': ['length', 'cov', 'seq']}
# seqs_dict = {int(record.id.split('_')[1]): [int(record.id.split('_')[3]), float(record.id.split('_')[5]), str(record.seq)] for record in SeqIO.parse(spades_fa, 'fasta')}

# # Create dataframe containing contigs info
# seqs_df = pd.DataFrame.from_dict(seqs_dict, orient='index', columns=['length', 'coverage', 'seq'])

# seqs_df.head()

### Load the bins/MAGs

- Two tools were used to perform metagenome binning to generate metagenome assembled genomes (MAGs) - [MetaBAT2](https://bitbucket.org/berkeleylab/metabat/src/master/) and [MaxBin2](https://sourceforge.net/projects/maxbin2/)

- And two tools were used to check for quality control (QC) of the genome bins - [Busco](https://busco.ezlab.org/) and [Quast](http://quast.sourceforge.net/quast)


Let's look the summary information of the bins/MAGs


In [None]:
bin_df = pd.read_csv(bin_tsv, sep='\t')
bin_df.head()

In [None]:
# Get a list of all of the sample IDs
samples = [col.split(' ')[1] for col in bin_df.columns if 'Depth ' in col]
print(samples)

The results from the tools show that MaxBin2 generated more genome bins and a higher % completeness than MetaBAT2


In [None]:
busco_df = pd.read_csv(busco_tsv, sep='\t')
busco_df.head()

In [None]:
quast_df = pd.read_csv(quast_tsv, sep='\t')
quast_df.head()

We will therefore use the MaxBin2 results for the rest of the analysis


In [None]:
# Get the bin files
bin_files = []

for dirpath, dirnames, filenames in os.walk(bin_dir):
    bin_files.extend([os.path.join(dirpath, file) for file in filenames])

bin_files

In [None]:
bin_dfs = []

for bin_file_gz in bin_files:

    # Load the bin file and save as a dataframe
    bin_file = decompress_gz(bin_file_gz)
    bin_dict = {int(record.id.split('_')[1]): [int(record.id.split('_')[3]), float(record.id.split('_')[5]), str(record.seq)] for record in SeqIO.parse(bin_file, 'fasta')}
    bin_df = pd.DataFrame.from_dict(bin_dict, orient='index', columns=['length', 'coverage', 'seq'])

    # Add the bin ID to the dataframe
    bin_df['bin_id'] = bin_file.split('/')[-1].split('.')[1]

    # Add the dataframe to the list
    bin_dfs.append(bin_df)

# Concatenate all of the dataframes
bin_df = pd.concat(bin_dfs).sort_index()
bin_df = bin_df[~bin_df.index.duplicated(keep='first')]
bin_df.head()

## Taxonomic classification


### Virus classification

Let's load the virus classifications predicted using [geNomad](https://github.com/apcamargo/genomad) and combine this with our existing data for the contigs


In [None]:
# Load the virus classification data
vir_df = pd.read_csv(genomad_tsv, sep='\t')
vir_df.index = vir_df.seq_name.str.split('_').str[1].astype(int)
vir_df = vir_df.drop('seq_name', axis=1)
vir_df.head()

In [None]:
# Merge the binning and taxonomy dataframes
df = pd.merge(bin_df, vir_df, left_index=True, right_index=True)
df.head()

### Taxonomic classification of binned genomes
Load the GTDB-Tk summary table (see [column descriptions](https://ecogenomics.github.io/GTDBTk/files/summary.tsv.html)) and combine with our existng information for the contigs


In [None]:
# Load the GTDB-Tk classification data
gtdbtk_df = pd.read_csv(gtdbtk_tsv, sep='\t')

# Filter the GTDB-Tk dataframe
gtdbtk_df = gtdbtk_df[gtdbtk_df.user_genome.str.contains(binner)]
cols = ['user_genome', 'classification', 'classification_method', 'other_related_references(genome_id,species_name,radius,ANI,AF)', 'msa_percent', 'red_value', 'warnings']
gtdbtk_df = gtdbtk_df[cols]

In [None]:
# Add the GTDB-Tk classification data to the main dataframe
gtdbtk_df['bin_id'] = gtdbtk_df.user_genome.str.split('.').str[1]
gtdbtk_df = gtdbtk_df.drop('user_genome', axis=1)
df = pd.merge(df, gtdbtk_df, on='bin_id', how='left')

In [None]:
df.head()

### Taxonomic classification of trimmed reads

TODO


## Get read counts for the MAGs

TODO
