# dRep output processing

Example of running dRep (dereplicates genomes, such as a set of MAGs), and formatting the output files to map the representative genomes to the original set of genomes.

By: Molly Chen

Last updated: June 12 2023

---

## Running dRep:

Documentation for running dRep can be found in the wiki: https://github.com/MrOlm/drep

Genome dereplication uses the following code (don't need to create the output_directory beforehand):

```
dRep dereplicate output_directory -g path/to/genomes/*.fasta
```
\
**NOTE**: dRep has built in CheckM, and filters genomes based on a threshhold of **>75% complete, <10% contaminated**. If you want to skip this step (e.g., already ran CheckM with different threshholds), you need to provide the info separately: https://drep.readthedocs.io/en/latest/advanced_use.html

---

## Post-processing:

The output files summarizing which genomes were grouped together/dereplicated are hard to decipher, so the following script is provided to re-format them in a way that extracts the relevant information. The two files used here will be found under data_tables/ in the dRep output folder.  Cdb.csv is used for the cluster + genome information. Wdb.csv contains info on the winning genomes. 

'dRep_genome_map.csv' lists all genomes in the first column, and the 'winning genome' that it maps to in the second column. If the genome is part of the dereplicated set, the two values will be identical; if not, it will show which genome in the dereplicated set that it maps to. 

'dRep_summarize_clusters.csv' should list all the genomes that group together in each cluster (i.e. redundant genomes), and identifies which one was the 'winning' genome.

In [47]:
import numpy as np
import pandas as pd

#importing files
cdb_df = pd.read_table('Cdb.csv', sep=',')
wdb_df = pd.read_table('Wdb.csv', sep=',').drop('score', axis=1)

#reformatting columns & converting cluster id into a number
cdb_df['secondary_cluster'] = cdb_df['secondary_cluster'].str.replace('_', '.').astype(float)
cdb_df['genome'] = [label[:-3] for label in cdb_df['genome']] #removes the '.fa' from the ID
cdb_df = cdb_df[['secondary_cluster', 'genome']]
wdb_df['secondary_cluster'] = wdb_df['cluster'].str.replace('_', '.').astype(float)
wdb_df['winning_genome'] = [label[:-3] for label in wdb_df['genome']] #removes the '.fa' from the ID
wdb_df = wdb_df [['secondary_cluster', 'winning_genome']]

#creating and saving a map of each individual genome to the 'winning genome'
genome_map_df = cdb_df.join(wdb_df.set_index('secondary_cluster'), on='secondary_cluster')
genome_map_df = genome_map_df[['genome', 'winning_genome']]
genome_map_df.to_csv('dRep_genome_map.csv', index=False)

#grouping identical secondary clusters, concatenating the genome IDs
cdb_df = cdb_df.groupby(['secondary_cluster'])['genome'].apply(lambda x: ', '.join(x)).reset_index()
cdb_df.rename(columns={'genome':'all_genomes'}, inplace=True)

#creating and saving a summary winning genomes + column of all genomes corresponding to it
summary_df = wdb_df.join(cdb_df.set_index('secondary_cluster'), on='secondary_cluster')
summary_df = summary_df.loc[:, ["secondary_cluster","winning_genome","all_genomes"]]
summary_df.to_csv('dRep_summarize_clusters.csv', index=False)

<a style='text-decoration:none;line-height:16px;display:flex;color:#5B5B62;padding:10px;justify-content:end;' href='https://deepnote.com?utm_source=created-in-deepnote-cell&projectId=ad4fc364-42c7-4077-8ba0-d526fa3149f1' target="_blank">
 </img>
Created in <span style='font-weight:600;margin-left:4px;'>Deepnote</span></a>