__autor__ = Melany Calderón-Osorno

__versión__ = 0.1

__fecha__ = 2025-07-09

__credits__ = Franck Lejzerowicz

#**Abricate Postprocessing**

This notebook offers a step-by-step guide for post-processing the results produced by the Abricate and pysam pipeline.

#**Setup notebook environment**

First, we will clone the repository containing the data generated by Abricate and pysam pipeline.

In [1]:
!git clone https://github.com/mecalderon/Tutorial_Summer_Retreat.git

Cloning into 'Tutorial_Summer_Retreat'...
remote: Enumerating objects: 2393, done.[K
remote: Counting objects: 100% (66/66), done.[K
remote: Compressing objects: 100% (50/50), done.[K
remote: Total 2393 (delta 17), reused 56 (delta 9), pack-reused 2327 (from 1)[K
Receiving objects: 100% (2393/2393), 26.08 MiB | 11.50 MiB/s, done.
Resolving deltas: 100% (913/913), done.
Updating files: 100% (2386/2386), done.


The following code imports the library used for data post-processing.

In [2]:
import os
import csv
import glob
import pandas as pd
import warnings
warnings.filterwarnings('ignore')

**Input/Output paths**


**Inputs**

The following code navigates to the Tutorial_Summer_Retreat/data directory, where the Abricate and pysam output is stored.

In [3]:
cd Tutorial_Summer_Retreat/data

/content/Tutorial_Summer_Retreat/data


We defined the variable **abricate_dir** to store the path to the Abricate data directory, and **pysam_dir** to store the path to the pysam data directory.

In [4]:
abricate_dir = 'Abricate'
pysam_dir = 'Pysam_spades'

**Outputs**

We created a variable named **Abricate_processing** to store the path to the output directory.

In [5]:
output_dir = 'Abricate_processing'
os.makedirs(output_dir)

**Collect the Abricate outputs files**

The following code collects the output files generated by the Abricate tool.

In [6]:
fps = glob.glob('%s/*/*txt' % abricate_dir)
fps[:3] + fps[-3:]

['Abricate/SRR3967690/SRR3967690_ncbi.txt',
 'Abricate/SRR3967690/SRR3967690_megares.txt',
 'Abricate/SRR3967690/SRR3967690_resfinder.txt',
 'Abricate/ERR599131/ERR599131_ncbi.txt',
 'Abricate/ERR599131/ERR599131_argannot.txt',
 'Abricate/ERR599131/ERR599131_card.txt']

**Collect the pysam outputs files**

The following code collects the output files generated by the Pysam tool.

In [7]:
fpps = glob.glob('%s/*/reads.txt' % pysam_dir)
fpps[:3] + fpps[-3:]

['Pysam_spades/SRR3967690/reads.txt',
 'Pysam_spades/SRR3963804/reads.txt',
 'Pysam_spades/SRR3966130/reads.txt',
 'Pysam_spades/SRR3963982/reads.txt',
 'Pysam_spades/ERR599067/reads.txt',
 'Pysam_spades/ERR599131/reads.txt']

**Merge the Abricate results**

The following code reads multiple tab-separated files listed in fps, extracts the sample name from each filename, adds it as a new column, selects relevant columns (Sample, SEQUENCE, and GENE), and combines all the data into a single DataFrame called abricate_combined_data.

In [8]:
abricate_combined_data = pd.DataFrame()

for file_path in fps:
    sample_name = file_path.split('/')[-1].split('_')[0]
    data = pd.read_csv(file_path, sep='\t')
    data['Sample'] = sample_name
    data = data[['Sample'] + [col for col in data.columns if col != 'Sample']]
    abricate_combined_data = pd.concat([abricate_combined_data, data])

abricate_combined_data = abricate_combined_data[['Sample','SEQUENCE', 'GENE']]
abricate_combined_data.head()

Unnamed: 0,Sample,SEQUENCE,GENE
0,ERR599115,NODE_29865_length_3044_cov_0.742954,(Bla)blaTEM-116
0,ERR599115,NODE_29865_length_3044_cov_0.742954,blaTEM-116_1
0,ERR599115,NODE_29865_length_3044_cov_0.742954,blaTEM-116
0,ERR599115,NODE_29865_length_3044_cov_0.742954,TEM
0,ERR599115,NODE_29865_length_3044_cov_0.742954,TEM-116


**Merge the pysam results**

The code reads multiple tab-separated files listed in fpps, extracts the sample name from each file's directory name, combines the data into a single DataFrame, and selects the columns sample, contig, and reads, storing the result in pysam_combined_data.

In [9]:
pysam_combined_data = pd.DataFrame()

for file_path in fpps:
    sample_name = file_path.split('/')[-2]
    data = pd.read_csv(file_path, sep='\t')
    pysam_combined_data = pd.concat([pysam_combined_data, data])

pysam_combined_data = pysam_combined_data[['sample','contig', 'reads']]
pysam_combined_data.head()

Unnamed: 0,sample,contig,reads
0,SRR3967690,NODE_65_length_18699_cov_3.347204,1322
1,SRR3967690,NODE_77_length_17031_cov_1.972301,700
2,SRR3967690,NODE_115_length_14612_cov_2.008682,612
3,SRR3967690,NODE_250_length_10514_cov_2.522612,548
4,SRR3967690,NODE_411_length_8794_cov_2.774123,501


**Write outputs**

The following code defines output file paths and saves the abricate_combined_data DataFrame as both a tab-separated .tsv file and an Excel .xlsx file in the specified output_dir.

In [10]:
file_txt = os.path.join(output_dir, 'abricate_combined_results.tsv')
file_xlsx = os.path.join(output_dir, 'abricate_combined_results.xlsx')


abricate_combined_data.to_csv(file_txt, sep='\t', index=False)
abricate_combined_data.to_excel(file_xlsx, index=False)

The code defines the output file paths and saves the pysam_combined_data DataFrame as both a tab-separated .tsv file and an Excel .xlsx file in the specified output_dir.

In [11]:
file_txt = os.path.join(output_dir, 'pysam_combined_results.tsv')
file_xlsx = os.path.join(output_dir, 'pysam_combined_results.xlsx')


pysam_combined_data.to_csv(file_txt, sep='\t', index=False)
pysam_combined_data.to_excel(file_xlsx, index=False)

**Combine the outputs from Abricate and Pysam**

The code defines the file paths for the Abricate and pysam result files, then reads both tab-separated .tsv files into the DataFrames abricate_data and pysam_data.

In [12]:
abricate_path = os.path.join(output_dir, 'abricate_combined_results.tsv')
abricate_data = pd.read_csv(abricate_path, sep='\t')
pysam_path = os.path.join(output_dir, 'pysam_combined_results.tsv')
pysam_data = pd.read_csv(pysam_path, sep='\t')

The code merges the Abricate and pysam data using SEQUENCE and contig as keys, identifies contigs not associated with any gene, sums their read counts per sample, creates new rows for these unannotated reads, appends them to the final merged DataFrame, and keeps only the GENE, sample, and reads columns in the final output.

In [13]:
merged_df = pd.merge(abricate_data, pysam_data, left_on='SEQUENCE', right_on='contig', how='outer')
final_df = pd.merge(abricate_data, pysam_data, left_on='SEQUENCE', right_on='contig')

reads_sum_per_sample = merged_df[merged_df['GENE'].isna()].groupby('sample')['reads'].sum()


new_rows = []
for sample, reads_sum in reads_sum_per_sample.items():
    new_row = {'Sample': sample, 'sample': sample, 'contig': '-', 'SEQUENCE': '-', 'GENE': '-', 'reads': reads_sum}
    new_rows.append(new_row)

new_rows_df = pd.DataFrame(new_rows)
merged_df = pd.concat([final_df, new_rows_df], ignore_index=True)

merged_df = merged_df[['GENE', 'sample', 'reads']]
merged_df.head()

Unnamed: 0,GENE,sample,reads
0,(Bla)blaTEM-116,ERR599115,1828
1,blaTEM-116_1,ERR599115,1828
2,blaTEM-116,ERR599115,1828
3,TEM,ERR599115,1828
4,TEM-116,ERR599115,1828


The code creates a pivot table from merged_df, organizing the data with genes as rows and samples as columns, showing the read counts as values and filling missing values with '0', and stores the result in pivot_df.

In [14]:
pivot_df = merged_df.pivot_table(index='GENE', columns='sample', values='reads', fill_value='0')
pivot_df.head()

sample,ERR598944,ERR598947,ERR598958,ERR598960,ERR598964,ERR598971,ERR598980,ERR598985,ERR598999,ERR599000,...,SRR3965758,SRR3965873,SRR3965874,SRR3966130,SRR3967319,SRR3967690,SRR3967700,SRR3968061,SRR3968062,SRR3968777
GENE,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
(AGly)aadA27,0.0,547.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
(AGly)aadC,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
(Bla)blaTEM-116,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,131.0,0.0,0.0,0.0,0.0,0.0
-,779469.0,55463.0,357411.0,783852.0,635530.0,523138.0,71632.0,755728.0,137161.0,776703.0,...,20527.0,19722.0,4761.0,29695.0,9698.0,8205.0,5532.0,1044.0,78214.0,5031.0
ANT3-DPRIME,0.0,547.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


The code removes specific unwanted genes from the pivot table, converts the remaining values to integers, identifies and keeps only columns (samples) with non-zero total read counts, and then adds back one previously extracted row (with '-' as index) to the final DataFrame stored in final_df.

In [15]:
filtered_df = pivot_df.drop(index=['-','(AGly)aadA27', '(Bla)blaTEM-116', 'Col440I_1',
                                   'ColRNAI_1', 'TEM', 'acpXL', 'blaTEM-116', 'blaTEM-116_1',
                                   'rep10_3_pNE131p1(pNE131)'], errors='ignore')
filtered_df = filtered_df.astype(int)
row = pivot_df.iloc[3]

column_sums = filtered_df.sum(axis=0)
columns_to_keep = column_sums[column_sums > 0].index

final_df = filtered_df[columns_to_keep]
final_df.loc['-'] = row
final_df

sample,ERR598947,ERR599072,ERR599112,ERR599115,SRR3967319
GENE,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
(AGly)aadC,0.0,0.0,74.0,0.0,0.0
ANT3-DPRIME,547.0,0.0,0.0,0.0,0.0
QACC,0.0,0.0,63.0,0.0,0.0
TEM-116,0.0,696.0,0.0,1828.0,131.0
aadA27,547.0,0.0,0.0,0.0,0.0
-,55463.0,1379769.0,4271828.0,380571.0,9698.0


The code sums the read counts of three aminoglycoside resistance genes to create a combined Aminoglycoside class, removes these individual genes from the DataFrame, renames some gene entries to represent broader resistance classes, adds the new Aminoglycoside row, and stores the result in class_df.

In [16]:
aminoglycoside_class = final_df.loc[['(AGly)aadC','ANT3-DPRIME', 'aadA27']].sum()
aminoglycoside_class


class_df = final_df.drop(index=['(AGly)aadC','ANT3-DPRIME','aadA27'])
class_df = class_df.rename(index={'QACC': 'Drug and biocide resistance'})
class_df = class_df.rename(index={'TEM-116': 'Beta-lactamase'})
class_df.loc['Aminoglycoside'] = aminoglycoside_class
class_df

sample,ERR598947,ERR599072,ERR599112,ERR599115,SRR3967319
GENE,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Drug and biocide resistance,0.0,0.0,63.0,0.0,0.0
Beta-lactamase,0.0,696.0,0.0,1828.0,131.0
-,55463.0,1379769.0,4271828.0,380571.0,9698.0
Aminoglycoside,1094.0,0.0,74.0,0.0,0.0


**Write outputs**

The code defines output file paths and saves the final_df DataFrame as both a tab-separated .tsv file and an Excel .xlsx file in the specified output_dir, including row indices.

In [17]:
file_txt = os.path.join(output_dir, 'final_results.tsv')
file_xlsx = os.path.join(output_dir, 'final_results.xlsx')


final_df.to_csv(file_txt, sep='\t', index=True)
final_df.to_excel(file_xlsx, index=True)

This code defines output file paths and saves the class_df DataFrame as both a tab-separated .tsv file and an Excel .xlsx file in the specified output_dir, including row indices.

In [18]:
file_txt = os.path.join(output_dir, 'final_class.tsv')
file_xlsx = os.path.join(output_dir, 'final_class.xlsx')


class_df.to_csv(file_txt, sep='\t', index=True)
class_df.to_excel(file_xlsx, index=True)