__autor__ = Melany Calderón-Osorno

__versión__ = 0.1

__fecha__ = 2025-07-10

__credits__ = Franck Lejzerowicz

#**Mobsuite Postprocessing**

This notebook provides a step-by-step guide to post-processing the results generated by the Mobsuite and pysam pipeline.

#**Setup notebook environment**

First, we will clone the repository containing the data generated by Mobsuite and pysam pipeline.

In [1]:
!git clone https://github.com/mecalderon/Tutorial_Summer_Retreat.git

Cloning into 'Tutorial_Summer_Retreat'...
remote: Enumerating objects: 2399, done.[K
remote: Counting objects: 100% (72/72), done.[K
remote: Compressing objects: 100% (56/56), done.[K
remote: Total 2399 (delta 19), reused 60 (delta 9), pack-reused 2327 (from 1)[K
Receiving objects: 100% (2399/2399), 26.23 MiB | 9.14 MiB/s, done.
Resolving deltas: 100% (915/915), done.
Updating files: 100% (2388/2388), done.


The following code imports the library used for data post-processing.

In [2]:
import os
import csv
import glob
import pandas as pd
import warnings
warnings.filterwarnings('ignore')

**Input/Output paths**


**Inputs**

The following code navigates to the Tutorial_Summer_Retreat/data directory, where the Mobsuite and pysam output is stored.

In [3]:
cd Tutorial_Summer_Retreat/data

/content/Tutorial_Summer_Retreat/data


We defined the variable **mob_dir** to store the path to the Mobsuite data directory, and **pysam_dir** to store the path to the pysam data directory.

In [4]:
mob_dir  = 'Mobsuite'
pysam_dir = 'Pysam_spades'

**Outputs**

We created a variable named **Mobsuite_processing** to store the path to the output directory.

In [5]:
output_dir = 'Mobsuite_processing'
os.makedirs(output_dir)

**Collect the Mobsuite outputs files**

The code uses glob to find all files named mobtyper_results.txt in subdirectories of mob_dir and displays the first three and last three file paths from the list.

In [6]:
# use glob and regex to capture all the `read_counts.tsv` files
fps = glob.glob('%s/*/mobtyper_results.txt' % mob_dir)
# show a couple folders (3 first and 3 last ones)
fps[:3] + fps[-3:]

['Mobsuite/SRR3967690/mobtyper_results.txt',
 'Mobsuite/SRR3966130/mobtyper_results.txt',
 'Mobsuite/SRR3965592/mobtyper_results.txt',
 'Mobsuite/SRR3963458/mobtyper_results.txt',
 'Mobsuite/SRR3963982/mobtyper_results.txt',
 'Mobsuite/ERR599067/mobtyper_results.txt']

The code uses glob to find all files named contig_report.txt in subdirectories of mob_dir and displays the first three and last three file paths from the list.

In [7]:
# use glob and regex to capture all the `read_counts.tsv` files
cps = glob.glob('%s/*/contig_report.txt' % mob_dir)
# show a couple folders (3 first and 3 last ones)
cps[:3] + cps[-3:]

['Mobsuite/SRR3967690/contig_report.txt',
 'Mobsuite/SRR3963804/contig_report.txt',
 'Mobsuite/SRR3966130/contig_report.txt',
 'Mobsuite/SRR3963982/contig_report.txt',
 'Mobsuite/ERR599067/contig_report.txt',
 'Mobsuite/ERR599131/contig_report.txt']

**Merge the Mobsuite results (mobtyper_results)**

The code reads multiple tab-separated files from the list fps, extracts the sample name from each file’s parent directory, adds it as a new column, reorders the columns to put Sample first, concatenates all data into a single DataFrame called combined_data, and displays it.


In [8]:
combined_data = pd.DataFrame()

for file_path in fps:
    sample_name = file_path.split('/')[-2]

    data = pd.read_csv(file_path, sep='\t')
    data['Sample'] = sample_name

    data = data[['Sample'] + [col for col in data.columns if col != 'Sample']]

    combined_data = pd.concat([combined_data, data])

display(combined_data)

Unnamed: 0,Sample,sample_id,num_contigs,size,gc,md5,rep_type(s),rep_type_accession(s),relaxase_type(s),relaxase_type_accession(s),...,mash_neighbor_identification,primary_cluster_id,secondary_cluster_id,predicted_host_range_overall_rank,predicted_host_range_overall_name,observed_host_range_ncbi_rank,observed_host_range_ncbi_name,reported_host_range_lit_rank,reported_host_range_lit_name,associated_pmid(s)
0,SRR3967690,filtered_contigs.plasmid:AD731,1,3008,0.364694,55c093c6352fd7c392701747c7dbba9a,rep_cluster_1223,000520__CP010352_00002,-,-,...,Acinetobacter baumannii,AD731,-,genus,Acinetobacter,genus,Acinetobacter,-,-,-
1,SRR3967690,filtered_contigs.plasmid:AD642,1,1520,0.333553,d1eb38016a09f0ac0a3ca2bdf5e4e9bc,-,-,-,-,...,Acinetobacter pittii,AD642,-,genus,Acinetobacter,genus,Acinetobacter,-,-,-
0,SRR3966130,filtered_contigs.plasmid:AF615,5,34324,0.507808,579edc324d76dfac83a068e88bf468f5,rep_cluster_255,001534__NC_008739_00056,MOBH,NC_008739_00059,...,Marinobacter sp. NP-4(2019),AF615,-,genus,Marinobacter,genus,Marinobacter,-,-,-
1,SRR3966130,filtered_contigs.plasmid:novel_0c89ce550d73afb...,8,30267,0.562263,0c89ce550d73afb9ada73c86ceef226d,-,-,MOBP,CP021432_00049,...,Yoonia vestfoldensis,novel_0c89ce550d73afb9ada73c86ceef226d,-,order,Rhodobacterales,order,Rhodobacterales,-,-,-
2,SRR3966130,filtered_contigs.plasmid:AE470,7,20415,0.577468,5be2ea5990f1de6cf2be93eaf357c878,rep_cluster_579,001983__CP021432_00100,-,-,...,Sulfitobacter sp. THAF37,AE470,-,class,Alphaproteobacteria,class,Alphaproteobacteria,-,-,-
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2,SRR3963982,filtered_contigs.plasmid:AD653,10,65282,0.596933,38edcf5734f4036aeea825ddd04b79c6,rep_cluster_558,001954__CP015042_00026,-,-,...,Sulfitobacter sp. AM1-D1,AD653,-,order,Rhodobacterales,order,Rhodobacterales,-,-,-
3,SRR3963982,filtered_contigs.plasmid:AE470,3,38863,0.573219,f8861086460a02d8c2780d8da0556249,rep_cluster_579,001983__CP021432_00100,MOBP,CP025812_00016,...,Sulfitobacter sp. THAF37,AE470,-,class,Alphaproteobacteria,class,Alphaproteobacteria,-,-,-
4,SRR3963982,filtered_contigs.plasmid:novel_dd6a0914c4d253a...,3,31449,0.570129,dd6a0914c4d253a5892041cbafc7d9a8,-,-,MOBP,CP021432_00049,...,Yoonia vestfoldensis,novel_dd6a0914c4d253a5892041cbafc7d9a8,-,order,Rhodobacterales,order,Rhodobacterales,-,-,-
5,SRR3963982,filtered_contigs.plasmid:AF003,33,104064,0.404318,8b182454a10b6476294bb34c11555703,rep_cluster_1371,000701__NC_021718_00105,-,-,...,Alteromonas mediterranea UM7,AF003,-,genus,Alteromonas,genus,Alteromonas,-,-,-


**Write outputs**

This code defines output file paths and saves the combined_data DataFrame as both a tab-separated .tsv file and an Excel .xlsx file in the specified output_dir.

In [9]:
file_txt = os.path.join(output_dir, 'mobsuite_combined_results.tsv')
file_xlsx = os.path.join(output_dir, 'mobsuite_combined_results.xlsx')

combined_data.to_csv(file_txt, sep='\t', index=False)
combined_data.to_excel(file_xlsx, index=False)

**Merge the Mobsuite results (contig_reports)**

The code reads multiple tab-separated contig_report.txt files from cps, extracts the sample name from each file’s parent directory, adds it as a new column, concatenates all data into a single DataFrame, filters to keep only rows where molecule_type is "plasmid", and displays the resulting DataFrame contig_combined_data.

In [10]:
contig_combined_data = pd.DataFrame()

for file_path in cps:
    sample_name = file_path.split('/')[-2]



    data = pd.read_csv(file_path, sep='\t')
    data['Sample'] = sample_name

    data = data[['Sample'] + [col for col in data.columns if col != 'Sample']]

    contig_combined_data = pd.concat([contig_combined_data, data])
    contig_combined_data =  contig_combined_data[ contig_combined_data['molecule_type'] == 'plasmid']

display(contig_combined_data)

Unnamed: 0,Sample,sample_id,molecule_type,primary_cluster_id,secondary_cluster_id,contig_id,size,gc,md5,circularity_status,...,mpf_type_accession(s),orit_type(s),orit_accession(s),predicted_mobility,mash_nearest_neighbor,mash_neighbor_distance,mash_neighbor_identification,repetitive_dna_id,repetitive_dna_type,filtering_reason
14,SRR3967690,filtered_contigs.plasmid,plasmid,AD731,-,NODE_3999_length_3008_cov_1.950155,3008,0.364694,55c093c6352fd7c392701747c7dbba9a,not tested,...,-,-,-,-,CP042564,0.0586096,Acinetobacter baumannii,-,-,-
28,SRR3967690,filtered_contigs.plasmid,plasmid,AD642,-,NODE_14365_length_1520_cov_1.175932,1520,0.333553,d1eb38016a09f0ac0a3ca2bdf5e4e9bc,not tested,...,-,-,-,-,CP042368,0.0463343,Acinetobacter pittii,-,-,-
0,SRR3966130,filtered_contigs.plasmid,plasmid,AF615,-,NODE_84_length_17462_cov_3.606808,17462,0.516092,d60b52d63ade1f41b381115f066c4c2a,not tested,...,-,-,-,-,CP034143,0.047688,Marinobacter sp. NP-4(2019),-,-,-
2,SRR3966130,filtered_contigs.plasmid,plasmid,novel_0c89ce550d73afb9ada73c86ceef226d,-,NODE_243_length_11467_cov_2.277270,11467,0.546350,8d6aa4bd33c3249950684897816f2845,not tested,...,-,-,-,-,CP021432,0.0909306,Yoonia vestfoldensis,-,-,-
4,SRR3966130,filtered_contigs.plasmid,plasmid,AE470,-,NODE_362_length_9502_cov_2.159630,9502,0.565355,3c228f9a2a2d14329c226979d7cd9255,not tested,...,-,-,-,-,CP045378,0.0496454,Sulfitobacter sp. THAF37,-,-,-
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
145,SRR3963982,filtered_contigs.plasmid,plasmid,AF003,-,NODE_28486_length_1081_cov_0.994908,1081,0.402405,1bb347eb04fd3f5b207f7265c0b28d91,not tested,...,-,-,-,-,CP004854,0.0489202,Alteromonas mediterranea UM7,-,-,-
2,ERR599067,filtered_contigs.plasmid,plasmid,AD061,-,NODE_37_length_38092_cov_0.195983,38092,0.605586,5b948538835a536b07bdd023d3321fdd,not tested,...,-,-,-,-,CP015441,0.0378117,Erythrobacter atlanticus,-,-,-
4,ERR599067,filtered_contigs.plasmid,plasmid,AD061,-,NODE_68_length_27879_cov_0.192081,27879,0.608845,4de4be663cc0a52ca546a8a548906ad7,not tested,...,-,-,-,-,CP015441,0.0378117,Erythrobacter atlanticus,-,-,-
7,ERR599067,filtered_contigs.plasmid,plasmid,AD061,-,NODE_285_length_12737_cov_0.227805,12737,0.603517,2871336c5a8e899d42145c354ecfef19,not tested,...,-,-,-,-,CP015441,0.0378117,Erythrobacter atlanticus,-,-,-


**Write outputs**

The code sets output file paths and saves the contig_combined_data DataFrame as both a tab-separated .tsv file and an Excel .xlsx file in the specified output_dir.

In [11]:
file_txt = os.path.join(output_dir, 'contig_mobsuite_combined_results.tsv')
file_xlsx = os.path.join(output_dir, 'contig_mobsuite_combined_results.xlsx')


contig_combined_data.to_csv(file_txt, sep='\t', index=False)
contig_combined_data.to_excel(file_xlsx, index=False)

**Merge Mobsuite with Pysam data**

The code merges combined_data and contig_combined_data on the Sample and relaxase_type_accession(s) columns, filters out rows where relaxase_type_accession(s) equals '-', selects specific columns for the final DataFrame, and displays the result.

In [12]:
merged_df = pd.merge(combined_data, contig_combined_data, left_on=['Sample','relaxase_type_accession(s)'], right_on = ['Sample','relaxase_type_accession(s)'])
merged_df =  merged_df[ merged_df['relaxase_type_accession(s)'] != '-']


merged_df = merged_df[['Sample', 'contig_id', 'relaxase_type_accession(s)', 'relaxase_type(s)_x']]
display(merged_df)

Unnamed: 0,Sample,contig_id,relaxase_type_accession(s),relaxase_type(s)_x
4,SRR3966130,NODE_84_length_17462_cov_3.606808,NC_008739_00059,MOBH
5,SRR3966130,NODE_243_length_11467_cov_2.277270,CP021432_00049,MOBP
691,SRR3965592,NODE_1753_length_3579_cov_1.960057,NC_010401_00007,MOBQ
852,ERR599086,NODE_13_length_84275_cov_0.226323,CP009357_00140,MOBH
866,ERR599037,NODE_5016_length_7163_cov_0.137316,CP032092_00035,MOBH
...,...,...,...,...
17135,SRR3962293,NODE_139_length_17551_cov_2.820766,CP025812_00016,MOBP
18521,SRR3963458,NODE_25_length_31182_cov_3.499083,CP021432_00049,MOBP
18871,SRR3963982,NODE_276_length_14866_cov_2.690662,CP025812_00016,MOBP
18872,SRR3963982,NODE_296_length_14235_cov_2.945671,CP021432_00049,MOBP


The code defines output file paths and saves the merged DataFrame merged_df as both a tab-separated .tsv file and an Excel .xlsx file.

In [13]:
file_txt = os.path.join(output_dir, 'merge_mobsuite.tsv')
file_xlsx = os.path.join(output_dir, 'merge_mobsuite.xlsx')



# Guardar el DataFrame combinado en un nuevo archivo
merged_df.to_csv(file_txt, sep='\t', index=False)
merged_df.to_excel(file_xlsx, index=False)

**Merge the Mobsuite results with pysam**

This code reads the tab-separated file 'Mobsuite/pysam_combined_results.tsv' into a DataFrame called pysam and displays its contents.

In [14]:
pysam =  pd.read_csv('Mobsuite/pysam_combined_results.tsv', sep='\t')
display(pysam)

Unnamed: 0,sample,contig,reads
0,ERR599021,NODE_4_length_52489_cov_0.142546,7449
1,ERR599021,NODE_52_length_22264_cov_0.136070,2935
2,ERR599021,NODE_54_length_22201_cov_0.152656,3176
3,ERR599021,NODE_156_length_15308_cov_0.198895,3161
4,ERR599021,NODE_231_length_13214_cov_0.134274,1112
...,...,...,...
10176,ERR598999,NODE_61216_length_1106_cov_0.085402,59
10177,ERR598999,NODE_62109_length_1097_cov_0.119238,74
10178,ERR598999,NODE_63971_length_1079_cov_0.076531,50
10179,ERR598999,NODE_68052_length_1042_cov_0.050901,33


The code performs two merges between merged_df and pysam on the contig identifiers: an outer join stored in final_merged_df and an inner join stored in final_df, then displays both resulting DataFrames.

In [15]:
final_merged_df = pd.merge(merged_df, pysam, left_on='contig_id', right_on='contig', how='outer')
display(final_merged_df)

final_df = pd.merge(merged_df, pysam, left_on='contig_id', right_on='contig')
display(final_df)

Unnamed: 0,Sample,contig_id,relaxase_type_accession(s),relaxase_type(s)_x,sample,contig,reads
0,,,,,ERR599072,NODE_100019_length_1576_cov_0.093433,138
1,,,,,ERR598985,NODE_100094_length_1448_cov_0.087472,88
2,,,,,ERR599112,NODE_1000_length_20556_cov_0.780564,21623
3,,,,,ERR598985,NODE_10013_length_4786_cov_0.325368,1569
4,,,,,ERR599005,NODE_100148_length_1084_cov_0.109645,65
...,...,...,...,...,...,...,...
10176,,,,,SRR3963982,NODE_999_length_7367_cov_2.895432,447
10177,,,,,ERR599159,NODE_99_length_40422_cov_1.347940,80873
10178,,,,,ERR598971,NODE_9_length_142113_cov_0.475904,45832
10179,,,,,SRR3968062,NODE_9_length_58353_cov_7.694150,9446


Unnamed: 0,Sample,contig_id,relaxase_type_accession(s),relaxase_type(s)_x,sample,contig,reads
0,SRR3966130,NODE_84_length_17462_cov_3.606808,NC_008739_00059,MOBH,SRR3966130,NODE_84_length_17462_cov_3.606808,1366
1,SRR3966130,NODE_243_length_11467_cov_2.277270,CP021432_00049,MOBP,SRR3966130,NODE_243_length_11467_cov_2.277270,563
2,SRR3965592,NODE_1753_length_3579_cov_1.960057,NC_010401_00007,MOBQ,SRR3965592,NODE_1753_length_3579_cov_1.960057,144
3,ERR599086,NODE_13_length_84275_cov_0.226323,CP009357_00140,MOBH,ERR599086,NODE_13_length_84275_cov_0.226323,13631
4,ERR599037,NODE_5016_length_7163_cov_0.137316,CP032092_00035,MOBH,ERR599037,NODE_5016_length_7163_cov_0.137316,563
...,...,...,...,...,...,...,...
99,SRR3962293,NODE_139_length_17551_cov_2.820766,CP025812_00016,MOBP,SRR3962293,NODE_139_length_17551_cov_2.820766,1080
100,SRR3963458,NODE_25_length_31182_cov_3.499083,CP021432_00049,MOBP,SRR3963458,NODE_25_length_31182_cov_3.499083,2386
101,SRR3963982,NODE_276_length_14866_cov_2.690662,CP025812_00016,MOBP,SRR3963982,NODE_276_length_14866_cov_2.690662,860
102,SRR3963982,NODE_296_length_14235_cov_2.945671,CP021432_00049,MOBP,SRR3963982,NODE_296_length_14235_cov_2.945671,911


The code sums the reads for samples lacking a relaxase type annotation, creates new rows to represent these unannotated reads per sample, appends them to the annotated data, selects key columns, and displays the updated final_merged_df DataFrame.

In [16]:
reads_sum_per_sample = final_merged_df[final_merged_df['relaxase_type(s)_x'].isna()].groupby('sample')['reads'].sum()

new_rows = []

for sample, reads_sum in reads_sum_per_sample.items():
    new_row = {'Sample': sample, 'sample': sample, 'contig': '-', 'relaxase_type_accession(s)': '-', 'relaxase_type(s)_x': '-', 'reads': reads_sum}
    new_rows.append(new_row)

new_rows_df = pd.DataFrame(new_rows)
final_merged_df = pd.concat([final_df, new_rows_df], ignore_index=True)

final_merged_df = final_merged_df[['relaxase_type(s)_x', 'sample', 'reads']]
display(final_merged_df)

Unnamed: 0,relaxase_type(s)_x,sample,reads
0,MOBH,SRR3966130,1366
1,MOBP,SRR3966130,563
2,MOBQ,SRR3965592,144
3,MOBH,ERR599086,13631
4,MOBH,ERR599037,563
...,...,...,...
180,-,SRR3967690,8205
181,-,SRR3967700,3026
182,-,SRR3968061,1044
183,-,SRR3968062,78214


The code creates a pivot table from final_merged_df with relaxase types as rows and samples as columns, showing the mean read counts per combination and filling missing values with '0', then displays the resulting table.

In [17]:
pivot_df = final_merged_df.pivot_table(index='relaxase_type(s)_x', columns='sample', values='reads', fill_value='0', aggfunc='mean')

display(pivot_df)

sample,ERR598944,ERR598947,ERR598958,ERR598960,ERR598964,ERR598971,ERR598980,ERR598985,ERR598999,ERR599000,...,SRR3965758,SRR3965873,SRR3965874,SRR3966130,SRR3967319,SRR3967690,SRR3967700,SRR3968061,SRR3968062,SRR3968777
relaxase_type(s)_x,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
-,772716.0,55284.0,350813.0,776317.0,623094.0,470454.0,68403.0,741881.0,137161.0,759153.0,...,20351.0,19580.0,4761.0,27766.0,9829.0,8205.0,3026.0,1044.0,78214.0,5031.0
MOBB,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
MOBF,3628.0,0.0,6440.0,1432.0,9523.0,52684.0,0.0,0.0,0.0,6125.5,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
MOBH,0.0,0.0,0.0,0.0,0.0,0.0,3229.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1366.0,0.0,0.0,1116.0,0.0,0.0,0.0
MOBP,674.0,547.0,158.0,592.5,989.0,0.0,0.0,3322.0,0.0,2649.5,...,0.0,142.0,0.0,563.0,0.0,0.0,0.0,0.0,0.0,0.0
MOBQ,4630.0,179.0,0.0,5062.0,281.0,0.0,0.0,3881.0,0.0,0.0,...,88.0,0.0,0.0,0.0,0.0,0.0,517.0,0.0,0.0,0.0
MOBV,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,356.0,0.0,0.0,0.0


**Write outputs**

The code defines output file paths and saves the pivot table pivot_df as both a tab-separated .tsv file and an Excel .xlsx file.

In [18]:
file_txt = os.path.join(output_dir, 'final_mobsuite.tsv')
file_xlsx = os.path.join(output_dir, 'final_mobsuite.xlsx')



pivot_df.to_csv(file_txt, sep='\t', index=True)
pivot_df.to_excel(file_xlsx, index=True)