__autor__ = Melany Calderón-Osorno

__versión__ = 0.1

__fecha__ = 2025-07-08

__credits__ = Franck Lejzerowicz

#**QUAST Postprocessing**

This notebook provides a step-by-step guide for post-processing the results generated by the QUAST tool.

#**Setup notebook environment**

First, we will clone the repository containing the data generated by the QUAST tool.

In [1]:
!git clone https://github.com/mecalderon/Tutorial_Summer_Retreat.git

Cloning into 'Tutorial_Summer_Retreat'...
remote: Enumerating objects: 2382, done.[K
remote: Counting objects: 100% (55/55), done.[K
remote: Compressing objects: 100% (41/41), done.[K
remote: Total 2382 (delta 14), reused 47 (delta 8), pack-reused 2327 (from 1)[K
Receiving objects: 100% (2382/2382), 26.02 MiB | 4.79 MiB/s, done.
Resolving deltas: 100% (910/910), done.
Updating files: 100% (2384/2384), done.


The following code imports the library used for data post-processing.

In [2]:
import os
import glob
import pandas as pd
import shutil

**Input/Output paths**


**Inputs**

The following code navigates to the Tutorial_Summer_Retreat/data directory, where the QUAST output is stored.

In [3]:
cd Tutorial_Summer_Retreat/data

/content/Tutorial_Summer_Retreat/data


We created a variable named **Quast_plasmid** to store the path to the QUAST data directory.

In [4]:
quast_dir = 'Quast_plasmid'

**Outputs**

We created a variable named **Quast_processing** to store the path to the output directory.

In [5]:
output_dir = 'Quast_processing'
os.makedirs(output_dir)

**Collect the Quast outputs files**

The following code collects the output files generated by the QUAST tool.

In [6]:
fds = glob.glob('%s/*/transposed_report.tsv' % quast_dir)
fds[:3] + fds[-3:]

['Quast_plasmid/SRR3960579/transposed_report.tsv',
 'Quast_plasmid/ERR599164/transposed_report.tsv',
 'Quast_plasmid/ERR599085/transposed_report.tsv',
 'Quast_plasmid/ERR598999/transposed_report.tsv',
 'Quast_plasmid/ERR599127/transposed_report.tsv',
 'Quast_plasmid/ERR598944/transposed_report.tsv']

**Extract Total lenght values**

The following code reads multiple TSV files, extracts the 'Assembly' and 'Total length' columns, combines the data into a single DataFrame, and saves the result as an Excel file named **TotalLength_combined.xlsx**.

In [7]:
file_xlsx = os.path.join(output_dir, 'TotalLength_combined.xlsx')

dataframes = []

for archivo in fds:
    df = pd.read_csv(archivo, sep='\t', usecols=['Assembly', 'Total length'])
    dataframes.append(df)

df_final = pd.concat(dataframes, ignore_index=True)
df_final.to_excel(file_xlsx, index=False)
df_final.head()

Unnamed: 0,Assembly,Total length
0,SRR3960579,36390
1,SRR3961047,22730
2,SRR3961906,44927
3,SRR3961935,118941
4,SRR3962293,884702


**Extract N50 values**

The following code reads multiple TSV files, extracts the 'Assembly' and 'N50' columns, combines the data into a single DataFrame, and saves the result as an Excel file named **N50_combined.xlsx**.

In [8]:
file_xlsx = os.path.join(output_dir, 'N50_combined.xlsx')
dataframes = []

for archivo in fds:
    df = pd.read_csv(archivo, sep='\t', usecols=['Assembly', 'N50'])
    dataframes.append(df)
df_final = pd.concat(dataframes, ignore_index=True)

df_final.to_excel(file_xlsx, index=False)
df_final.head()

Unnamed: 0,Assembly,N50
0,SRR3960579,1790
1,SRR3961047,1322
2,SRR3961906,1819
3,SRR3961935,1745
4,SRR3962293,6501


**Extract Number of contigs**

The following code reads multiple TSV files, extracts the 'Assembly' and 'Number of contigs' columns, combines the data into a single DataFrame, and saves the result as an Excel file named **Ncontigs_combined.xlsx**.

In [9]:
file_xlsx = os.path.join(output_dir, 'Ncontigs_combined.xlsx')
dataframes = []

for archivo in fds:
    df = pd.read_csv(archivo, sep='\t', usecols=['Assembly', '# contigs'])
    dataframes.append(df)
df_final = pd.concat(dataframes, ignore_index=True)

df_final.to_excel(file_xlsx, index=False)
df_final.head()

Unnamed: 0,Assembly,# contigs
0,SRR3960579,21
1,SRR3961047,17
2,SRR3961906,25
3,SRR3961935,74
4,SRR3962293,222


**Extract N90 values**

The following code reads multiple TSV files, extracts the 'Assembly' and 'N90' columns, combines the data into a single DataFrame, and saves the result as an Excel file named **N90_combined.xlsx**.

In [10]:
file_xlsx = os.path.join(output_dir, 'N90_combined.xlsx')
dataframes = []

for archivo in fds:
    df = pd.read_csv(archivo, sep='\t', usecols=['Assembly', 'N90'])
    dataframes.append(df)
df_final = pd.concat(dataframes, ignore_index=True)

df_final.to_excel(file_xlsx, index=False)
df_final.head()

Unnamed: 0,Assembly,N90
0,SRR3960579,1173
1,SRR3961047,1048
2,SRR3961906,1189
3,SRR3961935,1072
4,SRR3962293,1685


**Extract largest contigs values**

The following code reads multiple TSV files, extracts the 'Assembly' and 'Largest contig' columns, combines the data into a single DataFrame, and saves the result as an Excel file named **Largestcontig_combined.xlsx**.

In [11]:
file_xlsx = os.path.join(output_dir, 'Largestcontig_combined.xlsx')
dataframes = []

for archivo in fds:
    df = pd.read_csv(archivo, sep='\t', usecols=['Assembly', 'Largest contig'])
    dataframes.append(df)
df_final = pd.concat(dataframes, ignore_index=True)

df_final.to_excel(file_xlsx, index=False)
df_final.head()

Unnamed: 0,Assembly,Largest contig
0,SRR3960579,4061
1,SRR3961047,2681
2,SRR3961906,5228
3,SRR3961935,3244
4,SRR3962293,32623
