StandEnA: A customizable workflow to create a standardized presence - absence matrix of annotated proteins

Introduction

This workflow was created to predict and annotate proteins from genomes and to build a standardized matrix of presence/absence from the annotated proteins. The outcome of this workflow can be used for many purposes. For example, to infer the genetic potential of a given organism to perform a pathway of interest.

Workflow of StandEnA: Starting with enzyme identifiers for the pathways of interest, StandEnA has four steps: Step 1 compiles enzyme synonyms and identifiers for these pathways from various databases. Step 2 creates a custom database from these enzyme protein sequences and annotates genomes. Step 3.1 creates a reference file with cross-database identifiers for each enzyme synonym used in the annotation and step 3.2 lists all of the enzymes of interest within the annotated genomes. Step 4 generates a standardized presence absence matrix for each enzyme within the desired pathway.

Input and Output File Structure of StandEnA: Refer to this figure for input and output files used at each step. The column structure of the figure is as follows: 1. input file name and description for specific step, 2. output file name and description for specific step, 3. description of the output's significance for the pipeline. Color-filled boxes indicate files generated by StandEnA whereas the white background boxes indicate the files manually inputted by the user. Each input - output file pair is ordered according to its sequence of appearance in the pipeline and is seen under the corresponding main step number in line with the above figure.

The different inputs and outputs for each step are summarized in the figure below:

In step 1, the user inputs the uniq_ec.tsv file containing enzyme information separated by tab characters in this order: unique enzyme ID, pathway name, pathway step ID, enzyme name, and its enzyme commission (EC) number. Unique enzyme ID* and pathway step ID** are dependent on the naming convention preferred by the user. These IDs must be given using a consistent alphanumeric naming convention with no whitespace characters within the names. Pathway step ID*** is named according to the pathway number and the step at which the enzyme is working (e.g., pathway 1 step 1 is 1.1). This file is used to generate 01_customdb/id_synonyms_per_line.tsv file containing the same information as the uniq_ec.tsv with the addition of available synonyms for each enzyme standard name from the KEGG database and the unique synonym ID**** for each of the retrieved synonyms. Unique synonym ID is a variation of unique enzyme ID to differentiate synonyms of the same enzyme. Next, this information is used to retrieve protein sequences from the NCBI database via the Edirect application programming interface (API) and stored in the 01_customdb/edirect_fasta/ directory. Alternatively, the user can input EC numbers or KEGG Orthology (KO) identifiers within ecs.txt or kos.txt files, respectively. This method retrieves protein sequences from KEGG database via OrtSuite. Furthermore, users are provided the option to use any means to download desired protein sequences from other databases which will be combined with the protein sequences downloaded by OrtSuite and stored within 01_customdb/manual_download_fasta/ directory.

In step 2, all protein sequences compiled in step 1 (within 01_customdb/manual_download_fasta/ and 01_customdb/edirect_fasta/) are used to generate the custom database (02_annotation/customdb.faa). This database is used for genome annotation by Prokka along with Prokka's default database to generate the output 02_annotation/prokka_all.tsv.

In step 3.1, the files from step 1 (01_customdb/id_synonyms_per_line.tsv, ecs.txt, kos.txt) are used to generate a reference file containing standard database identifiers for each enzyme including the EC number, standard enzyme name from KEGG, KO identifier, and enzyme synonyms used for the search with their EC numbers. Since some enzymes are defined by multiple EC numbers in KEGG, the last field sometimes contains additional EC numbers for the same enzyme. One reference file is generated for each pathway and separated into different text files according to their retrieval method as 03_standardization/pw_N/pw_N_kegg_info.txt, 03_standardization/pw_N/ortsuite_pw_N_kegg_info.txt, and/or 03_standardization/pw_N/manual_pw_N_kegg_info.txt for Edirect retrieved, OrtSuite retrieved and manually downloaded files, respectively. Moreover, the same files from step 1 are used to generate query files within the 03_standardization/pw_N/queries/ directory which list the possible synonym names for each enzyme.
In step 3.2, 03_standardization/pw_N/queries/ files are used to select the matching annotations in 02_annotation/prokka_all.tsv. The results are dumped to 03_standardization/standardized/results_pw_N.txt files.

In step 4, 03_standardization/standardized/results_pw_N.txt is used to generate a standardized presence - absence matrix for all inputted genomes for pathway N. Alternatively, annotated enzymes from multiple pathways can be used to generate a single standardized presence - absence matrix. For this purpose, the 03_standardization/pw_N/standardized/ results for all pathways are merged into 04_presabs/std_results_all.txt. 02_annotation/prokka_all.tsv is processed to remove any problematic punctuation such as brackets and parentheses. This processed output file, 04_presabs/prokka_all_updated.tsv, is used along with a user inputted file, 04_presabs/ids_to_names.tsv, containing user-defined unique protein IDs for each enzyme and corresponding standard protein names to be used in the standardized presence - absence matrix generation. The standardized presence - absence matrix output, 04_presabs/presence_absence.csv, is generated.

Summary of different databases incorporated into StandEnA workflow: On the left side, the databases accessed by step 1 and step 2 of the automated StandEnA pipeline is shown. The custom database is produced by automated custom search using the user-supplied enzyme information while the Prokka default database is used to encompass a general representative set of sequences. A manual custom search is added to expand the custom database by compiling specified sequences from other databases such as UniProt. Finally, the expanded custom database and Prokka default database is used to annotate the genomes producing a presence-absence matrix in step 4.

GitHub Contents

Introduction
Installation instructions
Dependencies
System requirements and usage
Reference genomes analyzed using this pipeline
Workflow steps
Contributions

Installation instructions

Clone this repository

git clone https://github.com/mdsufz/StandEnA.git

Install Miniconda3 and add channels

wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh

conda config --add channels defaults
conda config --add channels bioconda
conda config --add channels conda-forge

Create and activate conda environment

conda create -n std_enzymes python=3.6.13 perl-lwp-simple perl-lwp-protocol-https prokka blast==2.9.0

Set Perl 5.22.0 default path for libraries

conda env config vars set PERL5LIB=$CONDA_PREFIX/lib/perl5/site_perl/5.22.0/ -n std_enzymes

Activate environment

conda activate std_enzymes

Install required packages inside the environment

conda install -c bioconda perl-lwp-simple prokka blast==2.9.0

Dependencies

To manually download additional proteins using KEGG database identifiers, OrtSuite is required. Follow the installation steps for OrtSuite here.

System requirements and usage

A typical desktop (Linux) computer is capable of performing this workflow. Disk space can be the most limiting resource for the annotation step as each annotated genome produces ~2 G of data. Therefore, it is recommended to have a fair amount of free space depending on the number of genomes to be annotated.

Reference genomes analyzed using this pipeline

Please find the links of the referenced genomes analyzed using StandEnA:

Escherichia coli str. K-12 substr. MG1655 - NCBI Reference Sequence: NC_000913.3
Azoarcus sp. DN11 chromosome - GenBank: CP021731.1
Malikia spinosa strain AB6 - NCBI Reference Sequence: NZ_VYSB01000001.1
Pseudomonas veronii 1YdBTEX2 - Submitted GenBank sequence: GCA_900092355.1

Output annotation files for these genomes will be available shortly.

Workflow steps

StandEnA is divided into 4 steps:

Step 1 - Compiling Protein sequences for the custom database from NCBI, KEGG and other databases

Step 2 - Generating a custom database and annotating genomes using Prokka with this custom database

Step 3 - Generating the Reference File for enzymes used in the annotation and standardizing protein names in Prokka results

Step 3.1 - Generating reference files and query files for each enzyme standard name

Step 3.2 - Performing queries and standardizing annotation results

Step 4 - Generating matrix of standardized presence absence

Step 1 Compiling Protein sequences for the custom database from NCBI, KEGG and other databases

This step is necessary to extend the scope of proteins that Prokka uses by default to annotate genomes. For that, below steps will be used to prepare a list of synonyms for all enzymes of interest to be downloaded from NCBI. This database will be called custom database (custom_db.faa) in later steps.

Step 1.1 Using KEGG API to retrieve enzyme synonym names

Make the new working directory and switch to this directory:

mkdir 01_customdb
cd 01_customdb

Note that the directory structure in StandEnA is very important for the execution of later steps. As indicated in the examples directory, each directory should be created within the parent StandEnA directory. For flawless execution of next steps, users should not create these directories elsewhere or in another daughter directory within StandEnA.

Step 1.1.1 Preparing a tab separated values file following the example file

Compiling information from literature and KEGG database for the pathways/enzymes of interest to be used in uniq_ec.tsv file

Users are suggested to identify their pathways of interest from literature and to use the KEGG database to obtain information about its component enzymes. For example, the example pathway used in this GitHub is regarding benzene/toluene degradation (KEGG Module number M00547). This module page was used to compile the enzyme EC numbers through clicking on each KO number (K identifier number) defined under the "Definition" such as K03268 and obtaining the EC numbers of the enzyme orthologs from the KO number webpage. For more information about EC numbers and KO identifiers, please refer to this page. The EC numbers are listed under the "Name" section of each KO number page. To check the exact identity of each EC number, refer to their separate KEGG information pages (accessed through clicking on the corresponding EC number on the KO number page) such as the page for 1.14.12.3. Depending on the aim of your analysis, you can choose to include all or a particular subset of the EC numbers listed on a single KO number page in the uniq_ec.tsv file. In this case, depending on your starting compound, a different EC number should be used (for benzene 1.14.12.3, for toluene 1.14.12.11, for chlorobenzene 1.14.12.26). If all three starting compounds are to be analyzed, then all three of these EC numbers should be included in the uniq_ec.tsv file. Please note that each row of the uniq_ec.tsv file should include only one EC number, meaning there can be multiple rows for the same enzyme standard name but with different EC numbers.

In cases where a KEGG module is not found for your desired pathway, then the KEGG database can be used to compile each step of the reaction from separate enzyme information pages given that the names of these enzymes are known.

uniq_ec.tsv is the example file. A tab-separated values file is a text format similar to comma-separated values file where, instead of a comma, a tab character is used to separate different fields. Please find more information on this file structure here.

Columns in the uniq_ec.tsv file should contain in order (i.e., from left to right): unique enzyme ID, pathway, pathway step ID, enzyme name, and EC number. This column arrangement is extremely important because some of those columns will be used to organize the download of protein sequence files in step 1.2. Although a single enzyme might be given multiple EC numbers by databases, each row of uniq_ec.tsv must contain only 1 EC number. Therefore, users should enter each EC number as a separate row with the corresponding unique enzyme ID, pathway, pathway step ID, enzyme name column information. Please read the above heading about compiling information to be used in uniq_ec.tsv file carefully before proceeding with the next steps.

Note that the unique enzyme ID and pathway step ID are provided by the user for their pathway of interest. In the example file, unique enzyme ID is named using the convention E01, E02 etc. while the pathway step ID is named according to the pathway number and the step at which the enzyme is working at (e.g., pathway 1 step 1 is 1.1). Depending on the users' preferences, other naming conventions can be used in place of this provided that the column order does not change. However, each ID needs to be unique and must be named using a consistent alphanumeric naming convention with no whitespace characters within the names.

Please be aware that trailing spaces might exist depending on how you generated the file (e.g. Windows OS). Trailing spaces are space characters found at the end of a line. If trailing spaces are not removed, this causes a problem during synonym retrieval from the KEGG database in step 1.4.

To check for trailing spaces run:

cat -v uniq_ec.tsv

If there are any, the symbol ^M should appear at the end of the lines. To remove the trailing spaces:

mv uniq_ec.tsv temp_file.tsv
sed -e "s/\r//g" temp_file.tsv > uniq_ec.tsv

Now, the file is free of trailing whitespaces.

Step 1.1.2 Retrieving the synonyms for each enzyme from KEGG API

After preparing this file, retrieve the enzyme/protein synonyms for each name in the uniq_ec.tsv file from KEGG using their API using this code:

# Retrieve synonyms from KEGG API
cut -f5 uniq_ec.tsv | while read line; do out=$(curl -s https://rest.kegg.jp/list/ec:$line); echo $line $out; done > ec_synonyms.txt

# Combine tables
paste uniq_ec.tsv <(cut -f3- -d' ' ec_synonyms.txt) > synonyms_table.tsv

Please note that any typo or extra character (e.g. space) in the EC number field of the uniq_ec.tsv file may cause no synonyms to be returned from KEGG API.

Step 1.2 Preparing the list of synonyms for NCBI Edirect

Parsing the enzyme synonyms to write down one synonym per line and removing "gene name", "incorrect" and "misleading" flags retrieved with the synonyms:

# Parsing the synonyms and removing output that can't be used 

perl -ne 'chomp; @fields=split("\t",$_); @syn=split(";",$fields[4]); unless(scalar(@syn)==0){foreach(@syn){print join("\t",@fields[0..3]),"\t$_\n"}}else{print "$_\t$fields[2]\n"};' <(cut -f1,3- synonyms_table.tsv) | sed -e 's/\t /\t/g' | grep -v "incorrect\|gene name\|misleading" > synonyms_per_line.tsv

For some enzymes that did not return synonyms because of any reason, manually insert the enzyme names that you know in column 5 of synonyms_per_line.tsv. It is expected that the enzymes with ambiguous EC numbers will not return any synonyms (e.g., E-phenylitaconyl-CoA hydratase with EC number 4.2.1.- where hyphen indicates the absence of a single EC number). Users should manually insert the known enzyme names in this case.

Now, the following command will create a new column and add new IDs for synonyms, which will become the first column, that will be used to name the fasta files when using Edirect:

cat synonyms_per_line.tsv | perl -ne '$line=sprintf("%03d",$.); @fields=split("\t",$_); $synid="S$line-$fields[0]-$fields[3]"; if($fields[3] eq "NA"){print "$synid\t",join("\t",@fields[0..3]),"\t$fields[2]\n"}else{print "$synid\t$_"}' > id_synonyms_per_line.tsv

Step 1.3 Manually curating the list

Look through the "id_synonyms_per_line.tsv" and check for potential problems (eg.: too short names that can cause ambiguity).

The following command removes synonyms with less than 6 characters to avoid ambiguity when querying NCBI.

# This perl code writes the changes in the same file
perl -ne 'chomp; @fields=split("\t",$_); $fields[5] =~ tr/ //d; unless(scalar(split("",$fields[5]))<=5){print "$_\n"};' id_synonyms_per_line.tsv > tmp; mv tmp id_synonyms_per_line.tsv

Note that KEGG synonyms containing ";" in their names will be separated as different synonyms. The user is advised to check for such instances and manually curate the id_synonyms_per_line.tsv file to remove ";" characters.

Step 1.4 Using a custom perl script to download proteins from NCBI Edirect API

This custom script is based on the following example application: Sample Applications of the E-utilities - Entrez Programming Utilities Help - NCBI Bookshelf

This following code will run the bulk_edirect_custom.pl script.

If you are on high-performance computer cluster using SLURM as the workload manager, do not forget to load the modules. If not, do not run the below code.

# Loading module
module load foss/2019b Perl/5.30.0

To run the script on your high-performance computer cluster or local computer, execute this code:

# Create a folder to store downloaded fasta files
mkdir edirect_fasta

# Running the script
echo "START: $(date)"; cat id_synonyms_per_line.tsv | while read line; do id=$(echo "$line" | cut -f1); reac=$(echo "$line" | cut -f6 ); perl ../scripts/bulk_edirect_custom.pl "$reac" protein $id edirect_fasta/ >> log.txt 2>> err.txt; done; echo "  END  : $(date)";

Since it can take hours or even days depending on the size of your list, we recommend running this with the help of another tool (e.g. "screen"). To get more information on the screen tool, visit this website.

This code outputs the protein sequence files (compiled within 01_customdb/edirect_fasta directory) along with log.txt and err.txt files for the process containing the log of the actions and any error encountered during the process, respectively.

Do not try to run many instances in parallel (e.g., multiple id_synonyms_per_line.tsv files used to access the Edirect API at the same time). This may cause NCBI to black list your IP in which case the log.txt file from this step may contain "RESULTS: ERROR" output for your queries. If this is the case, stop the execution and retry at a later time. If the error persists, you may need to contact NCBI Edirect services.

Note that in each .faa in the edirect_fasta/ directory, sequences are separated by a new line with no character (empty new line) as seen here. Although this is not the usual fasta format, remaining code works with or without these spaces. Therefore, during manual_download_fasta/ formation the same convention of separating sequences with a new line (empty) may or may not be used depending on the preference of the user.

Step 1.5 Adding missing proteins to custom database through OrtSuite-mediated searching in KEGG or manual downloading from other databases

This step is necessary if there are enzymes that were not collected by Edirect in the previous step or additional proteins are desired to be downloaded from other databases. Otherwise, you can skip this step and continue with the Edirect downloaded proteins to perform the later steps.

For example, you can download some proteins that did not get downloaded by Edirect from the above-step and that you want to retrieve from KEGG Orthology directly using their EC numbers or KO identifiers. For this purpose, the workflow uses OrtSuite.

Create a folder to store manually downloaded fasta files

mkdir manual_download_fasta/

Step 1.5.1 Downloading proteins using OrtSuite

For this step to be executed, OrtSuite must be installed as described here. For further information on how to use the download_kos command used in this step, we recommend checking the OrtSuite GitHub.

Note that you must activate the OrtSuite environment in conda as shown in the OrtSuite GitHub if you followed the conda installation here. Example files for ecs.txt and kos.txt can be found in OrtSuite GitHub and in the ecs.txt and kos.txt files for the example case used in this pipeline.

# Downloading protein sequences based on a list of EC numbers
download_kos -o manual_download_fasta/ -e ecs.txt > log_ecs.txt 2> err_ecs.txt

# Downloading protein sequences based on a list of KO numbers
download_kos -o manual_download_fasta/ -k kos.txt > log_kos.txt 2> err_kos.txt

Note that if the same list of proteins are using both EC and KO numbers, the initially downloaded files will be overwritten because the file naming convention for this step uses the KO identifiers for both methods. It is very important to save the files to a new directory (e.g., manual_download_fasta_new/) to avoid the loss of the Edirect downloaded proteins from step 1.4.

Step 1.5.2 Manually downloading proteins from other sources

Also, we demonstrate the user-customizability and flexibility of the custom database creation of StandEnA by manually downloading a couple of protein sequence files from various different databases. This step is important when the automated retrieval in the previous steps do not work for some of the desired proteins or there are other specific databases that you want to use to retrieve some protein sequences.

Manually search for proteins in various databases (eg.: Uniprot, NCBI and KEGG) and save them into FASTA files. Note that protein sequences are found within .faa extension files and this file type is recommended for this step. However, in some cases, databases may only provde the option of downloading protein sequence files in the generic .fasta or .fa format. In this case, make sure that the contents of these files are protein sequences rather than nucleic acid sequences. For more information on different FASTA formats and file extensions visit this page.

# Example
 01_ana_benzene_carboxylase.faa
 02_benzoyl-coa_bamD.faa
 02_benzoyl-coa_bamE.faa
 02_benzoyl-coa_bamF.faa
 02_benzoyl-coa_bamG.faa
 02_benzoyl-coa_bamH.faa
 02_benzoyl-coa+reductase+bami.faa
 04_uniprot-Nitric+oxide+dismutase+(putative).faa

Note that these sequences must be saved to the manual_download_fasta/ directory to be used in later steps.

Output files generated in step 1: 01_customdb/id_synonyms_per_line.tsv, 01_custombd/edirect_fasta/ directory files, 01_customdb/manual_download_fasta/ directory files

Step 2 Generating a custom database and annotating genomes using Prokka with this custom database

In this part, we are going to annotate our genomes using Prokka with the additional custom database to be created from the downloaded proteins in step 1.

Step 2.1 Creating the work directory

Create the directory where Prokka annotation results will be saved to.

cd ..
mkdir 02_annotation
cd 02_annotation

Note that this directory should not be within the 01_customdb/ directory but in another parent directory beside 01_customdb/. The directory organization is exemplified here. The same directory organization must be followed throughout the pipeline.

Step 2.2 Creating the custom database

In the next step, we combine the additional downloaded proteins into a custom database.

# Example (make sure to check the extensions of the files so you can use the wildcard)
cat ../01_customdb/edirect_fasta/*.faa ../01_customdb/manual_download_fasta/*.faa > custom_db.faa

Note that if some files that have been manually downloaded in the previous step have different file extensions (e.g., download_kos function downloads files with .fa extension), they would not be added to the custom_db.faa with this code. Change the above code like this if there are files with .fa extensions:

# Example if there are .faa and .fa files in the manual_download_fasta/ directory
cat ../01_customdb/edirect_fasta/*.faa ../01_customdb/manual_download_fasta/*.faa ../01_customdb/manual_download_fasta/*.fa > custom_db.faa

For more information on different permitted file extensions, refer to step 1.5.

Step 2.3 Testing the custom database

Please make sure that the std_enzymes conda environment is setup and activated before running this step. The std_enzymes conda environment must be activated during all steps of this workflow (i.e., Step 1 through Step 4).

Run this command to check your conda environments:

conda env list

std_enzymes must be listed in the output. If not, please refer back to the Installation Instructions steps.

Run this code to activate the environment:

conda activate std_enzymes

Then, to test if the newly created database is "flawless", simply run "makeblastdb" on it.

makeblastdb -dbtype prot -in custom_db.faa -out custom_db

If you get an error like "BLAST Database creation error", something might be wrong with your database. In our case, we identified some problematic lines and manually removed those.

In this customdb.faa database, there will be many protein sequences where each sequence starts with a ">" character and different protein sequences are separated by new line characters. An example problem might be the lack of a new line character in between 2 protein sequences. This situation can be identified and fixed using this code:

# Looking for "problematic" lines
sed -e  's/>/\n>/g' custom_db.faa | grep -P "\w>"
# Fixing those lines
sed -i  's/>/\n>/g' custom_db.faa

Please be aware that this may not fix your specific problem. If this is the case, refer to this page to see the desired custom_db.faa format and manually make changes on your file, accordingly.

Step 2.4 Shortening contig names

Prokka does not work when FASTA sequences have long headers (> 20 characters long). Therefore, it is necessary to rename the headers of the FASTA files which will be used as input for annotation.

# Create a directory to store the renamed genomes
# Here we call it "short"
mkdir short

# Move to the folder where your genomes are
cd /path/to/genomes

# This awk command does the trick of renaming the fasta headers
# awk is taking each fasta as input and outputing to the directory short
for i in *; do awk '/^>/{print ">contig" ++i; next}{print}' < $i > path/to/02_annotation/short/"short_"$i; done

Step 2.5 Running Prokka on genomes using the custom database

There are 2 alternative methods to run Prokka on genomes depending on your computational resources:

2.5.1 Runing Prokka on your local machine

Changing working directory to the directory containing the shortened genomes (from previous step):

cd /path/to/short

Note that this step can only be done after Prokka is installed within the std_enzymes conda environment as described under the Installation instructions. Please make sure that the std_enzymes conda environment is activated as stated in the instructions to proceed with this step.

Running Prokka for genomes:

for k in *.fa; do prokka $k --outdir prokka_out/"$k".prokka.output --prefix PROKKA_${k##*/} --proteins "custom_db.faa" --norrna --notrna --cpus 4 ; echo $k; done

Note that depending on your machine resources you can increase the cpu number to be used by Prokka from by substituting another number in place of 4 next to the --cpus option (e.g., for 8 cpus to be used --cpus 8). See Prokka help page for detailed information on the flags used in the above code. This step puts all genome annotation files to the location 02_annotation/prokka_out/short.

2.5.2 Running Prokka on a high-performance computing cluster

The following for loop will submit a job, for each genome to be annotated, on a high-performance computing cluster using SLURM as a workload manager.

# Define working directory
workdir=/path/to/02_annotation
# Define prokka submission script
subscript_prokka=/path/to/prokka_on_bioindicators_v2.sh

for i in *.fa; do qsub -N $i $subscript_prokka $workdir/rep_set/short/$i prokka_out/"out_$i" $i; done

The submission script (which has the commands used) is available here.

Please note that this submission script contains path/to/ fillers in place of personal data and paths in the server. All of that must be replaced by your own information before running the script.

Step 2.6 Compiling all Prokka annotation results into a single file

The most important output for our further analysis is the ".tsv" file (read about Prokka output files here). A tab-separated values file is a text format similar to comma-separated values files where, instead of a comma, a tab character is used to separate different fields. Please find more information on this file structure here.

Concatenate all your ".tsv" outputs together.

# Concatenate files while printing filenames to first column
awk '{print FILENAME "\t" $0}' /path/to/prokka_out/*/*.tsv > prokka_all.tsv

Note that "/path/to/prokka_out/ * / * .tsv" is "02_annotation/short/prokka_out/ * / * .tsv" if the code has been run on local machine.

Optional but recommended step: Formating genome names from results.

sed -ri -e 's/^out_short_//' -e 's/\/PROKKA_[0-9]+\.tsv//' prokka_all.tsv
sed -ri '1 s/\S+/bin_id/' prokka_all.tsv

Output files generated in step 2: 02_annotation/customdb.faa custom protein database and 02_annotation/example_prokka_all_results.tsv annotated output file

Step 3 Generating the Reference File for enzymes used in the annotation and standardizing protein names in Prokka results

If you have many different pathways, we suggest doing this part separately for each pathway. That way, you will end up having less number of proteins at a time which will make the process faster. Additionally, since some steps are manual, handling too many proteins at once may be confusing for users and may lead to errors.

Step 3.1 Generating reference files and query files for each enzyme standard name

Step 3.1.1 Subsetting data to work with one pathway at a time

Creating directory for this part:

mkdir ../../03_standardization
cd ../../03_standardization

Note that the working directory in step 2.6 is 02_annotation/short/.

03_standardization/ directory should not be within the 02_annotation/ directory but in another parent directory beside 01_customdb/ and 02_annotation/. The directory organization is exemplified here. The same directory organization must be followed throughout the pipeline.

When working with pathway of interest "1", start the names of the working directory and other files with "pw_1" as the naming convention.

Creating directory for pathway 1:

mkdir pw_1
cd pw_1

Note that for each pathway, a new directory should be created within 03_standardization/ to get a file structure of: 03_standardization/pw_N. Hence, steps after this point should be executed within the corresponding pw_N directory.

Saving unique enzyme standard names and their synonyms for this pathway (which were already compiled into the 01_customdb/id_synonyms_per_line.tsv in step 1.2):

grep -P "\t1\.\d+\t" ../../01_customdb/id_synonyms_per_line.tsv | cut -f4,5,6 | cut -f1,3 | sort | uniq > pw_1.txt

For this example, enzymes within pathway 1 have an enzyme ID starting with "1.". Hence, to retrieve these, the 01_customdb/id_synonyms_per_line.tsv file is searched for the text "\t1.\d+\t" by this code. For example, if pathway 2 is to be retrieved, the above line should be updated to search for "\t2.\d+\t" (i.e., instead of "\t1.\d+\t" enter "\t2.\d+\t") and the output file should be named accordingly (pw_2.txt).

Note that the enzyme list for reference file and query file formation is derived from the 01_customdb/id_synonyms_per_line.tsv file that was used to download protein sequences using Edirect API in step 1.4.

Note that query formation in later steps from pw_1.txt will give an error if the file contains names with "/". These must be replaced by another character (e.g., in place of "/" put ""). Execute this code to replace all "/" with "":

#Replace all "/" with "_"
mv pw_1.txt temp_file.txt
sed -r "s/[/]+/_/g" temp_file.txt > pw_1.txt

#Remove intermediate files
rm temp_file.txt

IMPORTANT NOTE

Since the OrtSuite-mediated KEGG API download and manual download steps are performed after Edirect download and are optional, if manual download steps are used to retrieve sequences that are not included within id_synonyms_per_line.tsv, additional steps should be performed to account for these protein names. In the steps below, there are additional codes to be executed to add OrtSuite-mediated KEGG API downloaded proteins (from EC numbers) to the required files and directories. These steps can be modified by the user if there are other download methods used (e.g., OrtSuite-mediated KEGG API downloaded proteins from KO identifiers).

These steps (lines/sections involving OrtSuite-retrieved files between steps 3.1.2 - 3.1.3) indicated can be skipped altogether if:

a) there are no OrtSuite-mediated KEGG API downloaded proteins (from EC numbers) or manually proteins downloaded from different databases

OR

b) all enzyme information (including the later manually downloaded ones) was inputted to the pipeline via the initial uniq_ec.tsv file in step 1.1 which is used to generate the id_synonyms_per_line.tsv file in step 1.2

Step 3.1.2 Dividing pathways into separate files for each enzyme/protein and collecting them in the queries directory

Each pathway will have a "queries" directory of files containing the enzyme synonyms to be searched for in the Prokka annotation in step 3.2.

Creating directory for storing queries:

mkdir queries

Printing each collection of synonyms (i.e., each standard name will have one query file with many synonyms describing the same standard name) to a different file:

cat pw_1.txt | while read -r l; do line=$l; col2=$(echo "$l" | cut -f2); name=$(echo "$l" | cut -f1 | tr ' ' '_'); echo $col2 | tr -d '[]()*'| tr '[:upper:]' '[:lower:]' >> queries/$name.txt ; done

Note that this step removes all brackets and changes the queries to lowercase letters for uniformity.

Removing duplicated synonyms:

for i in queries/*; do sort $i | uniq > queries/tmp; mv queries/tmp $i; done

Example query file formation for OrtSuite-mediated KEGG API downloaded proteins (from EC numbers)

To be able to form query files, the EC numbers in ecs.txt must be used to retrieve enzyme names and synonyms. The below code is a variation of the step 1.1 demonstrating the customizability and flexibility of the pipeline depending on specific decisions made by the user about protein sequence downloading. Please refer back to step 1 for detailed explanations.

From this file, the enzyme synonyms and standard names can be added to files in the queries folder:

cat path/to/ecs.txt | while read line; do out=$(curl -s https://rest.kegg.jp/list/ec:$line); echo $line $out; done > ortsuite_ec_synonyms.txt

For later steps to be executed smoothly, the user should manually curate the ortsuite_ec_synonyms.txt file to remove any irrelevant EC numbers or synonyms that might have been retrieved from KEGG but may be irrelevant to the analysis (this decision must be made by the user) to proceed with the relevant information. After relevant synonyms for each standard name have been collected into the ortsuite_ec_synonyms.txt, we advise you to manually create a file similar to uniq_ec.tsv containing the unique enzyme ID, pathway, pathway step ID, enzyme name, and EC numbers (refer to step 1.1). Note that the enzyme name is the first name retrieved in each line in the ortsuite_ec_synonyms.txt file and the fields unique enzyme ID, pathway, pathway step ID should be entered following the same naming convention as the uniq_ec.tsv file. The example final file can be found here: ortsuite_uniq_ec.tsv.

Then, the steps for the creation of ortsuite_pw1_id_synonyms_per_line.tsv are followed from step 1.2:

paste ortsuite_uniq_ec.tsv <(cut -f3- -d' ' ortsuite_ec_synonyms.txt) > ortsuite_synonyms_table.tsv
perl -ne 'chomp; @fields=split("\t",$_); @syn=split(";",$fields[4]); unless(scalar(@syn)==0){foreach(@syn){print join("\t",@fields[0..3]),"\t$_\n"}}else{print "$_\t$fields[2]\n"};' <(cut -f1,3- ortsuite_synonyms_table.tsv) | sed -e 's/\t /\t/g' | grep -v "incorrect\|gene name\|misleading" > ortsuite_synonyms_per_line.tsv
cat ortsuite_synonyms_per_line.tsv | perl -ne '$line=sprintf("%03d",$.); @fields=split("\t",$_); $synid="S$line-$fields[0]-$fields[3]"; if($fields[3] eq "NA"){print "$synid\t",join("\t",@fields[0..3]),"\t$fields[2]\n"}else{print "$synid\t$_"}' > ortsuite_id_synonyms_per_line.tsv
perl -ne 'chomp; @fields=split("\t",$_); $fields[5] =~ tr/ //d; unless(scalar(split("",$fields[5]))<=5){print "$_\n"};' ortsuite_id_synonyms_per_line.tsv > tmp; mv tmp ortsuite_id_synonyms_per_line.tsv

From this file, queries can be added to the queries directory following the same steps used for the id_synonyms_per_line.tsv file above (starting from step 3.1.1) by changing the input file to ortsuite_pw1_id_synonyms_per_line.tsv. Note that the file names for the OrtSuite-mediated KEGG API downloaded proteins are advised to be distinguished from Edirect downloaded proteins in the query directories by using a specific naming convention (e.g., queries/ortsuite_$name.txt for the file names to include "ortsuite"). The same naming convention is advised to be used for the intermediate files (e.g., ortsuite_pw_1.txt) to prevent overwriting the files created for Edirect downloaded proteins.

Step 3.1.3 Collecting standard database identifiers about the enzyme names used during annotation from KEGG to generate a reference file

The goal of this step is to generate the file "kegg_info.txt" for the given pathway. This file can be used as a reference while manually curating the protein names during the presence - absence matrix generation in step 4.2.

Gathering unique EC numbers for the pathway:

grep -P "\t1\.\d+\t" ../../01_customdb/id_synonyms_per_line.tsv | cut -f4,5 | sort | uniq > unique_pw_ec.tsv

Note that if pathway 2 is to be retrieved, the above line should be updated to search for "\t2.\d+\t".

Note that for some enzyme names in id_synonyms_per_line.tsv there might be no EC number. In this case, the user should use the below line to remove any "NA" or "-" identifier:

grep -v "\.-" unique_pw_ec.tsv | grep -v "NA$" > temp.tsv
cat temp.tsv > unique_pw_ec.tsv

Collecting their corresponding EC numbers and KO identifiers from KEGG API:

cat unique_pw_ec.tsv | cut -f2 | while read l; do curl -s https://rest.kegg.jp/link/ko/ec:$l; done | sort -k1,2 | uniq | grep -v "^$" > pw_ec_kos.txt

Collecting KO definitions (protein names):

cut -f2 pw_ec_kos.txt | while read l; do def=$(curl -s https://rest.kegg.jp/get/$l | grep NAME | cut -f3- -d " "); paste <(echo $l) <(echo $def); done > kos_def.txt

Collecting main EC name (enzyme name):

cut -f1 pw_ec_kos.txt | while read l; do def=$(curl -s https://rest.kegg.jp/list/$l | cut -f2 | cut -f1 -d ";"); paste <(echo $l) <(echo $def); done > ec_name.txt

Combining all in a single file:

paste ec_name.txt kos_def.txt > pw_1_kegg_info.txt

Cleaning intermediate files (optional):

rm pw_ec_kos.txt kos_def.txt ec_name.txt

Example reference file formation for OrtSuite-mediated KEGG API downloaded proteins (from EC numbers)

Collecting EC numbers and KO identifiers from KEGG API:

cat ecs.txt | while read l; do curl -s https://rest.kegg.jp/link/ko/ec:$l; done | sort -k1,2 | uniq | grep -v "^$" > ortsuite_pw1_ec_kos.txt

After this step, ortsuite_ec_kos.txt file can be used in place of pw_ec_kos.txt in step 3.1.3. Note that the output and input file names for each of the above steps must be changed to prevent overwriting the reference files generated for Edirect downloaded proteins listed in id_synonyms_per_line.tsv. The suggested naming convention for these files is: ortsuite_kos_def.txt, ortsuite_ec_name.txt, ortsuite_pw_1_kegg_info.txt.

Output files generated in step 3.1: Reference files with standard database identifiers 03_standardization/ortsuite_pw_N_kegg_info.txt and 03_standardization/pw_N_kegg_info.txt along with 03_standardization/pw_N/queries/ directory files

Step 3.2 Performing queries and standardizing annotation results

Step 3.2.1 Performing queries of the Prokka annotation using files in queries directory and dumping results into files

Changing to queries directory:

mkdir results
cd queries

Note: manually remove queries with brackets "[]" (if there are any). To test this:

grep -lFe [ -lFe ] -lFe "(" -lFe ")" *.txt

If there are file contents with brackets, the user is advised to use this code to remove the brackets but keep the original files for later reference. Note that this step can be directly skipped if there are no brackets in the queries.

#This will put the files in a subdirectory called new and use these files for the rest of the steps
mkdir new
for i in *.txt; do cat $i | tr -d '[]()' > new/$i; done
#Change to new/ to execute below steps
cd new/

For each query file, query for terms in the complete results. This step connects the standard names generated from previous steps with the protein annotation files generated by Prokka in step 2.

for i in *.txt; do grep -i -f $i path/to/02_annotation/short/prokka_all.tsv > path/to/results/result_$i; done

Moving results to a folder with complete results:

cd path/to/results
mkdir complete
mv *.txt complete

Summarizing result files by removing duplicates (unique):

mkdir unique
cd unique
for i in ../complete/*.txt; do cut -f8 $i | sort | uniq > $i.uniq; done; mv ../complete/*.uniq .

Now, the files for standardization should be ready.

Open each one of the ".uniq" files and include a new first column. Note that .uniq files use tab as the delimiter. A tab-separated values file is a text format similar to comma-separated values files where, instead of a comma, a tab character is used to separate different fields. Please find more information on this file structure here.

Add a first column for each line and manually annotate the standard name to the protein name of the second column. If the second column cell does not relate to your protein, add "REMOVE" (without brackets) to the first column.

Example .uniq file:

Note that, depending on the specific protein headers present in the custom database generated in step 2, Prokka annotation step can produce relatively "clean" outputs. This means that the first column that you add might contain the same information outputted by Prokka on the second column. An example "clean" .uniq file image is shown below:

Grouped Presence-Absence Matrix Option: If the intention is to produce a final presence-absence matrix that groups all subunits/components of the same enzyme together (as opposed to displaying them as separate entities), then the grouped standard name can be used instead of the standard name provided in KEGG. In this case, step 3.2.1 can be repeated partly by copying the initial unique files into a grouped_unique directory (within the 03_standardization/pw_N/results/ directory) and manually changing the standard name column as desired.

# Make and change to results directory
cd ..
mkdir grouped_unique
# Copy all files in ../unique directory into this new directory
cp unique/*.uniq grouped_unique

According to your preference, group the enzymes in grouped_unique by adding the same standard name for their first column as indicated above. For example, if there are multiple files formed for different nitrate reductase subunits and you want to group them (nitrate reductase alpha subunit, nitrate reductase beta subunit, nitrate reductase gamma subunit, nitrate reductase ambiguous) under the standard name "nitrate reductase":

Note that these enzymes are listed with their proposed synonyms in different .uniq files initially and that this grouping decision is made by the user by inserting the same standard group name in the first column of these files.

Regardless of the user's choice in forming a grouped or individual presence-absence matrix, for the flawless execution of later steps, all .uniq files should have 2 columns separated by a tab character. Since the above steps are manual, users should check the column number in their files. Below is an example line that can be used for this purpose.

#Changing to the unique directory
cd unique
# This line should give 2 as output, if there numbers different from 2 there is a problem
for i in *.uniq; do awk -F"\t" "{print NF}" $i; done > ../columns.txt

If the above code outputs numbers different from 2, then the .uniq files should be checked again to make sure that there are 2 columns which are tab separated.

Step 3.2.2 Combining standard names to Prokka annotation results (standardization of Prokka annotation)

Moving to results directory (going back to 03_standardization/pw_1/results/):

cd ../

Creating directory to store standardized results:

mkdir standardized

Adding standardized results to last column for each enzyme:

cd complete
for i in *; do python3 ../../../../../scripts/add_standard_names.py "../unique/$i.uniq" "$i" >> ../standardized/results_pw_1.txt; done

Note: The add_standard_names.py script is available here.

Output files generated in step 3.2: 03_standardization/pw_N/results/standardized/results_pw_N.txt file which is the standard enzyme name containing (i.e., standardized) version of the Prokka annotation file from step 2.6

Step 4 Generating matrix of standardized presence absence

Step 4.1 Combining all standardized Prokka results

Creating directory for this part and changing working directory to this:

mkdir 04_presabs
cd 04_presabs

04_presabs/ directory should not be within the 03_standardization/ directory but in another parent directory beside 01_customdb/, 02_annotation/, and 03_standardization/. The directory organization is exemplified here. The same directory organization must be followed throughout the pipeline.

Note that for step 3.2.2 the working directory is at the subdirectory 03_standardization/pw_1/results/standardized/. The relative path from this point to the directory 04_presabs is:

mkdir ../../../../04_presabs

Combining all standardized results

cat ../03_standardization/pw_*/results/standardized/* > std_results_all.txt

Step 4.2 Manually preparing file of protein/enzyme names to be used for generating the presence - absence matrix

Prepare a tab separated file with two columns which contains the protein ID and its corresponding protein standard name. These IDs will appear in the presence - absence matrix output at the end of this step and may be identical to or different from the enzyme ID convention used in step 1.1. A tab-separated values file is a text format similar to comma-separated values files where, instead of a comma, a tab character is used to separate different fields. Please find more information on this file structure here.

Below is the example file (ids_to_names.tsv):

CP1001	benzene dioxygenase, alpha subunit
CP1002	benzene dioxygenase, beta subunit
CP1003	benzene dioxygenase, ferredoxin component
CP1004	benzene dioxygenase, ferredoxin reductase component

Note that the protein IDs are dependent on the preference of the user. Here we suggested the usage of an ID convention of CP1001, CP1002 etc. Protein names to be used for the presence absence matrix must match the names used in the standardization of Prokka annotations in step 3.2.1. As a guideline, the first column containing enzyme standard names of the results/unique/ directory files from step 3.2.1 can be copied to this file. For additional information, please check the your kegg_info.txt file for Edirect downloaded proteins (example file can be found under the name pw_6_C_kegg_info.txt) and ortsuite_pw_1_kegg_info.txt for proteins downloaded from KEGG using OrtSuite.generated in step 3.1.3.

Please note that the standard names listed in the ids_to_names.tsv file should not use parentheses characters ( "(", ")", "[", "]") and contain lowercase characters only.

Grouped Presence-Absence Matrix Option: If the intention is to produce a final presence-absence matrix that groups all subunits/components of the same enzyme together (as opposed to displaying them as separate entities as seen in the example (ids_to_names.tsv)), then the grouped standard name can be used instead of the standard name provided in KEGG during the standardization step of this pipeline. In this case, step 3.2.1 can be repeated partly by copying the initial unique files into a grouped_unique directory and manually changing the standard name column as desired. For the example (ids_to_names.tsv), CP1001 to CP1004 can be standardized as "benzene dioxygenase" so these will be grouped together in the final output.

If not done so already, please repeat all the steps starting from the manual curation of .uniq files in step 3.2.1 up until this step to produce the grouped presence-absence matrix. For this case grouped_ids_to_names.tsv file might contain:

CP1001	benzene dioxygenase
CP1002	cis-1,2-dihydrobenzene-1,2-diol dehydrogenase
CP1003  Phenol 2-monooxygenase
CP1004  2-hydroxymuconic semialdehyde hydrolase

Note that grouped_ids_to_names.tsv file must exactly match with the grouped standard names used in step 3.2.1 while forming the grouped_unique files.

Step 4.3 Running script to generate standardized presence - absence matrix

To prevent case-sensitivity as well as bracket differences in protein names while comparing the Prokka annotation with the standardized results, the Prokka annotation is recommended to be converted to lowercase and the brackets should be removed.

cat ../02_annotation/short/prokka_all.tsv |tr -d '[]()'| tr '[:upper:]' '[:lower:]' > prokka_all_updated.tsv

Using the ids_to_names.tsv file, run the script to generate the presence - absence matrix:

python3 ../../scripts/make_pres_abs.py std_results_all.txt ids_to_names.tsv > presence_absence.csv

Please note that, if there are bins/genomes/MAGs with no annotation for any of the standardized enzymes on the ids_to_names.tsv list, the presence-absence matrix will not show this as a column. Hence, if a bin/genome/MAG does not appear in the presence-absence matrix output, that name can be added as a row with zeros (as values).

Note: The make_pres_abs.py script is available here.

There may be a couple of errors due to problems with the user-inputted manual files at this step.

1 - If there is a split function error stating "expected 2 got 1" or "expected 2 got more", this is likely due to the lack of the required tab character or an additional tab character within .uniq files manually curated in step 3.2.1 which is propagated to the std_results_all.txt file. In this case, go back to the 03_standardization/results/unique/ directory files and manually fix these mistakes. It is advised that the user runs this code to look at the number of columns in each file quickly:

cd ../03_standardization/pw_1/results/unique
for i in *.uniq; do awk -F'\t' '{print NF}' $i; echo $i; done

This code is checks the number of columns separated by tabs in the .uniq files for pathway 1. To check the rest of the pathway files, change working directory to 03_standardization/pw_N/. The output of the code will be the name of each file searched and the number of columns found in each line within that file. The output must be exactly 2 for each line to avoid errors. Look for lines that are less than or more than 2 and go to the specific file name that this occurence was seen to fix the problem. Additionally, the line number within the output will correspond to the line number in the .uniq file (i.e., if the line output is 3 for line number 5 in a certain file, then you can go to the same file name and look at line 5 to see the problem).

Then, re-do step 4.1 and step 4.3 (except for the directory creation step as the 04_presabs/ is already created).

2 - Another possible error can be "KeyError" if the ids_to_names.tsv file generated in step 4.2 does not match the standard enzyme names used in the standardization (step 3.2.1). In this case, change the names in the ids_to_names.tsv names file to match the standard names in the .uniq files.

Then, re-do step 4.1 and step 4.3 (except for the directory creation step as the 04_presabs/ is already created).

3- Please note that this pipeline is intended to be run on the same list of enzymes/protein names during the standardization and presence-absence matrix generation steps. If a subset of the list of standard enzyme names are inputted as the ids_to_names.tsv file, there may be problems (specifically KeyError) in this script. Hence, it is advised to use the same set of enzymes/protein names throughout the workflow.

Output files generated in step 4: 04_presabs/presence_absence.csv file which is the standardized presence absence matrix file

Contributions

Authors of pipeline: Fatma Chafra, Felipe Borim Corrêa,and Ulisses Nunes da Rocha

Principal Investigator: Ulisses Nunes da Rocha

Institution: Microbial Data Sciences group, Helmholtz Center for Environmental Research, Department of Environmental Microbiology, Leipzig, Germany

All feedback is welcome. For errors and bugs, please open a new Issue thread on this github page, and we will try to address them as soon as possible. For general feedback you can contact us at mds@ufz.de.

Name		Name	Last commit message	Last commit date
Latest commit History 130 Commits
examples		examples
img		img
scripts		scripts
README.md		README.md

mdsufz/StandEnA

Folders and files

Latest commit

History

Repository files navigation