This tutorial section will cover how to perform searches using multiple types of search terms. In the previous tutorial examples you performed searches using only one type of supported search term in each search. While these types of searches can be useful, in multiple of the tutorial examples you saw how using just one type of search term can result in false positive or false negative search results. A benefit of using the ProkFunFind search approach comes from the ability to utilize multiple different types of queries during a search. For example one gene may be best represented by a profile HMM while another may be part of a well defined KO or COG group.
In ProkFunFind mixed searches you can incorporate any of the support search term types in any combination. The only limitation is based on what files you have available for each genome. For example in order to perform searches based on protein sequences and COG identifiers you will need to have the protein fasta files and EGGNog-mapper results available for each genome being searched.
The two most important files for configuring a mixed search in ProkFunFind are the configuration file and search-terms file. The file is going to be formatted like it has been in previous examples, but it needs to include configuration sections for each of the search term types that are being used. An example can be seen in the ./queries/mixed-search/config.yaml file:
--- main: cluster_tool: DBSCAN system_file: systems.json search_terms: search-terms.tsv faa_suffix: .faa gff_suffix: .gff fna_suffix: .fna DBSCAN: cluster_eps: 4 cluster_min_samples: 2 hmmer: hmmer_query: query.hmm hmmer_exec: hmmscan hmmer_threads: 1 evalue: 1e-3 blast: blast_query: query.fa blast_exec: blastp blast_threads: 1 evalue: 1e-3 kofamscan: annot_suffix: .kofam.tsv threshold: 0.5 emapper: annot_suffix: .emapper.annotations interproscan: annot_suffix: _InterProScan.tsv
The other important file is the search-terms file. In this file each query ID is associated with the individual search terms and the search approaches. In this file multiple search terms can be associated with the same query, allowing for queries to be identified through multiple approaches simultaneously. The search terms file for this tutorial search can be seen in the config.yaml file:
name: Equol Gene Cluster components: - name: Equol Production Pathway presence: essential components: - geneID: DZNR description: Daidzein reductase presence: essential terms: - id: DZNR method: hmmer - geneID: DHDR description: Dihydrodaidzein reductase presence: essential terms: - id: GCF_000422625.1_00043 method: blast - geneID: THDR description: Tetrahydrodaidzein reductase presence: essential terms: - id: COG1053 method: emapper - geneID: DDRC description: Dihydrodaidzein racemase presence: essential terms: - id: DDRC method: hmmer
This search is going to use a mix of 3 profile HMMs, 1 protein sequence, 2 KOs, 4 domain signatures, and 4 COGs.
To perform the search from the root directory of the tutorial repository you can run the following command:
prokfunfind -f queries/mixed-search/config.yaml --gtab ./genome-list.tsv --outputprefix ./out/mixed-search/mixed
This command will return an initial summary of the component presence and absence:
INFO:root:Checking configuration files INFO:root:Searching for function INFO:root:Identifying gene clusters INFO:root:Summarizing function presence and genes Detected function: Equol Gene Cluster in genome ./genomes//GTDB18040 1 out of 1 essential components present 3 out of 3 nonessential components present INFO:root:Searching for function INFO:root:Identifying gene clusters INFO:root:Summarizing function presence and genes Failed to detect function: Equol Gene Cluster in genome ./genomes//GTDB26128 0 out of 1 essential components present 3 out of 3 nonessential components present
The output is the same as what is produced by other searches. Because the search is done using multiple search terms it can also be useful to check the output to see what search terms are producing hits to certain genes in the results. This information can be found in the gff output of the search. For this search the output can be seen in the ./out/mixed-search/mixed.GTDB18040.annot.gff file:
GCF_000478885.1_1 ProkFunFind CDS 7382 8305 . - . ID=GCF_000478885.1_00007;Name=HYDE;ClusterID=Cl_NA;Target=PF04055;evalue=3.2e-19 GCF_000478885.1_1 ProkFunFind CDS 28201 29646 . + . ID=GCF_000478885.1_00024;Name=HYDE;ClusterID=Cl_NA;Target=PF04055;evalue=2e-15 GCF_000478885.1_1 ProkFunFind CDS 261233 262642 . + . ID=GCF_000478885.1_00150;Name=HYDE;ClusterID=Cl_NA;Target=PF04055;evalue=7.7e-22 GCF_000478885.1_1 ProkFunFind CDS 288712 290358 . - . ID=GCF_000478885.1_00174;Name=DEVR;ClusterID=Cl_NA;Target=DEVR;evalue=1.2e-07
In this output the Target property in column 9 provides what specific search term ID produced the hit to that gene.