Scripts and examples for GSearch paper

Scripts for analysizing results from the GSearch software: https://github.com/jean-pierreBoth/gsearch

Note: you can directly go to step 3 and 4 and skip step 1 and 2 for testing the recall of GSearch using several testing genomes. Step 1 and step 2 is to produce ground truth for your query genomes based on blastn-ANI/blastp-AAI. Step 1 and 2 is very expensive and take days to run on modern clusters.

1. ANI and AAI calculation

ANI and AAI was calcualted using the ani.rb/aai.rb scripts from the Kostas's lab, which can be found in the scripts directory. Several dependencies must be installed to run the scripts:

Ruby (>v2.7)
Blast+ (>v2.14.0, ANI computation is faster using new parallelism model in blastn and blastp)
perl
GNU parallel, for fasta file processing
python

We provide a bash script to run search of query genomes against database genomes based on aai.rb/ani.rb script

### this is a very expensive step and often takes more than several weeks even on a decent computer cluster for running one genome against all GTDB v207 (65,703 genomes). Thus we also provide the top 20 truth from the output of this step (truth_test.txt) in the example directory for testing purposes.

### first of all, prepare everything needed
git clone https://github.com/jianshu93/gsearch_analysis.git
chmod a+x ./gsearch_analysis/scripts/*
### copy to you user path, use system path as an example
cp ./gsearch_analysis/scripts/* /usr/local/bin/


### get GTDB v207 nt genomes and search/compare use ani.rb
wget https://data.ace.uq.edu.au/public/gtdb/data/releases/release207/207.0/genomic_files_reps/gtdb_genomes_reps_r207.tar.gz
tar xzvf ./gtdb_genomes_reps_r207.tar.gz

### or download all NCBI/RefSeq proakryotic genomes viar the download software

### unzip all genomes, database genomes and query genomes
find ./gtdb_genomes_reps_r207 -name "*.fna.gz" | parallel -j 10 "gunzip {}"
gunzip ./example/test_data/query_dir_nt/*.gz

find ./gtdb_genomes_reps_r207 -name "*.fna" > gtdb_name.txt
./scripts/ANI.multiple.comparison.many.pl -i gtdb_name.txt -n ./example/test_data/query_dir_nt/test01.fasta -m ani -o test01_ANI_search.txt
tail -n +2 test01_ANI_search.txt | sort -k 2 -g -r > test01_ANI_truth.txt


### get GTDB v207 aa genomes and search use aai.rb
wget https://data.ace.uq.edu.au/public/gtdb/data/releases/release207/207.0/genomic_files_reps/gtdb_proteins_aa_reps_r207.tar.gz
tar xzvf gtdb_proteins_aa_reps_r207.tar.gz
find ./gtdb_genomes_aa_reps_r207 -name "*.faa.gz" > gtdb_name_aa.txt
./scripts/ANI.multiple.comparison.many.pl -i gtdb_name_aa.txt -n ./example/test_data/query_dir_aa/test01.faa.gz -m aai -o test01_AAI_search.txt
tail -n +2 test01_AAI_search.txt | sort -k 2 -g -r > test01_AAI_truth.txt

2. Run GSearch to get gsearch.answers.txt for query genomes against pre-built database

### get the binary for linux (make sure you have recent Linux installed with GCC, e.g., Ubuntu 18.0.4 or above)
wget https://github.com/jean-pierreBoth/gsearch/releases/download/0.1.1/gsearch-linux-x86-64.zip --no-check-certificate
unzip gsearch-linux-x86-64.zip
chmod a+x ./gsearch

### get database and prepare
wget http://enve-omics.ce.gatech.edu/data/public_gsearch/GTDBv207_v2023.tar.gz
tar xzvf ./GTDBv207_v2023.tar.gz
cd GTDB/nucl/
tar xzvf ./k16_s12000_n128_ef1600.prob.tar.gz
cd ../../

### get gsearch.answers.txt output, it can also be found in example directory
### output will be in the current directory
./gsearch -b ./gsearch_analysis/example/test_data/query_dir_nt -r GTDB/nucl/k16_s12000_n128_ef1600_canonical -n 50

3. Transform results from GSearch output distance to ANI

According to equation: $$ANI=1+\frac{1}{k}log\frac{2*J}{1+J}$$

where J is Jaccard-like index (e.g. Jp from ProbMinHash or J from SetSketch) and k is k-mer size used in gsearch (default 16 for nt and 7 for amino acid). The 5th column of gsearch.answers.txt can be transformed into ANI by using this following script:

### reformat
cd ./gsearch_analysis
grep -E "*query_id*" ./example/gsearch.answers.txt > ./example/new.txt

### the $5 column is distance (1 - Jaccard index), transformation is the above mentioned equation. Output is query name, subject name and ANI
awk 'BEGIN{FS=OFS="\t"}{print $3,$7,log((1-$5)*2/(1-$5+1))/16+1}' ./example/new.txt > ./example/ani.txt

4. Calculate recall based on gsearch.answers.txt and true ANI hits found by ani.rb or aai.rb

### prepare files for each query
### test genome 01 top 10 by gsearch
grep -E "*test01.fasta.gz*" ./example/gsearch.answers.txt | grep -E "*query_id*" | awk 'BEGIN{FS=OFS="\t"}{print $7}' | awk 'BEGIN{FS="/"}{print $3}' | head -n 10 > test01.answers.top10.txt
### test genome 02 top 10 by gsearch
grep -E "*test02.fasta.gz*" ./example/gsearch.answers.txt | grep -E "*query_id*" | awk 'BEGIN{FS=OFS="\t"}{print $7}' | awk 'BEGIN{FS="/"}{print $3}' | head -n 10 > test02.answers.top10.txt

### count intersection between neighbors found by GSearch and ground truth for the same query genome, where file_1 and file_2 are best top K target names by GSearch and ground truth respectively.
awk 'NR==FNR{a[$1]=1; next} {if($1 in a)print $0}' ./example/test01.truth.txt ./example/test01.answers.top10.txt | wc -l | awk '{print $1/10}'

References

Jianshu Zhao, Jean Pierre Both, Luis M Rodriguez-R, Konstantinos T Konstantinidis, GSearch: ultra-fast and scalable genome search by combining K-mer hashing with hierarchical navigable small world graphs, Nucleic Acids Research, 2024;, gkae609, https://doi.org/10.1093/nar/gkae609.

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
example		example
scripts		scripts
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Scripts and examples for GSearch paper

1. ANI and AAI calculation

2. Run GSearch to get gsearch.answers.txt for query genomes against pre-built database

3. Transform results from GSearch output distance to ANI

4. Calculate recall based on gsearch.answers.txt and true ANI hits found by ani.rb or aai.rb

References

About

Releases

Packages

Languages

jianshu93/gsearch_analysis

Folders and files

Latest commit

History

Repository files navigation

Scripts and examples for GSearch paper

1. ANI and AAI calculation

2. Run GSearch to get gsearch.answers.txt for query genomes against pre-built database

3. Transform results from GSearch output distance to ANI

4. Calculate recall based on gsearch.answers.txt and true ANI hits found by ani.rb or aai.rb

References

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages