UniVec_Core build failed - skipping entries without valid taxonomic nodes #277

ksavhughes · 2024-01-11T19:45:10Z

Hi, I'm trying to build a UniVec_Core database. I keep running into this unable to match taxonomy targets error. I also tried to use taxID 28384 for all the univec_core sequences, just to see if that would work (same issue).

I downloaded the fasta file and then made an input file before running the below build command:
grep -o '^>[^ ]*' Univec_Core.fasta | sed 's/^>//' | awk '{print "Univec_Core.fasta\t"$1"\t81077"}' > Univec_Core_ganon_input_file.tsv

ganon build-custom -t 20 -n $db/refs/"${cat}"/"${cat}"_ganon_input_file.tsv -d $db/"${cat}"_k19 --level species -k 19 -w 31 -s 4 -v hibf -p 0.001

Error file:
Downloading and parsing ncbi taxonomy

done in 18.76s.

Parsing --input-file /ourdisk/hpc/hofmanlmamr/karsav1511/auto_archive_notyet/tape_2copies/RefSeq/ganonRefSeqLim3/refs/Univec_Core/Univec_Core_ganon_input_file.tsv

3157 unique entries
done in 0.05s.

Validating taxonomy

3157 entries without valid taxonomic nodes skipped
done in 0.03s.

ERROR: Unable to match taxonomy to targets
Total elapsed time: 19.32 seconds.

ksavhughes · 2024-01-11T20:48:05Z

I've also been having some issues building the RefSeq mitochondrion, plasmid, and plastid databases.

mitochondrion:

1 valid file(s) [--input-extension fna.gz, --input-recursive] found in /ourdisk/hpc/hofmanlmamr/karsav1511/auto_archive_notyet/tape_2copies/RefSeq/ganonRefSeqLim3/refs/mitochondrion
Total valid files: 1

Downloading and parsing ncbi taxonomy

done in 11.35s.

Parsing sequences from --input (1 files)

17109 unique entries
done in 3.13s.

Retrieving sequence information from NCBI e-utils

done in 148.67s.

Validating taxonomy

done in 0.18s.

Downloading and parsing auxiliary files for genome size estimation

done in 0.30s.

Estimating genome sizes

done in 16.59s.

Building index (raptor)
raptor prepare
============= Timings =============
Wall clock time [s]: 7044.54
Peak memory usage [GiB]: 90.9
Compute minimiser [s]: 5924.81
Write minimiser files [s]: 206.09
Write header files [s]: 1.41

done in 7049.49s.

raptor layout

done in 15570.53s.

raptor build
terminate called after throwing an instance of 'std::bad_alloc'
what(): std::bad_alloc
The following command failed to run:
/home/karsav1511/.conda/envs/ganon/bin/raptor build --output '/ourdisk/hpc/hofmanlmamr/karsav1511/auto_archive_notyet/tape_2copies/RefSeq/ganonRefSeqLim3/mitochondrion_k19.hibf' --threads 80 --input '/ourdisk/hpc/hofmanlmamr/karsav1511/auto_archive_notyet/tape_2copies/RefSeq/ganonRefSeqLim3/mitochondrion_k19_files/build/raptor_layout.binning.out'

Error code: -6

Plasmid and plastid:

10 valid file(s) [--input-extension fna.gz, --input-recursive] found in /ourdisk/hpc/hofmanlmamr/karsav1511/auto_archive_notyet/tape_2copies/RefSeq/ganonRefSeqLim3/refs/plasmid
Total valid files: 10

Downloading and parsing ncbi taxonomy

done in 13.40s.

Parsing --input (10 files)

10 unique entries
done in 0.00s.

Downloading assembly_summary files

done in 104.48s.

Parsing assembly_summary files

0 entries found in the assembly_summary_refseq.txt file
0 entries found in the assembly_summary_genbank.txt file
done in 5.74s.

Validating taxonomy

10 entries without valid taxonomic nodes skipped
done in 0.00s.

ERROR: Unable to match taxonomy to targets
Total elapsed time: 123.78 seconds.

I made an input tsv file for the plasmid and plastid files to fix the taxonomy problem, but now it comes back with the same error as the mitochondrion build (and I use --restart in all the build commands now).

Downloading and parsing ncbi taxonomy

done in 16.14s.

Parsing --input-file /ourdisk/hpc/hofmanlmamr/karsav1511/auto_archive_notyet/tape_2copies/RefSeq/ganonRefSeqLim3/refs/plastid/plastid_ganon_input_file.tsv

1 duplicated targets skipped
13770 unique entries
done in 0.13s.

Validating taxonomy

done in 0.17s.

Downloading and parsing auxiliary files for genome size estimation

done in 0.56s.

Estimating genome sizes

done in 11.32s.

Building index (raptor)
raptor prepare
terminate called after throwing an instance of 'std::filesystem::__cxx11::filesystem_error'
what(): filesystem error: cannot get file size: No such file or directory [/ourdisk/hpc/hofmanlmamr/karsav1511/auto_archive_notyet/tape_2copies/RefSeq/ganonRefSeqLim3/plastid_k19_files/build/5854.fna]
The following command failed to run:
/home/karsav1511/.conda/envs/ganon/bin/raptor prepare --input '/ourdisk/hpc/hofmanlmamr/karsav1511/auto_archive_notyet/tape_2copies/RefSeq/ganonRefSeqLim3/plastid_k19_files/build/hibf.txt' --output '/ourdisk/hpc/hofmanlmamr/karsav1511/auto_archive_notyet/tape_2copies/RefSeq/ganonRefSeqLim3/plastid_k19_files/build/' --kmer 19 --window 31 --threads 20

Error code: -6

pirovc · 2024-01-11T21:43:10Z

Regarding the UniVec database, the examples in the documentation are indeed outdated. Follow the commands to build it:

echo -e "UniVec_Core.fasta\tUniVec_Core\t81077" > UniVec_Core_ganon_input_file.tsv
ganon build-custom --input-file UniVec_Core_ganon_input_file.tsv --db-prefix UniVec_Core --level leaves

--level species won't work because the taxid 81077 does not have species in the lineage.

pirovc · 2024-01-12T13:43:35Z

The same goes for the plasmid, plastid and mitochondrion. Use the following after downloading the files:

mkdir sequences

zcat plasmid.* plastid.* mitochondrion.* | awk '$0 ~ ">" {accver=(substr($1,2)); print accver}{print $0 > "sequences/"accver".fna"}' | ganon-get-seq-info.sh -e -i - | awk '{print "sequences/"$1".fna\t"$1"\t"$3}' > ppm.tsv

ganon build-custom --input-file ppm.tsv --db-prefix ppm --level species --threads 20

rm -rf sequences

you could also build them separately, just change the zcat command to get the files you need.

The general issue is that --input-target sequence does not work yet with --filter-type hibf, meaning: the input have to be separated by file to be used in the build process, not in bulk like the plasmid sequences. For now I will update the documentation with the new commands but I hope to be able to implement the parse by sequence with hibf soon.

Fixed in v2.1.0 #285, this works now:

# Download sequence files
wget -A genomic.fna.gz -m -nd --quiet --show-progress "ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/plasmid/"
wget -A genomic.fna.gz -m -nd --quiet --show-progress "ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/plastid/"
wget -A genomic.fna.gz -m -nd --quiet --show-progress "ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/mitochondrion/"

ganon build-custom --input plasmid.* plastid.* mitochondrion.* --db-prefix ppm --level species --threads 8 --input-target sequence

pirovc · 2024-01-12T22:24:01Z

Documentation updated in v2.0.1 #281

pirovc added the bug Something isn't working label Jan 11, 2024

pirovc closed this as completed Jan 12, 2024

pirovc mentioned this issue Jan 26, 2024

build-ganon unable to read taxonomy file(s) #282

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UniVec_Core build failed - skipping entries without valid taxonomic nodes #277

UniVec_Core build failed - skipping entries without valid taxonomic nodes #277

ksavhughes commented Jan 11, 2024 •

edited

Loading

ksavhughes commented Jan 11, 2024 •

edited

Loading

pirovc commented Jan 11, 2024 •

edited

Loading

pirovc commented Jan 12, 2024 •

edited

Loading

pirovc commented Jan 12, 2024

UniVec_Core build failed - skipping entries without valid taxonomic nodes #277

UniVec_Core build failed - skipping entries without valid taxonomic nodes #277

Comments

ksavhughes commented Jan 11, 2024 • edited Loading

ksavhughes commented Jan 11, 2024 • edited Loading

I've also been having some issues building the RefSeq mitochondrion, plasmid, and plastid databases.

mitochondrion:

Plasmid and plastid:

pirovc commented Jan 11, 2024 • edited Loading

pirovc commented Jan 12, 2024 • edited Loading

pirovc commented Jan 12, 2024

ksavhughes commented Jan 11, 2024 •

edited

Loading

ksavhughes commented Jan 11, 2024 •

edited

Loading

pirovc commented Jan 11, 2024 •

edited

Loading

pirovc commented Jan 12, 2024 •

edited

Loading