Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UniVec_Core build failed - skipping entries without valid taxonomic nodes #277

Closed
ksavhughes opened this issue Jan 11, 2024 · 4 comments
Closed
Labels
bug Something isn't working

Comments

@ksavhughes
Copy link

ksavhughes commented Jan 11, 2024

Hi, I'm trying to build a UniVec_Core database. I keep running into this unable to match taxonomy targets error. I also tried to use taxID 28384 for all the univec_core sequences, just to see if that would work (same issue).

I downloaded the fasta file and then made an input file before running the below build command:
grep -o '^>[^ ]*' Univec_Core.fasta | sed 's/^>//' | awk '{print "Univec_Core.fasta\t"$1"\t81077"}' > Univec_Core_ganon_input_file.tsv

ganon build-custom -t 20 -n $db/refs/"${cat}"/"${cat}"_ganon_input_file.tsv -d $db/"${cat}"_k19 --level species -k 19 -w 31 -s 4 -v hibf -p 0.001

Error file:
Downloading and parsing ncbi taxonomy

  • done in 18.76s.

Parsing --input-file /ourdisk/hpc/hofmanlmamr/karsav1511/auto_archive_notyet/tape_2copies/RefSeq/ganonRefSeqLim3/refs/Univec_Core/Univec_Core_ganon_input_file.tsv

  • 3157 unique entries
  • done in 0.05s.

Validating taxonomy

  • 3157 entries without valid taxonomic nodes skipped
  • done in 0.03s.

ERROR: Unable to match taxonomy to targets
Total elapsed time: 19.32 seconds.

@ksavhughes
Copy link
Author

ksavhughes commented Jan 11, 2024

I've also been having some issues building the RefSeq mitochondrion, plasmid, and plastid databases.

mitochondrion:

1 valid file(s) [--input-extension fna.gz, --input-recursive] found in /ourdisk/hpc/hofmanlmamr/karsav1511/auto_archive_notyet/tape_2copies/RefSeq/ganonRefSeqLim3/refs/mitochondrion
Total valid files: 1

Downloading and parsing ncbi taxonomy

  • done in 11.35s.

Parsing sequences from --input (1 files)

  • 17109 unique entries
  • done in 3.13s.

Retrieving sequence information from NCBI e-utils

  • done in 148.67s.

Validating taxonomy

  • done in 0.18s.

Downloading and parsing auxiliary files for genome size estimation

  • done in 0.30s.

Estimating genome sizes

  • done in 16.59s.

Building index (raptor)
raptor prepare
============= Timings =============
Wall clock time [s]: 7044.54
Peak memory usage [GiB]: 90.9
Compute minimiser [s]: 5924.81
Write minimiser files [s]: 206.09
Write header files [s]: 1.41

  • done in 7049.49s.

raptor layout

  • done in 15570.53s.

raptor build
terminate called after throwing an instance of 'std::bad_alloc'
what(): std::bad_alloc
The following command failed to run:
/home/karsav1511/.conda/envs/ganon/bin/raptor build --output '/ourdisk/hpc/hofmanlmamr/karsav1511/auto_archive_notyet/tape_2copies/RefSeq/ganonRefSeqLim3/mitochondrion_k19.hibf' --threads 80 --input '/ourdisk/hpc/hofmanlmamr/karsav1511/auto_archive_notyet/tape_2copies/RefSeq/ganonRefSeqLim3/mitochondrion_k19_files/build/raptor_layout.binning.out'

Error code: -6

Plasmid and plastid:

10 valid file(s) [--input-extension fna.gz, --input-recursive] found in /ourdisk/hpc/hofmanlmamr/karsav1511/auto_archive_notyet/tape_2copies/RefSeq/ganonRefSeqLim3/refs/plasmid
Total valid files: 10

Downloading and parsing ncbi taxonomy

  • done in 13.40s.

Parsing --input (10 files)

  • 10 unique entries
  • done in 0.00s.

Downloading assembly_summary files

  • done in 104.48s.

Parsing assembly_summary files

  • 0 entries found in the assembly_summary_refseq.txt file
  • 0 entries found in the assembly_summary_genbank.txt file
  • done in 5.74s.

Validating taxonomy

  • 10 entries without valid taxonomic nodes skipped
  • done in 0.00s.

ERROR: Unable to match taxonomy to targets
Total elapsed time: 123.78 seconds.


I made an input tsv file for the plasmid and plastid files to fix the taxonomy problem, but now it comes back with the same error as the mitochondrion build (and I use --restart in all the build commands now).

Downloading and parsing ncbi taxonomy

  • done in 16.14s.

Parsing --input-file /ourdisk/hpc/hofmanlmamr/karsav1511/auto_archive_notyet/tape_2copies/RefSeq/ganonRefSeqLim3/refs/plastid/plastid_ganon_input_file.tsv

  • 1 duplicated targets skipped
  • 13770 unique entries
  • done in 0.13s.

Validating taxonomy

  • done in 0.17s.

Downloading and parsing auxiliary files for genome size estimation

  • done in 0.56s.

Estimating genome sizes

  • done in 11.32s.

Building index (raptor)
raptor prepare
terminate called after throwing an instance of 'std::filesystem::__cxx11::filesystem_error'
what(): filesystem error: cannot get file size: No such file or directory [/ourdisk/hpc/hofmanlmamr/karsav1511/auto_archive_notyet/tape_2copies/RefSeq/ganonRefSeqLim3/plastid_k19_files/build/5854.fna]
The following command failed to run:
/home/karsav1511/.conda/envs/ganon/bin/raptor prepare --input '/ourdisk/hpc/hofmanlmamr/karsav1511/auto_archive_notyet/tape_2copies/RefSeq/ganonRefSeqLim3/plastid_k19_files/build/hibf.txt' --output '/ourdisk/hpc/hofmanlmamr/karsav1511/auto_archive_notyet/tape_2copies/RefSeq/ganonRefSeqLim3/plastid_k19_files/build/' --kmer 19 --window 31 --threads 20

Error code: -6

@pirovc
Copy link
Owner

pirovc commented Jan 11, 2024

Regarding the UniVec database, the examples in the documentation are indeed outdated. Follow the commands to build it:

echo -e "UniVec_Core.fasta\tUniVec_Core\t81077" > UniVec_Core_ganon_input_file.tsv
ganon build-custom --input-file UniVec_Core_ganon_input_file.tsv --db-prefix UniVec_Core --level leaves

--level species won't work because the taxid 81077 does not have species in the lineage.

@pirovc pirovc added the bug Something isn't working label Jan 11, 2024
@pirovc
Copy link
Owner

pirovc commented Jan 12, 2024

The same goes for the plasmid, plastid and mitochondrion. Use the following after downloading the files:

mkdir sequences

zcat plasmid.* plastid.* mitochondrion.* | awk '$0 ~ ">" {accver=(substr($1,2)); print accver}{print $0 > "sequences/"accver".fna"}' | ganon-get-seq-info.sh -e -i - | awk '{print "sequences/"$1".fna\t"$1"\t"$3}' > ppm.tsv

ganon build-custom --input-file ppm.tsv --db-prefix ppm --level species --threads 20

rm -rf sequences

you could also build them separately, just change the zcat command to get the files you need.

The general issue is that --input-target sequence does not work yet with --filter-type hibf, meaning: the input have to be separated by file to be used in the build process, not in bulk like the plasmid sequences. For now I will update the documentation with the new commands but I hope to be able to implement the parse by sequence with hibf soon.

Fixed in v2.1.0 #285, this works now:

# Download sequence files
wget -A genomic.fna.gz -m -nd --quiet --show-progress "ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/plasmid/"
wget -A genomic.fna.gz -m -nd --quiet --show-progress "ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/plastid/"
wget -A genomic.fna.gz -m -nd --quiet --show-progress "ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/mitochondrion/"

ganon build-custom --input plasmid.* plastid.* mitochondrion.* --db-prefix ppm --level species --threads 8 --input-target sequence

@pirovc
Copy link
Owner

pirovc commented Jan 12, 2024

Documentation updated in v2.0.1 #281

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants