Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Loading data using builder.py fails #52

Closed
rbf22 opened this issue Feb 14, 2021 · 5 comments
Closed

Loading data using builder.py fails #52

rbf22 opened this issue Feb 14, 2021 · 5 comments

Comments

@rbf22
Copy link

rbf22 commented Feb 14, 2021

Describe the bug
loading data into the database fails when trying to load the latest version of drugdb. seems like the file names and potential the format of the input file have changed.

2021-02-14 18:18:26,308 - database_controller - ERROR - Database DrugBank: (<class 'lxml.etree.XMLSyntaxError'>, XMLSyntaxError('Document is empty, line 1, column 1'), <traceback object at 0x19d61a3c0>), file: databases_controller.py,line: 205

To Reproduce
Steps to reproduce the behavior:
go to the builder.py and execute with standard command for minimal or full

Expected behavior
no errors in the log

@rbf22
Copy link
Author

rbf22 commented Feb 14, 2021

A similar error related to building the database is:

2021-02-14 18:17:31,592 - importer - ERROR - Writing Stats object full_stats_1_0 in file:/CKG/src/graphdb_builder/../../data/imports/stats/stats.hdf > Trying to store a string with len [9] in [date] column but
this column has a limit of [8]!
Consider using min_itemsize to preset the sizes on these columns.

2021-02-14 18:17:31,296 - ontologies_controller - ERROR - Error: Tag-value pair parsing failed for:
A000 Cholera due to Vibrio cholerae 01, biovar cholerae
. Ontology ICD-10: (<class 'ValueError'>, ValueError('Tag-value pair parsing failed for:\nA000 Cholera due to Vibrio cholerae 01, biovar cholerae\n'), <traceback object at 0x18fc147d0>), file: ontologies_controller.py,line: 134

2021-02-14 18:40:09,476 - database_controller - ERROR - Database UniProt: (<class 'Exception'>, Exception('Something went wrong. Exception raised when an error code signifying a permanent error. 550 Failed to open file..\nURL:ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/reference_proteomes/Eukaryota/UP000005640_9606.fasta.gz.\nURL:ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/reference_proteomes/Eukaryota/UP000005640_9606.fasta.gz'), <traceback object at 0x10a961780>), file: databases_controller.py,line: 66

-1 / unknownDone Parsing database GWASCatalog

2021-02-15 04:28:00,160 - database_controller - ERROR - Database DGIdb: (<class 'Exception'>, Exception("mapping - No mapping file ../../../data/databases/DrugBank/complete_mapping.tsv for entity Drug. Error: [Errno 2] No such file or directory: '../../../data/databases/DrugBank/complete_mapping.tsv'"), <traceback object at 0x1927e3640>), file: databases_controller.py,line: 143

@rbf22
Copy link
Author

rbf22 commented Feb 16, 2021

The first error: 2021-02-14 18:18:26,308 - database_controller - ERROR - Database DrugBank: (<class 'lxml.etree.XMLSyntaxError'>, XMLSyntaxError('Document is empty, line 1, column 1'), <traceback object at 0x19d61a3c0>), file: databases_controller.py,line: 205

Came from the download from the DrugBank decompressing the file. Using the OS X compress created a directory in the archive __MACOSX, which was causing the issue during parsing.

Fixed by : Can be fixed after the fact by zip -d filename.zip __MACOSX/*
as detailed here:

https://stackoverflow.com/questions/10924236/mac-zip-compress-without-macosx-folder

@rbf22
Copy link
Author

rbf22 commented Feb 16, 2021

uniprot error:

2021-02-14 18:40:09,476 - database_controller - ERROR - Database UniProt: (<class 'Exception'>, Exception('Something went wrong. Exception raised when an error code signifying a permanent error. 550 Failed to open file..\nURL:ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/reference_proteomes/Eukaryota/UP000005640_9606.fasta.gz.\nURL:ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/reference_proteomes/Eukaryota/UP000005640_9606.fasta.gz'), <traceback object at 0x10a961780>), file: databases_controller.py,line: 66

updated to this line to fix the file path:

9 uniprot_fasta_file: 'ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/reference_proteomes/Eukaryota/UP000005640/UP000005640_9606.fasta.gz'

./src/graphdb_builder/databases/config/uniprotConfig.yml

@rbf22
Copy link
Author

rbf22 commented Feb 16, 2021

The error for ICD10 code import fails because the input file seems incompatible with the parser. I am not sure what the correct file should be, the parser (ontologies/parsers/icdParser.py) seems to suggest that it should be a tab separated file with at least 6 columns, the downloaded file is just two columns and does not have any tabs:

ftp://ftp.cdc.gov/pub/Health_Statistics/NCHS/Publications/ICD10CM/2020/icd10cm_codes_2020.txt

@albsantosdel
Copy link
Collaborator

Hi, apologies for the late response.

ICD10 codes are not included in this version of CKG. The parser we committed last year was in development and was not finalized. Closing until there is a parser supporting this node type.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants