Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add validation step for pandas parsing #33

Closed
caufieldjh opened this issue Apr 6, 2022 · 0 comments · Fixed by #29
Closed

Add validation step for pandas parsing #33

caufieldjh opened this issue Apr 6, 2022 · 0 comments · Fixed by #29

Comments

@caufieldjh
Copy link
Collaborator

A small but non-zero number of the ontology transforms can't be parsed by pandas properly. This is probably caught by one or another of the existing validations but when it gets to the kg-bioportal merge step this becomes an issue like the following:

15:44:08  Traceback (most recent call last):
15:44:08    File "/usr/lib/python3.8/multiprocessing/pool.py", line 125, in worker
15:44:08      result = (True, func(*args, **kwds))
15:44:08    File "/var/lib/jenkins/workspace/NCBO/kg-bioportal/gitrepo/venv/lib/python3.8/site-packages/kgx/cli/cli_utils.py", line 809, in parse_source
15:44:08      transformer.transform(input_args)
15:44:08    File "/var/lib/jenkins/workspace/NCBO/kg-bioportal/gitrepo/venv/lib/python3.8/site-packages/kgx/transformer.py", line 303, in transform
15:44:08      self.process(source_generator, sink)
15:44:08    File "/var/lib/jenkins/workspace/NCBO/kg-bioportal/gitrepo/venv/lib/python3.8/site-packages/kgx/transformer.py", line 343, in process
15:44:08      for rec in source:
15:44:08    File "/var/lib/jenkins/workspace/NCBO/kg-bioportal/gitrepo/venv/lib/python3.8/site-packages/kgx/source/tsv_source.py", line 184, in parse
15:44:08      for chunk in file_iter:
15:44:08    File "/var/lib/jenkins/workspace/NCBO/kg-bioportal/gitrepo/venv/lib/python3.8/site-packages/pandas/io/parsers/readers.py", line 1187, in __next__
15:44:08      return self.get_chunk()
15:44:08    File "/var/lib/jenkins/workspace/NCBO/kg-bioportal/gitrepo/venv/lib/python3.8/site-packages/pandas/io/parsers/readers.py", line 1284, in get_chunk
15:44:08      return self.read(nrows=size)
15:44:08    File "/var/lib/jenkins/workspace/NCBO/kg-bioportal/gitrepo/venv/lib/python3.8/site-packages/pandas/io/parsers/readers.py", line 1254, in read
15:44:08      index, columns, col_dict = self._engine.read(nrows)
15:44:08    File "/var/lib/jenkins/workspace/NCBO/kg-bioportal/gitrepo/venv/lib/python3.8/site-packages/pandas/io/parsers/c_parser_wrapper.py", line 230, in read
15:44:08      data = self._reader.read(nrows)
15:44:08    File "pandas/_libs/parsers.pyx", line 787, in pandas._libs.parsers.TextReader.read
15:44:08    File "pandas/_libs/parsers.pyx", line 861, in pandas._libs.parsers.TextReader._read_rows
15:44:08    File "pandas/_libs/parsers.pyx", line 847, in pandas._libs.parsers.TextReader._tokenize_rows
15:44:08    File "pandas/_libs/parsers.pyx", line 1960, in pandas._libs.parsers.raise_parser_error
15:44:08  pandas.errors.ParserError: Error tokenizing data. C error: Expected 8 fields in line 6, saw 9

Most of these errors are due to #32 so the solution is to re-transform, but the pandas error does not specify which ontology led to the error. A pre-screening in this repo would be helpful: just load each graph file into pandas and warn loudly if it doesn't parse.

@caufieldjh caufieldjh linked a pull request Apr 7, 2022 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant