Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add functionality for PubMed Central text retrieval #156

Merged
merged 17 commits into from
Aug 2, 2023
Merged

Conversation

caufieldjh
Copy link
Member

@caufieldjh caufieldjh commented Jul 18, 2023

Run as, for example: ontogpt pubmed-extract -t core.TextWithTriples --get-pmc 25833107

By default, this will break each PMC entry's body text up into multiple chunks to fit into the available context size.

@caufieldjh caufieldjh linked an issue Jul 18, 2023 that may be closed by this pull request
@caufieldjh
Copy link
Member Author

This currently only works with the 16k model reliably, even when using text inputs that really should fit into the context. Some kind of strangeness going on with tokenization?

@caufieldjh
Copy link
Member Author

Also have an issue with the result type:

Traceback (most recent call last):                                                                                                                                                                
  File "/home/harry/ontogpt/.venv/bin/ontogpt", line 6, in <module>
    sys.exit(main())
  File "/home/harry/ontogpt/.venv/lib/python3.9/site-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "/home/harry/ontogpt/.venv/lib/python3.9/site-packages/click/core.py", line 1055, in main
    rv = self.invoke(ctx)
  File "/home/harry/ontogpt/.venv/lib/python3.9/site-packages/click/core.py", line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/harry/ontogpt/.venv/lib/python3.9/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/harry/ontogpt/.venv/lib/python3.9/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "/home/harry/ontogpt/src/ontogpt/cli.py", line 328, in pubmed_extract
    results = ke.extract_from_text(text)
  File "/home/harry/ontogpt/src/ontogpt/engines/spires_engine.py", line 87, in extract_from_text
    return ExtractionResult(
  File "pydantic/main.py", line 342, in pydantic.main.BaseModel.__init__
pydantic.error_wrappers.ValidationError: 1 validation error for ExtractionResult
input_text
  str type expected (type=type_error.str)

@caufieldjh
Copy link
Member Author

Try this for a quick demo:

ontogpt pubmed-annotate -t phenotype "Takotsubo Cardiomyopathy: A Brief Review" --get-pmc --model gpt-3.5-turbo-16k --limit 3

@caufieldjh caufieldjh marked this pull request as ready for review July 19, 2023 19:35
@caufieldjh
Copy link
Member Author

One of the factors causing the number of tokens in PMC text input to be greater is tables. Example:

>>> import tiktoken
>>> encoding = tiktoken.get_encoding("cl100k_base")
>>> fulltable = """  81
...   M
...   Yes
...   Generalized
...   Dyspnea
...   Ventricular tachycardia
...   Yes
...   -
...   Unknown
...   Recovered
...   [47]"""
>>> flattable = "  81 M Yes Generalized Dyspnea Ventricular tachycardia Yes - Unknown Recovered [47]"
>>> len(encoding.encode(flattable))
25
>>> len(encoding.encode(fulltable))
44

There isn't really an obvious reason to keep all the newlines, and we aren't really parsing tables on their own, so I'll replace them with spaces.

@caufieldjh
Copy link
Member Author

Waiting to see if this works with what @AgranyaGitHub is doing in #149

@caufieldjh caufieldjh merged commit fcc882b into main Aug 2, 2023
2 checks passed
@caufieldjh caufieldjh deleted the get_pmc_text branch August 2, 2023 21:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add functionality to pubmed-extract to retrieve PMC XML
2 participants