Add functionality for PubMed Central text retrieval #156

caufieldjh · 2023-07-18T17:37:59Z

Run as, for example: ontogpt pubmed-extract -t core.TextWithTriples --get-pmc 25833107

By default, this will break each PMC entry's body text up into multiple chunks to fit into the available context size.

caufieldjh · 2023-07-19T17:52:54Z

This currently only works with the 16k model reliably, even when using text inputs that really should fit into the context. Some kind of strangeness going on with tokenization?

caufieldjh · 2023-07-19T17:56:23Z

Also have an issue with the result type:

Traceback (most recent call last):                                                                                                                                                                
  File "/home/harry/ontogpt/.venv/bin/ontogpt", line 6, in <module>
    sys.exit(main())
  File "/home/harry/ontogpt/.venv/lib/python3.9/site-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "/home/harry/ontogpt/.venv/lib/python3.9/site-packages/click/core.py", line 1055, in main
    rv = self.invoke(ctx)
  File "/home/harry/ontogpt/.venv/lib/python3.9/site-packages/click/core.py", line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/harry/ontogpt/.venv/lib/python3.9/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/harry/ontogpt/.venv/lib/python3.9/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "/home/harry/ontogpt/src/ontogpt/cli.py", line 328, in pubmed_extract
    results = ke.extract_from_text(text)
  File "/home/harry/ontogpt/src/ontogpt/engines/spires_engine.py", line 87, in extract_from_text
    return ExtractionResult(
  File "pydantic/main.py", line 342, in pydantic.main.BaseModel.__init__
pydantic.error_wrappers.ValidationError: 1 validation error for ExtractionResult
input_text
  str type expected (type=type_error.str)

caufieldjh · 2023-07-19T19:28:26Z

Try this for a quick demo:

ontogpt pubmed-annotate -t phenotype "Takotsubo Cardiomyopathy: A Brief Review" --get-pmc --model gpt-3.5-turbo-16k --limit 3

caufieldjh · 2023-07-19T19:44:06Z

One of the factors causing the number of tokens in PMC text input to be greater is tables. Example:

>>> import tiktoken
>>> encoding = tiktoken.get_encoding("cl100k_base")
>>> fulltable = """  81
...   M
...   Yes
...   Generalized
...   Dyspnea
...   Ventricular tachycardia
...   Yes
...   -
...   Unknown
...   Recovered
...   [47]"""
>>> flattable = "  81 M Yes Generalized Dyspnea Ventricular tachycardia Yes - Unknown Recovered [47]"
>>> len(encoding.encode(flattable))
25
>>> len(encoding.encode(fulltable))
44

There isn't really an obvious reason to keep all the newlines, and we aren't really parsing tables on their own, so I'll replace them with spaces.

caufieldjh · 2023-07-23T07:49:19Z

Waiting to see if this works with what @AgranyaGitHub is doing in #149

Add cli option for PMC text retrieval

9e2500c

caufieldjh linked an issue Jul 18, 2023 that may be closed by this pull request

Add functionality to pubmed-extract to retrieve PMC XML #155

Closed

caufieldjh added 9 commits July 18, 2023 13:41

Pre-linting

bea1e23

Begin expand pubmed client, modify output from CLI

c855eb5

Parse PMC ID from the PM XML

33bf24b

Move parse_pmxml into the client class

134d564

Retrieve body text from PMC entry

0db4c68

Linty lint lint lint

a3f3f7e

Chunk full texts - context size isn't quite right yet

b5a856c

Add 16K model to models list

018521e

Logging n' linting

924acea

caufieldjh added 3 commits July 19, 2023 14:35

Minor log output change

8d92785

Update CLI

ef7da43

Change annotation limit to option, --limit

6987ba8

caufieldjh added 2 commits July 19, 2023 15:30

Comments in CLI

5edcad1

Lyent

fc01dec

caufieldjh marked this pull request as ready for review July 19, 2023 19:35

Replace newlines in PMC input

45ac219

Merge branch 'main' into get_pmc_text

0e8aaea

caufieldjh merged commit fcc882b into main Aug 2, 2023
2 checks passed

caufieldjh deleted the get_pmc_text branch August 2, 2023 21:43

caufieldjh mentioned this pull request Aug 24, 2023

Add parser for PubMedCentral full texts #154

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add functionality for PubMed Central text retrieval #156

Add functionality for PubMed Central text retrieval #156

caufieldjh commented Jul 18, 2023 •

edited

caufieldjh commented Jul 19, 2023

caufieldjh commented Jul 19, 2023

caufieldjh commented Jul 19, 2023

caufieldjh commented Jul 19, 2023

caufieldjh commented Jul 23, 2023

Add functionality for PubMed Central text retrieval #156

Add functionality for PubMed Central text retrieval #156

Conversation

caufieldjh commented Jul 18, 2023 • edited

caufieldjh commented Jul 19, 2023

caufieldjh commented Jul 19, 2023

caufieldjh commented Jul 19, 2023

caufieldjh commented Jul 19, 2023

caufieldjh commented Jul 23, 2023

caufieldjh commented Jul 18, 2023 •

edited