Entrez search result limit #16

noamiz5060 · 2023-02-06T01:54:12Z

Hello!
I've come to this project since the BioPython entrez search fail me.
It used to return more than 9999 results but now there's this cursed limit.
so several question

Is the default search the same as the one in BioPython?
Are the articles added by relevancy? In BioPython they are, and the first articles MIDs here and there are different
And most important one, how can I get more then 9999 results? I've tried the 'in_batchs_of' with the entrez_api.search function but I still get only 9999 results

I need the most simple use of these functions, I want to put a term ('T cell' for example) and get a list of the most 100k relevant articles PMIDs. That's the only thing standing in my project way

Cheers

noamiz5060 · 2023-03-26T14:38:58Z

I really need your help, my project is stuck and i'm really desperate

krassowski · 2023-03-26T16:21:01Z

Can you post a reproducible example code using in_batches_of that you tried and did not work so I could take a look?

noamiz5060 · 2023-03-26T16:47:56Z

Can you post a reproducible example code using in_batches_of that you tried and did not work so I could take a look?

thank you very much for the reply!
probably instead of posting my fruitless tries (because probably because it is my lack of understanding), it will be more efficient to ask how to perform it in the most simple way if, for example, i'd want get the ids for: 't cell', from pubmed/ncbi, 100,000 ids

noamiz5060 · 2023-03-27T11:42:03Z

actually i'll put an example, that I partly succeeded
I've managed to get the result but the limit is still 9999
a = entrez_api.in_batches_of(1_000).search("t cell", max_results=100_000, database='pubmed')

when I try to use in_batches_of(1_000).fetch I get the an error of 'raise ValueError(
ValueError: Received str but a list-like container of identifiers was expected'

jonasfreimuth · 2023-10-03T17:27:38Z

I am currently attempting to work with GEO series in the gds database. It would be very useful to me if I could just ask for all of something during EntrezAPI.search(), i.e. all the series released within the last 3 months. While in most cases setting the max_res to 100_000 should be fine, there are currently slightly under 210'000 series in the database, so there is a given time interval for which this limit is too small. There should either be an option to also specify a retmax so that I can manually construct batches for search, or to extend the batching system to EntrezAPI.search(). I notice there is already the batching-improvements branch, but there appear to be no commits there yet. Is there already some work on adding the batching functionality, or would it be hypothetically worth for me to invest some time to come up with something myself?

So instead of

from easy-entrez import EntrezAPI
EntrezAPI("", "").in_batches_of(10**4 - 1).search({"entrytype": "GSE"}, max_results = 100_000,  database="gds")

I'd like to just say

from easy-entrez import EntrezAPI
EntrezAPI("", "").in_batches_of(10**4 - 1).search({"entrytype": "GSE"}, max_results = None,  database="gds")

and get useful results (which will then be summarized and get their accession field extracted).

krassowski · 2023-10-03T17:34:53Z

batching-improvements was merged in #15 which is why you see no commits. I removed that branch now.

jonasfreimuth · 2023-10-03T17:37:37Z

Ok, and there is no work so far for adding batching to search? I don't understand the system very well yet, is there a special reason why search can't be decorated with supports_batches?

krassowski · 2023-10-03T17:43:22Z

Thanks for your interest! To implement batching for search a different approach than for other methods is needed. Essentially one would need to send subsequent requests with increasing retstart (see Esearch and perl example); this is of course a poor API on the side of entrez because if the database gets updated between queries you may miss out some records or retrieve some records twice, which is why I was not keen on implementing it in the first place. However, I am happy to accept a pull request if it comes with reasonable documentation/warning explaining this potential pitfall.

jonasfreimuth · 2023-10-03T17:45:23Z

Alright, I just understodd why the batching doesn't work with search. Ok I will see if I am able to wrangle up something that fulfills these criteria while doing what it's supposed to. Thanks for lining out what you expect 👍

jonasfreimuth · 2023-10-06T14:00:16Z

Actually, why is there this limit to 100'000 records for (e)search? I experimentally removed it, at least for the gds database, everything seems to work fine.

resp = EntrezAPI("", "").search({"entrytype": "GSE"}, max_results = 300_000,  database="gds")
# The number of unique ids corresponds to the number of total results as reported by entrez.
len(set(resp.data["esearchresult"]["idlist"])) == int(resp.data["esearchresult"]["count"])

The documentation talks about a 10'000 UID limit, but for gds at least, that seems to be as binding as a 100'000 record limit.

Some further experimentation revealed the actual limit after which entrez refuses to send anything is 2'147'483'647, aka the maximum number for a 32 bit signed integer, so if this works for all databases, retrieving all UIDs for a query would just entail setting max_results to that.

# Max limit test for convenience
resp = EntrezAPI("", "").search({"entrytype": "GSE"}, max_results = 2 ** 31 - 1,  database="gds")
len(set(resp.data["esearchresult"]["idlist"])) == int(resp.data["esearchresult"]["count"])

krassowski · 2023-10-06T14:13:23Z

Well the limit on search() is there exactly because the documentation of ESearch states that there is such a limit.

retmax

Total number of UIDs from the retrieved set to be shown in the XML output (default=20). By default, ESearch only includes the first 20 UIDs retrieved in the XML output. [...] Increasing retmax allows more of the retrieved UIDs to be included in the XML output, up to a maximum of 10,000 records.

To retrieve more than 10,000 UIDs from databases other than PubMed, submit multiple esearch requests while incrementing the value of retstart (see Application 3). For PubMed, ESearch can only retrieve the first 10,000 records matching the query. To obtain more than 10,000 PubMed records, consider using that contains additional logic to batch PubMed search results automatically so that an arbitrary number can be retrieved.

It is good to hear that it works for you with a specific database but I suspect it might not work for all databases.

Some further experimentation revealed the actual limit after which entrez refuses to send anything is 2'147'483'647, aka the maximum number for a 32 bit signed integer

Are there really as many records in the gds database? Or is len(set(resp.data["esearchresult"]["idlist"])) actually a lower number?

jonasfreimuth · 2023-10-06T14:25:19Z

Well the limit on search() is there exactly because the documentation of ESearch states that there is such a limit.

But that would be 10'000 not 100'000, or is there just a typo somewhere?

It is good to hear that it works for you with a specific database but I suspect it might not work for all databases.

I am currently checking what I get back for the other ones, as defined in easy_entrez/data/entrez_databases.tsv

Are there really as many records in the gds database? Or is len(set(resp.data["esearchresult"]["idlist"])) actually a lower number?

In gds for that query there are ~210'000 records. In total though, gds comprises 6'961'960 records. I just wanted to find out what the actual maximum number possible is, because Entrez always returns the number of records that are the minimum of retmax and the actual number of results. I'd put Infinity there if it'd work, the idea is just to get everything by default. Having any other number that everything is just arbitrary, no?

krassowski · 2023-10-06T14:30:27Z

But that would be 10'000 not 100'000, or is there just a typo somewhere?

Yes, it appears so that currently it is to lax a limit and should have been 10k nor 100k.

jonasfreimuth · 2023-10-06T14:32:05Z

Should this limit then be dynamic, depending on the database? I can see if I can determine individual database limits by experimentation...

krassowski · 2023-10-06T15:02:37Z

Well these limits can change. I would be more inclined to have a separate argument force_override_max_results_i_know_what_i_am_doing: Optional[int] (ok maybe the name could be shortened).

jonasfreimuth · 2023-10-06T15:04:22Z

This seems reasonable 👍

jonasfreimuth · 2023-10-12T11:48:30Z

Could you look over #18 please, @krassowski? This should solve my immediate problem, and at least logically I don't see why the actual request limit for eSearch should ever lie below the number of search results which would necessitate batched search. eSearch returns nothing (much) besides the IDs...

krassowski · 2023-10-12T12:39:38Z

It turns out I had an implementation of pagination locally: #21. I hope it is no longer needed with #18 but if it turns out that Entrez API changes to more restrictive we can always get back to #21.

krassowski · 2023-10-12T13:59:41Z

v0.3.7 is now released an available on PyPI: https://pypi.org/project/easy-entrez/0.3.7/

jonasfreimuth · 2023-10-12T14:03:44Z

Great thank you very much!

jonasfreimuth mentioned this issue Oct 9, 2023

Allow to disable Entrez search result limit #18

Merged

krassowski mentioned this issue Oct 12, 2023

Support pagination of search results #21

Draft

krassowski changed the title ~~Entrez result limit~~ Entrez search result limit Oct 12, 2023

krassowski closed this as completed in #18 Oct 12, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Entrez search result limit #16

Entrez search result limit #16

noamiz5060 commented Feb 6, 2023

noamiz5060 commented Mar 26, 2023

krassowski commented Mar 26, 2023

noamiz5060 commented Mar 26, 2023

noamiz5060 commented Mar 27, 2023

jonasfreimuth commented Oct 3, 2023 •

edited

krassowski commented Oct 3, 2023

jonasfreimuth commented Oct 3, 2023

krassowski commented Oct 3, 2023

jonasfreimuth commented Oct 3, 2023 •

edited

jonasfreimuth commented Oct 6, 2023 •

edited

krassowski commented Oct 6, 2023

jonasfreimuth commented Oct 6, 2023

krassowski commented Oct 6, 2023

jonasfreimuth commented Oct 6, 2023

krassowski commented Oct 6, 2023

jonasfreimuth commented Oct 6, 2023

jonasfreimuth commented Oct 12, 2023

krassowski commented Oct 12, 2023

krassowski commented Oct 12, 2023

jonasfreimuth commented Oct 12, 2023

Entrez search result limit #16

Entrez search result limit #16

Comments

noamiz5060 commented Feb 6, 2023

noamiz5060 commented Mar 26, 2023

krassowski commented Mar 26, 2023

noamiz5060 commented Mar 26, 2023

noamiz5060 commented Mar 27, 2023

jonasfreimuth commented Oct 3, 2023 • edited

krassowski commented Oct 3, 2023

jonasfreimuth commented Oct 3, 2023

krassowski commented Oct 3, 2023

jonasfreimuth commented Oct 3, 2023 • edited

jonasfreimuth commented Oct 6, 2023 • edited

krassowski commented Oct 6, 2023

jonasfreimuth commented Oct 6, 2023

krassowski commented Oct 6, 2023

jonasfreimuth commented Oct 6, 2023

krassowski commented Oct 6, 2023

jonasfreimuth commented Oct 6, 2023

jonasfreimuth commented Oct 12, 2023

krassowski commented Oct 12, 2023

krassowski commented Oct 12, 2023

jonasfreimuth commented Oct 12, 2023

jonasfreimuth commented Oct 3, 2023 •

edited

jonasfreimuth commented Oct 3, 2023 •

edited

jonasfreimuth commented Oct 6, 2023 •

edited