Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Entrez search result limit #16

Closed
noamiz5060 opened this issue Feb 6, 2023 · 20 comments · Fixed by #18
Closed

Entrez search result limit #16

noamiz5060 opened this issue Feb 6, 2023 · 20 comments · Fixed by #18

Comments

@noamiz5060
Copy link

Hello!
I've come to this project since the BioPython entrez search fail me.
It used to return more than 9999 results but now there's this cursed limit.
so several question

  1. Is the default search the same as the one in BioPython?
  2. Are the articles added by relevancy? In BioPython they are, and the first articles MIDs here and there are different
  3. And most important one, how can I get more then 9999 results? I've tried the 'in_batchs_of' with the entrez_api.search function but I still get only 9999 results

I need the most simple use of these functions, I want to put a term ('T cell' for example) and get a list of the most 100k relevant articles PMIDs. That's the only thing standing in my project way

Cheers

@noamiz5060
Copy link
Author

I really need your help, my project is stuck and i'm really desperate

@krassowski
Copy link
Owner

Can you post a reproducible example code using in_batches_of that you tried and did not work so I could take a look?

@noamiz5060
Copy link
Author

Can you post a reproducible example code using in_batches_of that you tried and did not work so I could take a look?

thank you very much for the reply!
probably instead of posting my fruitless tries (because probably because it is my lack of understanding), it will be more efficient to ask how to perform it in the most simple way if, for example, i'd want get the ids for: 't cell', from pubmed/ncbi, 100,000 ids

@noamiz5060
Copy link
Author

actually i'll put an example, that I partly succeeded
I've managed to get the result but the limit is still 9999
a = entrez_api.in_batches_of(1_000).search("t cell", max_results=100_000, database='pubmed')

when I try to use in_batches_of(1_000).fetch I get the an error of 'raise ValueError(
ValueError: Received str but a list-like container of identifiers was expected'

@jonasfreimuth
Copy link
Contributor

jonasfreimuth commented Oct 3, 2023

I am currently attempting to work with GEO series in the gds database. It would be very useful to me if I could just ask for all of something during EntrezAPI.search(), i.e. all the series released within the last 3 months. While in most cases setting the max_res to 100_000 should be fine, there are currently slightly under 210'000 series in the database, so there is a given time interval for which this limit is too small. There should either be an option to also specify a retmax so that I can manually construct batches for search, or to extend the batching system to EntrezAPI.search(). I notice there is already the batching-improvements branch, but there appear to be no commits there yet. Is there already some work on adding the batching functionality, or would it be hypothetically worth for me to invest some time to come up with something myself?

So instead of

from easy-entrez import EntrezAPI
EntrezAPI("", "").in_batches_of(10**4 - 1).search({"entrytype": "GSE"}, max_results = 100_000,  database="gds")

I'd like to just say

from easy-entrez import EntrezAPI
EntrezAPI("", "").in_batches_of(10**4 - 1).search({"entrytype": "GSE"}, max_results = None,  database="gds")

and get useful results (which will then be summarized and get their accession field extracted).

@krassowski
Copy link
Owner

batching-improvements was merged in #15 which is why you see no commits. I removed that branch now.

@jonasfreimuth
Copy link
Contributor

Ok, and there is no work so far for adding batching to search? I don't understand the system very well yet, is there a special reason why search can't be decorated with supports_batches?

@krassowski
Copy link
Owner

Thanks for your interest! To implement batching for search a different approach than for other methods is needed. Essentially one would need to send subsequent requests with increasing retstart (see Esearch and perl example); this is of course a poor API on the side of entrez because if the database gets updated between queries you may miss out some records or retrieve some records twice, which is why I was not keen on implementing it in the first place. However, I am happy to accept a pull request if it comes with reasonable documentation/warning explaining this potential pitfall.

@jonasfreimuth
Copy link
Contributor

jonasfreimuth commented Oct 3, 2023

Alright, I just understodd why the batching doesn't work with search. Ok I will see if I am able to wrangle up something that fulfills these criteria while doing what it's supposed to. Thanks for lining out what you expect 👍

@jonasfreimuth
Copy link
Contributor

jonasfreimuth commented Oct 6, 2023

Actually, why is there this limit to 100'000 records for (e)search? I experimentally removed it, at least for the gds database, everything seems to work fine.

resp = EntrezAPI("", "").search({"entrytype": "GSE"}, max_results = 300_000,  database="gds")
# The number of unique ids corresponds to the number of total results as reported by entrez.
len(set(resp.data["esearchresult"]["idlist"])) == int(resp.data["esearchresult"]["count"])

The documentation talks about a 10'000 UID limit, but for gds at least, that seems to be as binding as a 100'000 record limit.

Some further experimentation revealed the actual limit after which entrez refuses to send anything is 2'147'483'647, aka the maximum number for a 32 bit signed integer, so if this works for all databases, retrieving all UIDs for a query would just entail setting max_results to that.

# Max limit test for convenience
resp = EntrezAPI("", "").search({"entrytype": "GSE"}, max_results = 2 ** 31 - 1,  database="gds")
len(set(resp.data["esearchresult"]["idlist"])) == int(resp.data["esearchresult"]["count"])

@krassowski
Copy link
Owner

Well the limit on search() is there exactly because the documentation of ESearch states that there is such a limit.

retmax

Total number of UIDs from the retrieved set to be shown in the XML output (default=20). By default, ESearch only includes the first 20 UIDs retrieved in the XML output. [...] Increasing retmax allows more of the retrieved UIDs to be included in the XML output, up to a maximum of 10,000 records.

To retrieve more than 10,000 UIDs from databases other than PubMed, submit multiple esearch requests while incrementing the value of retstart (see Application 3). For PubMed, ESearch can only retrieve the first 10,000 records matching the query. To obtain more than 10,000 PubMed records, consider using that contains additional logic to batch PubMed search results automatically so that an arbitrary number can be retrieved.

It is good to hear that it works for you with a specific database but I suspect it might not work for all databases.

Some further experimentation revealed the actual limit after which entrez refuses to send anything is 2'147'483'647, aka the maximum number for a 32 bit signed integer

Are there really as many records in the gds database? Or is len(set(resp.data["esearchresult"]["idlist"])) actually a lower number?

@jonasfreimuth
Copy link
Contributor

Well the limit on search() is there exactly because the documentation of ESearch states that there is such a limit.

But that would be 10'000 not 100'000, or is there just a typo somewhere?

It is good to hear that it works for you with a specific database but I suspect it might not work for all databases.

I am currently checking what I get back for the other ones, as defined in easy_entrez/data/entrez_databases.tsv

Are there really as many records in the gds database? Or is len(set(resp.data["esearchresult"]["idlist"])) actually a lower number?

In gds for that query there are ~210'000 records. In total though, gds comprises 6'961'960 records. I just wanted to find out what the actual maximum number possible is, because Entrez always returns the number of records that are the minimum of retmax and the actual number of results. I'd put Infinity there if it'd work, the idea is just to get everything by default. Having any other number that everything is just arbitrary, no?

@krassowski
Copy link
Owner

But that would be 10'000 not 100'000, or is there just a typo somewhere?

Yes, it appears so that currently it is to lax a limit and should have been 10k nor 100k.

@jonasfreimuth
Copy link
Contributor

Should this limit then be dynamic, depending on the database? I can see if I can determine individual database limits by experimentation...

@krassowski
Copy link
Owner

Well these limits can change. I would be more inclined to have a separate argument force_override_max_results_i_know_what_i_am_doing: Optional[int] (ok maybe the name could be shortened).

@jonasfreimuth
Copy link
Contributor

This seems reasonable 👍

@jonasfreimuth
Copy link
Contributor

Could you look over #18 please, @krassowski? This should solve my immediate problem, and at least logically I don't see why the actual request limit for eSearch should ever lie below the number of search results which would necessitate batched search. eSearch returns nothing (much) besides the IDs...

@krassowski krassowski changed the title Entrez result limit Entrez search result limit Oct 12, 2023
@krassowski
Copy link
Owner

It turns out I had an implementation of pagination locally: #21. I hope it is no longer needed with #18 but if it turns out that Entrez API changes to more restrictive we can always get back to #21.

@krassowski
Copy link
Owner

v0.3.7 is now released an available on PyPI: https://pypi.org/project/easy-entrez/0.3.7/

@jonasfreimuth
Copy link
Contributor

Great thank you very much!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants