Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

duplicate responses when paging query results with api.inaturalist.org/v1/projects/autocomplete #218

Closed
PietrH opened this issue Oct 20, 2020 · 4 comments

Comments

@PietrH
Copy link

PietrH commented Oct 20, 2020

We are trying to build an output of all projects with "Bioblitz" in the description. Using the https://api.inaturalist.org/v1/projects/autocomplete?q= GET request and collating the json outputs in python.

I'm looping over a request with page size 100, and waiting at least a second between requests (less then 60 requests a minute as per guidelines).

I'm getting a lot of duplicate responses, and also not fetching all results. I'm expecting around 1600 pages, but only getting around 500 total results.

This is the code I'm running:

import json
from time import sleep
import pandas
import requests

# set the number of objects we want to capture per query, max 100 according to the documentation
page_size = 100

# define function to query api
def get_inat_projects(query,page):
    url = "https://api.inaturalist.org/v1/projects/autocomplete?q=" + query
    q = {
        "type": "collection",
        "per_page":page_size,
        "page": page,
    }

    response = requests.get(url,params=q)

    # Please note that we throttle API usage to a max of 100 requests per minute,
    # though we ask that you try to keep it to 60 requests per minute or lower,
    # and to keep under 10,000 requests per day.

    sleep(1)
    return response.json()
    

# init dataframe
output=pandas.DataFrame()

# get total number of pages
total_pages=get_inat_projects("Bioblitz",1)["total_results"]/page_size
print("this query will take at least "+str(round((total_pages+1)/60,2))+"m to complete")

# loop query over all pages
for i in range(1,total_pages+1):
    output=output.append(pandas.DataFrame(get_inat_projects("Bioblitz",i)["results"]))

Am I addressing the API in the wrong way? Is there a way I can get a larger query response and page trough that result? Or is there a bug in the API resulting in duplicate responses?

@PietrH
Copy link
Author

PietrH commented Oct 20, 2020

This is a forum thread discussing this issue: https://forum.inaturalist.org/t/unable-to-page-through-projects-from-the-api/17328/11

@kueda
Copy link
Member

kueda commented Oct 20, 2020

The autocomplete endpoint is... for autocomplete interfaces, not for scraping. It shouldn't have duplicates so I'll leave this open, but you might want to try https://api.inaturalist.org/v1/docs/#!/Projects/get_projects instead.

@PietrH
Copy link
Author

PietrH commented Oct 21, 2020

If i'm looking for all projects containing Bioblitz, emulating this search: https://www.inaturalist.org/projects/search?utf8=%E2%9C%93&q=bioblitz

I would use https://api.inaturalist.org/v1/search?q=Bioblitz&sources=projects instead. That way I get the same number of results anyway.

Thanks for the help!

@pleary
Copy link
Member

pleary commented Jul 14, 2021

The source of the problem is discussed in a similar ticket #227 (comment) . I'm going to close this as there is little we can do on our end to prevent duplicates across pages when not using a unique ordering param as long as we're using this version of Elasticsearch. We could add a statement to the API documentation describing the problem and stating that unless you're sorting by a reasonably unique order parameter like ID or date, we cannot guarantee complete and non-unique results across multiple pages of requests.

@pleary pleary closed this as completed Jul 14, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants