Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pagination bug in GET /observations/observers #235

Closed
willkuhn opened this issue Dec 29, 2020 · 1 comment
Closed

Pagination bug in GET /observations/observers #235

willkuhn opened this issue Dec 29, 2020 · 1 comment
Assignees

Comments

@willkuhn
Copy link

GET /observations/observers is giving some funky responses and seems to have an affinity for the number 500.

TL;DR

Breakdown with examples

Pages after ~500th record contain duplicates

Here's some Python code (tweaked from #227) to test this behavior:

def get_paginated_results(per_page, n_pages):
    ids = []

    for page in range(1, n_pages + 1):
        response = requests.get(BASE_URL, params={'place_id':72645,'per_page': per_page, 'page': page})
        ids.extend([p['user_id'] for p in response.json()['results']])
        sleep(1)

    unique_ids = set(ids)
    duplicated_ids = [id for id, count in Counter(ids).items() if count > 1]

    # Find the first instance of a duplicated ID
    already_seen_once = set()
    for i,id in enumerate(ids):
        if id in duplicated_ids and id in already_seen_once:
            dup_id,dup_pos = id,i+1
            break
        already_seen_once.update([id])


    print(f"Expected total:  \t{response.json()['total_results']}")
    print(f"Total IDs received: \t{len(ids)}")
    print(f"Unique IDs received: \t{len(unique_ids)}")
    print(f"Duplicated IDs: \t{len(duplicated_ids)}")
    if len(duplicated_ids)>0:
        print(f"First duplicate ID: \t`{dup_id}` at position {dup_pos}")

Results from a few different page sizes:

>>> get_paginated_results(per_page=500, n_pages=1)
Expected total:         3931
Total IDs received:     500
Unique IDs received:    500
Duplicated IDs:         0

>>> get_paginated_results(500,2)
Expected total:         3931
Total IDs received:     1000
Unique IDs received:    549
Duplicated IDs:         451
First duplicate ID:     `3569438` at position 503

>>> get_paginated_results(100,6)
Expected total:         3931
Total IDs received:     1000
Unique IDs received:    549
Duplicated IDs:         451
First duplicate ID:     `3569438` at position 503

>>> get_paginated_results(31,25)
Expected total:         3931
Total IDs received:     4500
Unique IDs received:    549
Duplicated IDs:         500
First duplicate ID:     `3569438` at position 503

Pages after 500th record are oversized

After around the 500th record, subsequent pages contain more than per_page records. Here are 2 examples for per_page=30 and per_page=20, respectively:

>>> get_paginated_results(30,17)
Expected total:         3931
Total IDs received:     500
Unique IDs received:    500
Duplicated IDs:         0

>>> get_paginated_results(30,18)
Expected total:         3931
Total IDs received:     1000
Unique IDs received:    549
Duplicated IDs:         451
First duplicate ID:     `3569438` at position 503

>>> get_paginated_results(30,19)
Expected total:         3931
Total IDs received:     1500
Unique IDs received:    549
Duplicated IDs:         500
First duplicate ID:     `3569438` at position 503


>>> get_paginated_results(20,25)
Expected total:         3931
Total IDs received:     500
Unique IDs received:    500
Duplicated IDs:         0

>>> get_paginated_results(20,26)
Expected total:         3931
Total IDs received:     1000
Unique IDs received:    549
Duplicated IDs:         451
First duplicate ID:     `3569438` at position 503

>>> get_paginated_results(20,27)
Expected total:         3931
Total IDs received:     1500
Unique IDs received:    549
Duplicated IDs:         500
First duplicate ID:     `3569438` at position 503

Pages past the 500th record contain 500 records no matter per_page. Only 49 non-duplicate records were returned in the subsequent page, and no unique records were returned in the one after. Record 503 is the first duplicate.

Duplicated records are not exact duplicates

The duplicated observers that are returned after ~500 are not actually duplicates, exactly. Their user info is the same but their stats are different. Here are 2 examples:

Stats from first record of user 3569438 on page: https://api.inaturalist.org/v1/observations/observers?place_id=72645&per_page=30&page=10

{"user_id":3569438,"observation_count":30,"species_count":16,"user":{...}

...And their stats on the first post-500th-record page:
https://api.inaturalist.org/v1/observations/observers?place_id=72645&per_page=30&page=18

{"user_id":3569438,"species_count":16,"observation_count":0,"user":{...}

Same for user 15723:
Initial stats: https://api.inaturalist.org/v1/observations/observers?place_id=72645&per_page=30&page=1

{"user_id":15723,"observation_count":2346,"species_count":604,"user":{...}

Post-500th-record stats: https://api.inaturalist.org/v1/observations/observers?place_id=72645&per_page=30&page=18

{"user_id":15723,"species_count":604,"observation_count":0,"user":{...}

Curiously in both of these examples, species_count in the second record equals observation_count in the first. Maybe that's the key to finding this bug.

Requests with order_by=species_count max out at 500 records

With per_page=30 we get 16 pages (= 480 records) of expected results:
https://api.inaturalist.org/v1/observations/observers?place_id=72645&per_page=30&page=1&order_by=species_count
...
https://api.inaturalist.org/v1/observations/observers?place_id=72645&per_page=30&page=16&order_by=species_count

...then page 17 only returns 20 records (making 500 total), instead of the expected 30:
https://api.inaturalist.org/v1/observations/observers?place_id=72645&per_page=30&page=17&order_by=species_count

Page 18 (record 501+) returns {"error":"Error","status":500}:
https://api.inaturalist.org/v1/observations/observers?place_id=72645&per_page=30&page=18&order_by=species_count

I just spend like 2 hours writing this. I hope it's helpful and not toooooo looooooong!

@pleary
Copy link
Member

pleary commented Jan 22, 2021

This should be resolved along with #236

@pleary pleary closed this as completed Jan 22, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants