Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Document or Increase the maximum amount of results using facets #69

Open
Tracked by #76
linogaliana opened this issue Apr 9, 2021 · 9 comments
Open
Tracked by #76

Comments

@linogaliana
Copy link

linogaliana commented Apr 9, 2021

I have the impression that the maximum number of echoes when requesting using facets is 10 000. I don't see that mentionned in the README or in the doc. Is it possible to get more results ? (maybe I am doing something wrong)

Otherwise, mentionning that in the README would be useful.

import openfoodfacts
import pandas as pd

brands = openfoodfacts.facets.get_brands()
brands = pd.json_normalize(brands)
brands.shape
# (10000, 5)
packagings = openfoodfacts.facets.get_packaging()
packagings = pd.json_normalize(packagings)
packagings.shape
# (10000, 6)
@teolemon teolemon changed the title Maximum amount of results Document or Increase the maximum amount of results using facets Jan 28, 2022
@MahmoudHamdy02
Copy link

In the original link (for example https://world.openfoodfacts.org/packaging.json), if I save the data as a .json and open it manually it also shows 10000 rows, so I'm guessing this issue is related to the website API and not the python module?

@Ansh-Sarkar
Copy link

Ansh-Sarkar commented May 24, 2022

Hi ! Was just getting started with openfoodfacts when I came across this issue.

The actual problem here is

The 10000 limit is being caused by omission of the parameter sysparm_limit which default value is 10000. If you specify a higher value in the URL then you can get the desired amount of records.
https://community.servicenow.com/community?id=community_question&sys_id=ee160f61db1cdbc01dcaf3231f961911

Even though, we can surely increase the limit on the number of records returned, it will almost certainly lead to a decrease in performance and increased waiting times.

A suggested way to solve this issue would be to create a set of new functions which handle pagination. We could have 2 different types of functions : get_all_<facet_name>() and get_page_<facet_name>()

The get_all_<facet_name>() function would internally call the get_page_<facet_name>() function repeatedly until all the pages have been fetched one by one. Since this data can be large we can create a FacetContainer which shall store the entire fetched data while also providing easy and efficient access to functions which can be helpful in manipulating and moving the data around.

Combined, these 2 suggestions if implemented, should be able to solve the following issues

  • Document or Increase the maximum amount of results using facets #69 : By dividing the entire available data into pages and also providing control over the number of records which should be returned per page.
  • how to get categories by page #56 : The second part of this feature implementation involves the use of the FacetContainer class to implement functions to aid in Data Manipulation and movement. This class can be used to add more precise filters to the data stored inside it thereby acting as a powerful tool for working with records.

@Ansh-Sarkar
Copy link

Ansh-Sarkar commented May 24, 2022

Hi ! Was just getting started with openfoodfacts when I came across this issue.

The actual problem here is

The 10000 limit is being caused by omission of the parameter sysparm_limit which default value is 10000. If you specify a higher value in the URL then you can get the desired amount of records.
https://community.servicenow.com/community?id=community_question&sys_id=ee160f61db1cdbc01dcaf3231f961911

Even though, we can surely increase the limit on the number of records returned, it will almost certainly lead to a decrease in performance and increased waiting times.

A suggested way to solve this issue would be to create a set of new functions which handle pagination. We could have 2 different types of functions : get_all_<facet_name>() and get_page_<facet_name>()

The get_all_<facet_name>() function would internally call the get_page_<facet_name>() function repeatedly until all the pages have been fetched one by one. Since this data can be large we can create a FacetContainer which shall store the entire fetched data while also providing easy and efficient access to functions which can be helpful in manipulating and moving the data around.

Combined, these 2 suggestions if implemented, should be able to solve the following issues

  • Document or Increase the maximum amount of results using facets #69 : By dividing the entire available data into pages and also providing control over the number of records which should be returned per page.
  • how to get categories by page #56 : The second part of this feature implementation involves the use of the FacetContainer class to implement functions to aid in Data Manipulation and movement. This class can be used to add more precise filters to the data stored inside it thereby acting as a powerful tool for working with records.

Would love to work on and implement these features, if they could help address the above mentioned issues.

@Ansh-Sarkar
Copy link

@linogaliana @MahmoudHamdy02 @Anubhav-Bhargava kindly do let me know about your views on this approach.

@linogaliana
Copy link
Author

Hi, thanks for the suggestion. Yes I think it's a great idea !

This would not penalize people that do not need to retrieve a large number of items but could help others to retrieve more data.

@Ansh-Sarkar
Copy link

Awesome. Thank you for the review. I'll start working on this feature right away and open a PR once its done.

@Ansh-Sarkar
Copy link

@linogaliana have you tried using this : https://world.openfoodfacts.org/?json=packaging&page=300

This seems to be working and is paginating the data properly. The problem seems to be due to the direct call to the .json endpoint in the get_<facet_name>() function rather than passing it in as an argument like : ?json=packaging . Also the page argument can be used to get a specific page. Although I believe this still calls for the implementation of the FacetContainer class, it seems like the paging functionality is already being implemented by the openfoodfacts-server and is working properly.

Kindly do let me know if the above solution works for you. Thanks !

@alexgarel @teolemon would be grateful if you could kindly review this issue and the above mentioned solution once. Open to suggestions regarding the implementation of the FacetContainer class. Would it be a good addition to the codebase? Thanks in advance !

@alexgarel
Copy link
Member

Hi @Ansh-Sarkar thank you for trying to contribute, and really sorry for the lag (I let this notification slip away…).
Do not hesitate to come and ping us on slack in the #python channel

I'll comment your ticket, but yes I'm 100% in favour of a class.

Also you are encouraged to migrate as much as possible to the current API: https://openfoodfacts.github.io/api-documentation/

warning the search v2 documentation is here (for now): https://wiki.openfoodfacts.org/Open_Food_Facts_Search_API_Version_2

@Ansh-Sarkar
Copy link

@alexgarel not an issue at all. Had been waiting for a reply in order to commence work on this. Will surely join the slack channel and start implementing this feature. Would be glad to contribute.

Also yeah I'll check out the API documentation as well. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

5 participants