Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate how to deal when large amount of data is sent by provider through OPTIMADE #12

Closed
pedrobcst opened this issue Mar 16, 2022 · 4 comments

Comments

@pedrobcst
Copy link
Owner

pedrobcst commented Mar 16, 2022

Currently when a provider sends a large amount of data back from an optimade structure query, it results in a error 503/504 and the querier stop, example is when querying COD for ['Si', 'O'] system.

For example this query URL will return 504

Example of extremely large file:
COD ID 1552091

Investigate how to handle this.
Ideas:

  1. Maybe if query returns a 503 error, we should retry query reducing page_limit? (current default 10)
  2. Skip current query and move to next page (add page_limit to page_offset?)
@ml-evs
Copy link
Collaborator

ml-evs commented Mar 16, 2022

I think this is related to the cartesian_site_positions issue with COD, i.e. that they do not store them so must be reconstructed for every query. I raised an issue with the COD team (https://projects.ibt.lt/repositories/issues/1173) about the problem we faced when not restricting page limit (some bug in the internal logic was recalculating more structures than necessary for the response).

If you remove the response_fields from that query it works instantly, and you can see the other data about that particular structure here: https://www.crystallography.net/cod/optimade/v1.1.0/structures/1552091. The problem seems to be that it is a 10000 atom cell (the chemical_formula_descriptive field indicates O5280Si2640) that would be very expensive to recompute the positions for.

I would suggest an additional filter by nsites<100 or something reasonable, but unfortunately for COD they also do not know the nsites ahead of time until the positions have been computed. A possible workaround could be to give a reasonable upper bound on unit cell volume _cod_vol<10000 (the structure in question here has a volume of 167011 Å^3). Note, this is not the same as _cod_volume, which is for bibliographic references.

Skipping the page would be annoying as you will lose the other structures from that page (unless you step through to find the offending structure).

@ml-evs
Copy link
Collaborator

ml-evs commented Mar 16, 2022

(Having more general handling for 504 might be useful though, as you suggest)

@pedrobcst
Copy link
Owner Author

pedrobcst commented Mar 17, 2022

Thanks for the suggestions!

I would suggest an additional filter by nsites<100 or something reasonable, but unfortunately for COD they also do not know the nsites ahead of time until the positions have been computed. A possible workaround could be to give a reasonable upper bound on unit cell volume _cod_vol<10000 (the structure in question here has a volume of 167011 Å^3). Note, this is not the same as _cod_volume, which is for bibliographic references.

I think indeed it would be much better to restrict to lower cell volume. Not only this is breaking the query, but also this extremely large structures will (most of the time) break GSASII too, and the only way to fix it is by manually deleting those structures and uploading into the database.
I will run some tests, and if we can manage to filter out those structures through OPTIMADE, it might be better to replace the current COD querier that just downloads directly the CIF URLs.

(Having more general handling for 504 might be useful though, as you suggest)

I am wondering how would the best way to handle that for all cases. For now I will just raise an error, and limit the COD volume. If 504 still happens due to a different issue I will come back to think how to handle it.

Edit:
I have tested the _cod_vol< 10000 and it seems to be working quite nicely. It seems that we lost around 100 structures (for Si-O), although for the purposes of Xerus, they might just be noise and slow it down.

The only difference now is to check query speed by each interface.

@pedrobcst
Copy link
Owner Author

This seems to be fixed by restricting the volume of COD structures, so I will close it for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants