Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Export of full result sets #27

Open
simongray opened this issue May 15, 2023 · 1 comment
Open

Export of full result sets #27

simongray opened this issue May 15, 2023 · 1 comment

Comments

@simongray
Copy link
Member

Currently, the export functionality is limited to one page of results. Ideally, it should be possible to download entire result sets, e.g. for "og" that is currently ~300k results on the Clarin Korp instance.

Test of current query API endpoint

Querying for “og” in all available corpora with the paging set to 500K results in a very sluggish download:

https://alf.hum.ku.dk/korp/backend/query?default_context=1%20sentence&show=sentence,pos,msd,lemma,ref,prefix,suffix&show_struct=text_title&start=0&end=500000&corpus=LSPCONSTRUCTIONEB1,LSPCONSTRUCTIONEB2,LSPCONSTRUCTIONMURO,LSPCONSTRUCTIONSBI,LSPAGRICULTUREJORDBRUGSFORSKNING,LSPCLIMATEAKTUELNATURVIDENSKAB,LSPCLIMATEDMU,LSPCLIMATEHOVEDLAND,LSPCLIMATEOEKRAAD,LSPHEALTH1AKTUELNATURVIDENSKAB,LSPHEALTH1LIBRISSUNDHED,LSPHEALTH1NETPATIENT,LSPHEALTH1REGIONH,LSPHEALTH1SOEFARTSSTYRELSEN,LSPHEALTH1SST,LSPHEALTH2SUNDHEDDK1,LSPHEALTH2SUNDHEDDK2,LSPHEALTH2SUNDHEDDK3,LSPHEALTH2SUNDHEDDK5,LSPNANONANO1,LSPNANONANO2,LSPNANONANO3,LSPNANONANO4,LSPNANOAKTUELNATURVIDENSKAB&cqp=[word%20=%20%22og%22]&query_data=&context=&incremental=true&default_within=sentence&within=

For any query, the query results are cached under the key query_data. So in theory the second attempt at downloading the same result set should be fast. We have tried with slightly smaller queries (10K), albeit on a slow connection, and caching does seem to be present.

Hypothetical solution

We want to create a download(s) endpoint which proxies the query endpoint in order to create output in different formats/configurations. In our case, we just want to have a CSV-encoding, but it should be made in such a way as to enable different formats too.

Chunking or entire query downloaded at once?

One consideration is whether we should keep a buffer of the query results and create the download by combining chunks. In the ideal case, we would want to keep a buffer and perhaps associate the long URLs with our partial results with files on disk.

However, we could start out by simply serving a single file (CSV) and see how far that takes us. The CSV format should be a fraction of the QWIK JSON representation used by the frontend.

Plan of action

The first proof-of-concept should be a simple Flask service that runs locally and queries the (unprotected) backend endpoint across the network.
The second step should be putting this onto our Clarin server, perhaps even in the Docker configuration (or locally).
The third step would be forking the korp-backend project, integrating our solution. This forked backend would then have to be the one in use in our Docker setups from then on.
The fourth step is probably merging our solution into the upstream korp-backend repository, however this requires significant coordination with Språkbanken, not to mention updating our version of Korp to match the one used by Språkbanken ahead of any kind of PR.

Other comments

The URL/path of the download endpoint should be 1:1 compatible with the regular Korp search page URL, i.e. we can generate the URL in the frontend using simple string concatenation, making the Korp frontend changes minimal.

@simongray
Copy link
Member Author

Philip wants the entire breadth of search options to be available, which in effect means that the call to the backend query endpoint must be copied in its entirety, i.e. the API interface must be identical.

Unfortunately, this means that we can't just copy from the URL in the address bar to construct links. Instead, the source of the call to the query endpoint must be tracked down and similar code copy-pasted to construct the path of the download endpoint on that specific search result page.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant