Export of full result sets #27

simongray · 2023-05-15T09:46:03Z

Currently, the export functionality is limited to one page of results. Ideally, it should be possible to download entire result sets, e.g. for "og" that is currently ~300k results on the Clarin Korp instance.

Test of current query API endpoint

Querying for “og” in all available corpora with the paging set to 500K results in a very sluggish download:

https://alf.hum.ku.dk/korp/backend/query?default_context=1%20sentence&show=sentence,pos,msd,lemma,ref,prefix,suffix&show_struct=text_title&start=0&end=500000&corpus=LSPCONSTRUCTIONEB1,LSPCONSTRUCTIONEB2,LSPCONSTRUCTIONMURO,LSPCONSTRUCTIONSBI,LSPAGRICULTUREJORDBRUGSFORSKNING,LSPCLIMATEAKTUELNATURVIDENSKAB,LSPCLIMATEDMU,LSPCLIMATEHOVEDLAND,LSPCLIMATEOEKRAAD,LSPHEALTH1AKTUELNATURVIDENSKAB,LSPHEALTH1LIBRISSUNDHED,LSPHEALTH1NETPATIENT,LSPHEALTH1REGIONH,LSPHEALTH1SOEFARTSSTYRELSEN,LSPHEALTH1SST,LSPHEALTH2SUNDHEDDK1,LSPHEALTH2SUNDHEDDK2,LSPHEALTH2SUNDHEDDK3,LSPHEALTH2SUNDHEDDK5,LSPNANONANO1,LSPNANONANO2,LSPNANONANO3,LSPNANONANO4,LSPNANOAKTUELNATURVIDENSKAB&cqp=[word%20=%20%22og%22]&query_data=&context=&incremental=true&default_within=sentence&within=

For any query, the query results are cached under the key query_data. So in theory the second attempt at downloading the same result set should be fast. We have tried with slightly smaller queries (10K), albeit on a slow connection, and caching does seem to be present.

Hypothetical solution

We want to create a download(s) endpoint which proxies the query endpoint in order to create output in different formats/configurations. In our case, we just want to have a CSV-encoding, but it should be made in such a way as to enable different formats too.

Chunking or entire query downloaded at once?

One consideration is whether we should keep a buffer of the query results and create the download by combining chunks. In the ideal case, we would want to keep a buffer and perhaps associate the long URLs with our partial results with files on disk.

However, we could start out by simply serving a single file (CSV) and see how far that takes us. The CSV format should be a fraction of the QWIK JSON representation used by the frontend.

Plan of action

The first proof-of-concept should be a simple Flask service that runs locally and queries the (unprotected) backend endpoint across the network.
The second step should be putting this onto our Clarin server, perhaps even in the Docker configuration (or locally).
The third step would be forking the korp-backend project, integrating our solution. This forked backend would then have to be the one in use in our Docker setups from then on.
The fourth step is probably merging our solution into the upstream korp-backend repository, however this requires significant coordination with Språkbanken, not to mention updating our version of Korp to match the one used by Språkbanken ahead of any kind of PR.

Other comments

The URL/path of the download endpoint should be 1:1 compatible with the regular Korp search page URL, i.e. we can generate the URL in the frontend using simple string concatenation, making the Korp frontend changes minimal.

The text was updated successfully, but these errors were encountered:

simongray · 2023-05-15T12:26:53Z

Philip wants the entire breadth of search options to be available, which in effect means that the call to the backend query endpoint must be copied in its entirety, i.e. the API interface must be identical.

Unfortunately, this means that we can't just copy from the URL in the address bar to construct links. Instead, the source of the call to the query endpoint must be tracked down and similar code copy-pasted to construct the path of the download endpoint on that specific search result page.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Export of full result sets #27

Export of full result sets #27

simongray commented May 15, 2023

simongray commented May 15, 2023

Export of full result sets #27

Export of full result sets #27

Comments

simongray commented May 15, 2023

Test of current query API endpoint

Hypothetical solution

Chunking or entire query downloaded at once?

Plan of action

Other comments

simongray commented May 15, 2023