Update BODS Data Review Tool to handle larger data analysis #69

StephenAbbott · 2021-08-17T08:58:15Z

During the process of checking Latvia's initial release of beneficial ownership data from January 2021, we attempted to use the DRT to analyse how closely their data complies with BODS. But the file could not be analysed as it is too big.

No file size limit is explicitly given in the notes on this page so documentation should be updated to include an explanation of the limit.

StephenAbbott · 2021-10-05T08:08:08Z

Following discussion with @kd-ods and @ScatteredInk, updating this card not just to focus on better error messaging around maximum file size but also to document the maximum data file size allowed and see how we can increase that in order to allow analysis and reviewing of larger BODS datasets which are starting to be released.

Kadie/Jack: you referenced similar work done recently to expand the capacity of IATI's validator. Is this the version 2 work referenced at https://github.com/IATI/IATI-Validator-Web/releases and https://github.com/IATI/js-validator-api/releases? Appreciate it if you can add in any useful repo links so we can flesh out this issue

StephenAbbott · 2021-10-05T10:38:19Z

@odscjames Flagging this issue

odscjames · 2021-10-06T09:24:21Z

@StephenAbbott Can you be exact about what you tried when you got the error?

StephenAbbott · 2021-10-06T11:16:44Z

@odscjames

Downloaded the Latvian BODS zipped dataset from data.gov.lv and unzipped it. It's 31.4MB zipped
Unzipped the file and ended up with a 316.2MB BODS JSON file
Went to Data Review Tool and chose that file. Clicked Submit
Then after hanging on the Loading... screen for a long time, I get the following error message which seems to be caused by the file size being too large (although the error messaging doesn't identify this as the reason for the 503 error)

odscjames · 2021-10-14T12:35:16Z

I ran this on my laptop (If you download https://pypi.org/project/libcovebods/ there is a CLI tool included). It took 18 mins. I have the results file as JSON.

So first thought is that even if we get this down, to get this down to the 20 seconds or so that a web request can really take before people start to get impatient is going to be very challenging.

A first take on this is maybe to put in a message queue, a background worker and a holding screen saying "please wait, your file is being processed - don't leave this page".

18 minutes is still a long time to stare at a web page - ideally we would then work to get it down or at least provide some kind of progress count somehow. But at least it's better than a time out and a crash.

In terms of documenting a limit, it's hard to be exact as to what a limit would be. If we create a holding page for large files, we can put messages there and then people will see them only when relevant.

odscjames · 2021-10-14T13:05:39Z

I just ran /usr/bin/time -v to get Maximum resident set size (kbytes): 3234076. 3GB.

Currently the entire JSON file is loaded in memory in one go. Because of the streaming design of BODS data we should be able to bring in https://pypi.org/project/ijson/ and make a big dent in that.

To attempt to estimate how much, I opened a python prompt and loaded the JSON file into RAM only. ps aux reports VSZ=1681184 & RSS=1652408. ~1.6GB.

That seems like a large gap, but we have to keep a lot of state as we process the file so that's maybe fair enough (eg we keep a list of ID's so we can work out where there are missing statements or statements out of order). If that gets to be a problem, we could start storing state in SQLite or something.

ScatteredInk · 2021-11-01T16:56:37Z

For planning purposes, it may be useful to think about potential dataset sizes here (based on these stats, which are odd-looking but OK for estimation purposes.):

Latvia is on the smaller end in terms of number of companies (173k).
The larger economies in western Europe are an order of magnitude bigger.

So first thought is that even if we get this down, to get this down to the 20 seconds or so that a web request can really take before people start to get impatient is going to be very challenging.

Thanks @odscjames - yes, I think our main objective here is not pushing users with big files away from the validator. I'd be interested in thinking about a "big file" strategy that explicitly tries to avoid validating the whole thing, until users are ready, i.e. does ijson give us the capability to stream a sample of the data (say "keep going until you hit 10 ownership-or-control statements or it seems pretty likely that no ownership-or-control statements are going to turn up" and then validate that sample (or warn the user that it is a bad sample)? This is basically what I used to do (minus any clever logic) when looking at bulk data from the OO register and companies house using jq.

robredpath · 2021-11-01T18:15:59Z

@ScatteredInk there's a lot of interesting things here, and this is an issue that's come up in several standards that we work with.

The first speed challenge is around getting the data to the server: uploads from home connections are usually quite slow (it would take me ~3 minutes to upload a 300MB file from home), and even across a connection between servers at 100Mbps, it would take 24 seconds. Moving compressed files is much quicker, of course, but not all compression algorithms allow for streaming. It's also not entirely straightforward to do operations on files that are still being uploaded - it's possible (although I'm not sure about for the DRT specifically!) - just has some extra complexity in principle.

If the data is updated relatively rarely (e.g. daily, or weekly), one option could be to routinely download the data, such as we do for the 360Giving data store, and run a service whereby stats are available on that data, rather than "live". For 360, we get the latest data every night, and we run the Data Quality Tool (ie, their DRT), and store the results for internal use. We have a forthcoming tool to allow for publishers to see their own data quality stats, as well. Obviously, with an overnight run, we can (and do) run computationally expensive processes without worrying about time - which avoids the problem described here.

Self-service is a bit trickier, but with some appropriate controls and/or manual intervention, this could be offered as a service to implementers: register an endpoint with the helpdesk to get regular monitoring? Or, trusted users can register for an account to log in and set their own endpoints?

Depending on the users and their relative technical expertise, providing advice or tooling on "preparing a sample" might be an appropriate thing to do. I'm not a huge fan of moving the burden to the user like that, but it might be a better experience overall. I think that with BODS you should just be able to do a head -n1000 or something and still get valid data, but that may well be a step too far.

Leaving all that aside - there's always possibilities around doing a "first pass" of the data. I think that in the current DRT implementation that would probably slow down the overall response time, although we have done some work around that in our Standards Lab project that we might be able to draw from. Such a "first pass" could do things like:

regex match to look for things that have to appear in the file in order for it to have any chance of being valid
sample 1% of lines and check that they're actually JSON
check that the first ownership-and-control statement (as found by regex matching) is preceded by...everything that it would need to be preceded by

...and then "fail fast", if those conditions aren't met. But, that requires a very different approach than is currently taken in the DRT. Could well be worth it, mind!

odscjames · 2021-11-09T14:50:30Z

So in this mornings call I said I'd post some things here:

Rob raises a lot of interesting ideas but I think some of them may cross over into grounds the data developer library may cover, and I would suggest some thinking about where functionality belongs could be good.

We should take a piecemeal approach to this - there are many options here and we should break them down and look at them a piece at a time, and see what we could do. With that in mind I said I'd try and estimate 2 things:

Please wait holding screen & work done in background - the current model where the work is done in the same request really limits the size of file to something that can be processed in the 20 seconds or so before people get impatient. Adding a screen and background worker has 2 advantages:

pushes the size of the file up to something that can be processed in a few minutes
provides a place to put messaging about large file sizes and other options in a place where people will actually read if it's relevant to them. (If we just put that message on the home page of the tool probably a lot of people won't read it.)

sample mode - as a first step, add a sample mode that has a flag on the home page of the tool so when you submit data you submit full data or a sample. In sample mode, certain tests are not run - there wouldn't be any tests that need a full file to check properly, and thus can't be checked in a sample file. (eg Person or Entity statement not used by a ownership/control statement). We can then provide simple instructions to people to make a sample file locally and upload.

Jack talks about the idea of a sample mode that keeps working until it sees a certain amount of different types of statements - once we have ijson sorted out we could add this to the sample mode as a next step, but until we do I don't think this would gain much. Large files would still have high memory use.

I'll work out estimates for those 2.

Later, I would guess that ijson would definetly help memory use and maybe help speed performance. I would guess that sqlite would definetly help memory use but might actually hurt speed performance. We can come back to those later.

StephenAbbott · 2021-11-09T14:51:52Z

Given the low level of usage of the BODS Data Review Tool at the moment and the fact that we can learn a lot by working more closely with implementers, I'm leaning more towards @robredpath's suggestions above about us providing some clear advice/actions for implementers to follow who try to use the DRT to analyse a large file.

Great for us to estimate the time required to make the following changes, @odscjames:

Update documentation on datareview.openownership.org to recommend current use of DRT only for .json files smaller than 100mb. Prompt users to email support@openownership.org if they want help to analyse larger files. Exact language TBC. Would we need to provide different guidance for maximum recommended file sizes across the four accepted formats?
If a user proceeds to upload a .json file larger than 100mb, we update the Loading screen to do one of the following (decision on which to use will depend on estimates for technical work required for each):

Time out after a set period of time with messaging to prompt users to email support@openownership.org if they want help to analyse larger files
Time out after a set period of time with messaging suggesting that the user analyses a sample of the file rather than the full file. Following discussions earlier today on this 'sample mode', would need more inputs from @odscjames about how to implement that

Then, once we have put these changes in place, we can have more of a discussion about long-term aims for BODS DRT development.

StephenAbbott · 2021-11-09T14:53:45Z

Ha! You beat me to posting your comment here by two minutes, @odscjames 😀

robredpath · 2021-11-09T16:38:13Z

@odscjames I just an idea. Words to strike fear into the heart of any developer.

What about if we made a desktop app (Electron?) which just helps users prepare samples for upload to the DRT? And, maybe, does quick checks to make sure that it's worth their time (the file is actually JSON, contains at least one of each of the statements that it needs to have, etc)?

I don't know anything about BODS users but my observation when delivering training for other data standards has been that the move to the command line is a Big Deal for the majority of users. Something that they can run on the desktop with a graphical interface - even if it doesn't have any of the power of the full app - could be a useful thing to have as a first step.

kd-ods · 2021-11-09T17:00:50Z

@StephenAbbott - ok if @odscjames goes ahead and works on #80? While also coming up with an estimate of how long #81 would take?

StephenAbbott · 2021-11-09T19:48:48Z

@kd-ods Sure. Will check in with @odscjames over on #80 with next steps

odscjames · 2021-11-10T16:16:59Z

Words to strike fear into the heart of any developer. .... What about if we made a desktop app

It's alright, I can also have ideas :-) If we think that people have large local files and uploading them is an issue (bandwidth and/or privacy worries) and use of CLI is an issue, we could just take libcovebods and make a GUI for it such that they can run all the checks locally. There are various Python GUI libs and also libs that let you package your python app into a nice installer: https://pypi.org/project/pyinstaller/

See openownership#69 and openownership#80 for background

See #69 and #80 for background

Fix to 0e002e3 See #69 and #80 for background

kd-ods · 2023-05-05T16:38:51Z

@odscjames - can I just check: the tool doesn't handle JSON lines does it? We say in the docs that 'JSON Lines MAY also be used when creating large files'.

odscjames · 2023-05-18T16:23:39Z

doesn't handle JSON lines does it

No, not at the moment. Adding that in itself should not be to much work, but it implies people can upload really big files and that will take more work - both to process them without using too much server resources and to make sure the UI is still readable/understandable when it's showing 17,000 errors!

StephenAbbott · 2023-05-19T13:16:08Z

Thanks @odscjames. Will be good to check in on this with you and @kd-ods next week.

All the files published via https://bods-data.openownership.org for UK/DK/SK and the Register use JSONLines and would like to understand how much more work it would be to make changes and make the UI still understandable.

kd-ods · 2023-05-31T10:22:19Z

@odscjames and I have just talked through the final work on #81 and #79. @StephenAbbott - the suggestion is that we finalise and test that work then release it in the next day or two.

Beyond that, the next two tasks to take on (timing of work to be discussed) are:

Enable ingesting of JSON lines files (about 2 days' work)
Improve test processing and reporting for very large files

We haven't yet added tickets for those two tasks.

StephenAbbott · 2023-05-31T10:53:05Z

Thanks @kd-ods - and welcome back @odscjames! Happy for you to release this work if you're happy with it and will be good to discuss when we can book in the JSONLines / large files work once tickets raised for those.

odscjames · 2023-05-31T11:00:10Z

Improve test processing and reporting for very large files

To be clear that's for both JSON lines and normal JSON, the work I have in mind will cover both. Can comment more on ticket.

kd-ods · 2023-05-31T13:29:07Z

See links above ^ @odscjames. New tickets added.

StephenAbbott · 2023-07-07T17:13:01Z

Closing this issue following completion of work listed in #97

StephenAbbott changed the title ~~Update Data Review Tool user guidance to include explanation about max size of files which can be analysed~~ Update BODS Data Review Tool to handle larger data analysis Oct 5, 2021

odscjames mentioned this issue Oct 14, 2021

Group Additional Checks by type like we group validation errors #76

Open

This was referenced Nov 9, 2021

Add Sample Mode #79

Closed

Add information to home page about file size limits #80

Closed

Add Loading screen and background worker to app #81

Closed

StephenAbbott added a commit to StephenAbbott/cove-bods that referenced this issue Nov 11, 2021

File size limit text updates

33a367c

See openownership#69 and openownership#80 for background

StephenAbbott mentioned this issue Nov 11, 2021

File size limit text updates #82

Closed

StephenAbbott linked a pull request Nov 11, 2021 that will close this issue

File size limit text updates #82

Closed

odscjames pushed a commit that referenced this issue Nov 16, 2021

File size limit text updates

0e002e3

See #69 and #80 for background

odscjames added a commit that referenced this issue Nov 16, 2021

File size limit text updates: Change markdown to HTML

942c943

Fix to 0e002e3 See #69 and #80 for background

odscjames mentioned this issue Nov 16, 2021

File size limit text updates #83

Merged

This was referenced May 31, 2023

Enable ingesting of JSON lines files #95

Open

Improve test processing and reporting for very large files #96

Open

StephenAbbott closed this as completed Jul 7, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update BODS Data Review Tool to handle larger data analysis #69

Update BODS Data Review Tool to handle larger data analysis #69

StephenAbbott commented Aug 17, 2021

StephenAbbott commented Oct 5, 2021

StephenAbbott commented Oct 5, 2021

odscjames commented Oct 6, 2021

StephenAbbott commented Oct 6, 2021

odscjames commented Oct 14, 2021

odscjames commented Oct 14, 2021 •

edited

Loading

ScatteredInk commented Nov 1, 2021

robredpath commented Nov 1, 2021

odscjames commented Nov 9, 2021

StephenAbbott commented Nov 9, 2021

StephenAbbott commented Nov 9, 2021 •

edited

Loading

robredpath commented Nov 9, 2021

kd-ods commented Nov 9, 2021

StephenAbbott commented Nov 9, 2021

odscjames commented Nov 10, 2021

kd-ods commented May 5, 2023

odscjames commented May 18, 2023

StephenAbbott commented May 19, 2023

kd-ods commented May 31, 2023

StephenAbbott commented May 31, 2023

odscjames commented May 31, 2023

kd-ods commented May 31, 2023

StephenAbbott commented Jul 7, 2023

Update BODS Data Review Tool to handle larger data analysis #69

Update BODS Data Review Tool to handle larger data analysis #69

Comments

StephenAbbott commented Aug 17, 2021

StephenAbbott commented Oct 5, 2021

StephenAbbott commented Oct 5, 2021

odscjames commented Oct 6, 2021

StephenAbbott commented Oct 6, 2021

odscjames commented Oct 14, 2021

odscjames commented Oct 14, 2021 • edited Loading

ScatteredInk commented Nov 1, 2021

robredpath commented Nov 1, 2021

odscjames commented Nov 9, 2021

StephenAbbott commented Nov 9, 2021

StephenAbbott commented Nov 9, 2021 • edited Loading

robredpath commented Nov 9, 2021

kd-ods commented Nov 9, 2021

StephenAbbott commented Nov 9, 2021

odscjames commented Nov 10, 2021

kd-ods commented May 5, 2023

odscjames commented May 18, 2023

StephenAbbott commented May 19, 2023

kd-ods commented May 31, 2023

StephenAbbott commented May 31, 2023

odscjames commented May 31, 2023

kd-ods commented May 31, 2023

StephenAbbott commented Jul 7, 2023

odscjames commented Oct 14, 2021 •

edited

Loading

StephenAbbott commented Nov 9, 2021 •

edited

Loading