Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update BODS Data Review Tool to handle larger data analysis #69

Closed
StephenAbbott opened this issue Aug 17, 2021 · 23 comments
Closed

Update BODS Data Review Tool to handle larger data analysis #69

StephenAbbott opened this issue Aug 17, 2021 · 23 comments

Comments

@StephenAbbott
Copy link
Member

During the process of checking Latvia's initial release of beneficial ownership data from January 2021, we attempted to use the DRT to analyse how closely their data complies with BODS. But the file could not be analysed as it is too big.

No file size limit is explicitly given in the notes on this page so documentation should be updated to include an explanation of the limit.

@StephenAbbott StephenAbbott changed the title Update Data Review Tool user guidance to include explanation about max size of files which can be analysed Update BODS Data Review Tool to handle larger data analysis Oct 5, 2021
@StephenAbbott
Copy link
Member Author

Following discussion with @kd-ods and @ScatteredInk, updating this card not just to focus on better error messaging around maximum file size but also to document the maximum data file size allowed and see how we can increase that in order to allow analysis and reviewing of larger BODS datasets which are starting to be released.

Kadie/Jack: you referenced similar work done recently to expand the capacity of IATI's validator. Is this the version 2 work referenced at https://github.com/IATI/IATI-Validator-Web/releases and https://github.com/IATI/js-validator-api/releases? Appreciate it if you can add in any useful repo links so we can flesh out this issue

@StephenAbbott
Copy link
Member Author

@odscjames Flagging this issue

@odscjames
Copy link
Collaborator

@StephenAbbott Can you be exact about what you tried when you got the error?

@StephenAbbott
Copy link
Member Author

@odscjames

  1. Downloaded the Latvian BODS zipped dataset from data.gov.lv and unzipped it. It's 31.4MB zipped
  2. Unzipped the file and ended up with a 316.2MB BODS JSON file
  3. Went to Data Review Tool and chose that file. Clicked Submit
  4. Then after hanging on the Loading... screen for a long time, I get the following error message which seems to be caused by the file size being too large (although the error messaging doesn't identify this as the reason for the 503 error)

image

@odscjames
Copy link
Collaborator

I ran this on my laptop (If you download https://pypi.org/project/libcovebods/ there is a CLI tool included). It took 18 mins. I have the results file as JSON.

So first thought is that even if we get this down, to get this down to the 20 seconds or so that a web request can really take before people start to get impatient is going to be very challenging.

A first take on this is maybe to put in a message queue, a background worker and a holding screen saying "please wait, your file is being processed - don't leave this page".

18 minutes is still a long time to stare at a web page - ideally we would then work to get it down or at least provide some kind of progress count somehow. But at least it's better than a time out and a crash.

In terms of documenting a limit, it's hard to be exact as to what a limit would be. If we create a holding page for large files, we can put messages there and then people will see them only when relevant.

@odscjames
Copy link
Collaborator

odscjames commented Oct 14, 2021

I just ran /usr/bin/time -v to get Maximum resident set size (kbytes): 3234076. 3GB.

Currently the entire JSON file is loaded in memory in one go. Because of the streaming design of BODS data we should be able to bring in https://pypi.org/project/ijson/ and make a big dent in that.

To attempt to estimate how much, I opened a python prompt and loaded the JSON file into RAM only. ps aux reports VSZ=1681184 & RSS=1652408. ~1.6GB.

That seems like a large gap, but we have to keep a lot of state as we process the file so that's maybe fair enough (eg we keep a list of ID's so we can work out where there are missing statements or statements out of order). If that gets to be a problem, we could start storing state in SQLite or something.

@ScatteredInk
Copy link
Collaborator

For planning purposes, it may be useful to think about potential dataset sizes here (based on these stats, which are odd-looking but OK for estimation purposes.):

  • Latvia is on the smaller end in terms of number of companies (173k).
  • The larger economies in western Europe are an order of magnitude bigger.

So first thought is that even if we get this down, to get this down to the 20 seconds or so that a web request can really take before people start to get impatient is going to be very challenging.

Thanks @odscjames - yes, I think our main objective here is not pushing users with big files away from the validator. I'd be interested in thinking about a "big file" strategy that explicitly tries to avoid validating the whole thing, until users are ready, i.e. does ijson give us the capability to stream a sample of the data (say "keep going until you hit 10 ownership-or-control statements or it seems pretty likely that no ownership-or-control statements are going to turn up" and then validate that sample (or warn the user that it is a bad sample)? This is basically what I used to do (minus any clever logic) when looking at bulk data from the OO register and companies house using jq.

@robredpath
Copy link

@ScatteredInk there's a lot of interesting things here, and this is an issue that's come up in several standards that we work with.

The first speed challenge is around getting the data to the server: uploads from home connections are usually quite slow (it would take me ~3 minutes to upload a 300MB file from home), and even across a connection between servers at 100Mbps, it would take 24 seconds. Moving compressed files is much quicker, of course, but not all compression algorithms allow for streaming. It's also not entirely straightforward to do operations on files that are still being uploaded - it's possible (although I'm not sure about for the DRT specifically!) - just has some extra complexity in principle.

If the data is updated relatively rarely (e.g. daily, or weekly), one option could be to routinely download the data, such as we do for the 360Giving data store, and run a service whereby stats are available on that data, rather than "live". For 360, we get the latest data every night, and we run the Data Quality Tool (ie, their DRT), and store the results for internal use. We have a forthcoming tool to allow for publishers to see their own data quality stats, as well. Obviously, with an overnight run, we can (and do) run computationally expensive processes without worrying about time - which avoids the problem described here.

Self-service is a bit trickier, but with some appropriate controls and/or manual intervention, this could be offered as a service to implementers: register an endpoint with the helpdesk to get regular monitoring? Or, trusted users can register for an account to log in and set their own endpoints?

Depending on the users and their relative technical expertise, providing advice or tooling on "preparing a sample" might be an appropriate thing to do. I'm not a huge fan of moving the burden to the user like that, but it might be a better experience overall. I think that with BODS you should just be able to do a head -n1000 or something and still get valid data, but that may well be a step too far.

Leaving all that aside - there's always possibilities around doing a "first pass" of the data. I think that in the current DRT implementation that would probably slow down the overall response time, although we have done some work around that in our Standards Lab project that we might be able to draw from. Such a "first pass" could do things like:

  • regex match to look for things that have to appear in the file in order for it to have any chance of being valid
  • sample 1% of lines and check that they're actually JSON
  • check that the first ownership-and-control statement (as found by regex matching) is preceded by...everything that it would need to be preceded by

...and then "fail fast", if those conditions aren't met. But, that requires a very different approach than is currently taken in the DRT. Could well be worth it, mind!

@odscjames
Copy link
Collaborator

So in this mornings call I said I'd post some things here:

Rob raises a lot of interesting ideas but I think some of them may cross over into grounds the data developer library may cover, and I would suggest some thinking about where functionality belongs could be good.

We should take a piecemeal approach to this - there are many options here and we should break them down and look at them a piece at a time, and see what we could do. With that in mind I said I'd try and estimate 2 things:

Please wait holding screen & work done in background - the current model where the work is done in the same request really limits the size of file to something that can be processed in the 20 seconds or so before people get impatient. Adding a screen and background worker has 2 advantages:

  • pushes the size of the file up to something that can be processed in a few minutes
  • provides a place to put messaging about large file sizes and other options in a place where people will actually read if it's relevant to them. (If we just put that message on the home page of the tool probably a lot of people won't read it.)

sample mode - as a first step, add a sample mode that has a flag on the home page of the tool so when you submit data you submit full data or a sample. In sample mode, certain tests are not run - there wouldn't be any tests that need a full file to check properly, and thus can't be checked in a sample file. (eg Person or Entity statement not used by a ownership/control statement). We can then provide simple instructions to people to make a sample file locally and upload.

Jack talks about the idea of a sample mode that keeps working until it sees a certain amount of different types of statements - once we have ijson sorted out we could add this to the sample mode as a next step, but until we do I don't think this would gain much. Large files would still have high memory use.

I'll work out estimates for those 2.

Later, I would guess that ijson would definetly help memory use and maybe help speed performance. I would guess that sqlite would definetly help memory use but might actually hurt speed performance. We can come back to those later.

@StephenAbbott
Copy link
Member Author

Given the low level of usage of the BODS Data Review Tool at the moment and the fact that we can learn a lot by working more closely with implementers, I'm leaning more towards @robredpath's suggestions above about us providing some clear advice/actions for implementers to follow who try to use the DRT to analyse a large file.

Great for us to estimate the time required to make the following changes, @odscjames:

  1. Update documentation on datareview.openownership.org to recommend current use of DRT only for .json files smaller than 100mb. Prompt users to email support@openownership.org if they want help to analyse larger files. Exact language TBC. Would we need to provide different guidance for maximum recommended file sizes across the four accepted formats?

  2. If a user proceeds to upload a .json file larger than 100mb, we update the Loading screen to do one of the following (decision on which to use will depend on estimates for technical work required for each):

  • Time out after a set period of time with messaging to prompt users to email support@openownership.org if they want help to analyse larger files
  • Time out after a set period of time with messaging suggesting that the user analyses a sample of the file rather than the full file. Following discussions earlier today on this 'sample mode', would need more inputs from @odscjames about how to implement that

Then, once we have put these changes in place, we can have more of a discussion about long-term aims for BODS DRT development.

@StephenAbbott
Copy link
Member Author

StephenAbbott commented Nov 9, 2021

Ha! You beat me to posting your comment here by two minutes, @odscjames 😀

@robredpath
Copy link

@odscjames I just an idea. Words to strike fear into the heart of any developer.

What about if we made a desktop app (Electron?) which just helps users prepare samples for upload to the DRT? And, maybe, does quick checks to make sure that it's worth their time (the file is actually JSON, contains at least one of each of the statements that it needs to have, etc)?

I don't know anything about BODS users but my observation when delivering training for other data standards has been that the move to the command line is a Big Deal for the majority of users. Something that they can run on the desktop with a graphical interface - even if it doesn't have any of the power of the full app - could be a useful thing to have as a first step.

@kd-ods
Copy link
Collaborator

kd-ods commented Nov 9, 2021

@StephenAbbott - ok if @odscjames goes ahead and works on #80? While also coming up with an estimate of how long #81 would take?

@StephenAbbott
Copy link
Member Author

@kd-ods Sure. Will check in with @odscjames over on #80 with next steps

@odscjames
Copy link
Collaborator

Words to strike fear into the heart of any developer. .... What about if we made a desktop app

It's alright, I can also have ideas :-) If we think that people have large local files and uploading them is an issue (bandwidth and/or privacy worries) and use of CLI is an issue, we could just take libcovebods and make a GUI for it such that they can run all the checks locally. There are various Python GUI libs and also libs that let you package your python app into a nice installer: https://pypi.org/project/pyinstaller/

StephenAbbott added a commit to StephenAbbott/cove-bods that referenced this issue Nov 11, 2021
@StephenAbbott StephenAbbott linked a pull request Nov 11, 2021 that will close this issue
odscjames pushed a commit that referenced this issue Nov 16, 2021
See #69 and #80 for background
odscjames added a commit that referenced this issue Nov 16, 2021
@kd-ods
Copy link
Collaborator

kd-ods commented May 5, 2023

@odscjames - can I just check: the tool doesn't handle JSON lines does it? We say in the docs that 'JSON Lines MAY also be used when creating large files'.

@odscjames
Copy link
Collaborator

doesn't handle JSON lines does it

No, not at the moment. Adding that in itself should not be to much work, but it implies people can upload really big files and that will take more work - both to process them without using too much server resources and to make sure the UI is still readable/understandable when it's showing 17,000 errors!

@StephenAbbott
Copy link
Member Author

Thanks @odscjames. Will be good to check in on this with you and @kd-ods next week.

All the files published via https://bods-data.openownership.org for UK/DK/SK and the Register use JSONLines and would like to understand how much more work it would be to make changes and make the UI still understandable.

@kd-ods
Copy link
Collaborator

kd-ods commented May 31, 2023

@odscjames and I have just talked through the final work on #81 and #79. @StephenAbbott - the suggestion is that we finalise and test that work then release it in the next day or two.

Beyond that, the next two tasks to take on (timing of work to be discussed) are:

  • Enable ingesting of JSON lines files (about 2 days' work)
  • Improve test processing and reporting for very large files

We haven't yet added tickets for those two tasks.

@StephenAbbott
Copy link
Member Author

Thanks @kd-ods - and welcome back @odscjames! Happy for you to release this work if you're happy with it and will be good to discuss when we can book in the JSONLines / large files work once tickets raised for those.

@odscjames
Copy link
Collaborator

Improve test processing and reporting for very large files

To be clear that's for both JSON lines and normal JSON, the work I have in mind will cover both. Can comment more on ticket.

@kd-ods
Copy link
Collaborator

kd-ods commented May 31, 2023

See links above ^ @odscjames. New tickets added.

@StephenAbbott
Copy link
Member Author

Closing this issue following completion of work listed in #97

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants