New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

reprocess.py generates huge sqlite3 temp files #1

Closed
jarvisms opened this Issue Oct 9, 2018 · 1 comment

Comments

Projects
None yet
1 participant
@jarvisms
Owner

jarvisms commented Oct 9, 2018

When reprocess.py is run on the sqlite3 data store, very large temporary "WAL" files are created if the dataset is large enough (mine is around 8 years), particularly while calib data is being processed. This appears to be due to the underlying process.py effectively doing a bulk read on RawStore while writing one record at a time to CalibStore. There are many small write transactions which create a far greater write-ahead-log than few big ones, and those writes can't be consolidated and fully committed to the main file (and out of the WAL file) as there is a read-lock open . This is not an issue on the filedata data store as there is a dict like cache in the middle.

Options may be:

  1. split the datastore into seperate raw, calib, hourly, daily and monthly files - one objective of doing this was to consolidate everything into one file and multiple tables so I would rather not do this
  2. to consolidate connections to be module wide which will necessitate thread locks and may give rise to other potential issues
  3. introduce a similar cache to the filedata system (in addition to sqlite3's built in caching) which would break database ACID properties (which I'd rather not break)
  4. possible refactoring of process.py to use the update() method.

Comments welcome.

@jarvisms jarvisms self-assigned this Oct 9, 2018

@jarvisms jarvisms added enhancement and removed help wanted labels Oct 9, 2018

jarvisms added a commit that referenced this issue Oct 13, 2018

Modification to process.py to resolve Issue #1
Wrapped the main processing loops into nested generator functions allowing the use of .update() so data writes are bulk committed with less overhead and less write ahead log payload on sqlite3.

Signed-off-by: Mark Jarvis <jarvism@thisaddressdoesnotexist.co.uk>
@jarvisms

This comment has been minimized.

Show comment
Hide comment
@jarvisms

jarvisms Oct 13, 2018

Owner

I didn't like the idea of option 1 or 3 so didn't bother.

I experimented with option 2 and it had no difference. Transpires that using shared cached mode is effectively the same thing as consolidating all connections into one since sqlite3 serializes everything internally anyway in this mode (but will always do it faster a the C leave than at Python level). However, this is a little slower than normal "one cache per connection" mode in true multithreaded environments, but since CPython's GIL doesn't give us that (even with threading), the performance boost is then just related to the sheer size of cache but its not common and there are python's locks in the way, and the complicated implementation. In summary, it didn't fix the issue, and any other benefit simply wasn't worth it..

In the end I went with option 4 and refactored process.py by wrapping the main processing loops to convert them into nested generator functions (so as much of the original code is preserved) which are fed to the new .update() method I introduced into the sqlite3data.py which I also backported to the filedata.py (so it works on the old system as well). This allows one transaction of multiple records (via sqlite3's executemany() method) to be committed once instead of having the processing overhead and cumulative logging of multiple individual record by record transactions being committed. This eliminated the massive file problem (exponential cumulative logging before locks are released) and made reprocessing from scratch several orders of magnitude faster (no overhead from transactions stuff). This also works just fine on python2.7 with the old file based storage via reprocess.py, and my live weather station on Python 3.7 has been running this for a bit as a test without issue.

So I believe this issue is resolved by commit 215e5f3.

Owner

jarvisms commented Oct 13, 2018

I didn't like the idea of option 1 or 3 so didn't bother.

I experimented with option 2 and it had no difference. Transpires that using shared cached mode is effectively the same thing as consolidating all connections into one since sqlite3 serializes everything internally anyway in this mode (but will always do it faster a the C leave than at Python level). However, this is a little slower than normal "one cache per connection" mode in true multithreaded environments, but since CPython's GIL doesn't give us that (even with threading), the performance boost is then just related to the sheer size of cache but its not common and there are python's locks in the way, and the complicated implementation. In summary, it didn't fix the issue, and any other benefit simply wasn't worth it..

In the end I went with option 4 and refactored process.py by wrapping the main processing loops to convert them into nested generator functions (so as much of the original code is preserved) which are fed to the new .update() method I introduced into the sqlite3data.py which I also backported to the filedata.py (so it works on the old system as well). This allows one transaction of multiple records (via sqlite3's executemany() method) to be committed once instead of having the processing overhead and cumulative logging of multiple individual record by record transactions being committed. This eliminated the massive file problem (exponential cumulative logging before locks are released) and made reprocessing from scratch several orders of magnitude faster (no overhead from transactions stuff). This also works just fine on python2.7 with the old file based storage via reprocess.py, and my live weather station on Python 3.7 has been running this for a bit as a test without issue.

So I believe this issue is resolved by commit 215e5f3.

@jarvisms jarvisms closed this Oct 13, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment