Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"btree" module for simple persistent database #2188

Open
pfalcon opened this issue Jun 15, 2016 · 11 comments

Comments

@pfalcon
Copy link
Contributor

commented Jun 15, 2016

Initial implementation landed in 8072162 .

This ticket is for discussing aspects of it which are worth discussing.

@pfalcon

This comment has been minimized.

Copy link
Contributor Author

commented Jun 15, 2016

This is based on Berkeley DB 1.85's btree implementation. Note that I personally don't have an idea to bring it other backend which Berkeley DB supports into MicroPython (e.g. hash backend). Instead, the idea is to have a single, small (20K code on x86_64), well-known (== ~ bug-free, though it's already clear that bdb 1.xx code quality is rather mediocre by today's standards), and the most featureful backend. The most featureful here is "sorted" and "implementing efficient range scans". These features are enough to implement more advanced, SQL-like functionality on the top of this module.

@pfalcon

This comment has been minimized.

Copy link
Contributor Author

commented Jun 15, 2016

In the current master, the module can be built for unix with make MICROPY_PY_BTREE=1.

@pfalcon

This comment has been minimized.

Copy link
Contributor Author

commented Jun 16, 2016

Range iteration landed in master. I.e. besides dict's protocol .items(), it's possible to run btree.items("bar", "foo", flags) which will (efficiently, sorted) enumerate all items whose keys are in range from "bar" to "foo".

And here's something to bikeshed: how to specify enumeration in reverse order, via flags param above. Python's natural way would be call it "reverse", e.g. abbreviated btree.REV (module-level constants). But of we speak databases, SQL's DESC comes to mind. Currently, I'm going to use btree.DESC.

@dpgeorge

This comment has been minimized.

Copy link
Member

commented Jun 25, 2016

I can build the unix port (x86-64) and the test passes. But it doesn't build for stmhal. The issue seems to be that the Berkeley library includes <sys/queue.h> (eg from mpool.h), a file which is provided by the library (in PORT/include), but which will be fetched from the local system directory (eg /usr/include/sys/queue.h). It just so happens that for the unix port this system file provides exactly the correct definitions (I guess!), but for the stmhal port it does not.

@dpgeorge

This comment has been minimized.

Copy link
Member

commented Jun 25, 2016

I would test the build on esp8266, but for some reason make MICROPY_PY_BTREE=1 in the esp8266 dir does not work, it's as though the command line option is ignored...

@pfalcon

This comment has been minimized.

Copy link
Contributor Author

commented Jun 26, 2016

I didn't even try building it for a baremetal port, and it certainly will require quite a bunch of porting effort (berkleydb assumes POSIX I/O, we'll need to change that to use stream protocol, etc.)

@pfalcon

This comment has been minimized.

Copy link
Contributor Author

commented Jul 3, 2016

btree is now enabled by default in unix port (API status is provisional and some methods (e.g. open()) will change).

@pfalcon

This comment has been minimized.

Copy link
Contributor Author

commented Aug 1, 2016

berkeleydb fork we use was further refactored to enable interfacing with 3rd-party file object implementation via vtable. MicroPython was switch to it, and the test was switched to work against io.BytesIO to show its capabilities. "btree" module is also enabled for esp8266 and passes this test (against io.BytesIO). So, basically it's ready for use, or more exactly, heavier testing. Interface remains provisional too.

@pfalcon

This comment has been minimized.

Copy link
Contributor Author

commented Aug 4, 2016

Well, with "almost everything" done, I proceeded with actual more or less heavy integration testing (like inserting 1K+ binary fuzzed records), and well, it faults. Venerable software of 1980/90ies, it faults as easily and as often as Netscape Navigator, being its host. It faults in bt_split, being the most complex operation. Defining some debug options (of clearing memory to a known value) makes it fail less, but it still fails with 100K+ entries test. The problem is definitely related to overflow pages (i.e. storing too long data items). Not storing such long data seems to be ok, so apparently it misses to test for such items somewhere. I played with it 2 nights, and so the findings above, but going beyond that would take much more effort. So, I guess that's it - we can ship it, warning users that there're limitations/knowing issues, and then investigate them as it goes, based on actual reports received.

@dpgeorge

This comment has been minimized.

Copy link
Member

commented Aug 10, 2016

I managed to insert 1 million records, each with a key around 6 bytes, and data around 10 bytes. It seemed to work (the data read back correctly). (I was on unix, using a real file, not BytesIO.)

@pfalcon

This comment has been minimized.

Copy link
Contributor Author

commented Aug 10, 2016

Yup, that's not a problem. Problems start when you insert data longer than half of DB page. That was alleviated somehow by setting default page size to 4096 (both makes sense and good for esp8266, though I plan to re-implement FS block autodetection in the future).

@pfalcon pfalcon added the triaged label Aug 25, 2016

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants
You can’t perform that action at this time.