fix file descriptor leak #34

lucaong · 2021-04-18T15:15:28Z

when cleaning up old files, the corresponding Store should also be
stopped, releasing the file descriptor. This step was missing, leading
to a file descriptor leak that would manifest after many compactions.

Now the CubDB process keeps a list of old Btrees, and stops them
(closing the relative Store) upon cleanup, after ensuring that those
Btrees and their files are not in use anymore by any reader.

Resolves #30

when cleaning up old files, the corresponding Store should also be stopped, releasing the file descriptor. This step was missing, leading to a file descriptor leak that would manifest after many compactions. Now the CubDB process keeps a list of old B-trees, and stops them (closing the relative Store) upon cleanup, when it ensured that those b-trees and their files are not in use anymore.

meyercm · 2021-04-18T16:44:22Z

I'm on mobile for now, but I'll try this branch as soon as I get home. I saw you added a test that checks your new code, but it's probably worth adding a failing test based on the old code. A workflow I really like:

Find a new bug
Write a test that fails in the current codebase that I think is tied to the bug
Write code to make to test pass
Check to see if the bug is resolved

Side note: I'd also recommend adding at least one stress test to the suite, with lots of simultaneous readers and writers, manual and automatic compactions, etc.

lucaong · 2021-04-18T18:30:33Z

I definitely agree that a test that reproduces the issue is necessary, but the few unit tests that I wrote in this PR should act as such, preventing regressions:

The test added in cubed_test.exs is the real regression test that reproduces the issue: it tests that the Btrees that were "leaked" before are now stopped. It does so using the new Btree.alive?/1 function though, so that needs to be tested too.
The test added in btree_test.exs in turn tests that Btree.alive?/1 delegates to Store.open?/1, checking if the store was closed or not
The tests added to file_test.exs and test_store_test.exs test that the relative implementation of Store.open?/1 is correct

What I like about this setup is that details of how a Store is decommissioned are not leaked to the CubDB process. The new tests also manage to keep the test coverage to a respectable 99%.

That said, after your comment I did add an assertion that uses directly Process.alive? (not relying on new code) to check if the internal Agent of the Store is leaked or not.

Regarding the load tests I absolutely agree. Some of those already exist in form of property based tests (you can run them with mix test --include property_based). I am also occasionally running more load tests on a physical device (a Raspberry Pi) to verify that a sudden loss of power does not corrupt the database or break atomicity, which is one of the main project goals. What was missing, and I will setup in the future, is a load test exercising CubDB on a long-lived application with frequent compaction. I do have production applications using CubDB that are definitely long-lived, but since their database contains many entries, compactions are bigger and less frequent, so the limit of file descriptors is met much less frequently, usually after months, with the supervision tree merely restarting the process and cleaning up, so it wasn't noticed until you filed the issue.

Again, thanks a lot for your help on this issue!

meyercm · 2021-04-18T19:43:36Z

Really happy to help, and really glad to have this database to work with. Thanks for sharing your work.

Unfortunately, it looks like there is still a problem on this branch:
after running these commands:

iex(1)> CubDB.start_link(name: :db, data_dir: "/tmp/testtest")
{:ok, #PID<0.326.0>}
iex(2)> CubDB.compact(:db)
:ok
iex(3)> :os.getpid
'5001'
iex(4)> CubDB.compact(:db)
:ok

output from ls -l /proc/5001/fd:

...skipping...
lrwx------ 1 user user 64 Apr 18 15:24 19 -> /tmp/testtest/2.cub
lrwx------ 1 user user 64 Apr 18 15:23 2 -> /dev/pts/0
lrwx------ 1 user user 64 Apr 18 15:23 20 -> '/tmp/testtest/1.cub (deleted)'
lrwx------ 1 user user 64 Apr 18 15:24 22 -> /tmp/testtest/2.cub
...skipping...

Not only is there still a file descriptor for a deleted db version, but there are 2 file descriptors open for the same version. I have a really hard time following where all the file work is being done, but I think the issue is the file not being closed before the File.rename in finalize_compaction and subsequent re-opening.

Without making a test that is OS specific to linux, the best I was able to do was to test that the number of linked processes stayed constant through many compactions:

test "compaction doesn't spawn tons of linked processes (i.e. lost file descriptors)", %{tmp_dir: tmp_dir} do
    {:ok, db} = CubDB.start_link(tmp_dir, auto_compact: false)
    CubDB.subscribe(db)
    # do it once to identify the steady-state baseline
    CubDB.put(db, 0,0)
    CubDB.compact(db)
    assert_receive :compaction_completed, 5000
    initial_links = Process.info(db)[:links] |> length()
    # now run many times to confirm we return to the baseline
    for i <- 1..100 do
      CubDB.put(db, i, i)
      CubDB.compact(db)
      assert_receive :compaction_completed, 5000
    end
    final_links = Process.info(db)[:links] |> length()
    assert final_links == initial_links
  end

lucaong · 2021-04-18T20:48:53Z

Thanks for the test, I will have a look at it tomorrow with a fresh mind to get to the bottom of it!

When renaming the file after a successful compaction, the old Store must be closed to avoid leaking the process, and consequently also a file descriptor.

lucaong · 2021-04-18T21:24:08Z

Actually, brilliant work and intuition about the renaming. I added a generic test for leaked links during compaction, based on yours, and fixed the remaining leak. There were indeed two places leaking a Store process (and therefore also its file descriptor: one, fixed before, during cleanup of the old database files. The other during renaming of the compacted file, as you noted.

I also realize that your PR indeed addressed both places, I should have given it more attention.

Additionally, I am also explicitly closing the file when the Store is closed. This shouldn't be necessary, as garbage collection should take care of it, but better to cleanup as soon as possible.

Could you confirm that the issue you experienced is now fixed?

meyercm · 2021-04-19T01:18:10Z

Looks like this commit fixes it. Great!

I only found the second leak because I wrote the test I shared first - I actually had a lot of head scratching trying to figure out how to write the test without making tacit assumptions about the OS, since it's an OS problem.

I do really like CubDB, the API is perfect for a bunch of use cases - like SQLite, but without the sql headaches. It's a shame that some of the file handling has leaked out of the Store.File module - I'd love to see a Store.Memory option at some point, with the ability to build up the datastructure solely in memory, then later flush it to disk and work with it like the current implementation.

lucaong · 2021-04-19T08:33:12Z

I merged the change and will now make a new release. Thanks again for the major help on this!

Regarding the Store abstraction, I do have some future plans to support more options, for example to allow compression or caching.

lucaong mentioned this pull request Apr 18, 2021

CubDB crash with :error, :emfile #30

Closed

rename Store.alive?/1 to Store.open?/1 and add tests

fef28f0

add assertion

5e4fa63

fix another process leak and add test for leaks

f9a8ba8

When renaming the file after a successful compaction, the old Store must be closed to avoid leaking the process, and consequently also a file descriptor.

lucaong merged commit 432f07d into master Apr 19, 2021

lucaong deleted the fix_file_descriptor_leak branch April 19, 2021 08:33

lucaong mentioned this pull request Apr 19, 2021

Bugfix/lost file descriptors #33

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix file descriptor leak #34

fix file descriptor leak #34

lucaong commented Apr 18, 2021 •

edited

Loading

meyercm commented Apr 18, 2021

lucaong commented Apr 18, 2021 •

edited

Loading

meyercm commented Apr 18, 2021

lucaong commented Apr 18, 2021

lucaong commented Apr 18, 2021 •

edited

Loading

meyercm commented Apr 19, 2021

lucaong commented Apr 19, 2021

fix file descriptor leak #34

fix file descriptor leak #34

Conversation

lucaong commented Apr 18, 2021 • edited Loading

meyercm commented Apr 18, 2021

lucaong commented Apr 18, 2021 • edited Loading

meyercm commented Apr 18, 2021

lucaong commented Apr 18, 2021

lucaong commented Apr 18, 2021 • edited Loading

meyercm commented Apr 19, 2021

lucaong commented Apr 19, 2021

lucaong commented Apr 18, 2021 •

edited

Loading

lucaong commented Apr 18, 2021 •

edited

Loading

lucaong commented Apr 18, 2021 •

edited

Loading