Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix file descriptor leak #34

Merged
merged 4 commits into from
Apr 19, 2021
Merged

fix file descriptor leak #34

merged 4 commits into from
Apr 19, 2021

Conversation

lucaong
Copy link
Owner

@lucaong lucaong commented Apr 18, 2021

when cleaning up old files, the corresponding Store should also be
stopped, releasing the file descriptor. This step was missing, leading
to a file descriptor leak that would manifest after many compactions.

Now the CubDB process keeps a list of old Btrees, and stops them
(closing the relative Store) upon cleanup, after ensuring that those
Btrees and their files are not in use anymore by any reader.

Resolves #30

when cleaning up old files, the corresponding Store should also be
stopped, releasing the file descriptor. This step was missing, leading
to a file descriptor leak that would manifest after many compactions.

Now the CubDB process keeps a list of old B-trees, and stops them
(closing the relative Store) upon cleanup, when it ensured that those
b-trees and their files are not in use anymore.
@meyercm
Copy link
Contributor

meyercm commented Apr 18, 2021

I'm on mobile for now, but I'll try this branch as soon as I get home. I saw you added a test that checks your new code, but it's probably worth adding a failing test based on the old code. A workflow I really like:

  1. Find a new bug
  2. Write a test that fails in the current codebase that I think is tied to the bug
  3. Write code to make to test pass
  4. Check to see if the bug is resolved

Side note: I'd also recommend adding at least one stress test to the suite, with lots of simultaneous readers and writers, manual and automatic compactions, etc.

@lucaong
Copy link
Owner Author

lucaong commented Apr 18, 2021

I definitely agree that a test that reproduces the issue is necessary, but the few unit tests that I wrote in this PR should act as such, preventing regressions:

  • The test added in cubed_test.exs is the real regression test that reproduces the issue: it tests that the Btrees that were "leaked" before are now stopped. It does so using the new Btree.alive?/1 function though, so that needs to be tested too.
  • The test added in btree_test.exs in turn tests that Btree.alive?/1 delegates to Store.open?/1, checking if the store was closed or not
  • The tests added to file_test.exs and test_store_test.exs test that the relative implementation of Store.open?/1 is correct

What I like about this setup is that details of how a Store is decommissioned are not leaked to the CubDB process. The new tests also manage to keep the test coverage to a respectable 99%.

That said, after your comment I did add an assertion that uses directly Process.alive? (not relying on new code) to check if the internal Agent of the Store is leaked or not.

Regarding the load tests I absolutely agree. Some of those already exist in form of property based tests (you can run them with mix test --include property_based). I am also occasionally running more load tests on a physical device (a Raspberry Pi) to verify that a sudden loss of power does not corrupt the database or break atomicity, which is one of the main project goals. What was missing, and I will setup in the future, is a load test exercising CubDB on a long-lived application with frequent compaction. I do have production applications using CubDB that are definitely long-lived, but since their database contains many entries, compactions are bigger and less frequent, so the limit of file descriptors is met much less frequently, usually after months, with the supervision tree merely restarting the process and cleaning up, so it wasn't noticed until you filed the issue.

Again, thanks a lot for your help on this issue!

@meyercm
Copy link
Contributor

meyercm commented Apr 18, 2021

Really happy to help, and really glad to have this database to work with. Thanks for sharing your work.

Unfortunately, it looks like there is still a problem on this branch:
after running these commands:

iex(1)> CubDB.start_link(name: :db, data_dir: "/tmp/testtest")
{:ok, #PID<0.326.0>}
iex(2)> CubDB.compact(:db)
:ok
iex(3)> :os.getpid
'5001'
iex(4)> CubDB.compact(:db)
:ok

output from ls -l /proc/5001/fd:

...skipping...
lrwx------ 1 user user 64 Apr 18 15:24 19 -> /tmp/testtest/2.cub
lrwx------ 1 user user 64 Apr 18 15:23 2 -> /dev/pts/0
lrwx------ 1 user user 64 Apr 18 15:23 20 -> '/tmp/testtest/1.cub (deleted)'
lrwx------ 1 user user 64 Apr 18 15:24 22 -> /tmp/testtest/2.cub
...skipping...

Not only is there still a file descriptor for a deleted db version, but there are 2 file descriptors open for the same version. I have a really hard time following where all the file work is being done, but I think the issue is the file not being closed before the File.rename in finalize_compaction and subsequent re-opening.

Without making a test that is OS specific to linux, the best I was able to do was to test that the number of linked processes stayed constant through many compactions:

test "compaction doesn't spawn tons of linked processes (i.e. lost file descriptors)", %{tmp_dir: tmp_dir} do
    {:ok, db} = CubDB.start_link(tmp_dir, auto_compact: false)
    CubDB.subscribe(db)
    # do it once to identify the steady-state baseline
    CubDB.put(db, 0,0)
    CubDB.compact(db)
    assert_receive :compaction_completed, 5000
    initial_links = Process.info(db)[:links] |> length()
    # now run many times to confirm we return to the baseline
    for i <- 1..100 do
      CubDB.put(db, i, i)
      CubDB.compact(db)
      assert_receive :compaction_completed, 5000
    end
    final_links = Process.info(db)[:links] |> length()
    assert final_links == initial_links
  end

@lucaong
Copy link
Owner Author

lucaong commented Apr 18, 2021

Thanks for the test, I will have a look at it tomorrow with a fresh mind to get to the bottom of it!

When renaming the file after a successful compaction, the old Store must
be closed to avoid leaking the process, and consequently also a file
descriptor.
@lucaong
Copy link
Owner Author

lucaong commented Apr 18, 2021

Actually, brilliant work and intuition about the renaming. I added a generic test for leaked links during compaction, based on yours, and fixed the remaining leak. There were indeed two places leaking a Store process (and therefore also its file descriptor: one, fixed before, during cleanup of the old database files. The other during renaming of the compacted file, as you noted.

I also realize that your PR indeed addressed both places, I should have given it more attention.

Additionally, I am also explicitly closing the file when the Store is closed. This shouldn't be necessary, as garbage collection should take care of it, but better to cleanup as soon as possible.

Could you confirm that the issue you experienced is now fixed?

@meyercm
Copy link
Contributor

meyercm commented Apr 19, 2021

Looks like this commit fixes it. Great!

I only found the second leak because I wrote the test I shared first - I actually had a lot of head scratching trying to figure out how to write the test without making tacit assumptions about the OS, since it's an OS problem.

I do really like CubDB, the API is perfect for a bunch of use cases - like SQLite, but without the sql headaches. It's a shame that some of the file handling has leaked out of the Store.File module - I'd love to see a Store.Memory option at some point, with the ability to build up the datastructure solely in memory, then later flush it to disk and work with it like the current implementation.

@lucaong lucaong merged commit 432f07d into master Apr 19, 2021
@lucaong
Copy link
Owner Author

lucaong commented Apr 19, 2021

I merged the change and will now make a new release. Thanks again for the major help on this!

Regarding the Store abstraction, I do have some future plans to support more options, for example to allow compression or caching.

@lucaong lucaong deleted the fix_file_descriptor_leak branch April 19, 2021 08:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

CubDB crash with :error, :emfile
2 participants