Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[3.x] Add option to use handles to RID #54907

Merged
merged 1 commit into from
Dec 6, 2021

Conversation

lawnjelly
Copy link
Member

@lawnjelly lawnjelly commented Nov 12, 2021

Adds an option to compile an alternative implementation for RIDs, which allows checks for erroneous usage patterns as well as providing leak tests.

Note : This is compiled out during a regular build so there should be no differences to operation. Handles are only compiled in (replacing the regular RIDs) when passing an argument to scons.

Introduction

We've had a number of very difficult issues over the years concerned with the lifetimes of objects. There are two lifetimes of major interest - scene side Nodes (and objects derived from Reference, Object), and server objects, which are referenced using RIDs. It is this latter objects that are addressed here.

An example of the difficult to investigate problems caused by RID lifetimes is #53374. This and similar bugs can be debugged and fixed using this PR.

The implementation of RID in 3.x is basically a glorified pointer to the object (stored in RID_Data, the actual server objects are derived from this). This means that if you copy a RID, all the RIDs point to the same object.

This is great and quite efficient, in that you can basically just look up the pointer to get the object, the problem comes when you come to delete objects, you often get dangling RIDs (pointers) that are pointing to an object that has been deleted, or worse, another object that has replaced the old object.

One widely used solution to this problem is handles containing an indirect lookup for the object rather than a direct link. That way when you delete an object, a further erroneous lookup of the handle can detect this situation, flag an error and return a NULL. There are a number of other advantages of handles - the object is relocatable, and you can detect use of an invalid handle ID using revisions, etc etc.

https://en.wikipedia.org/wiki/Handle_(computing)

The downside in this case is that the lookup requires an extra indirection, which may potentially cause a cache miss. There is a little extra housekeeping versus a straight pointer, but not a lot, and certainly nothing compared to the cache implications.

Godot 4.x has moved to using handles for the safety aspects.

This PR

What I've done here is add a simple alternative implementation for RIDs that uses handles, and is complete with error checking, and also can pinpoint the original files / lines of leaks etc (to a limited extent). This ability to do tracking hasn't been added to 4.x, it's somewhat difficult logistically but I've managed to get it working via a small macro when creating RIDs (you wrap them in a RID_PRIME()), which compiles out when tracking is switched off.

This is a small extra hassle when creating RIDs, but it's no big deal if you forget to add one, it just means you won't get tracking for that object.

Handle system

The handle system itself is pretty simple, it doesn't attempt some of the optimizations in 4.x, and it reuses the pooled_list templates which have already been created for handles in the BVH and portals, which makes the new code quite easy to understand.

Building with handles

Nothing has changed in the regular builds of Godot, in order to build with handles you need to add an scons argument:

rids=pointers [default]
rids=handles
rids=tracked_handles

The tracking version does a little more than the straight handles implementation because it has to record files and line numbers. But even the tracking version uses less memory than the standard build, around 20 megs less in the small tests I used so far, so the old system is quite memory hungry presumably with red black trees etc.

Implementation Details

The RID_Database is just a global currently. It needs to be readily accessible for the RID_PRIME systems to work efficiently. It could be a singleton, but there's no real need for this and it will probably make it slower. It could also potentially be tagged onto another existing object, but the lifetime issues are quite important here, it needs to be available until other systems have been destroyed, so having it some form of global is a safe bet in this regard.

Performance

I haven't tested yet to any extent, I get the impression the handles version may be very slightly slower in some situations (and possibly faster in others), and it is not highly optimized, indeed many of the calls are in cpp files rather than headers to avoid the unfortunate side effect of recompiling the engine when making changes, as RID is quite deep in the dependency tree.

The main aim to start with however is to provide a debugging build that can be used to investigate and test for RID lifetime issues. We may decide at a later date to swap completely for handles, but there's no need to make this decision at this time.

Example Leak Output

ERROR: RID_Database leaked 10 objects at exit.
   at: shutdown (core/rid_handle.cpp:140)
ERROR: 1 RID objects leaked
   at: N22RasterizerStorageGLES38MaterialE (drivers/gles3/rasterizer_scene_gles3.cpp:4953)
ERROR: 1 RID objects leaked
   at: N22RasterizerStorageGLES36ShaderE (drivers/gles3/rasterizer_scene_gles3.cpp:4924)
ERROR: 1 RID objects leaked
   at: N22RasterizerStorageGLES38MaterialE (drivers/gles3/rasterizer_scene_gles3.cpp:4926)
ERROR: 1 RID objects leaked
   at: N22RasterizerStorageGLES36ShaderE (drivers/gles3/rasterizer_scene_gles3.cpp:4929)
ERROR: 1 RID objects leaked
   at: N22RasterizerStorageGLES38MaterialE (drivers/gles3/rasterizer_scene_gles3.cpp:4930)
ERROR: 1 RID objects leaked
   at: N22RasterizerStorageGLES36ShaderE (drivers/gles3/rasterizer_scene_gles3.cpp:4936)
ERROR: 1 RID objects leaked
   at: N22RasterizerStorageGLES38MaterialE (drivers/gles3/rasterizer_scene_gles3.cpp:4938)
ERROR: 1 RID objects leaked
   at: N22RasterizerStorageGLES36ShaderE (drivers/gles3/rasterizer_scene_gles3.cpp:4941)
ERROR: 1 RID objects leaked
   at: N22RasterizerStorageGLES38MaterialE (drivers/gles3/rasterizer_scene_gles3.cpp:4942)
ERROR: 1 RID objects leaked
   at: N22RasterizerStorageGLES36ShaderE (drivers/gles3/rasterizer_scene_gles3.cpp:4950)

This is actually the standard situation running the engine, these are some default materials / shaders that are not cleaned up in the normal way. They are not problematic, but could be fixed.

In use there may be other leaks in particular projects, which indicate problems that need to be investigated.

@lawnjelly lawnjelly requested review from a team as code owners November 12, 2021 10:32
@Calinou Calinou added this to the 3.5 milestone Nov 12, 2021
@lawnjelly lawnjelly force-pushed the rid_handles branch 2 times, most recently from 570cb8a to 1163297 Compare November 12, 2021 12:06
@lawnjelly lawnjelly changed the title Add handles to RID Add option to use handles to RID Nov 12, 2021
@RandomShaper RandomShaper changed the title Add option to use handles to RID [3.x] Add option to use handles to RID Nov 12, 2021
@lawnjelly lawnjelly force-pushed the rid_handles branch 2 times, most recently from 9852498 to 4544fab Compare November 13, 2021 07:11
@mhilbrunner
Copy link
Member

PR meeting: reduz said this is good, needs review by @RandomShaper

@TokisanGames
Copy link
Contributor

Since it's an optional build (which I don't think it should be), the CI/CD checks are meaningless.

I tried to build this with rids=handles. The build failed on win10/64 msvc2019.

1>core\rid_handle.cpp(103): error C2039: 'owner_name_id': is not a member of 'RID_Database::PoolElement'
1>Compiling ==> core\string_builder.cpp
1>D:\GD\godot\core\rid_handle.h(123): note: see declaration of 'RID_Database::PoolElement'
1>core\rid_handle.cpp(108): error C2039: 'owner_name_id': is not a member of 'RID_Database::PoolElement'
1>D:\GD\godot\core\rid_handle.h(123): note: see declaration of 'RID_Database::PoolElement'
1>core\rid_handle.cpp(145): error C2039: 'line_number': is not a member of 'RID_Database::PoolElement'
1>D:\GD\godot\core\rid_handle.h(123): note: see declaration of 'RID_Database::PoolElement'
1>core\rid_handle.cpp(145): error C2039: 'owner_name_id': is not a member of 'RID_Database::PoolElement'
1>string_builder.cpp
1>D:\GD\godot\core\rid_handle.h(123): note: see declaration of 'RID_Database::PoolElement'
1>core\rid_handle.cpp(145): error C2039: 'filename': is not a member of 'RID_Database::PoolElement'
1>D:\GD\godot\core\rid_handle.h(123): note: see declaration of 'RID_Database::PoolElement'Compiling ==> core\string_name.cpp
1>
1>core\rid_handle.cpp(145): error C2660: 'RID_Database::register_leak': function does not take 1 arguments
1>core\rid_handle.cpp(116): note: see declaration of 'RID_Database::register_leak'
1>string_name.cpp
1>Compiling ==> core\translation.cpp
1>scons: *** [core\rid_handle.windows.opt.tools.64.obj] Error 2

Building with rids=tracked worked fine. It seems there are several RIDs that leak in the stock engine, without any project. Just quitting the project manager causes it to report those 10 leaks you quoted. I think they should be cleaned up in this PR. Otherwise, they'll be reported as issues and confusing for users who think they are problems in their game instead of the engine.

I'll start using this build for my project.

@lawnjelly
Copy link
Member Author

lawnjelly commented Dec 1, 2021

Ah good spot I'll fix that. I was mostly checking the tracked build and it takes a while to recompile.

EDIT: Fixed, just needed some #ifdefs to check it was the tracking build for printing the leaks etc.

You are correct about CI builds, but in practice it's not hugely important as it is just for special tracking builds at this stage, and once up and running it is unlikely many things would break it (except work on the RIDs themselves). We really want to keep the CI builds as streamlined as possible and limiting the permutations is already done.

If you want to use the handles build in production that is probably fine if you desire (It's preferable to the tracked build as that will bloat the executable, and may be a little slower). It could be optimized a bit for release use but it's not that bad.

We can definitely consider swapping the standard builds to handles in the future, but I'm not sure it's a good idea to jump the gun on this, mainly because this approach and any implementation needs thorough testing, as RIDs are central to the engine. There's also a trade off - handles are safer, but pointer RIDs may be a bit faster in some situations, as they avoid an indirection (although whether that is actually a problem in practice remains to be seen, we can do plenty of profiling / performance testing etc).

And don't forget the political aspect. It's much easier to get this in as a debugging feature, then have people say "hey why aren't we using this in production", than to get it merged in the first place as production code. 😉

On the fixing the existing leaks, they are actually false flags, because they are deleted in a non-standard way. They can be fixed but better in a separate PR. We try and split PRs when they do more than one divisible thing. This makes it easier to track the changes, and e.g. revert one without the other.

I actually did fix some order of destruction bugs in here (I think the owners were deleted before the RIDs) but that was more pressing.

rasterizer_gles3.cpp line 505:

	// memdelete(storage); // moved lower
	memdelete(canvas);
	memdelete(scene);

	// storage must be deleted last,
	// because it contains RID_owners that are used by scene and canvas destructors
	memdelete(storage);

@lawnjelly
Copy link
Member Author

lawnjelly commented Dec 1, 2021

As a compromise I've added a second commit which fixes the cleanup in GLES3 and GLES2. Rather amusingly, there was not even an attempt to cleanup RasterizerSceneGLES2 (there was no destructor!). 😀

I still have a RID leak in Environment in GLES2 but don't want to get things out of hand in this PR. And there might be more that should be ideally deleted in the GLES2 scene destructor but at least the RID leak list is a bit more manageable now.

In retrospect it would probably have been better to put the GLES3 ordering fix I quoted above in the 2nd commit, but I don't know easy this is to do, whether you can edit a previous commit, not being a git expert. I could alternatively back these leak fixes up and put in a separate PR, depends what the mergers think.

Copy link
Member

@RandomShaper RandomShaper left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I may have missed something, since a lot happens here, but as far as my review can reach, this is fine.

SConstruct Outdated Show resolved Hide resolved
core/pooled_list.h Outdated Show resolved Hide resolved
@TokisanGames
Copy link
Contributor

We can definitely consider swapping the standard builds to handles in the future, but I'm not sure it's a good idea to jump the gun on this,

I understand about testing. But the whole purpose of fixing the RIDs ability to determine if they are valid or not was to fix leakage and spatialeditor destroying memory that doesn't belong to it. If this isn't set as the default, then it doesn't meet the primary motivation for writing it, and I need to fix spatialeditor and visual server, and maybe physics server by other means, and assuming the RIDs are still unreliable. Not a deal killer, but worth discussing.

@lawnjelly
Copy link
Member Author

lawnjelly commented Dec 3, 2021

I understand about testing. But the whole purpose of fixing the RIDs ability to determine if they are valid or not was to fix leakage and spatialeditor destroying memory that doesn't belong to it. If this isn't set as the default, then it doesn't meet the primary motivation for writing it, and I need to fix spatialeditor and visual server, and maybe physics server by other means, and assuming the RIDs are still unreliable. Not a deal killer, but worth discussing.

Sorry this may be a misunderstanding. The PR isn't intended directly to fix the RID bugs, it is intended as a debug tool to make it much easier to find, track down and fix RID bugs in client code. To explain the reasoning a little more:

The existing behaviour of RIDs follows the new / delete idiom. The client code is responsible for keeping track of the lifetime and calling delete once, but only once. Doing it more than more, or using the object after calling delete is a bug. Normally that would crash, but with the handles we have the opportunity to more gracefully handle this situation, by not crashing and outputting an error message.

Now the handles are written in such a way as it can detect such use and recover from it, but this pattern of double deletes etc is imo not expected behaviour by the client and should not be relied on, or used as a replacement for a proper bug fix. This behaviour is the same in the handles in 4.x afaik.

So currently using double deletes with handles might work, but it will flag errors, and would be considered a hack, rather than a proper fix. The proper fix is to keep proper track of lifetimes in the client code. There is no reason why you can't use e.g. reference counting for this or other methods if they are warranted in a situation.

Altering the "contract" in the way you suggest is quite a major change and would need to go through a proposal etc.

@lawnjelly
Copy link
Member Author

lawnjelly commented Dec 6, 2021

Yes on balance I think I will split the fixes into another PR and shrink this one to just the tracking. Hopefully won't take too long. 👍 😀

Ok is now separated hopefully.

EDIT: Turns out the rasterizer order of destruction bugs do have to be fixed at the same time, because the RasterizerScene uses the RID_Owners in RasterizerStorage in their cleanup code. So that has to be included, but I'll try and have the minimum fixes in this PR.

@lawnjelly lawnjelly force-pushed the rid_handles branch 3 times, most recently from 22fe8b4 to 8725ca7 Compare December 6, 2021 13:18
memdelete(storage->shader_owner.getptr(default_worldcoord_shader_twosided));

memdelete(storage->material_owner.getptr(default_overdraw_material));
memdelete(storage->shader_owner.getptr(default_overdraw_shader));

Copy link
Member Author

@lawnjelly lawnjelly Dec 6, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that these are changed. They are essential fixes for this PR, because the old form of calling get_data() directly on the RID is effectively no longer supported (i.e. won't compile with tracked_handles).

It was only used in a couple of places anyway, constituted a bug here, and also wasn't compatible with handles / allocation tracking, so better to change these (the other place where it was used in this way was getting lights, and was unnecessary).

It may actually be worth removing the get_data() in RID just to prevent people accidentally using it in future instead of the long form, by making e.g. private and only accessible by making the RID_Owner a friend class. Doesn't have to be in this PR though.

@@ -1500,7 +1500,7 @@ void RasterizerCanvasGLES3::render_joined_item(const BItemJoined &p_bij, RenderI
}
}

glBindBufferBase(GL_UNIFORM_BUFFER, 1, static_cast<LightInternal *>(light->light_internal.get_data())->ubo);
glBindBufferBase(GL_UNIFORM_BUFFER, 1, static_cast<LightInternal *>(light_internal_owner.get(light->light_internal))->ubo);

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the other occasion when get_data() had to be changed as it is no longer supported directly on the RID as it won't compile with tracked_handles. This form compiles down to the same form as previous anyway when rids=pointers (i.e. default) is set.

@lawnjelly lawnjelly force-pushed the rid_handles branch 3 times, most recently from a07fcde to 859d915 Compare December 6, 2021 14:36
Adds an option to compile an alternative implementation for RIDs, which allows checks for erroneous usage patterns as well as providing leak tests.
@akien-mga akien-mga merged commit efadd46 into godotengine:3.x Dec 6, 2021
@akien-mga
Copy link
Member

Thanks!

@lawnjelly
Copy link
Member Author

Just to note here, that at the moment, you won't see the leak list at the end when running a game from the Godot IDE. Instead you need to run a game (or the Godot Editor) directly from a console, or c++ debugger for the leak list to show up.

The reason for this is that the machinery for print_line, ERR_PRINT etc to transfer these from the game to the Godot IDE, is shutdown before the leak checks can be done (i.e. after the Rasterizers etc are cleaned up). This is actually a surprisingly common problem for leak checks. 😀

This is a bit annoying as it would be nicer to be able to see the leaks directly in the Godot IDE. I'm just seeing if we can fix this by simply moving the call site around for the rid_database shutdown (which lists the leaks), but no joy so far. I'm not sure exactly which parts of the engine are responsible for debug output. So at the moment we either get no output in the Godot IDE, or we get 'fake leaks'.

An alternative could be to move the RID cleanup code for the rasterizers to their finalize() method rather than the destructor (this is just a stub in some cases, but it is obviously intended to help deal with this type of order of destruction problem), and call this before the engine debug output system is shut down.

We could also potentially try moving the cleanup of the debug output to later in the cleanup sequence, but I'm loathe to touch this if possible, as this order has obviously been tested on all the platforms and works.

Having said that the errors for misuse at runtime should show up fine in the Godot IDE, it's just a gotcha to watch out for.

@TokisanGames
Copy link
Contributor

TokisanGames commented Dec 8, 2021

@lawnjelly How do we free the handles? When opening my project in the editor, not running it, then closing it, the console reports 21 leaks in CanvasItem and Texture:

ERROR: RID_Database leaked 21 objects at exit.
   at: RID_Database::shutdown (core\rid_handle.cpp:142)
ERROR: 18 RID objects leaked
   at: struct VisualServerCanvas::Item (scene\2d\canvas_item.cpp:1268)
ERROR: 3 RID objects leaked
   at: struct RasterizerStorageGLES3::Texture (scene\resources\texture.cpp:828)

Looking at texture.cpp:828, it creates an RID for a StreamTexture. However that is freed in the destructor. Since the RID is freed in the destructor, is this falsely reporting a leak?

StreamTexture::StreamTexture() {
        ...
	texture = RID_PRIME(VS::get_singleton()->texture_create());
}

StreamTexture::~StreamTexture() {
	VS::get_singleton()->free(texture);
}

Both go to RasterizerStorageGLES3, where the texture_owner.free(p_rid) is ran.

CanvasItem shows the same thing:

CanvasItem::CanvasItem() :
		xform_change(this) {
	canvas_item = RID_PRIME(VisualServer::get_singleton()->canvas_item_create());
...
}

CanvasItem::~CanvasItem() {
	VisualServer::get_singleton()->free(canvas_item);
}

Both go to VisualServerCanvas where canvas_item_owner.free(p_rid); is ran.

@lawnjelly
Copy link
Member Author

lawnjelly commented Dec 8, 2021

That looks correct in theory, but you need to debug it (or post a min reproduction project and I can take a look). The things of interest are the RID _id, and the _revision.

False leaks are in theory possible, but unlikely (most likely a valid bug), because the RID_Database should be shutdown last. There could also be a bug in the tracking (as it is quite new), or most obvious this could be a genuine error that needs fixing.

Incidentally best to post these in individual new issues then we can investigate and fix.

RID bugs are to be expected because none of this has been monitored before, and it would be surprising it a few hadn't built up over the years.

Incidentally in the GLES2 environments was a very similar situation - on first glance it looked like it was freeing them correctly and I suspected a bug in the allocation tracking. But on further investigation the leak was genuine and there was a bug in the free routine.

print_verbose("RID revision incorrect : " + _rid_to_string(p_rid, pe));
ERR_FAIL_COND_V_MSG(pe.revision != p_rid._revision, nullptr, "RID_Database get_or_null, revision is incorrect, object possibly freed before use.");
}

Copy link
Member Author

@lawnjelly lawnjelly Apr 22, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note to myself: In retrospect, I probably should add a lock here if we do decide to roll out further.

I suspect there is a potential race condition from getting PoolElement and reading at the exact moment a pool resize is happening from another thread (this should be incredibly rare, and may not even happen in practice). This is kind of a pain, as it would be nice to have this performance sensitive function operate without a lock, especially for such an unlikely situation. Maybe there is a way around this. Anyway not a significant problem unless we do decide to make regular builds with handles.

EDIT: Fixed in #60427 .

@TokisanGames
Copy link
Contributor

How about making this the default build option in the next beta release for editor builds @akien-mga?

Suggested here: #60406 (comment)

The bottom lines are:

  • The RID system and the way they are free()'d in the servers is inherently flawed and will continue to produce bugs until it is redesigned.
  • 99% of users cannot debug RID issues, and will need dev help.
  • The RID handle implementation is a useful tool for devs, but it won't be enabled by most users unless special builds are made available.
  • Users and devs can save time if users can provide the error database on the first go-round. However, now the console has been hidden from them. :/

@lawnjelly said this:

This is perhaps something we could consider now some time has passed.

As far as I remember it from the time:

  • The handles builds needed testing - I've now been using the rids=tracked_handles builds for 5 months or so now as a daily with no problems at all, for debug and release_debug editor builds. This is a sample size of one, on one platform, but on the flip side the type of problems likely to occur will probably occur on all platforms.
  • The tracked_handles may bloat the executable size a bit and be a little slower. This could be tested. (On the plus side, memory use was actually lower when I tested this.)

The other question is whether we might use tracked_handles for the editor, and pointers for release templates (as ultimately these should be fastest, in theory at least. We need to profile of course though. In practice, the difference might be negligible, in which case robustness would win.). Given that these two are quite different there is the opportunity for bugs if the implementation of either diverges. An alternative is to make release templates with rids=handles, which has all the same robustness but can't provide such detailed error messages. This is the approach that Godot 4 takes, if I remember correctly.

On the other hand there is always the potential to introduce new nasty hard to find bugs if we change this in the release templates, particularly threading bugs, which the handles is more susceptible to. So on balance I am probably still slightly in favour of keeping the current release template to use pointers (if it ain't broke etc).

I'm personally fairly agnostic on this really, and we should also weigh in how often RID errors happen in practice (not very often it seems now that we have fixed the major existing bugs), and we do have the tools to investigate them now when they do crop up.

Another option is to build with rids=tracked_handles for DEV_ENABLED builds, so that more eyes are likely to see RID bugs earlier. Or we could publish a debug build of the editor with all debugging enabled and rids=tracked_handles with each release, for windows only might be okay as linux users could run it under wine. I think there was some interest in having debug builds available so maybe this could be a "kill two birds with one stone" thing.

@akien-mga
Copy link
Member

Next build should RC1, so it's not the time for experiments anymore for 3.5. But we can consider it for 3.6 beta 1.

@Calinou
Copy link
Member

Calinou commented May 5, 2022

I'll remember to use rids=tracked_handles in my debug builds once 3.5 is released.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants