Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OfxOscReceiver: from detach() to join() #7949

Merged
merged 9 commits into from
May 15, 2024

Conversation

artificiel
Copy link
Contributor

No description provided.

@artificiel
Copy link
Contributor Author

another test, just plop this in setup() -- immediate freeze here (macOS14.4)

	ofxOscMessage m { "/test" };
	m.add("something", "else", 20.3f);

	ofxOscSender sender { "127.0.0.1", 2002 };	
	std::shared_ptr<ofxOscReceiver> receiver;

	receiver = std::make_shared<ofxOscReceiver>(2002);
	sender.send(m);
	receiver = std::make_shared<ofxOscReceiver>(2003);

I think as soon as something is brewing in the listenerThread, things have to be concluded with the listener prior to reassignation.

@artificiel
Copy link
Contributor Author

I have integrated your #define USE_EXPLICIT_FUNCTION as default, in order to distinguish problems (it's a touchy territory and we don't want to introduce something that breaks something else).

As for the eventual crash after w while: [warning] ofxOscReceiver: select failed is that similar to the problems you had in actual use? if not, can you provoke a crash similar to the one you encountered with another #ifdef?

(i have a fix for the select failed, but I'd like to confront your issue too if it's different)

@NickHardeman
Copy link
Contributor

I don't remember ever receiving the [warning] ofxOscReceiver: select failed message.

But the crash report is pointing to the same areas in the code we were seeing in practice and appears to be the same.

@artificiel
Copy link
Contributor Author

can you try with the above commit? also if you can try windows it would be great (oscpack implementation is different).

this does not touch oscpack, but what happens between AsynchronousBreak() and join() is in the low level socket stuff.

@artificiel
Copy link
Contributor Author

added a #define OSC_NO_DETACH to simplify toggling off the changes to confirm the crashes. by default the fix is on.

@artificiel
Copy link
Contributor Author

making a temporary instance of ofxOscReceiver does not work with the fix (but works with previous .detach) because the internal listener thread does not time to start, and when it does: in .detach() it spins up and winds down out of scope "in the background" while in .join(): it locks in Run because UdpSocket resets the break_ pseudo-barrier within the thread itself.

this can be fixed in oscpack by initializing break_ to false, and setting it back to false at the end of Run() (in case the UdpSocket gets reused, which does not seem to happen in the ofxOscReceiver pattern.

@NickHardeman
Copy link
Contributor

NickHardeman commented May 15, 2024

Hi @artificiel,
Thank you for making these tests.
When I comment out the INSTANTLY_OUT_OF_SCOPE define, it runs as it should, otherwise it hangs, as you mentioned.
I tried it out on windows 10 in visual studio and ran each mode for at least 10 minutes. Mode 3 reported 100% loss, not sure why.
Also tested on macOS and it ran as it should.

@artificiel
Copy link
Contributor Author

OK! i have a fix for the INSTANTLY_OUT_OF_SCOPE but I'd prefer handling it in another PR. will look into #3 @ 100% (note that the stress test also stresses the OS in rapidly going through all available UDP ports...). do you have the possibility of testing your previously-crashing deployed situation?

the rationale for this fix: the comments in ofxOscSender stipulates that using .detach() is OK because the custom deleter "does the right thing", which allows overlapping instances, which is probably OK in the case of start/stopping a receiver, because the ofxOscReceiver itself is not deallocated and the thread can terminate properly, but not when being overwrittten (which is destruction of the thread — so the deleter will be deallocating before it has time to complete). and I ask the question: what kind of case requires such an optimisation? using .detach() correctly for non-trivial things is known to be difficult, and in the case of ofxOscReceiver, for a end-user to call stop() I don't see the goal of such an optimisation: if it takes a couple of cycles to wind oscpack down, I don't see the problem — maybe there are cases where stopping the listener would take noticeable time? (perhaps with a huge unattended queue?? [another idea: an optional size limit to the thread channel, 'cause right now it can build infinitely]).

(NOTE: instead of taking out .detach() i first had an approach with a shared_mutex which could cohabit the "instant" requirement of overwriting by allocation with the "lazy" feature of the detached deleter by synchronizing the close() with the internal state of the udpSocket, but it requires some changes inside oscpack and a dynamic data structure for the listener threads (even if detached you need to keep a hold on the one in the process of being deleted while the new one spins up). it's not sooo complicated but certainly more complicated. I prefer this approach which only affects ofxOscReceiver itself)

of course, as many sets of eyes as possible on this would be great -- even though the stress test makes things comfortable, there are many possibilities out there... (but maybe not so many users dynamically (re)allocating ofxOscReceiver...)

if we don't see a problem to have the deleter synchronized with the join(), I don't think we should try to optimize that away. so this can merge so the other details can be fixed.

@artificiel artificiel changed the title test app to crash dynamic OfxOscReceiver OfxOscReceiver: from detach() to join() May 15, 2024
@artificiel artificiel marked this pull request as ready for review May 15, 2024 02:32
@artificiel
Copy link
Contributor Author

closes #7938

@ofTheo
Copy link
Member

ofTheo commented May 15, 2024

cool! I'll defer to @NickHardeman to approve - but this looks good to me!

@NickHardeman
Copy link
Contributor

"do you have the possibility of testing your previously-crashing deployed situation?"
I don't have a good way to test. It was one of those crashes that took a long time to happen and felt inconsistent and was tricky to reproduce.
This PR addresses the crashes on allocation / deallocation which seems to be the same as the one we were getting :)
So I think it's good to go.

@ofTheo ofTheo merged commit 1f62174 into openframeworks:master May 15, 2024
11 checks passed
@artificiel artificiel deleted the oscthread branch May 15, 2024 20:04
danoli3 added a commit to danoli3/openFrameworks that referenced this pull request May 20, 2024
…leedingmacOS

* commit '7f37e70f65e9e022ba8868fb555570ce2c78a6ba': (37 commits)
  Allows Retina hi res enabled via App or Project.xcconfig (openframeworks#7971)
  actions changes (openframeworks#7968)
  Changing exr to hdr files for compatability with windows (openframeworks#7964)
  ofMesh - newfaces push_back to insert a list (openframeworks#7772)
  restore default-copy-constructibility of ofEvent (openframeworks#7969)
  [actions] ccache update (openframeworks#7967)
  Core small changes (openframeworks#7952)
  config.emscripten.default.mk for Emscripten >= 3.1.52 (openframeworks#7909)
  Fix edge case in findDelimiter (openframeworks#7911)
  oscpack / udpSocket: invert the "break_" semaphore (openframeworks#7963)
  ofxOscMessage: extra implicit adds [fixes something noted through openframeworks#7938 debugging] (openframeworks#7953) #changelog #ofxOsc
  ofThreadChannel::clear() to clear the channel (openframeworks#7921) #changelog #threadChannel
  OfxOscReceiver: from detach() to join() (openframeworks#7949)
  Update ofMathConstants.h (openframeworks#7958)
  [actions] update ubuntu 24.04 (openframeworks#7955)
  ofScopedMatrix (openframeworks#7946)
  [actions] - testing one action with multiple jobs for tests (openframeworks#7860)
  adding of.entitlements and vscode files to gitignore (openframeworks#7031)
  Make - use relative paths (openframeworks#7519)
  FPS timing with chrono (openframeworks#7867)
  ...

# Conflicts:
#	libs/openFrameworks/sound/ofAVEngineSoundPlayer.mm
@armadillu
Copy link
Contributor

Hey, just a quick note; I am getting crashes on a very mature project after updating OF. I am testing on OSX. These crashes are happening when calling setup() on ofxReceiver; this particular project creates a few of them at the same time (all created from the main thread though). The crash always seems to happen in

void SocketReceiveMultiplexer::Run()
{
	impl_->Run();
}

When defining the macro "OSC_NO_DETACH" there are no crashes. I suspect this is related to #5262. Threads that are very short lived cause some sort of race condition when you create and destroy a few of them in a quick succession. This happens when you create an object (like ofxReceiver, that starts a thread on its own when allocated), and then it is destroyed very quickly after being created... and the thread is still mid-creation or on a weird state.

I think this is particularly happening in my case because of how ofxReceiver setups() itself on allocation, and it does it with a default port. If you setup() it manually just after allocation (which seems quite a logical thing to do if you are using a non-standard port), you are shooting yourself in the foot, as internally a thread was already spawned as soon as you created the object. I know this is not the real solution, but it seems weird to me that ofxReceiver sets up itself.

I am seeing this in ofxTuio; every instance of ofxTuio holds an instance of an ofxReceiver... So just by allocating an ofxTuio you already have and ofxReceiver up and running before you have setup anything.

@armadillu
Copy link
Contributor

I can confirm that with this small change to my project:

		tuioClient = new ofxTuioClient();
		ofSleepMillis(1); //Sleep a bit so ofxOscReceiver has time to spawn a thread cleanly
		tuioClient->start(port); 

The crashes are gone (with OSC_NO_DETACHnot defined). Just adding a ofSleepMillis(1); between the two start() calls on ofxOscReceiver removes the issue.

@NickHardeman
Copy link
Contributor

Hi @armadillu,
Thank you for reporting this.
If you make some changes to ofxOscReceiver.h so that it does not start via the constructor and default port, does that avoid the crash?

ofxOscReceiver(int port = 7970) {

Something like the following:

ofxOscReceiver() {};

ofxOscReceiver(int port) {
	setup(port);
};

@artificiel
Copy link
Contributor Author

artificiel commented May 20, 2024

@armadillu what you describe is very similar to the INSTANTLY OUT OF SCOPE case described above; to be sure can you confirm your current OF include #7963 (5 days ago)? which should make the "quickie thread" correctly re-join (without resorting to an ofSleepMillis which unsynchronizes things enough for the detached thread to finalize).

nevertheless, allocating a default port is ill-advised considering a dynamic thread allocation; it should be removed — but #7963 should still allow it without crashing.

and to synthesize the reason for these changes: when re-assigning a new ofxOscReceiver, the previous one gets destructed, and the detached thread sometimes (we don't have a clear statistic, perhaps 1 in 1000) does not run correctly it's deleter which ends in segfault; switching to the .join model forces things to happen before the end of the destructor. (it could also be done with a form of mutex but it's a more complex change).

and thanks for the simple tuio example I will be able to make tests against that later today if you confirm you are indeed running with #7963.

@artificiel
Copy link
Contributor Author

artificiel commented May 21, 2024

2 things:

about the default constructor that setups the thread, I agree it should be removed, but at the moment it is an excellent test as it should not hang/segfault but correctly delete/destroy upon re-assignation. (this new crashing behaviour is a side-effect of fixing the dynamic reallocation case)

[EDIT re-submitted a MR for the #ifdef stuff]

@armadillu
Copy link
Contributor

@artificiel @NickHardeman Yes I was testing against the latest master (I pulled yesterday).
I'll be able to get back to this in a few hours, and I'll report back on your queries.

@armadillu
Copy link
Contributor

@NickHardeman I can confirm that creating a default constructor () fixes the issue. This was also expected, as by doing so you are avoiding the "race condition".

@artificiel
Copy link
Contributor Author

@armadillu OK and are you POSIX or Win (oscpack implementation is not the same).

@armadillu
Copy link
Contributor

@artificiel testing on OSX, as mentioned in my first post. Reading through your posts now

@artificiel
Copy link
Contributor Author

OK and to be sure, there was a mistake in the PR as the #define OSC_NO_DETACH should be defined (but it was in the ofxStressTest we were using to confirm things were working). the MR above fixes that be removing altogether the #defines.

@ofTheo @NickHardeman if possible to look and merge #7981 #7979 and also #7925 which includes an ofxTest for OSC that tests amongst other thing an immediately-out-of-scope ofxOscReceiver — it will make it easier to track behaviour across platforms.

@armadillu
Copy link
Contributor

@artificiel I am happy to test on your private branches if you tell me which one to test against, maybe you can have a unified one? I will try to make a minimal example that makes it crash with the current OF master, so you can test against it.

@armadillu
Copy link
Contributor

Ok here's a minimal example that triggers the crash for me:

#pragma once
#include "ofMain.h"
#include "ofxTuioClient.h"

class ofApp : public ofBaseApp{
public:
	void setup() override{
		tuio = new ofxTuioClient();
		tuio->start(6666);
	}
	void update() override{};
	void draw() override{};

	ofxTuioClient * tuio = nullptr;
};

I am using my fork of ofxTuio, but it should trigger the same crash with the original.

@artificiel
Copy link
Contributor Author

OK i confirm the Tuio crash against master, and correct behaviour with https://github.com/artificiel/openFrameworks/tree/osc-scope2 — it's really just the #define that was inverted between the separated commits...

BTW I just read through #5262 which for some reason did not pop up when starting the work vs the detach() usage within ofxOscReceiver. I jotted down my thoughts as I went along above and in #7938 and in essence the problem is reassigning a detached thread that has not finished it's custom deleter (which requires completed breaking from the UDP socket listener) and/or re-assigning a thread before the detached UDP socket listener actually starts Run(). both problems are inconstant as they depend on how threads end up being processed without sync. I've read your last comment and I don't know if the calling thread makes a difference (probably that it does in how the compiler sequences instructions, hence different behaviours, but maybe not in the sense of "threads" themselves).

if you want to see the type of behaviour that was under test you can run apps/devapps/StressofxOscReceiver — with the old detach() it would randomly crash once in a while (after a few minutes on a Mac mini M1, exacerbated by dragging windows or launching other apps); now it runs for days.

the only drawback there seems to be is that destruction/deletion of the listener thread is synchronized by the join(), which means a bit of latency on the thread that calls close() on ofxOscReceiver (and that is almost no latency, because every time it was not crashing before it was by coincidence (or perhaps compiler "common sense") executed in the order as if synchronized).

@armadillu
Copy link
Contributor

armadillu commented May 21, 2024

@artificiel I can confirm that compiling against your private branch osc-scope2 both the "ofTuio based" crashing example and my larger project (that uses multiple OSC objects) work OK.

I think the tradeoff you propose is completely desirable. Just for curiosity, I timed the "latency" one gets at stopping and starting (ie the ofTuio example), and it comes out at about 0.22ms in my setup on a Release build. I think this is completely acceptable, and any users advanced enough to notice or care about it, can probably find a way to work around it. And it is of course much better than dealing with 1-in-a-1000 type crashes.

I think ideally we should remove the "automatic setup" of the ofxOscReceiver default constructor, but doing that may break some older projects; so it may not be worth it.

@artificiel
Copy link
Contributor Author

@armadillu ah great to time it, yes 0.22ms is quite small, and in the case that it would cause a hiccup and drop a frame on the openGL thread, it is still possible to handle it within another thread.

for the default constructor, it's a recent addition so I don't think there will be breaking (there are more constructors getting implemented) and this one was a step too far. so the constructor can be maintained with auto-setup/start if int port is provided, but left unstated when no port. #7986

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants