Server drops inbound messages and receives corrupted input. #25

juj · 2012-05-24T10:56:34Z

I am observing an issue with the following application:

http://dl.dropbox.com/u/40949268/Bugs/Alchemy/Program.cs

This server program accepts all connections and prints all messages it receives to console.

I am using the following html application to test:

http://dl.dropbox.com/u/40949268/Bugs/Alchemy/client.html (links to http://dl.dropbox.com/u/40949268/Bugs/Alchemy/sendtest.js )

The application connects to localhost, and when the user presses a button, it sends the messages "Test=0", "Test=1", ... , "Test=999" to the server in one for loop.

The result I am observing with test is that the server application does not receive all the messages, and in some cases, it receives complete garbage and malformed input.

The TCP data stream was investigated with Wireshark, and it seems to be intact and contains all the messages.

This problem seems to be exaggerated/caused by a race condition inside Alchemy, since the time that it takes for the server to process a message OnServerReceive affects on how much data is corrupted and/or dropped.

To demonstrate the problem, the Program.cs server program was run with both Thread.Sleep enabled and disabled (the commented line 46 in the file). For the client, the browsers Firefox 12, Opera 11.64 and Chrome 19.0.1084.52 m were tested.

Here are the resulting print logs from each test run:

Firefox + Sleep enabled:
Client: http://dl.dropbox.com/u/40949268/Bugs/Alchemy/FirefoxClient_Sleep.txt
The result is as expected.
Server: http://dl.dropbox.com/u/40949268/Bugs/Alchemy/FirefoxServer_Sleep.txt
Broken, server has dropped several inbound messages.

Opera + Sleep enabled:
Client: http://dl.dropbox.com/u/40949268/Bugs/Alchemy/OperaClient_Sleep.txt
The result is as expected.
Server: http://dl.dropbox.com/u/40949268/Bugs/Alchemy/OperaServer_Sleep.txt
Broken, the server has dropped several inbound messages, and calls the message handler with null data frames.

Chrome + Sleep enabled:
Client: http://dl.dropbox.com/u/40949268/Bugs/Alchemy/ChromeClient_Sleep.txt
Unexpected: The client receives a message "A server must not mask any frames that it sends to the client. :1"
This seems to relate to bug #21.
Server: http://dl.dropbox.com/u/40949268/Bugs/Alchemy/ChromeServer_Sleep.txt
Broken, the server has dropped several messages, and calls the message handler with corrupted data frames.

The Thread.Sleep(10) was present in the above tests to measure the effects of timing for this problem. The results:

Firefox + No Sleep:
Client: http://dl.dropbox.com/u/40949268/Bugs/Alchemy/FirefoxClient_NoSleep.txt
The result is as expected.
Server: http://dl.dropbox.com/u/40949268/Bugs/Alchemy/FirefoxServer_NoSleep.txt
At first glance, the result is seemingly good, but the server has still dropped some messages (in particular e.g. "Test=2")

Opera + No sleep:
Client: http://dl.dropbox.com/u/40949268/Bugs/Alchemy/OperaClient_NoSleep.txt
The result is as expected.
Server: http://dl.dropbox.com/u/40949268/Bugs/Alchemy/OperaServer_NoSleep.txt
Broken, the server has dropped some messages, and calls the message handler with null data frames.

Chrome + No Sleep:
Client: http://dl.dropbox.com/u/40949268/Bugs/Alchemy/ChromeClient_NoSleep.txt
The result is as expected. Note that the same client error that was present when Thread.Sleep was used on the server does not occur here.
Server: http://dl.dropbox.com/u/40949268/Bugs/Alchemy/ChromeServer_NoSleep.txt
Broken, the server has dropped some messages.

Also it was observed that if one sleeps in the javascript client code in between data sends, that alleviates the number of dropped messages, so this issue may be related to how multiple frames are decoded and assembled in the TCP stream when the amount of available bytes to read in the socket varies on the server.

ajacksified · 2012-05-31T05:49:08Z

Very curious - thanks for the detailed info. We'll investigate.

juj · 2012-05-31T12:44:04Z

I am interested in hearing if this is just my system, or can others independently reproduce the issues with their systems on the provided test cases?

ajacksified · 2012-05-31T15:40:26Z

I'm going to run these tests this evening; I'll see if @Kythera can as well.

bridgettegraham · 2012-07-19T10:56:54Z

I am curious - has there been any progress on fixing this? I am really enjoying the ease and speed of this library, but I might have to try something else because of the corrupted data...? Just wondering :)

juj · 2012-07-19T16:48:59Z

I tried to resolve the problem for a while, but eventually migrated to using the Fleck library, https://github.com/statianzo/Fleck/ , which worked better for me without problems.

ajacksified · 2012-07-19T16:58:02Z

Yeah, I completely failed to check this out; will be doing so this weekend.

bridgettegraham · 2012-07-20T04:24:46Z

I like this library. For now I have increased the buffer size which seems to resolve the issue. It only seems to pop up when the data must be sent in packages. Maybe this helps?

preli · 2012-09-14T07:07:17Z

I think this is the best websocket server in C#, but I have the same issue with the library as the creator of this thread. It works for small data, but when I try to send lager amounts of data at once, the server will only receive "garbage".
Any chance this could be fixed?

sztupy · 2012-11-20T17:35:14Z

It happens to me too. I am sanding larger JSON packets between an alchemy server and alchemy client, and sometimes it arrives garbled. I used a workaround so when I encounter an unparseable JSON object I reset the dataframe (context.DataFrame.Reset()), and send a request to the server to send the data again. Usually it solves the problem.

vbguyny · 2013-02-10T01:26:19Z

I am able to duplicate this issue. It appears when the server received 32768 bytes or more of data it starts not reporting it correctly to the server. It is pretty amazing that this issue is still outstanding.

vbguyny · 2013-02-10T04:25:53Z

Additional news: I tested this out with the super websocket and it doesn't have this issue, so I don't think it isn't browser based.

vbguyny · 2013-02-10T17:36:01Z

Another update: After testing with Safari 5.1.7 and Opera 12.14, there is no issue at all. However Chrome 24 drops message after 32768 bytes and Firefix 18 starts dropping the first 32-64 bytes if the message is greater than 1024 bytes. The major difference between the browsers that work (Safari and Opera) and the ones that do not (Chrome and Firefox) is the protocol used for the websockets. There appears to be a bug in the implementation for rfc6455, which Chrome and Firefox use. I personally don't know enough of the protocol to debug it myself so I would ask that someone please assist. Thanks.

joreg · 2013-07-03T19:11:17Z

we're experiencing the same problem with our application in connection with every recent browser. any news?

Dinin7 · 2013-07-15T05:13:07Z

Same problem here. Has anyone solved this already ?

joreg · 2013-07-15T09:57:34Z

our solution was using SuperWebSocket instead.

zyo2012 · 2013-08-06T02:08:13Z

I have no choice to drop Alchemy too because I'm loosing a lots of packet when sent very quick from another websocket client. I confirm with both WebSocket4net (nuget package) and Websocketsharp.

I did a small program that connect to server after connection is open I do a for loop to send 100 messages. The server print them and there is over half missing. I've switching to SuperWebSocket none were missing... haven't touch the client code. I was preferring Alchemy, less dependency and cleaner to run but it's missing packet is a showstopper.

slothbag · 2013-10-08T23:32:03Z

Thinking about using Alchemy for a websocket library.. any progress on this bug? Anyone done any investigations? Easy to fix, hard to fix?

swieser · 2014-03-25T16:00:32Z

Problem still exists - and it's not even hard to reproduce. I have an app that just sends two messages upon connecting. 90% of the time, the second one is lost.

With that, Alchemy is completely worthless for me, all the code gone to waste because no one fixes this highly critical bug for two years.

Switched to Fleck, which - thanks to a similar architecture - was done within 20 minutes. No problems anymore.

Cannot recommend Alchemy at all.

steforster · 2014-03-27T20:26:43Z

@swieser, did you try my fork (https://github.com/steforster/Alchemy-Websockets) also ?

Myndale · 2014-06-20T01:31:19Z

@steforster: you're a life-saver, thanks! I spent all morning trying to figure out why I was seeing this. Google led me here, I replaced the NuGet package with your fork and now everything is working perfectly. Cheers!

swieser · 2014-06-21T10:27:05Z

@steforster I didn't try the fork, because Fleck worked flawlessly and I had no need to spend extra time porting my code back to Alchemy. Seeing Myndale's comment, however, I'm glad that someone actually steps up and fixes that (highly critical) bug, and applaud you for that.

Gabriel-RABHI · 2014-07-09T17:16:30Z

I have readed the code of Alchemy to include websocket it in my own network classes, and experienced the same problems. When sending big message, they come corrupted, and when I send two or more messages in a short time, I received only the first one !

So I start debugging the code, and re-read the Alchemy processing path step by step...

I have found that when a websocket send in a loop some small messages, they come in as aggregated ordered packets. I found that the Alcemy classe is not processing the remaining bytes when a packet contains the end of a message and the beginning of the next ! Hum... 👎

I have found that the Alchemy is not unmasking all payload segments correctly. In a random manner, some payload segments are not correctly unmasked. The index to read the mask seems to be wrong sometime.

So, I will implement the processing of the remaining bytes and give you a feedback. It will correct the multiple fast message sent and not received, and hope it will remove the big message problem...

steforster · 2014-07-09T21:22:09Z

As you see above, I fixed this in https://github.com/steforster/Alchemy-Websockets and also posted a pull request.

zyo2012 · 2014-07-09T21:38:12Z

@steforster Thanks for you time, I've been lazy and just picked another library but alchemy footprint was much cleaner.

Question, did you made 2 simples test:

Send 1000 messages and check if the 1000 messages arrives at the server in order.
Have you try to send big messages if they are not getting cut, sending over 1 MB of data for instance.

I'm kinda tired to change my code and using yet another websocket library that failed basic testing.

Once you tell me both case are working I will try your fork in a live project.

steforster · 2014-07-10T05:59:07Z

I did not send large messages but checked message order of fast appended messages.
Please feel free to improve the unit tests.
There are not many but they are on a client/server level.

Gabriel-RABHI · 2014-07-10T09:35:46Z

@steforster : THank you for your submit, it's cool. I've corrected the fast small message in my own code, this is ok. Now i'm experiencing some strange behaviore with big messages sent from the browser, 100 ko + messages are divided in 8048 octet each (standard buffer size), and some payload segments still randomly corrupted.

BUT :

I've noticed that if I slow down the socket by adding Thread.Sleep(2) before restarting Socket.Receive, then all is OK. So, it seems that a strange behaviore is linked to a race condition or IO Completion Port used for high performance Sockets and low GC pressure. But it is possible I've leave a bug in my own code, because I'm using only some small parts of the Alchemy lib (header, dataframe).

I'm starting to debug that now.

Gabriel-RABHI · 2014-07-10T10:37:23Z

First result in my debugging ; if you call immediatly ReceiveAsync then corruption appear randomly. This is the case when you do near nothing between the Receive call back and call to ReceiveAsync. Read this article - http://www.themissingdocs.net/wordpress/?p=615

This is my case : I simply queue the receved buffer to be processed by another thread, so the processing is nothing... This is possibly the case in Alchemy wich is only add the received buffer in the segment array, wich is too fast to avoid the race bug.

So I'm trying to find a simple but fast solution to go arround the problem, without GC pressure (many buffers, many structures) ;

use many ASyncEventArgs in a circular manner
integrate a non blocking delay before restart receiving...

steforster · 2014-07-10T11:39:24Z

I would be surprised to see this error when using my fork. There I fixed several bugs concerning the async behaviour on server side.

Gabriel-RABHI · 2014-07-10T13:03:21Z

Yes, but you'd better execute the test of sending 100 times 1 Mo buffer and check if all received buffers are good. You cannot say "there is no problem" if you never test it !

What is true in Alchemy, is that the packet is fully processed BEFORE call ReceiveAsync, so, the time needed for this processing avoid the bug apparition !

In my rewritting I do the packet processing in another thread, and restart immediatly ReceiveAsync. I've test various solutions, with 100 EventArgs, with single buffer, etc... nothing is working. I still have the problem. The dellay between callback call and ReceiveAsync to avoid the bug is really thin, but it cause a sever penalty in case of high performance server.

I will start new tests, with unsafe pinned buffers this time.

juj · 2014-07-10T13:31:19Z

Nice to see some activity on this issue, great work. However, let me unsubscribe myself from this, since I am no longer using this library. Hope it gets resolved - good luck!

Myndale · 2014-07-10T22:28:08Z

Just to throw my 2c in the mix…I tested your fork on a facility containing a centralised server and about a dozen different clients all connected simultaneously and I saw heaps of corruption issues. Unfortunately I was under a tight deadline so I didn’t get time to track them down properly but my tests seemed to indicate it was due to Alchemy not being correctly thread-safe.

Switching to WebSocket4Net resolved all problems immediately and the system has been stable ever since, so I’m reasonably confident it’s not an issue with our code…unless of course Alchemy isn’t actually meant to be thread-safe?

Mark

From: Stefan Forster [mailto:notifications@github.com]
Sent: Thursday, 10 July 2014 9:39 PM
To: Olivine-Labs/Alchemy-Websockets
Cc: Myndale
Subject: Re: [Alchemy-Websockets] Server drops inbound messages and receives corrupted input. (#25)

I would be surprised to see this error when using my fork. There I fixed several bugs concerning the async behaviour on server side.

—
Reply to this email directly or view it on GitHub #25 (comment) . https://github.com/notifications/beacon/3477140__eyJzY29wZSI6Ik5ld3NpZXM6QmVhY29uIiwiZXhwaXJlcyI6MTcyMDYxMTU2NywiZGF0YSI6eyJpZCI6ODMxOTF9fQ==--9473078e69b4b6aef0c0b2b52be4380c24746fc3.gif

zyo2012 · 2014-07-10T23:30:27Z

I agree with @Myndale this library is not ready for prime time. Basic testing and multi-thread safety is a must.

Gabriel-RABHI · 2014-07-11T13:42:09Z

So, in my last research i've tested the ASync .Net API separatly and there is no problem with it. High perf, low garbage pressure. All is strange, because I can reproduce the behavioure of code presented with the article I posted yesterday. I will start to analyze the potential threading issues... The error can appear on a 8 core PC and not on a 2 core one, because the thread pool managing the IO Completion Port can be larger, and an unfinished message processing operation can be corrupted by the next one started by another threads.

So, I think to test it, everyone need to ;

Use a multicore machine
Process multi-megabyte messages

For the Alchemy implementation, I cannot say what would be the result.

Gabriel-RABHI · 2014-07-11T14:07:08Z

Ok, I've found my error : it's exactly a problem of bad protection of critical section and processing overlapping. Must add monitor to prevent that, and it's all ok. For Alchemy, I hope that the work done by steforster is solving all problems of threading, but I've haven't seen anymore problems in the code I extracted from the lib. Steforster seems to be really confident. Thank you all !

zyo2012 · 2014-07-23T11:03:18Z

The big message drop is not a bug, it's a hard limit in the code.
public UInt64 MaxFrameSize = 102400; //100kb in context.cs set the limit, so you can increase it.

The fork from steforster ( https://github.com/steforster/Alchemy-Websockets ) is not having any bug from what I've tested. I think the author should do the merge.

ghost assigned DorianGray Jul 11, 2012

sztupy mentioned this issue Nov 20, 2012

JSON messages truncated #49

Open

mattkentz mentioned this issue Mar 14, 2013

Using Alchemy as a client to connect to server of a different implementation #69

Open

steforster mentioned this issue Feb 16, 2014

Client- and serverside bug fixes #94

Open

Server drops inbound messages and receives corrupted input. #25

Server drops inbound messages and receives corrupted input. #25

Comments

juj commented May 24, 2012

ajacksified commented May 31, 2012

juj commented May 31, 2012

ajacksified commented May 31, 2012

bridgettegraham commented Jul 19, 2012

juj commented Jul 19, 2012

ajacksified commented Jul 19, 2012

bridgettegraham commented Jul 20, 2012

preli commented Sep 14, 2012

sztupy commented Nov 20, 2012

vbguyny commented Feb 10, 2013

vbguyny commented Feb 10, 2013

vbguyny commented Feb 10, 2013

joreg commented Jul 3, 2013

Dinin7 commented Jul 15, 2013

joreg commented Jul 15, 2013

zyo2012 commented Aug 6, 2013

slothbag commented Oct 8, 2013

swieser commented Mar 25, 2014

steforster commented Mar 27, 2014

Myndale commented Jun 20, 2014

swieser commented Jun 21, 2014

Gabriel-RABHI commented Jul 9, 2014

steforster commented Jul 9, 2014

zyo2012 commented Jul 9, 2014

steforster commented Jul 10, 2014

Gabriel-RABHI commented Jul 10, 2014

Gabriel-RABHI commented Jul 10, 2014

steforster commented Jul 10, 2014

Gabriel-RABHI commented Jul 10, 2014

juj commented Jul 10, 2014

Myndale commented Jul 10, 2014

zyo2012 commented Jul 10, 2014

Gabriel-RABHI commented Jul 11, 2014

Gabriel-RABHI commented Jul 11, 2014

zyo2012 commented Jul 23, 2014