Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Known needs a queue #691

Closed
benwerd opened this Issue Jan 8, 2015 · 15 comments

Comments

Projects
None yet
5 participants
@benwerd
Copy link
Member

commented Jan 8, 2015

A queue system will improve processing of:

  • syndication
  • outgoing webmentions
  • incoming webmentions

This will make the platform faster and more resilient overall.

However, most self-hosted users won't technically be able to use queues. Therefore, I'm tempted to create an interface that allows you to use your own local queue server, but also can use our infrastructure, in case you're on a host that doesn't support this.

Want to back this issue? Post a bounty on it! We accept bounties via Bountysource.

@mapkyca

This comment has been minimized.

Copy link
Member

commented Jan 8, 2015

Aye... just define the interface, have the an implementation which implements it synchronously, but have dispatchers for things like amazon's messaging server thingy etc....

@Phyks

This comment has been minimized.

Copy link
Contributor

commented Jan 9, 2015

Hmm, what would be the behaviour of such a queue system ? As I understood it so far, it could be a local queue, in local database, of outgoing queries that could not be processed, to process them once again in the future.

But your remark about self-hosting seems to imply that it should be running on another server, a bit on a CDN model?

@mapkyca

This comment has been minimized.

Copy link
Member

commented Feb 15, 2015

So, for syndication and outgoing webmentions, I think it might be useful (especially if withknown.com provided a hosted service, as has previously been discussed), to bake in some techniques to mitigate mass interception (as, I have mentioned before, is IETF best practice now)

We have previously discussed as part of #203, that it might be advantageous to use some techniques to obfuscate a user's social graph, since this is often more valuable to an attacker than message content and is harder to protect (should be noted that it is still possible to extract, but it's just not as easy). We could, for example, route mentions over tor etc.

Trouble is, if you're watching that network, you can perform statistical analysis based on the traffic being sent - size of packet, time it arrived, etc. This is called a confirmation attack.

This kind of attack is really super easy if you've got a centralised node, e.g. the proposed withknown.com message server. If I wanted to see if Alice and Bob are friends, I could do so easily by watching that server, even if all the traffic was encrypted. All I'd do is watch for a packet from Alice hitting that server, and then a packet of similar size leaving the server towards Bob. If I wanted to get clever about it, I could confirm this by watching to see if Bob performed something that looks like a GET request which was shortly followed by Alice sending a page and Bob receiving something of a similar size shortly thereafter. This would effectively establish Alice and Bob as talking, regardless of whether the traffic was encrypted (and sent over tor in some situations).. and yes, could probably be done anyway, but with a centralised server this makes it trivial to automate.

So, here are some thoughts, in no particular order:

  • Queues should be asynchronous, and messages from the queue should not be sent as fast as possible. I would suggest that messages are sent randomly at a time somewhere between 0 - n, where n is the maximum number of minutes a message is permitted to be in the queue before being sent immediately.
  • Queues are not FIFO, I'd suggest randomly shuffling on insertion.
  • For mentions, explore the possibility for obfuscating the message size - for example, it should be safe to pad the content with nonsense - for example if the message is ?source=https://foo&target=https://bar, could we not add &buffer=_randomamountofrandomcrap_ ? Unless their mention server is very badly written, it should be ignored. Padding assumes message is sent over TLS so you can't just pull the raw form data, but if you're not using TLS for everything at this point, you're an idiot, and there's no technical fix for that.
  • Parsing incoming mentions should be similarly queued, so when Bob receives Alice's mention ping, the retrieval of that page is also queued and retrieved some time later.

For a message queue service (this applies to any hosted messaging service, so although I'm using withknown.com's potential offering as a reference, this is applicable to any implementation, hence why I'm mentioning it as part of the open source discussion), there are additional operational security considerations:

  • Since, for vanilla mention at least (if we can guarantee known->known we could extend the protocol) zero knowledge is probably not possible, it should nonetheless be the goal. So, for example, I'd suggest not retaining a log of the transaction.
  • I'd also suggest never writing a queue to disk - data loss is probably not a problem, since messages being queued are largely fire and forget anyway.

Nothing is perfect, and these are some random thoughts on a rainy sunday afternoon while I procrastinate doing my accounts. I also add that I am by no means an expert, and professional attackers are probably able to get around this sort of thing. Thoughts?

@benwerd

This comment has been minimized.

Copy link
Member Author

commented Feb 17, 2015

I think all of this is great.

My thought is that this is probably not going to be the default behavior for Known: for many people, real-time communications are becoming a more important part of the framework, particularly for community sites and this would mess with that. But it makes sense for this to exist if you want it, and it makes sense to be on by default for individual sites.

I also wonder if something akin to Tribler might make sense: in effect an anonymizing, Tor-like proxy.

I strongly agree with not retaining a log of webmentions as a service.

@mapkyca

This comment has been minimized.

Copy link
Member

commented Feb 17, 2015

Indeed, although if you're going to use a "tor like service", why don't you just use tor? ;) Obviously not for everyone, but hosted services could easily set this up.

As for instant comms, I agree, and there's a balance of instant delivery vs pissing on an attacker's stats. The busier a network, the shorter the delivery delay can afford to be.

Additionally, the messaging service server could periodically ping hosts - if the whole exchange is encrypted and the server incorporates the padding protocol etc, it should make it harder to conclusively identify an individual exchange (although aggregated over time, statistical analysis is of course still possible).

@kylewm

This comment has been minimized.

Copy link
Collaborator

commented Mar 26, 2016

We discussed this a bit today in IRC. Here's a proposed interface:

namespace Idno\Core;

class Queue extends \Idno\Common\Component {
    // optionally returns a uuid for the job
    function enqueue(string $queueName, string $jobClass, array $args);
}

class Job extends \Idno\Common\Component {
    function perform(array $args);
}

so to send webmentions, you'd call

Idno::site()->queue()->enqueue('\Idno\Core\PingMentionsJob', [
    'page' => $pageURL, 
    'text' => $text,
]);

the default implementation of Queue could just immediately, synchronously perform the job.

@kylewm

This comment has been minimized.

Copy link
Collaborator

commented Mar 26, 2016

Based on (my understanding of) a suggestion from @bear, we could process the queue asynchronously without a separate process/resque/redis/SQS/etc.:

  1. enqueue stores an entry for this Job in the database.
  2. enqueue sends a POST back to itself on localhost at the endpoint /jobs/[jobuid]
  3. the handler on the jobs endpoint grabs the Job out of the database
  4. the handler does the session_write_close trick to close out the request but continue processing in the background.
  5. the handler finally performs the job from the database, marks it as complete, and stores its result in the database.

With a little Apache or nginx configuration, we give handlers on the /jobs/ endpoints a much longer timeout than the default 30 seconds.

The problem I'm still working through: it depends on being able to run all those jobs in parallel. When you save a post, it might do POSSE to Twitter, POSSE to Facebook, send PuSH ping, and send webmentions — 4 long-running jobs at once, maybe starving out other requests?

@kylewm

This comment has been minimized.

Copy link
Collaborator

commented Mar 26, 2016

I added a $queueName parameter to the enqueue method above, not needed yet but something we may want later and don't want to have to change the API.

Also, after sleeping on it, I do not think we should do number 4 above (session_write_close) -- the POST call should take however long the task takes

@benwerd

This comment has been minimized.

Copy link
Member Author

commented Mar 26, 2016

Some thoughts:

  • The timeout can be set in PHP in most configurations. No need to fiddle with Apache settings.
  • /jobs/ probably shouldn't be functional from the outside world, but you could mitigate this by setting an internal token and using that to verify the call.
  • Parallel tasks in PHP are awful no matter what we do. Atomic transactions will help, but we'll have to be super-careful not to clobber metadata.
  • So maybe a maximum of one job queue instance can run at any one time? If it gets launched again, it simply closes out. But whenever it's run, it sequentially goes down the list of tasks until there aren't any more tasks to handle.
  • But this could be brittle - particularly if the queue handler dies halfway through - so maybe not.
  • A fun thing could be allowing tasks to be post-dated to only be handle-able in the future.
@kylewm

This comment has been minimized.

Copy link
Collaborator

commented Mar 26, 2016

The timeout can be set in PHP in most configurations. No need to fiddle with Apache settings.

Ah ok, this might be a difference between Apache and nginx. nginx hangs up and kills the php-fpm thread after 30 seconds regardless of what php's timeout is (re #1296).

@benwerd

This comment has been minimized.

Copy link
Member Author

commented Mar 26, 2016

Ooh. That's actually smart and gives more control to the server administrator, but at the same time, how annoying.

@kylewm

This comment has been minimized.

Copy link
Collaborator

commented Mar 26, 2016

Instead of Jobs, maybe those should just be events that get triggered, so instead of

enqueue('\Idno\Core\SyndicateJob',...), we would enqueue('post/note/twitter',...)

@bear

This comment has been minimized.

Copy link

commented Mar 26, 2016

I mention apache and nginx only in that you want to make sure the proxy forward timeout is at least a bit longer than the php (or php-fpm pool) timeout so that your app is in control of the behaviour.

@mapkyca

This comment has been minimized.

Copy link
Member

commented May 29, 2017

The way I've done this in the past has been to ping a custom endpoint on a cron event.

You need to package a user context with the event so that events are triggered as the logged in user, or write your event handlers to not care that they'll be logged out.

Another possibility is to have the event dispatcher use php-cli on cron, instead of php-fpm or whatever. That means it won't fall foul of any timeouts.

We could even use the existing Known CLI interface as part of this... so the async queue writes to a database table, and then the dispatcher running on a minute cron will pop the head message and dispatch.

@mapkyca

This comment has been minimized.

Copy link
Member

commented Jul 24, 2017

Closed by #1826

@mapkyca mapkyca closed this Jul 24, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.