Skip to content
This repository has been archived by the owner on Oct 7, 2022. It is now read-only.

Using imapfw to analyze attachements #11

Open
Rafiot opened this issue May 19, 2016 · 25 comments
Open

Using imapfw to analyze attachements #11

Rafiot opened this issue May 19, 2016 · 25 comments
Labels

Comments

@Rafiot
Copy link
Contributor

Rafiot commented May 19, 2016

My goal is to bundle an IMAP proxy usable by endusers to sanitize email attachments transparently.

I am the maintainer of PyCIRClean and I'm going to write a python script that takes an email as input, lookup the attachment(s) with a script similar to this one, reattach the payload and return a "sanitized" email.

My question is the following: is there an easy way to hook a module to imapfw? I didn't find it in the documentation, sorry if I missed it.

@nicolas33
Copy link
Member

imapfw is not ready to download emails.

I'd be happy to know more on your use case to get a bigger picture of your expected use case. What do you mean by "hook a module"? What module(s) would you hook?

@Rafiot
Copy link
Contributor Author

Rafiot commented May 20, 2016

My idea is the following:

  • The users install and configure imapfw to connect to their imap server
  • The users configure their email client (Thunderbird, outlook,...) to connect to imapfw
  • imapfw acts as a transparent proxy for all messages except for the ones with attachments. In that case, it gives the whole source of the message to my script that check the sanity of the attachment and returns a message (after having changed the attachment if needed)

The other thing I'm not sure about is how an IMAP server would handle a message modified by the client. Do you have an idea?

@nicolas33
Copy link
Member

nicolas33 commented May 20, 2016

Thanks.

The other thing I'm not sure about is how an IMAP server would handle a message modified by the client. Do you have an idea?

emails are mapped with their UID. Changing the email invalidates the UID. So, the changed email must be removed from server and re-uploaded as a new email.

@nicolas33
Copy link
Member

Forgot to say our (WIP) documentation is available on the website

@Rafiot
Copy link
Contributor Author

Rafiot commented May 20, 2016

Thanks, I'll look at the doc and the code.

Do you think having such a module is doable with imapfw in the future? I'm comfortable working on a project under heavy development but I simply want to make sure it is possible before I start allocating time on it.

@nicolas33
Copy link
Member

I think it is possible in two ways:

  • if you need the "real" IMAP server to take changes into account, the proxy has to propagate the changes (download, make the change, delete old on server, create new on server).
  • if there's no need for the IMAP server to be aware of the changes, the proxy could be an IMAP server like Dovecot or internal (must be implemented). In this case, imapfw would sync both IMAP servers (and possibly make changes on the emails). Since Dovecot works on Maildir, imapfw could also sync the "real" IMAP server to the Dovecot database (Maildir).

@nicolas33
Copy link
Member

There another possible way: write a real proxy. This would be the hard path but very interesting, BTW.
In this case, imapfw should create a socket and allow triggers on IMAP requests.

@Rafiot
Copy link
Contributor Author

Rafiot commented May 20, 2016

Solution 1 seems to be the best one in my case (I don't really understand the difference with solution 3).

The goal is to be able to propose such a solution for occasional users of webmails too, so the changes need to be somehow synchronized back to the "real" IMAP server.

@nicolas33
Copy link
Member

There are different levels a proxy could act:

  • high-level (solution 1): the proxy exposes/tunes emails "on demand" when they are first downloaded and propagates changes back on "real". This could mean long waiting time responses for the client since it has to wait for a new UID (re-upload) on updates. In this case, discovering new emails is triggered by the client. The proxy implements almost a full IMAP server but the emails aren't stored locally. The database is the "real" server. In this case, the provided UIDs could be different from "real".
  • high-level (solution 2): imapfw syncs emails to a local maildir and then exposes them via IMAP which requires local disk space to store the emails. Discovering and updating emails is done by the proxy without connected client.
  • low-level (solution 3): proxy works on IMAP requests and commands. The proxy doesn't interpret the IMAP requests from the client. They are blindly relayed to the server. The exposed UIDs are the "real" UIDs. However, downloaded emails can be updated and propagated back on "real" before they are exposed to the client. This would mean long delays for plenty of IMAP commands because any discovered UID requires each email to be checked first in case they need changes. This likely requires a local database of known (already checked) UIDs and regexp on IMAP commands from the client.

There could be other ways of working for your proxy. For example, a monitor could regularly request "real" for new emails so they are checked and updated if needed. The proxy would only expose pre-validated UIDs.

@nicolas33
Copy link
Member

nicolas33 commented May 20, 2016

BTW, I wonder the update on emails is best done on the server at delivery time (MDA). I think most IMAP servers allow to do this kind of things. ,-)

@Rafiot
Copy link
Contributor Author

Rafiot commented May 20, 2016

Definitely, a postfix script will also happen, and it is the cleanest way, but the goal is to support user with no specific knowledge, infrastructure or support team at-hand (webmail users working in small organisations receiving all kind of ransomwares).

Just to make sure I got it right:

(everytime I say mail client, I mean Thunderbird/Outlook/...)

Solution 3

The mail client uses imapfw as an actual proxy and connects to it to get the emails, imapfw is the only one connecting to the IMAP server.
Every email passing through is send to the sanitizing module. If it is has an attachment, it is sanitized (optionally: the original email is sent in quarantine), the sanitized email is tagged as sanitized, pushed back to the server and passed to the email client.

Solution 1

imapfw acts as a mail client and modify the emails on demand

Downside: the mail client still connects to the remote IMAP server so it will still receive unprocessed malicious attachments if imapfw didn't had time to update the email.

Solution 2

imapfw does the the same as solution 1 but with a local storage.

Downside: If the email clients connects directly to imapfw, it will only see the emails in the local storage and not the ones on the server.

Solution 3 is most definitely the best one, because even if it is a bit slower at fetching the emails, the mail client will still do all the caching it was doing before (let's say the last 30 days and all the subjects) so the extra hop isn't critical.

@nicolas33
Copy link
Member

Solution 3

The mail client uses imapfw as an actual proxy and connects to it to get the emails, imapfw is the only one connecting to the IMAP server.

Same goes for 1 & 2.

Every email passing through is send to the sanitizing module. If it is has an attachment, it is sanitized (optionally: the original email is sent in quarantine), the sanitized email is tagged as sanitized, pushed back to the server and passed to the email client.

Same goes for 1 & 2.

Solution 1

imapfw acts as a mail client and modify the emails on demand

imapfw is an IMAP client in all the alternatives.

Downside: the mail client still connects to the remote IMAP server

No, the mail client connects to the proxy.

so it will still receive unprocessed malicious attachments if imapfw didn't had time to update the email.

Same goes for 2 & 3.

Solution 2

imapfw does the the same as solution 1 but with a local storage.

No, it doesn't do the same as solution 1. Solution 1 is about using IMAP as a language for the remote database. Solution 2 is about syncing both server and proxy.

Downside: If the email clients connects directly to imapfw, it will only see the emails in the local storage and not the ones on the server.

True, the latest emails on the server must be processed at regular intervals to update the local and remote databases.

Solution 3 is most definitely the best one, because even if it is a bit slower at fetching the emails, the mail client will still do all the caching it was doing before (let's say the last 30 days and all the subjects) so the extra hop isn't critical.

Caching can be done with solution 1, too.

However, I don't think solution 3 will be "a bit" slower. I think this can become a lot slower. For example, if the mail client only requests for the list of UIDs, the proxy must first download ALL the unkown emails to process them and then return the correct list of numbers.

I don't know which option is the best. I'd say it depends on what users expect. Solutions 1 and 3 are hard because IMAP is client side while the purpose is to apply changes on the server. Each solution has downsides. Proxying IMAP is "easy" as long as no modifications are made on the emails.

@Rafiot
Copy link
Contributor Author

Rafiot commented May 21, 2016

Ok, I understand now.

What the users expects is to receive their messages in their email client and not changing their habits. They also have multiple devices (PCs, phone, ...) and use a webmail from time to time so having a local cache isn't the goal, and we need to sync the changes back to the server (so the other clients also have the sanitized version).

@Rafiot
Copy link
Contributor Author

Rafiot commented May 21, 2016

Now a very practical question: is it something you think will be doable with imapfw in a near-ish future? Or should I look at an other library?

I'd very happy to participate to the development but right now, I don't really understand where I should look at in the code, as the framework seems very extensive.

@nicolas33
Copy link
Member

I can't tell how much time this would require since it depends on contributions (myself included). Also, this depends on your own knwoledges of Python and how "production ready" you expect it.

For now, imapfw is still early stage so you should expect to write quite some code. OTOH, this means you have more degrees of liberty to implement what you want.

I think imapfw has the best extensibility compared to any other library due to the design and the Python metaprograming capabilites.

I'd say imapfw can be a good long-time solution if you have enough time to spend on the code.

For a starter, I'd first look the screencast. Next, you should look at the code and request me on gitter. You can ask any question, as much as you want, so you can get a better picture of the current state and have a better overview of what you could do with imapfw.

@nicolas33
Copy link
Member

What the users expects is to receive their messages in their email client and not changing their habits. They also have multiple devices (PCs, phone, ...) and use a webmail from time to time so having a local cache isn't the goal, and we need to sync the changes back to the server (so the other clients also have the sanitized version).

Pushing back improves safety but most email clients will need 2 different accounts (one for the real remote and another for the proxy) so that the local caching of the clients won't be usefull while switching between both.

Whatever the solution, accessing the real would expose to un-checked emails.

@Rafiot
Copy link
Contributor Author

Rafiot commented May 22, 2016

Great, I'll dig into imapfw more and look for a way to implement it. I have decent skills in python so I'll definitely contribute as much as I can.

FYI, I wrote a quick&dirty script that takes a mail as input and returns a sanitized version: https://github.com/Rafiot/PyCIRCLean/blob/mail/bin/mail.py

It is far from being production ready, but this is the idea.

Regarding the 2 different accounts, I still don't get it :/

To me, it should work that way:
imapfw

Of course, if the users uses any other device at the same time they use the proxy, they will get the un-sanitized version but as soon as the sanitizing is done, the only version that stays on the server is the sanitized one.
And on the machine where the proxy is running, they only get to see the mail when the sanitizing is done.

@nicolas33
Copy link
Member

sanitize

@nicolas33
Copy link
Member

Things are worse because the clients can connect more than once and this should not trigger twice the same checks.

@nicolas33
Copy link
Member

The more I think about this, the more I'm convinced you should use both a proxy and a monitor. The proxy would only hides unchecked UIDs while the monitor (IDLE mode?) would sanitize the emails.

sanitize-proxy-monitor

@Rafiot
Copy link
Contributor Author

Rafiot commented May 23, 2016

Okay, my idea was to have no database at the proxy's level and just look at the content of each email passing through, but your approach is probably more efficient.

I would still prefer, or at least have the possibility, to look at the original email in a quarantine folder but that's a detail at that point.

@nicolas33
Copy link
Member

Okay, my idea was to have no database at the proxy's level and just look at the content of each email passing through, but your approach is probably more efficient.

But you have to know which emails are already checked to avoid scanning them more than once.

I would still prefer, or at least have the possibility, to look at the original email in a quarantine folder but that's a detail at that point.

This is something I'd suggest at some point. Blindly trusting a sanitizer is crazy. ,-)

@Rafiot
Copy link
Contributor Author

Rafiot commented May 26, 2016

Sounds great, I now have a beta version of the mail parsing script: https://github.com/CIRCL/PyCIRCLean/blob/mail/bin/mail.py (it needs some refactoring)
I tested it on junk mails (~50k) and it works properly. Now we need to get the proxy together :)

Can you tell me what imapfw can do and can't do right now based on your last graph? This will help me to prepare my roadmap.

@nicolas33
Copy link
Member

You should look at the code. For IMAP sessions, see https://github.com/OfflineIMAP/imapfw/blob/master/imapfw/imap/imap.py#L108

@Rafiot
Copy link
Contributor Author

Rafiot commented Jul 7, 2016

very simple script to process a directory of emails: https://github.com/Rafiot/imapfw/blob/msghook/rascals/dev.messagehook.rascal

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

2 participants