Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improving Meek With Adversarial Techniques (FOCI 19) #13

Open
wkrp opened this issue Sep 30, 2019 · 1 comment
Open

Improving Meek With Adversarial Techniques (FOCI 19) #13

wkrp opened this issue Sep 30, 2019 · 1 comment
Labels
reading group summaries and discussions of research papers and other publications

Comments

@wkrp
Copy link
Member

wkrp commented Sep 30, 2019

Improving Meek With Adversarial Techniques
Steven R. Sheffey, Ferrol Aderholdt
https://censorbib.nymity.ch/#Sheffey2019a
https://www.usenix.org/conference/foci19/presentation/sheffey

This paper is concerned with meek's susceptibility to classification based on traffic flow analysis; i.e., packet sizes and packet timing. The authors collect their own traffic traces of browsing home pages with and without meek-with-Tor. They identify feature differences and demonstrate three classifiers that can distinguish ordinary HTTPS from meek HTTPS. They then show how minimal perturbation of the meek-derived feature vectors can hinder the classifiers.

To build a corpus of training and test data, they built a parallel data collection framework using Docker containers and a centralized work queue. They browsed 10,000 home pages both with a headless Firefox, and with Tor Browser configured to use its meek-azure bridge. They performed the test from three different networks—residential, university, and datacenter—yielding a total of 60,000 traffic traces. From these, they extract binned features: TCP payload length, and interarrival times tagged with direction (upstream or downstream). Their packet length distribution differs from the one reported in the 2015 domain fronting paper; the authors speculate that could be because of differences in source data, or changes to meek that have happened in the meantime.

They then use a GAN (generative adversarial network), specifically the StarGAN implementation, to iteratively transform a meek feature vector so that it looks more like a ordinary HTTPS feature vector. The transformation process tries to minimize the size of changes required, by including a perturbation loss term that increases as more changes are required. Minimizing perturbation is to make it easier to implement the resulting distribution, while still fooling the classifiers.

The data collection framework and analysis scripts are published at
https://github.com/starfys/packet_captor_sakura.

@wkrp wkrp added the reading group summaries and discussions of research papers and other publications label Sep 30, 2019
@wkrp
Copy link
Member Author

wkrp commented Sep 30, 2019

One of the research doors left open by this work is how to concretely implement the perturbed traffic flow feature distribution. Supposing you have a target feature vector that you want to be observed by the adversary, what do you do (in the code), to achieve that?

With meek it's no so easy, because its additional protocol layers and the overhead they add. If your feature vector calls for sending a packet of 400 bytes, you cannot simply send 400 bytes of application-layer payload, because those bytes are going to be prefixed by an HTTP header, and then the whole encapsulated in a TLS application data record. You would need to somehow reverse-engineer (perhaps using some simple optimization algorithm) what number of bytes of HTTP payload you need to send, to get 400 bytes on the wire.

Alternatively, you could collect traffic traces as these authors have done, but do it at the payload layer. (I.e., using some in-browser logging, not tcpdump.) That would give you the sizes and timings of typical HTTP request and response bodies, which if transferred to the pluggable transport, would give you the right traffic flow signature on the wire, assuming that the HTTP and TLS layers perform similarly. (The last assumption is questionable, because for example in a normal browser your browser collects cookies which change the size of HTTP headers as it runs; but then again, normal browsers already start with a stocked cookie jar and don't start from a clean configuration. There are a lot of assumptions that could be tested in this kind of work.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
reading group summaries and discussions of research papers and other publications
Projects
None yet
Development

No branches or pull requests

1 participant