You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This paper is concerned with meek's susceptibility to classification based on traffic flow analysis; i.e., packet sizes and packet timing. The authors collect their own traffic traces of browsing home pages with and without meek-with-Tor. They identify feature differences and demonstrate three classifiers that can distinguish ordinary HTTPS from meek HTTPS. They then show how minimal perturbation of the meek-derived feature vectors can hinder the classifiers.
To build a corpus of training and test data, they built a parallel data collection framework using Docker containers and a centralized work queue. They browsed 10,000 home pages both with a headless Firefox, and with Tor Browser configured to use its meek-azure bridge. They performed the test from three different networks—residential, university, and datacenter—yielding a total of 60,000 traffic traces. From these, they extract binned features: TCP payload length, and interarrival times tagged with direction (upstream or downstream). Their packet length distribution differs from the one reported in the 2015 domain fronting paper; the authors speculate that could be because of differences in source data, or changes to meek that have happened in the meantime.
They then use a GAN (generative adversarial network), specifically the StarGAN implementation, to iteratively transform a meek feature vector so that it looks more like a ordinary HTTPS feature vector. The transformation process tries to minimize the size of changes required, by including a perturbation loss term that increases as more changes are required. Minimizing perturbation is to make it easier to implement the resulting distribution, while still fooling the classifiers.
One of the research doors left open by this work is how to concretely implement the perturbed traffic flow feature distribution. Supposing you have a target feature vector that you want to be observed by the adversary, what do you do (in the code), to achieve that?
With meek it's no so easy, because its additional protocol layers and the overhead they add. If your feature vector calls for sending a packet of 400 bytes, you cannot simply send 400 bytes of application-layer payload, because those bytes are going to be prefixed by an HTTP header, and then the whole encapsulated in a TLS application data record. You would need to somehow reverse-engineer (perhaps using some simple optimization algorithm) what number of bytes of HTTP payload you need to send, to get 400 bytes on the wire.
Alternatively, you could collect traffic traces as these authors have done, but do it at the payload layer. (I.e., using some in-browser logging, not tcpdump.) That would give you the sizes and timings of typical HTTP request and response bodies, which if transferred to the pluggable transport, would give you the right traffic flow signature on the wire, assuming that the HTTP and TLS layers perform similarly. (The last assumption is questionable, because for example in a normal browser your browser collects cookies which change the size of HTTP headers as it runs; but then again, normal browsers already start with a stocked cookie jar and don't start from a clean configuration. There are a lot of assumptions that could be tested in this kind of work.)
Improving Meek With Adversarial Techniques
Steven R. Sheffey, Ferrol Aderholdt
https://censorbib.nymity.ch/#Sheffey2019a
https://www.usenix.org/conference/foci19/presentation/sheffey
This paper is concerned with meek's susceptibility to classification based on traffic flow analysis; i.e., packet sizes and packet timing. The authors collect their own traffic traces of browsing home pages with and without meek-with-Tor. They identify feature differences and demonstrate three classifiers that can distinguish ordinary HTTPS from meek HTTPS. They then show how minimal perturbation of the meek-derived feature vectors can hinder the classifiers.
To build a corpus of training and test data, they built a parallel data collection framework using Docker containers and a centralized work queue. They browsed 10,000 home pages both with a headless Firefox, and with Tor Browser configured to use its meek-azure bridge. They performed the test from three different networks—residential, university, and datacenter—yielding a total of 60,000 traffic traces. From these, they extract binned features: TCP payload length, and interarrival times tagged with direction (upstream or downstream). Their packet length distribution differs from the one reported in the 2015 domain fronting paper; the authors speculate that could be because of differences in source data, or changes to meek that have happened in the meantime.
They then use a GAN (generative adversarial network), specifically the StarGAN implementation, to iteratively transform a meek feature vector so that it looks more like a ordinary HTTPS feature vector. The transformation process tries to minimize the size of changes required, by including a perturbation loss term that increases as more changes are required. Minimizing perturbation is to make it easier to implement the resulting distribution, while still fooling the classifiers.
The data collection framework and analysis scripts are published at
https://github.com/starfys/packet_captor_sakura.
The text was updated successfully, but these errors were encountered: