<h1 id="tocheading">Table of Contents</h1>
<div id="toc"></div>

In [None]:
%%javascript
$.getScript('https://kmahelona.github.io/ipython_notebook_goodies/ipython_notebook_toc.js')

In [None]:
%matplotlib inline
from IPython.display import HTML
from IPython.core.interactiveshell import InteractiveShell

# Notebook config
InteractiveShell.ast_node_interactivity = "all"

In [None]:
# button to toggle code
HTML('''<script>code_show=true; function code_toggle() {if (code_show){$('div.input').hide();} else {$('div.input').show();} code_show = !code_show} $( document ).ready(code_toggle);</script>
<form action="javascript:code_toggle()"><input type="submit" value="Click here to toggle on/off the raw code."></form>''')

# Introduction

The objective of this analysis is to study the distribution of network jitter in Tor entry guards. Some nodes might have a significantly large jitter than the average. We believe that in those cases, website fingerprinting may have lower effectiveness than in the entry guards with low jitter. The rationale is that even if the entry to guard TCP connection is independent of the website, the interaction between the page structure (HTTP request/response pattern) and the jitter, may make the fingerprint less reliable than in low-jitter guards.

In [None]:
import re
from os import listdir
from os.path import join, dirname, realpath, isdir, getmtime, splitext
from glob import glob

import numpy as np
import pandas as pd

import matplotlib as mpl
import matplotlib.pyplot as plt
mpl.style.use('ggplot')

# directories
BASE_DIR = dirname(realpath("__file__"))
RESULTS_DIR = join(BASE_DIR, 'results')
LATEST_FILE = max([join(RESULTS_DIR, d) for d in listdir(RESULTS_DIR)], key=getmtime)
LATEST_DATA = '%s.csv' % splitext(LATEST_FILE)[0]

In [None]:
# URLs
ATLAS = '<a href="https://atlas.torproject.org/#search/{fp}" target="_blank">{fp}</a>'

def fp2url(df):
    """Convert guard fingerprints to links to Tor Atlas."""
    fp_re = re.compile(r"([A-F0-9]{40})", re.MULTILINE | re.UNICODE)
    def repl_fp(match):
        match = match.group()
        return ATLAS.format(fp=match)
    return fp_re.sub(repl_fp, df.to_html())

# Collected data


For our data collection, we make a TCP connection to the guard's OR port and record all the traffic that is generated. In total, we have collected:

In [None]:
# load data
data = pd.read_csv(join(LATEST_DATA))

num_samples = len(data)
print "- Total num samples:", num_samples

guards = data.groupby(['guard_fp'])['guard_fp'].count()
num_guards = len(guards)
print "- Found data for", num_guards, "entry guards"

avg_num_samples_per_guard = guards.mean()
print "- An average of", int(avg_num_samples_per_guard), "samples for each entry guard."

This is how the dataset looks like:

In [None]:
# load data in a dataframe
data.head()

# Latency

From the traffic traces collected for the TCP connections to the guards, we extract the first SYN+ACK packet (if any) and its corresponding SYN packet. Next, we substract the SYN timestamp to the SYN+ACK timestamp to obtain a measurement of the latency to a guard.

We extract some basic statistics about the latencies:

In [None]:
data['latency'].describe()

We plot the histogram of latencies:

In [None]:
fig, (ax1, ax2) = plt.subplots(nrows=1, ncols=2)
mean_latencies = data.groupby(['guard_fp']).latency.mean().reset_index(name='mean_latency')
mean_latencies['mean_latency'].plot(kind='hist', bins=100, xlim=(0, 0.25), ax=ax1, figsize=(10, 3), title="Entry mean latency");
mean_latencies['mean_latency'].apply(np.log).plot(kind='hist', bins=20, ax=ax2, figsize=(10, 3), title="Entry mean latency (log scale)");

Top entries by average latency:

In [None]:
top_lat = mean_latencies.sort_values(['mean_latency'], ascending=False).head()
HTML(fp2url(top_lat))

# Jitter

We measure jitter of a node as the variance of the node's latency. We can calculate the variance because we take several samples for each node.

In [None]:
# compute jitter
jitters = data.groupby(['guard_fp']).latency.std().reset_index(name='jitter')
entry_stats = pd.merge(mean_latencies, jitters, on='guard_fp')

# plot histograms
fig, (ax1, ax2) = plt.subplots(nrows=1, ncols=2)
entry_stats['jitter'].plot(kind='hist', bins=250, xlim=(0, 0.05), ax=ax1, figsize=(10, 3), title="Entry jitter");
entry_stats['jitter'].apply(np.log).plot(kind='hist', bins=20, ax=ax2, figsize=(10, 3), title="Entry jitter (log scale)");

Top entries by jitter:

In [None]:
top_jitter = entry_stats.sort_values(['jitter'], ascending=False).head()
HTML(fp2url(top_jitter))