In [None]:
%%javascript
$.getScript('http://homes.esat.kuleuven.be/~mjuarezm/ipy_toc.js')

<h1 id="tocheading">Table of Contents</h1>
<div id="toc"></div>

In [None]:
# ipy imports
%matplotlib inline
%load_ext rpy2.ipython
from IPython.display import HTML
from IPython.core.interactiveshell import InteractiveShell

# Notebook config
InteractiveShell.ast_node_interactivity = "all"

In [None]:
%%R
# R dependencies
suppressPackageStartupMessages({
    library("ggplot2")
    library("gridExtra")
})

In [None]:
# button to toggle code
HTML('''<form action="javascript:code_toggle()">
            <input type="submit" value="Toggle cells">
        </form>
     ''')

# Introduction

The objective of this analysis is to study the distribution of network jitter in Tor nodes. Some nodes might have a significantly large jitter than the average. We believe that in those cases, website fingerprinting may have lower effectiveness than in the nodes with low jitter. The rationale is that even if the Tor path is independent of the website, the interaction between the page structure (HTTP request/response pattern) and the jitter, may make the fingerprint less reliable than in high-jitter paths.

In [None]:
import re
from os import listdir
from os.path import join, dirname, realpath, isdir, getmtime, splitext
from glob import glob

import numpy as np
import pandas as pd

import matplotlib as mpl
import matplotlib.pyplot as plt
mpl.style.use('ggplot')

from ggplot import *

# directories
BASE_DIR = dirname(realpath("__file__"))
RESULTS_DIR = join(BASE_DIR, 'results')
LATEST_FILE = max([join(RESULTS_DIR, d)
                   for d in listdir(RESULTS_DIR)], key=getmtime)
LATEST_DATA = '%s.csv' % splitext(LATEST_FILE)[0]

In [None]:
# URLs
ATLAS = '<a href="https://atlas.torproject.org/#details/{fp}" target="_blank">{fp}</a>'

def fp2url(df):
    """Convert node fingerprints to links to Tor Atlas."""
    fp_re = re.compile(r"([A-F0-9]{40})", re.MULTILINE | re.UNICODE)
    def repl_fp(match):
        match = match.group()
        return ATLAS.format(fp=match)
    return fp_re.sub(repl_fp, df.to_html())

# Data collection


 To collect a latency sample, we make a TCP connection to the node's OR port and record the SYN and SYN+ACK packets. We have collected the latency dataset in batches, that is iterating over the whole list of nodes multiple times and taking several samples of each node in each iteration. This allows to obtain a more reliable estimate of the latency at a fixed time and obtain estimates of the latency at different times.

In [None]:
# load data
data = pd.read_csv(join(LATEST_DATA))

 In total, we have collected:

In [None]:
# get stats for each sample batch
batches = data.groupby(['batch_id', 'fp'])

# get median latency per batch
batch_med_lat = batches['latency'].median().reset_index()

# get standard deviation of latency per batch
batch_std_lat = batches['latency'].std().reset_index()

# get median latency per node
node_med_lat = batch_med_lat.groupby('fp').median().reset_index()

# get jitter within a batch per node
#jitter = batch_med_lat.groupby('fp').std().reset_index()
node_med_jitter = batch_std_lat.groupby('fp').median().reset_index()
node_med_jitter.rename(columns={'latency': 'jitter'}, inplace=True)

In [None]:
print "- Total num samples:", len(data)
print "- Found data for",  len(node_med_lat), "nodes"
print "- An average of", int(batches.count()['sample_id'].mean()), "samples for each node."
print "\n"

This is how the *raw* dataset looks like:

In [None]:
# show head of dataset
data.head()

This is how the *per-node* dataset looks like:

In [None]:
# put into one single dataset
header = ['fp', 'flags']
nodestat = data[header].drop_duplicates()

# merge stats
nodestat = node_med_lat.merge(nodestat, on='fp')
nodestat = node_med_jitter.merge(nodestat, on='fp')

# check if a node is exit or guard
nodestat['is_guard'] = nodestat.flags.str.contains('Guard')
nodestat['is_exit'] = nodestat.flags.str.contains('Exit')

# show head of node stats
nodestat.head()

In [None]:
# push the dataset to R
%Rpush nodestat

# Latency

From the traffic traces collected for the TCP connections to the nodes, we extract the first SYN+ACK packet (if any) and its corresponding SYN packet. Next, we substract the SYN timestamp to the SYN+ACK timestamp to obtain a measurement of the latency to a node.

Some basic statistics about the per-node median latencies:

In [None]:
node_med_lat.describe()

Top nodes by high latency:

In [None]:
top_lat = node_med_lat.sort_values(['latency'], ascending=False).head()
HTML(fp2url(top_lat[['fp', 'latency']]))

In [None]:
%%R

# plot boxplot position vs rest
guard <- ggplot(data=nodestat, aes(y=latency, x=is_guard)) + geom_boxplot() +  scale_y_log10()
exit <- ggplot(data=nodestat, aes(y=latency, x=is_exit)) + geom_boxplot() +  scale_y_log10()

grid.arrange(guard, exit, ncol=2, nrow =1)

# Jitter

We measure jitter of a node as the variance of the node's latency within a batch.

Some statistics for jitter values:

In [None]:
node_med_jitter.describe()

Top nodes by high jitter:

In [None]:
top_jitter = node_med_jitter.sort_values(['jitter'], ascending=False).head()
HTML(fp2url(top_jitter[['fp', 'jitter']]))

In [None]:
%%R

# plot boxplot guard vs rest
guard <- ggplot(data=nodestat, aes(y=jitter, x=is_guard)) + geom_boxplot() +  scale_y_log10()
exit <- ggplot(data=nodestat, aes(y=jitter, x=is_exit)) + geom_boxplot() +  scale_y_log10()

grid.arrange(guard, exit, ncol=2, nrow =1)