# Table of Contents
 <p><div class="lev1"><a href="#Introduction"><span class="toc-item-num">1&nbsp;&nbsp;</span>Introduction</a></div><div class="lev2"><a href="#Collected-data"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Collected data</a></div><div class="lev2"><a href="#Collected-data"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>Collected data</a></div><div class="lev2"><a href="#Latency"><span class="toc-item-num">1.3&nbsp;&nbsp;</span>Latency</a></div>

# Introduction

The objective of this analysis is to study the distribution of network jitter in Tor entry guards. Some nodes might have a significantly large jitter than the average. We believe that in those cases, website fingerprinting may have lower effectiveness than in the entry guards with low jitter. The rationale is that even if the entry to guard TCP connection is independent of the website, the interaction between the page structure (HTTP request/response pattern) and the jitter, may make the fingerprint less reliable than in low-jitter guards.

In [1]:
%matplotlib inline
from os import listdir
from os.path import join, dirname, realpath, isdir, getmtime
from glob import glob

import numpy as np
import pandas as pd

import matplotlib as mpl
import matplotlib.pyplot as plt
mpl.style.use('ggplot')

from IPython.display import HTML
from IPython.core.interactiveshell import InteractiveShell

# Notebook config
InteractiveShell.ast_node_interactivity = "all"

In [2]:
# button to toggle code

HTML('''<script>
code_show=true; 
function code_toggle() {
 document.title = 'Variance analysis';
 code_show = !code_show
 var divs = document.getElementsByClassName('input');
 var divs = document.getElementsByClassName('input');
 if (code_show){
   for (var i in divs) {
     if (typeof divs[i] != 'undefined') {
       divs[i].style.display = 'block';
     } 
   }
 } else {
   for (var i in divs) {
     if (typeof divs[i] != 'undefined') {
       divs[i].style.display = 'none';
     } 
   }
 }
} 

document.addEventListener("DOMContentLoaded", function(event) { 
  code_toggle()
});
</script>
<form action="javascript:code_toggle()"><input type="submit" value="Click here to toggle on/off the raw code."></form>''')

In [3]:
# directories
BASE_DIR = dirname(realpath("__file__"))
RESULTS_DIR = join(BASE_DIR, 'results')
LATEST_DIR = max([join(RESULTS_DIR, d) for d in listdir(RESULTS_DIR)
                  if isdir(join(RESULTS_DIR, d))], key=getmtime)
print LATEST_DIR

# globals
CSV_NAMES = ["ip_proto", "epoch",
             "ip_src", "ip_dst", "port_src", "port_dst",
             "ip_len", "ip_hdr_len", "tcp_hdr_len", "data_len",
             "tcp_flags", "tcp_seq", "tcp_ack", "next_seq", "acks_frame",
             "tcp_window_size_value", "ws_message",
             "tcp_options_tsval", "tcp_options_tsecr"]

/home/mjuarezm/git/entrystats/results/170201_112253


In [5]:
# load data
dfs = []
for fname in listdir(LATEST_DIR):
    entry, sample = fname.split('_')
    df = pd.read_csv(join(LATEST_DIR, fname), names=CSV_NAMES)
    df['tcp_flags'] = df['tcp_flags'].apply(lambda s: int(s, 0))
    df['entry'] = entry
    df['sample'] = sample
    dfs.append(df)
data = pd.concat(dfs)

entries = data.groupby(['entry']).count()
num_entries = len(entries)
avg_samples = entries.mean()

## Collected data


For our data collection, we make a TCP connection to the guard's OR port and record all the traffic that is generated. In total, we have collected:

In [None]:
print "- Data for", num_entries, "entry guards"
print "- An average of", int(avg_samples), "samples for each entry guard."

This is how the dataset looks like:

In [None]:
# load data in a dataframe
dfs = []
for k, l in stat_files.iteritems():
    for i, fpath in enumerate(l):
        df = pd.read_csv(fpath, names=CSV_NAMES)
        df['tcp_flags'] = df['tcp_flags'].apply(lambda s: int(s, 0))
        df['entry'] = k
        df['sample'] = i
        dfs.append(df)
data = pd.concat(dfs)
data[['entry', 'sample'] + [c for c in data.columns if c != 'entry' and c != 'sample']].tail()

## Latency

From the traffic traces collected for the TCP connections to the guards, we extract the first SYN+ACK packet (if any) and its corresponding SYN packet. Next, we substract the SYN timestamp to the SYN+ACK timestamp to obtain a measurement of the latency to a guard.

This is the dataset of latencies:

In [None]:
def latency(d):
    """Return difference between first syn+ack and its corresponding syn."""
    syn_acks = d[(np.isnan(d['data_len'])) & (d['tcp_flags'] == 18)]
    if len(syn_acks) == 0:
        return np.nan
    syn_ack = syn_acks.head(1).iloc[0]
    syn_seq = syn_ack['tcp_ack'] - 1
    syns = d[(d['tcp_flags'] == 2) & (d['tcp_seq'] == syn_seq)]
    if len(syns) == 0:
        return np.nan
    syn = syns.head(1).iloc[0]
    return syn_ack['epoch'] - syn['epoch']
    
def latencies(data):
    """Compute latencies for all samples in the dataframe."""
    return data.groupby(['entry', 'sample']).apply(latency).reset_index(name='latency')


def jitter(latencies):
    """Compute jitter from latency data."""
    return latencies.groupby(['entry'])['latency'].std()

In [None]:
lats = latencies(data).dropna()
lats.head()

We extract some basic statistics about the latencies:

In [None]:
lats.describe()

We plot the histogram of latencies:

In [None]:
fig, (ax1, ax2) = plt.subplots(nrows=1, ncols=2)
lats['latency'].plot(kind='hist', bins=20, ax=ax1, figsize=(10, 3), title="Entry latency")
lats['latency'].apply(np.log).plot(kind='hist', bins=20, ax=ax2, figsize=(10, 3), title="Entry altency (log scale)")

## Jitter

We measure jitter of a node as the variance of the node's latency. We can calculate the variance because we take several samples for each node.

In [None]:
jitter(lats)