## MONROE Data Preparation for Latency Privacy Study

This notebook prepares data dumped from MONROE ping measurements to look at the correlation between distance and RTT.

Unlike the Atlas case, we have only one single "anchor", and our probes might move. Therefore, we get minimum RTT per time bin, and determine mean (median?) location per time bin, and treat each time bin as a separate probe. So this notebook does more preanalysis than the Atlas dataprep notebook does.

The MONROE data used here is not currently publicly available; please contact the authors for this data set.

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import dateutil.parser as dp
import requests
import json
import csv
import os.path

from collections import namedtuple

In [2]:
def csv2nt(filename):
    with open(filename, errors="replace") as file:
        reader = csv.reader(file)
        Data = namedtuple("Data", next(reader))
        for row in map(Data._make, reader):
            yield row

In [3]:
ping_df = pd.DataFrame(csv2nt("monroe_data/2017-09-01_1504224000_monroe_exp_ping.csv")).dropna().loc[:,['nodeid','timestamp','operator','rtt']]
ping_df["nodeid"] = ping_df["nodeid"].astype("int")
ping_df["timestamp"] = pd.to_datetime(ping_df["timestamp"].astype("float"), unit="s")
ping_df["rtt"] = pd.to_numeric(ping_df["rtt"], errors="coerce")

In [4]:
operator_by_nodeid = ping_df.groupby('nodeid').first()['operator']

In [5]:
gps_df = pd.DataFrame(csv2nt("monroe_data/2017-09-01_1504224000_monroe_meta_device_gps.csv")).dropna().loc[:,['nodeid','timestamp','latitude','longitude']]
gps_df["nodeid"] = gps_df["nodeid"].astype("int")
gps_df["timestamp"] = pd.to_datetime(gps_df["timestamp"].astype("float"), unit="s")
gps_df['latitude'] = pd.to_numeric(gps_df['latitude'], errors="coerce")
gps_df['longitude'] = pd.to_numeric(gps_df['longitude'], errors="coerce")

In [6]:
with pd.HDFStore('monroe.hdf5') as store:
    store['ping_df'] = ping_df
    store['gps_df'] = gps_df
    store['operator_by_nodeid'] = operator_by_nodeid

## Preaggregation

Take minimum RTT and mean latitude/longitude in five minute bins, and stick them into a joined dataframe.

In [16]:
time_ping_df = ping_df.groupby(('nodeid', pd.Grouper(key="timestamp", freq="5min"))).min().loc[:,['rtt']]
time_pingstd_df = ping_df.groupby(('nodeid', pd.Grouper(key="timestamp", freq="5min"))).std().loc[:,['rtt']]
time_pingstd_df.columns = ('rttstd',)
time_gps_df = gps_df.groupby(('nodeid', pd.Grouper(key="timestamp", freq="5min"))).mean()
monroe_rtt_df = time_gps_df.join(time_ping_df, how='inner').join(time_pingstd_df).dropna()
monroe_rtt_df.columns = ('plat', 'plon', 'minrtt', 'rttstd')

In [17]:
with pd.HDFStore('monroe.hdf5') as store:
    store['monroe_rtt_df'] = monroe_rtt_df