# Data from the Copenhagen Networks Study

The data here comes from the Copenhagen Networks Study. 

- [Data download](https://figshare.com/articles/dataset/The_Copenhagen_Networks_Study_interaction_data/7267433/1)
- [Data overview](https://www.nature.com/articles/s41597-019-0325-x.pdf)

In [None]:
import pandas as pd
import networkx as nx
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns
from dcss.plotting import custom_seaborn
custom_seaborn()
from collections import Counter

In [None]:
from dcss.paths import copenhagen_networks_path
bluetooth_contact = pd.read_csv(copenhagen_networks_path / "bt_symmetric.csv", sep=',')
phone_calls = pd.read_csv(copenhagen_networks_path / "calls.csv", sep=',') 
sms_messages = pd.read_csv(copenhagen_networks_path / "sms.csv", sep=',') 
facebook_friendship = pd.read_csv(copenhagen_networks_path / "fb_friends.csv", sep=',') 

## Bluetooth Contact Networks

The bluetooth contact network data contains 4 variables:

- `timestamp`:
- `user A`:
- `user B`:
- `received signal strength (RSSI)`:

Key information about data collection:

- Each participants phone scanned to detect with devices within 10m/30ft every five minutes
- Each participants phone is always discoverable on Bluetooth 
- RSSI is a proxy for physical distance, with higher measures indicating close proximity and lower measures indicating that the phones were greater distances apart
- "Empty scans are marked with user B = -1 and RSSI = 0"
- "Scans of devices outside of the experiment are marked with user B = -2. All non-experiment devices are given the same ID."

In [None]:
bluetooth_contact.head()

In [None]:
bluetooth_contact.sample(5)

The timestamps are a bit weird. Disregard all that start with 0 for sure. After that they jump up to 300 and seem to change their increments at some points. Suspect this is due to privacy measures?

In [None]:
def get_sequential_timestamps(dataframe, timestamp, n_timestamps, timestamp_var='# timestamp'):
    """
    Helper function for `get_timeframe()`
    
    Takes in a dataframe. Gets a list of the unique timestamps in the order they
    appear in the data. Takes a specified timestep and the number of timesteps 
    to collect folling that one. Returns a sequence of timestamps as a list that
    can be fed directly into the `get_timeframe()` function to return a 
    properly subsetted dataframe.
    """
    ordered_timestamps = list(dataframe[timestamp_var].unique())
    start = ordered_timestamps.index(timestamp)
    end = start + n_timestamps
    selected = ordered_timestamps[start:end]
    return selected

def get_timeframe(dataframe, timestamp, n_timestamps=1, timestamp_var='# timestamp'):
    """
    It takes in the bluetooth dataframe as the first argument and a specific timestamp.
    If that's all that's provided, then it will return all the edges for that timestamp
    with the null scans (marked with negative scores for user_b) screened out.
    
    If n_timestamps is provided, it will return all the data for that many sequential 
    unique timestamps. 
    """
    timeframe = get_sequential_timestamps(dataframe, timestamp, n_timestamps, timestamp_var=timestamp_var)
    df = dataframe[(dataframe['# timestamp'].isin(timeframe)) & (dataframe['user_b'] >= 0)] # screen out lonely scans
    return df

In [None]:
bluetooth_contact['# timestamp'].sample(10)

In [None]:
max(bluetooth_contact['rssi'])

In [None]:
filtered = get_timeframe(bluetooth_contact, 1217400, 4032)  # 300 was old start, 288 was old number of steps
# 2016 is roughly equivalent to a week 

# 1429500 and 144 was the other optipon.
# 300 was first pass and was good.
# 1429500 was selected randomly, just because its not 
# the EXACT START of the study (at least as released)
filtered.head()

In [None]:
filtered = filtered[filtered['rssi'] > -60]

In [None]:
filtered.to_csv('cns_bluetooth_filtered.csv', index=False)

In [None]:
g_bluetooth_contact = nx.from_pandas_edgelist(filtered, 'user_a', 'user_b', create_using=nx.Graph())
g_bluetooth_contact.name = 'CNS Bluetooth Contact'
print(nx.info(g_bluetooth_contact))

In [None]:
layout = nx.nx_pydot.graphviz_layout(g_bluetooth_contact)

In [None]:
fig, ax = plt.subplots(figsize=(12, 12))
nx.draw(g_bluetooth_contact,
        pos=layout,
        node_color='gray',
        edge_color='lightgray',
        node_size=10,
        width=.5)
plt.show()

## Phone Calls

In [None]:
phone_calls.head()

## SMS Messages

In [None]:
sms_messages.head()

# COMMUNICATION NETWORKS

- Phone calls and SMS messages combined

In [None]:
calls = phone_calls[['caller', 'callee', 'timestamp']]
messages = sms_messages[['sender', 'recipient', 'timestamp']]

In [None]:
messages.columns = ['i', 'j', 'timestamp']
calls.columns = ['i', 'j', 'timestamp']

In [None]:
communication = pd.concat([messages,calls])

In [None]:
communication.head()

The code below groups the pairs and counts them up to get an edge weight. Crucially the order of `i` and `j` doesn't matter when we do things this way, so the edge weights are actually correct. 

In [None]:
edges = list(communication[['i', 'j']].to_records(index=False))
weighted_edges = Counter(tuple(sorted(tup)) for tup in edges)
len(weighted_edges)

In [None]:
counter_as_tuples = weighted_edges.most_common(len(weighted_edges)) 

In [None]:
G = nx.Graph()

In [None]:
for_networkx = [(we[0][0], we[0][1], we[1]) for we in counter_as_tuples]

In [None]:
with open('cns_weighted_communication_network.csv', 'w') as f:
    f.write('i,j,weight\n')
    for edge in for_networkx:
        f.write(f'{edge[0]},{edge[1]},{edge[2]}\n')

In [None]:
G.add_weighted_edges_from(for_networkx)

In [None]:
print(nx.info(G))

In [None]:
layout = nx.nx_pydot.graphviz_layout(G)

In [None]:
fig, ax = plt.subplots(figsize=(12, 12))
nx.draw(G,
        pos=layout,
        node_color='gray',
        edge_color='lightgray',
        node_size=10,
        width=.5)
plt.show()

In [None]:
[e for e in G.edges(data=True)]

## Facebook Friends

In [None]:
facebook_friendship.head()