# 1. Data Preprocessing

In this notebook we preprocess the graph data from the gexf files as well as the centrality metrics from the csv files to be used as input to our Graph Neural Networks (GNNs) for the node classification. We will be using the same configuration of the columns and features as the t.ex graph classifier. The preprocessing consists of following steps:
1. Loading the graph data with the centrality metrics
2. Labeling the data
3. Removing attributes according to the configuration
4. Feature encoding
5. Transforming the data into PyTorch Geometrics (PyG) objects

### Dataset Configuration

In [1]:
data_name = 'chrome-run-01'

### Set Up


First, we set ```%load_ext autoreload``` and ```%autoreload 2``` to reload modules instantly when changes had been made in these modules:

In [2]:
%load_ext autoreload
%autoreload 2

import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

Now we import the necessary libraries:

In [3]:
import networkx as nx
import os

In [4]:
import sys
# Append own directories to sys.path
references_path = os.path.join( '..', 'references')
visualization_path = os.path.join('..', 'src', 'visualization')
data_path = os.path.join('..', 'src', 'data')
features_path = os.path.join('..', 'src', 'features')
# Append paths to sys.path
sys.path.append(references_path)
sys.path.append(visualization_path)
sys.path.append(data_path)
sys.path.append(features_path)
# Import own modules
import visualize
import preprocess
import build_features

### Step 1: Loading the graph data with the centrality metrics

We want to use the data from the ```raw``` data folder. We use the graph data from the gexf file and add the centrality metrics from the csv file to it:

In [5]:
data_path = os.path.join('..', 'data', 'raw')

# Load the graph
G = preprocess.load_graph(data_name, data_path)

# Show the graph info
print('\nGraph info: ')
visualize.graph_info(G)
print()
visualize.node_info(G, 5)


Loading graph from ..\data\raw\chrome-run-01.gexf...
Successfully loaded graph from ..\data\raw\chrome-run-01.gexf
Loading centrality metrics from ..\data\raw\chrome-run-01.csv...
Successfully loaded centrality metrics from ..\data\raw\chrome-run-01.csv
Adding new attributes to the graph...
Successfully added new attributes to the graph

Graph info: 
Directed Graph? True | Nodes: 25338 | Edges: 131846 | Node attributes: 59 | Edge attributes: 38
Node attributes: xmlhttprequest, image, font, script, stylesheet, ping, sub_frame, other, main_frame, csp_report, object, media, websocket, GET, POST, OPTIONS, HEAD, PUT, DELETE, SEARCH, PATCH, count, tracking, firstPartyDisclosed, cookiesSet, thirdPartyCookie, avgUrlLength, avgReqPerNeighbor, avgQpPerReq, avgQpPerNeighbor, avgRhPerNeighbor, avgRespHPerRq, avgRespHPerNeighbor, avgCookieFieldsPerRq, avgCookieFieldsPerNeighbor, maxSubdomainDepth, avgPathLength, hostUrlLength, frameIdGtZero, label, indegree, outdegree, degree, weighted indegree, w

### Step 2: Labeling the data

We use the ```tracking``` node attribute (the tracking ratio) to label the nodes in the csv data. If the ratio is greater than or equal to 0.5, the node is labeled a tracker:

In [6]:
# Iterate over the nodes and set the label
labels = preprocess.generate_labels(G)

num_labels = 50
print(f"\nExample of {num_labels} labels:")
print(labels[:num_labels])


Generating binary labels for the nodes in the graph...
Successfully generated the binary labels for the nodes in the graph

Example of 50 labels:
[1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0]


Before we move on, we want to save this networkx data object into our ```interim``` data folder to be used for visualizations:

In [7]:
save_file_path = os.path.join('..','data', 'interim', f"{data_name}-centrality-graph.gexf")
nx.write_gexf(G, save_file_path)

### Step 3: Removing attributes according to the configuration

We will be using the t.ex-configuration:

In [8]:
config_path = os.path.join('..', 'references', 'tex_config.py')

# Remove the unwanted node attributes/features
G = preprocess.remove_attributes_from_gexf(G, config_path)

print('\nGraph info: ')
visualize.graph_info(G)
print()
visualize.node_info(G, 5)


Importing configuration from ..\references\tex_config.py...
Successfully imported configuration from ..\references\tex_config.py
Removing unwanted attributes from the graph with the configuration in ..\references\tex_config.py...
Successfully removed unwanted attributes from the graph with the configuration in ..\references\tex_config.py

Graph info: 
Directed Graph? True | Nodes: 25338 | Edges: 131846 | Node attributes: 49 | Edge attributes: 0
Node attributes: xmlhttprequest, image, font, script, stylesheet, sub_frame, other, main_frame, csp_report, object, media, GET, POST, OPTIONS, HEAD, PUT, count, firstPartyDisclosed, cookiesSet, thirdPartyCookie, avgUrlLength, avgReqPerNeighbor, avgQpPerReq, avgQpPerNeighbor, avgRhPerNeighbor, avgRespHPerRq, avgRespHPerNeighbor, avgCookieFieldsPerRq, avgCookieFieldsPerNeighbor, maxSubdomainDepth, avgPathLength, hostUrlLength, frameIdGtZero, indegree, outdegree, degree, eccentricity, closnesscentrality, harmonicclosnesscentrality, betweenesscentr

### Step 4: Feature encoding the node attributes

We use the same feature encoding as the t.ex classifier: simply converting all node attributes into numerical values as features. We also save two seperate feature tensors, one with the centrality metrics and one without for further analysis:

In [9]:
feature_tensor, feature_names = build_features.feature_encoding(G)
feature_tensor_excluding_centrality, feature_names_excluding_centrality = build_features.feature_encoding(G, exclude_centrality=True)


Feature encoding the node attributes...


Encoding node attributes: 100%|██████████| 25338/25338 [00:00<00:00, 43214.00it/s]


Successfully feature encoded the node attributes

Feature encoding the node attributes...


Encoding node attributes: 100%|██████████| 25338/25338 [00:01<00:00, 21607.42it/s]


Successfully feature encoded the node attributes


In [10]:
print(feature_names)
print(feature_names_excluding_centrality)

['xmlhttprequest', 'image', 'font', 'script', 'stylesheet', 'sub_frame', 'other', 'main_frame', 'csp_report', 'object', 'media', 'GET', 'POST', 'OPTIONS', 'HEAD', 'PUT', 'count', 'firstPartyDisclosed', 'cookiesSet', 'thirdPartyCookie', 'avgUrlLength', 'avgReqPerNeighbor', 'avgQpPerReq', 'avgQpPerNeighbor', 'avgRhPerNeighbor', 'avgRespHPerRq', 'avgRespHPerNeighbor', 'avgCookieFieldsPerRq', 'avgCookieFieldsPerNeighbor', 'maxSubdomainDepth', 'avgPathLength', 'hostUrlLength', 'frameIdGtZero', 'indegree', 'outdegree', 'degree', 'eccentricity', 'closnesscentrality', 'harmonicclosnesscentrality', 'betweenesscentrality', 'authority', 'hub', 'pageranks', 'componentnumber', 'strongcompnum', 'modularity_class', 'stat_inf_class', 'clustering', 'eigencentrality']
['xmlhttprequest', 'image', 'font', 'script', 'stylesheet', 'sub_frame', 'other', 'main_frame', 'csp_report', 'object', 'media', 'GET', 'POST', 'OPTIONS', 'HEAD', 'PUT', 'count', 'firstPartyDisclosed', 'cookiesSet', 'thirdPartyCookie', 'av

### Step 5: Transforming the data into a PyTorch Geometrics (PyG) object

Finally we can transform the graph data into a PyG data object in order to use it as input to our GNN models. We save the two preprocessed PyG data objects into our ```processed``` data folder. We also assign an inital training, validation and testing mask (80%/10%/10%):

In [11]:
# Create the PyG data objects
data, node_names = preprocess.networkx_to_pyg(G, feature_tensor, labels)
data_excluding_centrality, _ = preprocess.networkx_to_pyg(G, feature_tensor_excluding_centrality, labels)

# Assign the masks
data = preprocess.assign_masks(data)
data_excluding_centrality = preprocess.assign_masks(data_excluding_centrality)

# Save the data objects to the specified file paths
save_data = os.path.join('..','data', 'processed', f"{data_name}-with-centrality-metrics.pt")
save_metadata = os.path.join('..','data', 'processed', f"metadata-{data_name}.json")
save_data_excluding_centrality = os.path.join('..','data', 'processed', f"{data_name}-without-centrality-metrics.pt")

preprocess.save_pyg(data_excluding_centrality, feature_names_excluding_centrality, node_names, save_data_excluding_centrality, save_metadata)
preprocess.save_pyg(data, feature_names, node_names, save_data, save_metadata)


Converting NetworkX graph to PyG Data object...
Successfully converted NetworkX graph to PyG Data object

Converting NetworkX graph to PyG Data object...
Successfully converted NetworkX graph to PyG Data object

Assigning masks to the PyG Data object...
Successfully assigned masks to the PyG Data object

Assigning masks to the PyG Data object...
Successfully assigned masks to the PyG Data object

Saving PyG Data object to ..\data\processed\chrome-run-01-without-centrality-metrics.pt...
Successfully saved PyG Data object to ..\data\processed\chrome-run-01-without-centrality-metrics.pt: Data(x=[25338, 33], edge_index=[2, 131846], num_nodes=25338, y=[25338], train_mask=[25338], val_mask=[25338], test_mask=[25338])
Saving PyG Data metadata to ..\data\processed\metadata-chrome-run-01.json...
Successfully saved metadata to ..\data\processed\metadata-chrome-run-01.json

Saving PyG Data object to ..\data\processed\chrome-run-01-with-centrality-metrics.pt...
Successfully saved PyG Data object 