## Detection of Malicious Accounts on Azure-AD signon using Relational Graph Neural Network (RGCN)

### Content
1. Introduction
2. Dataset Loading & Processing
3. Graph Construction
4. Model Training
5. Evaluation
6. Summary

### 1. Introduction

Azure active directory (Azure-AD) is an identity and access management service, that helps users to access external and internal resources such as Office365, SaaS applications. The Sign-in logs in Azure-AD log identifies who the user is, how the application is used for the access and the target accessed by the identity [1]. On a given time 𝑡, a service 𝑠 is requested by user 𝑢 from device 𝑑 using authentication mechanism of 𝑎 to be either allowed or blocked.

This workflow shows end-to-end pipeline for azure malicious sign-in malicious detection using relational graph neural network (RGCN).

In [43]:
%load_ext autoreload
%autoreload 2
import warnings
warnings.filterwarnings('ignore')

import torch
import torch.nn as nn
import pandas as pd
import torch.nn.functional as F
import dgl.nn.pytorch as dglnn
from utils import *
from sklearn.metrics import roc_auc_score, accuracy_score
from model import HeteroRGCN
from tqdm import trange
from data_processing import   build_azure_graph,  synthetic_azure

from utils import get_metrics
from model_training import init_loaders, train, evaluate, save_model, load_model


The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload



### Dataset Loading and Processing
Load dataset and set true fraud label for the testing and status flag for training. Define set of meta columns to exclude as training features. 

In [28]:
status_label = 'status_flag'
result_dir = 'azure_result'


meta_cols = ['day','appId', 'userId', 'ipAddress',
            'fraud_label','appId_id','userId_id','ipAddress_id', 'auth_id', 'status_flag']

train_data, test_data, train_idx, test_idx, labels, df  = synthetic_azure('../dataset/azure_synthetic/azure_ad_logs_sample_with_anomaly_train.json')

fraud_labels = df['fraud_label'].values


In [29]:
print('The distribution of status flag for the train data is:\n', train_data['status_flag'].value_counts())
print('The distribution of fraud for the test data is:\n', test_data['fraud_label'].value_counts())

The distribution of status flag for the train data is:
 0    1245
1     333
Name: status_flag, dtype: int64
The distribution of fraud for the test data is:
 0.0    414
Name: fraud_label, dtype: int64


In [16]:
print(f"Days in test set {test_data.day.unique().tolist()}")

Days in test set [239, 237, 236, 240, 241, 238]


In [17]:
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

### Graph Construction

Construct training and inference graph. Each graph is constructed using `train_data` and `test_data`, with graph schema defined in the `build_azure_graph` method. Create feature and compute weight for training proportion.

![Azure-graph](graph.png)

In [18]:
g, _ = build_azure_graph(train_data, meta_cols)
g_test, feature_tensors = build_azure_graph(df, meta_cols)
n_nodes = sum([g.number_of_nodes(n_type) for n_type in g.ntypes])
n_edges = sum([g.number_of_edges(e_type) for e_type in g.etypes])

In [35]:
# hyperparameters
in_size, hidden_size, out_size, n_layers, embedding_size = feature_tensors.shape[1], 16, 2, 2, 8  # 2 16
target_node = "authentication"
hyperparameters = {"in_size":in_size, "hidden_size":hidden_size, "out_size":out_size,
                   "n_layers":n_layers, "embedding_size":embedding_size, "target_node":target_node}
labels = torch.LongTensor(labels).to(device)
scale_pos_weight = train_data[status_label].sum() / train_data.shape[0]
scale_pos_weight = torch.tensor(
    [scale_pos_weight, 1-scale_pos_weight]).to(device)



### Model Training

#### RGCN Model

At high level, the RGCN model takes graph 𝐺 of nodes features 𝑋 and learn an embedding of 𝐺 with a function 𝑓:𝑉→𝑅𝑑 that map each node 𝑣∈ 𝑉 to d-dimensional vector. The model is trained as a semi-supervised settings to classify each authentication as potential success or failure authentication from historical authentication logs. The embedding vector for “authentication” node is then used to score whether the authentication is benign or malicious. We compare the detection score of RGCN SoftMax layer along feeding the learned embedding to unsupervised anomaly detection algorithm isolation forest.

In [20]:
print(
    """ ---- Data statistics ------
            # Nodes: {}
            # Edges: {}
            # Features Shape: {}
            # Labeled Train samples: {}
            # Unlabeled Test samples: {}""".
    format(
        n_nodes, n_edges, feature_tensors.shape[0],
        train_data.shape[0],
        test_data.shape[0]))

 ---- Data statistics ------
            # Nodes: 2131
            # Edges: 9468
            # Features Shape: 1992
            # Labeled Train samples: 1578
            # Unlabeled Test samples: 414


#### Training

The model is trained using synthetic azure log data of 3 months for selected users and tested on 14 days of future days. Selected features are aggregated as OHE per day for individual users, app requested, and device activity logs and authentication behaviors. The `statusFailure` features is used for semi-supervised training of the model. It has binary values of either "success" or "failure".

Set dataloaders, define model and optimizers.

In [21]:
epochs = 30
train_loader, val_loader, test_loader = init_loaders(
    g, train_idx, test_idx=test_idx, val_idx=train_idx, g_test=g_test,
    target_node=target_node)

model = HeteroRGCN(g, in_size, hidden_size, out_size,
                   n_layers, embedding_size, device=device, target=target_node).to(device)

optimizer = torch.optim.Adam(
    model.parameters(),
    lr=0.001, weight_decay=5e-4)
loss_func = nn.CrossEntropyLoss(weight=scale_pos_weight.float())
best_model = model

In [22]:
best_auc = 0
for epoch in trange(epochs):
    train_acc, loss = train(
        model, loss_func, train_loader, labels, optimizer, feature_tensors,
        target_node=target_node, device=device)
    print("Epoch {:03d}/{:03d} | Train Accuracy: {:.4f} | Train Loss: {:.4f}".format(
        epoch, epochs, train_acc, loss))

    val_logits, val_seed, _ = evaluate(
        model, val_loader, feature_tensors, target_node, device=device)
    val_accuracy = accuracy_score(
        val_logits.argmax(1),
        labels.long()[val_seed].cpu()).item()
    val_auc = roc_auc_score(labels.long()[val_seed].cpu().numpy(),
                            val_logits[:, 1].numpy(),)
    print(
        "Validation Accuracy: {:.4f} auc {:.4f}".format(
            val_accuracy, val_auc))
        
best_model = model

  7%|▋         | 2/30 [00:00<00:02, 13.18it/s]

Epoch 000/030 | Train Accuracy: 0.5986 | Train Loss: 1.3291
Validation Accuracy: 0.6888 auc 0.7919
Epoch 001/030 | Train Accuracy: 0.6869 | Train Loss: 1.3142
Validation Accuracy: 0.7199 auc 0.7999
Epoch 002/030 | Train Accuracy: 0.7336 | Train Loss: 1.3011
Validation Accuracy: 0.7902 auc 0.8085
Epoch 003/030 | Train Accuracy: 0.7924 | Train Loss: 1.2884


 20%|██        | 6/30 [00:00<00:01, 14.60it/s]

Validation Accuracy: 0.8099 auc 0.8162
Epoch 004/030 | Train Accuracy: 0.8183 | Train Loss: 1.2758
Validation Accuracy: 0.8346 auc 0.8251
Epoch 005/030 | Train Accuracy: 0.8287 | Train Loss: 1.2634
Validation Accuracy: 0.8397 auc 0.8313
Epoch 006/030 | Train Accuracy: 0.8339 | Train Loss: 1.2510
Validation Accuracy: 0.8549 auc 0.8383


 33%|███▎      | 10/30 [00:00<00:01, 14.66it/s]

Epoch 007/030 | Train Accuracy: 0.8426 | Train Loss: 1.2386
Validation Accuracy: 0.8657 auc 0.8446
Epoch 008/030 | Train Accuracy: 0.8426 | Train Loss: 1.2262
Validation Accuracy: 0.8669 auc 0.8481
Epoch 009/030 | Train Accuracy: 0.8633 | Train Loss: 1.2138
Validation Accuracy: 0.8904 auc 0.8504


 40%|████      | 12/30 [00:00<00:01, 14.24it/s]

Epoch 010/030 | Train Accuracy: 0.8806 | Train Loss: 1.2013
Validation Accuracy: 0.8904 auc 0.8541
Epoch 011/030 | Train Accuracy: 0.8806 | Train Loss: 1.1887
Validation Accuracy: 0.8904 auc 0.8564
Epoch 012/030 | Train Accuracy: 0.8806 | Train Loss: 1.1760


 47%|████▋     | 14/30 [00:01<00:01, 12.80it/s]

Validation Accuracy: 0.8916 auc 0.8573
Epoch 013/030 | Train Accuracy: 0.8806 | Train Loss: 1.1635
Validation Accuracy: 0.8916 auc 0.8582
Epoch 014/030 | Train Accuracy: 0.8806 | Train Loss: 1.1508
Validation Accuracy: 0.8885 auc 0.8574
Epoch 015/030 | Train Accuracy: 0.8806 | Train Loss: 1.1381


 60%|██████    | 18/30 [00:01<00:00, 13.67it/s]

Validation Accuracy: 0.8853 auc 0.8586
Epoch 016/030 | Train Accuracy: 0.8633 | Train Loss: 1.1253
Validation Accuracy: 0.8853 auc 0.8592
Epoch 017/030 | Train Accuracy: 0.8443 | Train Loss: 1.1125
Validation Accuracy: 0.8783 auc 0.8601
Epoch 018/030 | Train Accuracy: 0.8443 | Train Loss: 1.0997


 67%|██████▋   | 20/30 [00:01<00:00, 13.60it/s]

Validation Accuracy: 0.8783 auc 0.8614
Epoch 019/030 | Train Accuracy: 0.8443 | Train Loss: 1.0870
Validation Accuracy: 0.8771 auc 0.8626
Epoch 020/030 | Train Accuracy: 0.8443 | Train Loss: 1.0744
Validation Accuracy: 0.8771 auc 0.8636
Epoch 021/030 | Train Accuracy: 0.8443 | Train Loss: 1.0619


 80%|████████  | 24/30 [00:01<00:00, 13.88it/s]

Validation Accuracy: 0.8771 auc 0.8652
Epoch 022/030 | Train Accuracy: 0.8443 | Train Loss: 1.0496
Validation Accuracy: 0.8771 auc 0.8665
Epoch 023/030 | Train Accuracy: 0.8443 | Train Loss: 1.0375
Validation Accuracy: 0.8771 auc 0.8675
Epoch 024/030 | Train Accuracy: 0.8408 | Train Loss: 1.0256


 87%|████████▋ | 26/30 [00:01<00:00, 13.50it/s]

Validation Accuracy: 0.8745 auc 0.8689
Epoch 025/030 | Train Accuracy: 0.8408 | Train Loss: 1.0140
Validation Accuracy: 0.8752 auc 0.8698
Epoch 026/030 | Train Accuracy: 0.8374 | Train Loss: 1.0027
Validation Accuracy: 0.8701 auc 0.8708
Epoch 027/030 | Train Accuracy: 0.8374 | Train Loss: 0.9918


100%|██████████| 30/30 [00:02<00:00, 13.62it/s]

Validation Accuracy: 0.8701 auc 0.8725
Epoch 028/030 | Train Accuracy: 0.8374 | Train Loss: 0.9812
Validation Accuracy: 0.8682 auc 0.8743
Epoch 029/030 | Train Accuracy: 0.8356 | Train Loss: 0.9711
Validation Accuracy: 0.8676 auc 0.8764





Save model, hyperparameters and graph structure.

In [44]:
# Save model
save_model(g, model, hyperparameters, "../modeldir/")

In [45]:
# Load the model for inference
model_new, g_ = load_model("../modeldir/")

### Evaluation

For evaluation and comparing we need to create the "authentication" embedding from the training and test dataset. 

In [24]:
## Create training & test embedding
train_logits, train_seeds, train_embedding = evaluate(
    best_model, train_loader, feature_tensors, target_node)

In [25]:
test_logits, test_seeds, test_embedding = evaluate(
    best_model, test_loader, feature_tensors, target_node)

The `train_logit` and `test_logits` are the output of the RGCN classification layer. We can evaluate the model using `test_logits` along evaluation label. The `train_embedding` and `test_embedding` dataframe consists of an embedding of the authentication. These embedding
can be used to train & evaluate for other baseline models, such as XGBoost or unsupervised model (Isoltion Forest).

Using these embedding we compare the performance of RGCN based model against baseline algorithm trained & tested on the authentication embedding on internal dataset of more than 45k events for training and 11k events for testing. The result is shown in the following figure.

![model comparison](model_comp.png)

### Conclusion

In this workflow we showed end-to-end worklfow for malicious azure detection using RGCN.
we explore methods of heterogeneous graph embedding for malicious sign-on detection of azure logs. This work adds two main contributions. First, adapting log authentication as GNN, allows us to learn a richer embedding of authentication on both structural and individual entities involved without much hand-crafted feature learning. Second, by modeling every “authentication” as a target node, the model avoids the challenge of depending on modeling temporal historical user login information. The experimental result on internal data shows, RGCN prediction scores on authentication nodes have a promising result on overall detection performance and are better than the baseline isolation forest & XGBoost algorithm applied to the learned embedding vector of authentication nodes.


### Reference
1. https://docs.microsoft.com/en-us/azure/active-directory/reports-monitoring/concept-sign-ins
2. Liu, Ziqi, et al. “Heterogeneous Graph Neural Networks for Malicious Account Detection.” arXiv [cs.LG], 27 Feb. 2020, https://doi.org/10.1145/3269206.3272010. arXiv.
3. Lv, Mingqi, et al. “A Heterogeneous Graph Learning Model for Cyber-Attack Detection.” arXiv [cs.CR], 16 Dec. 2021, http://arxiv.org/abs/2112.08986. arXiv.
4. Schlichtkrull, Michael, et al. "Modeling relational data with graph convolutional networks." European semantic web conference. Springer, Cham, 2018
5. Rao, Susie Xi, et al. "xFraud: explainable fraud transaction detection." Proceedings of the VLDB Endowment 3 (2021)
6. Powell, Brian A. "Detecting malicious logins as graph anomalies." Journal of Information Security and Applications 54 (2020): 102557