# Malware Classification with Graph Embeddings

This notebook builds an end-to-end workflow to detect malware from function-call graphs. We rely on the MalNet Tiny dataset distributed via PyTorch Geometric, transform the graphs into vector representations, explore their structure, and finally train both classical ML and GNN models for classification. The dataset is pulled automatically from the [PyTorch Geometric documentation](https://pytorch-geometric.readthedocs.io/en/2.4.0/generated/torch_geometric.datasets.MalNetTiny.html) so no manual downloads are required.


In [None]:
# if needed. Then restart the jupyter kernel
!uv pip install --no-cache-dir --force-reinstall joblib==1.3.2

In [None]:
from pathlib import Path
from collections import Counter, defaultdict
from tqdm import tqdm

from karateclub import Graph2Vec
from umap.umap_ import UMAP
import networkx as nx
import plotly.express as px
from sklearn.model_selection import train_test_split

import pandas as pd
from pycaret.classification import (
    setup, compare_models, finalize_model,
    tune_model, evaluate_model, save_model, load_model)

## 1. Load MalNet Tiny Graphs
Use the `MalNetTiny` dataset helper to download the graphs automatically, and keep a balanced subset of 200 graphs per class for faster experimentation.


In [None]:
PATH_GRAPHS = '../../data/malnet-graphs-tiny'

In [None]:
%%time
CLASSES = ['addisplay', 'adware', 'benign', 'downloader', 'trojan']
MAX_GRAPHS_BY_CLASSE = 200

targets = []
graphs = []

for classe in CLASSES:
    files = Path(PATH_GRAPHS + '/' + classe).glob('*.edgelist')
    for i, file in tqdm(enumerate(files)):
        if i >= MAX_GRAPHS_BY_CLASSE:
            break
        targets.append(classe)
        G = nx.read_edgelist(file)
        G = nx.convert_node_labels_to_integers(G, label_attribute='old_label')
        graphs.append(G)

f"{len(graphs)} graphes chargés ({dict(Counter(targets))})"

## 2. Graph Embedding
Learn Graph2Vec representations that turn each graph into a dense vector suitable for downstream visualization and classification tasks.


In [None]:
%%time
N_DIMENSIONS = 2

graph2vec = Graph2Vec(dimensions=N_DIMENSIONS)
graph2vec.fit(graphs)
embeddings = graph2vec.get_embedding()
print(embeddings.shape)
embeddings

**Plot the embeddings**

The interactive scatterplot helps verify whether Graph2Vec separates the malware families.


In [None]:
fig = px.scatter(x=embeddings[:, 0], y=embeddings[:, 1], color=targets)
fig.show()

## 3. Dimensionality Reduction
Use UMAP to reduce the high-dimensional embeddings to 2D and 3D views that make cluster structures easier to inspect.


In [None]:
%%time
N_DIMENSIONS = 512

graph2vec = Graph2Vec(dimensions=N_DIMENSIONS, seed=42)
graph2vec.fit(graphs)
embeddings = graph2vec.get_embedding()

In [None]:
df = pd.DataFrame(embeddings)
df['target'] = targets
df.to_csv('../../data/malware_emb.csv', index=None)
df

**Project embeddings down to 2 dimensions**

In [None]:
%%time
proj_2d = UMAP(n_components=2, init='random', random_state=0).fit_transform(embeddings)
proj_2d

**Plot the 2D embedding**

In [None]:
fig_2d = px.scatter(
    proj_2d, x=0, y=1,
    color=targets
)
fig_2d.show()

**Project embeddings down to 3 dimensions**

In [None]:
proj_3d = UMAP(n_components=3, init='random', random_state=0).fit_transform(embeddings)

**Plot the 3D embedding**

In [None]:
fig_3d = px.scatter_3d(
    proj_3d, x=0, y=1, z=2,
    color=targets,
    height=700
)
fig_3d.update_traces(marker_size=5)
fig_3d.show()

#### Versus bad initial approach 

In [None]:
%%time
N_DIMENSIONS = 3

graph2vec = Graph2Vec(dimensions=N_DIMENSIONS)
graph2vec.fit(graphs)
embeddings = graph2vec.get_embedding()
print(embeddings.shape)

fig_3d = px.scatter_3d(
    embeddings, x=0, y=1, z=2,
    color=targets,
    height=700
)
fig_3d.update_traces(marker_size=5)
fig_3d.show()

## 4. Classical Classification
Each graph is annotated with a malware family (or the benign label), so we can train a supervised classifier that predicts one of the five categories automatically. We rely on PyCaret to quickly compare algorithms, tune the best one, and persist the winning model for later inference.


**Load the saved embeddings**

Read the CSV file that stores the Graph2Vec representations alongside their labels.

In [None]:
df = pd.read_csv('../../data/malware_emb.csv')

**Initialize PyCaret**

In [None]:
setup(df, target="target", fold=3, html=True) # 0.8658

**Compare models**
PyCaret benchmarks a wide range of classifiers so we can pick the one that offers the best accuracy for the selected embedding dimensionality.

In [None]:
!uv add scikit-learn==1.4.2

In [None]:
best_model = compare_models(exclude=['lightgbm'])

**Tune the best model**

In [None]:
best_model_tuned = tune_model(best_model, search_library='optuna')

**Evaluation**

In [None]:
evaluate_model(best_model)
# addisplay: 0, adware: 1, benign: 2, downloader: 3, trojan: 4

**Save final model**

In [None]:
final_model = finalize_model(best_model)
save_model(final_model, 'ml_model')

In [None]:
best_model = load_model('ml_model')
best_model

### Autogluon VS PyCaret

In [None]:
# install autogluon
!uv sync --extra autogluon

In [None]:
from autogluon.tabular import TabularPredictor, TabularDataset

Split train and test set

In [None]:
X_train, X_test, _, _ = train_test_split(df, targets, test_size=0.33, random_state=42)
X_train.to_csv('train.csv', index=None)
X_test.to_csv('test.csv', index=None)

Training

In [None]:
predictor = TabularPredictor(label="target").fit("train.csv")

Prediction on test set

In [None]:
test_data = TabularDataset(f'test.csv')

y_pred = predictor.predict(test_data.drop(columns=['target']))
y_pred[:10]  

Evaluation of test set

In [None]:
predictor.evaluate(test_data)

In [None]:
predictor.leaderboard(test_data)

## 5. GNN-Based Classification
Explore an end-to-end neural approach by training a Graph Convolutional Network (GCN) on the raw graphs instead of relying on precomputed embeddings.


In [None]:
# install torch and torch_geometric
!uv sync --extra deep-learning

In [None]:
from sklearn.metrics import classification_report
from sklearn.model_selection import train_test_split
import torch
import torch.nn.functional as F
from torch.nn import Dropout
from torch.optim.lr_scheduler import ReduceLROnPlateau
from torch_geometric.data import DataLoader
from torch_geometric.nn import BatchNorm, GCNConv, global_mean_pool
from torch_geometric.utils import to_networkx, from_networkx

# hardware selection
if torch.cuda.is_available():
    device = torch.device("cuda")
    print(torch.version.cuda)
    print(torch.cuda.get_device_name(0))
    print(torch.cuda.get_device_capability(0))
elif torch.backends.mps.is_available():
    device = torch.device("mps")
else:
    device = torch.device("cpu")
print(torch.__version__)
print(device)

**Convert NetworkX graphs into PyG Data objects**

In [None]:
# Convert NetworkX graphs into PyG Data objects, adding placeholder node features and labels.
def convert_to_pyg(graphs, targets):
    data_list = []
    for i, graph in enumerate(graphs):
        for node in graph.nodes():
            graph.nodes[node]['x'] = [1.0]  # Constant node feature placeholder

        data = from_networkx(graph)
        data.y = torch.tensor([targets[i]], dtype=torch.long)
        data_list.append(data)
    return data_list

class_mapping = {'addisplay': 0, 'adware': 1, 'benign': 2, 'downloader': 3, 'trojan': 4}
encoded_targets = [class_mapping[label] for label in targets]

data_list = convert_to_pyg(graphs, encoded_targets)

**Create train/test splits**

In [None]:
# Split the dataset into train and test partitions
train_data, test_data = train_test_split(
    data_list,
    test_size=int(len(data_list) * 0.3),
    stratify=encoded_targets,
    random_state=42,
)

# Build PyG dataloaders
train_loader = DataLoader(train_data, batch_size=32, shuffle=True)
test_loader = DataLoader(test_data, batch_size=32, shuffle=False)

**Define the GNN architecture**

In [None]:
# Deeper GNN classifier with normalization and dropout regularization
class ImprovedGNNClassifier(torch.nn.Module):
    def __init__(self, in_channels, hidden_channels, out_channels, num_layers=3, dropout=0.5):
        super().__init__()
        self.convs = torch.nn.ModuleList()
        self.norms = torch.nn.ModuleList()

        self.convs.append(GCNConv(in_channels, hidden_channels))
        self.norms.append(BatchNorm(hidden_channels))
        for _ in range(num_layers - 1):
            self.convs.append(GCNConv(hidden_channels, hidden_channels))
            self.norms.append(BatchNorm(hidden_channels))

        self.dropout = Dropout(dropout)
        self.fc = torch.nn.Linear(hidden_channels, out_channels)

    def forward(self, data):
        x, edge_index, batch = data.x, data.edge_index, data.batch

        for conv, norm in zip(self.convs, self.norms):
            x = conv(x, edge_index)
            x = norm(x)
            x = F.relu(x)
            x = self.dropout(x)

        x = global_mean_pool(x, batch)
        return self.fc(x)

Load or create a new model

In [None]:
load_previous_model = False

if load_previous_model:
    model = ImprovedGNNClassifier(
        in_channels=checkpoint["in_channels"],
        hidden_channels=checkpoint["hidden_channels"],
        out_channels=checkpoint["out_channels"],
        num_layers=checkpoint["num_layers"],
        dropout=checkpoint["dropout"],
    ).to(device)
    model.load_state_dict(checkpoint["model_state_dict"])

    optimizer = torch.optim.AdamW(
        model.parameters(),
        lr=0.005,          # même LR qu’à l’entraînement initial
        weight_decay=1e-4
    )
    optimizer.load_state_dict(checkpoint["optimizer_state_dict"])

    start_epoch = checkpoint["epoch"] + 1
else:
    in_channels = 1
    hidden_channels = 256  # Increased hidden size for better capacity
    out_channels = len(class_mapping)
    num_layers = 8  # Stack more GCN layers
    dropout = 0.5

    model = ImprovedGNNClassifier(in_channels, hidden_channels, out_channels, num_layers, dropout)
    optimizer = torch.optim.AdamW(model.parameters(), lr=0.005, weight_decay=1e-4)
    criterion = torch.nn.CrossEntropyLoss()

    model = model.to(device)
    start_epoch = 1

In [None]:
def train():
    model.train()
    total_loss = 0.0
    for batch in train_loader:
        batch = batch.to(device)
        optimizer.zero_grad()
        out = model(batch)
        loss = criterion(out, batch.y)
        loss.backward()
        optimizer.step()
        total_loss += loss.item()
    return total_loss / max(1, len(train_loader))

@torch.no_grad()
def test(loader):
    model.eval()
    correct = 0
    preds, labels = [], []
    for batch in loader:
        batch = batch.to(device)
        out = model(batch)
        predictions = out.argmax(dim=1)
        correct += (predictions == batch.y).sum().item()
        preds.extend(predictions.cpu().tolist())
        labels.extend(batch.y.cpu().tolist())
    accuracy = correct / len(loader.dataset)
    return accuracy, preds, labels

**Train the model**

In [None]:
# if needed, update your torch version
# !uv pip uninstall torch
# !uv pip install torch==2.10.*

In [None]:
scheduler = ReduceLROnPlateau(optimizer, mode='min', factor=0.5, patience=5)

num_epochs = 120
for epoch in range(start_epoch, num_epochs + 1):
    loss = train()
    train_acc, _, _ = test(train_loader)
    test_acc, _, _ = test(test_loader)
    print(f"Epoch {epoch:02d}, Loss: {loss:.4f}, Train Acc: {train_acc:.4f}, Test Acc: {test_acc:.4f}")

    # Step the scheduler based on the latest loss
    scheduler.step(loss)

A previous result:
```
Epoch 01, Loss: 1.0301, Train Acc: 0.5174, Test Acc: 0.5153
Epoch 02, Loss: 0.8389, Train Acc: 0.6200, Test Acc: 0.6033
Epoch 03, Loss: 0.8088, Train Acc: 0.5354, Test Acc: 0.5287
Epoch 04, Loss: 0.7750, Train Acc: 0.5603, Test Acc: 0.5600
Epoch 05, Loss: 0.7607, Train Acc: 0.6637, Test Acc: 0.6480
Epoch 06, Loss: 0.7577, Train Acc: 0.6720, Test Acc: 0.6607
Epoch 07, Loss: 0.7170, Train Acc: 0.6157, Test Acc: 0.5973
Epoch 08, Loss: 0.6969, Train Acc: 0.6829, Test Acc: 0.6687
Epoch 09, Loss: 0.6886, Train Acc: 0.6023, Test Acc: 0.5793
Epoch 10, Loss: 0.6733, Train Acc: 0.6937, Test Acc: 0.6733
Epoch 11, Loss: 0.6586, Train Acc: 0.7749, Test Acc: 0.7540
Epoch 12, Loss: 0.6440, Train Acc: 0.7511, Test Acc: 0.7487
Epoch 13, Loss: 0.6406, Train Acc: 0.5489, Test Acc: 0.5467
Epoch 14, Loss: 0.6336, Train Acc: 0.7774, Test Acc: 0.7660
Epoch 15, Loss: 0.6112, Train Acc: 0.6420, Test Acc: 0.6320
Epoch 16, Loss: 0.6018, Train Acc: 0.6291, Test Acc: 0.6140
Epoch 17, Loss: 0.6030, Train Acc: 0.7763, Test Acc: 0.7747
Epoch 18, Loss: 0.5758, Train Acc: 0.7857, Test Acc: 0.7813
Epoch 19, Loss: 0.5853, Train Acc: 0.7374, Test Acc: 0.7353
Epoch 20, Loss: 0.5772, Train Acc: 0.5374, Test Acc: 0.5240
Epoch 21, Loss: 0.5930, Train Acc: 0.7754, Test Acc: 0.7607
Epoch 22, Loss: 0.5513, Train Acc: 0.7809, Test Acc: 0.7760
Epoch 23, Loss: 0.5405, Train Acc: 0.8166, Test Acc: 0.8053
Epoch 24, Loss: 0.5582, Train Acc: 0.7626, Test Acc: 0.7413
Epoch 25, Loss: 0.5531, Train Acc: 0.8231, Test Acc: 0.8080
Epoch 26, Loss: 0.5529, Train Acc: 0.7766, Test Acc: 0.7593
Epoch 27, Loss: 0.5257, Train Acc: 0.7771, Test Acc: 0.7467
Epoch 28, Loss: 0.5465, Train Acc: 0.7631, Test Acc: 0.7467
Epoch 29, Loss: 0.5231, Train Acc: 0.8194, Test Acc: 0.7927
Epoch 30, Loss: 0.5167, Train Acc: 0.8326, Test Acc: 0.8127
Epoch 31, Loss: 0.5167, Train Acc: 0.7423, Test Acc: 0.7253
Epoch 32, Loss: 0.5202, Train Acc: 0.8291, Test Acc: 0.8113
Epoch 33, Loss: 0.5230, Train Acc: 0.7911, Test Acc: 0.7700
Epoch 34, Loss: 0.4981, Train Acc: 0.8377, Test Acc: 0.8200
Epoch 35, Loss: 0.5222, Train Acc: 0.7386, Test Acc: 0.7267
Epoch 36, Loss: 0.5015, Train Acc: 0.7903, Test Acc: 0.7680
Epoch 37, Loss: 0.4928, Train Acc: 0.8089, Test Acc: 0.7913
Epoch 38, Loss: 0.5122, Train Acc: 0.6820, Test Acc: 0.6653
Epoch 39, Loss: 0.4862, Train Acc: 0.8469, Test Acc: 0.8327
Epoch 40, Loss: 0.4738, Train Acc: 0.8520, Test Acc: 0.8373
Epoch 41, Loss: 0.4997, Train Acc: 0.8377, Test Acc: 0.8227
Epoch 42, Loss: 0.5066, Train Acc: 0.7291, Test Acc: 0.7140
Epoch 43, Loss: 0.4735, Train Acc: 0.7591, Test Acc: 0.7507
Epoch 44, Loss: 0.4872, Train Acc: 0.8377, Test Acc: 0.8220
Epoch 45, Loss: 0.4716, Train Acc: 0.8629, Test Acc: 0.8360
Epoch 46, Loss: 0.4819, Train Acc: 0.8617, Test Acc: 0.8280
Epoch 47, Loss: 0.4629, Train Acc: 0.8363, Test Acc: 0.8140
Epoch 48, Loss: 0.4457, Train Acc: 0.8720, Test Acc: 0.8433
Epoch 49, Loss: 0.4503, Train Acc: 0.7811, Test Acc: 0.7653
Epoch 50, Loss: 0.4520, Train Acc: 0.8346, Test Acc: 0.8087
Epoch 51, Loss: 0.4787, Train Acc: 0.8520, Test Acc: 0.8240
Epoch 52, Loss: 0.4457, Train Acc: 0.8523, Test Acc: 0.8273
Epoch 53, Loss: 0.4662, Train Acc: 0.8520, Test Acc: 0.8187
Epoch 54, Loss: 0.4388, Train Acc: 0.8666, Test Acc: 0.8487
Epoch 55, Loss: 0.4587, Train Acc: 0.8554, Test Acc: 0.8260
Epoch 56, Loss: 0.4575, Train Acc: 0.8603, Test Acc: 0.8387
Epoch 57, Loss: 0.4202, Train Acc: 0.8709, Test Acc: 0.8400
Epoch 58, Loss: 0.4205, Train Acc: 0.8751, Test Acc: 0.8367
Epoch 59, Loss: 0.4349, Train Acc: 0.8697, Test Acc: 0.8520
Epoch 60, Loss: 0.4291, Train Acc: 0.8434, Test Acc: 0.8240
Epoch 61, Loss: 0.4300, Train Acc: 0.7560, Test Acc: 0.7327
Epoch 62, Loss: 0.4213, Train Acc: 0.8620, Test Acc: 0.8460
Epoch 63, Loss: 0.4200, Train Acc: 0.8737, Test Acc: 0.8480
Epoch 64, Loss: 0.4226, Train Acc: 0.8894, Test Acc: 0.8633
Epoch 65, Loss: 0.4125, Train Acc: 0.8634, Test Acc: 0.8373
Epoch 66, Loss: 0.4110, Train Acc: 0.8686, Test Acc: 0.8380
Epoch 67, Loss: 0.4154, Train Acc: 0.8434, Test Acc: 0.8127
Epoch 68, Loss: 0.3913, Train Acc: 0.8809, Test Acc: 0.8540
Epoch 69, Loss: 0.3951, Train Acc: 0.8783, Test Acc: 0.8500
Epoch 70, Loss: 0.3896, Train Acc: 0.8771, Test Acc: 0.8493
Epoch 71, Loss: 0.3807, Train Acc: 0.8671, Test Acc: 0.8500
Epoch 72, Loss: 0.3976, Train Acc: 0.8529, Test Acc: 0.8360
Epoch 73, Loss: 0.3960, Train Acc: 0.8763, Test Acc: 0.8400
Epoch 74, Loss: 0.4049, Train Acc: 0.8714, Test Acc: 0.8320
Epoch 75, Loss: 0.3970, Train Acc: 0.8809, Test Acc: 0.8573
Epoch 76, Loss: 0.4003, Train Acc: 0.8629, Test Acc: 0.8327
Epoch 77, Loss: 0.3927, Train Acc: 0.8874, Test Acc: 0.8667
Epoch 78, Loss: 0.3541, Train Acc: 0.9040, Test Acc: 0.8813
Epoch 79, Loss: 0.3515, Train Acc: 0.8814, Test Acc: 0.8373
Epoch 80, Loss: 0.3496, Train Acc: 0.8977, Test Acc: 0.8693
Epoch 81, Loss: 0.3393, Train Acc: 0.9031, Test Acc: 0.8740
Epoch 82, Loss: 0.3398, Train Acc: 0.8734, Test Acc: 0.8327
Epoch 83, Loss: 0.3437, Train Acc: 0.9126, Test Acc: 0.8833
Epoch 84, Loss: 0.3306, Train Acc: 0.8657, Test Acc: 0.8313
Epoch 85, Loss: 0.3125, Train Acc: 0.8997, Test Acc: 0.8640
Epoch 86, Loss: 0.3107, Train Acc: 0.9089, Test Acc: 0.8627
Epoch 87, Loss: 0.3271, Train Acc: 0.9103, Test Acc: 0.8780
Epoch 88, Loss: 0.3325, Train Acc: 0.8997, Test Acc: 0.8673
Epoch 89, Loss: 0.3325, Train Acc: 0.9040, Test Acc: 0.8860
Epoch 90, Loss: 0.3404, Train Acc: 0.9174, Test Acc: 0.8840
Epoch 91, Loss: 0.3225, Train Acc: 0.9149, Test Acc: 0.8767
Epoch 92, Loss: 0.3150, Train Acc: 0.9103, Test Acc: 0.8640
Epoch 93, Loss: 0.3097, Train Acc: 0.9189, Test Acc: 0.8753
Epoch 94, Loss: 0.3011, Train Acc: 0.9186, Test Acc: 0.8827
Epoch 95, Loss: 0.2842, Train Acc: 0.9206, Test Acc: 0.8773
Epoch 96, Loss: 0.3029, Train Acc: 0.9254, Test Acc: 0.8900
Epoch 97, Loss: 0.2822, Train Acc: 0.9226, Test Acc: 0.8867
Epoch 98, Loss: 0.2841, Train Acc: 0.9197, Test Acc: 0.8787
Epoch 99, Loss: 0.2867, Train Acc: 0.9209, Test Acc: 0.8833
Epoch 100, Loss: 0.2917, Train Acc: 0.9266, Test Acc: 0.8840
```

**Display results**

In [None]:
# Rapport final
_, all_preds, all_labels = test(test_loader)
print("\nClassification Report:")
print(classification_report(all_labels, all_preds, target_names=list(class_mapping.keys())))

**Save the model**

In [None]:
checkpoint = {
    "epoch": epoch,
    "model_state_dict": model.state_dict(),
    "optimizer_state_dict": optimizer.state_dict(),
    # hyperparams modèle
    "in_channels": in_channels,
    "hidden_channels": hidden_channels,
    "out_channels": out_channels,
    "num_layers": num_layers,
    "dropout": dropout,
}

torch.save(checkpoint, "improved_gnn_checkpoint.pth")