In this NPrint OS Detection project, we will apply machine learning to fingerprint operating systems based on their network traffic patterns. The dataset contains traffic from 13 classes of operating systems, and our first task is to extract the labels from the pcapng file. Labels are stored in the packet comments, and I used tshark for this task. Please update the path below to your local installation.

In [1]:
TSHARK = r"C:\Program Files\Wireshark\tshark.exe"

As we will later see, we need a mechanism to match the extracted OS labels to the corresponding packets. I used a dictionary of timestamps to OS labels, where it is assumed that the timestamp will be unique for each packet.

In [2]:
import subprocess
import os
import re
from collections import OrderedDict

def extract_os_labels(pcap_file):
    """
    Extract OS labels from pcapng packet comments using tshark.
    Returns a dictionary mapping normalized timestamps of packets to OS labels.
    """
    if not os.path.exists(TSHARK):
        print("tshark not found")
        return {}

    ts_to_label = OrderedDict()
    pending_label = None

    cmd = [TSHARK, "-r", pcap_file, "-V"]
    proc = subprocess.Popen(cmd, stdout=subprocess.PIPE, stderr=subprocess.PIPE, text=True)

    for line in proc.stdout:
        if "Packet comments" in line:
            raw = next(proc.stdout, "").strip()
            if "," in raw and "_" in raw:
                _, pair = raw.split(",", 1)
                _, hard = pair.split("_", 1)
                pending_label = hard.strip()
            else:
                pending_label = None
            continue

        if pending_label is not None:
            m_ts = re.search(r"Epoch Arrival Time:\s*([0-9]+\.[0-9]+)", line)
            if m_ts:
                ts = m_ts.group(1).rstrip('0').rstrip('.')
                ts_to_label[ts] = pending_label
                pending_label = None

    proc.wait()
    return ts_to_label

Now onto data processing. We will use netml to convert the packets into flows. The two features we will use are IAT and STATS. Different operating systems implement TCP/IP protocols with different IAT characteristics due to variations in congestion control algorithms. STATS captures structural protocol differences, offering complementary features to IAT's temporal fingerprints.

In [3]:
from netml.pparser.parser import PCAP
import numpy as np

def extract_features(pcap_file):
    """
    Extract IAT and STATS features from a pcap file using netml.
    Returns combined feature array and list of flows.
    """
    pcap = PCAP(pcap_file)
    pcap.pcap2flows()

    pcap.flow2features('IAT', fft=False, header=False)
    iat_features = pcap.features.copy()

    pcap.flow2features('STATS', fft=False, header=False)
    stats_features = pcap.features

    combined = np.hstack([iat_features, stats_features])
    return combined, pcap.flows



The flows above have no knowledge of the operating system labels that we extracted. We will take the first packet of each flow and use its timestamp to search through the labels dictionary. Recall that the packets of each flow are guaranteed to come from the same source ip. However, the same ip may map to multiple operating systems in this dataset. We lose accuracy assuming the operating system of the first packet as the entire flow's label, but considering that the dataset contains millions of packets, we cannot afford to train on individual packets. The compromise of accuracy using flows is required.

In [4]:
import pandas as pd

def build_dataset(features, flows, ts_to_label):
    """
    Build labeled dataset by matching flow timestamps to OS labels.
    Returns feature matrix X and label array y.
    """
    df = pd.DataFrame(features)

    labels = []
    for flow_key, packets in flows:
        ts = str(packets[0].time).rstrip('0').rstrip('.')
        labels.append(ts_to_label.get(ts))

    df['label'] = labels
    df = df.dropna(subset=['label'])

    X = df.drop(columns=['label']).values
    y = df['label'].values
    return X, y

Now onto the training and evaluation. We will test four different models in this project. Ridge and Logistic are linear models that provide simple, fast baselines. Neural Network is a naive attempt to capture non-linear relationships. Lastly, Random Forest provides interpretable thresholds, allowing us to understand which specific protocol characteristics distinguish operating systems, unlike neural networks which operate as black boxes.

In [5]:
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import RidgeClassifier, LogisticRegression
from sklearn.neural_network import MLPClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import balanced_accuracy_score

def train_and_evaluate(X, y):
    """
    Train and evaluate multiple classifiers using balanced accuracy.
    """
    le = LabelEncoder()
    y = le.fit_transform(y)

    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.2, random_state=42, stratify=y
    )

    scaler = StandardScaler()
    X_train_scaled = scaler.fit_transform(X_train)
    X_test_scaled = scaler.transform(X_test)

    models = {
        "Ridge": (RidgeClassifier(alpha=1.0, random_state=42), True),
        "Logistic": (LogisticRegression(max_iter=500, random_state=42), True),
        "NeuralNet": (MLPClassifier(hidden_layer_sizes=(50,), max_iter=500, random_state=42), True),
        "RandomForest": (RandomForestClassifier(n_estimators=100, random_state=42), False),
    }

    for name, (model, needs_scaling) in models.items():
        if needs_scaling:
            model.fit(X_train_scaled, y_train)
            preds = model.predict(X_test_scaled)
        else:
            model.fit(X_train, y_train)
            preds = model.predict(X_test)
        acc = balanced_accuracy_score(y_test, preds)
        print(f"{name:12} Balanced Accuracy: {acc:.4f}")

We are now ready to feed our pipeline raw data. Note that feeding the complete os-100-packets.pcapng file took 15 minutes on my machine. The results are discussed in the write-up. The dataset can be found at https://nprint.github.io/benchmarks/os_detection/nprint_os_detection.html

In [6]:
pcap_file = "os-100-packets.pcapng"

ts_to_label = extract_os_labels(pcap_file)

features, flows = extract_features(pcap_file)

X, y = build_dataset(features, flows, ts_to_label)
train_and_evaluate(X, y)

Ridge        Balanced Accuracy: 0.2380
Logistic     Balanced Accuracy: 0.2875
NeuralNet    Balanced Accuracy: 0.4448
RandomForest Balanced Accuracy: 0.6557
