# Notebook 01: Abruf & Aufbereitung der CVE-Daten

Dieses Notebook:
1. Initialisiert Pfade & Umgebung
2. Liest optional `.env` für API Key / Kontakt
3. Führt den inkrementellen CVE-Fetch aus (JSONL Append)
4. Exportiert CSV-Dateien (CVSS v4.0 / v3.1 / v3.0 / v2)
5. Zeigt Stichproben der CSVs

---


## 1. Initialisierung & Pfad-Setup
Basisverzeichnisse & Skriptpfade festlegen


In [1]:
from pathlib import Path
import os
import json
import datetime as dt

REPO_DIR = Path("..").resolve()
DATA_DIR = REPO_DIR / "data"
RAW_DIR = DATA_DIR / "raw"
SCRIPTS_DIR = REPO_DIR / "scripts"

# Skripte
FETCH_SCRIPT = SCRIPTS_DIR / "nvd_cve_fetcher" / "nvd_cve_fetcher.py"
CSV_SCRIPT = SCRIPTS_DIR / "cves_json_to_csv.py"

# CSV-Export Ziel (kann manuell überschrieben werden, Standard: RAW_DIR)
CSV_OUT_DIR = os.getenv("CSV_OUT_DIR", str(RAW_DIR))

print("REPO_DIR:", REPO_DIR)
print("RAW_DIR:", RAW_DIR)
print("CSV_OUT_DIR:", CSV_OUT_DIR)
print("FETCH_SCRIPT exists:", FETCH_SCRIPT.exists())
print("CSV_SCRIPT exists:", CSV_SCRIPT.exists())

REPO_DIR: /home/konrad/projects/master/cve_severity_classifier
RAW_DIR: /home/konrad/projects/master/cve_severity_classifier/data/raw
CSV_OUT_DIR: /home/konrad/projects/master/cve_severity_classifier/data/raw
FETCH_SCRIPT exists: True
CSV_SCRIPT exists: True


## 2. Optionale Umgebungsvariablen (.env)
Erkennung des API Keys und der Kontakt-E-Mail (falls vorhanden) für höheres Rate-Limit.


In [2]:
from dotenv import dotenv_values
ENV_PATH = REPO_DIR / ".env"
if ENV_PATH.exists():
    env_vals = dotenv_values(str(ENV_PATH))
    print(".env gefunden – Keys:")
    for k in ("NVD_API_KEY", "CONTACT_EMAIL"):
        val = env_vals.get(k)
        print(f"  {k}: {'gesetzt' if val else '—'}")
else:
    print(".env nicht gefunden – es wird ohne API Key gearbeitet (langsameres Rate-Limit).")

.env gefunden – Keys:
  NVD_API_KEY: gesetzt
  CONTACT_EMAIL: gesetzt


## 3. CVE-Fetch ausführen
Start des inkrementellen Fetchers (inkl. Update-Logik & Interval-Kontrolle).


In [3]:
# Fetcher ausführen
import subprocess, sys, os

print("Starte Fetcher … dies kann je nach Netzwerk/Rate-Limit dauern.")
ret = subprocess.run([sys.executable, str(FETCH_SCRIPT)], cwd=str(REPO_DIR))
print("Fetcher Exit-Code:", ret.returncode)
assert ret.returncode in (0, 130), "Fetcher fehlgeschlagen"

Starte Fetcher … dies kann je nach Netzwerk/Rate-Limit dauern.


[2025-09-13 20:26:14] [INFO] NVD CVE Fetcher for CVE Severity Classification (Prototyp)
[2025-09-13 20:26:14] [INFO] Zeitraum: 1999-01-01 -> 2025-09-13
[2025-09-13 20:26:14] [INFO] API Key: JA
[2025-09-13 20:26:14] [INFO] Kontakt: konrad.eckhardt@mni.thm.de
[2025-09-13 20:26:14] [INFO] User-Agent: nvd_cve_fetcher/1.0 (+mailto:konrad.eckhardt@mni.thm.de)
[2025-09-13 20:26:14] [INFO] Limit: 50/30s
[2025-09-13 20:26:14] [INFO] Basis-Schlaf: 1.00s (Jitter bis 0.30s)
[2025-09-13 20:26:14] [INFO] Fenster: erlaubt 120 | genutzt 120
[2025-09-13 20:26:14] [INFO] Results/Page: erlaubt 2000 | genutzt 2000
[2025-09-13 20:26:14] [INFO] Output: data/raw/cves.jsonl
[2025-09-13 20:26:14] [INFO] Start in 2s
Traceback (most recent call last):
  File "/home/konrad/projects/master/cve_severity_classifier/scripts/nvd_cve_fetcher/nvd_cve_fetcher.py", line 961, in <module>
    sys.exit(run())
             ^^^^^
  File "/home/konrad/projects/master/cve_severity_classifier/scripts/nvd_cve_fetcher/nvd_cve_fetch

KeyboardInterrupt: 

## 4. CSV-Export erzeugen
Konvertierung der zeilenbasierten JSONL-Sammlung in versions-spezifische CSV-Dateien.


In [4]:
# Vorverarbeitung / CSV-Export
import subprocess, sys

print("Starte CSV-Export …")
# Skript nutzt interne Config (INPUT/OUT_DIR). Falls anderes Verzeichnis gewünscht, Config im Skript anpassen.
ret_csv = subprocess.run([sys.executable, str(CSV_SCRIPT)], cwd=str(REPO_DIR))
print("CSV Exit-Code:", ret_csv.returncode)
assert ret_csv.returncode == 0, "CSV-Export fehlgeschlagen"

Starte CSV-Export …


[2025-09-13 20:26:20] [INFO] Start Konvertierung: input=data/raw/cves.jsonl out_dir=data/raw
[2025-09-13 20:26:21] [INFO] Progress: 10000 CVEs verarbeitet
[2025-09-13 20:26:21] [INFO] Progress: 20000 CVEs verarbeitet
[2025-09-13 20:26:21] [INFO] Progress: 30000 CVEs verarbeitet
[2025-09-13 20:26:21] [INFO] Progress: 40000 CVEs verarbeitet
[2025-09-13 20:26:22] [INFO] Progress: 50000 CVEs verarbeitet
[2025-09-13 20:26:22] [INFO] Progress: 60000 CVEs verarbeitet
[2025-09-13 20:26:22] [INFO] Progress: 70000 CVEs verarbeitet
[2025-09-13 20:26:23] [INFO] Progress: 80000 CVEs verarbeitet
[2025-09-13 20:26:23] [INFO] Progress: 90000 CVEs verarbeitet
[2025-09-13 20:26:23] [INFO] Progress: 100000 CVEs verarbeitet
[2025-09-13 20:26:24] [INFO] Progress: 110000 CVEs verarbeitet
[2025-09-13 20:26:25] [INFO] Progress: 120000 CVEs verarbeitet
[2025-09-13 20:26:25] [INFO] Progress: 130000 CVEs verarbeitet
[2025-09-13 20:26:26] [INFO] Progress: 140000 CVEs verarbeitet
[2025-09-13 20:26:27] [INFO] Progr

CSV Exit-Code: 0


[2025-09-13 20:26:34] [INFO] v40: 11712 Zeilen
[2025-09-13 20:26:34] [INFO] v31: 172908 Zeilen
[2025-09-13 20:26:34] [INFO] v30: 53856 Zeilen
[2025-09-13 20:26:34] [INFO] v2: 188645 Zeilen
[2025-09-13 20:26:34] [INFO] Fertig: total=308736


## 5. Stichprobenanzeige der CSVs
Kurzer Blick auf die ersten Zeilen jeder erzeugten CSV zur Validierung.


In [5]:
# Datensatz-Stichprobe anzeigen
import pandas as pd
from pathlib import Path

out_dir = Path(CSV_OUT_DIR)
for name in ["cves_v40.csv", "cves_v31.csv", "cves_v30.csv", "cves_v2.csv"]:
    p = out_dir / name
    if p.exists():
        print("Vorschau:", name)
        display(pd.read_csv(p, nrows=5))
    else:
        print("Nicht gefunden:", p)

Vorschau: cves_v40.csv


Unnamed: 0,cve_id,severity,description
0,CVE-2017-2680,HIGH,Specially crafted PROFINET DCP broadcast packe...
1,CVE-2017-2681,HIGH,Specially crafted PROFINET DCP packets sent on...
2,CVE-2017-12741,HIGH,Specially crafted packets sent to port 161/udp...
3,CVE-2019-13939,HIGH,A vulnerability has been identified in APOGEE ...
4,CVE-2020-8899,CRITICAL,There is a buffer overwrite vulnerability in t...


Vorschau: cves_v31.csv


Unnamed: 0,cve_id,severity,description
0,CVE-1999-1568,HIGH,Off-by-one error in NcFTPd FTP server before 2...
1,CVE-1999-0426,CRITICAL,The default permissions of /dev/kmem in Linux ...
2,CVE-1999-1549,HIGH,Lynx 2.x does not properly distinguish between...
3,CVE-1999-1127,HIGH,Windows NT 4.0 does not properly shut down inv...
4,CVE-1999-1324,CRITICAL,VAXstations running Open VMS 5.3 through 5.5-2...


Vorschau: cves_v30.csv


Unnamed: 0,cve_id,severity,description
0,CVE-2000-0258,HIGH,IIS 4.0 and 5.0 allows remote attackers to cau...
1,CVE-2004-0847,CRITICAL,The Microsoft .NET forms authentication capabi...
2,CVE-2005-0109,MEDIUM,"Hyper-Threading technology, as used in FreeBSD..."
3,CVE-2006-1364,HIGH,Microsoft w3wp (aka w3wp.exe) does not properl...
4,CVE-2006-5847,MEDIUM,Cross-site scripting (XSS) vulnerability in in...


Vorschau: cves_v2.csv


Unnamed: 0,cve_id,severity,description
0,CVE-1999-0197,HIGH,finger 0@host on some systems may print inform...
1,CVE-1999-0198,HIGH,finger .@host on some systems may print inform...
2,CVE-1999-0200,HIGH,Windows NT FTP server (WFTP) with the guest ac...
3,CVE-1999-0205,MEDIUM,Denial of service in Sendmail 8.6.11 and 8.6.12.
4,CVE-1999-0220,HIGH,Attackers can do a denial of service of IRC by...


## 6. Nächste Schritte
- Wechsel zu Notebook 02 (Preprocessing)
- Zuschnitt auf Beschreibung → Severity
