# Assignment 1: Bronze Layer Example - CVE 2024 Data Ingestion
**DIC 587 - Data Intensive Computing - Fall 2025**

This notebook implements the Bronze layer of our medallion architecture:
- Downloads CVEProject/cvelistV5 repository 
- Filters to 2024 vulnerabilities only (~40,000 records)
- Stores raw JSON as Delta tables with ACID guarantees
- Demonstrates streaming data ingestion patterns

**Learning Objectives:**
- Understand Bronze layer concepts (raw data preservation)
- Practice JSON schema-on-read with PySpark
- Learn Delta Lake table registration
- Handle large-scale data downloads programmatically

## Step 1: Environment Setup and Data Download

In [0]:
import os
import json
import pandas as pd
import shutil
import zipfile
import urllib.request
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.types import *

# Configuration
USE_OFFLINE = False  # Set True if you manually uploaded /FileStore/cvelistV5.zip
year = 2024   # Focus on 2024 CVEs only

# Paths - using /tmp and /FileStore to avoid Community Edition I/O errors
base_dir = "/tmp/cve/"
CATALOG = "workspace"
SCHEMA = "default"
DELTA_BRONZE_PATH = f"/Volumes/{CATALOG}/{SCHEMA}/assignment1/cve/bronze"
os.makedirs(base_dir, exist_ok=True)
os.makedirs(DELTA_BRONZE_PATH, exist_ok=True)

# for registering sql table
schema_name = "cve_bronze"
table_name = "records"

print(f"Target: {year} CVE data (~40,000 records)")
print(f"Bronze location: {DELTA_BRONZE_PATH}")

Target: 2024 CVE data (~40,000 records)
Bronze location: /Volumes/workspace/default/assignment1/cve/bronze


## Step 2: Download CVE Repository

This uploads the entire CVE catalog into a databricks catalog.  You need to figure out how to filter it down to the 2024 records.  You also need to do the other medallion letters!

In [0]:
# Download CVE repository
print("Downloading CVE repository...")

# Configuration
volume_root = "/Volumes/workspace/default/assignment1"

# Your existing download/extract pattern
zip_dest = f"{base_dir}cvelistV5.zip"
extract_dir = f"{base_dir}cvelistV5-main"

# Create local temp directory
os.makedirs(base_dir, exist_ok=True)

# Clean previous extracts
if os.path.exists(extract_dir):
    shutil.rmtree(extract_dir, ignore_errors=True)

# Download the repository
zip_url = "https://github.com/CVEProject/cvelistV5/archive/refs/heads/main.zip"
print(f"Downloading from: {zip_url}")

with urllib.request.urlopen(zip_url) as response:
    data = response.read()
    
with open(zip_dest, "wb") as f:
    f.write(data)

print(f"Downloaded {len(data):,} bytes")

# Extract the ZIP
print("Extracting ZIP archive...")
with zipfile.ZipFile(zip_dest) as z:
    z.extractall(base_dir)

print("Extraction complete")

Downloading CVE repository...
Downloading from: https://github.com/CVEProject/cvelistV5/archive/refs/heads/main.zip
Downloaded 524,599,814 bytes
Extracting ZIP archive...
Extraction complete


In [0]:
# Process 2024 CVEs
print("Processing {year} CVEs...")

def process_year_cves(year, max_files=100000):
    """Process CVEs for a specific year, limiting to max_files"""
    
    cve_year_dir = f"{extract_dir}/cves/{year}"
    json_files = []
    
    print(f"Scanning directory: {cve_year_dir}")
    
    # read JSON files into list, filtering by target year
    if os.path.exists(cve_year_dir):
        file_count = 0
        for root, dirs, files in os.walk(cve_year_dir):
            for file in files:
                if file.endswith('.json') and f'CVE-{year}-' in file and file_count < max_files:
                    file_path = os.path.join(root, file)
                    try:
                        # Read and validate JSON
                        with open(file_path, 'r', encoding='utf-8') as f:
                            content = f.read()
                            cve_data = json.loads(content)
                            json_files.append(cve_data)
                            
                        file_count += 1
                        
                        if file_count % 1000 == 0:
                            print(f"Processed {file_count} CVE-{year} files...")
                            
                    except Exception as e:
                        print(f"Skipped {file}: {e}")
                        continue
                        
        print(f"Collected {len(json_files)} CVEs from {year}")
        return json_files
    else:
        print(f"Directory not found: {cve_year_dir}")
        return []

# Process target year
cves_year = process_year_cves(year, 100000)

print(f"Total {year} CVEs: {len(cves_year)}")

# function for handling complex nested structure of containers column
def to_json_safe(v):
    if v is None:
        return None
    if isinstance(v, (str, int, float, bool)):
        return v
    try:
        return json.dumps(v, ensure_ascii=False)
    except Exception:
        return str(v)

# Serverless approach: Use pandas + createDataFrame instead of sparkContext
    
# Convert list of dicts to pandas DataFrame first
print(f"Converting {len(cves_year)} CVEs to DataFrame...")

pdf = pd.DataFrame(cves_year)
pdf['containers'] = pdf['containers'].apply(to_json_safe)

# Convert pandas to Spark DataFrame (serverless compatible)
df_raw = spark.createDataFrame(pdf)

print(f"Total {year} CVE records found: {df_raw.count():,}")


Processing {year} CVEs...
Scanning directory: /tmp/cve/cvelistV5-main/cves/2024
Processed 1000 CVE-2024 files...
Processed 2000 CVE-2024 files...
Processed 3000 CVE-2024 files...
Processed 4000 CVE-2024 files...
Processed 5000 CVE-2024 files...
Processed 6000 CVE-2024 files...
Processed 7000 CVE-2024 files...
Processed 8000 CVE-2024 files...
Processed 9000 CVE-2024 files...
Processed 10000 CVE-2024 files...
Processed 11000 CVE-2024 files...
Processed 12000 CVE-2024 files...
Processed 13000 CVE-2024 files...
Processed 14000 CVE-2024 files...
Processed 15000 CVE-2024 files...
Processed 16000 CVE-2024 files...
Processed 17000 CVE-2024 files...
Processed 18000 CVE-2024 files...
Processed 19000 CVE-2024 files...
Processed 20000 CVE-2024 files...
Processed 21000 CVE-2024 files...
Processed 22000 CVE-2024 files...
Processed 23000 CVE-2024 files...
Processed 24000 CVE-2024 files...
Processed 25000 CVE-2024 files...
Processed 26000 CVE-2024 files...
Processed 27000 CVE-2024 files...
Processed 2

In [0]:
# Step 3: Complete Bronze Layer Implementation
# YOUR TASK: Add the missing Bronze layer components below

# Optimize for Community Edition
spark.conf.set("spark.sql.shuffle.partitions", "8")

# function for handling complex nested structure of containers column
def to_json_safe(v):
    if v is None:
        return None
    if isinstance(v, (str, int, float, bool)):
        return v
    try:
        return json.dumps(v, ensure_ascii=False)
    except Exception:
        return str(v)


def save_cves_to_delta_serverless(cves_list, year, delta_path):
    """Save CVEs to Delta Lake bronze layer - SERVERLESS COMPATIBLE"""
    
    if not cves_list:
        print(f"No CVEs to save for {year}")
        return
    
    # Serverless approach: Use pandas + createDataFrame instead of sparkContext

    # 1) Read raw JSON recursively and add lineage
    # Convert list of dicts to pandas DataFrame first
    print(f"Converting {len(cves_list)} CVEs to DataFrame...")

    pdf = pd.DataFrame(cves_list)
    pdf['containers'] = pdf['containers'].apply(to_json_safe)

    # Convert pandas to Spark DataFrame (serverless compatible)
    df_raw = spark.createDataFrame(pdf)

    # Add metadata columns
    df_bronze = (df_raw
                 .withColumn("_ingestion_timestamp", current_timestamp())
                 .withColumn("_ingestion_date", current_date())
                 .withColumn("_year", lit(year))
                 .withColumn("_record_id", monotonically_increasing_id()))
    
    print(f"{year} CVE records loaded: {df_bronze.count():,}")

    # 2) Add 2024 filtering logic
    #Completed above while reading JSON files

    # 3) Add data quality checks

    # Data Quality Checks
    print("\nPerforming data quality checks...")
    record_count = df_bronze.count()
    null_cve_ids = df_bronze.filter(col("_record_id").isNull()).count()
    unique_cve_ids = df_bronze.select("_record_id").distinct().count()

    assert record_count > 30000, f"Warning: {record_count:,} (<30,000) CVE records from {year}"
    print(f"Verified {record_count:,} (>30,000) CVE records from {year}")
        
    assert null_cve_ids == 0, f"Warning: {null_cve_ids} CVE records with null CVE id"
    print(f"Verified no null CVE ids from {year}")

    assert unique_cve_ids == record_count, f"Warning: {record_count - unique_cve_ids} CVE records with duplicate CVE ids"
    print(f"Verified {unique_cve_ids} unique CVE records from {year}")

    # 4) Write Delta table
    # Write to Delta Lake
    (df_bronze.write
    .format("delta")
    .mode("overwrite")
    .option("mergeSchema", "true")
    .option("overwriteSchema", "true")
    .option("delta.columnMapping.mode", "name")
    .save(delta_path))
    
    print(f"\n{year} Bronze layer created: {delta_path}")
    return df_bronze

# Save target year with serverless approach
df_2024 = save_cves_to_delta_serverless(cves_year, year, DELTA_BRONZE_PATH)

# Create schema if it does not exist
spark.sql("CREATE SCHEMA IF NOT EXISTS workspace.cve_bronze")

# 5) Register table for SQL access
(df_2024.write
    .format("delta")
    .mode("overwrite")
    .option("mergeSchema", "true")
    .option("overwriteSchema", "true")
    .option("delta.columnMapping.mode", "name")
    .saveAsTable("cve_bronze.records"))



Converting 38753 CVEs to DataFrame...
2024 CVE records loaded: 38,753

Performing data quality checks...
Verified 38,753 (>30,000) CVE records from 2024
Verified no null CVE ids from 2024
Verified 38753 unique CVE records from 2024

2024 Bronze layer created: /Volumes/workspace/default/assignment1/cve/bronze


In [0]:
# 6) Verification and screenshots

# bronze record count
print(f"{year} Bronze Layer Record Count: {df_2024.count()}\n") 

# bronze schema
spark.read.format("delta").load(DELTA_BRONZE_PATH).printSchema()
print()

# bronze describe detail
spark.sql(f"DESCRIBE DETAIL {schema_name}.{table_name}").show(truncate=False)

print()
print("📸 REQUIRED SCREENSHOTS:")
print("   • df_2024.count() showing ~40,000 records")
print("   • DESCRIBE DETAIL cve_bronze.records output")
print("   • Data quality assertion results")
print("   • Delta files visible in path")

2024 Bronze Layer Record Count: 38753

root
 |-- dataType: string (nullable = true)
 |-- dataVersion: string (nullable = true)
 |-- cveMetadata: struct (nullable = true)
 |    |-- assignerOrgId: string (nullable = true)
 |    |-- assignerShortName: string (nullable = true)
 |    |-- cveId: string (nullable = true)
 |    |-- datePublished: string (nullable = true)
 |    |-- dateRejected: string (nullable = true)
 |    |-- dateReserved: string (nullable = true)
 |    |-- dateUpdated: string (nullable = true)
 |    |-- state: string (nullable = true)
 |-- containers: string (nullable = true)
 |-- _ingestion_timestamp: timestamp (nullable = true)
 |-- _ingestion_date: date (nullable = true)
 |-- _year: integer (nullable = true)
 |-- _record_id: long (nullable = true)


+------+------------------------------------+----------------------------+-----------+--------+-----------------------+-------------------+----------------+-----------------+--------+-----------+-----------------------------

In [0]:
# Cleanup temporary files
print("Cleaning temporary files...")

import shutil
try:
    shutil.rmtree("/tmp/cve_graph_demo")
    print("Temporary files cleaned")
except:
    print("Temporary files will be cleaned up automatically")

Cleaning temporary files...
Temporary files will be cleaned up automatically
