# Bronze → Silver: Enron Email Corpus

## Purpose
Process large-scale email text data for NLP/analytics

## Dataset
- **Source:** Enron email corpus (public dataset)
- **File:** emails.csv (1.36 GB)
- **Strategy:** Sample for dev, full load for prod

## Transformations
- Parse email structure
- Extract sender/recipient
- Clean text content
- Add text length metrics
- Sample for performance

**Author:** Kevin  
**Date:** Feb 9, 2026


In [0]:
from pyspark.sql.functions import (
    col, length, trim, lower, split, regexp_replace,
    current_timestamp, count, when, substring
)

storage_account_name = "stgolistmigration"
account_key = ""  

spark.conf.set(
    f"fs.azure.account.key.{storage_account_name}.dfs.core.windows.net",
    account_key
)

def get_bronze_path(filename):
    return f"abfss://bronze@{storage_account_name}.dfs.core.windows.net/{filename}"

def get_silver_path(table):
    return f"abfss://silver@{storage_account_name}.dfs.core.windows.net/{table}/"

print("✅ Config loaded")


✅ Config loaded


In [0]:
# Read with sampling for faster processing
bronze_path = get_bronze_path("emails.csv")

print(f"📖 Reading Enron emails (sampling 10% for speed)...")
print(f"   Path: {bronze_path}")

# Read with sample to avoid loading 1.36GB at once
df_emails_bronze = spark.read \
    .option("header", "true") \
    .option("inferSchema", "true") \
    .option("multiLine", "true") \
    .option("escape", '"') \
    .csv(bronze_path) \
    .sample(fraction=0.1, seed=42)  # 10% sample

print(f"✅ Loaded (10% sample): {df_emails_bronze.count():,} emails")
print(f"   Columns: {len(df_emails_bronze.columns)}")


📖 Reading Enron emails (sampling 10% for speed)...
   Path: abfss://bronze@stgolistmigration.dfs.core.windows.net/emails.csv
✅ Loaded (10% sample): 51,685 emails
   Columns: 2


In [0]:
print("🔍 Schema & Sample")
print("=" * 80)

# Show schema
print("\n1️⃣ COLUMNS:")
df_emails_bronze.printSchema()

# Show sample
print("\n2️⃣ SAMPLE EMAILS:")
df_emails_bronze.limit(2).show(truncate=80, vertical=True)


🔍 Schema & Sample

1️⃣ COLUMNS:
root
 |-- file: string (nullable = true)
 |-- message: string (nullable = true)


2️⃣ SAMPLE EMAILS:
-RECORD 0-----------------------------------------------------------------------------------
 file    | allen-p/_sent_mail/1.                                                            
 message | Message-ID: <18782981.1075855378110.JavaMail.evans@thyme>\nDate: Mon, 14 May ... 
-RECORD 1-----------------------------------------------------------------------------------
 file    | allen-p/_sent_mail/114.                                                          
 message | Message-ID: <26575732.1075855687756.JavaMail.evans@thyme>\nDate: Mon, 2 Oct 2... 



In [0]:
print("🔄 Transforming emails...")

df_emails_silver = df_emails_bronze \
    .filter(col("message").isNotNull()) \
    .withColumn("message_clean", 
        regexp_replace(trim(col("message")), r'\s+', ' ')
    ) \
    .withColumn("message_length", length(col("message_clean"))) \
    .withColumn("message_preview", substring(col("message_clean"), 1, 200)) \
    .withColumn("ingestion_timestamp", current_timestamp()) \
    .select(
        "file",
        "message_preview",
        "message_length",
        "ingestion_timestamp"
    )

silver_count = df_emails_silver.count()

print(f"✅ Transformation complete")
print(f"   Silver rows: {silver_count:,} emails")


🔄 Transforming emails...
✅ Transformation complete
   Silver rows: 51,522 emails


In [0]:
print("📊 Email Statistics")
print("=" * 80)

# Message length stats
print("\n1️⃣ MESSAGE LENGTH DISTRIBUTION:")
df_emails_silver.select("message_length") \
    .describe() \
    .show(truncate=False)

# Sample messages
print("\n2️⃣ SAMPLE PREVIEWS:")
df_emails_silver.select("file", "message_preview", "message_length") \
    .limit(3) \
    .show(truncate=100, vertical=True)


📊 Email Statistics

1️⃣ MESSAGE LENGTH DISTRIBUTION:
+-------+-----------------+
|summary|message_length   |
+-------+-----------------+
|count  |51522            |
|mean   |2625.302744458678|
|stddev |7670.803571494996|
|min    |383              |
|max    |1192777          |
+-------+-----------------+


2️⃣ SAMPLE PREVIEWS:
-RECORD 0---------------------------------------------------------------------------------------------------------------
 file            | allen-p/_sent_mail/1.                                                                                
 message_preview | Message-ID: <18782981.1075855378110.JavaMail.evans@thyme> Date: Mon, 14 May 2001 16:39:00 -0700 (... 
 message_length  | 486                                                                                                  
-RECORD 1---------------------------------------------------------------------------------------------------------------
 file            | allen-p/_sent_mail/114.                         

In [0]:
output_path = get_silver_path("enron_emails_sample")

print(f"💾 Writing to: {output_path}")
print("   Note: This is a 10% sample for dev/testing")

df_emails_silver.write \
    .format("delta") \
    .mode("overwrite") \
    .option("overwriteSchema", "true") \
    .save(output_path)

print("✅ Enron Silver (sample) complete!")


💾 Writing to: abfss://silver@stgolistmigration.dfs.core.windows.net/enron_emails_sample/
   Note: This is a 10% sample for dev/testing
✅ Enron Silver (sample) complete!


In [0]:
print("🔍 Verifying...")

df_verify = spark.read.format("delta").load(output_path)

print(f"✅ Verified: {df_verify.count():,} emails")
print(f"   Avg message length: {df_verify.agg({'message_length': 'avg'}).collect()[0][0]:.0f} chars")

print("\nMessage length distribution:")
df_verify.groupBy(
    when(col("message_length") < 100, "< 100 chars")
    .when(col("message_length") < 500, "100-500 chars")
    .when(col("message_length") < 1000, "500-1000 chars")
    .otherwise("> 1000 chars").alias("length_bucket")
).count().orderBy("length_bucket").show(truncate=False)

print("=" * 80)
print("🎉 Enron Bronze → Silver complete!")
print("\n💡 NOTE: This is a 10% sample. For production:")
print("   - Remove `.sample(fraction=0.1)` from Cell 3")
print("   - Increase cluster size for full 1.36GB load")


🔍 Verifying...
✅ Verified: 51,522 emails
   Avg message length: 2625 chars

Message length distribution:
+--------------+-----+
|length_bucket |count|
+--------------+-----+
|100-500 chars |1749 |
|500-1000 chars|14286|
|> 1000 chars  |35487|
+--------------+-----+

🎉 Enron Bronze → Silver complete!

💡 NOTE: This is a 10% sample. For production:
   - Remove `.sample(fraction=0.1)` from Cell 3
   - Increase cluster size for full 1.36GB load
