# Silver → Gold: Enron Email Analytics

## Purpose
Create analytics-ready email summary table

## Source
- Silver: `enron_emails_sample` (10% sample)

## Transformations
- Message length categorization
- Text statistics
- Ready for NLP pipelines

## Output
- Gold: `fact_enron_emails`
- Grain: One row per email
- Use Case: Email analytics, NLP, sentiment analysis

**Author:** Kevin  
**Date:** Feb 9, 2026


In [0]:
from pyspark.sql.functions import col, when, current_timestamp, count

storage_account_name = "stgolistmigration"
account_key = ""

spark.conf.set(
    f"fs.azure.account.key.{storage_account_name}.dfs.core.windows.net",
    account_key
)

def get_silver_path(table):
    return f"abfss://silver@{storage_account_name}.dfs.core.windows.net/{table}/"

def get_gold_path(table):
    return f"abfss://gold@{storage_account_name}.dfs.core.windows.net/{table}/"

print("✅ Config loaded")


✅ Config loaded


In [0]:
print("📖 Loading Enron Silver data...")

enron_path = get_silver_path("enron_emails_sample")
df_enron = spark.read.format("delta").load(enron_path)

print(f"✅ Loaded: {df_enron.count():,} emails")


📖 Loading Enron Silver data...
✅ Loaded: 51,522 emails


In [0]:
print("🔄 Creating email fact table...")

df_gold_emails = df_enron \
    .withColumn("email_id", col("file")) \
    .withColumn("message_category",
        when(col("message_length") < 100, "Very Short")
        .when(col("message_length") < 500, "Short")
        .when(col("message_length") < 1000, "Medium")
        .when(col("message_length") < 2000, "Long")
        .otherwise("Very Long")
    ) \
    .withColumn("is_long_email", col("message_length") > 1000) \
    .withColumn("gold_ingestion_timestamp", current_timestamp()) \
    .select(
        "email_id",
        "message_preview",
        "message_length",
        "message_category",
        "is_long_email",
        "gold_ingestion_timestamp"
    )

print(f"✅ Created: {df_gold_emails.count():,} email records")

# Show sample
df_gold_emails.limit(2).show(truncate=100, vertical=True)


🔄 Creating email fact table...
✅ Created: 51,522 email records
-RECORD 0------------------------------------------------------------------------------------------------------------------------
 email_id                 | allen-p/_sent_mail/106.                                                                              
 message_preview          | Message-ID: <2707340.1075855687584.JavaMail.evans@thyme> Date: Mon, 9 Oct 2000 07:00:00 -0700 (PD... 
 message_length           | 6061                                                                                                 
 message_category         | Very Long                                                                                            
 is_long_email            | true                                                                                                 
 gold_ingestion_timestamp | 2026-02-09 13:26:46.809086                                                                           
-RECORD 1------------------

In [0]:
output_path = get_gold_path("fact_enron_emails")

print(f"💾 Writing to: {output_path}")

df_gold_emails.write \
    .format("delta") \
    .mode("overwrite") \
    .option("overwriteSchema", "true") \
    .save(output_path)

print("✅ Enron Gold complete!")


💾 Writing to: abfss://gold@stgolistmigration.dfs.core.windows.net/fact_enron_emails/
✅ Enron Gold complete!


In [0]:
print("🔍 Verifying...")

df_verify = spark.read.format("delta").load(output_path)

print(f"✅ Verified: {df_verify.count():,} emails")

print("\nEmail distribution by category:")
df_verify.groupBy("message_category").count().orderBy("message_category").show()

print("\nLong vs Short emails:")
df_verify.groupBy("is_long_email").count().show()

print("\nMessage length stats:")
df_verify.select("message_length").describe().show()

print("🎉 Enron → Gold complete!")


🔍 Verifying...
✅ Verified: 51,522 emails

Email distribution by category:
+----------------+-----+
|message_category|count|
+----------------+-----+
|            Long|17314|
|          Medium|14286|
|           Short| 1749|
|       Very Long|18173|
+----------------+-----+


Long vs Short emails:
+-------------+-----+
|is_long_email|count|
+-------------+-----+
|         true|35470|
|        false|16052|
+-------------+-----+


Message length stats:
+-------+------------------+
|summary|    message_length|
+-------+------------------+
|  count|             51522|
|   mean| 2625.302744458678|
| stddev|7670.8035714949565|
|    min|               383|
|    max|           1192777|
+-------+------------------+

🎉 Enron → Gold complete!
