# Preprocessing the Enron dataset

In this notebook we'll be preprocessing the Enron dataset. The goal is to augment the data with the information we'll need to find the emails that we want to study (and ignore the ones we don't).

In [1]:
import os

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [2]:
! ls ../data/maildir

allen-p      fischer-m	     kitchen-l	      phanis-s	     smith-m
arnold-j     forney-j	     kuykendall-t     pimenov-v      solberg-g
arora-h      fossum-d	     lavorato-j       platter-p      south-s
badeer-r     gang-l	     lay-k	      presto-k	     staab-t
bailey-s     gay-r	     lenhart-m	      quenet-j	     stclair-c
bass-e	     geaccone-t      lewis-a	      quigley-d      steffes-j
baughman-d   germany-c	     linder-e	      rapp-b	     stepenovitch-j
beck-s	     gilbertsmith-d  lokay-m	      reitmeyer-j    stokley-c
benson-r     giron-d	     lokey-t	      richey-c	     storey-g
blair-l      griffith-j      love-p	      ring-a	     sturm-f
brawner-s    grigsby-m	     lucci-p	      ring-r	     swerzbin-m
buy-r	     guzman-m	     maggi-m	      rodrique-r     symes-k
campbell-l   haedicke-m      mann-k	      rogers-b	     taylor-m
carson-m     hain-m	     martin-t	      ruscitti-k     tholt-j
cash-m	     harris-s	     may-l	      sager-e	     thomas-p
causholli-m  hayslett-r      mc

We wanna convert the dataset to parquet format and then save it - then we can replace the dataset in the dvc cache with the parquet version which should be a lot easier to cache track and so forth.

In [3]:
! tree ../data/maildir | head 

../data/maildir
├── allen-p
│   ├── all_documents
│   │   ├── 1. -> /network/scratch/c/caleb.moses/group-project/dvc/files/md5/08/f89c6e8b9dfb55ce5d96e49e8be465
│   │   ├── 10. -> /network/scratch/c/caleb.moses/group-project/dvc/files/md5/7e/8270c667aeecf249ad15fac5e4aacc
│   │   ├── 100. -> /network/scratch/c/caleb.moses/group-project/dvc/files/md5/01/46e8d854f36b331d7c844029d44800
│   │   ├── 101. -> /network/scratch/c/caleb.moses/group-project/dvc/files/md5/f9/2b88674aaea14988e17f82e7e2f87d
│   │   ├── 102. -> /network/scratch/c/caleb.moses/group-project/dvc/files/md5/ea/92953635b60e6b874f991d508c5f4b
│   │   ├── 103. -> /network/scratch/c/caleb.moses/group-project/dvc/files/md5/5f/72d38fe7f7d2d4d9a6b0e1b59e7c06
│   │   ├── 104. -> /network/scratch/c/caleb.moses/group-project/dvc/files/md5/a3/165664647f9fd9eca8a8398eb0ab64


In [16]:
%%time
paths = []
for root, dirs, files in os.walk('../data/maildir', followlinks=True):
    for f in files:
        paths.append(os.path.join(root, f))

KeyboardInterrupt: 

In [17]:
! head ../data/maildir/shively-h/1.

Message-ID: <27083999.1075840298877.JavaMail.evans@thyme>
Date: Mon, 24 Sep 2001 08:17:17 -0700 (PDT)
From: tammie.schoppe@enron.com
To: d..hogan@enron.com, kimberly.bates@enron.com, jessica.presas@enron.com, 
	alexandra.villarreal@enron.com, michael.salinas@enron.com, 
	becky.young@enron.com, ina.rangel@enron.com
Subject: Dinner next Wednesday(10/3) with Louise and John
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit


In [18]:
enron_data = pd.DataFrame({'path': paths})
enron_data

Unnamed: 0,path
0,../data/maildir/benson-r/inbox/115.
1,../data/maildir/benson-r/inbox/128.
2,../data/maildir/benson-r/inbox/254.
3,../data/maildir/benson-r/inbox/91.
4,../data/maildir/benson-r/inbox/290.
...,...
10480,../data/maildir/ermis-f/inbox/37.
10481,../data/maildir/ermis-f/inbox/43.
10482,../data/maildir/ermis-f/inbox/584.
10483,../data/maildir/ermis-f/inbox/151.


In [19]:
from concurrent.futures import ThreadPoolExecutor, as_completed
from tqdm.notebook import tqdm
import os
import chardet

def decode_email(fp):
    with open(fp, 'rb') as f:
        raw_data = f.read()

        # Detect and use the correct encoding
        detected_encoding = chardet.detect(raw_data)['encoding']
        if detected_encoding is None:
            detected_encoding = 'us-ascii'  # Default to utf-8 if encoding is undetected

        try:
            text = raw_data.decode(detected_encoding)
        except UnicodeDecodeError:
            text = raw_data.decode('us-ascii', errors='replace')

    return text.replace('\r', '')

def read_email(fp):
    text = decode_email(fp)
    
    header, content = text.split('\n\n', 1)
    
    # Define the fields we are interested in
    fields = ['Message-ID', 'Date', 'From', 'Subject', 'X-FileName', 'X-Origin', 
              'X-Folder', 'X-bcc', 'X-cc', 'X-To', 'X-From', 'Content-Transfer-Encoding', 
              'Content-Type', 'Mime-Version', 'To', 'Cc', 'Bcc', 'Content']

    # Initialize an empty dictionary with the fields
    email_dict = {field: '' for field in fields}
    
    # Set the content
    email_dict['Content'] = content

    # Temporary variable to hold the key for multi-line values
    current_key = None
    
    # Split the header into lines and iterate through each line
    lines = header.strip().split('\n')
    for line in lines:
        if ':' in line:
            key, value = line.split(':', 1)
            key = key.strip()

            # If the key is one of the fields we're interested in, or if we don't have a current key
            if key in email_dict or current_key is None:
                email_dict[key] = value.strip()
                current_key = key
            else:
                # Append the line to the value of the previous key
                email_dict[current_key] += ' ' + line.strip()
        elif current_key:
            # This is a continuation of the value from the previous line
            email_dict[current_key] += ' ' + line.strip()

    return email_dict

In [None]:
enron_data['email'] = [read_email(fp) for fp in tqdm(enron_data.path)]

  0%|          | 0/10485 [00:00<?, ?it/s]

In [None]:
%%time
fields = pd.json_normalize(enron_data.email)
enron_df = pd.concat([enron_data.loc[:, ['path']], fields], axis=1)

In [None]:
enron_df.to_parquet('../data/enron_emails.parquet')