Skip to content

Converting the Enron email collection to mbox format

Notifications You must be signed in to change notification settings

lintool/Enron2mbox

Repository files navigation

Converting the Enron Email Dataset to mbox Format

The Enron Email Dataset is distributed in maildir format, which means that each message is stored in a separate file. This is unwieldy to work with. Here's how you can convert maildir into mbox, where all messages in a folder are stored in a single mbox file.

Go fetch the dataset and then unpack:

$ tar xvfz enron_mail_20150507.tgz

The dataset should unpack into a directory called maildir. Use the script count_messages.sh to gather an inventory of the messages in each folder:

$ ./count_messages.sh

Verify the total number of messages in the dataset:

$ ./count_messages.sh | cut -d' ' -f1 | awk '{s+=$1} END {print s}'
517401

Now run the conversion script:

$ ./convert_enron_to_mbox.py

It might take a bit, so go grab a cup of coffee...

Note that the script is destructive, in that it alters the original structure of the dataset. This is necessary to get everything in the right maildir format so that it can be processed by Python tools (in particular, the script creates cur/ and new/ directories, which is part of the expected layout).

After the script completes, the resulting mbox files are stored in the enron/ directory:

$ ls enron | wc
    3311    3311   93804

The repo includes ReadMbox.java, a very simple Java program that uses the JavaMail API to read the mbox files. The dependent jars are checked into the repo for convenience, so you can compile directly:

$ javac -cp lib/javax.mail-1.5.6.jar:lib/mbox.jar ReadMbox.java

You can now examine a particular mbox file:

$ java -cp .:lib/javax.mail-1.5.6.jar:lib/mbox.jar ReadMbox enron/enron.allen-p._sent_mail

The program prints out the subject line of each email.

To verify the integrity of the entire dataset in mbox format, run:

$ ./verify_mbox.sh > mbox.log &

Confirm that the number of messages is exactly the same:

$ cut -d' ' -f3 mbox.log | awk '{s+=$1} END {print s}'
517401

About

Converting the Enron email collection to mbox format

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published