Skip to content
Converting the Enron email collection to mbox format
Branch: master
Clone or download
Latest commit 6722389 Dec 9, 2016
Type Name Latest commit message Commit time
Failed to load latest commit information.
lib Java class for reading mbox files and dependent jars. Oct 16, 2016
.gitignore Ignores. Dec 9, 2016 Merge branch 'master' of Dec 9, 2016 Java class for reading mbox files and dependent jars. Oct 16, 2016 Simplified output. Dec 9, 2016

Converting the Enron Email Dataset to mbox Format

The Enron Email Dataset is distributed in maildir format, which means that each message is stored in a separate file. This is unwieldy to work with. Here's how you can convert maildir into mbox, where all messages in a folder are stored in a single mbox file.

Go fetch the dataset and then unpack:

$ tar xvfz enron_mail_20150507.tgz

The dataset should unpack into a directory called maildir. Use the script to gather an inventory of the messages in each folder:

$ ./

Verify the total number of messages in the dataset:

$ ./ | cut -d' ' -f1 | awk '{s+=$1} END {print s}'

Now run the conversion script:

$ ./

It might take a bit, so go grab a cup of coffee...

Note that the script is destructive, in that it alters the original structure of the dataset. This is necessary to get everything in the right maildir format so that it can be processed by Python tools (in particular, the script creates cur/ and new/ directories, which is part of the expected layout).

After the script completes, the resulting mbox files are stored in the enron/ directory:

$ ls enron | wc
    3311    3311   93804

The repo includes, a very simple Java program that uses the JavaMail API to read the mbox files. The dependent jars are checked into the repo for convenience, so you can compile directly:

$ javac -cp lib/javax.mail-1.5.6.jar:lib/mbox.jar

You can now examine a particular mbox file:

$ java -cp .:lib/javax.mail-1.5.6.jar:lib/mbox.jar ReadMbox enron/enron.allen-p._sent_mail

The program prints out the subject line of each email.

To verify the integrity of the entire dataset in mbox format, run:

$ ./ > mbox.log &

Confirm that the number of messages is exactly the same:

$ cut -d' ' -f3 mbox.log | awk '{s+=$1} END {print s}'
You can’t perform that action at this time.