Skip to content
Converting the Enron email collection to mbox format
Branch: master
Clone or download
Latest commit 6722389 Dec 9, 2016
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
lib Java class for reading mbox files and dependent jars. Oct 16, 2016
.gitignore Ignores. Dec 9, 2016
README.md Merge branch 'master' of github.com:lintool/Enron2mbox Dec 9, 2016
ReadMbox.java Java class for reading mbox files and dependent jars. Oct 16, 2016
convert_enron_to_mbox.py
count_messages.sh
verify_mbox.sh Simplified output. Dec 9, 2016

README.md

Converting the Enron Email Dataset to mbox Format

The Enron Email Dataset is distributed in maildir format, which means that each message is stored in a separate file. This is unwieldy to work with. Here's how you can convert maildir into mbox, where all messages in a folder are stored in a single mbox file.

Go fetch the dataset and then unpack:

$ tar xvfz enron_mail_20150507.tgz

The dataset should unpack into a directory called maildir. Use the script count_messages.sh to gather an inventory of the messages in each folder:

$ ./count_messages.sh

Verify the total number of messages in the dataset:

$ ./count_messages.sh | cut -d' ' -f1 | awk '{s+=$1} END {print s}'
517401

Now run the conversion script:

$ ./convert_enron_to_mbox.py

It might take a bit, so go grab a cup of coffee...

Note that the script is destructive, in that it alters the original structure of the dataset. This is necessary to get everything in the right maildir format so that it can be processed by Python tools (in particular, the script creates cur/ and new/ directories, which is part of the expected layout).

After the script completes, the resulting mbox files are stored in the enron/ directory:

$ ls enron | wc
    3311    3311   93804

The repo includes ReadMbox.java, a very simple Java program that uses the JavaMail API to read the mbox files. The dependent jars are checked into the repo for convenience, so you can compile directly:

$ javac -cp lib/javax.mail-1.5.6.jar:lib/mbox.jar ReadMbox.java

You can now examine a particular mbox file:

$ java -cp .:lib/javax.mail-1.5.6.jar:lib/mbox.jar ReadMbox enron/enron.allen-p._sent_mail

The program prints out the subject line of each email.

To verify the integrity of the entire dataset in mbox format, run:

$ ./verify_mbox.sh > mbox.log &

Confirm that the number of messages is exactly the same:

$ cut -d' ' -f3 mbox.log | awk '{s+=$1} END {print s}'
517401
You can’t perform that action at this time.