This project is deprecated and not maintained. It exists for historical reasons only.
- Technology: Python, Java, Bash, SQL; PHP, JavaScript, HTML/CSS; MySQL; libpst, Apache Tika
- Developed: 2011-2012
This is the processing pipeline that was used in generating the Avocado Research Email Collection.
The Avocado Research Email Collection is a corpus of emails and attachments (2 million items), distributed by the Linguistic Data Consortium for use in research and development in e-discovery, social network analysis, and related fields. The emails and attachments come from 282 accounts of a defunct information technology company referred to as "Avocado". The collection consists of the processed personal folders (PST files) of these accounts with metadata describing folder structure, email characteristics and contacts, among others. It is expected to be useful for social network analysis, e-discovery and related fields.
This code handled the following data processing tasks:
- PST file extraction (initial data set: 282 PST source files, 66GB)
- File deduplication
- Email thread reconstruction
- MIME-type identification
- Archive extraction
- Text extraction
- Partial redaction of sensitive data
- Generation of the distribution data files
The code includes a web-based system to visually explore the collection's data and its relationships, create labeled subsets of the data, and generate custom distribution sets based on selected criteria (attachment MIME types, file extensions, PST item types).
Why Pluto? Because, similar to Pluto, the system "presided over" the afterlife of email.
- Documentation describing the collection and the format
- Details on the system's database structure
- Screenshots of the web-based UI
- Sample extracted data (The data used for this demo is a subset of the EDRM Enron Email Data Set)
- Douglas Oard, William Webber, David Kirsch, and Sergey Golitsynskiy. 2015. Avocado Research Email Collection. LDC2015T03. DVD. Philadelphia: Linguistic Data Consortium. (2015).
To extract data:
- The system requires Python 2.x, Java and MySQL (no specific version requirements)
- You must have libpst installed (see http://www.five-ten-sg.com/libpst); libpst requires boost and libgsf
Make sure to edit your PYTHONPATH:
PYTHONPATH=$PYTHONPATH:/usr/local/lib/python2.6/site-packages
export PYTHONPATH
- Create a file
config.properties
, preferably outside your source code directory (seeconfig-SAMPLE.properties
as an example; the file must be formatted with no spaces around '='). - Create a symbolic link to
config.properties
in the root and python/ directories - Run the
util/run_full_index.sh
from the root directory:
$ ./util/run_full_index.sh
If you are processing a large data set on a remote server, it might be a good idea to run the process in the background the HUP (hangup) signal to prevent the process from dying when you are logged out:
$ nohup ./util/run_full_index.sh &
- If you need to recompile the Java files, you may need to include the -sourcepath and -classpath options. For example:
$ javac -d pluto/bin -sourcepath pluto/ -cp $CLASSPATH:/<path-to-tika-app.jar> pluto/TikaExtractor.java
To generate distribution files:
- Edit the file
main.py
to specify the output directory and the distro builder (builder1 or builder2) - Run python/main.py:
$ python main.py
To setup the web-based system:
- Create a file
config.ini
, preferably outside your source code directory (seephp/config-SAMPLE.ini
as an example) - Create a symbolic link to
config.ini
in thephp/
directory - Point your web server to the
webapp
directory, which is your root web directory
Disclaimer:
- This code's only purpose was to generate the Avocado Research Email Collection. It has not been systematically tested on other PST data sets.
- The code may have been slightly modified before generating the final published version of the
data. For example, the metadata format generated by
python/distro/builder1.py
was not used for the final distribution (it did not combine custodian items in a single XML file and instead used xi:include directives that referenced individual item files). Seepython/distro/builder2.py
for the alternative format.
- demo: sample PST source file + extracted data (data is a subset of the EDRM Enron Email Data Set)
- java: Java classes (source + compiled) used to interface with Apache Tika
- lib: mysql connector JAR file + shell script to obtain tika JAR file (see readme)
- php: web-based system for exploring the data and generating custom distributions
- python: the core data processing code
- sql: SQL scripts
- util: shell script for running all data extraction and subsequent processing