Skip to content
/ pluto Public

Processing pipeline for generating the Avocado Research Email Collection (PST extractor+)

License

Notifications You must be signed in to change notification settings

jdavcs/pluto

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Pluto

This project is deprecated and not maintained. It exists for historical reasons only.

  • Technology: Python, Java, Bash, SQL; PHP, JavaScript, HTML/CSS; MySQL; libpst, Apache Tika
  • Developed: 2011-2012

This is the processing pipeline that was used in generating the Avocado Research Email Collection.

The Avocado Research Email Collection is a corpus of emails and attachments (2 million items), distributed by the Linguistic Data Consortium for use in research and development in e-discovery, social network analysis, and related fields. The emails and attachments come from 282 accounts of a defunct information technology company referred to as "Avocado". The collection consists of the processed personal folders (PST files) of these accounts with metadata describing folder structure, email characteristics and contacts, among others. It is expected to be useful for social network analysis, e-discovery and related fields.

This code handled the following data processing tasks:

  1. PST file extraction (initial data set: 282 PST source files, 66GB)
  2. File deduplication
  3. Email thread reconstruction
  4. MIME-type identification
  5. Archive extraction
  6. Text extraction
  7. Partial redaction of sensitive data
  8. Generation of the distribution data files

The code includes a web-based system to visually explore the collection's data and its relationships, create labeled subsets of the data, and generate custom distribution sets based on selected criteria (attachment MIME types, file extensions, PST item types).

Why Pluto? Because, similar to Pluto, the system "presided over" the afterlife of email.

More information

Published dataset

  • Douglas Oard, William Webber, David Kirsch, and Sergey Golitsynskiy. 2015. Avocado Research Email Collection. LDC2015T03. DVD. Philadelphia: Linguistic Data Consortium. (2015).

Usage

To extract data:

  • The system requires Python 2.x, Java and MySQL (no specific version requirements)
  • You must have libpst installed (see http://www.five-ten-sg.com/libpst); libpst requires boost and libgsf
    Make sure to edit your PYTHONPATH:
PYTHONPATH=$PYTHONPATH:/usr/local/lib/python2.6/site-packages
export PYTHONPATH
  • Create a file config.properties, preferably outside your source code directory (see config-SAMPLE.properties as an example; the file must be formatted with no spaces around '=').
  • Create a symbolic link to config.properties in the root and python/ directories
  • Run the util/run_full_index.sh from the root directory:
$ ./util/run_full_index.sh

If you are processing a large data set on a remote server, it might be a good idea to run the process in the background the HUP (hangup) signal to prevent the process from dying when you are logged out:

$ nohup ./util/run_full_index.sh &
  • If you need to recompile the Java files, you may need to include the -sourcepath and -classpath options. For example:
$ javac -d pluto/bin -sourcepath pluto/ -cp $CLASSPATH:/<path-to-tika-app.jar> pluto/TikaExtractor.java

To generate distribution files:

  • Edit the file main.py to specify the output directory and the distro builder (builder1 or builder2)
  • Run python/main.py:
$ python main.py

To setup the web-based system:

  • Create a file config.ini, preferably outside your source code directory (see php/config-SAMPLE.ini as an example)
  • Create a symbolic link to config.ini in the php/ directory
  • Point your web server to the webapp directory, which is your root web directory

Disclaimer:

  1. This code's only purpose was to generate the Avocado Research Email Collection. It has not been systematically tested on other PST data sets.
  2. The code may have been slightly modified before generating the final published version of the data. For example, the metadata format generated by python/distro/builder1.py was not used for the final distribution (it did not combine custodian items in a single XML file and instead used xi:include directives that referenced individual item files). See python/distro/builder2.py for the alternative format.

What's inside

  • demo: sample PST source file + extracted data (data is a subset of the EDRM Enron Email Data Set)
  • java: Java classes (source + compiled) used to interface with Apache Tika
  • lib: mysql connector JAR file + shell script to obtain tika JAR file (see readme)
  • php: web-based system for exploring the data and generating custom distributions
  • python: the core data processing code
  • sql: SQL scripts
  • util: shell script for running all data extraction and subsequent processing

License

Public domain

About

Processing pipeline for generating the Avocado Research Email Collection (PST extractor+)

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published