Skip to content

marco-schmidt/messy

Repository files navigation

messy Java CI CodeQL Codecov Codacy Badge License Snyk Vulnerabilities

A tool suite for electronic messages.

Features

Input Formats

Messy recursively reads archive and container formats and parses several types of messages. Typically, one or more messages are stored in a file using a container format. One or more of those container files are then stored within an archive.

Archive Formats

These are general-purpose archive formats, not specific to messages.

  • Single compressed files
    • Gzip (.gz)
    • Bzip2 (.bz2)
    • Compress (.Z)
  • Multiple files stored without compression
    • Tar (.tar)
  • Multiple files stored with compression
    • Zip (.zip)
    • 7-Zip (.7z)

Container Formats

  • Mbox files.
    • Supports various subtypes.
  • Newline-delimited JSON (ndjson) files.
  • Hamster data.dat files.
  • Single-message files.
    • File extensions .eml and .msg.
    • Newsspool messages, no file extension, name is an integer number.

Message Formats

  • Internet Message Format (IMF) used with email and Usenet messages.
  • A News messages, a 1980s format for Usenet messages.
  • JSON tweets distributed as a directory tree of , each compressed with bzip2, the directory tree then packed in a single tar archive file.

Storage

Status

Created November 8th, 2020. As of 2021, a one-person hobby project. Command-line application msgcli can be used to explore message archives, converting messages to JSON and printing them to standard output.

Goals

Human Goals

  • Help users sort through, triage, clean up and consolidate their messages as a basis for discovery, backup and archival.
  • Enable digital preservation of public messages as a part of computing history.
  • Simplify bulk exchange of messages between interested parties.

Technological Goals

  • Parse electronic messages of various types.
  • Support different file formats.
  • Read messages from servers with different protocols.
  • Handle extraction of attachments and references to external information.
  • Create a message database with full text search and reporting.
  • Analyze messages to allow more fine-grained search, separate public from private ones.

Command-Line Application

Command-line application msgcli reads messages from standard input or files, converts them and prints a summary of each message to standard output or upoads it to Elastic.

Clone the git repository and install msgcli locally:

$ javac -version
# ... should print version 1.8 or higher
$ cd ~
$ git clone https://github.com/marco-schmidt/messy.git
...
$ cd messy
$ ./gradlew :msgcli:install
...
$ alias m='/path/to/homedir/messy/msgcli/build/install/msgcli/bin/msgcli'
$ m ../test.mbox
...

The application can now be used with m.

This makes msgcli upload the content of a twitter stream tar file to Elasticsearch running locally listening on port 9200:

$ export MESSY_OUTPUT_FORMAT=ELASTIC
$ m /path/to/twitter-stream-2017-07-01.tar
{"@timestamp":"2021-12-04T16:49:56.631+01:00","message":"Connected to Elastic server 'localhost:9200'.","logger_name":"messy.msgsearch.elastic.ElasticOutputProcessor","thread_name":"main","level":"INFO","level_value":20000,"server_type":"Elastic","host":"localhost","port":9200,"app_name":"msgcli"}
{"@timestamp":"2021-12-04T16:49:56.655+01:00","message":"Opening file '/path/to/twitter-stream-2017-07-01.tar' (35864390 bytes).","logger_name":"messy.msgcli.app.InputProcessor","thread_name":"main","level":"INFO","level_value":20000,"file_name":"/path/to/twitter-stream-2017-07-01.tar","file_size":35864390,"app_name":"msgcli"}
...

This uses Unix tool find to create a list of mbox files and pipe them to msgcli which will print two properties as tab-separated values to standard output:

$ export MESSY_OUTPUT_FORMAT=TSV
$ export MESSY_OUTPUT_ITEMS=AUTHOR_ID,AUTHOR_NAME
$ find /mnt/hdd2/archive/usenet -type f -name '*.mbox'|m -@
...

Known Limitations

  • 7-Zip streams can only be opened as files, not as part of archives.
  • Hamster message data files have no magic bytes file signature to properly identify them. Their file name data.dat is therefore used to detect them.

Technology Stack

  • Written in Java 8, using Adoptium (but any JDK version 8 or higher should do).
  • Build tool gradle, as a multi-project build with the gradle wrapper.
  • Hosted in a public git repository at GitHub.
  • Continuous integration with GitHub Workflow Java CI.
  • Dependencies:
  • Static code analysis with
  • Project comes with an Eclipse configuration file and gradle is configured to generate a workspace for Eclipse. Any other Java IDE will probably also work.
  • Code formatting and license header with gradle spotless plugin. Also format automatically when saving in Eclipse (if provided configuration file is used, see below for gradle Eclipse workspace setup).
  • Vulnerability analysis:
    • Gradle plugin dependencyCheck. It compares direct and transitive dependencies to CVE entries in the National Vulnerability Database (NVD).
    • GitHub workflow service CodeQL.
  • API documentation with javadoc.
  • Code coverage reporting with jacoco and codecov.io.
  • Check for new versions of dependencies with gradle plugin versions.
  • Create reports of dependencies and their licenses and check licenses against positive list.

Development Setup

  • Install JDK 8 or higher on the system.
  • Set environment variable JAVA_HOME to the JDK installation path, include its bin subdirectory in PATH variable. Run javac -version and possibly which java to make sure that the right Java compiler and virtual machine are available now.
  • Clone the messy git repository.
  • Navigate to cloned working copy and run ./gradlew check as an initial toolchain check.
  • Install Eclipse IDE, run ./gradlew eclipse in the cloned working copy, open Eclipse and import projects msg*.