Skip to content


Repository files navigation

Gransk - Document processing for investigations

A tool for when you have a bunch of documents to figure out of. Introduction to Gransk (YouTube)

Build Status Documentation Status Coverage Status

Gransk is an open source tool that aims to be a Swiss army knife of document processing and analysis. Its primary objective is to quikly provide users with insight to their documents during investigations. It includes a processing engine written in Python and a web interface. Under the hood it uses Apache Tika for content extraction, Elasticsearch for data indexing, and dfVFS to unpack disk images.


Using VirtualBox:
  1. Download Gransk VM:
  2. Open VirtualBox and click "File" -> "Import appliance". Choose downloaded VM.
  3. Double click on the imported machine. (Hold shift to run in background)
  4. After a couple of seconds. open a web browser and go to http://localhost:8084
Using Docker on Linux/Mac:
curl -o -X GET
sh ./
Using Docker on Windows:

Type the following command in to powershell

Invoke-WebRequest -Outfile docker-quickstart.ps1
powershell -ExecutionPolicy ByPass -File docker-quickstart.ps1


  • Unpack disk images with dfVFS and archives with 7zip
  • Extract metadata and text from documents with Apache Tika
  • Named entity recognition with Polyglot (NER) and Namefinder
  • Entity extraction with regular expressions
  • Simple data statistics
  • Search and explore data with Elasticsearch
  • +++

Processing tested on Python 2.7 and 3.4. The web interface requires a modern web browser.

Processing overview



Subscribers are registered in config.yml.

import gransk.core.abstract_subscriber as abstract_subscriber
import gransk.core.helper as helper

class Subscriber(abstract_subscriber.Subscriber):

  def consume(self, doc, text):
    doc.meta['num_chars'] = len(text)

Programmatically adding files

import io

import gransk.api as api
import gransk.core.document as document

gransk = api.API(u'config.yml')

doc = document.get_document(u'filename-or-path.txt')
doc.tag = u'demo'

content = io.BytesIO(b'Data buffer')

gransk.add_file(doc, content)


Conventions, code quality and documentation


autopep8 --indent-size 2 --max-line-length 80 --in-place --recursive --aggressive gransk
py.test --cov-report html --cov gransk gransk
pylint --rcfile=.pylintrc gransk

Web interface

cd gransk/web/tests && npm install && cd ../../../
gransk/web/tests/node_modules/.bin/karma start gransk/web/tests/cover.conf.js
gransk/web/tests/node_modules/.bin/karma start gransk/web/tests/watch.conf.js
jshint gransk/web/static/modules/* gransk/web/tests/spec/modules/*

Continuous integration

Test build:

docker build -t gransk-prebuilt -f utils/local-travis/Dockerfile .
docker run -v $PWD:/app --entrypoint=python -it gransk-prebuilt utils/local-travis/


Generate docs locally:

pip install sphinx sphinx_rtd_theme
sphinx-build -c docs -b html docs/ local_data/build


git clone && cd gransk
virtualenv pyenv
source pyenv/bin/activate
pip install -r utils/dfvfs-requirements.txt
pip install -r requirements.txt
python install
python download

Processing from command line

python -m /path/to/data
python -m --help

Using Docker:

docker run -v /path/to/data:/data --entrypoint=python -i -t pcbje/gransk -m --workers=4 /data

Starting web UI

python -m gransk.boot.web

Using Docker:

docker run -p 8084:8084 --entrypoint=python -i -t pcbje/gransk -m gransk.boot.ui --host=


See es-auto-query.


  • dfVFS: Apache License Version 2.0
  • Apache Tika: Apache License Version 2.0
  • 7zip: GNU LGPL
  • Elasticsearch: Apache License Version 2.0
  • Flask: BSD

Uh, "gransk"?

"Gransk" is imperative form of "investigate" in Norwegian.