Python Java Other
Switch branches/tags
Clone or download
Mark Granroth-Wilding
Mark Granroth-Wilding Added multicore LDA to LDA module
Can now select whether to use Gensim's basic single-core implementation or multicore.
Latest commit 7dce17b Jul 13, 2018


The Pimlico Processing Toolkit

The Pimlico Processing Toolkit (PIpelined Modular LInguistic COrpus processing) is a toolkit for building pipelines made up of linguistic processing tasks to run on large datasets (corpora).

It provides a wrappers around many existing, widely used Natural Language Processing (NLP) tools. It makes it easy to write potentially complex pipelines and apply them to large datasets.

Pimlico aims:

  • to provide clear documentation of what has been done;
  • to make it easy to run standard NLP tasks on your data;
  • to make it easy to implement your own non-standard tasks, specific to a pipeline;
  • to support simple distribution of code for reproduction, for example, on other datasets.

Full documentation, including a guide on geting started using Pimlico, is available at

Pimlico is hosted on Github