The Pimlico Processing Toolkit (PIpelined Modular LInguistic COrpus processing) is a toolkit for building pipelines made up of linguistic processing tasks to run on large datasets (corpora).
It provides a wrappers around many existing, widely used Natural Language Processing (NLP) tools. It makes it easy to write potentially complex pipelines and apply them to large datasets.
Pimlico aims:
- to provide clear documentation of what has been done;
- to make it easy to run standard NLP tasks on your data;
- to make it easy to implement your own non-standard tasks, specific to a pipeline;
- to support simple distribution of code for reproduction, for example, on other datasets.
Full documentation, including a guide on geting started using Pimlico, is available at http://pimlico.readthedocs.io.
Pimlico is hosted on Github