Skip to content

Commit

Permalink
Added code blocks to setup guide
Browse files Browse the repository at this point in the history
  • Loading branch information
Mark Granroth-Wilding committed Mar 29, 2016
1 parent 56351c1 commit a054c1c
Showing 1 changed file with 19 additions and 0 deletions.
19 changes: 19 additions & 0 deletions docs/setup_guide.rst
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,7 @@ You'll want to use the latest release of Pimlico. Check the website and download
Create a new directory to put your project in and extract the codebase into
a directory `pimlico` within the project directory. Let's say we're using `~/myproject/`.

.. code-block:: bash
mkdir ~/myproject
cd ~/myproject
mv /path/to/downloaded/tarball.tar.gz .
Expand All @@ -28,6 +29,7 @@ Depending on what you want to do with Pimlico, you'll
also need to fetch dependencies. Let's start by getting the basic dependencies that will be needed regardless of what
module types you use.

.. code-block:: bash
cd pimlico/lib/python
make core
Expand All @@ -52,6 +54,7 @@ backed up, so you don't lose your valuable output.

Create a file `~/.pimlico` that looks like this:

.. code-block:: ini
long_term_store=/path/to/long-term/store
short_term_store=/path/to/short-term/store
Expand All @@ -65,6 +68,7 @@ a simple one as an example.
We're going to create the file `~/myproject/pipeline.conf`. Start by writing a `pipeline` section to give the
basic pipeline setup.

.. code-block:: ini
[pipeline]
name=myproject
release=0.1
Expand All @@ -87,13 +91,15 @@ documents. This is how the Gigaword corpus is stored, so if you have Gigaword, j

**TODO: add an example that everyone can run**

.. code-block:: ini
[input-text]
type=pimlico.datatypes.XmlDocumentIterator
path=/path/to/data/dir
Perhaps your corpus is very large and you'd rather try out your pipeline on a small subset. In that case, add the
following option:

.. code-block:: ini
truncate=1000
.. note::
Expand All @@ -111,6 +117,7 @@ We can do the grouping on the fly as we read data from the input corpus. The `ta
documents together and subsequent modules will all use the same grouping to store their output, making it easy to
align the datasets they produce.

.. code-block:: ini
[tar-grouper]
type=pimlico.modules.corpora.tar_filter
input=input-text
Expand All @@ -124,6 +131,7 @@ things at once, calling OpenNLP tools.
Notice that the output from the previous module feeds into the input for this one, which we specify simply by naming
the module.

.. code-block:: ini
[tokenize]
type=pimlico.modules.opennlp.tokenize
input=tar-grouper
Expand All @@ -133,6 +141,7 @@ Doing something more interesting: POS tagging
Many NLP tools rely on part-of-speech (POS) tagging. Again, we use OpenNLP, and a standard Pimlico module
wraps the OpenNLP tool.

.. code-block:: ini
[pos-tag]
type=pimlico.modules.opennlp.pos
input=tokenize
Expand All @@ -152,6 +161,7 @@ Fetching dependencies
All the standard modules provide easy ways to get hold of their dependencies via makefiles for GNU Make. Let's get
Beautiful Soup.

.. code-block:: bash
cd ~/myproject/pimlico/lib/python
make bs4
Expand All @@ -160,12 +170,14 @@ Simple as that.
OpenNLP is a little trickier. To make things simple, we just get all the OpenNLP tools and libraries required to
run the OpenNLP wrappers at once. The `opennlp` make target gets all of these at once.

.. code-block:: bash
cd ~/myproject/pimlico/lib/java
make opennlp
At the moment, it's also necessary to build the Java wrappers around OpenNLP that are provided as part of Pimlico. For
this, you'll need a Java compiler installed on your system.

.. code-block:: bash
cd ~/myproject/pimlico
ant opennlp
Expand All @@ -176,6 +188,7 @@ this, you'll need a Java compiler installed on your system.
There's one more thing to do: the tools we're using
require statistical models. We can simply download the pre-trained English models from the OpenNLP website.

.. code-block:: bash
cd ~/myproject/pimlico/models
make opennlp
Expand All @@ -188,6 +201,7 @@ Checking everything's dandy
We now run some checks over the pipeline to make sure that our config file is valid and we've got Pimlico basically
ready to run.

.. code-block:: bash
cd ~/myproject/
./pimlico/bin/pimlico pipeline.conf check
Expand All @@ -199,6 +213,7 @@ each module. This is intentional: in some setups, we might run different modules
such that in no one of them do all modules have all of their dependencies. For us, however, this isn't the case, so
we can run further checks on the *runtime* dependencies of all our modules.

.. code-block:: bash
./pimlico/bin/pimlico pipeline.conf check --runtime
If that works as well, we're able to start running modules.
Expand All @@ -210,6 +225,7 @@ What modules to run?
Pimlico can now suggest an order in which to run your modules. In our case, this is pretty obvious, seeing as our
pipeline is entirely linear – it's clear which ones need to be run before others.

.. code-block:: bash
./pimlico/bin/pimlico pipeline.conf schedule
The output also tells you the current status of each module. At the moment, all the modules are `UNSTARTED`.
Expand All @@ -227,6 +243,7 @@ Running the modules
-------------------
The modules can be run using the `run` command and specifying the module by name. We do this manually for each module.

.. code-block:: bash
./pimlico/bin/pimlico.sh pipeline.conf run input-text
./pimlico/bin/pimlico.sh pipeline.conf run tokenize
./pimlico/bin/pimlico.sh pipeline.conf run pos-tag
Expand All @@ -239,6 +256,7 @@ people can replicate what you did.

First, let's create a directory where our custom source code will live.

.. code-block:: bash
cd ~/myproject
mkdir -p src/python
Expand All @@ -248,6 +266,7 @@ the config file, so it's easy to distribute the two together.

Add this option to the `[pipeline]` section in the config file:

.. code-block:: ini
python_path=src/python
Now you can create Python modules or packages in `src/python`, following the same conventions as the built-in modules
Expand Down

0 comments on commit a054c1c

Please sign in to comment.