Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP
A morphosyntactic tagger for Polish based on Concraft
Haskell Shell
branch: master
Failed to load latest commit information.
config Add NKJP tagset file
eval Use the NKJP tagset by default and close #10
notes
src/NLP/Concraft remove non-printable characters from the input when using Maca
tools Add module for request processing and close #24
.gitignore Initial commit
LICENSE Initial code
README.md README: Specify missing parameter "inmodel"
changelog bump version
concraft-pl.cabal bump version

README.md

Concraft-pl

This package provides a morphosyntactic tagger for the Polish language. The tool combines the following components into a pipeline:

  • A morphosyntactic segmentation and analysis tool Maca,
  • A morphosyntactic disambiguation library Concraft,

As for now, the tagger doesn't provide any lemmatisation capabilities. As a result, it may output multiple interpretations (all related to the same morphosyntactic tag, but with different lemmas) for some known words, while for the out-of-vocabulary words it just outputs orthographic forms as lemmas.

See the homepage if you wish to download a pre-trained model for the Polish language.

Installation

You will need Glasgow Haskell Compiler (GHC) and the Cabal tool to build Concraft-pl. The easiest way to get both GHC and Cabal is to install the latest Haskell Platform.

Unless you plan to use a custom preprocessing pipeline or run Maca on a different machine (see section Tagging analysed data), you will also need the Maca tool. A detailed installation guide can be found on the Maca homepage.

To install Concraft-pl from the official Hackage repository just run:

cabal install concraft-pl

The concraft-pl tool will be installed in the ~/.cabal/bin directory by default.

If you want to upgrade Concraft-pl to a newer version you should update the package list first:

cabal update 
cabal install concraft-pl

To install the latest development version from github just run

cabal install

from the concraft-pl toplevel directory.

Data format

The current version of Concraft-pl works on a simple plain text format supported by the Corpus2 tools. You will have to install these tools when you install Maca anyway, so you can use them to convert the output generated by Concraft-pl to one of other formats supported by Corpus2.

Training

If you have the training material with disambiguation annotations (stored in the plain text format) you can train the Concraft-pl model yourself.

concraft-pl train train.plain -e eval.plain -o model.gz

Concraft-pl uses the NKJP morphosyntactic tagset definition by default. It will also reanalyse the input data before the actual training. If you want to change this behaviour, use the --tagset and --noana command-line options.

Consider using runtime system options. You can speed up processing by making use of multiple cores by using the -N option. The -s option will produce the runtime statistics, such as the time spent in the garbage collector. If the program is spending too much time collecting garbage, you can try to increase the allocation area size with the -A option. If you have a big dataset and it doesn't fit in the computer memory, use the --disk flag. For example, to train the model using four threads and 256M allocation area size, run:

concraft-pl train train.plain -e eval.plain -o model.gz +RTS -N4 -A256M -s

Run concraft-pl train --help to learn more about the program arguments and possible training options.

Finally, you may consider pruning the resultant model in order to reduce its size. Features with values close to 0 (in log-domain) have little effect on the modeled probability and, therefore, it should be safe to discard them.

concraft-pl prune -t 0.05 input-model.gz pruned-model.gz

Tagging

Once you have a Concraft-pl model you can use the following command tag input.txt file:

concraft-pl tag model.gz < input.txt > output.plain

The input file is first divided into paragraphs (the tool interprets empty lines as paragraph ending markers). After that, Maca is used to segment and analyse each paragraph. Finally, Concraft module is used to disambiguate each sentence in the Maca output.

With the --marginals option enabled, Concraft-pl will output marginal probabilities corresponding to individual tags (determined on the basis of the disambiguation model) instead of disamb markers.

Run concraft-pl tag --help to learn more about possible tagging options.

Server

Concraft-pl provides also a client/server mode. It is handy when, for example, you need to tag a large collection of small files. Loading Concraft-pl model from a disk takes considerable amount of time which makes the tagging method described above very slow in such a setting.

To start the Concraft-pl server, run:

concraft-pl server --inmodel model.gz

You can supply a custom port number using a --port option. For example, to run the server on the 10101 port, use the following command:

concraft-pl server --inmodel model.gz --port 10101

To use the server in a multi-threaded environment, you need to specify the -N RTS option. A set of options which usually yields good server performance is presented in the following example:

concraft-pl server --inmodel model.gz +RTS -N -A4M -qg1 -I0

Run concraft-pl server --help to learn more about possible server-mode options.

The client mode works just like the tagging mode. The only difference is that, instead of supplying your client with a model, you need to specify the port number (in case you used a custom one when starting the server; otherwise, the default port number will be used).

concraft-pl client --port 10101 < input.txt > output.plain

Run concraft-pl client --help to learn more about possible client-mode options.

Tagging analysed data

In some situations you might want to feed Concraft-pl with a previously analysed data. Perhaps your Maca instance is installed on a different machine, or maybe you want to use Concraft-pl with a custom preprocessing pipeline.

If you want to use a preprocessing pipeline significantly different from the standard one (Maca), you should first train your own Concraft model. To train the model on analysed data use the --noana training flag.

Use the same --noana flag when you want to tag analysed data. Input format should be the same as the output format. This option is currently not supported in the client/server mode.

Remember to use the same preprocessing pipeline (segmentation + analysis) for both training and disambiguation. Inconsistencies between training material and input data may severely harm the quality of disambiguation.

Something went wrong with that request. Please try again.