Skip to content
This repository

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP

Clojure library for using the Oslo-Bergen-Tagger

branch: master

Fetching latest commit…

Octocat-spinner-32-eaf2f5

Cannot retrieve the latest commit at this time

Octocat-spinner-32 src
Octocat-spinner-32 .gitignore
Octocat-spinner-32 README.md
Octocat-spinner-32 project.clj
README.md

clj-obt

The Oslo-Bergen-Tagger is a GPL licensed Norwegian text tagger. This is an interface for accessing it with Clojure.

clj-obt supports Linux only, and you must acquire the tagger from http://tekstlab.uio.no/obt-ny/ or clone it here on GitHub: https://github.com/noklesta/The-Oslo-Bergen-Tagger

Installation

Simply add the library with Leiningen: [clj-obt "0.5.1"] and require clj-obt.core

Usage

Before tagging any text, you must set the path to the Oslo-Bergen-Tagger by calling:

(set-obt-path! "path/to/obt")

Or you can supply it when using the tagger function:

(obt-tag "tag this" "path/to/obt")

You only need to set the path once per session. Currently, only disambiguated bokmål is supported.

The resulting output from the tagger is parsed to simple maps. There are helper functions in clj-obt.tools to manipulate and filter the tagged words.

Startup time of OBT

There is a startup cost in calling the tagger. If you call obt-tag on a number of texts sequentially, there might be a about a seconds worth of overhead for each invocation. This is due to the startup time of OBT. We can get around this by giving obt-tag all the texts in a vector like this:

(obt-tag ["many" "texts" "to" "be" "tagged"])

The tagger function will then concatenate all the texts, tag them in one OBT invocation, split them up again, and return them to you - separately in a vector. Doing this, we save about n-1 of OBT startup time cost.

License

Copyright (C) 2011-2012 Aleksander Skjæveland Larsen

Distributed under the Eclipse Public License, the same as Clojure.

Something went wrong with that request. Please try again.