UimaScala

About

uimaScala is toolkit to develop natural language application in Scala. It bases mainly on uimaFIT, which itsself bases on Apache UIMA. To develop natural language processing (NLP) application in Apache UIMA you need to work with lots of XML files. For nearly every Java class you will need an XML File. If your Java class changes you also need to change you XML file. uimaFIT tries to solve this problem with reflection and nearly removes all XML files.

This project started as a wrapper for uimaFIT. With Scala's collection library and the functional programming stuff it is a lot easier to develop NLP Application. Also a type safe configuration system and a nicer DSL was added.

This readme provides a short introduction. More documentation will be added later.

Setup a project

To use this project add following configuration to your built.sbt file. Uimscala requires Scala version 2.11

scalaVersion := "2.11.1"

resolvers ++= Seq(
  "Sonatype OSS Releases"  at "http://oss.sonatype.org/content/repositories/releases/",
  "Sonatype OSS Snapshots" at "http://oss.sonatype.org/content/repositories/snapshots/"
)

libraryDependencies += "com.github.jenshaase.uimascala" %% "uimascala-core" % "0.5.0-SNAPSHOT"

addCompilerPlugin("org.scalamacros" % "paradise" % "2.0.1" cross CrossVersion.full)

Next you need to tell UIMA where to find the description files. Therefore add the file types.txt to the folder src/main/resources/META-INF/org.apache.uima.fit. Add following content:

classpath*:desc/types/**/*.xml

A simple annotator

Annotators in UIMA will process a document. Most of the time they are using annotations from previous annotators and combine them to new annotations. The following annotator is Tokenizer. It looks at the text and identifies single words, also called tokens. We can use Java's BreakIterator to tokenize the text. You will find the class also in the toolkit with some additional processing:

package com.github.jenshaase.test

import com.github.jenshaase.uimascala.core._
import com.github.jenshaase.uimascala.core.configuration._
import java.util.Locale
import org.apache.uima.jcas.JCas
import java.text.BreakIterator

class BreakIteratorTokenizer extends SCasAnnotator_ImplBase {

  object locale extends Parameter[Locale](Locale.getDefault)

  def process(jcas: JCas) = {
    val bi = BreakIterator.getWordInstance(locale.is)
    bi.setText(jcas.getDocumentText)

    var last = bi.first
    var cur = bi.next
    while (cur != BreakIterator.DONE) {
      if (jcas.getDocumentText().substring(last, cur).trim != "") {
        jcas.annotate[Token](last, cur)
      }

      last = cur
      cur = bi.next
    }
  }
}

An annotator in uimaScala extends the SCasAnnotator_ImplBase class. To implement this class you need to implement the process method. Here we use Java's BreakIterator to process the document. For each token we add a new Token type (the next part will explain how to create such type). You can also see the locale configuration parameter. It has a name (locale) and type (Locale) and a default value Locale.getDefault. These parameter can be change when using this component in a UIMA pipeline.

Adding your own type system description

The goal of an annotator is to add new annotation to text. With UIMA you can create you custom annotation with XML Files and then generate the Java classes. uimaScala uses a Scala marco and custom DSL to provide this features. In order to create your type system you need to define an object in your scala code:

package com.github.jenshaase.test

import com.github.jenshaase.uimascala.core.description._ 

@TypeSystemDescription
object TypeSystem {

  val Token = Annotation {
    val pos = Feature[String]
    val lemma = Feature[String]
    val stem = Feature[String]
  }

  val Sentence = Annotation {}
}

After running compile your can see following output on your sbt console:

Jul 03, 2014 8:28:37 AM org.apache.uima.tools.jcasgen.UimaLoggerProgressMonitor subTask(35)
INFORMATION:  >>JCasGen Creating: 'com.github.jenshaase.test.Token'.
Jul 03, 2014 8:28:37 AM org.apache.uima.tools.jcasgen.UimaLoggerProgressMonitor subTask(35)
INFORMATION:  >>JCasGen Creating: 'com.github.jenshaase.test.Token_Type'.
Jul 03, 2014 8:28:37 AM org.apache.uima.tools.jcasgen.UimaLoggerProgressMonitor subTask(35)
INFORMATION:  >>JCasGen Creating: 'com.github.jenshaase.test.Sentence'.
Jul 03, 2014 8:28:37 AM org.apache.uima.tools.jcasgen.UimaLoggerProgressMonitor subTask(35)
INFORMATION:  >>JCasGen Creating: 'com.github.jenshaase.test.Sentence_Type'

Now the necessary Java files are created. You need to run compile again to compile the generated Java sources.

Running a pipeline

Tu run a pipeline uimascala use scalaz-stream library. To run a pipeline we need to convert documents to a CAS and process the CAS with our annotators:

package com.github.jenshaase.test

import com.github.jenshaase.uimascala.core._
import com.github.jenshaase.uimascala.core.stream._
import scalaz._, Scalaz._
import scalaz.stream._
import java.util.Locale

object Main extends App {

  val p = Process("this is a text", "and another text") |>
    casFromText |>
    annotate(new BreakIteratorTokenizer().config(_.locale := Locale.US)) |>
    extractCas { cas =>
      cas.select[Token].map(_.getCoveredText).toList
    }

  println(p.toList)

  p.toList == List(
    List("this", "is", "a", "text"),
    List("and", "another", "text")
  )
}

TODO

Add more documentation

Name		Name	Last commit message	Last commit date
Latest commit History 119 Commits
core		core
language-identification/n-gram-language-identifier/src		language-identification/n-gram-language-identifier/src
lemmatizer/mate-lemmatizer/src		lemmatizer/mate-lemmatizer/src
name-entity-recognizer/stanford-ner/src		name-entity-recognizer/stanford-ner/src
parser		parser
part-of-speech-tagger		part-of-speech-tagger
project		project
sbt-plugin		sbt-plugin
segmenter		segmenter
type-system/src/main/resources		type-system/src/main/resources
.gitignore		.gitignore
.travis.yml		.travis.yml
LICENSE		LICENSE
README.markdown		README.markdown
build.sbt		build.sbt
version.sbt		version.sbt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

UimaScala

About

Setup a project

A simple annotator

Adding your own type system description

Running a pipeline

TODO

About

Releases

Packages

Languages

License

jenshaase/uimaScala

Folders and files

Latest commit

History

Repository files navigation

UimaScala

About

Setup a project

A simple annotator

Adding your own type system description

Running a pipeline

TODO

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages