uimaScala is toolkit to develop natural language application in Scala. It bases mainly on uimaFIT, which itsself bases on Apache UIMA. To develop natural language processing (NLP) application in Apache UIMA you need to work with lots of XML files. For nearly every Java class you will need an XML File. If your Java class changes you also need to change you XML file. uimaFIT tries to solve this problem with reflection and nearly removes all XML files.
This project started as a wrapper for uimaFIT. With Scala's collection library and the functional programming stuff it is a lot easier to develop NLP Application. Also a type safe configuration system and a nicer DSL was added.
This readme provides a short introduction. More documentation will be added later.
To use this project add following configuration to your built.sbt
file. Uimscala requires Scala version 2.11
scalaVersion := "2.11.1"
resolvers ++= Seq(
"Sonatype OSS Releases" at "http://oss.sonatype.org/content/repositories/releases/",
"Sonatype OSS Snapshots" at "http://oss.sonatype.org/content/repositories/snapshots/"
)
libraryDependencies += "com.github.jenshaase.uimascala" %% "uimascala-core" % "0.5.0-SNAPSHOT"
addCompilerPlugin("org.scalamacros" % "paradise" % "2.0.1" cross CrossVersion.full)
Next you need to tell UIMA where to find the description
files. Therefore add the file types.txt
to the folder
src/main/resources/META-INF/org.apache.uima.fit
. Add following
content:
classpath*:desc/types/**/*.xml
Annotators in UIMA will process a document. Most of the time they are
using annotations from previous annotators and combine them to new
annotations. The following annotator is Tokenizer. It looks at the
text and identifies single words, also called tokens. We can use
Java's BreakIterator
to tokenize the text. You will find the class
also in the toolkit with some additional processing:
package com.github.jenshaase.test
import com.github.jenshaase.uimascala.core._
import com.github.jenshaase.uimascala.core.configuration._
import java.util.Locale
import org.apache.uima.jcas.JCas
import java.text.BreakIterator
class BreakIteratorTokenizer extends SCasAnnotator_ImplBase {
object locale extends Parameter[Locale](Locale.getDefault)
def process(jcas: JCas) = {
val bi = BreakIterator.getWordInstance(locale.is)
bi.setText(jcas.getDocumentText)
var last = bi.first
var cur = bi.next
while (cur != BreakIterator.DONE) {
if (jcas.getDocumentText().substring(last, cur).trim != "") {
jcas.annotate[Token](last, cur)
}
last = cur
cur = bi.next
}
}
}
An annotator in uimaScala extends the SCasAnnotator_ImplBase
class. To implement this class you need to implement the process
method. Here we use Java's BreakIterator
to process the
document. For each token we add a new Token
type (the next part will
explain how to create such type). You can also see the locale
configuration parameter. It has a name (locale
) and type (Locale
)
and a default value Locale.getDefault
. These parameter can be change
when using this component in a UIMA pipeline.
The goal of an annotator is to add new annotation to text. With UIMA you can create you custom annotation with XML Files and then generate the Java classes. uimaScala uses a Scala marco and custom DSL to provide this features. In order to create your type system you need to define an object in your scala code:
package com.github.jenshaase.test
import com.github.jenshaase.uimascala.core.description._
@TypeSystemDescription
object TypeSystem {
val Token = Annotation {
val pos = Feature[String]
val lemma = Feature[String]
val stem = Feature[String]
}
val Sentence = Annotation {}
}
After running compile
your can see following output on your sbt console:
Jul 03, 2014 8:28:37 AM org.apache.uima.tools.jcasgen.UimaLoggerProgressMonitor subTask(35)
INFORMATION: >>JCasGen Creating: 'com.github.jenshaase.test.Token'.
Jul 03, 2014 8:28:37 AM org.apache.uima.tools.jcasgen.UimaLoggerProgressMonitor subTask(35)
INFORMATION: >>JCasGen Creating: 'com.github.jenshaase.test.Token_Type'.
Jul 03, 2014 8:28:37 AM org.apache.uima.tools.jcasgen.UimaLoggerProgressMonitor subTask(35)
INFORMATION: >>JCasGen Creating: 'com.github.jenshaase.test.Sentence'.
Jul 03, 2014 8:28:37 AM org.apache.uima.tools.jcasgen.UimaLoggerProgressMonitor subTask(35)
INFORMATION: >>JCasGen Creating: 'com.github.jenshaase.test.Sentence_Type'
Now the necessary Java files are created. You need to run compile
again to compile the generated Java sources.
Tu run a pipeline uimascala use scalaz-stream library. To run a pipeline we need to convert documents to a CAS and process the CAS with our annotators:
package com.github.jenshaase.test
import com.github.jenshaase.uimascala.core._
import com.github.jenshaase.uimascala.core.stream._
import scalaz._, Scalaz._
import scalaz.stream._
import java.util.Locale
object Main extends App {
val p = Process("this is a text", "and another text") |>
casFromText |>
annotate(new BreakIteratorTokenizer().config(_.locale := Locale.US)) |>
extractCas { cas =>
cas.select[Token].map(_.getCoveredText).toList
}
println(p.toList)
p.toList == List(
List("this", "is", "a", "text"),
List("and", "another", "text")
)
}
- Add more documentation