Skip to content

Add new language models to MaltParser

roberto zanoli edited this page Feb 12, 2015 · 1 revision

(This memo outlines processes that used for adding a new language model for MaltParser pipeline. This shows general idea of how to add a new language --- a new trained model --- for DKPro based UIMA tools; if there is no model for your language. )

Summary: When you add a model for DKPro-based UIMA AE (Analysis Engine) -- use scripts provided with DKPro. You can generate a new model for a new model for your new language relatively easily, by following and changing existing "model-packing" script examples.

Steps of adding a new model, (DKPro-based) MaltParser example

  • Prepare DKPro source for MaltParser AE

    Adding a model for UIMA modules in the platform is relatively easy, especially when the underlying UIMA module is provided by DKPro.

    DKPro model generations are automated, and always provide a build script. You can extend this build script by copying an existing entry and modify to suit your need. To do this, you need DKPro source code.

    For now (EOP 1.1.3 / 1.1.4) we are using DKPro 1.4.0. So let's checkout a copy of that version of DKPro source.

    In our case, it would be done like this (since 1.4.0 checking out) svn checkout http://dkpro-core-asl.googlecode.com/svn/de.tudarmstadt.ukp.dkpro.core-asl/trunk/ dkpro-core-asl-read-only

  • Edit build script

    So, let's first locate the model-build script. move to directory: /dkpro-core-asl-1.4.0/de.tudarmstadt.ukp.dkpro.core.maltparser-asl/src/scripts

    It has one Ant build script (.xml). And it knows to how to handle a model. Briefly take a look. Check some existing models such as French, or English version.

    Let add one additional entry for our case. Here, we are adding an Italian entry --- note that, we need a URL for pre-built model of the underlying tool (here, Malt parser model).

      <target name="it">
      <mkdir dir="target/download"/>
      <!-- Italian malt parser model 2014-10-02  
              - we have the model file on the following URL 
              --> 
      <get
                  src="http://hlt-services4.fbk.eu:8080/artifactory/simple/tmp/italian-malt.2014.oct.mco"
                  dest="target/download/it-malt-linear.mco"
                  skipexisting="true"/>
    
      <install-model-file groupId="de.tudarmstadt.ukp.dkpro.core" artifactIdBase="de.tudarmstadt.ukp.dkpro.core.maltparser"
      	file="target/download/it-malt-linear.mco" 
              md5="0f9c01777d1534f5716ee17254cbecfd" 
      	tool="parser" language="it" variant="linear" extension="mco" version="20141002.0"/>		
      <delete dir="target/model-staging"/>          
      </target> 
    

    The above example dictates download, and building a model file that can be accessed by DKPro maltparser UIMA AE. It sets language code as "IT", and model variant (if there's more than one model) "linear". Note that, for malt parser, default model variant is "linear". You can provide more than one models, but for now, for here, let's just use default model variant (as copied from other language model).

  • Run build script

    Since the ant build script is ready, build the ant script by running "ant [target]"

    in this case, > ant it

    If successful, Ant will report all done, and the newly built Jar file will be ready in ../../target directory.

  • Now the model file is ready.

    Deploy it (on artifactory), or install it locally. (For this Malt Italian model case, I have deployed it on the project artifactory, as private-internal repository -- it is actually a public repository -- http://hlt-services4.fbk.eu:8080/artifactory/simple/private-internal/de/tudarmstadt/ukp/dkpro/core/ Note that all our other internally trained models are there.)

    Then, add dependency of the project that will use this model (here, LAP)

    de.tudarmstadt.ukp.dkpro.core de.tudarmstadt.ukp.dkpro.core.maltparser-model-parser-it-linear 20141002.0 pom
  • Finally, the model is ready for the project. Test them.

    Testing would require making of a new LAP pipeline, and test it.

    For this case, I have prepared a pipeline, called MaltParserIT. It is a normal LAPAccess instance. https://github.com/hltfbk/Excitement-Open-Platform/blob/master/lap/src/main/java/eu/excitementproject/eop/lap/dkpro/MaltParserIT.java

    Make a simple test code, run its output to check the dependency output is as expected. For this case, I have added a JUnit test case that is called MaltParserITTest. https://github.com/hltfbk/Excitement-Open-Platform/blob/master/lap/src/test/java/eu/excitementproject/eop/lap/dkpro/MaltParserItTest.java This is a simple test without any assertion --- however, the result is outputted on log, and can be checked out. Also, if anything (e.g. classpath of the model) was wrong, the test would fail. (Note that this test will simply ignored, if there is no TreeTagger related binaries and models are included in LAP POM).

    If something is wrong, an exception or error will bring about (such as unable to find the model, etc). In that case, you have to make sure on which rules (conditions / names, class pathes) that the underlying UIMA/DKPro AE is looking for the model files.

    The parse result output also need to be checked to make sure everything works as is. (e.g. no POS tag mismatch, correct model, etc)

Clone this wiki locally