Skip to content

Step by Step, Code Examples

Roberto Zanoli edited this page Dec 22, 2014 · 1 revision

Running the experiments reported in this section requires that the TreeTagger tool has already been installed as described in the Step by Step Tutorial. In fact this section reports experiments using resources like WordNet and VerbOcean requiring using the lemma of the words of T/H pairs as produced by TreeTagger. Also these Java code examples are along the line of the material used in the Fall School class for Textual Entailment in Heidelberg; see their web site for further information and code updates: http://fallschool2013.cl.uni-heidelberg.de/. A [basic Maven project] (http://hlt-services4.fbk.eu:8080/artifactory/simple/private-internal/eu/excitementproject/eop-resources/java_examples/myProject-EOPv1.1.1.tar.gz) containing the Hello World example as well as the examples contained in Appendix B is provided too.

  1. [Example: Preprocessing data sets](#Example Preprocessing data sets)
  2. [Example: Training new models](#Example Training new models)
  3. [Example: Annotating by using pre-trained models](#Example Annotating by using pre-trained models)

1. Example: Preprocessing data sets

Below we reported a Java code consisting of 4 code fragments (i.e. ex1_1(), ex1_2(), ex1_3(), ex1_4()) Users should proceed with each fragments. First run each fragment, and follow the code comments to understand what happens there. As usual we will use Eclipse to write and run the code.

  1. Create a new Java class with Eclipse and name it as Ex1.
  2. Copy the following code into the new class
  3. Be sure that the path of the English data set (i.e. English_dev.xml) in the code refers to the path of that data set on your file system, e.g.
File f = new File("/home/user_name/programs/eop-resources-1.2.0/data-set/English_dev.xml");
  1. The class that we have just created uses some functionalities available with the class CASAccessUtilities provided in Appendix C. You need to create a new class namely CASAccessUtilities into your project e copy the content of the class in Appendix C into the new one.
  2. Navigate to Ex1.java > Run As > Java Application to run the code.
import java.io.File;

import org.apache.log4j.BasicConfigurator;
import org.apache.log4j.Level;
import org.apache.log4j.Logger;
import org.apache.uima.cas.CASException;
import org.apache.uima.jcas.JCas;
import org.uimafit.util.JCasUtil;

//import de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Lemma;
import de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Token;
import de.tudarmstadt.ukp.dkpro.core.api.syntax.type.dependency.Dependency;

import eu.excitementproject.eop.lap.LAPAccess;
import eu.excitementproject.eop.lap.LAPException;
import eu.excitementproject.eop.lap.PlatformCASProber;
import eu.excitementproject.eop.lap.dkpro.MaltParserEN;
import eu.excitementproject.eop.lap.dkpro.TreeTaggerEN;

/**
* This heavily commented code introduces the Linguistic Analysis Pipeline
* (LAP) of EXCITEMENT open platform. Check EX1 exercise sheet first, and proceed
* with this example code.
*/
public class Ex1 {

        public static void main(String[] args) {
                
            // init logs
            BasicConfigurator.resetConfiguration();
            BasicConfigurator.configure();
            Logger.getRootLogger().setLevel(Level.WARN);

            // remove comments of the following methods one by one, and
            // run it, and read it.
            // ex1_1();
            // ex1_2();
            // ex1_3();
            // ex1_4();
        }
        

        /**
         * ex1_1();
         * This code introduces LAPAccess.generateSingleTHPairCAS
         * [URL]
         */
        public static void ex1_1() {
                // Each and every LAP in EXCITEMENT Open Platform (EOP)
                // implements the interface LAPAccess.
                // Here, lets use the TreeTagger based LAP.
                LAPAccess aLap = null;
                try {
                        aLap = new TreeTaggerEN();
                } catch (LAPException e)
                {
                        System.out.println("Unable to initiated TreeTagger LAP: " + e.getMessage());                         
                }
                
                // LAPs (all implement LAPAccess) basically support 3 types of common methods.
        
                // First interface: generateSingleTHPair
                // LAPs can generate a specific data format that can be
                // accepted by EOP Entailment Decision Algorithms (EDAs). This is supported
                // by the LAPAccess.generateSingleTHPairCAS.
                
                JCas aJCas = null;
                try {
                        aJCas = aLap.generateSingleTHPairCAS("This is the Text part.", "The Hypothesis comes here.");
                } catch (LAPException e)
                {
                        System.out.println("Unable to run TreeTagger LAP: " + e.getMessage());                                                 
                }
                
                // All output of LAPs are stored in a data type that is called CAS.
                // This data type is borrowed from Apache UIMA: for the moment, just think
                // of it as a data type that can hold any annotation data. One way to see
                // it is "smarter" version of CONLL format; just much more flexible, and
                // unlike CONLL, this is "im-memory" format.
                
                // Take a look at a CAS figure; to see how it stores data of a T-H pair.
                // Figure URL: http://hltfbk.github.io/Excitement-Open-Platform/specification/spec-1.1.3.html#CAS_example

                // Here, let's briefly check what is stored in this actual aJCas.
                // Say, how it is annotated?
                try {
                        // This command checks CAS data, and checks if it is compatible for the EDAs
                        PlatformCASProber.probeCas(aJCas, System.out);
                        // the following command dumps all annotations to text file.
                        CASAccessUtilities.dumpJCasToTextFile(aJCas, "test_dump1.txt");
                        System.out.println("test_dump1.txt file dumped.");
                } catch (LAPException e)
                {
                        System.out.println("Failed to dump CAS data: " + e.getMessage());                                                 
                }
                // TODO Task1_1 check out this file, in Excitement-Open-Platform/fallschool/test_dump1.txt         
                
                System.out.println("method ex1_1() finished");
        }
        
        /**
         * ex1_2()
         * This code introduces LAPAccess.processRawInputFormat
         */
        public static void ex1_2()
        {
                // LAPs also support file based mass pre-processing.
                // As an example let's process RTE3 English data with TreeTagger LAP.

                // Initialize an LAP, here it's TreeTagger
                LAPAccess ttLap = null;
                try {
                        ttLap = new TreeTaggerEN();
                } catch (LAPException e)
                {
                        System.out.println("Unable to initiated TreeTagger LAP: " + e.getMessage());                         
                }
                
                // Prepare input file, and output directory.
                File f = new File("/home/user_name/programs/eop-resources-1.2.0/data-set/English_dev.xml");
                File outputDir = new File("/tmp/EN/dev/");
                
                // Call LAP method for file processing.
                // This takes some time. RTE data has 800 cases in it.
                // Each case, will be first annotated as a CAS, and then it will be
                // serialized into one XMI file.
                try {
                        ttLap.processRawInputFormat(f, outputDir);
                } catch (LAPException e)
                {
                        System.out.println("Failed to process EOP RTE data format: " + e.getMessage());                                                 
                }
        
                // TODO Task1_2: now all RTE3 training data is annotated and stored in
                // output dir ( /tmp/EN/dev/ )
                // a. Check the files are really there.
                // b. Open up one XMI file to get impression that how the CAS content is
                // stored into XML-based file.
                System.out.println("method ex1_2() finished");
        }
        
        /**
         * ex1_3()
         * This code introduces LAPAccess.addAnnotationOn
         */
        public static void ex1_3()
        {
                // Previous two methods generates "Pair data stored in CAS" (or XMI file)
                // , including Entailment Pair annotation, and so on
                
                // But what if, if you simply wants to annotate a sentence, or something
                // like that. E.g. no Entailment pair, just a single text document annotation.
                
                // All LAP has addAnnotationOn() method is there to give you this capability.
                // The following code shows you how you can do that.
                
                // first, prepare Malt parser based LAP
                LAPAccess malt = null;
                try {
                        malt = new MaltParserEN();
                } catch (LAPException e)
                {
                        System.out.println("Unable to initiated MaltParser (with TreeTagger) LAP: " + e.getMessage());                         
                }
                
                // and let's annotate something.
                try {
                        // get one empty CAS.
                        JCas aJCas = CASAccessUtilities.createNewJCas();
                        
                        // Before asking LAP to process, you have to set at least two things.
                        // One is language, and the other is document itself.
                        aJCas.setDocumentLanguage("EN"); // ISO 639-1 language code.
                        String doc = "This is a document. You can pass an arbitary document to CAS and let LAP work on it.";
                        aJCas.setDocumentText(doc);
                        malt.addAnnotationOn(aJCas);
                } catch (LAPException e)
                {
                        System.out.println("Failed to process EOP RTE data format: " + e.getMessage());                                                 
                }
                
                // Malt parser annotates the given aJCas document text.
                // But here, there is no Pair, no TEXTVIEW, or HYPOTHESISVIEW.
                
                // TODO Task1_3 Dump this result of malt parser result to a textfile.
                // Check how the CAS stores dependency parser result.
                // (use CASAccessUtilities.dumpJCasToTextFile())
        }
        
        /**
         * ex1_4()
         * This code introduces how you can iterate over added annotations
         * within a JCas.
         */
        public static void ex1_4()
        {
                // So far, so good. But how can we access annotation results
                // stored in a JCas? You can iterate them, like the followings.

                // First, prepare LAP and process a T-H pair.
                LAPAccess malt = null;
                JCas aJCas = null;
                try {
                        malt = new MaltParserEN();
                        aJCas = malt.generateSingleTHPairCAS("We thought that there were many cats in this garden.", "But there was only one cat, among all the gardens in the city.");
                } catch (LAPException e)
                {
                        System.out.println("Unable to initiated MaltParser (with TreeTagger) LAP: " + e.getMessage());                         
                }
                
                // aJCas has now T-H pair.
                // Here, let's iterate over the Tokens on Text side.
                try {
                        JCas textView = aJCas.getView("TextView");
                        System.out.println("Listing tokens of TextView.");
                        for (Token tok : JCasUtil.select(textView, Token.class))
                        {
                                String s = tok.getCoveredText(); // .getCoveredText() let you check the text on the document that this annotation is attached to.
                                int begin = tok.getBegin();
                                int end = tok.getEnd();
                                System.out.println(begin + "-" + end + " " + s);                 
                        }
                } catch (CASException e)
                {
                        System.out.println("Exception while accesing TextView of CAS: " + e.getMessage());                                                 
                }

                // And here, let's iterate over the dependency edges on the Hypothesis side.
                try {
                        JCas hypothesisView = aJCas.getView("HypothesisView");
                        for (Dependency dep : JCasUtil.select(hypothesisView, Dependency.class)) {

                                // One Dependency annotation holds the information for a dependency edge.
                                // Basically, 3 things;
                                // It holds "Governor (points to a Token)", "Dependent (also to a Token)",
                                // and relationship between them (as a string)
                                Token dependent = dep.getDependent();
                                Token governor = dep.getGovernor();
                                String dTypeStr = dep.getDependencyType();

                                // lets print them with full token information (lemma, pos, loc)
                                // info for the dependent ...
                                int dBegin = dependent.getBegin();
                                int dEnd = dependent.getEnd();
                                String dTokenStr = dependent.getCoveredText();
                                String dLemmaStr = dependent.getLemma().getValue();
                                String dPosStr = dependent.getPos().getPosValue();

                                // info for the governor ...
                                int gBegin = governor.getBegin();
                                int gEnd = governor.getEnd();
                                String gTokenStr = governor.getCoveredText();
                                String gLemmaStr = governor.getLemma().getValue();
                                String gPosStr = governor.getPos().getPosValue();

                                // and finally print the edge with full info
                                System.out.println(dBegin + "-" + dEnd + " " + dTokenStr + "/" + dLemmaStr + "/" + dPosStr);
                                System.out.println("\t ---"+ dTypeStr + " --> ");
                                System.out.println("\t " + gBegin + "-" + gEnd + " " + gTokenStr + "/" + gLemmaStr + "/" + gPosStr);
                                }
                } catch (CASException e)
                {
                        System.out.println("Exception while accesing HypothesisView of CAS: " + e.getMessage());                                                 
                }                

                // TODO [Optional Task] Task 1_4
                // ( This is an optional task: you can skip without affecting later exercise)
                //
                // Try to print out the above T-H pair as two bags of lemmas.
                //
                // You can iterate over Lemma type (you will need to import Lemma class),
                // or, you can iterate over Tokens, and use Token.getLemma() to fetch Lemmas.         
                // Then, you can access Lemma value, by calling Lemma.getValue();
                
                System.out.println("ex1_4() method finished");
                
        }
}

2. Example: Training new models

This example reports how to train new models on RTE-3 data set.

  1. Create a new Java class with Eclipse and name it as Ex2.
  2. Copy the following code into the new created class
  3. Be sure that the path of the English data set (i.e. English_dev.xml) in the code refers to the path of that data set on your file system, e.g.
File f = new File("/home/user_name/programs/eop-resources-1.2.0/data-set/English_dev.xml");
  1. Be sure that the path of the EDA configuration file (i.e. MaxEntClassificationEDA_Base+WN+VO_EN.xml) in the code refers to the path of that configuration file on your file system, e.g.
File configFile = new File("/home/user_name/programs/eop-resources-1.2.0/configuration-files/MaxEntClassificationEDA_Base+WN+VO_EN.xml");
  1. The data set is processed by the selected LAP and the output put in this directory:
File outputDir = new File("/tmp/EN/dev/"); // as written in configuration!

To let the selected EDA read the pre-processed data set the EDA configuration file has to report that directory as the directory where the data set is. 6. Navigate to Ex2.java > Run As > Java Application to run the code.

import java.io.File;

import org.apache.log4j.BasicConfigurator;
import org.apache.log4j.Level;
import org.apache.log4j.Logger;
//import org.apache.uima.jcas.JCas;

//import eu.excitementproject.eop.common.DecisionLabel;
import eu.excitementproject.eop.common.EDABasic;
import eu.excitementproject.eop.common.EDAException;
//import eu.excitementproject.eop.common.TEDecision;
import eu.excitementproject.eop.common.configuration.CommonConfig;
import eu.excitementproject.eop.common.exception.ComponentException;
import eu.excitementproject.eop.common.exception.ConfigurationException;
import eu.excitementproject.eop.common.utilities.configuration.ImplCommonConfig;
import eu.excitementproject.eop.core.MaxEntClassificationEDA;
import eu.excitementproject.eop.lap.LAPAccess;
import eu.excitementproject.eop.lap.LAPException;
//import eu.excitementproject.eop.lap.dkpro.MaltParserEN;
import eu.excitementproject.eop.lap.dkpro.TreeTaggerEN;

/**
* This example code shows how you can train a new model on a new data set. We
* will use external resources like WordNet and VerbOcean.
*/

public class Ex2 {

        public static void main(String[] args) {

            // init logs
            BasicConfigurator.resetConfiguration();
            BasicConfigurator.configure();
            Logger.getRootLogger().setLevel(Level.INFO);

            ex2_1(); // start_training() of EDA
        }

        /**
         * This method shows how to train a EDA, with a given configuration & training data.
         * It could take several minutes.
         */
        public static void ex2_1()
        {
                // The other mode of the EDA is training mode. Let's check how this is done
                // with one training example.

                // Training also requires the configuration file.
                // We will load a configuration file first.                 
                CommonConfig config = null;
                try {
                        // The configuration uses WordNet and VerbOcean as external resources.
                        // Make sure that the path of the model in the following configuration files
                        // refers to the directory: /home/user_name/programs/eop-resources-1.2.0/model/
                        File configFile = new File("/home/user_name/programs/eop-resources-1.2.0/configuration-files/MaxEntClassificationEDA_Base+WN+VO_EN.xml");
                        config = new ImplCommonConfig(configFile);
                }
                catch (ConfigurationException e)
                {
                        System.out.println("Failed to read configuration file: "+ e.getMessage());
                        System.exit(1);
                }

                // TODO task ex2_1_a
                // Check the above configuration XML file by opening and reading it.
                // Check the following values under the section
                // "eu.excitementproject.eop.core.MaxEntClassificationEDA" (last section).
                // modelFile: the new model will be generated here.
                // trainDir: the configuration expects here pre-processed RTE training data as a set of XMI Files.
                // Where the new model will be generated? Where the configuration
                // expects to read pre-processed training data?
                // Also check the first section:
                // What LAP it requires? (top section, "activatedLAP")

                        // WARNING: each EDA has different procedures for Training.
                        // So other EDAs like BIUTEE might expect different parameters
                        // for training. One needs to consult EDA-specific documentations
                        // to check this.

                // Before calling start_training() we have to provide
                // pre-processed training data. This EDA will train itself with
                // the provided data that is pointed by trainDir.

                try {
                        LAPAccess ttLap = new TreeTaggerEN();
                        // Prepare input file, and output directory.
                        File f = new File("/home/user_name/programs/eop-resources-1.2.0/data-set/English_dev.xml");
                        File outputDir = new File("/tmp/EN/dev/"); // as written in configuration!
                        if (!outputDir.exists())
                        {
                                outputDir.mkdirs();
                        }
                        ttLap.processRawInputFormat(f, outputDir);
                } catch (LAPException e)
                {
                        System.out.println("Training data annotation failed: " + e.getMessage());                         
                        System.exit(1);
                }

                // Okay, now RTE3 data are all tagged and stored in the
                // trainDir. Let's ask EDA to train itself.
                try {
                        @SuppressWarnings("rawtypes")
                        EDABasic eda = null;
                        eda = new MaxEntClassificationEDA();
                        eda.startTraining(config); // This *MAY* take a some time.
         }
                catch (EDAException e)
                {
                        System.out.println("Failed to do the training: "+ e.getMessage());
                        System.exit(1);
                }   
                catch (ConfigurationException e)
                {
                        System.out.println("Failed to do the training: "+ e.getMessage());
                        System.exit(1);
                }   
                catch (ComponentException e)
                {
                        System.out.println("Failed to do the training: "+ e.getMessage());
                        System.exit(1);
                }   

                System.out.print("Training completed.");
        }                
}

3. Example: Annotating by using pre-trained models

  1. Create a new Java class with Eclipse and name it as Ex3.
  2. Copy the following code into the page of the created class and navigate to Ex3.java > Run As > Java Application to run the code.
import java.io.File;

import org.apache.log4j.BasicConfigurator;
import org.apache.log4j.Level;
import org.apache.log4j.Logger;
import org.apache.uima.jcas.JCas;

import eu.excitementproject.eop.common.DecisionLabel;
import eu.excitementproject.eop.common.EDABasic;
import eu.excitementproject.eop.common.EDAException;
import eu.excitementproject.eop.common.TEDecision;
import eu.excitementproject.eop.common.configuration.CommonConfig;
import eu.excitementproject.eop.common.exception.ComponentException;
import eu.excitementproject.eop.common.exception.ConfigurationException;
import eu.excitementproject.eop.common.utilities.configuration.ImplCommonConfig;
import eu.excitementproject.eop.core.MaxEntClassificationEDA;
import eu.excitementproject.eop.lap.LAPAccess;
import eu.excitementproject.eop.lap.LAPException;
//import eu.excitementproject.eop.lap.dkpro.MaltParserEN;
import eu.excitementproject.eop.lap.dkpro.TreeTaggerEN;

/**
* This example code shows how you can initiate and use an EDA to annotate entailment
* relations by using the model created in the section above.
*/

public class Ex3 {

        public static void main(String[] args) {

            // init logs
            BasicConfigurator.resetConfiguration();
            BasicConfigurator.configure();
            Logger.getRootLogger().setLevel(Level.INFO);

            // read and each of the code sections, one by one.
            
        ex3_1(); // initialize() and process() of EDA
        
        }
        
        /**
         * This method shows initializing an EDA with one existing (already trained) model.
         */
        public static void ex3_1() {

                // All EDAs are implementing EDABasic interface. Here, we will visit
                // "process mode" of an EDA.
            
                ///////
                /// Step #1: initialize an EDA
                ///////
                // First we need an instance of an EDA. We will use a TIE instance.
                // (MaxEntClassificationEDA)
                @SuppressWarnings("rawtypes") // why this? will be explained later.
                EDABasic eda = null;
                try {
                        eda = new MaxEntClassificationEDA();
                        // To start "process mode" we need to initialize the EDA.
                        // We have some TIE configurations in configuration-files
                        // let's use "lexical one": MaxEntClassificationEDA_Base+WN+VO_EN.xml
                        // it uses external resources like WordNet and VerbOcean
                        File configFile = new File("/home/user_name/programs/eop-resources-1.2.0/configuration-files/MaxEntClassificationEDA_Base+WN+VO_EN.xml");

                        CommonConfig config = new ImplCommonConfig(configFile);
                        eda.initialize(config);
                }
                catch (EDAException e)
                {
                        System.out.println("Failed to init the EDA: "+ e.getMessage());
                        System.exit(1);
                }
                catch (ConfigurationException e)
                {
                        System.out.println("Failed to init the EDA: "+ e.getMessage());
                        System.exit(1); 
                }
                catch (ComponentException e)
                {
                        System.out.println("Failed to init the EDA: "+ e.getMessage());
                        System.exit(1);
                }    
                
                // Okay the EDA is ready. Let's prepare one T-H pair and use it.
                // simple Text and Hypothesis.
                // Note that (as written in the configuration file), current configuration
                // needs TreeTaggerEN Annotations
                String text = "The sale was made to pay Yukos' US$ 27.5 billion tax bill, Yuganskneftegaz was originally sold for US$ 9.4 billion to a little known company Baikalfinansgroup which was later bought by the Russian state-owned oil company Rosneft.";
               String hypothesis = "Baikalfinansgroup was sold to Rosneft.";
        
               JCas thPair = null;
               try {
                        LAPAccess lap = new TreeTaggerEN();
                        thPair = lap.generateSingleTHPairCAS(text, hypothesis); // ask it to process this T-H.
               } catch (LAPException e)
               {
                        System.err.print("LAP annotation failed:" + e.getMessage());
                        System.exit(1);
               }
        
               // Now the pair is ready in the CAS. call process() method to get
               // Entailment decision.
        
               // Entailment decisions are represented with "TEDecision" class.
               TEDecision decision = null;
               try {
                        decision = eda.process(thPair);
               } catch (EDAException e)
               {
                        System.err.print("EDA reported exception" + e.getMessage());
                        System.exit(1);
               } catch (ComponentException e)
               {
                        System.err.print("EDA reported exception" + e.getMessage());
                        System.exit(1);
               }
        
               // And let's look at the result.
               DecisionLabel r = decision.getDecision();
               System.out.println("The result is: " + r.toString());
        
               // and you can call process() multiple times as much as you like.
               // ...
               // once all is done, we can call this.
               eda.shutdown();
        }
}
Clone this wiki locally