Skip to content

Training and evaluating all modules using the example corpus

Assaf Urieli edited this page Jan 23, 2018 · 5 revisions

For now, this page simply lists the commands for training and evaluating all modules using the example corpus.

Train a sentence detector

java -Xmx1G -Dconfig.file=examples/french/conf/fr-train-eval-no-lex.conf -jar talismane-core-X.X.X.jar --train --sessionId=fr --module=sentenceDetector --inFile="examples/french/corpus/frWikiDisc_v1.1-sentence-train.txt" --logConfigFile=examples/conf/logback.xml --sentenceModel="output/models/sentenceTest1.zip"

Note: In this case, there was a sentence file available with one sentence per line. If this file isn't available, sentences can be reconstructed from a CoNLL file or equivalent. To do this, we add the following settings to the configuration file:

    sentence-detector {
      train {
        corpus-reader = com.joliciel.talismane.tokeniser.TokenRegexBasedCorpusReader
        input-pattern = ${input-pattern}
        ...
      }
    }

Evaluate the sentence detector

java -Xmx1G -Dconfig.file=examples/french/conf/fr-train-eval-no-lex.conf -jar talismane-core-X.X.X.jar --evaluate --sessionId=fr --module=sentenceDetector --sentenceModel="output/models/sentenceTest1.zip" --inFile=examples/french/corpus/frWikiDisc_v1.1-sentence-test.txt --encoding=UTF8 --logConfigFile=examples/conf/logback.xml --outDir=output/eval/sentence

Train a tokeniser

java -Xmx1G -Dconfig.file=examples/french/conf/fr-train-eval-no-lex.conf -jar talismane-core-X.X.X.jar --train --sessionId=fr --module=tokeniser --inFile="examples/french/corpus/frWikiDisc_v1.1-train.conll" --logConfigFile=data/conf/logback.xml --tokeniserModel="output/models/tokeniserTest1.zip"

Evaluate the tokeniser

java -Xmx1G -Dconfig.file=examples/french/conf/fr-train-eval-no-lex.conf -jar talismane-core-X.X.X.jar --evaluate --sessionId=fr --module=tokeniser --tokeniserModel="output/models/tokeniserTest1.zip" --inFile=examples/french/corpus/frWikiDisc_v1.1-test.conll --encoding=UTF8 --logConfigFile=examples/conf/logback.xml --outDir=output/eval/tokeniser

Train pos-tagger without a lexicon

java -Xmx1G -Dconfig.file=examples/french/conf/fr-train-eval-no-lex.conf -jar talismane-core-X.X.X.jar --train --sessionId=fr --module=posTagger --posTaggerModel=output/models/frPosTagger1.zip --inFile=examples/french/corpus/frWikiDisc_v1.1-train.conll --encoding=UTF8 --logConfigFile=examples/conf/logback.xml

Serialize a lexicon

java -Xmx1G -Dconfig.file=examples/french/conf/fr-serialize-lexicon.conf -jar talismane-core-X.X.X.jar --serializeLexicon --sessionId=fr --lexiconProps=examples/french/lexicons/lexicons_fr.txt --outFile=output/lexicons/lexicons_fr.zip

Test the lexicon

java -Xmx1G -Dconfig.file=examples/french/conf/fr-train-eval-with-lex.conf -jar talismane-core-X.X.X.jar --testLexicon --sessionId=fr --words=à,dommage,drainer,dites,que

Note: that the configuration file fr-train-eval-with-lex.conf looks for the lexicon in the directory output/lexicons/lexicons_fr.zip, as indicated by the following key:

    lexicons = [
      "output/lexicons/lexicons_fr.zip"
    ]

If you serialized into a different directory, you need to change this configuration value. If you have a configuration file with a lexicons key, or if you want to override the location in the configration file, you can use the command-line option --lexicon, as follows:

java -Xmx1G -Dconfig.file=examples/french/conf/fr-train-eval-with-lex.conf -jar talismane-core-X.X.X.jar --testLexicon --sessionId=fr --lexicon=other-location/lexicons_fr.zip  --words=à,dommage,drainer,dites,que

Train a pos-tagger with the lexicon

java -Xmx1G -Dconfig.file=examples/french/conf/fr-train-eval-with-lex.conf -jar talismane-core-X.X.X.jar --train --sessionId=fr --module=posTagger --posTaggerModel=output/models/frPosTaggerLex1.zip --inFile=examples/french/corpus/frWikiDisc_v1.1-train.conll --encoding=UTF8 --lexicon=output/lexicons/lexicons_fr.zip --logConfigFile=examples/conf/logback.xml

Evaluate the pos-tagger with the lexicon

java -Xmx1G -Dconfig.file=examples/french/conf/fr-train-eval-with-lex.conf -jar talismane-core-X.X.X.jar --evaluate --sessionId=fr --module=posTagger --posTaggerModel=output/models/frPosTaggerLex1.zip --inFile=examples/french/corpus/frWikiDisc_v1.1-test.conll --encoding=UTF8 --lexicon=output/lexicons/lexicons_fr.zip --logConfigFile=examples/conf/logback.xml --outDir=output/eval/posTagger --suffix=_lex1

Train a parser with the lexicon

java -Xmx1G -Dconfig.file=examples/french/conf/fr-train-eval-with-lex.conf -jar talismane-core-X.X.X.jar --train --sessionId=fr --module=parser --parserModel=output/models/frParserLex1.zip --inFile=examples/french/corpus/frWikiDisc_v1.1-train.conll --encoding=UTF8 --lexicon=output/lexicons/lexicons_fr.zip --logConfigFile=examples/conf/logback.xml

Evaluate the parser with the lexicon

java -Xmx1G -Dconfig.file=examples/french/conf/fr-train-eval-with-lex.conf -jar talismane-core-X.X.X.jar --evaluate --sessionId=fr --module=parser --parserModel=output/models/frParserLex1.zip --inFile=examples/french/corpus/frWikiDisc_v1.1-test.conll --encoding=UTF8 --logConfigFile=examples/conf/logback.xml --outDir=output/eval/parser --suffix=_lex1 --lexicon=output/lexicons/lexicons_fr.zip