Skip to content

NemexClassification

madhumita-git edited this page Jun 1, 2016 · 4 revisions

Nemex classification Entailment Decision Algorithm (EDA) makes use of explicit phrase level alignments to decide entailment relations between pairs of texts in English language. These phrase level alignments have been generated through the use of pre-trained word embeddings, and a tool called NemexA, which calculates similarity between pairs of multi-word terms through an n-gram vector match.

Different levels of alignments are performed, including approximate and semantic match. Words, lemmas and chunks of text and hypothesis are aligned if similarity between them is greater than a certain threshold. This similarity is calculated through an API which supports cosine, Dice and Jaccard similarity metrics. Thereby, an overlap score is calculated depending on the fraction of aligned segments. Similarly, semantic similarity between chunks of text is calculated through the use of composite word embeddings. If cosine similarity between vectors of chunks is greater than a specified threshold, an alignment link is added between them. However, if any pair of word among these chunks hold an antonym relation between them (identified via WordNet), or if a word in T chunk is opposite of, or greater in strength than the H chunk (identified via VerbOcean), they are indicated as negative alignments with respect to an entailment decision task. A score is calculated to indicate the fraction of positively and negatively aligned chunks. Further, a score is calculated based on the frequency of negation words in text and hypothesis. Some scores are also added depending on the tasks a (T,H) pair has been derived from.

Available Components

Different scoring components are available as a part of the EDA. The user can decide which components would he like to use for performing classification. These scoring components query the respective aligners to generate a relevant score.

  • NemexBagOfWordsScoring
  • NemexBagOfLemmasScoring
  • NemexBagOfChunksScoring
  • BagOfChunkVectorScoring
  • NegationScoring

User guide for BagOfChunkVectorScoring

For using the component BagOfChunkVectorScoring, the vectors for chunks need to be calculated separately before running rest of the pipeline.

If you are using the RTE3, RTE6 or the SNLI data sets, these vectors have already been calculated. They have been saved in the directory chunkVectors within the eop-resources directory. Otherwise, the word embeddings pre-trained on the Google News corpus using word2vec needs to be first downloaded from here. This needs to be extracted to the directory vectorModel within the eop-resources directory that has been downloaded with the software distribution.

Next, the pre-trained model for chunking English text using opennlp, en-chunker.bin, needs to be downloaded from the web address: http://opennlp.sourceforge.net/models-1.5/ . Next, the following command needs to be executed (Replace <path-to-chunker-model> with the path, example, ./eop-resources-1.2.4/en-chunker.bin:

java -Djava.ext.dirs=../EOP-1.2.4-SNAPSHOT/ eu.excitementproject.eop.util.vectorizer.chunk.ChunkVectorizer ./eop-resources-1.2.4/data-set/data.xml /tmp/EN/dev/ ./eop-resources-1.2.4/chunkVectors/data.txt <path-to-chunker-model> google ./eop-resources-1.2.4/vectorModel/GoogleNews-vectors-negative300.bin ./eop-resources-1.2.4/external-data/ignorePosTags.txt , where,

data.xml is the data-set for which chunk vectors needs to be generated, /tmp/EN/dev/ is the folder where XMI files for pairs in this data set will be written (directory to be created before command execution), and data.txt is the name of resulting file which stores vectors for all chunks in this data-set.

The version number of the EOP installation and the eop-resources needs to be updated according to the current EOP distribution version.

Configuration File

Configuration files are used to create and use a particular instance of an EDA. With them it is for example possible to specify which resources the EDA has to use, the data set to be annotated and so on. The standard configuration file provided consists of all options for configurable parameter values. These values can be modified according to the requirements of the user. The fields present in the configuration file are explained in the comments in the file.

Given in Appendix A is a sample configuration file. One important update that needs to be made to this configuration file before use, is to download the files en-chunker.bin and en-ner-person.bin and add the paths to these in the fields chunkerModelPath and personNameModelPath respectively, if those components are being used.

Appendix A: NemexWekaClassificationEDA_SNLI_AllFeats_EN.xml


<?xml version="1.0" encoding="utf-8"?>

<!DOCTYPE configuration [
<!ENTITY myVar "Some common #PCDATA that can be reused... ">
]>

<configuration>

	<!-- This section specifies the EDA to use, the language and the required Linguistic Analysis Pipeline. Currently, this EDA is only supported for English language. Lemma annotations are used by this EDA during processing, which are added through the MaltParser pipeline. For this pipeline to work, TreeTagger needs to be installed before. -->
	<section name="PlatformConfiguration">
		<property name="activatedEDA">eu.excitementproject.eop.core.NemexWekaClassificationEDA</property>
		<property name="language">EN</property>
		<property name="activatedLAP">eu.excitementproject.eop.lap.dkpro.MaltParserEN</property>
	</section>
	
	<!-- Configurable parameters for using the component NemexBagOfWordsScoring -->
	<section name="NemexBagOfWordsScoring">

		<!-- If stopwords should be removed before alignment -->
		<property name="removeStopWords">false</property>
		<!-- path to the stop words file in the eop-resources directory -->
		<property name="stopWordPath">eop-resources-1.2.4/external-data/stopwords_EN.txt</property>

		<!-- Option to perform additional external lookup using Nemex dictionaries for specific domain information, like Named Entities -->
		<property name="numOfExtDicts">0</property>
		<property name="extDicts">eop-resources-1.2.4/gazetteer/all-dblp-mwl-plain.txt,eop-resources-1.2.4/gazetteer/MedicalTerms-mwl-plain.txt</property>
		
		<!-- configurable NemexA values. Comma separated for each supported external dictionary. Please do not to change values for any parameter except the similarity measures and thresholds. -->
		<property name="delimExtLookup">#,#</property>
		<property name="delimSwitchOffExtLookup">true,true</property>
		<property name="nGramSizeExtLookup">3,3</property>
		<property name="ignoreDuplicateNGramsExtLookup">false,false</property>
		<property name="simMeasureExtLookup">COSINE_SIMILARITY_MEASURE,COSINE_SIMILARITY_MEASURE</property>
		<property name="simThresholdExtLookup">0.8,0.8</property>

		<!-- configurable NemexA values for performing lookup for alignment. Please do not change parameter values except for similarity measure and threshold. -->
		<property name="gazetteerAlignLookup">eop-resources-1.2.4/gazetteer/nemexBOWAligner.txt</property>
		<property name="delimiterAlignLookup">#</property>
		<property name="delimiterSwitchOffAlignLookup">true</property>
		<property name="nGramSizeAlignLookup">3</property>
		<property name="ignoreDuplicateNGramsAlignLookup">false</property>
		<property name="simMeasureAlignLookup">COSINE_SIMILARITY_MEASURE</property>
		<property name="simThresholdAlignLookup">0.8</property>

		<!-- direction of processing for alignment. TtoH refers to Nemex dictionary generated from text, and entries in H looked up to add alignment links between T and H. Vice-versa for HtoT. -->
		
		<property name="direction">TtoH</property>
	</section>

	<!-- same basic config as NemexBagOfWordsScoring -->
	<section name="NemexBagOfLemmasScoring">

		<property name="removeStopWords">false</property>
		<property name="stopWordPath">eop-resources-1.2.4/external-data/stopwords_EN.txt</property>

		<property name="numOfExtDicts">0</property>
		<property name="extDicts">eop-resources-1.2.4/gazetteer/all-dblp-mwl-plain.txt,eop-resources-1.2.4/gazetteer/MedicalTerms-mwl-plain.txt</property>
		<property name="delimExtLookup">#,#</property>
		<property name="delimSwitchOffExtLookup">true,true</property>
		<property name="nGramSizeExtLookup">3,3</property>
		<property name="ignoreDuplicateNGramsExtLookup">false,false</property>
		<property name="simMeasureExtLookup">COSINE_SIMILARITY_MEASURE,COSINE_SIMILARITY_MEASURE</property>
		<property name="simThresholdExtLookup">0.8,0.8</property>

		<property name="gazetteerAlignLookup">eop-resources-1.2.4/gazetteer/nemexBOLAligner.txt</property>
		<property name="delimiterAlignLookup">#</property>
		<property name="delimiterSwitchOffAlignLookup">true</property>
		<property name="nGramSizeAlignLookup">3</property>
		<property name="ignoreDuplicateNGramsAlignLookup">false</property>
		<property name="simMeasureAlignLookup">DICE_SIMILARITY_MEASURE</property>
		<property name="simThresholdAlignLookup">0.75</property>

		<property name="direction">TtoH</property>

		<!-- Additional option for WordNet lookup. -->
		<property name="isWN">true</property>
		<property name="wnPath">eop-resources-1.2.4/ontologies/EnglishWordNet-dict/</property>
		<property name="WNRelations">HYPERNYM,SYNONYM,PART_HOLONYM</property>
		<property name="isWNCollapsed">true</property>
		<property name="useFirstSenseOnlyLeft">true</property>
		<property name="useFirstSenseOnlyRight">true</property>


	</section>
	
	<!-- same basic config as NemexBagOfWordsScoring -->
	<section name="NemexBagOfChunksScoring">

		<property name="removeStopWords">false</property>
		<property name="stopWordPath">eop-resources-1.2.4/external-data/stopwords_EN.txt</property>

		<property name="numOfExtDicts">0</property>
		<property name="extDicts">eop-resources-1.2.4/gazetteer/all-dblp-mwl-plain.txt,eop-resources-1.2.4/gazetteer/MedicalTerms-mwl-plain.txt</property>
		<property name="delimExtLookup">#,#</property>
		<property name="delimSwitchOffExtLookup">true,true</property>
		<property name="nGramSizeExtLookup">3,3</property>
		<property name="ignoreDuplicateNGramsExtLookup">false,false</property>
		<property name="simMeasureExtLookup">COSINE_SIMILARITY_MEASURE,COSINE_SIMILARITY_MEASURE</property>
		<property name="simThresholdExtLookup">0.8,0.8</property>

		<property name="gazetteerAlignLookup">eop-resources-1.2.4/gazetteer/nemexBOChunksAligner.txt</property>
		<property name="delimiterAlignLookup">#</property>
		<property name="delimiterSwitchOffAlignLookup">true</property>
		<property name="nGramSizeAlignLookup">3</property>
		<property name="ignoreDuplicateNGramsAlignLookup">false</property>
		<property name="simMeasureAlignLookup">COSINE_SIMILARITY_MEASURE</property>
		<property name="simThresholdAlignLookup">0.65</property>

		<property name="direction">HtoT</property>

		<!-- Additional option to lookup words in chunks for matching entries in WordNet, to expand phrase list before alignment. -->
		<property name="isWN">false</property>
		<property name="wnPath">eop-resources-1.2.4/ontologies/EnglishWordNet-dict/</property>
		<property name="WNRelations">HYPERNYM,SYNONYM,PART_HOLONYM</property>
		<property name="isWNCollapsed">true</property>
		<property name="useFirstSenseOnlyLeft">true</property>
		<property name="useFirstSenseOnlyRight">true</property>

		<!-- To additionally calculate scores dependent on the fraction of overlap of terms under alignment. -->
		<property name="useCoverageFeats">true</property>
		<property name="coverageFeats">word,content,verb,properNoun</property>

		<!-- Path to trained model for opennlp chunker. This file, 'en-chunker.bin' can be downloaded for research purposes from the website: http://opennlp.sourceforge.net/models-1.5/ . The path to the downloaded file needs to be entered here. -->
		<property name="chunkerModelPath">eop-resources-1.2.4/en-chunker.bin</property>

	</section>

	<!-- same basic config as NemexBagOfWordsScoring -->
	<section name="NemexPersonNameScoring">

		<!-- External dictionary consisting of person names -->
		<property name="numOfExtDicts">1</property>
		<property name="extDicts">eop-resources-1.2.4/gazetteer/personNames_all.txt</property>
		<property name="delimExtLookup">#</property>
		<property name="delimSwitchOffExtLookup">true</property>
		<property name="nGramSizeExtLookup">3</property>
		<property name="ignoreDuplicateNGramsExtLookup">false</property>
		<property name="simMeasureExtLookup">COSINE_SIMILARITY_MEASURE</property>
		<property name="simThresholdExtLookup">0.8</property>

		<property name="gazetteerAlignLookup">eop-resources-1.2.4/gazetteer/nemexPersonNameDict.txt</property>
		<property name="delimiterAlignLookup">#</property>
		<property name="delimiterSwitchOffAlignLookup">true</property>
		<property name="nGramSizeAlignLookup">3</property>
		<property name="ignoreDuplicateNGramsAlignLookup">false</property>
		<property name="simMeasureAlignLookup">COSINE_SIMILARITY_MEASURE</property>
		<property name="simThresholdAlignLookup">0.8</property>

		<property name="direction">TtoH</property>

		<!-- Path to model file for identifying names of people using opennlp Named Entity Recognizer. This file, 'en-ner-person.bin' can be downloaded for research purposes from the website: http://opennlp.sourceforge.net/models-1.5/ . The path to the downloaded file needs to be entered here.-->
		<property name="personNameModelPath">eop-resources-1.2.4/en-ner-person.bin</property>

	</section>

	<!-- Configuration for performing alignment based on embedded vector similarity. -->
	<section name="BagOfWordVectorScoring">
	
		<!-- type of vector model. Default to google. -->
		<property name="modelType">google</property>
		<!-- path to the word vector file -->
		<property name="vecModel">eop-resources-1.2.4/vectorModel/GoogleNews-vectors-negative300.bin</property>
		<!-- simialarity threshold for alignment based on word vectors -->
		<property name="threshold">0.75</property>
		
		<!-- if stopword removal should be performed before alignment -->
		<property name="removeStopWords">false</property>
		<!-- path to stopwords file -->
		<property name="stopWordPath">eop-resources-1.2.4/external-data/stopwords_EN.txt</property>
		<!-- list of POS tags of words to ignore during alignment process -->
		<property name="ignorePosPath">eop-resources-1.2.4/external-data/ignorePosTags.txt</property>

	</section>

	<!-- Configuration for aligning chunks using vector similarity -->
	<section name="BagOfChunkVectorScoring">
		<!-- path to pre-trained vectors for chunks for the given data set. Important: these values would be different for both training and test set, depending on which task is being performed. -->
		<property name="chunkVecModel">eop-resources-1.2.4/chunkVectors/SNLITrain.txt</property>

                 <!-- Path to trained model for opennlp chunker. This file, 'en-chunker.bin' can be downloaded for research purposes from the website: http://opennlp.sourceforge.net/models-1.5/ . The path to the downloaded file needs to be entered here. -->		

                <property name="chunkerModelPath">eop-resources-1.2.4/en-chunker.bin</property>
		<property name="threshold">0.6</property>

		<property name="removeStopWords">false</property>
		<property name="ignorePosPath">eop-resources-1.2.4/external-data/ignorePosTags.txt</property>
		
		<!-- option to use WordNet for identifying negatively aligned chunks based on Antonymy relations -->
		<property name="isWN">true</property>
		<property name="wordNetFilesPath">eop-resources-1.2.4/ontologies/EnglishWordNet-dict/</property>
		<property name="useFirstSenseOnlyLeft">false</property>
		<property name="useFirstSenseOnlyRight">false</property>
		
		<!-- option to use VerbOcean for identifying negatively aligned chunks -->
		<property name="isVO">true</property>
		<property name="verbOceanFilesPath">eop-resources-1.2.4/VerbOcean/verbocean.unrefined.2004-05-20.txt</property>
		<property name="verbOceanThreshold">1.0</property>

		<!-- if scores should be calculated based on the fraction of content covered under alignment of chunks -->
		<property name="useCoverageFeats">true</property>
		<property name="coverageFeats">word,content,verb,properNoun</property>

	</section>

	<!-- Configuration for calculating a score based on relative number of negation terms in T and H. -->
	<section name="NegationScoring">
		<property name="negWordPath">eop-resources-1.2.4/external-data/negationWordsEN.txt</property>
	</section>


	<section name="eu.excitementproject.eop.core.NemexWekaClassificationEDA">
		<property name="trainDir">/tmp/EN/dev/</property>
		<property name="dataSplit">false</property>
		<property name="testDir">/tmp/EN/test/</property>
		<property name="modelFile">eop-resources-1.2.4/model/NemexWekaClassificationEDASnliModel_AllFeats_EN</property>
		<property name="numOfModelFiles">1</property>
		<property name="wekaArffFile">eop-resources-1.2.4/wekaArff/NemexWekaArffSNLITrainAllFeats</property>
		<property name="classifier">liblinear</property>
		
		<!-- true for RTE6 data - highly unbalanced -->
		<property name="costSensitive">false</property>
		
		<!-- tuned parameter cost ration for RTE6 data -->
		<property name="cost_0-1">3.5</property>
		<property name="cost_1-0">1.0</property>
		<property name="minimizeExpectedCost">false</property>
		
		<!-- which scorers to use alongwith the EDA -->
		<property name="Components">NemexBagOfWordsScoring,NemexBagOfLemmasScoring,NemexBagOfChunksScoring,BagOfChunkVectorScoring,NegationScoring</property>

	</section>

</configuration>



Clone this wiki locally