Fetching latest commit…
Cannot retrieve the latest commit at this time.
|Type||Name||Latest commit message||Commit time|
|Failed to load latest commit information.|
Word Sense Disambiguation ------------------------- Working Notes Linas Vepstas April 2008 This document contains working notes for performing word-sense disambiguation within the framework of OpenCog. The first section reviews the mapping of WordNet ideas onto OpenCog structures. The second section discusses how various different ideas of "word-sense" are associated with a parsed sentence, and the algorithms that can be used to perform the actual disambiguation. The wordnet-import directory contains code to import the wordnet dataset into OpenCog, based on the structures outlined below. The code in this directory is primarily focused on the implementation of the Rada Mihalcea word-sense disambiguation algorithm, together with extensions/enhancements/modifications thereof. The Mihalcea algorithm assigns senses to the words in a sentence, and constructs links between these different senses. The links are weighted by a word-sense similarity metric. The result is a graph whose vertices are word-senses, and whose edges are these links. The graph is essentially just a Markov chain, and is solved as such, looking for a stationary vector of probabilities for the word-sense vertices. The vertices with the highest scores are then the most likely meaning for a word. Preliminary comments -------------------- Per current conversations with Ben Goertzel, new links introduced in the text below should be understood to be a kind of predicate. That is, a link of the form BlahLink Node "AAA" Node "BBB" is to be understood to be equivalent to EvaluationLink PredicateNode "Blah" ListLink Node "AAA" Node "BBB" Although OpenCog does not currently handle this equivalence in a transparent fashion, this is the long-term intention. Thus, the text below will mix these two different representation forms. Word senses ----------- Associated with every word is a list of possible senses. Consider the word "bark", thus: WordNode "bark" -- a word. Wordnet entries are not necessarily single words; more properly, the should be called "lexical units": they can be single words, collocations, idioms, and so on. A "word sense" is a collection of information about a single semantic sense of a word. This bag includes information about the part-of-speech (noun, verb, etc.), example usages, pointers to synonym sets (synsets), hyponyms, hypernyms, etc. A specific "word sense" will be tagged with a unique identifier, which is then used to reference that sense. In the OpenCog type hierarchy, (type.script) WORD_SENSE_LINK <- ASSOCIATIVE_LINK A WordSenseLink couples the tag that will stand for the word sense, to the word itself: WordSenseLink WordInstanceNode "bark" WordSenseNode "bark_sense_23" The tag value "bark_sense_23" has no particular meaning in itself, it is just some unique string used to identify the sense. Word senses can be crudely categorized according to part-of-speech. Thus, for example: PartOfSpeechLink WordSenseNode "bark_sense_23" PartOfSpeechNode "noun" The above introduces a new node type "PartOfSpeechNode", and a new link type: "PartOfSpeechLink". Other possible properties might include gender, entity tags, and so on. Each of these nodes and links are assigned a truth value of 1 and a confidence of 1. Non-unit truth-values & confidence will be used when relating word senses to the use of actual words in sentences. Here, however, the goal is to state "this word does have this as one possible meaning, no matter how rarely used this meaning might be." [[Some word senses are more common than others. For example, archaic word senses tend to be rarely encountered ... except when one is reading an archaic text. This is addressed in a section below.]] In WordNet, word senses are commonly established by giving "sister terms" (a WordNet concept, equivalent to "synonyms for word senses") as well as hypernyms, hyponyms, part holonyms, etc. A sister term, hypernym, hyponym, etc. are thus links. Since some sister terms are closer than others, these links will have non-unit truth values. Hypernyms and hyponyms will be mapped to inheritance links. Details of the actual mapping used are given in greater detail in the wordnet-import/README file. Statistical corpus analysis --------------------------- Word nodes can be assigned a truth value that indicates the frequency of occurence of the word in a text. This is discussed in greater detail in the similarity/README file. Sentences to meanings --------------------- Consider two sentences: "The loud bark woke me up." or "The rough bark cut my finger." The goal of word-sense disambiguation is two-fold: associate to all possible meanings of the word "bark", and then, rank the possible meanings from most likely to least. There are six related ideas of a word, and its sense, that need to be distinguished. These are: 1) The "word instance", as it occurs in the sentence. Word instances may be mis-spelled, and thus it may be ambiguous as to which actual dictionary word was intended. 2) The "dictionary word", the word as it would be in the standard lexicon of a language: a word in its proper spelling, and with morphological inflections stripped away: the lemmatized or root form of the word. Dictionary-words are represented by WordNode. 3) The "word sense", a particular sense of a word, as it might appear in a dictionary, or in WordNet. For example, "bark" in the sense of "tree bark". A word-sense exists independently of a particular sentence or text. 4) The "word parse", a linkage of a word-instance to a particular sentence parse. In the course of parsing, syntactical information is assigned to the word-parse: the word parse is identified by a a part-of-speech, a tense, a number, as a subject or object. The assignment of these properties may vary from parse to parse. For any one given sentence, there may be as many different word-parses as there are sentence parses. Typically, most sentence parses will assign the same word properties to each word-parse; these can then be collapsed into one, with the realization that the identification of the word-parse is unambiguous across all of the different parses. Note that a word-parse is also an assignment of a word-instance to a dictionary-word. Although rare, there are words in the English language which have multiple different base forms; for example, "axes" is the plural form of both "axe" and "axis". Two different parses might then assign different base-forms to the word-instance. This may also occur if a word was mis-spelled: for instance, using "there" instead of "their". An incorrect parse would link the mis-spelled word-instance to the spelled-the-same but-wrong dictionary-word; the correct parse would link the word-instance to its correct spelling. 5) The "word concept", the linkage of a word sense to a word parse. The "word concept" is distinguished from the "word sense", as being a particular word sense within context of the sentence in which it appears. Word-concepts may be uncertain or ambiguous: the meaning of any particular word-instance, as it appears in a sentence, may not be entirely clear. Initially, a link is created between a word-parse and every possible appropriate word-sense (based on the part-of-speech assignment, etc.). Each link is assigned equal weight. The goal of word-sense disambiguation is to (strongly) prefer one particular word-concept over another. 6) Finally, there is a "text-sense", which is the disembodied concept that a word instance is referring to, across the entire text. For example, in the text "Mary looked at the glass. She picked it up." the word-instances "she" and "Mary" both refer to the same text-sense, even though they are represented by completely different surface words. The relation of text-senses to a text is also probabilistic: there is a non-zero probability that "she" does not actually refer to Mary. At the basic level, a text-sense is a link between a node "Mary the abstract concept" and various word-concepts. Thus, for example, the word "Mary" has multiple possible word-senses: "Mary" may be a human being, but also, "Mary" may be a dog or other animal, or "Mary" may be a statue of the the Virgin Mary. The act of word-sense disambiguation will assign a probability to the word-concepts binding "Mary" to each of these possible word-senses. Then, the likelihood that "she" refers to "Mary" depends on the likelihood that "Mary" refers to a statue of the Virgin Mary, as statues cannot pick things up. (Incidentally, this gives some mathematical insight into humour, poetry, and "magical" and fantasy literature: in the normal world, a statue cannot pick things up. But a well-directed joke can force one to conclude that it must have: an impossibility, and therefore funny. A "well-directed" joke does this by making sure that there is no ambiguity in either the parse, or in the associated word-senses, so that there is no question, upon arriving at the punch line, of what it implies, however absurd that may be.) To illustrate the above: Consider the example text: "We stood in the yard. As she walked up to me, I laid my hand on the old oak. Its rough bark cut my finger." This sentence contains the word-instance "bark". The word-parse of "bark" is unambiguous: it is a noun. From WordNet, we know that "bark-the-noun" has several different word-senses: it may be tree-bark, or it may be a type of watercraft. The word-concept would be a linkage from the word-parse "bark-the-noun" to the sense of "tree-bark"; the goal of word-sense disambiguation is to give this particular word-concept a high probability and a high confidence, as compared to "bark-the-watercraft". Lastly, there is the text-sense. In this text, the text-concept for "bark" can be deduced to be "the bark of an oak tree in a particular yard in which there are two people". Network structure ----------------- The goal of the OpenCog network structure given below is to properly capture the above-described relationships, inserting probabilities in all of the right places, and allowing an algorithm to walk the network of connections and assign the highest probabilities to the most likely interpretations. The basic algorithmic components are as follows: A) Import WordNet data to establish "dictionary-words" and "word-senses", as defined above. B) Use the Link-Grammar+RelEx infrastructure to identify "word-instances", and link them to "word-parses", and to assign initial probabilities ranking alternative parses. C) Use Rada Mihalcea's word-sense-disambiguation algorithms to compute initial probability assignments for "word-concepts". D) Use anaphora resolution algorithms to assign initial probabilities to text-senses. E) Use "common-sense reasoning" to refine probabilities for text-senses. A given text-sense probability can act as feedback for the likelihood of a given parse being correct, and so the initial probabilities assigned in steps B,C,D are then iterated upon, in an effort to find convergence. Word-parse meaning ------------------ This section elaborates on point 4) above. Word-parses assign syntactic meaning to a word, within the context of a parse. Thus, for example, the goal is to express "in parse 2, the word 'bark' has part-of-speech 'noun'". Such associations are required because, in parse 3, the word 'bark' might be a verb. Mathematically, this is a quadruple: (parse2, 'bark', part-of-speech, 'noun'). Most PLN structures are oriented around triples, and so this section explores how to map this quadruple into a set of predicate triples. This suggests the structure below. Each parse of a sentence is assigned an initial parse ranking, and so: ParseNode "parse_1" strength=0.9 confidence=0.6 ParseNode "parse_2" strength=0.8 confidence=0.5 ParseNode "parse_3" strength=0.6 confidence=0.3 Each word in each parse of a sentence gets a unique concept node assignment. This is in order to allow the same word-instance be associated to different values in different parses. Suppose, for example, that, in two different parses of the same sentence, the word "bark" is identified as a noun, or as a verb. Thus, one has distinct nodes: <!-- Identify bark_144 as being parse instance 2 --> EvaluationLink PredicateNode "ParseInstance" ListLink WordInstanceNode "bark_144" ParseNode "parse_2" <!-- Identify bark_169 as being parse instance 3 --> EvaluationLink PredicateNode "ParseInstance" ListLink WordInstanceNode "bark_169" ParseNode "parse_3" As promised by the example, one of these identifies "bark" as a noun, the other as verb: <!-- In parse 2, which has bark_144, it is a noun --> EvaluationLink PredicateNode "PartOfSpeech" ListLink WordInstanceNode "bark_144" PartOfSpeechNode "noun" <!-- In parse 3, which has bark_169, it is a verb --> EvaluationLink PredicateNode "PartOfSpeech" ListLink WordInstanceNode "bark_169" PartOfSpeechNode "verb" It should be made clear that both concepts "bark_144" and "bark_169" refer to the same word in the same sentence. <!-- in parse 2, the word "bark" is associated with bark_144 --> ReferenceLink WordInstanceNode "bark_144" WordNode "bark" <!-- in parse 3, the word "bark" is associated with bark_169 --> ReferenceLink WordInstanceNode "bark_169" WordNode "bark" In the above, the associated WordNode is meant to be the non-lemmatized form of the word, since, in principle (though extremely rare in practice), the lemmatization might be different for different parses. (By "lemmatized form", we mean the infinitive form of a verb, the singular of a plural noun, etc.) [[In the above, a collection of triples were used to specify the quadruple relation (parse 2, 'bark', part-of-speech, 'noun'), which can be read as "in the context of parse 2, the word 'bark' is a noun". This suggests that perhaps ContextLink could be used?]] If all parses agree on the properties of a given word, it would be advantageous to collapse the proliferation of word-instances down to one or two, as early in the processing as possible. [[----- A note about the structure of PLN/OpenCog. The current working assumption, per discussions with Ben, is to treat thw following two forms as entirely equivalent: EvaluationLink PredicateNode "PartOfSpeech" ListLink WordInstance "bark_144" PartOfSpeechNode "noun" That is, the above is (ignoring truth values) *equivalent* to the below: PartOfSpeechLink WordInstance "bark_144" PartOfSpeechNode "noun" The below uses fewer atoms to represent the information. The above is more convenient for reasoning, as it is in predicate form. For reasons of compactness, the form below shall be used in practice, at this time. ]] Sentence identification ----------------------- Multiple alternate parses of a sentence should be associated with a single sentence. This may be accomplished by declaring a sentence: SENTENCE_NODE <- CONCEPT_NODE SentenceNode "sentence_22" strength=0.9 confidence=0.6 and associating it with each parse: ParseLink ParseNode "parse_1" strength=0.9 confidence=0.6 SentenceNode "sentence_22" ParseLink ParseNode "parse_2" strength=0.8 confidence=0.5 SentenceNode "sentence_22" and so on. The special node types "SentenceNode" and ParseNode are used so as to easily locate new sentences added to the system. After word processing has been performed, the sentence node is marked to indicate that processing is complete, as so: InheritanceLink SentenceNode "sentence_22" AnchorNode "#WSD_completed" The AnchorNode is an ad-hoc node type, functioning essentially as a special-purpose tag with can be used to tag things like a SentenceNode. Word concepts ------------- This section elaborates on point 5 above. The goal of word-sense disambiguation is to associate a word instance to each possible word-sense, and then to determine the most-likely pairing. The Mihalcea algorithm performs this by creating weighted links between (word-instance, word-sense) pairs. (These are referred to as "admissible labels" in ). The goal of this section is to describe the OpenCog representation of these pairs-of-pairs. Crudely speaking, these relations form an adjacency matrix for a graph resembling Markov chain; and so the representation problem is that of representing a square matrix in OpenCog. Several representations are reviewed. A naive way of representing word-concept links would be to create links: InheritanceLink strength=0.9 confidence=0.1 WordInstanceNode "bark_144" WordSenseNode "bark_sense_23" The likelihood "bark_sense_23" being the correct sense for the word "bark" is encoded in the strength of the Inheritance Link. The pairs-of-pairs relations can then be provided as below. Here, the example sentence contains the words "tree" and "bark", which can then mutually reinforce the most likely meaning of each word. <!-- the word "tree" occurred in the sentence --> CosenseLink strength=0.49 confidence=0.3 InheritanceLink strength=0.9 confidence=0.6 WordInstanceNode "tree_99" WordSenseNode "tree_sense_12" InheritanceLink strength=0.9 confidence=0.1 WordInstanceNode "bark_144" WordSenseNode "bark_sense_23" Note that the set link is unordered: there is no preferred direction in the relation between "tree" and "bark". [[An alternate representation for the above would create a single node for every (word-instance, word-sense) pair, like so: ConceptNode "bark_pair_1" strength=0.9 confidence=0.1 This node is essentially the graph-dual to the InheritanceLink above; here, "graph-dual" is to be understood in the graph-theory sense. The likelihood "bark_sense_23" being the correct sense for the word "bark" is encoded in the strength of this node. This node is associated with the (word-instance, word-sense) pair as follows: AssociativeLink ConceptNode "bark_pair_1" strength=0.9 confidence=0.1 WordSenseNode "bark_sense_23" AssociativeLink WordInstanceNode "bark_144" ConceptNode "bark_pair_1" strength=0.9 confidence=0.1 The associative links are not weighted. The pairs-of-pairs are then represented as follows: SentenceCooccurrenceLink strength=0.49 confidence=0.3 ConceptNode "tree_pair_2" strength=0.9 confidence=0.6 ConceptNode "bark_pair_1" strength=0.9 confidence=0.1 Its not clear which representation is "better".]] The Mihalcea algorithm ---------------------- This section describes how the Mihalcea algorithm is actually implemented in OpenCog. -- "Input/output" -- Input is received from the OpenCog server, in the form of RelEx-generated OpenCog networks. These are directly obtained by class WordSenseProcessor, which interfaces with the OpenCog server (and isolates the rest of the code from the OpenCog server details). It then passes input to class Mihalcea, which provides the top-level class for the rest of the algorithm. -- "Admissible labels" -- in the text above, and in the code, these are referred to as (word-instance, word-sense) pairs. These are added to the raw relax input by class MihalceaLabel. They are added to all words in all parses. -- "Build a graph of label dependencies". In the text above, and in the code, these are referred to as "pairs of pairs". These are added to the by the class MihalceaEdge. There is an slight difference: the original Mihalcea algorithm directs that these be added word-by-word in the sentence. Here, instead, these are added between words participating between RelEx relations. In particular, because RelEx will omit determiners ("a", "the") and other words from relations, there won't be edges between these. RelEx will also sometimes collapse prepositional phrases into collocations, as well as entities, this will also cause a subtle difference in the graph structure. -- "Label dependencies". Currently, only the Leacock-Chodorow word-sense similarity measure is implemented. It only provides a measure that relates nouns to nouns, and verbs to verbs. It says nothing about adjectives or adverbs. It is implemented in class SenseSimilarity. -- "Score vertices". Scoring is performed using the the PageRank algorithm. The graph is walked in a weighted random fashion, with the likelihood of taking an edge proportional to the edge's weight. The score of each vertex is then adjusted using a weighted page-rank sum. The algo is implemented in class SenseRank. -- "Label assignment". The highest-scoring word-senses are reported by class ReportRank. Currently, it just prints word senses to stdout. Markov Finer Points ------------------- Note that the eigenvector that is solved for has *all* of the various word senses listed as its components. If the eigenvector is normalized to unit length, then the sum of the probabilities for any one given word will *not* total to one! An alternative to propagating on the graph might instead be to just link to lapack, and use on of the eigenvector routines. This might be faster. Edges connecting senses use SimpleTruthValues; the mean/strength holds holds the strength of the connection. SenseLinks (between word-instances and wordnet-senses) use CountTruthValues. The intital strength is set to 1.0 (true) and confidence to 0.0 (inconfident). The algo is run, manipulating the count. XXXX-- the following para describes something that should be, but is not currently done: XXX After the algo has been run, the results are "renormalized" with confidence value on the sense-link being changed to reflect the result. The resulting confidence can be undrstood to be a probability: by summing over all possible sense assignments for a word-instance, the total confidence will sum to one. Preliminary Results ------------------- I've only done some basic testing so far, the results are "OK". Mihalcea herself claims about 50% accuracy, which is said to be more-or-less state-of-the-art. However, based on even this small bit of testing, the reason for both the successes and the failures of the Mihalcea algorithm are evident: For example, consider the sentence "I cut my finger on the tree bark". The words in the noun-modifier phrase "tree bark" are strongly reinforcing of one-another, and thus rank the correct sense of both "tree" and "bark" quite highly. On the other hand, "cut" and "finger" and "me" pretty much have nothing to do with tree-bark, and serve only to inject noise and confusion into the sense ranking. General Thoughts ---------------- The Rada Mihalcea algorithm is more-or-less just a Markov model, where the states of the system are word-senses, and the word-sense similarity metrics are the state transition probabilities (the conditional probabilities). Because wordnet is explicitly used, the set of possible word senses are overt, visibile. If wordnet was not being used, and word-senses were a-priori unknown, the algorithm would have to be replaced by a hidden Markov model. The above analogy suggests the correct way to generalize the algorithm for anaphora resolution: create a hidden Markov model that consolidates different word meanings in different senses, as referring to the same object. Correlations and Probabilities ------------------------------ The correct way to understand sense-similarity measures is to interpret them as correlations between pairs of senses. Given a correlation function, the underlying probability distribution can be computed, and then, from this, the conditional probability of one word sense, given another word sense, can be computed. Properly speaking, the edge weights used in the Mihalcea algorithm should be these conditional probabilities, and not the rescaled word-sense similarity corrrelations. However, for now, rescaling is approximately correct, and should be enough. Algorithm Modifications ----------------------- This section describes how the core Mihalcea algorithm has been modified or extended. The goal of the modifications is top improve the results. Obvious "raw material" for this is the incorporation of the sentence structure, obtained from parsing, into the algorithm. Noun-modifier Strengthening --------------------------- The parser identifies noun-modifier phrases, such as "tree bark" or "aquatic mammal". The senses chosen for both nouns in the noun-modifier phrase should be in strong agreement with each other. There are several ways in which this might be accomplished. -- Strengthen all similarity links between all senses of these two words. This should have the effect of shielding these two words from outside influences. -- Weaken all links to the modifying noun, or strengthen all links to the modified noun. The first approach is implemented in class NNAdjust. Preliminary, seat-of-the-pants experiments seem to show that this makes little or no difference. Word-Sense Distance Metrics --------------------------- A major bottleneck for the algorithm appears to be the computation of word-sense commonality measures. Finding the least common subsumer (lcs) in a single hierarchy is straightforward; thus, the lcs of the hypernym (is-a) hierarchy is easily found. However, mixing together multiple hierarchies, e.g. is-a and has-a (meronym) hierarchies leads to an exponential explosion of search possibilities, since both the is-a and the has-a paths need to be explored at every stage. The current implementation of the Leacock-Chodorow(lch) similarity metric explores the is-a hierarchy fully, while taking a limited number of has-a steps to explore closely related heirarchy. Good results seem to be obtained by setting n=1. When n is increased by 1, the cpu time usage appears to increase by 2^n (right!?) so that the total cpu usage is 2^(n!) where n! is n-factorial. At least, for small n. Wow. 2:07.67 - 1:50.95 = 17 seconds for the n=0 simple case. 2:20.30 - 1:48.40 = 32 seconds for the n=1 case. 4:02.93 - 1:48.39 = 134 seconds for the n=2 case. 3:26.40 - 1:49.56 = 96 seconds for the n=3 case. (The above is implemented in SenseSimilarityLCH.h/.cc More convenient is SenseSimilaritySQL.cc, which pulls precomputed senses out of an SQL datbase. Much faster that way.) Caching Word-Sense Distance Measures ------------------------------------ To avoid the cost of multiple computations of the measure for the same sense pair, the results of the computation are cached. The cached value is stored as a similarity link, in the atom space, between the two word senses. The caching is performed by class SenseCache. For sentences with multiple parses, the use of caching provides a major performance boost. clean start: 3294 edges for 36 word pairs (rate=160.967140) 494 edges for 45 word pairs (rate=560.458623) 4179 edges for 90 word pairs (rate=1023.431412) 227 edges for 60 word pairs (rate=172.097597) 24 edges for 36 word pairs (rate=44.365162) 37 edges for 54 word pairs (rate=47.401926) 603 edges for 78 word pairs (rate=55.337474) 182 edges for 117 word pairs (rate=62.594903) 32 edges for 21 word pairs (rate=66.643619) 407 edges for 91 word pairs (rate=102.196214) 35 edges for 56 word pairs (rate=43.196438) 38 edges for 28 word pairs (rate=106.432441) 311 edges for 72 word pairs (rate=351.375108) 2139 edges for 171 word pairs (rate=110.845516) 551 edges for 152 word pairs (rate=108.684289) 1872 edges for 105 word pairs (rate=106.707945) 5009 edges for 285 word pairs (rate=229.150653) 2437 edges for 90 word pairs (rate=584.059233) 2290 edges for 90 word pairs (rate=190.564907) 1820 edges for 190 word pairs (rate=141.599178) 4078 edges for 300 word pairs (rate=279.677327) 5366 edges for 180 word pairs (rate=255.627497) 3777 edges for 135 word pairs (rate=386.990560) hour later: 7983 edges for 143 word pairs (rate=177.777639) 3566 edges for 253 word pairs (rate=119.823203) 3494 edges for 78 word pairs (rate=150.161062) 4288 edges for 156 word pairs (rate=301.568366) 2244 edges for 78 word pairs (rate=221.805891) 394 edges for 56 word pairs (rate=648.082810) two hours later: 1300 edges for 95 word pairs (rate=41.524402) 5217 edges for 171 word pairs (rate=64.100957) 587 edges for 78 word pairs (rate=77.836852) 4772 edges for 168 word pairs (rate=261.459252) 1513 edges for 66 word pairs (rate=143.758023) 2764 edges for 126 word pairs (rate=90.000027) 2983 edges for 99 word pairs (rate=103.497465) 1238 edges for 66 word pairs (rate=104.637787) 476 edges for 91 word pairs (rate=37.023434) 6492 edges for 253 word pairs (rate=94.560790) 1453 edges for 135 word pairs (rate=196.339518) 2119 edges for 231 word pairs (rate=57.912276) 1824 edges for 144 word pairs (rate=109.158918) 3072 edges for 156 word pairs (rate=62.409254) 2607 edges for 78 word pairs (rate=65.946862) ten hours later: 377 edges for 28 word pairs (rate=76.804571) 197 edges for 21 word pairs (rate=82.990628) 691 edges for 28 word pairs (rate=58.645770) 1669 edges for 264 word pairs (rate=53.167052) 1779 edges for 171 word pairs (rate=84.579619) 1851 edges for 117 word pairs (rate=111.031172) 1711 edges for 78 word pairs (rate=64.673862) 1159 edges for 120 word pairs (rate=67.343033) 1774 edges for 45 word pairs (rate=105.078328) 7269 edges for 136 word pairs (rate=91.874337) 3206 edges for 182 word pairs (rate=79.418685) 2645 edges for 140 word pairs (rate=88.427971) 1506 edges for 45 word pairs (rate=110.503189) These number suggest that the current "caching" mechanism is a loosing strategy. Looking at the code, its not hard to see why. But wait -- shows not over: after 12 hours: 1104 edges for 99 word pairs (rate=157.804099) 1128 edges for 55 word pairs (rate=79.791207) 2578 edges for 117 word pairs (rate=451.047206) 2312 edges for 195 word pairs (rate=185.579256) 1683 edges for 78 word pairs (rate=202.813602) 2263 edges for 195 word pairs (rate=114.505580) 8368 edges for 312 word pairs (rate=242.270232) 3498 edges for 216 word pairs (rate=726.549312) 7389 edges for 276 word pairs (rate=136.556129) 1734 edges for 99 word pairs (rate=2574.969632) 1600 edges for 77 word pairs (rate=940.313595) 729 edges for 55 word pairs (rate=378.761704) 300 edges for 21 word pairs (rate=340.650119) wtf ... these rates are higher than ever.. and are a fantastic speedup! after 24 hours: 258 edges for 56 word pairs (rate=79.132317) 1405 edges for 260 word pairs (rate=58.772474) 2362 edges for 190 word pairs (rate=66.941334) 818 edges for 182 word pairs (rate=225.214808) 8480 edges for 325 word pairs (rate=91.553765) 922 edges for 72 word pairs (rate=929.476711) 465 edges for 66 word pairs (rate=268.935519) Ideas/TODO ---------- -- Do not create sense links to punctuation (as well as determiners, etc.) -- Remove small graphs (e.g. two senses attached to each other, but nothing else). -- When solving network, do not follow links to a sense which has only one edge. -- Build test framework -- Need to fix main dispatcher; right now, WSD is called 100 times a second, no matter what. -- Processed sentences need to be discarded when they are old enough. Can't afford to clutter RAM with lots of sentences. Unfortunately, delete performance is terrible. -- use framenet for sense-similarity scoring!? -- do some common-sense reasoning from the wordnet ontology. -- The role of relex is utterly unclear: how should the dependency relations spit out by relex be used to guide WSD ?? Perhaps there should only be relation edges between those words that are involved in relex relations? Yes, .. this is probably what should be done, but the precise argument as to why is unclear ... probably need prove this experimentally. Another possibility is to use relex relations to create direct eges between word senese, and add much much weaker links between non-relations. The experiment could be done by using floating point adjustments. -- Per ben: Use the following similarity measure: "Probability of _obj(sense 1 of cut, sense 1 of finger)" "Probability of _obj(sense 1 of cut, sense 2 of finger)" based on a large corpus. -- Use word frequency scores to identify the "important" words in a sentence. These "important words" should get greater weight for WSD, as opposed to lower-importance words. This is similar to the idea of distinguishing between "function words" and "content words" -- function words having only a gramatical function, whereas the "content words" hold most of the semantic content. The idea here is that certain "content words" hold "more content" than others, based on thier information entropy (information content). -- Use mutual-information scores to tie together word pairs more tightly. ... !? The tightest word-pairs are likely to be idioms, and so these should have senses of thier own. -- where does parse-ranking fit into the scheme of things? Should total up the high scores for different parses. This assumes that tdifferent parses lead to diffferent graphs. -- Use the wordnet sense-occurence counts to add a preference to the most likely meanings? Can these counts be used to implement one of the information-content measures, e.g. jcn? The problem here is that the wordnet sense-occurence counts are small numbers, and are zero for many senses. -- stumped: how to use opencyc data, since there's no concordance with wordnet data? -- Ben says: Google "dekang lin path minipar" for the papers on his "syntactic path dependency" paper. He makes an argument that by parsing and then analyzing paths in the parse trees, one can handle a variety of tasks varying from semantic disambiguation to question-answering. (Seem to be two papers: DIRT, and the question answering one ...) DIRT seems to be aimed at the automatic discovery of word senses, rather than in sense-disambiguation per-se. -- integrate anaphora resolution. More generally, work on reference resolution in general. -- The choice of the word-similarity metric generated by lch seems somewhat "arbitrary" -- its a measure, but its not at all clear that its any good for the Markov-chain-like Mihalcea algorithm. For example, perhaps it gives too-strong a weight for word senses that are only distantly related. it would be better if one generated a random function, taking as input, the distances to the least-common subsumer, and the distance of the least-common subsumer from the top of the hierarchy. One might then try to use some optimization algo to find the best-possible function of these inputs, that gives the highest word-sense score. Domains of discourse (aka word-sense frequency counts) ------------------------------------------------------ Some word senses are more common than others. For example, archaic word senses tend to be rarely encountered ... except when one is reading an archaic text. Similarly, scientific and technical texts tend to employ certain common word with certain very uncommon meanings. Cool, hip, edgy writing will also use words in uncommon senses, or in unusual constructions. An anecdote to illustrate the idea: "When I was young and learning French, I found it easy to read and understand "Le Monde", and was utterly perplexed by "Paris Match", even thought Paris Match sentences were considerably shorter, and seemed to use a simpler vocabulary -- alas, for me, the Paris Match lingo was heavily overladen with French pop culture references and current events, which I was completely unaware of. I would sit there, with Paris Match, and look up every word in the dictionary, even though I already knew the word, because I was searching for some meaning... which, of course was not to be found in the dictionary. By contrast, Le Monde was not trying to be hip, edgy or entertaining, it was trying to be precise and explicit, and in this, it succeeds. The moral of the story is that word-sense frequency counts for Le Monde and Paris Match would be radically different -- and so, when I read a "typical" French text, say, Le Canard Enchainee (a satirical political paper), or Alexander Dumas (The Three Musketeers-- eminently readable, but clearly 18th century French), which should I use?" Wordnet does have the notion of discourse domains. XXX ToBeResolved: How to employ context-sensitive word-frequency counts is unclear. References ----------  Rada Mihalcea, "Unsupervised Large-Vocabulary Word Sense Disambiguation with Graph-based Algorithms for Sequence Data Labeling", 2005  Dekang Lin, Patrick Pantel, "DIRT: Discovering Inference Rules from Text". In Proceedings of the ACM SIGKDD Conference on Knowledge Discovery and Data Mining, 2001.