Skip to content
Branch: master
Find file History
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
..
Failed to load latest commit information.
BUGS
CMakeLists.txt
EdgeThin.cc
EdgeThin.h
EdgeUtils.cc
EdgeUtils.h
ForeachWord.h
Mihalcea.cc
Mihalcea.h
MihalceaEdge.cc
MihalceaEdge.h
MihalceaLabel.cc
MihalceaLabel.h
NNAdjust.cc
NNAdjust.h
NOTES
ParseRank.cc
ParseRank.h
README
README-howto
ReportRank.cc
ReportRank.h
SenseCache.cc
SenseCache.h
SenseRank.cc
SenseRank.h
SenseSimilarity.h
SenseSimilarityLCH.cc
SenseSimilarityLCH.h
SenseSimilaritySQL.cc
SenseSimilaritySQL.h
Sweep.cc
Sweep.h
WordSenseProcessor.cc
WordSenseProcessor.h

README

                      Word Sense Disambiguation
                      -------------------------
                           Working Notes
                      Linas Vepstas April 2008


This document contains working notes for performing word-sense
disambiguation within the framework of OpenCog. The first section
reviews the mapping of WordNet ideas onto OpenCog structures. The
second section discusses how various different ideas of "word-sense"
are associated with a parsed sentence, and the algorithms that can
be used to perform the actual disambiguation.

The wordnet-import directory contains code to import the wordnet dataset
into OpenCog, based on the structures outlined below.

The code in this directory is primarily focused on the implementation
of the Rada Mihalcea word-sense disambiguation algorithm, together with
extensions/enhancements/modifications thereof.

The Mihalcea algorithm assigns senses to the words in a sentence, and
constructs links between these different senses. The links are weighted
by a word-sense similarity metric. The result is a graph whose vertices
are word-senses, and whose edges are these links.  The graph is
essentially just a Markov chain, and is solved as such, looking for a
stationary vector of probabilities for the word-sense vertices. The
vertices with the highest scores are then the most likely meaning for
a word.

Preliminary comments
--------------------
Per current conversations with Ben Goertzel, new links introduced in
the text below should be understood to be a kind of predicate. That is,
a link of the form

   BlahLink
      Node "AAA"
      Node "BBB"

is to be understood to be equivalent to

   EvaluationLink
      PredicateNode "Blah"
      ListLink
         Node "AAA"
         Node "BBB"

Although OpenCog does not currently handle this equivalence in a
transparent fashion, this is the long-term intention.  Thus, the text
below will mix these two different representation forms.

Word senses
-----------
Associated with every word is a list of possible senses.
Consider the word "bark", thus:

    WordNode "bark"   -- a word.

Wordnet entries are not necessarily single words; more properly, the
should be called "lexical units": they can be single words,
collocations, idioms, and so on.

A "word sense" is a collection of information about a single semantic
sense of a word. This bag includes information about the part-of-speech
(noun, verb, etc.), example usages, pointers to synonym sets (synsets),
hyponyms, hypernyms, etc. A specific "word sense" will be tagged with a
unique identifier, which is then used to reference that sense.

In the OpenCog type hierarchy, (type.script)
   WORD_SENSE_LINK <- ASSOCIATIVE_LINK

A WordSenseLink couples the tag that will stand for the word sense,
to the word itself:

   WordSenseLink
      WordInstanceNode "bark"
      WordSenseNode "bark_sense_23"

The tag value "bark_sense_23" has no particular meaning in itself,
it is just some unique string used to identify the sense.

Word senses can be crudely categorized according to part-of-speech.
Thus, for example:

   PartOfSpeechLink
      WordSenseNode "bark_sense_23"
      PartOfSpeechNode "noun"

The above introduces a new node type "PartOfSpeechNode", and a new link
type: "PartOfSpeechLink".  Other possible properties might include gender,
entity tags, and so on.

Each of these nodes and links are assigned a truth value of 1 and a
confidence of 1. Non-unit truth-values & confidence will be used when
relating word senses to the use of actual words in sentences. Here,
however, the goal is to state "this word does have this as one possible
meaning, no matter how rarely used this meaning might be."

[[Some word senses are more common than others. For example, archaic
word senses tend to be rarely encountered ... except when one is
reading an archaic text. This is addressed in a section below.]]

In WordNet, word senses are commonly established by giving "sister
terms" (a WordNet concept, equivalent to "synonyms for word senses")
as well as hypernyms, hyponyms, part holonyms, etc. A sister term,
hypernym, hyponym, etc. are thus links. Since some sister terms are
closer than others, these links will have non-unit truth values.

Hypernyms and hyponyms will be mapped to inheritance links.

Details of the actual mapping used are given in greater detail
in the wordnet-import/README file.

Statistical corpus analysis
---------------------------
Word nodes can be assigned a truth value that indicates the frequency of
occurence of the word in a text. This is discussed in greater detail in 
the similarity/README file.

Sentences to meanings
---------------------
Consider two sentences: "The loud bark woke me up." or "The rough bark
cut my finger." The goal of word-sense disambiguation is two-fold:
associate to all possible meanings of the word "bark", and then, rank
the possible meanings from most likely to least.

There are six related ideas of a word, and its sense, that need to
be distinguished. These are:

1) The "word instance", as it occurs in the sentence. Word instances
   may be mis-spelled, and thus it may be ambiguous as to which
   actual dictionary word was intended.

2) The "dictionary word", the word as it would be in the standard
   lexicon of a language: a word in its proper spelling, and with
   morphological inflections stripped away: the lemmatized or root form
   of the word.

   Dictionary-words are represented by WordNode.

3) The "word sense", a particular sense of a word, as it might appear
   in a dictionary, or in WordNet. For example, "bark" in the sense of
   "tree bark". A word-sense exists independently of a particular
   sentence or text.

4) The "word parse", a linkage of a word-instance to a particular
   sentence parse. In the course of parsing, syntactical information
   is assigned to the word-parse: the word parse is identified by a
   a part-of-speech, a tense, a number, as a subject or object. The
   assignment of these properties may vary from parse to parse.
   For any one given sentence, there may be as many different
   word-parses as there are sentence parses. Typically, most sentence
   parses will assign the same word properties to each word-parse;
   these can then be collapsed into one, with the realization that the
   identification of the word-parse is unambiguous across all of the
   different parses.

   Note that a word-parse is also an assignment of a word-instance to
   a dictionary-word. Although rare, there are words in the English
   language which have multiple different base forms; for example,
   "axes" is the plural form of both "axe" and "axis". Two different
   parses might then assign different base-forms to the word-instance.
   This may also occur if a word was mis-spelled: for instance, using
   "there" instead of "their". An incorrect parse would link the
   mis-spelled word-instance to the spelled-the-same but-wrong
   dictionary-word; the correct parse would link the word-instance
   to its correct spelling.

5) The "word concept", the linkage of a word sense to a word parse.
   The "word concept" is distinguished from the "word sense", as being
   a particular word sense within context of the sentence in which it
   appears. Word-concepts may be uncertain or ambiguous: the meaning of
   any particular word-instance, as it appears in a sentence, may not be
   entirely clear. Initially, a link is created between a word-parse and
   every possible appropriate word-sense (based on the part-of-speech
   assignment, etc.). Each link is assigned equal weight. The goal of
   word-sense disambiguation is to (strongly) prefer one particular
   word-concept over another.

6) Finally, there is a "text-sense", which is the disembodied concept
   that a word instance is referring to, across the entire text. For
   example, in the text "Mary looked at the glass. She picked it up."
   the word-instances "she" and "Mary" both refer to the same
   text-sense, even though they are represented by completely different
   surface words. The relation of text-senses to a text is also
   probabilistic: there is a non-zero probability that "she" does
   not actually refer to Mary.  At the basic level, a text-sense
   is a link between a node "Mary the abstract concept" and various
   word-concepts.

   Thus, for example, the word "Mary" has multiple possible word-senses:
   "Mary" may be a human being, but also, "Mary" may be a dog or other
   animal, or "Mary" may be a statue of the the Virgin Mary.  The act of
   word-sense disambiguation will assign a probability to the
   word-concepts binding "Mary" to each of these possible word-senses.
   Then, the likelihood that "she" refers to "Mary" depends on the
   likelihood that "Mary" refers to a statue of the Virgin Mary, as
   statues cannot pick things up.

   (Incidentally, this gives some mathematical insight into humour,
   poetry, and "magical" and fantasy literature: in the normal world,
   a statue cannot pick things up. But a well-directed joke can force
   one to conclude that it must have: an impossibility, and therefore
   funny. A "well-directed" joke does this by making sure that there
   is no ambiguity in either the parse, or in the associated
   word-senses, so that there is no question, upon arriving at the punch
   line, of what it implies, however absurd that may be.)

To illustrate the above:

Consider the example text: "We stood in the yard.  As she walked up to
me, I laid my hand on the old oak.  Its rough bark cut my finger."
This sentence contains the word-instance "bark". The word-parse of
"bark" is unambiguous: it is a noun.  From WordNet, we know that
"bark-the-noun" has several different word-senses: it may be tree-bark,
or it may be a type of watercraft.  The word-concept would be
a linkage from the word-parse "bark-the-noun" to the sense of "tree-bark";
the goal of word-sense disambiguation is to give this particular
word-concept a high probability and a high confidence, as compared to
"bark-the-watercraft".  Lastly, there is the text-sense. In this text,
the text-concept for "bark" can be deduced to be "the bark of an oak
tree in a particular yard in which there are two people".

Network structure
-----------------
The goal of the OpenCog network structure given below is to properly
capture the above-described relationships, inserting probabilities
in all of the right places, and allowing an algorithm to walk the
network of connections and assign the highest probabilities to the most
likely interpretations.

The basic algorithmic components are as follows:
A) Import WordNet data to establish "dictionary-words" and "word-senses",
   as defined above.
B) Use the Link-Grammar+RelEx infrastructure to identify
   "word-instances", and link them to "word-parses", and to assign
   initial probabilities ranking alternative parses.
C) Use Rada Mihalcea's word-sense-disambiguation algorithms to compute
   initial probability assignments for "word-concepts".
D) Use anaphora resolution algorithms to assign initial probabilities
   to text-senses.
E) Use "common-sense reasoning" to refine probabilities for text-senses.
   A given text-sense probability can act as feedback for the
   likelihood of a given parse being correct, and so the initial
   probabilities assigned in steps B,C,D are then iterated upon,
   in an effort to find convergence.


Word-parse meaning
------------------
This section elaborates on point 4) above.

Word-parses assign syntactic meaning to a word, within the context of a
parse. Thus, for example, the goal is to express "in parse 2, the word
'bark' has part-of-speech 'noun'". Such associations are required
because, in parse 3, the word 'bark' might be a verb.

Mathematically, this is a quadruple: (parse2, 'bark', part-of-speech,
'noun'). Most PLN structures are oriented around triples, and so this
section explores how to map this quadruple into a set of predicate
triples.

This suggests the structure below. Each parse of a sentence is assigned
an initial parse ranking, and so:

   ParseNode "parse_1" strength=0.9 confidence=0.6
   ParseNode "parse_2" strength=0.8 confidence=0.5
   ParseNode "parse_3" strength=0.6 confidence=0.3

Each word in each parse of a sentence gets a unique concept node
assignment. This is in order to allow the same word-instance be
associated to different values in different parses. Suppose, for
example, that, in two different parses of the same sentence, the word
"bark" is identified as a noun, or as a verb. Thus, one has distinct
nodes:

   <!-- Identify bark_144 as being parse instance 2 -->
   EvaluationLink
      PredicateNode "ParseInstance"
      ListLink
         WordInstanceNode "bark_144"
         ParseNode "parse_2"

   <!-- Identify bark_169 as being parse instance 3 -->
   EvaluationLink
      PredicateNode "ParseInstance"
      ListLink
         WordInstanceNode "bark_169"
         ParseNode "parse_3"

As promised by the example, one of these identifies "bark" as a
noun, the other as verb:

   <!-- In parse 2, which has bark_144, it is a noun -->
   EvaluationLink
      PredicateNode "PartOfSpeech"
      ListLink
         WordInstanceNode "bark_144"
         PartOfSpeechNode "noun"

   <!-- In parse 3, which has bark_169, it is a verb -->
   EvaluationLink
      PredicateNode "PartOfSpeech"
      ListLink
         WordInstanceNode "bark_169"
         PartOfSpeechNode "verb"

It should be made clear that both concepts "bark_144" and "bark_169"
refer to the same word in the same sentence.

   <!-- in parse 2, the word "bark" is associated with bark_144 -->
   ReferenceLink
      WordInstanceNode "bark_144"
      WordNode "bark"

   <!-- in parse 3, the word "bark" is associated with bark_169 -->
   ReferenceLink
      WordInstanceNode "bark_169"
      WordNode "bark"

In the above, the associated WordNode is meant to be the non-lemmatized
form of the word, since, in principle (though extremely rare in
practice), the lemmatization might be different for different parses.
(By "lemmatized form", we mean the infinitive form of a verb, the
singular of a plural noun, etc.)

[[In the above, a collection of triples were used to specify the
quadruple relation (parse 2, 'bark', part-of-speech, 'noun'), which can
be read as "in the context of parse 2, the word 'bark' is a noun". This
suggests that perhaps ContextLink could be used?]]

If all parses agree on the properties of a given word, it would
be advantageous to collapse the proliferation of word-instances
down to one or two, as early in the processing as possible.

[[-----
A note about the structure of PLN/OpenCog. The current working
assumption, per discussions with Ben, is to treat thw following two
forms as entirely equivalent:

   EvaluationLink
      PredicateNode "PartOfSpeech"
      ListLink
         WordInstance "bark_144"
         PartOfSpeechNode "noun"

That is, the above is (ignoring truth values) *equivalent* to the below:

   PartOfSpeechLink
      WordInstance "bark_144"
      PartOfSpeechNode "noun"

The below uses fewer atoms to represent the information. The above is
more convenient for reasoning, as it is in predicate form.  For reasons
of compactness, the form below shall be used in practice, at this time.
]]


Sentence identification
-----------------------
Multiple alternate parses of a sentence should be associated with
a single sentence. This may be accomplished by declaring a sentence:

   SENTENCE_NODE <- CONCEPT_NODE

   SentenceNode "sentence_22" strength=0.9 confidence=0.6

and associating it with each parse:

   ParseLink
      ParseNode "parse_1" strength=0.9 confidence=0.6
      SentenceNode "sentence_22"

   ParseLink
      ParseNode "parse_2" strength=0.8 confidence=0.5
      SentenceNode "sentence_22" 

and so on. The special node types "SentenceNode" and ParseNode are used
so as to easily locate new sentences added to the system. After word
processing has been performed, the sentence node is marked to indicate
that processing is complete, as so:

   InheritanceLink
      SentenceNode "sentence_22"
      AnchorNode "#WSD_completed"

The AnchorNode is an ad-hoc node type, functioning essentially
as a special-purpose tag with can be used to tag things like a
SentenceNode.  


Word concepts
-------------
This section elaborates on point 5 above.

The goal of word-sense disambiguation is to associate a word instance to
each possible word-sense, and then to determine the most-likely pairing.
The Mihalcea algorithm[1] performs this by creating weighted links
between (word-instance, word-sense) pairs. (These are referred to as
"admissible labels" in [1]).  The goal of this section is
to describe the OpenCog representation of these pairs-of-pairs.
Crudely speaking, these relations form an adjacency matrix for a graph
resembling Markov chain; and so the representation problem is that of
representing a square matrix in OpenCog.  Several representations are
reviewed.

A naive way of representing word-concept links would be to create links:

   InheritanceLink strength=0.9 confidence=0.1
      WordInstanceNode "bark_144"
      WordSenseNode "bark_sense_23"

The likelihood "bark_sense_23" being the correct sense for the word
"bark" is encoded in the strength of the Inheritance Link.

The pairs-of-pairs relations can then be provided as below. Here, the
example sentence contains the words "tree" and "bark", which can then
mutually reinforce the most likely meaning of each word.

   <!-- the word "tree" occurred in the sentence -->
   CosenseLink strength=0.49 confidence=0.3
      InheritanceLink strength=0.9 confidence=0.6
         WordInstanceNode "tree_99"
         WordSenseNode "tree_sense_12"
     
      InheritanceLink strength=0.9 confidence=0.1
         WordInstanceNode "bark_144"
         WordSenseNode "bark_sense_23"

Note that the set link is unordered: there is no preferred direction in
the relation between "tree" and "bark".

[[An alternate representation for the above would create a single node
for every (word-instance, word-sense) pair, like so:

   ConceptNode "bark_pair_1"  strength=0.9 confidence=0.1

This node is essentially the graph-dual to the InheritanceLink above;
here, "graph-dual" is to be understood in the graph-theory sense.

The likelihood "bark_sense_23" being the correct sense for the word
"bark" is encoded in the strength of this node. This node is 
associated with the (word-instance, word-sense) pair as follows:

   AssociativeLink
      ConceptNode "bark_pair_1"  strength=0.9 confidence=0.1
      WordSenseNode "bark_sense_23"

   AssociativeLink
      WordInstanceNode "bark_144"
      ConceptNode "bark_pair_1"  strength=0.9 confidence=0.1

The associative links are not weighted. The pairs-of-pairs are then
represented as follows:

   SentenceCooccurrenceLink strength=0.49 confidence=0.3
      ConceptNode "tree_pair_2"  strength=0.9 confidence=0.6
      ConceptNode "bark_pair_1"  strength=0.9 confidence=0.1

Its not clear which representation is "better".]]

The Mihalcea algorithm
----------------------
This section describes how the Mihalcea algorithm[1] is actually
implemented in OpenCog.

-- "Input/output" -- Input is received from the OpenCog server, in
   the form of RelEx-generated OpenCog networks. These are directly
   obtained by class WordSenseProcessor, which interfaces with the
   OpenCog server (and isolates the rest of the code from the OpenCog
   server details). It then passes input to class Mihalcea, which
   provides the top-level class for the rest of the algorithm.

-- "Admissible labels" -- in the text above, and in the code, these are
   referred to as (word-instance, word-sense) pairs. These are added to
   the raw relax input by class MihalceaLabel. They are added to all
   words in all parses.

-- "Build a graph of label dependencies". In the text above, and in the
   code, these are referred to as "pairs of pairs". These are added to
   the by the class MihalceaEdge.  There is an slight difference:
   the original Mihalcea algorithm directs that these be added
   word-by-word in the sentence. Here, instead, these are added between
   words participating between RelEx relations. In particular, because
   RelEx will omit determiners ("a", "the") and other words from
   relations, there won't be edges between these. RelEx will also
   sometimes collapse prepositional phrases into collocations, as well
   as entities, this will also cause a subtle difference in the graph
   structure.

-- "Label dependencies". Currently, only the Leacock-Chodorow word-sense
   similarity measure is implemented. It only provides a measure that
   relates nouns to nouns, and verbs to verbs. It says nothing about
   adjectives or adverbs. It is implemented in class SenseSimilarity.

-- "Score vertices". Scoring is performed using the the PageRank 
   algorithm. The graph is walked in a weighted random fashion, 
   with the likelihood of taking an edge proportional to the edge's
   weight. The score of each vertex is then adjusted using a weighted
   page-rank sum. The algo is implemented in class SenseRank.

-- "Label assignment". The highest-scoring word-senses are reported
   by class ReportRank. Currently, it just prints word senses to stdout.

Markov Finer Points
-------------------
Note that the eigenvector that is solved for has *all* of the various
word senses listed as its components. If the eigenvector is normalized
to unit length, then the sum of the probabilities for any one given word
will *not* total to one! 

An alternative to propagating on the graph might instead be to just link
to lapack, and use on of the eigenvector routines. This might be faster.

Edges connecting senses use SimpleTruthValues; the mean/strength holds
holds the strength of the connection.

SenseLinks (between word-instances and wordnet-senses) use 
CountTruthValues.  The intital strength is set to 1.0 (true) and
confidence to 0.0 (inconfident). The algo is run, manipulating the
count. 


XXXX-- the following para describes something that should be, but is 
not currently done: XXX
After the algo has been run, the results are "renormalized" with
confidence value on the sense-link being changed to reflect the result.
The resulting confidence can be undrstood to be a probability: by 
summing over all possible sense assignments for a word-instance, the
total confidence will sum to one.


Preliminary Results
-------------------
I've only done some basic testing so far, the results are "OK". Mihalcea
herself claims about 50% accuracy, which is said to be more-or-less 
state-of-the-art.  However, based on even this small bit of testing, the
reason for both the successes and the failures of the Mihalcea algorithm
are evident:  For example, consider the sentence "I cut my finger on the 
tree bark".  The words in the noun-modifier phrase "tree bark" are strongly
reinforcing of one-another, and thus rank the correct sense of both "tree" 
and "bark" quite highly.  On the other hand, "cut" and "finger" and "me" 
pretty much have nothing to do with tree-bark, and serve only to inject 
noise and confusion into the sense ranking.

General Thoughts
----------------
The Rada Mihalcea algorithm is more-or-less just a Markov model, where
the states of the system are word-senses, and the word-sense similarity
metrics are the state transition probabilities (the conditional 
probabilities).  Because wordnet is explicitly used, the set of possible 
word senses are overt, visibile. If wordnet was not being used, and
word-senses were a-priori unknown, the algorithm would have to be 
replaced by a hidden Markov model. 

The above analogy suggests the correct way to generalize the algorithm
for anaphora resolution: create a hidden Markov model that consolidates
different word meanings in different senses, as referring to the same
object.

Correlations and Probabilities
------------------------------
The correct way to understand sense-similarity measures is to interpret
them as correlations between pairs of senses. Given a correlation function,
the underlying probability distribution can be computed, and then, from
this, the conditional probability of one word sense, given another word 
sense, can be computed. Properly speaking, the edge weights used in
the Mihalcea algorithm should be these conditional probabilities, and not
the rescaled word-sense similarity corrrelations.  However, for now, 
rescaling is approximately correct, and should be enough.


Algorithm Modifications
-----------------------
This section describes how the core Mihalcea algorithm has been
modified or extended. The goal of the modifications is top improve the
results. Obvious "raw material" for this is the incorporation of the 
sentence structure, obtained from parsing, into the algorithm.

Noun-modifier Strengthening
---------------------------
The  parser identifies noun-modifier phrases, such as "tree bark" or
"aquatic mammal". The senses chosen for both nouns in the noun-modifier
phrase should be in strong agreement with each other. There are several 
ways in which this might be accomplished.
-- Strengthen all similarity links between all senses of these two 
   words.  This should have the effect of shielding these two words
   from outside influences.
-- Weaken all links to the modifying noun, or strengthen all links to
   the modified noun.

The first approach is implemented in class NNAdjust. Preliminary, 
seat-of-the-pants experiments seem to show that this makes little or
no difference.

Word-Sense Distance Metrics
---------------------------
A major bottleneck for the algorithm appears to be the computation of 
word-sense commonality measures. Finding the least common subsumer (lcs)
in a single hierarchy is straightforward; thus, the lcs of the hypernym
(is-a) hierarchy is easily found. However, mixing together multiple
hierarchies, e.g. is-a and has-a (meronym) hierarchies leads to an 
exponential explosion of search possibilities, since both the is-a and 
the has-a paths need to be explored at every stage.

The current implementation of the Leacock-Chodorow(lch) similarity metric
explores the is-a hierarchy fully, while taking a limited number of has-a 
steps to explore closely related heirarchy. Good results seem to be obtained
by setting n=1. When n is increased by 1, the cpu time usage appears to 
increase by 2^n (right!?) so that the total cpu usage is 2^(n!) where n!
is n-factorial. At least, for small n. Wow.

2:07.67 - 1:50.95 =  17 seconds for the n=0 simple case.
2:20.30 - 1:48.40 =  32 seconds for the n=1 case.
4:02.93 - 1:48.39 = 134 seconds for the n=2 case.
3:26.40 - 1:49.56 =  96 seconds for the n=3 case.

(The above is implemented in SenseSimilarityLCH.h/.cc More convenient is
SenseSimilaritySQL.cc, which pulls precomputed senses out of an SQL 
datbase. Much faster that way.)

Caching Word-Sense Distance Measures
------------------------------------
To avoid the cost of multiple computations of the measure for the same
sense pair, the results of the computation are cached. The cached value
is stored as a similarity link, in the atom space, between the two word
senses. The caching is performed by class SenseCache. For sentences with
multiple parses, the use of caching provides a major performance boost.

clean start:
3294 edges for 36 word pairs (rate=160.967140)
494 edges for 45 word pairs (rate=560.458623)
4179 edges for 90 word pairs (rate=1023.431412)
227 edges for 60 word pairs (rate=172.097597)
24 edges for 36 word pairs (rate=44.365162)
37 edges for 54 word pairs (rate=47.401926)
603 edges for 78 word pairs (rate=55.337474)
182 edges for 117 word pairs (rate=62.594903)
32 edges for 21 word pairs (rate=66.643619)
407 edges for 91 word pairs (rate=102.196214)
35 edges for 56 word pairs (rate=43.196438)
38 edges for 28 word pairs (rate=106.432441)
311 edges for 72 word pairs (rate=351.375108)
2139 edges for 171 word pairs (rate=110.845516)
551 edges for 152 word pairs (rate=108.684289)
1872 edges for 105 word pairs (rate=106.707945)
5009 edges for 285 word pairs (rate=229.150653)
2437 edges for 90 word pairs (rate=584.059233)
2290 edges for 90 word pairs (rate=190.564907)
1820 edges for 190 word pairs (rate=141.599178)
4078 edges for 300 word pairs (rate=279.677327)
5366 edges for 180 word pairs (rate=255.627497)
3777 edges for 135 word pairs (rate=386.990560)

hour later:
7983 edges for 143 word pairs (rate=177.777639)
3566 edges for 253 word pairs (rate=119.823203)
3494 edges for 78 word pairs (rate=150.161062)
4288 edges for 156 word pairs (rate=301.568366)
2244 edges for 78 word pairs (rate=221.805891)
394 edges for 56 word pairs (rate=648.082810)

two hours later:
1300 edges for 95 word pairs (rate=41.524402)
5217 edges for 171 word pairs (rate=64.100957)
587 edges for 78 word pairs (rate=77.836852)
4772 edges for 168 word pairs (rate=261.459252)
1513 edges for 66 word pairs (rate=143.758023)
2764 edges for 126 word pairs (rate=90.000027)
2983 edges for 99 word pairs (rate=103.497465)
1238 edges for 66 word pairs (rate=104.637787)
476 edges for 91 word pairs (rate=37.023434)
6492 edges for 253 word pairs (rate=94.560790)
1453 edges for 135 word pairs (rate=196.339518)
2119 edges for 231 word pairs (rate=57.912276)
1824 edges for 144 word pairs (rate=109.158918)
3072 edges for 156 word pairs (rate=62.409254)
2607 edges for 78 word pairs (rate=65.946862)


ten hours later:
377 edges for 28 word pairs (rate=76.804571)
197 edges for 21 word pairs (rate=82.990628)
691 edges for 28 word pairs (rate=58.645770)
1669 edges for 264 word pairs (rate=53.167052)
1779 edges for 171 word pairs (rate=84.579619)
1851 edges for 117 word pairs (rate=111.031172)
1711 edges for 78 word pairs (rate=64.673862)
1159 edges for 120 word pairs (rate=67.343033)
1774 edges for 45 word pairs (rate=105.078328)
7269 edges for 136 word pairs (rate=91.874337)
3206 edges for 182 word pairs (rate=79.418685)
2645 edges for 140 word pairs (rate=88.427971)
1506 edges for 45 word pairs (rate=110.503189)

These number suggest that the current "caching" mechanism is a loosing
strategy. Looking at the code, its not hard to see why.

But wait -- shows not over: after 12 hours:
1104 edges for 99 word pairs (rate=157.804099)
1128 edges for 55 word pairs (rate=79.791207)
2578 edges for 117 word pairs (rate=451.047206)
2312 edges for 195 word pairs (rate=185.579256)
1683 edges for 78 word pairs (rate=202.813602)
2263 edges for 195 word pairs (rate=114.505580)
8368 edges for 312 word pairs (rate=242.270232)
3498 edges for 216 word pairs (rate=726.549312)
7389 edges for 276 word pairs (rate=136.556129)
1734 edges for 99 word pairs (rate=2574.969632)
1600 edges for 77 word pairs (rate=940.313595)
729 edges for 55 word pairs (rate=378.761704)
300 edges for 21 word pairs (rate=340.650119)

wtf ... these rates are higher than ever.. and are a fantastic speedup!

after 24 hours:
258 edges for 56 word pairs (rate=79.132317)
1405 edges for 260 word pairs (rate=58.772474)
2362 edges for 190 word pairs (rate=66.941334)
818 edges for 182 word pairs (rate=225.214808)
8480 edges for 325 word pairs (rate=91.553765)
922 edges for 72 word pairs (rate=929.476711)
465 edges for 66 word pairs (rate=268.935519)


Ideas/TODO
----------
-- Do not create sense links to punctuation (as well as determiners,
   etc.) 
-- Remove small graphs (e.g. two senses attached to each other, but
   nothing else).
-- When solving network, do not follow links to a sense which has only
   one edge.

-- Build test framework 

-- Need to fix main dispatcher; right now, WSD is called 100 times
   a second, no matter what.

-- Processed sentences need to be discarded when they are old enough.
   Can't afford to clutter RAM with lots of sentences. Unfortunately,
   delete performance is terrible.

-- use framenet for sense-similarity scoring!?

-- do some common-sense reasoning from the wordnet ontology.

-- The role of relex is utterly unclear: how should the dependency
   relations spit out by relex be used to guide WSD ?? 

   Perhaps there should only be relation edges between those 
   words that are involved in relex relations? Yes, .. this 
   is probably what should be done, but the precise argument
   as to why is unclear ... probably need prove this experimentally.

   Another possibility is to use relex relations to create 
   direct eges between word senese, and add much much weaker 
   links between non-relations.  

   The experiment could be done by using floating point 
   adjustments.

-- Per ben:
   Use the following similarity measure:
   "Probability of _obj(sense 1 of cut, sense 1 of finger)"
   "Probability of _obj(sense 1 of cut, sense 2 of finger)"
   based on a large corpus.

-- Use word frequency scores to identify the "important" words in a 
   sentence. These "important words" should get greater weight for WSD,
   as opposed to lower-importance words. This is similar to the idea of
   distinguishing between "function words" and "content words" --
   function words having only a gramatical function, whereas the
   "content words" hold most of the semantic content. The idea here
   is that certain "content words" hold "more content" than others,
   based on thier information entropy (information content).

-- Use mutual-information scores to tie together word pairs more
   tightly. ... !? The tightest word-pairs are likely to be idioms, 
   and so these should have senses of thier own.

-- where does parse-ranking fit into the scheme of things?
   Should total up the high scores for different parses. This
   assumes that tdifferent parses lead to diffferent graphs.

-- Use the wordnet sense-occurence counts to add a preference to
   the most likely meanings?  Can these counts be used to
   implement one of the information-content measures, e.g. jcn?
   The problem here is that the wordnet sense-occurence counts
   are small numbers, and are zero for many senses.

-- stumped: how to use opencyc data, since there's no concordance
   with wordnet data?

-- Ben says:
   Google "dekang lin path minipar" for the papers on his "syntactic
   path dependency" paper.  He makes an argument that by parsing and
   then analyzing paths in the parse trees, one can handle a variety
   of tasks varying from semantic disambiguation to question-answering.
   (Seem to be two papers: DIRT[2], and the question answering one ...)
	DIRT[2] seems to be aimed at the automatic discovery of word senses,
   rather than in sense-disambiguation per-se.

-- integrate anaphora resolution.

   More generally, work on reference resolution in general.

-- The choice of the word-similarity metric generated by lch seems 
   somewhat "arbitrary" -- its a measure, but its not at all clear that
   its any good for the Markov-chain-like Mihalcea algorithm. For
   example, perhaps it gives too-strong a weight for word senses that
   are only distantly related. it would be better if one generated 
   a random function, taking as input, the distances to the least-common
   subsumer, and the distance of the least-common subsumer from the top
   of the hierarchy. One might then try to use some optimization algo to 
   find the best-possible function of these inputs, that gives the highest
   word-sense score.
   


Domains of discourse (aka word-sense frequency counts)
------------------------------------------------------
Some word senses are more common than others. For example, archaic word
senses tend to be rarely encountered ... except when one is reading an
archaic text. Similarly, scientific and technical texts tend to employ
certain common word with certain very uncommon meanings. Cool, hip,
edgy writing will also use words in uncommon senses, or in unusual
constructions.

An anecdote to illustrate the idea:

  "When I was young and learning French, I found it easy to read and
   understand "Le Monde", and was utterly perplexed by "Paris Match",
   even thought Paris Match sentences were considerably shorter, and
   seemed to use a simpler vocabulary -- alas, for me, the Paris Match
   lingo was heavily overladen with French pop culture references and
   current events, which I was completely unaware of.  I would sit
   there, with Paris Match, and look up every word in the dictionary,
   even though I already knew the word, because I was searching for
   some meaning... which, of course was not to be found in the
   dictionary. By contrast, Le Monde was not trying to be hip, edgy
   or entertaining, it was trying to be precise and explicit, and in
   this, it succeeds.  The moral of the story is that word-sense
   frequency counts for Le Monde and Paris Match would be radically
   different -- and so, when I read a "typical" French text, say,
   Le Canard Enchainee (a satirical political paper), or Alexander
   Dumas (The Three Musketeers-- eminently readable, but clearly 18th
   century French), which should I use?"

Wordnet does have the notion of discourse domains.

XXX ToBeResolved: How to employ context-sensitive word-frequency counts
is unclear.


References
----------
[1] Rada Mihalcea, "Unsupervised Large-Vocabulary Word Sense
    Disambiguation with Graph-based Algorithms for Sequence Data
    Labeling", 2005

[2] Dekang Lin, Patrick Pantel, "DIRT: Discovering Inference Rules from 
    Text". In Proceedings of the ACM SIGKDD Conference on Knowledge 
    Discovery and Data Mining, 2001.
You can’t perform that action at this time.