Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Capitalization is a kind-of pseduo-morphology #690

Open
linas opened this issue Mar 7, 2018 · 47 comments
Open

Capitalization is a kind-of pseduo-morphology #690

linas opened this issue Mar 7, 2018 · 47 comments

Comments

@linas
Copy link
Member

linas commented Mar 7, 2018

This is a meta-issue, design-change request, to treat capitalization (and possibly other things) as a kind-of pseudo-morphology. See issue #42 for context. The general issue is about refomulating tokenization (and related issues) into a collection of rules (that could be encoded in a file). The example is capitalization.

For example: capitalization: we have a rule (coded in C) that if a word is at the beginning of a sentence, then we should search for a lower-case version of it. ... or if the word is after a semicolon, then we should search for a lower-case version of it. ... or if the word is after a quote, then we should search for a lower-case version of it. (Lincoln said, "Four-score and seven...") There is an obvious solution: design a new file, and place semi-colons, quotes and LEFT-WORD as markers for capitalization. Maybe this is kind-of-like a "new kind of affix rule" ?? All of the other affix rules state "if there is a certain sequence, then insert whitespace" while this new rule is "if there is a certain sequence, then look for downcase".

In the language learning code, I don't downcase any data in advance. Instead, the system eventually learns that certain words behave the same way, grammatically, whether they are uppercased or not. The system is blind to uppercasing: it just sees two different UTF8 strings that happen to fall into the same grammatical class.

To "solve" this problem, one can imagine three steps. First, a "morphological" analysis: given a certain grammatical class, compare pairs to strings to see if they have a common substring - for example, if the whole string matches, except for the first letter. This would imply that some words have a "morphology", where the first letter can be either one of two, while the rest of the word is the same.

The second step is to realize that there is a meta-morphology-rule, which states that there are many words, all of which have the property that they can begin with either one of two different initial letters. The correct choice of the initial letter depends on whether the preceding token was a semicolon, a quote, or the left-wall.

The third step is to realize that the meta-morphology-rule can be factored into approximately 26 different classes. That is, in principle, there are 52-squared/2=1352 possible sets containing two (initial) letters. Of these, only 26 are seen: {A, a}, {B, b}, {C, c} ....and one never ever sees {P, Q} or {Z, a}.

As long was we write C code, and know in advance that we are dealing with capital letters, then we can use pre-defined POSIX locales for capitalization. I'm trying to take two or three steps backwards, here. One is to treat capitalization as a kind of morphology, just like any other kind of morphology. The second is to create morphology classes - the pseudo-morpheme A is only substitutable by the pseudo-morpheme a. The third is that all of this should be somehow rule-driven and "generic" in some way.

The meta-meta-meta issue is that I want to expand the framework beyond just language written as UTF8 strings, but more generally, language with associated intonation, affect, facial expressions, or "language" from other domains (biology, spatial relations, etc.)

@ampli
Copy link
Member

ampli commented Mar 8, 2018

I once posted a proposal to handle capitalization as a special kind of morpheme:
https://groups.google.com/d/msg/link-grammar/hmK5gjXYWbk/6FC_x_43xpMJ

Note that the current tokenizer already implements that (untested, as the dict part of this idea has never been implemented).

This seems to me simple and natural way to do that, still using a pure link-grammar technic.

In the language learning code, I don't downcase any data in advance. Instead, the system eventually learns that certain words behave the same way, grammatically, whether they are uppercased or not. The system is blind to uppercasing: it just sees two different UTF8 strings that happen to fall into the same grammatical class.

The system find these words are very similar, but with a feature that is different. I propose to encode this different feature as a morpheme. All the capitalization rules will then can inferred (in principle) automatically as link-grammar rules (and until they are inferred automatically, they can be straightforwardly hand coded).

compare pairs to strings to see if they have a common substring

What I essentially propose is to look on substring of words as only a special case of morphemes.
The general case is that every feature of a word may be encoded as morpheme (I.e. an entity carrying disjuncts). It seems you agree with me, as you mention voice features of a word.
So I have never understood why you seemed to reject my proposal.

BTW, in Hebrew morphemes are not just substring of words. A morphemes that are "templates" are common (and they are really called "morphemes"). So I didn't invent something too new in my proposal to refer to capitalization (and vocalness of start of words, for a/an) as morphemes.

@ampli
Copy link
Member

ampli commented Mar 8, 2018

In the language learning code, I don't downcase any data in advance. Instead, the system eventually learns that certain words behave the same way, grammatically, whether they are uppercased or not. The system is blind to uppercasing: it just sees two different UTF8 strings that happen to fall into the same grammatical class.

The letter case is a private case of a more general problem.
For example, in Hebrew some letters change their look when they found at the end of words, and hence have a different code point even though essentially they are the "same letter" in a morphism standpoint.

@ampli ampli mentioned this issue Apr 23, 2018
@ampli
Copy link
Member

ampli commented Apr 28, 2018

I'm finally trying to implement hanfling of capitalized words by the dict.
I encountered a problem: How to generally implement a downcasing rule.
In English it is simple, and is hard-coded using tolower().
However, on French and maybe (I don't know) other languages it may be more complex.

Possible solution:
A hard coded tolower(), than a conversion according to affix-file definitions in order to generate alternatives.
E.g.:
E é è ê ë: TOLOWER+;

@ampli
Copy link
Member

ampli commented May 3, 2018

I implemented the main part of the pseudo-morphology capitalization.
There were several options for how and were to make the driving definition, and I chose to put it in 4.0.regex and a way that can be extended to phology or other uses (of course that everything is subject to drastic changes...):

% Treat a word match as morpheme-like feature, by prepending a token (the
% regex name) before a matching word.
%
% The following regex is used for prepending a "capitalization mark" before
% the matching word.
% The 'f' flag signifies prepending the regex name as a token before the word.
% The 'c' flag signifies to always continue to match the rest of the regexes.
% The 'l' flag signifies that the first letter should be converted to lowercase
% using the TOLOWER affix class, with a tolower() fallback.
<CAP-MARK>:     /^[[:upper:]][^'’]*[^[:punct:]]$/fcl

% The rest of this file contains regular expressions that are used to match
% tokens not found in the dictionary. Each regex is given a name which
% ...

I also made the needed changes, mainly in tokenize,.c, regex-morph.c and read-regex.c.

The easiest part was adding support for that in the dict.
The new code doesn't check at all for common-entities - instead, this is supported by the dict:

<marker-common-entity>: XXXGIVEN- & (<CAPITALIZED-WORDS> or <PL-CAPITALIZED-WORDS>);
<CAP-MARK>: CW- or XXXGIVEN+;
LEFT-WALL:
  {CW+} & (...);
<colon>: {CW+} & (...);
% etc.

Since the capitalized word is downcased if it is in the dict, it appears downcased even if it is a common-entity. It means that the code will need to handle restoring capitalization (like the old code).

However, the real (very) hard work has still not done:

  1. Sane-morphism - if the feature-mark (<CAP-MARK> in that case) and/or the word after it are null, they are together to be considered as a single null word. I still don't know how to do that. It should work also in island_ok=1.
  2. do_count() null counting: The <CAP-MARK> and the word after it are to be considered one unit for null-count (else the null count may be bogus).
  3. After sane-morphism, there is a need to remove the feature-marks and optionally modify the display for null words.
    For example, if <CAP-MARK> is a null word but the word after it has a linkage, the word display should be converted to [word] and its links omitted.

Since these 3 tasks are really hard to implement, I first thought to combine the disjuncts of the feature token ( <CAP-MARK>) with those of the word after it. This way there is no need for tasks 1-3 above.
However, the word may then need 2 links to its preceding word (for example both Wd- and CW- to LEFT-WALL), which is currently not supported (it may be interesting to add such support).

Current linkage result examples:

                 +------------------QUc------------------+
                 +----------->WV----------->+            |
    +---->WV---->+-------->Wd---------+     +---Ost--+   |
    +->Wd--+--Ss-+-QUd+--CW--+        +-Ss*b+  +Ds**c+   |
    |      |     |    |      |        |     |  |     |   |
LEFT-WALL he said.q-d " <CAP-MARK> this.p is.v a  test.n "


    +------------------Xp------------------+----------------Xp----------------+
    +----------->WV---------->+            +--------->WV-------->+            |
    +-------->Wd--------+     +---Ost--+   +------>Wd------+     +---Osm--+   |
    +----CW----+        +-Ss*b+  +Ds**c+   +--CW--+        +-Ss*b+  +Ds**c+   |
    |          |        |     |  |     |   |      |        |     |  |     |   |
LEFT-WALL <CAP-MARK> this.p is.v a  test.n . <CAP-MARK> this.p is.v a  test.n .


    +--------->WV-------->+
    +------->Wd-------+   +------PP-----+--------MVp--------+----------Js----------+
    +----CW----+      +-Ss+     +---E---+-MVp-+--Js--+      |        +---XXXGIVEN--+
    |          |      |   |     |       |     |      |      |        |             |
LEFT-WALL <CAP-MARK> he 's.v usually gone.v to.r Boston.m for.p <CAP-MARK> thanksgiving.n-u


    +--------->WV--------->+                                      +---------------Bs--------------+
    +------->Wd-------+    +---------IV------->+                  |                +------RS------+           +----Js---+
    +----CW----+      +-Ss-+--TOf-+-Ift-+--PPf-+--------Ost-------+--------R-------+     +----E---+--Pa--+-MVp+   +Ds**v+
    |          |      |    |      |     |      |                  |                |     |        |      |    |   |     |
LEFT-WALL <CAP-MARK> it seems.v to.r have.v been.v Einstein[!<CAPITALIZED-WORDS>] who first.a came.v-d up.a with the idea.n

@linas
Copy link
Member Author

linas commented May 4, 2018

I like the appearing in the parse tree. That is certainly an interesting idea!

But is it superfluous? In almost all cases, if a word attaches with Wd, then it should have been capitalized...(or already was capitalized).

Linking CAP-MARK to the left-wall seems wrong; it really should be linking to the word that is (should be) capitalized. Or better yet, link to both: to the wall, and also to the word that is being capitalized, with both linkages being mandatory, not optional, thus constraining the parses. That makes it similar to the PH link, which indicates a phonetic constraint.

@linas
Copy link
Member Author

linas commented May 4, 2018

This seems wrong:

<CAP-MARK>:     /^[[:upper:]][^'’]*[^[:punct:]]$/fcl

because it violates your earlier guess about French: the lower-case version of E could have been any one of the letters é è ê ë, and all of them should be tried during dictionary lookup. Thus, perhaps the regexes should be of the form:

<REGNAME>: s/match-expr/rewrite-expr/flags;

which 'c' being a possible flag. The 'f' and the 'l' flags would not be needed. If the rewrite-expr has whitespace, it should be split on that whitespace ...

@linas
Copy link
Member Author

linas commented May 4, 2018

Allowing two links between words -- I'm pretty sure this would massively break the existing dicts.

The issue 1,2 for null words - I'm not sure that needs to be solved. If ends up being a null word, that just means that someone capitalized a word that they should not have capitalized.

@ampli
Copy link
Member

ampli commented May 4, 2018

But is it superfluous? In almost all cases, if a word attaches with Wd, then it should have been capitalized...(or already was capitalized).

I tried to always use the same rule: Provide a CW+ connector on a token after which there is a capitalizable position. But my current implementation is indeed problematic.

Or better yet, link to both: to the wall, and also to the word that is being capitalized, with both linkages being mandatory, not optional, thus constraining the parses.

Of course you are right. I just had a problem how to do it.

One of the ways to start to solve that is having something like:
<CAP-MARK>: (CW- & LLD+) or XXXGIVEN+;
when a downcased word (only) will have LLD-.
This means that the code will need to add LLD- to a downcased word.
I think such a mechanism is generally fine and required when inserting tokens into the sentence, and may have uses in other cases. But I have no idea yet how to define a simple rule for that.

Why I think such disjunct rewrites are not contradicting the current LG theory:
This is as if we had two version for each word in the dict, one is the regular lowercase letter, and one is the uppercased one that has the exact same sentence usage as the lowercase one but that can come in capitalizable positions, thus having an additional mandatory connector.
Instead of heaving these two versions, I just proposed a higher-level approach to programmatically add the needed connector (of course in a general not hard-coded way). This may be needed in cases of "sentence rewrite" and maybe in other cases too.

@ampli
Copy link
Member

ampli commented May 5, 2018

The 'f' and the 'l' flags would not be needed.

I don't understand how you can do it without an indication like 'f'.
This flag indicates that the regex label should be physically inserted as an additional sentence token.
Or would the rewrite part will imply that such insertion is intended? But in general there may be such rules for other pseudo-morphology uses that don't involve a replacement.

because it violates your earlier guess about French: the lower-case version of E could have been any one of the letters é è ê ë, and all of them should be tried during dictionary lookup.

I cannot see how. I wrote there:

% The 'l' flag signifies that the first letter should be converted to lowercase
% using the TOLOWER affix class, with a tolower() fallback.

So the algorithm is to use a TOLOWER definition if exists, such as
E é è ê ë: TOLOWER+;
and generate all the alternatives (4 in that case), and also make tolower() and generate an alternative.
(in the quoted comment I used the word fallback which is misleading since it is always needed).

<REGNAME>: s/match-expr/rewrite-expr/flags;

I am not sure I understand what would be this rewrite-expr.
Do you mean to have a few tens of such regexes, in the form like:

...
<REGNAME>: s/B(word)/b\1/c
<REGNAME>: s/E(word)/e\1 é\1 è\1 ê\1 ë\1/c
...

(And then there is no tolower() in the code, and an alternative would be generated on white space or end of replacement part.)

@ampli
Copy link
Member

ampli commented May 5, 2018

Allowing two links between words -- I'm pretty sure this would massively break the existing dicts.

It will be interesting to see where and how...

I always thought multiple links should be allowed.
Else, what can be generally done (in the current scheme) for a token that sometimes has a real meaning of two words, and it cannot even get morphology-split?
(Private cases of such tokens can be solved with pseudo-morphology, like I try to implement now. I encountered this problem in Hebrew with a few one-letter tokens.)

@linas
Copy link
Member Author

linas commented May 5, 2018

I'm trying to look at this issue as a special case or demo for issue #700 Thus,

<REGNAME>: s/match-expr/rewrite-expr/flags;

becomes viable for all sorts of tokenization problems, and not just for capitalization. So, yes, there would be a few dozen rules of the form

...
<DOWNCASE>: s/B(word)/b\1/c
<DOWNCASE>: s/E(word)/e\1 é\1 è\1 ê\1 ë\1/c
...

and no tolower() in the code. The problem with E é è ê ë: TOLOWER+; is that its only good for capitalization, but is not sufficiently generic to solve other tokenization problems (such as splitting units away from numeric expressions.) And so, if you consider splitting away units from numbers, this suggests

<UNITS>: s/([0-9]+)inches/\1 inches/c

which does nothing except insert whitespace between the number and the run-on unit. Then with the power to insert spaces (or insert other text), the need for the f flag is no longer needed.

@ampli
Copy link
Member

ampli commented May 5, 2018

For splitting units and any other continuous morphology I proposed a better way - defining (in the dict) token boundaries (which side of a token must have whitespace) either by marks on the tokens (similar to the "infixmarks" today) or by link types. It is not hard to to write a tokenizer that will use only this info.

@linas
Copy link
Member Author

linas commented May 5, 2018

programmatically add the needed connector

Yes. I'm not sure how. Ideally, one could say something like "if regex <BLAH> ran, then insert A+ on the first token and insert B- on the second token". Since connectors are order-dependent, one should insert the A+ to the left of, or maybe to the right of all the other + connectors.

In essence, I'm thinking that there is some better way of doing what is done by INFIXMARK+,SANEMORPHISM+ and STEMSUBSCR+.

This is a non-trivial problem: the current Russian dictionaries do seem to be fairly direct and well-structured; I don't want to loose the current simplicity, there. What we currently do is not bad. Yet somehow, the current tokenization mechanism seems inadequate for more complex morphology. In the back of my mind, I'm even thinking of speech-to-text and other discernment problems.

There are other possibilities: for example, instead of a regex with programmatically-added connectors, imagine a neural net component, which took in sentences and spat out tokens and connectors. Viz, did exactly the same thing as a regex+prgram-connectors component did, except that maybe it assigns likelihood values to each possible tokenization. The only reason I'm thinking "neural net" is that it would be trainable -- I do NOT want to hand-write big, complex regex+prog-connector dictionaries. Or perhaps there is a way to auto-convert the NN to a regex+connector dictionary....

@linas
Copy link
Member Author

linas commented May 5, 2018

For splitting units and any other continuous morphology I proposed a better way

Where is this? one of the other issues? I'm sorry, I'm having trouble reading and responding to everything.

@ampli
Copy link
Member

ampli commented May 5, 2018

Yes. I'm not sure how. Ideally, one could say something like "if regex ran, then insert A+ on the first token and insert B- on the second token". Since connectors are order-dependent, one should insert the A+ to the left of, or maybe to the right of all the other + connectors.

It can be done in this way or in several other ways.
It doesn't really important which way. since we can change them if desired.
But if we would like to explore that direction we need to start with something.

INFIXMARK+,SANEMORPHISM+ and STEMSUBSCR+

INFIXMARK can be replaced by link type (or can remain),
SANEMORPHISM is a code-problems check and can even be extended.
STEMSUBSCR is just a subscript that ends with an INFIXMARK, and serves both purposes (to say that the LHS end should not end with a whitespace, and to annotate the word with a .= subscript).

This is a non-trivial problem: the current Russian dictionaries do seem to be fairly direct and well-structured; I don't want to loose the current simplicity, there. What we currently do is not bad. Yet somehow, the current tokenization mechanism seems inadequate for more complex morphology. In the back of my mind, I'm even thinking of speech-to-text and other discernment problems.

Because I got the impression you would not like to introduce changes to the way LG is currently done, including not to the format of the dictionaries, all my latest proposals try to be compatible with the way things are currently done, and are only additions in order to encode things in the dictionary instead of hard-code them like is done now (which has a benefit of better supporting English and also for supporting more languages).

For handling any arbitrary morphology, I proposed a different token lookup mechanism (but still use a dictionary syntax like the current one). It is like input preprocessing, but doing it internally has an easy ability to generate alternatives.
The idea was inspired by the current "word" files: Logically, a word is looked up in a file to yield a single token (the file name), and this token is looked up in the dictionary (physically it is done by incorporating the words in the file while reading the dict, but it doesn't matter).
Similarly, I proposed looking up a word in a file, but the result will not generally be a single token, but in the general case be a set of arrays of tokens (when each element in this set is an alternative). This output tokens will generally not be substrings of input token.
Hebrew seems "easy" to handle that way (there are ready such tables for the whole language). I wrote about that idea in my early letters to you and mentioned it in the LG group.

There are other possibilities: for example, instead of a regex with programmatically-added connectors, imagine a neural net component, which took in sentences and spat out tokens and connectors. Viz, did exactly the same thing as a regex+prgram-connectors component did, except that maybe it assigns likelihood values to each possible tokenization. The only reason I'm thinking "neural net" is that it would be trainable -- I do NOT want to hand-write big, complex regex+prog-connector dictionaries. Or perhaps there is a way to auto-convert the NN to a regex+connector dictionary....

There is indeed much to discuss on automatic language learning, and I have several questions regarding that. I will try to find the correct place to raise them - they are too technical/specific and the opencog discussion group is not the proper place for them).

@ampli
Copy link
Member

ampli commented May 5, 2018

Where is this? one of the other issues? I'm sorry, I'm having trouble reading and responding to everything.

In the LG group (the "zero knowledge tokenizer"), in other issues here and even in the code itself (the "regex tokenizer" demo).

I will try to summarize here.
Say the token word boundaries are marked in the dict for all the tokens (with "infixmarks").
Say we have an input word ABCDEF.

The algo starts with the first letter, and look it up as A=. If it is not in the dict, it looks up AB= etc.
However, if it is in the dict, it then looks up =B=, and so on in a backtracking fashion.
Each time it finishes looks up a complete word, it emits its so-obtained tokens as an alternative, and then backtracks and continues. This way it is guaranteed to emit all the possible alternatives for each word.

The regex-tokenizer demo uses the PCRE engine to do this backtracking (however, I also wrote C code to directly do it). It is able to tokenize Russian and is even able to tokenize this way sentences without white space between some or all of their words if it is not directed to look for white space at word boundaries.

BTW, for the current dicts it is possible to infer the infix marks for all the tokens in LPUNC/MPUNC/RPUNC, so the above mentioned algo also do punctuation stripping.
With proper marking, it would also do unit stripping (minimally, the numbers which are linked to units need to be so marked).

@ampli
Copy link
Member

ampli commented May 5, 2018

However, if it is in the dict, it then looks up =B=, and so on in a backtracking fashion.

Note this fix in my previous post.

@linas
Copy link
Member Author

linas commented May 5, 2018

I got the impression you would not like to introduce changes to the way LG is currently done

We can make changes in the dictionary format. I'm mostly being conservative, because I want to avoid ending up with a big confusing mess of non-orthogonal functions that have an incomplete or ugly theoretical underpinning. If something is a good idea, we should do it. I just don't want a horror show at the end of it.

The idea was inspired by

I didn't quite understtand the details of what you said there. I think I have the general flavor of it, just not the details.

output tokens will generally not be substrings of input token.

Yes, that was also the intent of the regex rewrite; it was a mechanism flexible enough to do this.

Hebrew seems "easy" to handle that way (there are ready such tables for the whole language).

Yes, this seems quite reasonable.

I wrote about that idea in my early letters to you and mentioned it in the LG group.

Sorry, there was a lot of confusion in my mind back then. Perhaps slightly less, now. There was no clear concept of the word-graph, back then; now that there is, the conceptual foundation is stronger; it allows us to design better pre-processors. The regex idea mentioned here is a kind-of warming up to the finite-state transducer morphology analyzers. I'm thinking of the tokenizer as some kind of "transducer" from flat strings, into word-graphs. The primary questions are "what is the best syntax for encoding what that transducer does?" and "what is the best way of describing the algorithm that the transducer uses?" Viz, currently, the tokenizer is a semi ad-hoc algorithm, driven by 4.0.affix. Can we give it a more elegant description, and a more powerful algorithm? Hebrew morphology tables and/or FST's are not a bad place to start, now.

automatic language learning, and I have several questions regarding that.

So do I.

@linas
Copy link
Member Author

linas commented May 5, 2018

"zero knowledge tokenizer"
I will try to summarize here.

  • There was some concern that it might be too slow, or generate too many possibilities. Is this still a concern?
  • Can it be used to do capitalization/down-casing?
  • Can it be used to programatically add connectors, when needed?
  • Can it easily make use of the Hebrew token tables? (and can you point me at an example of these token tables, on-line or in some paper, or elsewhere?)

@ampli
Copy link
Member

ampli commented May 5, 2018

I didn't quite understtand the details of what you said there. I think I have the general flavor of it, just not the details.

I will try to construct a particular example.

Regarding tokenizing, I didn't understand why an FST is any better than tokenizing by simple table lookup.
Is it to save the possible disk space of an input table? (But the tables driving an FST may take a comparable space or even more, as there is a need to supply it all of this information.)
Anyway there can be a benefit of FST over input table: An ability to guess the morphology of unknown words (but the FST packages I inspected had a problem to work on unknown strings).

@linas
Copy link
Member Author

linas commented May 5, 2018

FST

I am using this term very loosely, as a synonym for a regex s/from-string/to-string/ because, for the most part, that is all they are. The various different kinds add various different bells & whistles and improvements, that we may or may not need. You looked at them; some looked nicer than others, some had a bad license, all of them were impossible to integrate into LG.

So, now, I am trying to think of the transformation of an input string into a token flow-graph (I think maybe that is what we should call it, instead of word-graph -- it shows how an input string can be transformed into several different, sequential flows of tokens). So, your word-graph tokenizer is actually a "transducer" from a character string to a flow-graph. Along the way, you insert whitespace, change spelling, downcase, and transform the input in other, perhaps more radical ways.

What is the correct way to specify the operation of this transducer? Currently we use 4.0.affix. I was trying to say that maybe we should allow <REGNAME>: s/match-expr/rewrite-expr/flags; as another way of specifying what it does (but maybe this is a bad idea). I then infelicitously said FST, but perhaps that is also a bad idea: if it does not offer any space savings or performance savings or conceptual advantage, we should not use it. From the language learning side, it is almost certainly going to be easier to generate lookup tables, than it is to automatically learn regexes. I'm very open to suggestions right now.

@ampli
Copy link
Member

ampli commented May 5, 2018

There was some concern that it might be too slow, or generate too many possibilities. Is this still a concern?

It will be interesting to actually test it.

Indeed it seems to me that Hebrew tokenizing only by token-end marks would generate extremely many total-nonsense alternatives, since using only word-end info, the tokenizer don't know that some tokens can come only at start of word etc. and in certain order. So this is certainly non-optimal.

A tokenizer that follows the links implied by the dict can solve that. The idea is to make a trie-like structure when reading the dict according to connectors, then follow it during tokenizing. This way only sequences that potentially have a dict support will get emitted. But this is good only for continuous morphology (Hebrew has a continuous morphology-like approach that can be used, but other non-continuous morphology languages may not have that).

Can it be used to do capitalization/down-casing?

It (like the current tokenizer - using code for that) can emit a token indicating capitalization and then the lowercased word.
It is a tokenizer so it cannot do more than that.

Can it be used to programatically add connectors, when needed?

Same as above, it can amit special tokens (say for phonology also).
The code that converts tokens to disjuncts will need to interpret them and do the needed disjunct editing (of the previous and/or next words). So it cannot do anything more in this regard than the current tokenizer

Can it easily make use of the Hebrew token tables? (and can you point me at an example, on-line or in some paper, or elsewhere?)

It can, but the current tokenizer can to.

can you point me at an example

Say you have a word ABCDEF.
The tokenizer look it up in a table, and finds it has 2 matching lines:

ABCDEF: A B CDEF
ABCDEF: ABC DEF

Then it emits these 2 alternatives

A B CDEF
ABC DEF

It can be just substrings of the word. But I originally thought to tokenize this way into tokens which represent grammatical features of the looked up word, like:

ABCDE: root=xyz future single 3p f
XYZW: base=XYZ noun plural f

and build an LG dict based on that (there are more details about such an implementation, but i is not the point here). I only had some initial sketches on paper so I don't really know if this is possible (e.g. no crossed links) and which changes are needed to make it possible/easier (I guess that an ability of connector/disjunct editing would help).

I think you can find such tables for many languages. For complex languages they may be very big.
But even a 50MB table shouldn't make any special problem due to its size.
For Hebrew, a prefix lookup will need to be done separately, else the input table will be 2-3 order of magnitude bigger, which may be too much.

@ampli
Copy link
Member

ampli commented May 5, 2018

I said:

However, the word may then need 2 links to its preceding word (for example both Wd- and CW- to LEFT-WALL), which is currently not supported (it may be interesting to add such support).

I guess this is a direct response to the above:

But is it superfluous? In almost all cases, if a word attaches with Wd, then it should have been capitalized...(or already was capitalized).

The problem that Wd from the LEFT-WALL is often not to the first word, e.g.:

    +----------------->WV---------------->+
    |            +----------Ss*s----------+-------Osm------+
    +---->Wd-----+-------Mpn------+       |    +---Ds**x---+
    |      +Ds**c+        +--DTn--+       |    +PHc+---A---+
    |      |     |        |       |       |    |   |       |
LEFT-WALL the party.n that.j-r night.r was.v-d a big.a success.s

So I cannot see how to use it to enforce capitalization.

What I miss are rules for "disjunct calculus". It will tell how to create new disjuncts and "edit" existing ones in various situations.

Existing examples:

  • An existing example is idiom creation.
  • A practical rule that is used in a dictionary is for "feature propagation" using connector lowercase letters.
  • Currently, the code implements the rule "if a token is a quote, add its LHS connector to the previous token as an RHS connector". (The code actually check if the word has ZZZ- and if so adds ZZZ+ to t.he previous word.)

In our case here:
Suppose we have an LG definition of a language.
But it is not accurate enough, and we would like to constrain (or extend) it according to a feature found in certain words (it can be a substring, capitalization, phonetics, or something totally different, like the context of the word in previous sentences of the same document).

So when a certain word has feature X, it should have a modified set of disjuncts.
If one of these disjuncts will need to connect to the same word more than once, we need to insert an intermediate token with its own disjuncts, that will link to the word. A "disjunct calculus" rule is supposed specify how to do that. Since we have two words that needs to be represented as one unit, it seems to be somewhat close to the case of idioms (we also modify a given disjunct in that case by adding a link to connect it to a certain token).

I will try to code something and learn more from it.

@ampli
Copy link
Member

ampli commented May 5, 2018

The issue 1,2 for null words - I'm not sure that needs to be solved. If ends up being a null word, that just means that someone capitalized a word that they should not have capitalized.

Such an error will then mask possible single null-word linkages of the sentence (there we be no way to get them), and it seems to me this reduces the application usefulness of LG.

One way to solve that, and similar existing problems, is to count null links differently:
For each input token, at most one null-link will be counted.
For example, say a Russian stem-suffix doesn't have a link. Currently this will lead to a null count of 2.
But this may be totally misleading if this was a bogus split.
In Hebrew, a totally unlinked word can lead to a higher null count due to bogus splits (since the splits are just blind guesses and most of them are bogus). With my proposed null counting scheme (that can be a parse option), there will be at most one null count for input word, which seems to lead to a more "sane" set of linkages. One problem may be links that still connected to these artificial input null words (and removing them may hide interesting info).

@ampli
Copy link
Member

ampli commented May 5, 2018

So, now, I am trying to think of the transformation of an input string into a token flow-graph

I intend to test again the idea of tokenizing by LG (i.e. use the dict definitions as a tokenizer by automatically converting 4.0.dict and 4.0.affix to "letter-LG").

My previous test tried to tokenize whole sentences in a single parsing operation. This has a benefit of being able to overcome missing word boundaries (or allow for spaces within words). But it turned out to be extremely slow. I also didn't know what to do with multiple tokenization possibilities per sentence, as there were no wordgraph then (the only option was to parse each tokenization result separately, which would be extras-slow by itself).

But now the library is faster by a large factor, there is an ability to specify LENGTH_LIMIT_1 connectors (which is the length of all the "letter-LG" connectors), I know some more relevant things that I didn't know then, and I would want to try to tokenize each word separately and not all the words at once.

@ampli
Copy link
Member

ampli commented May 5, 2018

I said.

I think you can find such tables for many languages. For complex languages they may be very big.

I suspect that for languages with reach morphology like Turkish or with compound words like German, a naive table driven morphology analysis would not work (table too big). (Even for Hebrew, as I mentioned, there is a need to break words into two parts at least and make a lookup in more than one table.) But "letter-LG" may work for all. Stay tuned...

@ampli
Copy link
Member

ampli commented May 5, 2018

a token flow-graph (I think maybe that is what we should call it, instead of word-graph

Or is it a token-flow graph?

@linas
Copy link
Member Author

linas commented May 5, 2018

"the token flow graph" or simply "the flow graph" with-or-without hyphens. So "flow-graph" - its not just some graph (cause there's tons of those in comp-sci/linguistics/machine learning) but indicates time-like ordering. A "token-flow graph" because its tokens that are flowing.

@ampli
Copy link
Member

ampli commented May 5, 2018

So what should be done with that new term?
I.e., what to replace and where?

@linas
Copy link
Member Author

linas commented May 5, 2018

A tokenizer that follows the links implied by the dict can solve that. The idea is to make a trie-like structure when reading the dict according to connectors, then follow it during tokenizing. This way only sequences that potentially have a dict support will get emitted

Yeah, maybe. This risks turning into a full-featured parser, if you're not careful. And if you do it only shallowly, then it might not constrain things sufficiently. But its worth a try, I suppose. I don't understand the commoent about "continuous morphology".

@linas
Copy link
Member Author

linas commented May 5, 2018

For complex languages they may be very big. But even a 50MB table shouldn't make any special problem due to its size.

Hang on. I do not believe that some collection of scholars hand-created and hand-maintain a 50MB table. That table is being generated by some algorithm. The algorithm generating the table should then be the tokenizer. So, e.g. for French, there is some small number of irregular words, for which a table lookup is sufficient to tokenize, and then some 3 or 5 classes of very regular verbs which can be split algorithmically.

@linas
Copy link
Member Author

linas commented May 5, 2018

So what should be done with that new term?

Replace all occurrences of "word graph" by "flow graph". This is not urgent. And maybe we should sleep on it a bit longer.

@linas
Copy link
Member Author

linas commented May 5, 2018

XYZW: base=XYZ noun plural f and build an LG dict based on that

Its a plausible idea, but unlikely to work well. The French wiktionary contains this kind of data; I doubt its enough. Also, its not terribly hard to import data from other parsers for other languages; I suspect the quality will be mediocre.

@linas
Copy link
Member Author

linas commented May 5, 2018

"disjunct calculus"

I want to draw a line here. This is an interesting idea, but just not here, not now. Some of this calculus has to happen "after the parse", as it were. Some before the parse. Its a potentially bottomless pit, and we risk over-engineering and trying to design something before the principles are clear.

By contrast, the sentence-to-token-flow-graph conversion is becoming increasingly clear; we need to make it a bit more robust, a bit more general, but not go crazy on it. It seems like some amount of minor disjunct generation would be useful, but I want to keep it minor, until the dust settles a bit.

Go ahead and think about disjunct calculus, if you wish, but don't create any complex designs, until we have more practical experience with it, and what would be needed from it.

@linas
Copy link
Member Author

linas commented May 5, 2018

mask possible single null-word linkages

I don't care. If null-word linkages are required, then we know that either (a) its a spoken sentence, where the we know our grammar is bad, or (b) the dict is just handling the sentence incorrectly (viz, we know our grammar is bad). The correct long-term solution to null-word linkages is to have a much, much better, more accurate model of connector costs; if we had that, the null-word linkages become much clearer, and, in a sense, irrelevant.

To be clear, to me, it seems like trying to solve this problem now is not only not worth the cost, but may also have a negative impact on the system: it adds significant complexity, and, worse, that extra complexity might be theoretically wrong.

@ampli
Copy link
Member

ampli commented May 5, 2018

I don't understand the comment about "continuous morphology".

In that case you cannot extract unique substrings that represent morphemes.

@ampli
Copy link
Member

ampli commented May 5, 2018

Hang on. I do not believe that some collection of scholars hand-created and hand-maintain a 50MB table. That table is being generated by some algorithm. The algorithm generating the table should then be the tokenizer. So, e.g. for French, there is some small number of irregular words, for which a table lookup is sufficient to tokenize, and then some 3 or 5 classes of very regular verbs which can be split algorithmically.

Of course it is generated by a program. It is a big and slow Java program (GPL'ed) that of course has its own big data table (all GPL'ed). I guess the big table is used when a slow program doesn't fit the task.

BTW, regarding verb forms, in Hebrew there are 10+ families of verb roots that each has a potential 7 forms called "binyanim". And each of them (quoted from Wikipedia "... conjugated to reflect their tense and mood, as well as to agree with their subjects in gender, number, and person.".
Fortunately there are only about 1600+ verb roots (in a table that I inspected). And all of their possible conjugations have been listed (even by hand, long ago) in special grammar books ("Book of Verbs").

But not only verbs have inflections. Also regular words can have it. For example (the OR are different readings of the same written letters, with a different pronouncing):
מחשבו = his computer (OR he computes him)
מחשבי = my computer (OR the computers of OR he computes me)
מחשבה = her computer (OR a thought OR he computes her)

Each word may have up to about 1000 prefix combinations (most of them are of course rare in actual use).

Of course in order to analyze such words you need to have the full list of base words in Hebrew (and for each noun have its gender and its plural type - in general these cannot be derived from the words), the full list of rules (a big one by itself), and the full list of exceptions.

The program hspell has a big hand-written table of base words, their features and exceptions that has been gathered for many years, with a generator of all the possible word forms (including all the possible prefixes).
I thought to base on it an Hebrew morphology analyzer for LG, but it is AGPL3.
Maybe it will be a good idea to create a plug-in tokenizer API, and write such a plug in using the hspell sources (which are in C and are efficient).

@ampli
Copy link
Member

ampli commented May 5, 2018

Its a plausible idea, but unlikely to work well. The French wiktionary contains this kind of data; I doubt its enough. Also, its not terribly hard to import data from other parsers for other languages; I suspect the quality will be mediocre.

The quality of the such data that is available for Hebrew is considered to be rather excellent (but as I said above, under GPL or AGPL, depending on its source).

@ampli
Copy link
Member

ampli commented May 5, 2018

if we had that, the null-word linkages become much clearer, and, in a sense, irrelevant.

Do I understand correctly that bad linkages will be then emitted, but marked with a higher cost?
If so, how we can know from which linkage the possible ungrammatical parses start, as opposed to just less probable parses?

Additional question, how per-sentence cost may solve problems like as in issue #404?
(Or maybe a solution to that is not related to a cost system? Or maybe a per-alternative cost is needed?)

@ampli
Copy link
Member

ampli commented May 5, 2018

Go ahead and think about disjunct calculus, if you wish, but don't create any complex designs, until we have more practical experience with it, and what would be needed from it.

The idea is indeed to start with something minor enough and get a practical experience with it.
There are some possibilities (all got mentioned recently) that I will explored.

@linas
Copy link
Member Author

linas commented May 6, 2018

It is a big and slow Java program

Well, I guess it is a complex program, and not something that would be easily re-written in C/C++. Presumably, it algorithmically encodes many special cases. I guess having a 50MB table is OK, but then we have a packaging/distribution problem.

a plug-in tokenizer API

Yes. Chinese has a fairly simple grammar, but is written without any whitespace at all. Each word is either 1, 2 or 3 hanzi long, and humans can just automatically see the word boundaries. There are word-boundary splitters, but they are not terribly accurate. I was thinking of just treating each separate hanzi glyph as a distinct pseudo-morpheme, and letting the parser figure out what the word segmentation should be, during parsing. Anyway, this is an example of a language where one might want to have not just one but maybe two tokenizers. Its also an example where two different lanuages e.g. mandarin and cantonese might want to use the same tokenizer.

@ampli
Copy link
Member

ampli commented May 6, 2018

(Shouldn't this discussion be in issue #700 instead of here?)

we have a packaging/distribution problem.

It is text file so its size in a tar.gz file will be relatively negligible.
However, it is needed only if the dict is using word features.
For the traditional approach using word substrings in the dict, I think (and still need to prove it by actual code) that a tokenizer can efficiently use the info in the dict in order to produce a minimal number of bogus alternatives.

a plug-in tokenizer API

Even now the tokenizer include special code for issuing alternatives for the Hebrew word prefix.
This code can be useful for any language which has a multi prefix/suffix that can be decoded with similar rules (I just don't have info about the rules of prefixes/suffixes in other languages).

letting the parser figure out what the word segmentation should be

Do you know the approximate number of words in which each glyph can participate (order of magnitude)?

@linas
Copy link
Member Author

linas commented May 6, 2018

(Shouldn't this discussion be in issue #700 instead of here?)

Probably, but its too late for that.

how we can know from which linkage the possible ungrammatical parses start, as opposed to just less probable parses?

We can't. In principle, there is no difference between "ungrammatical" and "less probable". In a sense, there is no such thing as "ungrammatical", there is only "someone who does not speak a language very well" or "who doesn't understand a language very well". What we call "language" is actually a social construct; what LG tries to do is to capture some "typical" way in which some "typical" person might speak; however, much like a human, its got quirks, failings, and this will continue to be the case "forever" (as LG will never become a social construct; it will always remain an "individual", just another user of language.).

@linas
Copy link
Member Author

linas commented May 6, 2018

maybe a per-alternative cost is needed?

Maybe; but currently, having costs on connectors seems to be sufficient to solve all ranking problems.

@linas
Copy link
Member Author

linas commented May 6, 2018

Do you know the approximate number of words in which each glyph can participate (order of magnitude)?

No -- it is almost surely a Zipfian, with a slope of 1.01 -- viz the most common glyph will appear in 10K words, the 10th-most-common will be in 1K words, the 100th-most-common will be in 100 words, sliding downhill from there.

@linas
Copy link
Member Author

linas commented May 6, 2018

packaging/distribution problem

At this time, we could distribute a file containing lists of words, which LG would read. However, I would rather not distribute a file containing word features; instead, we should create scripts to convert that into an ordinary LG dict. This is somewhat analogous to what is currently done for Russian: the data/ru/tools directory contains scripts to mangle foreign files into LG format. Again, this is a packaging problem, with link-grammar.tar.gz aimed at users, while github is aimed at maintainers.

@ampli
Copy link
Member

ampli commented May 14, 2018

Additional question, how per-sentence cost may solve problems like as in issue #404?
(Or maybe a solution to that is not related to a cost system? Or maybe a per-alternative cost is needed?)

Maybe; but currently, having costs on connectors seems to be sufficient to solve all ranking problems.

Currently, instead of using a ranking system, the code just a-priori ignores possibilities (in a hard-coded way) that tend to give bad results, but sometimes might give the only good result. Issue #404 refers to such a case when spell-guess is never done to words that match the capitalization regexes. I added there several other similar situations. I don't have any idea how a sentence-global ranking can solve such problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants