Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handling quotation marks #42

Open
ampli opened this issue Dec 17, 2014 · 12 comments
Open

Handling quotation marks #42

ampli opened this issue Dec 17, 2014 · 12 comments
Assignees

Comments

@ampli
Copy link
Member

ampli commented Dec 17, 2014

In the current code, quotation marks are removed from the sentence.
Actually, they are converted to whitespace. This means they serve as a word separator.
For example:
This"is a test"
is converted to:
This is a test

In addition, a word just after quotation mark is considered to be in a capitalizable position.
This is true even for closing quotation mark... including when it is a "right mark".

In my new tokenization code, quotes are tokenized. They are defined in RPUNC and LPUNC, thus they get strip off words from their LHS and RHS. However, this doesn't preserve its "separator" behavior that exit in the current code.

In English (and I guess some other languages, maybe many) this seems to be a desired behavior,
because if we see qwerty"yuiop" we may guess it is actually qwerty "yuiop".

However, to generally do this in Hebrew will be wrong, as double quote U+0022 is a de facto replacement for the Hebrew character "gershayim" that can be an integral part of Hebrew words (as
gershayim is not found on the Hebrew keyboard). It is also a de facto replacement for Hebrew quotation marks (form the same reason). So a general tokenization code cannot blindly use it as a word separator.

In order to solve this, I would like to introduce an affix-class WORDSEP, which will be a list of characters to be used as a word separator, when blank would be the default. Character listed there will still be able to be listed in other affix classes and thus serve as tokens.
Is this solution sensible?
Another option is just not to use it as a word separator, at least in the first version of "quotation mark as token" (this is what my current code does).

Regarding the capitalizable position after closing quote, meanwhile I will mostly preserve this behavior in the hard-coded capitalization handling, because we are going to try to implement capitalization using the dict.

@linas
Copy link
Member

linas commented Dec 17, 2014

Adding a WORDSEP affix class seems reasonable.

Any of the usual white-space marks are reasonable separators, of course. To get fancy, the non-break space should be among them, as its a typesetting thing.

@linas
Copy link
Member

linas commented Dec 5, 2015

pull requests #228 and #229 added support for a new QU link type that is able to handle many simple sentence quotations. It cannot handle random quote-marks; the ZZZ link is currently used for that.

@linas
Copy link
Member

linas commented Feb 27, 2018

This was discussed again in issue #632 -- I am copying parts of that discussion here.

@linas
Copy link
Member

linas commented Feb 27, 2018

There are (at least) three types of distinct quote usage:

  1. direct quotation: "Let's go to the movies", John said. These mostly work already.

  2. the "whom" example, where the word is quoted because the writer knows that it doesn't make grammatical sense, and is trying to keep the reader from getting confused. In this case, the quoted word should be treated as an unknown noun.

  3. Some "people" who "don't" know how to write "clearly" seem to use "quotes" for "emphasis". I'm not sure what to make of this. Quotes used in this way are usually meant to be sarcastic, indicating that the writer knows the statement is false, and wants the reader to pick up on the sarcasm more easily.

  4. For case three, I guess we could have the notion of "sarcasm quotes". Which brings us to case 4: a new term is being introduced and defined, so it is placed in quotes. Which are not sarcastic, but definitional. Although this is exactly the device that sarcastic writers enjoy using.

  5. For example, I can use quotes around the words "as is" to indicate that "as is" is a kind of noun, or object that is being described. In this sentence, "as is" should be treated as if it was a noun-phrase. That is, I am describing the properties of "as is" as a linguisitic meta-abstraction, the same way I might wont to talk about "voir-dire" or "pulchritude" or "form vs. function" as a kind-of object.

Both case 3 and case 4 can probably be handled the same way.

Case 5 suggests that there should be a generic mechanism that treats a quoted phrase as if it were a noun. That is, the internal grammatical structure of the phrase should be ignored. The quotation marks form a wall (like right-wall, left-wall), preventing links from crossing over the wall.

@ampli
Copy link
Member Author

ampli commented Mar 7, 2018

From issue #632:

To sum up the possibilities for "word" tokenization:

  1. Tokenize it, including the quotes, as UNKNOWN-WORLD.
  2. Tokenize it as now (separating the quotes) in case the word is used in a grammatical context, and add UNKNOWN-WORD alternative for it (including the quotes).
    (I'm for (2), because I think that (1) disregards possible info in "word" that may still be interesting.)

Option 2.

I added it to my TODO list.

@linas
Copy link
Member

linas commented Mar 7, 2018

Thank you.

@linas
Copy link
Member

linas commented Mar 7, 2018

I have a very ill-defined, vague comment - something to try to think about and understand: is there some (elegant) way of reformulating tokenization (and related issues) into a collection of rules (that could be encoded in a file)?

For example: capitalization: we have a rule (coded in C++) that if a word is at the beginning of a sentence, then we should search for a lower-case version of it. ... or if the word is after a semicolon, then we should search for a lower-case version of it. ... or if the word is after a quote, then we should search for a lower-case version of it. (Lincoln said, "Four-score and seven...") There is an obvious solution: design a new file, and place semi-colons, quotes and LEFT-WORD as markers for capitalization. Maybe this is kind-of-like a "new kind of affix rule" ?? All of the other affix rules state "if there is a certain sequence, then insert whitespace" while this new rule is "if there is a certain sequence, then look for downcase".

In the language learning code, I don't downcase any data in advance. Instead, the system eventually learns that certain words behave the same way, grammatically, whether they are upercased or not. The system is blind to uppercasing: it just sees two different UTF8 strings that happen to fall into the same grammatical class.

To "solve" this problem, one can imagine three steps. First, a "morphological" analysis: given a certain grammatical class, compare pairs to strings to see if they have a common substring - for example, if the whole string matches, except for the first letter. This would imply that some words have a "morphology", where the first letter can be either one of two, while the rest of the word is the same.

The second step is to realize that there is a meta-morphology-rule, which states that there are many words, all of which have the property that they can begin with either one of two different initial letters. The correct choice of the initial letter depends on whether the preceding token was a semicolon, a quote, or the left-wall.

The third step is to realize that the meta-morphology-rule can be factored into approximately 26 different classes. That is, in principle, there are 52-squared/2=1352 possible sets containing two (initial) letters. Of these, only 26 are seen: {A, a}, {B, b}, {C, c} ....and one never ever sees {P, Q} or {Z, a}.

As long was we write C code, and know in advance that we are dealing with capital letters, then we can use pre-defined POSIX locales for capitalization. I'm trying to take two or three steps backwards, here. One is to treat capitalization as a kind of morphology, just like any other kind of morphology. The second is to create morphology classes - the pseudo-morpheme A is only substitutable by the pseudo-morpheme a. The third is that all of this should be somehow rule-driven and "generic" in some way.

The meta-meta-meta issue is that I want to expand the framework beyond just language written as UTF8 strings, but more generally, language with associated intonation, affect, facial expressions, or "language" from other domains (biology, spatial relations, etc.)

@linas
Copy link
Member

linas commented Mar 7, 2018

I created issue #690 to track this capitalization-as-pseudo-morphology idea.

The meta issue is, again:

  1. Tokenize it as now (separating the quotes) in case the word is used in a grammatical context, and add UNKNOWN-WORD alternative for it (including the quotes).

For the first pass, its just fine to write this as pure C code that "does the right thing". The meta-issue to identify these kinds of rules in the tokenizer algorithm, and capture them as "generic rules".

@ampli
Copy link
Member Author

ampli commented Mar 8, 2018

I have a very ill-defined, vague comment - something to try to think about and understand: is there some (elegant) way of reformulating tokenization (and related issues) into a collection of rules (that could be encoded in a file)?

I have an abandoned project that does just that...
The definition of tokenization for each language was one short line (that does exactly what the
tokenizer dose today, including generating alternatives, but not including MPUMC that didn't exist then):

English: "LPUNC* (WORD SUF* | PRE WORD | NUMBER? UNITS+)? RPUNC*"
Russian: "LPUNC* (WORD |​ WORD.= =WORD)? RPUNC*"
Hebrew: ​ "LPUNC* (WORD=+ | WORD=* WORD)? RPUNC*"

I sent a very detailed proposal, and even had a prototype implementation. However, I somehow understood such an approach is too complex and is actually not needed, maybe from your response:

But why not just assume that its LPUNC* WORD* RPUNC* for all languages, and be done with it? What do you win by making it more complex? Better performance ??

I then sent a very detailed answer (that you didn't address).

I also started to investigate another project, of a "zero knowledge" tokenizer. The idea was that the dict will include the needed information for tokenization.

Here is the relevant post (you didn't respond).

https://groups.google.com/d/msg/link-grammar/5u0Xmqg3YEA/e1eEzJPQBgAJ

We can continue the said discussions if desired.

For example: capitalization

I'll continue in #690.

@linas
Copy link
Member

linas commented Mar 13, 2018

that you didn't address

Sorry. The proposal did not seem quite right, and figuring out how to do it correctly takes a lot of time and effort, and I ran out of energy. The generic "problem" still needs to be solved. I recall two problems that bugged me:

  1. the names "prefix", "suffix" - in languages with complex morphology, "prefixes" and "suffixes" can occur in the middle of a word, and there was tension because of that.

  2. tokenization is kind-of the "opposite" of parsing. Tokenization tells you how to split apart; parsing tells you how to re-assemble back together again. So the original tokenization proposal did not keep these concepts distinct: by using the concepts of prefix, suffix, it constrained what could come where, in what order, and why - that constraint was a kind-of mini-parsing, but not a very good one, because it couldn't handle complex word morphology.

As I write this, it occurs to me that perhaps parsing and tokenization really truly are "adjoint functors". However, the term "adjoint functor" is a rather complex and abstract concept, and so I would have to think long and hard about how to demonstrate that adjointness directly, and how it could be used to guide the design of the tokenizer.

@ampli
Copy link
Member Author

ampli commented Mar 13, 2018

I think a main misunderstanding between us was that I didn't intend at all to "prefix" and "suffix" according to their linguistic meanings, and I repeated mentioning that. But I understand that I failed to clarify this point.

I just defined "prefix" and "suffix" in a way that is effective for tokenizing, disregarding (at all, as I pointed out) their linguistic role, since for tokenizing purpose this is not important - it only important to break words to morphemes in all the possible ways. Instead I could call these parts e.g ISSP and ESSP for "initial sentence string parts", and "ending sentence string parts", or anything else ("sentence string" and not "word" to further avoid a possible linguistic term clash). If you think it is clearer to use, for the purpose of tokenization discussion, other terms than specially defined "prefix" and "suffix", then I have no obligation for that.

by using the concepts of prefix, suffix, it constrained what could come where

I think that my definitions, that are especially tailored for the purpose of tokenization (and hence different from the linguistic terms) are not constraining tokenization, and words can still get broken to all the possibilities without missing any. But if any constraint is discovered, this definitions can be fine tuned as needed, because there is no need that they match any linguistic concept.

But it occured to me (and I also posted on that in details) that even these terms are not needed, if we just mark the possible internal links somehow, because the tokenizer can then infer everything from the dict. And as usual, for many things that I said, including this, I also wrote (before I said...) a demo program to check them.

@linas
Copy link
Member

linas commented Mar 13, 2018

Again, sorry for misunderstanding. A part of that involved the use of the equals sign. I think I understand the English and Russian examples above, I don't quite understand

Hebrew: ​ "LPUNC* (WORD=+ | WORD=* WORD)? RPUNC*"

I also don't understand how to write rules for Lithuanian, where the "prefixes" (which can occur in the middle of a word) are drawn from a "closed class" (there are maybe 20 of them; one can list them all, exhaustively), the "infixes" are another closed class (maybe five of them total), the "suffixes" are again closed (maybe 50 or 100 of them) but, again, can occur in the middle of a word. The stems are then open class (thousands) which cannot be exhaustively listed, and have to be drawn from the dictionary (and a word might have two stems inside of it).

So the idea was that if its "closed class" viz: a small, finite number of them, completely well-known by all speakers, then its an affix. If its not closed class, then its "open class", because its impossible to create a complete list; most speakers do not know (will never know) all of them - they are like words you never heard of before, never use, don't recognize. It's only because the total number of closed-class affixes is small that it makes sense to list them in one file. It's a good thing that the total number is small, as otherwise, morphology would be very difficult, requiring the lookup of huge numbers - combinatoric explosion - of alternatives in the dictionary.

In English, the closed-class words are pronouns (he she ...) determiners (this that..) prepositions (as in of next by) and all speakers know all of them, and new ones are never invented / created / coined even in slang (with exceptions: xyr xe ... - closed-class words are very difficult to invent and popularize.) The closed-class morphemes in Lithuanian are somewhat similar.

Again, sorry for the mis-understandings I cause. I make mistakes, I'm short tempered and have a large variety of human failings :-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants