Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Surprising multi-connector behavior #1351

Open
linas opened this issue Nov 4, 2022 · 12 comments
Open

Surprising multi-connector behavior #1351

linas opened this issue Nov 4, 2022 · 12 comments
Labels

Comments

@linas
Copy link
Member

linas commented Nov 4, 2022

Multi-connectors are behaving in a fashion I do not fully understand. Consider this dict

LEFT-WALL: @ANY+;

<many>: {[@ANY-]-0.1 or [@ANY+]-0.11};

<UNKNOWN-WORD>:  <many>;
% <UNKNOWN-WORD>:  <many> and <many>;
% <UNKNOWN-WORD>:  <many> and <many> and <many>;
% <UNKNOWN-WORD>:  <many> and <many> and <many> and <many>;

and then parse the sentence `asdf asdf asdf asdf. I get this:

linkparser> asdf asdf asdf asdf
Found 14 linkages (14 had no P.P. violations)
	Linkage 1, cost vector = (UNUSED=0 DIS=-0.43 LEN=6)

    +---------------ANY--------------+
    |        +----------ANY----------+
    |        |       +------ANY------+
    |        |       |       +--ANY--+
    |        |       |       |       |
LEFT-WALL asdf[?] asdf[?] asdf[?] asdf[?]

and the others are all trees. Not surprising.

When I set <UNKNOWN-WORD>: <many> and <many>; then I get

Found 435 linkages (435 had no P.P. violations)
	Linkage 1, cost vector = (UNUSED=0 DIS=-0.84 LEN=3)

             +----------ANY----------+
             |       +------ANY------+
    +---ANY--+       +--ANY--+--ANY--+
    |        |       |       |       |
LEFT-WALL asdf[?] asdf[?] asdf[?] asdf[?]

Loops now appear. Not surprising. This is what I wanted. Making the costs negative means that the loopy parses are shown first. But oddly, this is not showing the maximum-possible number of loops!

With <UNKNOWN-WORD>: <many> and <many> and <many>; I get

Found 939 linkages (939 had no P.P. violations)
	Linkage 1, cost vector = (UNUSED=0 DIS=-1.15 LEN=3)

             +----------ANY----------+
             |       +------ANY------+
    +---ANY--+--ANY--+--ANY--+--ANY--+
    |        |       |       |       |
LEFT-WALL asdf[?] asdf[?] asdf[?] asdf[?]

Much better! But wait, there's more: with four<many>'s I get

Found 1020 linkages (1020 had no P.P. violations)
	Linkage 1, cost vector = (UNUSED=0 DIS=-1.25 LEN=6)

    +---------------ANY--------------+
    |        +----------ANY----------+
    |        +------ANY------+       |
    +---ANY--+--ANY--+--ANY--+--ANY--+
    |        |       |       |       |
LEFT-WALL asdf[?] asdf[?] asdf[?] asdf[?]

I understand the difference between one and two <many>'s, But the change for three and four is surprising. I do not understand that or why it's happening.

Seem that five-word sentences need five <many>'s and so on, to explore al possibilities.

@linas linas added the question label Nov 4, 2022
@linas
Copy link
Member Author

linas commented Nov 4, 2022

It's even crazier. LEFT-WALL: {@ANY+} and {@ANY+} ; gives

Found 1836 linkages (1000 of 1000 random linkages had no P.P. violations)
	Linkage 1, cost vector = (UNUSED=0 DIS=-1.25 LEN=6)

    +---------------ANY--------------+
    |        +----------ANY----------+
    |        |       +------ANY------+
    +---ANY--+--ANY--+--ANY--+--ANY--+
    |        |       |       |       |
LEFT-WALL asdf[?] asdf[?] asdf[?] asdf[?]

while LEFT-WALL: {@ANY+} and {@ANY+} and {@ANY+}; gives

Found 2070 linkages (2070 had no P.P. violations)
	Linkage 1, cost vector = (UNUSED=0 DIS=-1.25 LEN=6)

    +---------------ANY--------------+
    |        +----------ANY----------+
    |        +------ANY------+       |
    +---ANY--+--ANY--+--ANY--+--ANY--+
    |        |       |       |       |
LEFT-WALL asdf[?] asdf[?] asdf[?] asdf[?]

@linas
Copy link
Member Author

linas commented Nov 4, 2022

I forgot to mention that both the number of linkages changes, and the cost of the best linkage changes too. The best cost for four-word sentences is achieved with four wall connectors:

<pany>: {[@ANY+]-0.11};
LEFT-WALL: <pany> and <pany> and <pany> and <pany>;

which gives me

Found 2097 linkages (2097 had no P.P. violations)
	Linkage 1, cost vector = (UNUSED=0 DIS=-1.47 LEN=5)

    +---------------ANY--------------+
    +-------ANY------+------ANY------+
    +---ANY--+--ANY--+--ANY--+--ANY--+
    |        |       |       |       |
LEFT-WALL asdf[?] asdf[?] asdf[?] asdf[?]

linas added a commit that referenced this issue Nov 4, 2022
An interesting phenomenon shows up, described in issue #1351
@ampli
Copy link
Member

ampli commented Nov 4, 2022

Found 1836 linkages (1000 of 1000 random linkages had no P.P. violations)

Random sampling here. Does it matter?

@linas
Copy link
Member Author

linas commented Nov 4, 2022

Random sampling here. Does it matter?

Yes. The costs are crazy. So, with

<pany>: {[@ANY+]-0.11};
LEFT-WALL: <pany> and <pany>; 

I get:

Found 1836 linkages (1000 of 1000 random linkages had no P.P. violations)
	Linkage 1, cost vector = (UNUSED=0 DIS=-1.37 LEN=6)

    +---------------ANY--------------+
    |        +----------ANY----------+
    |        +------ANY------+       |
    +---ANY--+--ANY--+--ANY--+--ANY--+
    |        |       |       |       |
LEFT-WALL asdf[?] asdf[?] asdf[?] asdf[?]

Notice the cost. Then

limit set to 3000
linkparser> asdf asdf asdf asdf
Found 1836 linkages (1836 had no P.P. violations)
	Linkage 1, cost vector = (UNUSED=0 DIS=-1.47 LEN=6)

    +---------------ANY--------------+
    |        +----------ANY----------+
    |        +------ANY------+       |
    +---ANY--+--ANY--+--ANY--+--ANY--+
    |        |       |       |       |
LEFT-WALL asdf[?] asdf[?] asdf[?] asdf[?]

Note the cost. Counting by hand, I see 7 + links and seven - links, so I expect a cost of $-0.11 \times 7 + -0.1 \times 7 = -1.47$ so where is the $-1.37$ coming from?

linkparser> !lim=60
limit set to 60
linkparser> !dis
Display of disjuncts used turned on.
linkparser> asdf asdf asdf asdf
Found 1836 linkages (60 of 60 random linkages had no P.P. violations)
	Linkage 1, cost vector = (UNUSED=0 DIS=-1.27 LEN=6)

    +---------------ANY--------------+
    |        +----------ANY----------+
    |        +------ANY------+       |
    +---ANY--+--ANY--+--ANY--+--ANY--+
    |        |       |       |       |
LEFT-WALL asdf[?] asdf[?] asdf[?] asdf[?]

            LEFT-WALL    -0.220  @ANY+ @ANY+
              asdf[?]    -0.430  @ANY- @ANY+ @ANY+ @ANY+
              asdf[?]    -0.210  @ANY- @ANY+
              asdf[?]    -0.210  @ANY- @ANY+
              asdf[?]    -0.200  @ANY- @ANY-

so that totals to $-1.27$ because ... the multi-connectors. With lim=6000 I get

            LEFT-WALL    -0.220  @ANY+ @ANY+
              asdf[?]    -0.430  @ANY- @ANY+ @ANY+ @ANY+
              asdf[?]    -0.210  @ANY- @ANY+
              asdf[?]    -0.310  @ANY- @ANY- @ANY+
              asdf[?]    -0.300  @ANY- @ANY- @ANY-

So the multi-connectors are now working "individually".

@linas
Copy link
Member Author

linas commented Nov 4, 2022

OK, so I understand the costs being computed differently. The multi-connector isn't incrementing the cost. Fine. But look at the last word ddd

linkparser> aaa bbb ccc ddd
Found 1836 linkages (1836 had no P.P. violations)
	Linkage 1, cost vector = (UNUSED=0 DIS=-1.47 LEN=6)

    +-------------ANY-------------+
    |        +---------ANY--------+
    |        +-----ANY-----+      |
    +---ANY--+--ANY-+--ANY-+--ANY-+
    |        |      |      |      |
LEFT-WALL aaa[?] bbb[?] ccc[?] ddd[?]

            LEFT-WALL    -0.220  @ANY+ @ANY+
               aaa[?]    -0.430  @ANY- @ANY+ @ANY+ @ANY+
               bbb[?]    -0.210  @ANY- @ANY+
               ccc[?]    -0.310  @ANY- @ANY- @ANY+
               ddd[?]    -0.300  @ANY- @ANY- @ANY-

vs.

linkparser> !lim=60
limit set to 60
linkparser> aaa bbb ccc ddd
Found 1836 linkages (60 of 60 random linkages had no P.P. violations)
	Linkage 1, cost vector = (UNUSED=0 DIS=-1.27 LEN=6)

    +-------------ANY-------------+
    |        +---------ANY--------+
    |        +-----ANY-----+      |
    +---ANY--+--ANY-+--ANY-+--ANY-+
    |        |      |      |      |
LEFT-WALL aaa[?] bbb[?] ccc[?] ddd[?]

            LEFT-WALL    -0.220  @ANY+ @ANY+
               aaa[?]    -0.430  @ANY- @ANY+ @ANY+ @ANY+
               bbb[?]    -0.210  @ANY- @ANY+
               ccc[?]    -0.210  @ANY- @ANY+
               ddd[?]    -0.200  @ANY- @ANY-

Notice how this time, the disjunct on ddd only has two multi-connectors Why two? why not one? There are clearly three - links being used ...

@linas
Copy link
Member Author

linas commented Nov 4, 2022

Ah! OK, I think I understand what is going on. The two- case should be enough to enumerate all possible graphs. There seem to be more, because they are duplicated with different disjunct usage. The multi connector cost isn't accounted for when multi is used. So I'm thinking everything works correctly; its just confusing.

I guess the way cost is computed is a bug -- is there an ovious way to "fix" the cost for multi-connectors?

Good night, I'll sleep on this.

@ampli
Copy link
Member

ampli commented Nov 4, 2022

See also lg_compute_disjunct_strings(). It is my rewrite of the original one. It handles multi-connectors in a special way.
This code should be rechecked. I think a debug flag can be added to make it optionally use the chosen disjuncts so we see if the results are the same.

@ampli
Copy link
Member

ampli commented Nov 4, 2022

See also lg_compute_disjunct_strings()

I haven't tried yet to add an option to directly use the chosen disjuncts, but the concept of the current code seems to me solid for now.

With <UNKNOWN-WORD>: <many> and <many> and <many>; ...

linkparser> !!<UNKNOWN-WORD>
Token "<UNKNOWN-WORD>" matches:
    <UNKNOWN-WORD>                    9 disjuncts


Token "<UNKNOWN-WORD>" expressions:
    <UNKNOWN-WORD>             ({[@ANY-]-0.100 or [@ANY+]-0.110}) & ({[@ANY-]-0.100 or [@ANY+]-0.110})

linkparser> !!<UNKNOWN-WORD>//
Token "<UNKNOWN-WORD>" matches:
    <UNKNOWN-WORD>                    9 disjuncts


Token "<UNKNOWN-WORD>" disjuncts:
    <UNKNOWN-WORD>                    5/8 disjuncts

  <UNKNOWN-WORD>: [0]-0.110=  <> @ANY+
  <UNKNOWN-WORD>: [1]-0.100= @ANY- <> 
  <UNKNOWN-WORD>: [2]-0.210= @ANY- <> @ANY+
  <UNKNOWN-WORD>: [3]-0.200= @ANY- @ANY- <> 
  <UNKNOWN-WORD>: [4]-0.220=  <> @ANY+ @ANY+

linkparser> 

Disjunct [0] and disjunct [4] can be considered duplicates (also [1] and [3]) since disjunct [0] can create similar linkages as disjunct [4]. Despite having diagrams that look the same, the total cost depends on the length of the multi-connector sequence which got used (if the multi-connector costs are not 0).

OK, so I understand the costs being computed differently. The multi-connector isn't incrementing the cost. Fine. But look at the last word ddd

The two- case should be enough to enumerate all possible graphs. There seem to be more, because they are duplicated with different disjunct usage.

As you said, this seems fine - on !lim=60 the lower sampled linkage happens to contain:
ddd[?] -0.200 @ANY- @ANY-
The linkage with ddd[?] -0.300 @ANY- @ANY- @ANY- just has not been sampled.

I guess the way cost is computed is a bug -- is there an ovious way to "fix" the cost for multi-connectors?

The current concept of disjunct cost is static (independent of the number of times a multi-connector is used).
We can recompute it after the linkage, but I'm not sure whether this is a good idea. Does multi-match makes a sentence less likely (for cost>0) or more likely (for cost<0)?

Instead, maybe we should discard such "duplicate" disjuncts. But the implementation looks very problematic because it is not clear how to handle the various cases (we can discuss that if needed).

@linas
Copy link
Member Author

linas commented Nov 4, 2022

OK, so after sleeping, I've convinced myself everything is fine, except for the cost computation. The fully correct behavior is given by the dict

LEFT-WALL:  {[@ANY+]-0.11};
<UNKNOWN-WORD>: [@ANY-]-0.1 or [@ANY+]-0.11 or [@ANY- & @ANY+]-0.111;

and the above does generate an exhaustive collection of linkages. However, due to the cost computations, the ones with the maximal number of links are NOT presented first! And that is what confused me, and lead to this long report.

Does multi-match makes a sentence less likely (for cost>0) or more likely (for cost<0)?

Yes, that was the whole idea! In one case, the use is minimized; in the other, it is maximized.

@linas
Copy link
Member Author

linas commented Nov 4, 2022

To fix cost computations, I think it can be done simply & easily in score.c by comparing the number of times a multi-connector appears in the disjuncts, vs the number of times that a link with the same name appears in the link_array.

@linas
Copy link
Member Author

linas commented Nov 4, 2022

I implemented a cost-adjust function, that adjusts the total cost by the number of times that a multi-connector is used. It is in my branch cost-adjust. What is missing is the actual cost of a multi-connector. The natural place to put it would be in condesc_t but it looks like condesc_t is carefully packed to minimize space.

@ampli take a look, tell me what you think. Getting this to work is low priority -- the goal was to maximize the creation of links, when they have negative cost. It seems that maximizing the creation of links is useful, to maximally constrain the grammar. I was envisioning that this would be used only by the Atomese dict, only for UNKNOWN-WORD, to create maximally-linked unknown words. But this is all very hypothetical. It seems like a good idea, it seems like the right thing to do, but the practical impact of it, at this time, is nil.

The branch: https://github.com/linas/link-grammar/tree/cost-adjust

@linas
Copy link
Member Author

linas commented Nov 4, 2022

linas added a commit that referenced this issue Nov 4, 2022
Now that the issue is clearly understood, we can do it correctly.
See #1351 for details.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants