# POC-Turtle-8: Test grammar learning on MST-parses
This is a continuation of proof-of-concept (POC) experiments in unsupervised language learning (ULL), the OpenCog project hosted on [GitHub](https://github.com/opencog/language-learning/tree/master/notebooks).  
This notebook contains tests for MST-parses from OpenCog MST parser 2018-03-16 ([input_data](http://88.99.210.144/data/clustering_2018/input_data/)), results shared via [http://88.99.210.144/data/clustering_2018/POC-Turtle-8-2018-03-16/](http://88.99.210.144/data/clustering_2018/POC-Turtle-8-2018-03-16/)

In [1]:
import os, sys, time
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
from IPython.display import display
module_path = os.path.abspath(os.path.join('..'))
if module_path not in sys.path: sys.path.append(module_path)
from src.utl.utl import UTC
from src.utl.turtle import html_table
print(UTC(), ':: module_path:', module_path)

2018-03-16 18:26:12 UTC :: module_path: /home/obaskov/language-learning


## 8.1 Settings, parameters, data.

In [2]:
prj_dir = '../output/Turtle-8-2018-03-16/'  # project directory 
prefix = ''     # all project files will start with this prefix
test_data_path = module_path + '/tests/'
verbose = 'max' # printed comments: 'none', 'min', 'max'
log = {'project': 'POC-Turtle-8: Test MST parses'}

if not os.path.exists(prj_dir):
    os.makedirs(prj_dir)
    print('Project directory created:', module_path + prj_dir[2:])
else: print('Project directory', module_path + prj_dir[2:], 'exists')
path = module_path + prj_dir[2:]
tmpath = module_path + '/tmp/'
input_dir = '/home/obaskov/data/clustering_2018/input_data/'

Project directory /home/obaskov/language-learning/output/Turtle-8-2018-03-16/ exists


## 8.2 Test "POC-Turtle" MST-parses.

In [3]:
def check_input_file(input_file, verbose='none'):
    if os.path.isfile(input_file):
        print('Data file:', input_file)
        log.update({'input_file': input_file})
        if verbose == 'max':
            print('- "Turtle" language corpus:\n')
            with open(input_file, 'r') as f: 
                lines = f.read().splitlines()
            for line in lines: print(line)
        elif verbose not in ['none', 'min']: 
            print('Input file:', input_file)
        return True
    else: 
        print('No data file', input_file)
        return False

input_file = input_dir+'poc-turtle-parses-window-distance-fmi.txt'
check_input_file(input_file, verbose='max')

Data file: /home/obaskov/data/clustering_2018/input_data/poc-turtle-parses-window-distance-fmi.txt
- "Turtle" language corpus:

## Parses obtained with window-based pair-counting, which accounts
## for distance.
## Word-pair counts are counted within a window of size K.
## The counts added for a word-pair are equal to K/d, where d is
## the distance between the two words.

## These parses, are the same for K = {2, 6, 10, 30}, but not for K = 1


tuna isa fish .
0 ###LEFT-WALL### 1 tuna
1 tuna 3 fish
2 isa 3 fish
3 fish 4 .

herring isa fish .
0 ###LEFT-WALL### 1 herring
1 herring 3 fish
2 isa 3 fish
3 fish 4 .

tuna has fin .
0 ###LEFT-WALL### 1 tuna
1 tuna 2 has
2 has 3 fin
3 fin 4 .

herring has fin .
0 ###LEFT-WALL### 1 herring
1 herring 2 has
2 has 3 fin
3 fin 4 .

parrot isa bird .
0 ###LEFT-WALL### 1 parrot
1 parrot 3 bird
2 isa 3 bird
3 bird 4 .

eagle isa bird .
0 ###LEFT-WALL### 1 eagle
1 eagle 3 bird
2 isa 3 bird
3 bird 4 .

parrot has wing .
0 ###LEFT-WALL### 1 parrot
1 parr

True

In [4]:
def mst2stalks(input_file, left_wall='', period=False, verbose='none'):
    from src.space.turtle import mst2disjuncts
    from src.link_grammar.turtle import lexical_entries, entries2clusters, \
         disjuncts2clusters, entries2rules, save_link_grammar
    parses = mst2disjuncts(input_file, lw=left_wall, dot=period)
    disjuncts = parses.groupby(['word','disjunct'], as_index=False).sum() \
        .sort_values(by=['count','word','disjunct'], ascending=[False,True,True]) \
        .reset_index(drop=True)
    dj_number = len(set(disjuncts['disjunct'].tolist()))
    if verbose != 'none': print(dj_number, 'unique disjuncts form', \
        len(disjuncts),'unique word-disjunct pairs from', len(parses), 'parsed items') 
    return entries2clusters(lexical_entries(disjuncts))

stalks = mst2stalks(input_file, left_wall='', period=False, verbose='max')
stalks

33 unique disjuncts form 40 unique word-disjunct pairs from 55 parsed items


Unnamed: 0,germs,disjuncts,counts,cluster
C01,[.],"[bird-, extremity-, fin-, fish-, scale-, wing-]",11,C01
C02,[LEFT-WALL],"[eagle+, fin+, herring+, parrot+, tuna+, wing+]",11,C02
C03,[bird],"[eagle- & isa- & .+, isa- & parrot- & .+]",2,C03
C04,"[eagle, parrot]","[LEFT-WALL- & bird+, LEFT-WALL- & has+]",4,C04
C05,[extremity],"[fin- & isa- & .+, isa- & wing- & .+]",2,C05
C06,[fin],"[LEFT-WALL- & extremity+, LEFT-WALL- & scale+,...",4,C06
C07,[fish],"[herring- & isa- & .+, isa- & tuna- & .+]",2,C07
C08,[has],"[eagle- & wing+, herring- & fin+, parrot- & wi...",5,C08
C09,"[herring, tuna]","[LEFT-WALL- & fish+, LEFT-WALL- & has+]",4,C09
C10,[isa],"[bird+, extremity+, fish+]",6,C10


In [5]:
from src.link_grammar.turtle import disjuncts2clusters, entries2rules
rule_list = entries2rules(disjuncts2clusters(stalks))
display(html_table([['Cluster','Germs','','','Disjuncts']] + rule_list))

0,1,2,3,4
Cluster,Germs,,,Disjuncts
C01,['.'],[],[],"['C03C01-', 'C05C01-', 'C06C01-', 'C07C01-', 'C11C01-', 'C12C01-']"
C02,['LEFT-WALL'],[],[],"['C02C04+', 'C02C06+', 'C02C09+', 'C02C12+']"
C03,['bird'],[],[],"['C04C03- & C10C03- & C03C01+', 'C10C03- & C04C03- & C03C01+']"
C04,"['eagle', 'parrot']",[],[],"['C02C04- & C04C03+', 'C02C04- & C04C08+']"
C05,['extremity'],[],[],"['C06C05- & C10C05- & C05C01+', 'C10C05- & C12C05- & C05C01+']"
C06,['fin'],[],[],"['C02C06- & C06C05+', 'C02C06- & C06C11+', 'C08C06- & C06C01+']"
C07,['fish'],[],[],"['C09C07- & C10C07- & C07C01+', 'C10C07- & C09C07- & C07C01+']"
C08,['has'],[],[],"['C04C08- & C08C12+', 'C08C11+', 'C09C08- & C08C06+']"
C09,"['herring', 'tuna']",[],[],"['C02C09- & C09C07+', 'C02C09- & C09C08+']"


In [6]:
from src.link_grammar.turtle import save_link_grammar
lg_file_string = save_link_grammar(rule_list, path)
for line in lg_file_string.splitlines(): print(line)

% POC Turtle Link Grammar v.0.7 2018-03-16 18:26:12 UTC
<dictionary-version-number>: V0v0v7+;
<dictionary-locale>: EN4us+;

% C01
".":
(C03C01-) or (C05C01-) or (C06C01-) or (C07C01-) or (C11C01-) or (C12C01-);

% C02
"LEFT-WALL":
(C02C04+) or (C02C06+) or (C02C09+) or (C02C12+);

% C03
"bird":
(C04C03- & C10C03- & C03C01+) or (C10C03- & C04C03- & C03C01+);

% C04
"eagle" "parrot":
(C02C04- & C04C03+) or (C02C04- & C04C08+);

% C05
"extremity":
(C06C05- & C10C05- & C05C01+) or (C10C05- & C12C05- & C05C01+);

% C06
"fin":
(C02C06- & C06C05+) or (C02C06- & C06C11+) or (C08C06- & C06C01+);

% C07
"fish":
(C09C07- & C10C07- & C07C01+) or (C10C07- & C09C07- & C07C01+);

% C08
"has":
(C04C08- & C08C12+) or (C08C11+) or (C09C08- & C08C06+);

% C09
"herring" "tuna":
(C02C09- & C09C07+) or (C02C09- & C09C08+);

% C10
"isa":
(C10C03+) or (C10C05+) or (C10C07+);

% C11
"scale":
(C06C11- & C08C11- & C11C01+);

% C12
"wing":
(C02C12- & C12C05+) or (C08C12- & C12C01+);

UNKNOWN-WORD: XXX+;

% 12 wor

### Link Grammar parsing tests with the dictionary learned from MST-parses.
The learned Link Grammar dictionary was used with Link Grammar parser to parse the 12 sentences of the "Turtle" corpus: 
```
tuna isa fish. : Found 1 linkage (1 had no P.P. violations)`

            +--C09C07--+       
    +-C02C09+   +C10C07+C07C01+
    |       |   |      |      |
LEFT-WALL tuna isa   fish     . 


herring isa fish. : Found 1 linkage (1 had no P.P. violations)`

             +---C09C07---+       
    +-C02C09-+     +C10C07+C07C01+
    |        |     |      |      |
LEFT-WALL herring isa   fish     . 


tuna has fin. : Found 1 linkage (1 had no P.P. violations)`

    +-C02C09+C09C08+C08C06+C06C01+
    |       |      |      |      |
LEFT-WALL tuna    has    fin     . 


herring has fin. : Found 1 linkage (1 had no P.P. violations)`

    +-C02C09-+C09C08+C08C06+C06C01+
    |        |      |      |      |
LEFT-WALL herring  has    fin     . 


parrot isa bird. : Found 1 linkage (1 had no P.P. violations)`

             +---C04C03--+       
    +-C02C04-+    +C10C03+C03C01+
    |        |    |      |      |
LEFT-WALL parrot isa   bird     . 


eagle isa bird. : Found 1 linkage (1 had no P.P. violations)`

            +---C04C03--+       
    +-C02C04+    +C10C03+C03C01+
    |       |    |      |      |
LEFT-WALL eagle isa   bird     . 


parrot has wing. : Found 1 linkage (1 had no P.P. violations)`

    +-C02C04-+C04C08+C08C12+C12C01+
    |        |      |      |      |
LEFT-WALL parrot   has   wing     . 


eagle has wing. : Found 1 linkage (1 had no P.P. violations)`

    +-C02C04+C04C08+C08C12+C12C01+
    |       |      |      |      |
LEFT-WALL eagle   has   wing     . 


fin isa extremity. : Found 1 linkage (1 had no P.P. violations)`

LEFT-WALL [fin] [isa] [extremity.] 


wing isa extremity. : Found 1 linkage (1 had no P.P. violations)`

            +--C12C05--+       
    +-C02C12+   +C10C05+C05C01+
    |       |   |      |      |
LEFT-WALL wing isa extremity  . 


fin has scale. : Found 1 linkage (1 had no P.P. violations)`

LEFT-WALL [fin] [has] [scale.] 


wing has feather. : Found 1 linkage (1 had no P.P. violations)`

LEFT-WALL [wing] [has] [feather.] 

```

## 8.3 First tests on POC-English corpus

In [7]:
def pipeline(input_file, left_wall='', period=False, verbose='none'):
    from src.space.turtle import mst2disjuncts
    from src.link_grammar.turtle import lexical_entries, entries2clusters, \
         disjuncts2clusters, entries2rules, save_link_grammar
    # import check_input_file, mst2stalks # this notebook
    check_input_file(input_file, verbose=verbose)
    stalks = mst2stalks(input_file, left_wall, period, verbose)
    if verbose == 'max': print('Stalks:\n',stalks[['germs','disjuncts']])
    rule_list = entries2rules(disjuncts2clusters(stalks))
    if verbose != 'none':
        print('\nLink Grammar rules:')
        display(html_table([['Cluster','Germs','','','Disjuncts']] + rule_list))
    lg_file_string = save_link_grammar(rule_list, path)
    if verbose != 'none': # not in ['none', 'min']:
        print('\nLink Grammar dictionary:\n')
        for line in lg_file_string.splitlines(): print(line)
    else: 
        print('\n'.join(x[2:] for x in lg_file_string.splitlines()[-2:]))

input_file = input_dir+'poc-english_noCaps-parses-window30-distance-fmi.txt'
pipeline(input_file, left_wall='', period=False, verbose='min')

Data file: /home/obaskov/data/clustering_2018/input_data/poc-english_noCaps-parses-window30-distance-fmi.txt
177 unique disjuncts form 229 unique word-disjunct pairs from 515 parsed items

Link Grammar rules:


0,1,2,3,4
Cluster,Germs,,,Disjuncts
C01,['.'],[],[],"['C05C01-', 'C06C01-', 'C07C01-', 'C09C01-', 'C11C01-', 'C15C01-', 'C17C01-', 'C27C01-', 'C29C01-', 'C30C01-', 'C33C01-', 'C35C01-', 'C38C01-', 'C44C01-']"
C02,['LEFT-WALL'],[],[],"['C02C03+', 'C02C03+ & C02C13+', 'C02C03+ & C02C25+', 'C02C04+', 'C02C09+', 'C02C13+', 'C02C14+', 'C02C16+', 'C02C25+', 'C02C30+', 'C02C34+']"
C03,['a'],[],[],"['C02C03-', 'C02C03- & C03C11+', 'C02C03- & C03C29+', 'C02C03- & C03C38+', 'C03C14+', 'C03C15+', 'C03C34+', 'C03C38+', 'C04C03-', 'C18C03-', 'C21C03-', 'C21C03- & C03C15+', 'C26C03-', 'C40C03-', 'C40C03- & C03C14+', 'C41C03-']"
C04,['are'],[],[],['C02C04- & C06C04- & C04C03+ & C04C38+']
C05,['before'],[],[],"['C15C05- & C40C05- & C05C01+', 'C23C05- & C05C01+', 'C29C05- & C05C01+', 'C40C05- & C05C01+']"
C06,['binoculars'],[],[],"['C06C04+', 'C18C06- & C06C01+', 'C41C06- & C06C01+']"
C07,['board'],[],[],"['C28C07- & C07C01+', 'C28C07- & C07C36+', 'C37C07- & C07C36+']"
C08,['by'],[],[],['C43C08- & C08C10+']
C09,"['cake', 'sausage']",[],[],"['C02C09- & C09C15+', 'C02C09- & C09C40+', 'C23C09-', 'C24C09- & C09C01+', 'C24C09- & C09C27+']"



Link Grammar dictionary:

% POC Turtle Link Grammar v.0.7 2018-03-16 18:26:13 UTC
<dictionary-version-number>: V0v0v7+;
<dictionary-locale>: EN4us+;

% C01
".":
(C05C01-) or (C06C01-) or (C07C01-) or (C09C01-) or (C11C01-) or (C15C01-) or (C17C01-) or (C27C01-) or (C29C01-) or (C30C01-) or (C33C01-) or (C35C01-) or (C38C01-) or (C44C01-);

% C02
"LEFT-WALL":
(C02C03+) or (C02C03+ & C02C13+) or (C02C03+ & C02C25+) or (C02C04+) or (C02C09+) or (C02C13+) or (C02C14+) or (C02C16+) or (C02C25+) or (C02C30+) or (C02C34+);

% C03
"a":
(C02C03-) or (C02C03- & C03C11+) or (C02C03- & C03C29+) or (C02C03- & C03C38+) or (C03C14+) or (C03C15+) or (C03C34+) or (C03C38+) or (C04C03-) or (C18C03-) or (C21C03-) or (C21C03- & C03C15+) or (C26C03-) or (C40C03-) or (C40C03- & C03C14+) or (C41C03-);

% C04
"are":
(C02C04- & C06C04- & C04C03+ & C04C38+);

% C05
"before":
(C15C05- & C40C05- & C05C01+) or (C23C05- & C05C01+) or (C29C05- & C05C01+) or (C40C05- & C05C01+);

% C06
"binoculars":
(C06C04+) or (C1

## TL;DR
**Good news**: The language learning pipeline manages to learn Link Grammar rules for a more complicated "POC-English" corpus.

**Bad news**:
- The rules learned by collecting multi-germ-multi-disjunct lexical entries are too detailed -- 44 clusters, most consisting of a single word suggest further clustering to provide generalised word categories and grammar rules.
- The learned rules could not be verified by the Link Grammar parser (to be amended).