# POC-Turtle-6: Tests
This is a continuation of proof-of-concept (POC) experiments in unsupervised language learning (ULL), the OpenCog project hosted on [GitHub](https://github.com/opencog/language-learning/tree/master/notebooks).  
This notebook contains tests for unsupervised language learning pipeline based on lexical entries (disjuncts) learning described in the previous [POC-Turtle-5-Lexical-Entries notebook](http://88.99.210.144/data/clustering_2018/html/POC-Turtle-5-Lexical-Entries.html).

In [1]:
import os, sys, time
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
from IPython.display import display
module_path = os.path.abspath(os.path.join('..'))
if module_path not in sys.path: sys.path.append(module_path)
from src.utl.utl import UTC
from src.utl.turtle import html_table
print(UTC(), ':: module_path:', module_path)

2018-03-11 14:52:51 UTC :: module_path: /home/obaskov/language-learning


## 6.1 Settings, parameters, data.

In [2]:
prj_dir = '../output/Turtle-6-2018-03-10/'  # project directory 
prefix = ''     # all project files will start with this prefix
test_data_path = module_path + '/tests/'
verbose = 'max' # printed comments: 'none', 'min', 'max'
log = {'project': 'POC-Turtle-6: Tests'}

if not os.path.exists(prj_dir):
    os.makedirs(prj_dir)
    print('Project directory created:', module_path + prj_dir[2:])
else: print('Project directory', module_path + prj_dir[2:], 'exists')
path = module_path + prj_dir[2:]
tmpath = path  # module_path + '/tmp/'  # path for temporary files

Project directory created: /home/obaskov/language-learning/output/Turtle-6-2018-03-10/


In [3]:
input_file = '../data/poc-turtle-sentences.txt'
if os.path.isfile(input_file):
    print('Data file:', module_path + input_file[2:],)
    log.update({'input_file': module_path + input_file[2:]})
    if verbose == 'max':
        print('- "Turtle" language corpus:')
        with open(input_file, 'r') as f: 
            lines = f.read().splitlines()
        for i,line in enumerate(lines): 
            if len(line) > 0: print(str(i+1)+'. '+line)    
else: print('No data file', module_path + input_file[2:])

Data file: /home/obaskov/language-learning/data/poc-turtle-sentences.txt
- "Turtle" language corpus:
1. tuna isa fish.
2. herring isa fish.
3. tuna has fin.
4. herring has fin.
5. parrot isa bird.
6. eagle isa bird.
7. parrot has wing.
8. eagle has wing.
9. fin isa extremity.
10. wing isa extremity.
11. fin has scale.
12. wing has feather.


## 6.2 Test "correct" disjuncts without punctuation

In [4]:
from src.space.turtle import dumb_disjuncter
from src.link_grammar.turtle import lexical_entries, entries2clusters, \
    disjuncts2clusters, entries2rules, save_link_grammar

def pipeline(input_file, left_wall='', period=False, verbose='none'):
    parses = dumb_disjuncter(input_file, lw=left_wall, dot=period)
    disjuncts = parses.groupby(['word','disjunct'], as_index=False).sum() \
        .sort_values(by=['count','word','disjunct'], ascending=[False,True,True]) \
        .reset_index(drop=True)
    dj_number = len(set(disjuncts['disjunct'].tolist()))
    if verbose != 'none': print(dj_number, 'unique disjuncts form', \
        len(disjuncts),'unique word-disjunct pairs from', len(parses), 'parsed items') 
    dfg = lexical_entries(disjuncts)
    dfc = entries2clusters(dfg)
    rules = disjuncts2clusters(dfc)
    lg_rule_list = entries2rules(rules)
    return lg_rule_list

lg_rule_list = pipeline(input_file, left_wall='', period=False, verbose='max')
display(html_table([['Cluster','Words','','','Disjuncts']] + lg_rule_list))

16 unique disjuncts form 31 unique word-disjunct pairs from 36 parsed items


0,1,2,3,4
Cluster,Words,,,Disjuncts
C01,"['bird', 'extremity', 'fish']",[],[],['C06C01-']
C02,"['eagle', 'herring', 'parrot', 'tuna']",[],[],"['C02C05+', 'C02C06+']"
C03,"['feather', 'scale']",[],[],['C05C03-']
C04,"['fin', 'wing']",[],[],"['C04C05+', 'C04C06+', 'C05C04-']"
C05,['has'],[],[],"['C02C05- & C05C04+', 'C04C05- & C05C03+']"
C06,['isa'],[],[],"['C02C06- & C06C01+', 'C04C06- & C06C01+']"


### Import and check reference grammar

In [5]:
def test_lg_rules(lg_rule_list, test_data_path, display_reference=False):
    from src.utl.turtle_tests import test_turtle_rules
    passed, file, reference = test_turtle_rules(lg_rule_list, test_data_path, 'True')
    if passed: response = 'matches'
    else: response = 'does not match'
    print('Learned grammar rules list', response, 'the "'+file+'" reference list')
    if display_reference:
        display(html_table([['Cluster','Words','','','Disjuncts']] + reference))
test_lg_rules(lg_rule_list, test_data_path, True)

Learned grammar rules list matches the "/home/obaskov/language-learning/tests/turtle_6c_lg_rules.pkl" reference list


0,1,2,3,4
Cluster,Words,,,Disjuncts
C01,"['bird', 'extremity', 'fish']",[],[],['C06C01-']
C02,"['eagle', 'herring', 'parrot', 'tuna']",[],[],"['C02C05+', 'C02C06+']"
C03,"['feather', 'scale']",[],[],['C05C03-']
C04,"['fin', 'wing']",[],[],"['C04C05+', 'C04C06+', 'C05C04-']"
C05,['has'],[],[],"['C02C05- & C05C04+', 'C04C05- & C05C03+']"
C06,['isa'],[],[],"['C02C06- & C06C01+', 'C04C06- & C06C01+']"


### Test Category Learner

In [6]:
def test_categories(lg_rule_list, test_data_path):
    from src.utl.turtle_tests import test_turtle_word_categories
    passed, file, reference = test_turtle_word_categories(lg_rule_list, test_data_path, 'True')
    if passed: response = 'match'
    else: response = 'do not match'
    print('Learned word categories', response, 'the "'+file+'" reference list:')
    display(html_table([['Categories','Germs',]] + reference))
test_categories(lg_rule_list, test_data_path)

Learned word categories match the "/home/obaskov/language-learning/tests/turtle_6c_categories.pkl" reference list:


0,1
Categories,Germs
C01,"['bird', 'extremity', 'fish']"
C02,"['eagle', 'herring', 'parrot', 'tuna']"
C03,"['feather', 'scale']"
C04,"['fin', 'wing']"
C05,['has']
C06,['isa']


### Test Grammar Learner

In [7]:
def test_grammar(lg_rule_list, test_data_path):
    from src.utl.turtle_tests import test_turtle_grammar
    passed, file, reference = test_turtle_grammar(lg_rule_list, test_data_path, 'True')
    if passed: response = 'match'
    else: response = 'do not match'
    print('Learned Link Grammar rules', response, 'the "'+file+'" reference rule list:')
    display(html_table([['Categories','Germs', 'Disjuncts']] + reference))
test_grammar(lg_rule_list, test_data_path)

Learned Link Grammar rules match the "/home/obaskov/language-learning/tests/turtle_6c_grammar.pkl" reference rule list:


0,1,2
Categories,Germs,Disjuncts
C01,"['bird', 'extremity', 'fish']",['C06C01-']
C02,"['eagle', 'herring', 'parrot', 'tuna']","['C02C05+', 'C02C06+']"
C03,"['feather', 'scale']",['C05C03-']
C04,"['fin', 'wing']","['C04C05+', 'C04C06+', 'C05C04-']"
C05,['has'],"['C02C05- & C05C04+', 'C04C05- & C05C03+']"
C06,['isa'],"['C02C06- & C06C01+', 'C04C06- & C06C01+']"


### TODO: add Link Grammar dictionary file test.
As of 2018-03-09 the Link Grammar parser test tool is in an early beta, providing results like the following:


```
tuna isa fish

  +C02C06+C06C01+
  |      |      |
tuna    isa   fish 


herring isa fish

   +C02C06+C06C01+
   |      |      |
herring  isa   fish 


tuna has fin

  +C02C05+C05C04+
  |      |      |
tuna    has    fin 


herring has fin

   +C02C05+C05C04+
   |      |      |
herring  has    fin 


parrot isa bird

   +C02C06+C06C01+
   |      |      |
parrot   isa   bird 


eagle isa bird

  +C02C06+C06C01+
  |      |      |
eagle   isa   bird 


parrot has wing

   +C02C05+C05C04+
   |      |      |
parrot   has   wing 


eagle has wing

  +C02C05+C05C04+
  |      |      |
eagle   has   wing 


fin isa extremity

 +C04C06+C06C01+
 |      |      |
fin    isa extremity 
```

The CLI tool runs only on a server with Link Gramar parser installed. The next Unsupervised Language Learning pipeline and test development steps may include further Link Grammat parsing test development and integration.

## 6.3 Test "correct" disjuncts with punctuation

In [8]:
lg_rule_list = pipeline(input_file, left_wall='LEFT-WALL', period=True, verbose='max')
display(html_table([['Cluster','Words','','','Disjuncts']] + lg_rule_list))

29 unique disjuncts form 44 unique word-disjunct pairs from 60 parsed items


0,1,2,3,4
Cluster,Words,,,Disjuncts
C01,['.'],[],[],"['C03C01-', 'C05C01-', 'C06C01-']"
C02,['LEFT-WALL'],[],[],"['C02C04+', 'C02C06+']"
C03,"['bird', 'extremity', 'fish']",[],[],['C08C03- & C03C01+']
C04,"['eagle', 'herring', 'parrot', 'tuna']",[],[],"['C02C04- & C04C07+', 'C02C04- & C04C08+']"
C05,"['feather', 'scale']",[],[],['C07C05- & C05C01+']
C06,"['fin', 'wing']",[],[],"['C02C06- & C06C07+', 'C02C06- & C06C08+', 'C07C06- & C06C01+']"
C07,['has'],[],[],"['C04C07- & C07C06+', 'C06C07- & C07C05+']"
C08,['isa'],[],[],"['C04C08- & C08C03+', 'C06C08- & C08C03+']"


In [9]:
# test_lg_rules(lg_rule_list, test_data_path) #, True)

### Test Category Learner

In [10]:
test_categories(lg_rule_list, test_data_path)

Learned word categories match the "/home/obaskov/language-learning/tests/turtle_8c_categories.pkl" reference list:


0,1
Categories,Germs
C01,['.']
C02,['LEFT-WALL']
C03,"['bird', 'extremity', 'fish']"
C04,"['eagle', 'herring', 'parrot', 'tuna']"
C05,"['feather', 'scale']"
C06,"['fin', 'wing']"
C07,['has']
C08,['isa']


### Test Grammar Learner

In [11]:
test_grammar(lg_rule_list, test_data_path)

Learned Link Grammar rules match the "/home/obaskov/language-learning/tests/turtle_8c_grammar.pkl" reference rule list:


0,1,2
Categories,Germs,Disjuncts
C01,['.'],"['C03C01-', 'C05C01-', 'C06C01-']"
C02,['LEFT-WALL'],"['C02C04+', 'C02C06+']"
C03,"['bird', 'extremity', 'fish']",['C08C03- & C03C01+']
C04,"['eagle', 'herring', 'parrot', 'tuna']","['C02C04- & C04C07+', 'C02C04- & C04C08+']"
C05,"['feather', 'scale']",['C07C05- & C05C01+']
C06,"['fin', 'wing']","['C02C06- & C06C07+', 'C02C06- & C06C08+', 'C07C06- & C06C01+']"
C07,['has'],"['C04C07- & C07C06+', 'C06C07- & C07C05+']"
C08,['isa'],"['C04C08- & C08C03+', 'C06C08- & C08C03+']"


### Link Grammar parsing tests with the learned 8-rule dictionary

```
tuna isa fish

    +-C02C04+C04C08+C08C03+C03C01+
    |       |      |      |      |
LEFT-WALL tuna    isa   fish     . 


herring isa fish.

    +-C02C04-+C04C08+C08C03+C03C01+
    |        |      |      |      |
LEFT-WALL herring  isa   fish     . 


tuna has fin.

    +-C02C04+C04C07+C07C06+C06C01+
    |       |      |      |      |
LEFT-WALL tuna    has    fin     . 


herring has fin.

    +-C02C04-+C04C07+C07C06+C06C01+
    |        |      |      |      |
LEFT-WALL herring  has    fin     . 


parrot isa bird.

    +-C02C04-+C04C08+C08C03+C03C01+
    |        |      |      |      |
LEFT-WALL parrot   isa   bird     . 


eagle isa bird.

    +-C02C04+C04C08+C08C03+C03C01+
    |       |      |      |      |
LEFT-WALL eagle   isa   bird     . 


parrot has wing.

    +-C02C04-+C04C07+C07C06+C06C01+
    |        |      |      |      |
LEFT-WALL parrot   has   wing     . 


eagle has wing.

    +-C02C04+C04C07+C07C06+C06C01+
    |       |      |      |      |
LEFT-WALL eagle   has   wing     . 


fin isa extremity.

    +C02C06+C06C08+C08C03+C03C01+
    |      |      |      |      |
LEFT-WALL fin    isa extremity  . 


wing isa extremity.

    +-C02C06+C06C08+C08C03+C03C01+
    |       |      |      |      |
LEFT-WALL wing    isa extremity  . 


fin has scale.

    +C02C06+C06C07+C07C05+C05C01+
    |      |      |      |      |
LEFT-WALL fin    has   scale    . 


wing has feather.

    +-C02C06+C06C07+C07C05+C05C01+
    |       |      |      |      |
LEFT-WALL wing    has  feather   . 
```


## 6.4 Test "original" MST-parses.
All the previous tests were based on "synthetic" disjuncts created by neighbouring words for each word in every sentence. Now it's time to learn Link Grammar after parsing the same "Turtle corpus" with the OpenCog MST parser.

In [12]:
input_file = '../data/poc-turtle-opencog-mst-parses.txt'
if os.path.isfile(input_file):
    print('Data file:', module_path + input_file[2:],)
    log.update({'input_file': module_path + input_file[2:]})
    if verbose == 'max':
        print('- "Turtle" language corpus:\n')
        with open(input_file, 'r') as f: 
            lines = f.read().splitlines()
        for line in lines: print(line)
else: print('No data file', module_path + input_file[2:])

Data file: /home/obaskov/language-learning/data/poc-turtle-opencog-mst-parses.txt
- "Turtle" language corpus:

tuna has fin .
0 ###LEFT-WALL### 1 tuna
1 tuna 2 has
2 has 3 fin
3 fin 4 .

eagle isa bird .
0 ###LEFT-WALL### 1 eagle
1 eagle 2 isa
2 isa 3 bird
3 bird 4 .

fin isa extremity .
0 ###LEFT-WALL### 1 fin
1 fin 4 .
2 isa 3 extremity
3 extremity 4 .

tuna isa fish .
0 ###LEFT-WALL### 1 tuna
1 tuna 2 isa
2 isa 3 fish
3 fish 4 .

fin has scale .
0 ###LEFT-WALL### 1 fin
1 fin 3 scale
2 has 3 scale
3 scale 4 .

eagle has wing .
0 ###LEFT-WALL### 1 eagle
1 eagle 2 has
2 has 3 wing
3 wing 4 .

wing has feather .
0 ###LEFT-WALL### 1 wing
1 wing 3 feather
2 has 3 feather
3 feather 4 .

wing isa extremity .
0 ###LEFT-WALL### 1 wing
1 wing 4 .
2 isa 3 extremity
3 extremity 4 .

herring isa fish .
0 ###LEFT-WALL### 1 herring
1 herring 2 isa
2 isa 3 fish
3 fish 4 .

herring has fin .
0 ###LEFT-WALL### 1 herring
1 herring 2 has
2 has 3 fin
3 fin 4 .

parrot isa bird .
0 ###LEFT-WALL### 1 parro

Four of the 12 sentences were parsed different from straitforward 0-1, 1-2, 2-3, 3-4 pattern:

```
fin isa extremity .
0 ###LEFT-WALL### 1 fin
1 fin 4 .
2 isa 3 extremity
3 extremity 4 .

fin has scale .
0 ###LEFT-WALL### 1 fin
1 fin 3 scale
2 has 3 scale
3 scale 4 .

wing has feather .
0 ###LEFT-WALL### 1 wing
1 wing 3 feather
2 has 3 feather
3 feather 4 .

wing isa extremity .
0 ###LEFT-WALL### 1 wing
1 wing 4 .
2 isa 3 extremity
3 extremity 4 .
```


In [13]:
#def pipeline(input_file, left_wall='', period=False, verbose='none'):
from src.space.turtle import mst2disjuncts
from src.link_grammar.turtle import lexical_entries, entries2clusters, \
     disjuncts2clusters, entries2rules, save_link_grammar
parses = mst2disjuncts(input_file, lw='', dot=False)
disjuncts = parses.groupby(['word','disjunct'], as_index=False).sum() \
    .sort_values(by=['count','word','disjunct'], ascending=[False,True,True]) \
    .reset_index(drop=True)
dj_number = len(set(disjuncts['disjunct'].tolist()))
if verbose != 'none': print(dj_number, 'unique disjuncts form', \
    len(disjuncts),'unique word-disjunct pairs from', len(parses), 'parsed items') 
dfc = entries2clusters(lexical_entries(disjuncts))
dfc

34 unique disjuncts form 44 unique word-disjunct pairs from 60 parsed items


Unnamed: 0,germs,disjuncts,counts,cluster
C01,[.],"[bird-, extremity- & fin-, extremity- & wing-,...",12,C01
C02,[LEFT-WALL],"[eagle+, fin+, herring+, parrot+, tuna+, wing+]",12,C02
C03,"[bird, extremity, fish]",[isa- & .+],6,C03
C04,"[eagle, herring, parrot, tuna]","[LEFT-WALL- & has+, LEFT-WALL- & isa+]",8,C04
C05,[feather],[has- & wing- & .+],1,C05
C06,[fin],"[LEFT-WALL- & .+, LEFT-WALL- & scale+, has- & .+]",4,C06
C07,[has],"[eagle- & wing+, feather+, herring- & fin+, pa...",6,C07
C08,[isa],"[eagle- & bird+, extremity+, herring- & fish+,...",6,C08
C09,[scale],[fin- & has- & .+],1,C09
C10,[wing],"[LEFT-WALL- & .+, LEFT-WALL- & feather+, has- ...",4,C10


The above mentiones specifics of parsing 4 sentences led to formation of two separate clusters for "fin" and "wing", each with a specific set of rules.

In [14]:
rules = disjuncts2clusters(dfc)
rule_list = entries2rules(rules)
#return rule_list
#rule_list = ppline(input_file, left_wall='', period=False, verbose='max')
display(html_table([['Cluster','Words','','','Disjuncts']] + rule_list))

0,1,2,3,4
Cluster,Words,,,Disjuncts
C01,['.'],[],[],"['C03C01-', 'C03C01- & C06C01-', 'C03C01- & C10C01-', 'C05C01-', 'C06C01-', 'C09C01-', 'C10C01-']"
C02,['LEFT-WALL'],[],[],"['C02C04+', 'C02C06+', 'C02C10+']"
C03,"['bird', 'extremity', 'fish']",[],[],['C08C03- & C03C01+']
C04,"['eagle', 'herring', 'parrot', 'tuna']",[],[],"['C02C04- & C04C07+', 'C02C04- & C04C08+']"
C05,['feather'],[],[],['C07C05- & C10C05- & C05C01+']
C06,['fin'],[],[],"['C02C06- & C06C01+', 'C02C06- & C06C09+', 'C07C06- & C06C01+']"
C07,['has'],[],[],"['C04C07- & C07C06+', 'C04C07- & C07C10+', 'C07C05+', 'C07C09+']"
C08,['isa'],[],[],"['C04C08- & C08C03+', 'C08C03+']"
C09,['scale'],[],[],['C06C09- & C07C09- & C09C01+']


In [15]:
from src.link_grammar.turtle import save_link_grammar
lg_file_string = save_link_grammar(rule_list, path)
for line in lg_file_string.splitlines(): print(line)

% POC Turtle Link Grammar v.0.6 2018-03-11 14:52:52 UTC
<dictionary-version-number>: V0v0v6+;
<dictionary-locale>: EN4us+;

% C01
".":
(C03C01-) or (C03C01- & C06C01-) or (C03C01- & C10C01-) or (C05C01-) or (C06C01-) or (C09C01-) or (C10C01-);

% C02
"LEFT-WALL":
(C02C04+) or (C02C06+) or (C02C10+);

% C03
"bird" "extremity" "fish":
(C08C03- & C03C01+);

% C04
"eagle" "herring" "parrot" "tuna":
(C02C04- & C04C07+) or (C02C04- & C04C08+);

% C05
"feather":
(C07C05- & C10C05- & C05C01+);

% C06
"fin":
(C02C06- & C06C01+) or (C02C06- & C06C09+) or (C07C06- & C06C01+);

% C07
"has":
(C04C07- & C07C06+) or (C04C07- & C07C10+) or (C07C05+) or (C07C09+);

% C08
"isa":
(C04C08- & C08C03+) or (C08C03+);

% C09
"scale":
(C06C09- & C07C09- & C09C01+);

% C10
"wing":
(C02C10- & C10C01+) or (C02C10- & C10C05+) or (C07C10- & C10C01+);

% 10 word clusters, 10 Link Grammar rules.
% Link Grammar file saved to: /home/obaskov/language-learning/output/Turtle-6-2018-03-10/poc-turtle_10C_2018-03-11_0006.4.0

### Link Grammar parsing tests with the dictionary learned from MST-parses.
The learned Link Grammar dictionary was used with Link Grammar parser to parse the 12 sentences of the "Turtle" corpus. The results were the following: 
```
1. tuna isa fish.

    +-C02C04+C04C08+C08C03+C03C01+
    |       |      |      |      |
LEFT-WALL tuna    isa   fish     . 


2. herring isa fish.

    +-C02C04-+C04C08+C08C03+C03C01+
    |        |      |      |      |
LEFT-WALL herring  isa   fish     . 


3. tuna has fin.

    +-C02C04+C04C07+C07C06+C06C01+
    |       |      |      |      |
LEFT-WALL tuna    has    fin     . 


4. herring has fin.

    +-C02C04-+C04C07+C07C06+C06C01+
    |        |      |      |      |
LEFT-WALL herring  has    fin     . 


5. parrot isa bird.

    +-C02C04-+C04C08+C08C03+C03C01+
    |        |      |      |      |
LEFT-WALL parrot   isa   bird     . 


6. eagle isa bird.

    +-C02C04+C04C08+C08C03+C03C01+
    |       |      |      |      |
LEFT-WALL eagle   isa   bird     . 


7. parrot has wing.

    +-C02C04-+C04C07+C07C10+C10C01+
    |        |      |      |      |
LEFT-WALL parrot   has   wing     . 


8. eagle has wing.

    +-C02C04+C04C07+C07C10+C10C01+
    |       |      |      |      |
LEFT-WALL eagle   has   wing     . 


9. fin isa extremity.

           +------C06C01-----+
    +C02C06+   +C08C03+C03C01+
    |      |   |      |      |
LEFT-WALL fin isa extremity  . 


10. wing isa extremity.

            +------C10C01-----+
    +-C02C10+   +C08C03+C03C01+
    |       |   |      |      |
LEFT-WALL wing isa extremity  . 


11. fin has scale.

    +C02C06+-----C06C01-----+
    |      |                |
LEFT-WALL fin [has] [scale] . 


12. wing has feather.

            +--C10C05--+       
    +-C02C10+   +C07C05+C05C01+
    |       |   |      |      |
LEFT-WALL wing has  feather   . 
```


The last four sentences look different due to the above mentioned specific MST parses of these sentences.