# POC-Turtle-5: Lexical Entries | updated 2018-03-07 
This is a continuation of proof-of-concept (POC) experiments in unsupervised language learning (ULL), the OpenCog project hosted on [GitHub](https://github.com/opencog/language-learning/tree/master/notebooks). A summary of February 2018 efforts and results is shared as a [static html copy of POC-Turtle-4-Grammar-Learning.ipynb notebook](http://88.99.210.144/data/clustering_2018/html/POC-Turtle-4-Grammar-Learning.html).

In [1]:
import os, sys, time
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
from IPython.display import display
module_path = os.path.abspath(os.path.join('..'))
if module_path not in sys.path: sys.path.append(module_path)
from src.utl.utl import UTC
from src.utl.turtle import html_table
print(UTC(), ':: module_path:', module_path)

2018-03-09 09:17:34 UTC :: module_path: /home/obaskov/language-learning


## 5.1 Settings, parameters, data.

In [2]:
prj_dir = '../output/Turtle-5-2018-03-09/'  # project directory 
prefix = ''     # all project files will start with this prefix
verbose = 'max' # printed comments: 'none', 'min', 'max'
log = {'project': 'POC-Turtle-5+: Lexical Entries'}

if not os.path.exists(prj_dir):
    os.makedirs(prj_dir)
    print('Project directory created:', module_path + prj_dir[2:])
else: print('Project directory', module_path + prj_dir[2:], 'exists')
path = module_path + prj_dir[2:]        # path to store vectors
tmpath = path  # module_path + '/tmp/'  # path for temporary files

Project directory created: /home/obaskov/language-learning/output/Turtle-5-2018-03-09/


In [3]:
input_file = '../data/poc-turtle-sentences.txt'
if os.path.isfile(input_file):
    print('Data file:', module_path + input_file[2:],)
    log.update({'input_file': module_path + input_file[2:]})
    if verbose == 'max':
        print('- "Turtle" language corpus:')
        with open(input_file, 'r') as f: 
            lines = f.read().splitlines()
        for i,line in enumerate(lines): 
            if len(line) > 0: print(str(i+1)+'. '+line)    
else: print('No data file', module_path + input_file[2:])

Data file: /home/obaskov/language-learning/data/poc-turtle-sentences.txt
- "Turtle" language corpus:
1. tuna isa fish.
2. herring isa fish.
3. tuna has fin.
4. herring has fin.
5. parrot isa bird.
6. eagle isa bird.
7. parrot has wing.
8. eagle has wing.
9. fin isa extremity.
10. wing isa extremity.
11. fin has scale.
12. wing has feather.


## 5.2 Parse "synthetic" disjuncts from sentences.
We create pseudo-disjuncts for each word in each sentence taking its left and right neighbour words.

In [4]:
from src.space.turtle import dumb_disjuncter
left_wall = 'LEFT-WALL'  # Left wall symbol, 'none' - don't use Left-Wall
parses = dumb_disjuncter(input_file, left_wall, dot=True,)
if verbose != 'none': print(len(parses),'disjuncts parsed from', input_file) 
disjuncts = parses.groupby(['word','disjunct'], as_index=False).sum() \
    .sort_values(by=['count','word','disjunct'], ascending=[False,True,True]) \
    .reset_index(drop=True)
dj_number = len(set(disjuncts['disjunct'].tolist()))
word_number = len(set(disjuncts['word'].tolist()))
if verbose != 'none': print(word_number, 'words and', dj_number, \
    'unique disjuncts form', len(disjuncts),'unique word-disjunct pairs') 
disjuncts

60 disjuncts parsed from ../data/poc-turtle-sentences.txt
15 words and 29 unique disjuncts form 44 unique word-disjunct pairs


Unnamed: 0,word,disjunct,count
0,.,bird-,2
1,.,extremity-,2
2,.,fin-,2
3,.,fish-,2
4,.,wing-,2
5,LEFT-WALL,eagle+,2
6,LEFT-WALL,fin+,2
7,LEFT-WALL,herring+,2
8,LEFT-WALL,parrot+,2
9,LEFT-WALL,tuna+,2


## 5.3 Learn multi-germ-multi-disjunct lexical entries
We cluster words and disjuncts to form unique multi-germ-multi-disjunct lexical entries, following the Link Grammar rules. Actually the entries are low-level Link Grammar rules, but Link Grammar would not accept word-based disjuncts.

In [5]:
from src.link_grammar.turtle import entries2clusters, lexical_entries
dfc = entries2clusters(lexical_entries(disjuncts))
dfc

Unnamed: 0,germs,disjuncts,counts,cluster
C01,[.],"[bird-, extremity-, feather-, fin-, fish-, sca...",12,C01
C02,[LEFT-WALL],"[eagle+, fin+, herring+, parrot+, tuna+, wing+]",12,C02
C03,"[bird, extremity, fish]",[isa- & .+],6,C03
C04,"[eagle, herring, parrot, tuna]","[LEFT-WALL- & has+, LEFT-WALL- & isa+]",8,C04
C05,"[feather, scale]",[has- & .+],2,C05
C06,"[fin, wing]","[LEFT-WALL- & has+, LEFT-WALL- & isa+, has- & .+]",8,C06
C07,[has],"[eagle- & wing+, fin- & scale+, herring- & fin...",6,C07
C08,[isa],"[eagle- & bird+, fin- & extremity+, herring- &...",6,C08


In [6]:
from src.link_grammar.turtle import entries2rules
display(html_table([['Cluster','Words','','','Disjuncts']] + entries2rules(dfc)))

0,1,2,3,4
Cluster,Words,,,Disjuncts
C01,['.'],[],[],"['bird-', 'extremity-', 'feather-', 'fin-', 'fish-', 'scale-', 'wing-']"
C02,['LEFT-WALL'],[],[],"['eagle+', 'fin+', 'herring+', 'parrot+', 'tuna+', 'wing+']"
C03,"['bird', 'extremity', 'fish']",[],[],['isa- & .+']
C04,"['eagle', 'herring', 'parrot', 'tuna']",[],[],"['LEFT-WALL- & has+', 'LEFT-WALL- & isa+']"
C05,"['feather', 'scale']",[],[],['has- & .+']
C06,"['fin', 'wing']",[],[],"['LEFT-WALL- & has+', 'LEFT-WALL- & isa+', 'has- & .+']"
C07,['has'],[],[],"['eagle- & wing+', 'fin- & scale+', 'herring- & fin+', 'parrot- & wing+', 'tuna- & fin+', 'wing- & feather+']"
C08,['isa'],[],[],"['eagle- & bird+', 'fin- & extremity+', 'herring- & fish+', 'parrot- & bird+', 'tuna- & fish+', 'wing- & extremity+']"


## 5.4 Learn Link Grammar rules from multi-germ-multi-disjunct lexical entries
The next step from multi-germ-multi-disjunct lexical entries to Link Grammar rules is replacing words in disjuncts with cluster identificators.

In [7]:
dfc  # lexical_entries # 80307 tmp

Unnamed: 0,germs,disjuncts,counts,cluster
C01,[.],"[bird-, extremity-, feather-, fin-, fish-, sca...",12,C01
C02,[LEFT-WALL],"[eagle+, fin+, herring+, parrot+, tuna+, wing+]",12,C02
C03,"[bird, extremity, fish]",[isa- & .+],6,C03
C04,"[eagle, herring, parrot, tuna]","[LEFT-WALL- & has+, LEFT-WALL- & isa+]",8,C04
C05,"[feather, scale]",[has- & .+],2,C05
C06,"[fin, wing]","[LEFT-WALL- & has+, LEFT-WALL- & isa+, has- & .+]",8,C06
C07,[has],"[eagle- & wing+, fin- & scale+, herring- & fin...",6,C07
C08,[isa],"[eagle- & bird+, fin- & extremity+, herring- &...",6,C08


In [8]:
from src.link_grammar.turtle import disjuncts2clusters
rules = disjuncts2clusters(dfc)  # DataFrame
rules   # 80307 tmp

Unnamed: 0,germs,disjuncts,counts,cluster
C01,[.],"[C03C01-, C05C01-, C06C01-]",12,C01
C02,[LEFT-WALL],"[C02C04+, C02C06+]",12,C02
C03,"[bird, extremity, fish]",[C08C03- & C03C01+],6,C03
C04,"[eagle, herring, parrot, tuna]","[C02C04- & C04C07+, C02C04- & C04C08+]",8,C04
C05,"[feather, scale]",[C07C05- & C05C01+],2,C05
C06,"[fin, wing]","[C02C06- & C06C07+, C02C06- & C06C08+, C07C06-...",8,C06
C07,[has],"[C04C07- & C07C06+, C06C07- & C07C05+]",6,C07
C08,[isa],"[C04C08- & C08C03+, C06C08- & C08C03+]",6,C08


In [9]:
lg_rule_list = entries2rules(rules)
display(html_table([['Cluster','Words','','','Disjuncts']] + lg_rule_list))

0,1,2,3,4
Cluster,Words,,,Disjuncts
C01,['.'],[],[],"['C03C01-', 'C05C01-', 'C06C01-']"
C02,['LEFT-WALL'],[],[],"['C02C04+', 'C02C06+']"
C03,"['bird', 'extremity', 'fish']",[],[],['C08C03- & C03C01+']
C04,"['eagle', 'herring', 'parrot', 'tuna']",[],[],"['C02C04- & C04C07+', 'C02C04- & C04C08+']"
C05,"['feather', 'scale']",[],[],['C07C05- & C05C01+']
C06,"['fin', 'wing']",[],[],"['C02C06- & C06C07+', 'C02C06- & C06C08+', 'C07C06- & C06C01+']"
C07,['has'],[],[],"['C04C07- & C07C06+', 'C06C07- & C07C05+']"
C08,['isa'],[],[],"['C04C08- & C08C03+', 'C06C08- & C08C03+']"


 The learned rules may sometimes over-generalize, allowing connections not present in the initial corpus. E.g. rule C07 **has: (C04C07- & C07C06+) or (C06C07- & C07C05)** would treat sentences "tuna has wing" and "fin has feather" correct (gramatically - and that's true).

## 5.5 Save Link Grammar dictionary

In [10]:
from src.link_grammar.turtle import save_link_grammar
lg_file_string = save_link_grammar(lg_rule_list, path)
for line in lg_file_string.splitlines(): print(line)

% POC Turtle Link Grammar v.0.6 2018-03-09 09:17:34 UTC
<dictionary-version-number>: V0v0v6+;
<dictionary-locale>: EN4us+;

% C01
".":
(C03C01-) or (C05C01-) or (C06C01-);

% C02
"LEFT-WALL":
(C02C04+) or (C02C06+);

% C03
"bird" "extremity" "fish":
(C08C03- & C03C01+);

% C04
"eagle" "herring" "parrot" "tuna":
(C02C04- & C04C07+) or (C02C04- & C04C08+);

% C05
"feather" "scale":
(C07C05- & C05C01+);

% C06
"fin" "wing":
(C02C06- & C06C07+) or (C02C06- & C06C08+) or (C07C06- & C06C01+);

% C07
"has":
(C04C07- & C07C06+) or (C06C07- & C07C05+);

% C08
"isa":
(C04C08- & C08C03+) or (C06C08- & C08C03+);

% 8 word clusters, 8 Link Grammar rules.
% Link Grammar file saved to: /home/obaskov/language-learning/output/Turtle-5-2018-03-09/poc-turtle_8C_2018-03-09_0006.4.0.dict


## 5.6 Test learned Link Grammar dictionary with Link Grammar parser
The dictionary was tested with an external CLI Pyton API to the Link Grammar parser on the server. The code is in early beta. The learned grammar was tested with all the 12 "Turtle corpus" sentences, 100% sentences were successfully parsed:
```
tuna isa fish

    +-C02C04+C04C08+C08C03+C03C01+
    |       |      |      |      |
LEFT-WALL tuna    isa   fish     . 


herring isa fish.

    +-C02C04-+C04C08+C08C03+C03C01+
    |        |      |      |      |
LEFT-WALL herring  isa   fish     . 


tuna has fin.

    +-C02C04+C04C07+C07C06+C06C01+
    |       |      |      |      |
LEFT-WALL tuna    has    fin     . 


herring has fin.

    +-C02C04-+C04C07+C07C06+C06C01+
    |        |      |      |      |
LEFT-WALL herring  has    fin     . 


parrot isa bird.

    +-C02C04-+C04C08+C08C03+C03C01+
    |        |      |      |      |
LEFT-WALL parrot   isa   bird     . 


eagle isa bird.

    +-C02C04+C04C08+C08C03+C03C01+
    |       |      |      |      |
LEFT-WALL eagle   isa   bird     . 


parrot has wing.

    +-C02C04-+C04C07+C07C06+C06C01+
    |        |      |      |      |
LEFT-WALL parrot   has   wing     . 


eagle has wing.

    +-C02C04+C04C07+C07C06+C06C01+
    |       |      |      |      |
LEFT-WALL eagle   has   wing     . 


fin isa extremity.

    +C02C06+C06C08+C08C03+C03C01+
    |      |      |      |      |
LEFT-WALL fin    isa extremity  . 


wing isa extremity.

    +-C02C06+C06C08+C08C03+C03C01+
    |       |      |      |      |
LEFT-WALL wing    isa extremity  . 


fin has scale.

    +C02C06+C06C07+C07C05+C05C01+
    |      |      |      |      |
LEFT-WALL fin    has   scale    . 


wing has feather.

    +-C02C06+C06C07+C07C05+C05C01+
    |       |      |      |      |
LEFT-WALL wing    has  feather   . 
```


## 5.7 Link Grammar for a dataset with removed punctuation
Let's collect the grammar learning pipeline in a short summary and test it on the same dataset parsed without ###LEFT-WALL### and period.

In [11]:
def pipeline(input_file, left_wall='', period=False, verbose='none'):
    parses = dumb_disjuncter(input_file, lw=left_wall, dot=False)
    disjuncts = parses.groupby(['word','disjunct'], as_index=False).sum() \
        .sort_values(by=['count','word','disjunct'], ascending=[False,True,True]) \
        .reset_index(drop=True)
    dj_number = len(set(disjuncts['disjunct'].tolist()))
    if verbose != 'none': print(dj_number, 'unique disjuncts form', \
        len(disjuncts),'unique word-disjunct pairs from', len(parses), 'parsed items') 
    dfg = lexical_entries(disjuncts)
    dfc = entries2clusters(dfg)
    rules = disjuncts2clusters(dfc)
    lg_rule_list = entries2rules(rules)
    return lg_rule_list

lg_rule_list = pipeline(input_file, left_wall='', period=False, verbose='max')
display(html_table([['Cluster','Words','','','Disjuncts']] + lg_rule_list))

16 unique disjuncts form 31 unique word-disjunct pairs from 36 parsed items


0,1,2,3,4
Cluster,Words,,,Disjuncts
C01,"['bird', 'extremity', 'fish']",[],[],['C06C01-']
C02,"['eagle', 'herring', 'parrot', 'tuna']",[],[],"['C02C05+', 'C02C06+']"
C03,"['feather', 'scale']",[],[],['C05C03-']
C04,"['fin', 'wing']",[],[],"['C04C05+', 'C04C06+', 'C05C04-']"
C05,['has'],[],[],"['C02C05- & C05C04+', 'C04C05- & C05C03+']"
C06,['isa'],[],[],"['C02C06- & C06C01+', 'C04C06- & C06C01+']"


In [12]:
lg_file_string = save_link_grammar(lg_rule_list, path)
for line in lg_file_string.splitlines(): print(line)

% POC Turtle Link Grammar v.0.6 2018-03-09 09:17:34 UTC
<dictionary-version-number>: V0v0v6+;
<dictionary-locale>: EN4us+;

% C01
"bird" "extremity" "fish":
(C06C01-);

% C02
"eagle" "herring" "parrot" "tuna":
(C02C05+) or (C02C06+);

% C03
"feather" "scale":
(C05C03-);

% C04
"fin" "wing":
(C04C05+) or (C04C06+) or (C05C04-);

% C05
"has":
(C02C05- & C05C04+) or (C04C05- & C05C03+);

% C06
"isa":
(C02C06- & C06C01+) or (C04C06- & C06C01+);

% 6 word clusters, 6 Link Grammar rules.
% Link Grammar file saved to: /home/obaskov/language-learning/output/Turtle-5-2018-03-09/poc-turtle_6C_2018-03-09_0006.4.0.dict


### Link Grammar parsing tests with the learned 6-rule dictionary

```
tuna isa fish

  +C02C06+C06C01+
  |      |      |
tuna    isa   fish 


herring isa fish

   +C02C06+C06C01+
   |      |      |
herring  isa   fish 


tuna has fin

  +C02C05+C05C04+
  |      |      |
tuna    has    fin 


herring has fin

   +C02C05+C05C04+
   |      |      |
herring  has    fin 


parrot isa bird

   +C02C06+C06C01+
   |      |      |
parrot   isa   bird 


eagle isa bird

  +C02C06+C06C01+
  |      |      |
eagle   isa   bird 


parrot has wing

   +C02C05+C05C04+
   |      |      |
parrot   has   wing 


eagle has wing

  +C02C05+C05C04+
  |      |      |
eagle   has   wing 


fin isa extremity

 +C04C06+C06C01+
 |      |      |
fin    isa extremity 


wing isa extremity

  +C04C06+C06C01+
  |      |      |
wing    isa extremity 


fin has scale

 +C04C05+C05C03+
 |      |      |
fin    has   scale 


wing has feather

  +C04C05+C05C03+
  |      |      |
wing    has  feather 
```

All the 12 "Turtle corpus" sentences sentences has been successfully parsed.

## Resume
This time we managed to create the same Link Grammar rules as in the previous "[POC-Turtle-4-Grammar-Learning](http://88.99.210.144/data/clustering_2018/html/POC-Turtle-4-Grammar-Learning.html)" case in a much easier way, without any complicated vector space manipulations and clustering. However, we still need some technique for cluster similarity evaluation to further generalize learned word categories and Grammar rules. In particular, we have managed to learn noun and verb top-level word categories analyzing cluster similarities in section 4.2.3 of the [previous case](http://88.99.210.144/data/clustering_2018/html/POC-Turtle-4-Grammar-Learning.html) and certainly would need a similar agglomeration technique.

13 words and 16 disjuncts are not enough for word space creation for the Turtle corpus. For a bigger corpus we could create two vector spaces for rule (cluster) embeddings -- the one based on words and the other based on disjuncts -- and research grammar patterns by clustering rules in both spaces and analyzing cross-relations.  

The updated and tested Link Grammar dictionaries are shared via [http://88.99.210.144/data/clustering_2018/POC-Turtle-5-2018-03-09/](http://88.99.210.144/data/clustering_2018/POC-Turtle-5+2018-03-09/)