# Grammar Learner 0.5: Hierarchies `2018-06-27`

**New features**:  
- hierarchical internal data representation, unified for word clusters, word categories and link grammar rules,  
- agglomerative word categories generalization for disjunct-based sparse word space using Jaccard index as similarity measure,  
- link grammar rules agglomerative generalization for disjunct-based rules using Jaccard index as  similarity measure,  
- sequental agglomerative generalization for categories and rules, building hierarchical category tree...
- hierarchical categories saved as xx_cat_tree.txt, where xx = number of clusters.

Static html of this notebook is shared via  
[langlearn.singularitynet.io ⇒ clustering_2018  ⇒ Grammar-Learner-05-Hierarchies.html](http://langlearn.singularitynet.io/data/clustering_2018/html/Grammar-Learner-05-Hierarchies.html)  
Data: [http://langlearn.singularitynet.io/data/clustering_2018/Grammar-Learner-05-Hierarchies/](http://http://langlearn.singularitynet.io/data/clustering_2018/Grammar-Learner-05-Hierarchies/)

## Basic settings

In [1]:
import os, sys, time
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
from IPython.display import display
module_path = os.path.abspath(os.path.join('..'))
if module_path not in sys.path: sys.path.append(module_path)
src_path = module_path + '/src'
if os.path.exists(src_path) and src_path not in sys.path: sys.path.append(src_path)
lg_path = '/home/oleg/miniconda3/envs/ull4/lib/python3.6/site-packages/linkgrammar'
if os.path.exists(lg_path) and lg_path not in sys.path: sys.path.append(lg_path)
from src.utl.utl import UTC
from src.utl.read_files import check_dir
from src.utl.widgets import html_table, display_tree
from src.grammar_learner.poc05 import learn_grammar, params, parse_metrics, run_learn_grammar
prefix = '' # unused option
tmpath = module_path + '/tmp/'
if check_dir(tmpath, True, 'none'):
    print(UTC(), ':: module_path =', module_path)

2018-06-27 12:42:03 UTC :: module_path = /home/oleg/language-learning


## Grammar Learner parameters

In [2]:
# GL.0.5 parameters: input_parses, output_categories, output_grammar, **kwargs
kwargs = {
    'parse_mode'    :   'given'     ,   # 'given' (default) / 'explosive' (next)
    'left_wall'     :   'LEFT-WALL' ,   # '','none' - don't use / 'LEFT-WALL' - replace ###LEFT-WALL###
    'period'        :   True        ,   # use period in links learning: True/False
    'context'       :   2           ,   # 1: connectors / 2,3...: disjuncts
    'window'        :   'mst'       ,   # 'mst' / reserved options for «explosive» parsing
    'weighting'     :   'ppmi'      ,   # 'ppmi' / future options
    'group'         :   True        ,   # group items after link parsing
    'distance'      :   False       ,   # reserved options for «explosive» parsing
    'word_space'    :   'discrete'  ,   # 'vectors' / 'discrete' - no dimensionality reduction
    'dim_max'       :   100         ,   # max vector space dimensionality
    'sv_min'        :   0.1         ,   # minimal singular value (fraction of the max value)
    'dim_reduction' :   'none'      ,   # 'svm' / 'none' (discrete word_space, group)
    'clustering'    :   'group'     ,   # 'kmeans' / 'group'~'identical_entries' / future options
    'cluster_range' :   (2,48,1)    ,   # min, max, step
    'cluster_criteria': 'silhouette',   #
    'cluster_level' :   0.9         ,   # level = 0, 1, 0.-0.99..: 0 - max number of clusters
    'categories_generalization': 'off', # 'off' / 'cosine' - cosine similarity, 'jaccard'
    'categories_merge': 0.8         ,   # merge categories with similarity > this 'merge' criteria
    'categories_aggregation': 0.2   ,   # aggregate categories with similarity > this criteria
    'grammar_rules' :   2           ,   # 1: 'connectors' / 2 - 'disjuncts' / 0 - 'words' (TODO?)
    'rules_generalization': 'off'   ,   # 'off' / 'cosine' - cosine similarity, 'jaccard'
    'rules_merge'   :   0.8         ,   # merge rules with similarity > this 'merge' criteria
    'rules_aggregation':   0.2      ,   # aggregate rules similarity > this criteria
    'tmpath'        :   module_path + '/tmp/',
    'verbose': 'min', # display intermediate results: 'none', 'min', 'mid', 'max'
    # Additional (optional) parameters for parse_metrics (_abiity & _quality):
    'test_corpus'   :   module_path + '/data/POC-Turtle/poc-turtle-corpus.txt',
    'reference_path':   module_path + '/data/POC-Turtle/poc-turtle-parses-expected.txt',
    'template_path' : 'poc-turtle',
    'linkage_limit' : 1
}
out_dir = module_path + '/output/Grammar-Learner-05-' + str(UTC())[:10]
print(UTC(), '::', out_dir)

2018-06-27 12:42:03 UTC :: /home/oleg/language-learning/output/Grammar-Learner-05-2018-06-27


# Integration test: POC-Turtle 

## Baseline: MST_fixed disjuncts-ILE-disjuncts, no generalization

In [3]:
%%capture
corpus = 'POC-Turtle'
dataset = 'MST_fixed_manually'
kwargs['categories_generalization'] = ''
kwargs['rules_generalization'] = ''
input_parses, output_categories, output_grammar = \
    params(corpus, dataset, module_path, out_dir, **kwargs)
response = learn_grammar(input_parses, output_categories, output_grammar, **kwargs)
pa, pq, lg_parse_path = parse_metrics(response['grammar_file'], **kwargs)

In [4]:
#print('kwargs:')
#display(html_table([[k,v] for k,v in kwargs.items()]))

In [5]:
print('Parse ability (PA), parse quality(PQ), PA*PQ:', \
      str(pa)+'%, '+str(pq)+'%, '+str(int(round(pa*pq/100,0)))+'%;')
print('Category tree "cat_tree.txt" file:')
with open(response['cat_tree_file'],'r') as f: x = f.read().splitlines()
display(html_table([y.split('\t') for y in x]))

Parse ability (PA), parse quality(PQ), PA*PQ: 100%, 100%, 100%;
Category tree "cat_tree.txt" file:


0,1,2,3,4,5
C01,0,1,1,.,1
C02,0,2,1,LEFT-WALL,1
C03,0,3,1,bird extremity fish,1 1 1
C04,0,4,1,eagle herring parrot tuna,1 1 1 1
C05,0,5,1,feather scale,1 1
C06,0,6,1,fin wing,1 1
C07,0,7,1,has,1
C08,0,8,1,isa,1


In [6]:
print('Link Grammar .dict file contents:')
with open(response['grammar_file'],'r') as f: 
    for line in f.read().splitlines(): print(line)

Link Grammar .dict file contents:
% Grammar Learner v.0.5 2018-06-27 12:42:04 UTC
<dictionary-version-number>: V0v0v5+;
<dictionary-locale>: EN4us+;

% C01
".":
(C03C01-) or (C05C01-) or (C06C01-);

% C02
"LEFT-WALL":
(C02C04+) or (C02C06+);

% C03
"bird" "extremity" "fish":
(C08C03- & C03C01+);

% C04
"eagle" "herring" "parrot" "tuna":
(C02C04- & C04C07+) or (C02C04- & C04C08+);

% C05
"feather" "scale":
(C07C05- & C05C01+);

% C06
"fin" "wing":
(C02C06- & C06C07+) or (C02C06- & C06C08+) or (C07C06- & C06C01+);

% C07
"has":
(C04C07- & C07C06+) or (C06C07- & C07C05+);

% C08
"isa":
(C04C08- & C08C03+) or (C06C08- & C08C03+);

UNKNOWN-WORD: XXX+;

% 8 word clusters, 8 Link Grammar rules.
% Link Grammar file saved to: /home/oleg/language-learning/output/Grammar-Learner-05-2018-06-27/POC-Turtle/MST_fixed_manually/disjuncts-ILE-disjuncts/LEFT-WALL_period/no_generalization/poc-turtle_8C_2018-06-27_0005.4.0.dict


In [7]:
#print('learn_grammar response -- project log (dict):')
#display(html_table([[k,v] for k,v in response.items()]))

# Generalization tests with POC-Turtle corpus

## Generalization of word categories

In [8]:
%%capture
kwargs['categories_generalization'] = 'jaccard'
kwargs['rules_generalization'] = ''
# All-in-one test function - /src/grammar_learner/poc05.py:
re31 = run_learn_grammar(corpus, dataset, module_path, out_dir, **kwargs)

In [9]:
display_tree(re31)

Parse ability (PA), parse quality(PQ), PA*PQ: 100%, 100%, 100%;
Category tree "cat_tree.txt" file:


0,1,2,3,4,5
C1,0,1,0.32,eagle feather fin herring parrot scale tuna wing,0 0 0 0 0 0 0 0
C2,0,2,1.0,bird extremity fish,1 1 1
C3,0,3,1.0,.,1
C4,0,4,1.0,LEFT-WALL,1
C5,0,5,1.0,has,1
C6,0,6,1.0,isa,1
,1,7,0.66,eagle fin herring parrot tuna wing,0 0 0 0 0 0
,7,8,1.0,eagle herring parrot tuna,1 1 1 1
,7,9,1.0,fin wing,1 1
,1,10,1.0,feather scale,1 1


_Primary clusters 8,9 aggregated to a new cluster 7, 
then clusters 7 and 10 aggregated to a new cluster 10 
(clusters renumbered after agglomeration)_

## Generalization of grammar rules

In [10]:
%%capture
kwargs['categories_generalization'] = ''
kwargs['rules_generalization'] = 'jaccard'
re32 = run_learn_grammar(corpus, dataset, module_path, out_dir, **kwargs)

In [11]:
display_tree(re32)

Parse ability (PA), parse quality(PQ), PA*PQ: 0%, 0%, 0%;
Category tree "cat_tree.txt" file:


0,1,2,3,4,5
C1,0,1,0.32,eagle feather fin herring parrot scale tuna wing,0 0 0 0 0 0 0 0
C2,0,2,1.0,bird extremity fish,1 1 1
C3,0,3,1.0,.,1
C4,0,4,1.0,LEFT-WALL,1
C5,0,5,1.0,has,1
C6,0,6,1.0,isa,1
,1,7,0.66,eagle fin herring parrot tuna wing,0 0 0 0 0 0
,7,8,1.0,eagle herring parrot tuna,1 1 1 1
,7,9,1.0,fin wing,1 1
,1,10,1.0,feather scale,1 1


## 2-step generalization

In [12]:
%%capture
# 1st test partial categories generalization
kwargs['categories_aggregation'] = 0.6
kwargs['categories_generalization'] = 'jaccard'
kwargs['rules_generalization'] = ''
re331 = run_learn_grammar(corpus, dataset, module_path, out_dir, **kwargs)

In [13]:
display_tree(re331)

Parse ability (PA), parse quality(PQ), PA*PQ: 100%, 100%, 100%;
Category tree "cat_tree.txt" file:


0,1,2,3,4,5
C1,0,1,0.66,eagle fin herring parrot tuna wing,0 0 0 0 0 0
C2,0,2,1.0,bird extremity fish,1 1 1
C3,0,3,1.0,feather scale,1 1
C4,0,4,1.0,.,1
C5,0,5,1.0,LEFT-WALL,1
C6,0,6,1.0,has,1
C7,0,7,1.0,isa,1
,1,8,1.0,eagle herring parrot tuna,1 1 1 1
,1,9,1.0,fin wing,1 1


In [14]:
%%capture
kwargs['categories_generalization'] = 'jaccard'
kwargs['categories_aggregation'] = 0.6
kwargs['rules_generalization'] = 'jaccard'
kwargs['rules_aggregation'] = 0.3
re332 = run_learn_grammar(corpus, dataset, module_path, out_dir, **kwargs)

In [15]:
display_tree(re332)

Parse ability (PA), parse quality(PQ), PA*PQ: 0%, 0%, 0%;
Category tree "cat_tree.txt" file:


0,1,2,3,4,5
C1,0,1,0.32,eagle feather fin herring parrot scale tuna wing,0 0 0 0 0 0 0 0
C2,0,2,1.0,bird extremity fish,1 1 1
C3,0,3,1.0,.,1
C4,0,4,1.0,LEFT-WALL,1
C5,0,5,1.0,has,1
C6,0,6,1.0,isa,1
,1,7,0.66,eagle fin herring parrot tuna wing,0 0 0 0 0 0
,7,8,1.0,eagle herring parrot tuna,1 1 1 1
,7,9,1.0,fin wing,1 1
,1,10,1.0,feather scale,1 1


***TODO: fix 1st category aggregation threshold!***

In [16]:
#STOP

## Connectors-DRK-disjuncts, generalize rules

In [17]:
%%capture
kwargs['left_wall'] = ''
kwargs['period'] = False
kwargs['context'] = 1
kwargs['word_space'] = 'vectors'
kwargs['dim_reduction'] = 'svm'
kwargs['clustering'] = 'kmeans'
kwargs['categories_generalization'] = 'off'
kwargs['rules_generalization'] = 'jaccard'
kwargs['rules_aggregation'] = 0.2
kwargs['verbose'] = 'mid'
re34 = run_learn_grammar(corpus, dataset, module_path, out_dir, **kwargs)

In [18]:
display_tree(re34)

Parse ability (PA), parse quality(PQ), PA*PQ: 0%, 0%, 0%;
Category tree "cat_tree.txt" file:


0,1,2,3,4,5
C1,0,1,0.66,eagle feather fin herring parrot scale tuna wing,0 0 0 0 0 0 0 0
C2,0,2,0.0,bird extremity fish,0 0 0
C3,0,3,0.0,has isa,0 0
,1,4,0.0,feather fin scale wing,0 0 0 0
,1,5,0.0,eagle herring parrot tuna,0 0 0 0


## Connectors-DRK-connectors, generalize rules

In [19]:
%%capture
kwargs['grammar_rules'] = 1
kwargs['categories_generalization'] = ''
kwargs['rules_generalization'] = 'jaccard'
re35 = run_learn_grammar(corpus, dataset, module_path, out_dir, **kwargs)

In [20]:
display_tree(re35)

Parse ability (PA), parse quality(PQ), PA*PQ: 0%, 0%, 0%;
Category tree "cat_tree.txt" file:


0,1,2,3,4,5
C1,0,1,0.66,eagle feather fin herring parrot scale tuna wing,0 0 0 0 0 0 0 0
C2,0,2,0.0,bird extremity fish,0 0 0
C3,0,3,0.0,has isa,0 0
,1,4,0.0,feather fin scale wing,0 0 0 0
,1,5,0.0,eagle herring parrot tuna,0 0 0 0


In [21]:
print('Link Grammar .dict file contents:')
with open(re35['grammar_file'],'r') as f: 
    for line in f.read().splitlines(): print(line)

Link Grammar .dict file contents:
% Grammar Learner v.0.5 2018-06-27 12:42:05 UTC
<dictionary-version-number>: V0v0v5+;
<dictionary-locale>: EN4us+;

% C1
"eagle" "feather" "fin" "herring" "parrot" "scale" "tuna" "wing":
{C3C1-} & {C1C3+};

% C2
"bird" "extremity" "fish":
(C3C2-);

% C3
"has" "isa":
(C3C1+) or (C3C2+);

UNKNOWN-WORD: XXX+;

% 3 word clusters, 3 Link Grammar rules.
% Link Grammar file saved to: /home/oleg/language-learning/output/Grammar-Learner-05-2018-06-27/POC-Turtle/MST_fixed_manually/connectors-DRK-connectors/no-LEFT-WALL_no-period/generalized_rules/poc-turtle_3C_2018-06-27_0005.4.0.dict


In [22]:
#print('learn_grammar response -- project log (dict):')
#display(html_table([[k,v] for k,v in re35.items()]))

In [23]:
#STOP

# POC-English-NoAmb

In [24]:
corpus = 'POC-English-NoAmb'
kwargs['test_corpus'] = module_path + '/data/POC-English-NoAmb/poc_english_noamb_corpus.txt'
kwargs['reference_path'] = module_path + '/data/POC-English-NoAmb/poc-english_noAmb-parses-gold.txt'
kwargs['left_wall'] = ''
kwargs['period'] = False
kwargs['categories_generalization'] = 'off'
kwargs['rules_generalization'] = 'jaccard'
kwargs['verbose'] = 'mid'

## Connectors-DRK-connectors, no generalization

In [25]:
%%capture
kwargs['context'] = 1
kwargs['word_space'] = 'vectors'
kwargs['dim_reduction'] = 'svm'
kwargs['clustering'] = 'kmeans'
kwargs['grammar_rules'] = 1
kwargs['rules_generalization'] = ''
re41 = run_learn_grammar(corpus, dataset, module_path, out_dir, **kwargs)

In [26]:
display_tree(re41)

Parse ability (PA), parse quality(PQ), PA*PQ: 72%, 64%, 46%;
Category tree "cat_tree.txt" file:


0,1,2,3,4,5
C01,0,1,0,a is liked likes was,0 0 0 0 0
C02,0,2,0,child food human now parent,0 0 0 0 0
C03,0,3,0,before not,0 0
C04,0,4,0,daughter son,0 0
C05,0,5,0,cake sausage,0 0
C06,0,6,0,dad mom,0 0


## Connectors-DRK-connectors, generalization

In [27]:
%%capture
kwargs['rules_generalization'] = 'jaccard'
kwargs['rules_aggregation'] = 0.2
re42 = run_learn_grammar(corpus, dataset, module_path, out_dir, **kwargs)

In [28]:
display_tree(re42)

Parse ability (PA), parse quality(PQ), PA*PQ: 0%, 0%, 0%;
Category tree "cat_tree.txt" file:


0,1,2,3,4,5
C1,0,1,0.24,cake child dad daughter food human mom now parent sausage son,0 0 0 0 0 0 0 0 0 0 0
C2,0,2,0.31,a is liked likes was,0 0 0 0 0
C3,0,3,0.0,before not,0 0
,1,4,0.0,child food human now parent,0 0 0 0 0
,1,5,0.0,dad daughter mom son,0 0 0 0
,1,6,0.0,cake sausage,0 0
,2,7,0.0,a is was,0 0 0
,2,8,0.0,liked likes,0 0


***Zero PA & PQ?***

In [29]:
with open(re42['grammar_file'],'r') as f: 
    for line in f.read().splitlines(): print(line)

% Grammar Learner v.0.5 2018-06-27 12:42:06 UTC
<dictionary-version-number>: V0v0v5+;
<dictionary-locale>: EN4us+;

% C1
"cake" "child" "dad" "daughter" "food" "human" "mom" "now" "parent" "sausage" "son":
(C1C2+);

% C2
"a" "is" "liked" "likes" "was":
(C2C1+) or (C2C3+);

% C3
"before" "not":
(C3C2+);

UNKNOWN-WORD: XXX+;

% 3 word clusters, 3 Link Grammar rules.
% Link Grammar file saved to: /home/oleg/language-learning/output/Grammar-Learner-05-2018-06-27/POC-English-NoAmb/MST_fixed_manually/connectors-DRK-connectors/no-LEFT-WALL_no-period/generalized_rules/poc-english_3C_2018-06-27_0005.4.0.dict


## Connectors-DRK-disjuncts

In [30]:
%%capture
kwargs['context'] = 1
kwargs['grammar_rules'] = 2
kwargs['rules_aggregation'] = 0.2
re43 = run_learn_grammar(corpus, dataset, module_path, out_dir, **kwargs)

In [31]:
display_tree(re43)

Parse ability (PA), parse quality(PQ), PA*PQ: 13%, 6%, 1%;
Category tree "cat_tree.txt" file:


0,1,2,3,4,5
C1,0,1,0.21,cake dad daughter mom sausage son,0 0 0 0 0 0
C2,0,2,0.0,child food human now parent,0 0 0 0 0
C3,0,3,0.0,a is was,0 0 0
C4,0,4,0.0,liked likes,0 0
C5,0,5,0.0,before not,0 0
,1,6,0.0,dad daughter mom son,0 0 0 0
,1,7,0.0,cake sausage,0 0


In [32]:
with open(re43['grammar_file'],'r') as f: 
    for line in f.read().splitlines(): print(line)

% Grammar Learner v.0.5 2018-06-27 12:42:06 UTC
<dictionary-version-number>: V0v0v5+;
<dictionary-locale>: EN4us+;

% C1
"cake" "dad" "daughter" "mom" "sausage" "son":
(C1C3+) or (C1C4+) or (C3C1- & C1C3+) or (C3C1- & C1C4+) or (C3C1- & C3C1-) or (C4C1-);

% C2
"child" "food" "human" "now" "parent":
(C3C2-) or (C3C2- & C3C2-) or (C4C2-);

% C3
"a" "is" "was":
(C3C1+) or (C3C1+ & C3C1+ & C3C5+) or (C3C1+ & C3C2+) or (C3C1+ & C3C2+ & C3C2+) or (C3C1+ & C3C2+ & C3C5+) or (C3C1+ & C3C2+ & C3C5+ & C3C5+) or (C3C1+ & C3C5+ & C3C2+) or (C3C2+);

% C4
"liked" "likes":
(C4C1+ & C4C1+) or (C4C1+ & C4C1+ & C4C5+) or (C4C1+ & C4C2+ & C4C1+) or (C4C1+ & C4C5+ & C4C1+);

% C5
"before" "not":
(C3C5-) or (C4C5-);

UNKNOWN-WORD: XXX+;

% 5 word clusters, 5 Link Grammar rules.
% Link Grammar file saved to: /home/oleg/language-learning/output/Grammar-Learner-05-2018-06-27/POC-English-NoAmb/MST_fixed_manually/connectors-DRK-disjuncts/no-LEFT-WALL_no-period/generalized_rules/poc-english_5C_2018-06-27_0005.

## Disjuncts-DRK-disjuncts

In [33]:
%%capture
kwargs['context'] = 2
re44 = run_learn_grammar(corpus, dataset, module_path, out_dir, **kwargs)

In [34]:
display_tree(re44)

Parse ability (PA), parse quality(PQ), PA*PQ: 0%, 0%, 0%;
Category tree "cat_tree.txt" file:


0,1,2,3,4,5
C1,0,1,0.21,cake dad daughter mom sausage son,0 0 0 0 0 0
C2,0,2,0.49,child food human parent,0 0 0 0
C3,0,3,0.0,is liked was,0 0 0
C4,0,4,0.0,before likes not,0 0 0
C5,0,5,0.0,a,0
C6,0,6,0.0,now,0
,1,7,0.56,dad daughter mom son,0 0 0 0
,7,8,0.0,dad mom,0 0
,7,9,0.0,daughter son,0 0
,1,10,0.0,cake sausage,0 0


In [35]:
with open(re44['grammar_file'],'r') as f: 
    for line in f.read().splitlines(): print(line)

% Grammar Learner v.0.5 2018-06-27 12:42:07 UTC
<dictionary-version-number>: V0v0v5+;
<dictionary-locale>: EN4us+;

% C1
"cake" "dad" "daughter" "mom" "sausage" "son":
(C1C3+) or (C1C4+) or (C3C1-) or (C3C1- & C5C1-) or (C4C1-) or (C5C1- & C1C3+) or (C5C1- & C1C4+);

% C2
"child" "food" "human" "parent":
(C3C2- & C5C2-) or (C5C2- & C3C2-);

% C3
"is" "liked" "was":
(C3C1+ & C3C1+ & C3C4+) or (C3C1+ & C3C2+) or (C3C1+ & C3C2+ & C3C4+) or (C3C1+ & C3C2+ & C3C4+ & C3C4+) or (C3C1+ & C3C2+ & C3C6+) or (C3C1+ & C3C4+ & C3C1+) or (C3C1+ & C3C4+ & C3C2+) or (C3C1+ & C3C6+ & C3C2+);

% C4
"before" "likes" "not":
(C3C4-) or (C4C1+ & C4C1+) or (C4C1+ & C4C6+ & C4C1+);

% C5
"a":
(C5C1+) or (C5C2+);

% C6
"now":
(C3C6-) or (C4C6-);

UNKNOWN-WORD: XXX+;

% 6 word clusters, 6 Link Grammar rules.
% Link Grammar file saved to: /home/oleg/language-learning/output/Grammar-Learner-05-2018-06-27/POC-English-NoAmb/MST_fixed_manually/disjuncts-DRK-disjuncts/no-LEFT-WALL_no-period/generalized_rules/poc-engl

## Disjuncts-ILE-disjuncts

In [36]:
%%capture
kwargs['word_space'] = 'discrete'
kwargs['dim_reduction'] = 'none'
kwargs['clustering'] = 'group'
kwargs['categories_generalization'] = 'jaccard'
kwargs['categories_aggregation'] = 0.2
kwargs['rules_generalization'] = 'jaccard'
kwargs['rules_aggregation'] = 0.1
re45 = run_learn_grammar(corpus, dataset, module_path, out_dir, **kwargs)

In [37]:
display_tree(re45)

Parse ability (PA), parse quality(PQ), PA*PQ: 81%, 80%, 65%;
Category tree "cat_tree.txt" file:


0,1,2,3,4,5
C1,0,1,0.21,cake dad daughter mom sausage son,0 0 0 0 0 0
C2,0,2,0.49,child food human parent,0 0 0 0
C3,0,3,0.49,before not,0 0
C4,0,4,1.0,a,1
C5,0,5,1.0,is,1
C6,0,6,1.0,liked,1
C7,0,7,1.0,likes,1
C8,0,8,1.0,now,1
C9,0,9,1.0,was,1
,1,10,1.0,cake sausage,1 1


In [38]:
with open(re45['grammar_file'],'r') as f: 
    for line in f.read().splitlines(): print(line)

% Grammar Learner v.0.5 2018-06-27 12:42:07 UTC
<dictionary-version-number>: V0v0v5+;
<dictionary-locale>: EN4us+;

% C1
"cake" "dad" "daughter" "mom" "sausage" "son":
(C1C5+) or (C1C6+) or (C1C7+) or (C1C9+) or (C4C1- & C1C5+) or (C4C1- & C1C7+) or (C6C1-) or (C7C1-) or (C9C1- & C4C1-);

% C2
"child" "food" "human" "parent":
(C4C2- & C5C2-) or (C9C2- & C4C2-);

% C3
"before" "not":
(C6C3-) or (C9C3-);

% C4
"a":
(C4C1+) or (C4C2+);

% C5
"is":
(C1C5- & C5C2+) or (C1C5- & C5C2+ & C5C8+) or (C1C5- & C5C8+ & C5C2+);

% C6
"liked":
(C1C6- & C6C1+ & C6C3+) or (C1C6- & C6C3+ & C6C1+);

% C7
"likes":
(C1C7- & C7C1+) or (C1C7- & C7C8+ & C7C1+);

% C8
"now":
(C5C8-) or (C7C8-);

% C9
"was":
(C1C9- & C9C1+ & C9C3+) or (C1C9- & C9C2+ & C9C3+) or (C1C9- & C9C2+ & C9C3+ & C9C3+) or (C1C9- & C9C3+ & C9C2+);

UNKNOWN-WORD: XXX+;

% 9 word clusters, 9 Link Grammar rules.
% Link Grammar file saved to: /home/oleg/language-learning/output/Grammar-Learner-05-2018-06-27/POC-English-NoAmb/MST_fixed_manuall

# POC-English-Amb, DRK + rules generalization

In [39]:
corpus = 'POC-English-Amb'
kwargs['test_corpus'] = module_path + '/data/POC-English-Amb/poc_english.txt'
kwargs['reference_path'] = module_path + '/data/POC-English-Amb/poc-english_ex-parses-gold.txt'
kwargs['left_wall'] = ''
kwargs['period'] = False
kwargs['verbose'] = 'mid'

## connectors-DRK-connectors

In [40]:
%%capture
kwargs['context'] = 1
kwargs['word_space'] = 'vectors'
kwargs['dim_reduction'] = 'svm'
kwargs['clustering'] = 'kmeans'
kwargs['categories_generalization'] = 'off'
kwargs['grammar_rules'] = 1
kwargs['rules_generalization'] = 'jaccard'
kwargs['rules_aggregation'] = 0.2
re51 = run_learn_grammar(corpus, dataset, module_path, out_dir, **kwargs)

In [41]:
display_tree(re51)

Parse ability (PA), parse quality(PQ), PA*PQ: 3%, 2%, 0%;
Category tree "cat_tree.txt" file:


0,1,2,3,4,5
C01,0,1,0.27,knocked liked likes saw sawed sees writes,0 0 0 0 0 0 0
C02,0,2,0.66,child food human not parent tool,0 0 0 0 0 0
C03,0,3,0.44,dad daughter mom son,0 0 0 0
C04,0,4,0.25,a is wants was,0 0 0 0
C05,0,5,0.39,binoculars hammer telescope,0 0 0
C06,0,6,0.0,cake sausage,0 0
C07,0,7,0.0,has with,0 0
C08,0,8,0.0,before,0
C09,0,9,0.0,now,0
C10,0,10,0.0,to,0


## disjuncts-DRK-disjuncts

In [45]:
%%capture
kwargs['context'] = 2
kwargs['grammar_rules'] = 2
kwargs['rules_aggregation'] = 0.2
re52 = run_learn_grammar(corpus, dataset, module_path, out_dir, **kwargs)

In [46]:
display_tree(re52)

Parse ability (PA), parse quality(PQ), PA*PQ: 9%, 9%, 1%;
Category tree "cat_tree.txt" file:


0,1,2,3,4,5
C01,0,1,0.0,are be is liked to,0 0 0 0 0
C02,0,2,0.32,child food human parent tool,0 0 0 0 0
C03,0,3,0.0,likes of was writes,0 0 0 0
C04,0,4,0.42,dad daughter mom son,0 0 0 0
C05,0,5,0.0,hammer saw telescope,0 0 0
C06,0,6,0.0,a her his,0 0 0
C07,0,7,0.0,directors sees the,0 0 0
C08,0,8,0.0,knocked sawed,0 0
C09,0,9,0.0,before not,0 0
C10,0,10,0.0,binoculars chalk,0 0


*Parse metrics for POC-English-Amb corpus look disappointing... 
Further parse metrics stability study ⇒ [http://langlearn.singularitynet.io/data/clustering_2018/html/POC-English-Amb-2018-05-31+06-27.html](http://langlearn.singularitynet.io/data/clustering_2018/html/POC-English-Amb-2018-05-31+06-27.html)*