# PictoBERT: Transformers for Pictogram Prediction (semantic grammar)

This notebook contains the procedure for constructing the semantic grammar used to compare with PictoBERT fine-tuned to pictogram prediction based on a grammatical structure.

In this notebook we replicated the method of [Pereira et al. (2020)](dx.doi.org/10.1007/978-3-030-58323-1_28). Refer to section 5.2.1 of PictoBERT paper.

## Install dependencies

In [1]:
!pip install transformers rdflib

Collecting transformers
  Downloading transformers-4.17.0-py3-none-any.whl (3.8 MB)
[K     |████████████████████████████████| 3.8 MB 4.2 MB/s 
[?25hCollecting rdflib
  Downloading rdflib-6.1.1-py3-none-any.whl (482 kB)
[K     |████████████████████████████████| 482 kB 68.5 MB/s 
Collecting sacremoses
  Downloading sacremoses-0.0.49-py3-none-any.whl (895 kB)
[K     |████████████████████████████████| 895 kB 50.0 MB/s 
[?25hCollecting tokenizers!=0.11.3,>=0.11.1
  Downloading tokenizers-0.11.6-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (6.5 MB)
[K     |████████████████████████████████| 6.5 MB 55.5 MB/s 
Collecting huggingface-hub<1.0,>=0.1.0
  Downloading huggingface_hub-0.4.0-py3-none-any.whl (67 kB)
[K     |████████████████████████████████| 67 kB 5.4 MB/s 
Collecting pyyaml>=5.1
  Downloading PyYAML-6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (596 kB)
[K     |████████████████████████████████| 596 kB 66.0 MB/s

In [16]:
!pip install anytree

Collecting anytree
  Downloading anytree-2.8.0-py2.py3-none-any.whl (41 kB)
[?25l[K     |███████▉                        | 10 kB 30.0 MB/s eta 0:00:01[K     |███████████████▊                | 20 kB 9.4 MB/s eta 0:00:01[K     |███████████████████████▋        | 30 kB 7.8 MB/s eta 0:00:01[K     |███████████████████████████████▍| 40 kB 7.3 MB/s eta 0:00:01[K     |████████████████████████████████| 41 kB 293 kB/s 
Installing collected packages: anytree
Successfully installed anytree-2.8.0


## Download fiels

In [2]:
!wget http://jayr.clubedosgeeks.com.br/pictobert/tokenizer_sem_childes_uk_clean_2.json

--2022-03-24 00:52:47--  http://jayr.clubedosgeeks.com.br/pictobert/tokenizer_sem_childes_uk_clean_2.json
Resolving jayr.clubedosgeeks.com.br (jayr.clubedosgeeks.com.br)... 192.185.214.132
Connecting to jayr.clubedosgeeks.com.br (jayr.clubedosgeeks.com.br)|192.185.214.132|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 118904 (116K) [application/json]
Saving to: ‘tokenizer_sem_childes_uk_clean_2.json’


2022-03-24 00:52:48 (205 KB/s) - ‘tokenizer_sem_childes_uk_clean_2.json’ saved [118904/118904]



In [4]:
!wget http://jayr.clubedosgeeks.com.br/pictobert/base.ttl

--2022-03-24 00:53:31--  http://jayr.clubedosgeeks.com.br/pictobert/base.ttl
Resolving jayr.clubedosgeeks.com.br (jayr.clubedosgeeks.com.br)... 192.185.214.132
Connecting to jayr.clubedosgeeks.com.br (jayr.clubedosgeeks.com.br)|192.185.214.132|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1574 (1.5K) [text/turtle]
Saving to: ‘base.ttl’


2022-03-24 00:53:32 (204 MB/s) - ‘base.ttl’ saved [1574/1574]



In [6]:
!wget http://jayr.clubedosgeeks.com.br/pictobert/data_sem_childes_clean_2.zip
!unzip ./data_sem_childes_clean_2.zip

--2022-03-24 00:54:41--  http://jayr.clubedosgeeks.com.br/pictobert/data_sem_childes_clean_2.zip
Resolving jayr.clubedosgeeks.com.br (jayr.clubedosgeeks.com.br)... 192.185.214.132
Connecting to jayr.clubedosgeeks.com.br (jayr.clubedosgeeks.com.br)|192.185.214.132|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2157229 (2.1M) [application/zip]
Saving to: ‘data_sem_childes_clean_2.zip’


2022-03-24 00:54:43 (1.56 MB/s) - ‘data_sem_childes_clean_2.zip’ saved [2157229/2157229]

Archive:  ./data_sem_childes_clean_2.zip
   creating: data/
  inflating: data/CS_new_test_data.pt  
  inflating: data/CS_new_val_data.pt  
  inflating: data/CS_new_train_data.pt  


In [41]:
!wget http://jayr.clubedosgeeks.com.br/pictobert/semantic_grammar_basis.db

--2022-03-24 01:22:46--  http://jayr.clubedosgeeks.com.br/pictobert/semantic_grammar_basis.db
Resolving jayr.clubedosgeeks.com.br (jayr.clubedosgeeks.com.br)... 192.185.214.132
Connecting to jayr.clubedosgeeks.com.br (jayr.clubedosgeeks.com.br)|192.185.214.132|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 679936 (664K)
Saving to: ‘semantic_grammar_basis.db’


2022-03-24 01:22:48 (710 KB/s) - ‘semantic_grammar_basis.db’ saved [679936/679936]



## Load tokenizer

In [3]:
TOKENIZER_PATH = "./tokenizer_sem_childes_uk_clean_2.json" # you can change this path to use your custom tokenizer

from transformers import PreTrainedTokenizerFast

cs_tokenizer = PreTrainedTokenizerFast(tokenizer_file=TOKENIZER_PATH)
cs_tokenizer.pad_token = "[PAD]"
cs_tokenizer.sep_token = "[SEP]"
cs_tokenizer.mask_token = "[MASK]"
cs_tokenizer.cls_token = "[CLS]"
cs_tokenizer.unk_token = "[UNK]"

In [5]:
import nltk
nltk.download("wordnet")

from nltk.corpus import wordnet as wn

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.


## Load encoded data

In [7]:
import pickle

test_dataset = pickle.load(open("./data/CS_new_test_data.pt",'rb'))
train_dataset = pickle.load(open("./data/CS_new_train_data.pt",'rb'))
val_dataset = pickle.load(open("./data/CS_new_val_data.pt",'rb'))

### Create ontology

In [21]:
def create_concept(name):
    
    concept = URIRef(myns[name])
    g.add((concept, RDF.type, ontolex.LexicalConcept))
    g.add((concept, RDFS.label, Literal(name)))

    # synset = wn.synset(name)
    
    return concept

In [8]:
def create_lexicalized_sense(sense_key, concept):
  lexicalized_sense = URIRef(myns[sense_key])
  g.add((lexicalized_sense, RDF.type, ontolex.LexicalizedSense))
  g.add((lexicalized_sense, RDFS.label, Literal(sense_key)))

  g.add((concept, ontolex.lexicalizedSense, lexicalized_sense))

  return lexicalized_sense

In [9]:
def create_lexical_entry(word, pos, lexicalized_sense, concept):
    pos = {
        "v":'verb',
        "n":"noun",
        "a": "adjective",
        "s": "adjective",
        "r":'adverb',
        'p':'pronoun'
    }[pos]
    lexicalEntry = URIRef(myns[word+"_"+pos+"_lex"])
    g.add((lexicalEntry, RDF.type, ontolex.Word))
    g.add((lexicalEntry, lexinfo.partOfSpeech, lexinfo[pos]))
    g.add((lexicalEntry, ontolex.writtenRep, Literal(word)))

    canonicalForm = URIRef(myns[word+"_"+pos+"_form"])
    g.add((canonicalForm, RDF.type, ontolex.Form))
    g.add((canonicalForm, ontolex.writtenRep, Literal(word)))
    g.add((canonicalForm, lexinfo.partOfSpeech, lexinfo[pos]))

    g.add((lexicalEntry, ontolex.canonicalForm, canonicalForm))

    g.add((lexicalEntry, ontolex.sense, lexicalized_sense))
    g.add((lexicalEntry, ontolex.evokes, concept))
    g.add((concept, ontolex.isEvokedBy, lexicalEntry))

    return lexicalEntry

In [10]:
def get_all_hypernyms(s):
    return set(
        self_synset
        for self_synsets in s._iter_hypernym_lists()
        for self_synset in self_synsets
    ).difference(set([s]))

## Complete missing nodes


In [11]:
vocab = [i for i in cs_tokenizer.get_vocab()]
synsets = [wn.lemma_from_key(k).synset() for k in vocab if '%' in k]
verb_synsets = [s for s in synsets if s.pos() == 'v']
noun_synsets = [s for s in synsets if s.pos() == 'n']

In [12]:
from tqdm import tqdm
class MissingNodes():
    def __init__(self):
        self.all_hyps = {}
        pass
    
    def get_missing(self, synsets, root = 'entity.n.01'):
        new_nodes = [wn.synset(root)]
        i = 1
        while i == 1:
            i += 1
            print(len(new_nodes))
            new_nodes = []
            for i,s in enumerate(tqdm(synsets)):
                all_hp = self.get_all_hypernyms(s)
                # print(s,all_hp)
                resulsts = []
                for s2 in synsets[i:]:
                    s2_all_hp = self.get_all_hypernyms(s2)
                    if s2 not in all_hp and s not in s2_all_hp:
                        subsumer = s.lowest_common_hypernyms(s2)
                        score = s.wup_similarity(s2)
                        if len(subsumer) > 0 and score > 0.9 :
                            # print(s,s2,subsumer[0],score)
                            if subsumer[0] not in synsets and subsumer[0] not in new_nodes:
                                new_nodes.append(subsumer[0])
                                # print(new_nodes)
                        # print(s, s2)
                    # print(s.wup_similarity(s2))
                # ordered = sorted(resulsts,key=lambda k:k['score'],reverse=True)
                # print(new_nodes)
            synsets = synsets + new_nodes
        
        return synsets;

    
    def get_all_hypernyms(self,s):
        if s.name() in self.all_hyps:
          self.all_hyps[s.name()]
        all = list(
            self_synset
            for self_synsets in s._iter_hypernym_lists()
            for self_synset in self_synsets
        )
        self.all_hyps[s.name()] = all
        return all

## Process corpus

In [15]:
from tqdm import tqdm
semantic_model = {}
roles = []
for example in tqdm(train_dataset['input_ids'] + val_dataset['input_ids']):
  cls, who_id, what_doing_id, what_id, where_id,to_whom_id, how_id, when_id,  sep = example
  who = cs_tokenizer.convert_ids_to_tokens(who_id)
  what_doing = cs_tokenizer.convert_ids_to_tokens(what_doing_id)
  what = cs_tokenizer.convert_ids_to_tokens(what_id)
  where = cs_tokenizer.convert_ids_to_tokens(where_id)
  to_whom = cs_tokenizer.convert_ids_to_tokens(to_whom_id)
  how = cs_tokenizer.convert_ids_to_tokens(how_id)
  when = cs_tokenizer.convert_ids_to_tokens(when_id)
  if "%" in what_doing:
    synset = wn.lemma_from_key(what_doing).synset().name()
    if synset not in semantic_model:
      semantic_model[synset] = {"hasAgent":[],"hasTheme":[], "hasLocation":[],"hasRecipient":[],'hasManner':[],'hasTime':[]}
    if who not in cs_tokenizer.all_special_tokens:
      if '%' in who:
        who_synset = wn.lemma_from_key(who).synset().name()
      else:
        who_synset = who
      semantic_model[synset]['hasAgent'].append(who_synset)
    if what not in cs_tokenizer.all_special_tokens:
      if '%' in what:
        what_synset = wn.lemma_from_key(what).synset().name()
      else:
        what_synset = what
      semantic_model[synset]['hasTheme'].append(what_synset)
    if where not in cs_tokenizer.all_special_tokens:
      if '%' in where:
        where_synset = wn.lemma_from_key(where).synset().name()
      else:
        where_synset = where
      semantic_model[synset]['hasLocation'].append(where_synset)
    if to_whom not in cs_tokenizer.all_special_tokens:
      if '%' in to_whom:
        to_whom_synset = wn.lemma_from_key(to_whom).synset().name()
      else:
        to_whom_synset = to_whom
      semantic_model[synset]['hasRecipient'].append(to_whom_synset)
    if how not in cs_tokenizer.all_special_tokens:
      if '%' in how:
        how_synset = wn.lemma_from_key(how).synset().name()
      else:
        how_synset = how
      semantic_model[synset]['hasManner'].append(how_synset)
    if when not in cs_tokenizer.all_special_tokens:
      if '%' in when:
        when_synset = wn.lemma_from_key(when).synset().name()
      else:
        when_synset = when
      semantic_model[synset]['hasTime'].append(when_synset)

100%|██████████| 78022/78022 [00:49<00:00, 1579.31it/s]


## importance cut off

In [14]:
from collections import Counter
from scipy import stats

import math
from scipy.stats import shapiro
import numpy as np
import matplotlib.pyplot as plt


def cut_off(items, alpha, assume_normal = False, change_alpha = True):
    # print(stats.norm.ppf(1-alpha))

    
    counter = Counter(items)
    names = counter.keys()
    counts = list(counter.values())
    a = np.array(counts)
    mean = a.mean()
    std = a.std()
    if len(counts) > 3 and not assume_normal:
        k2, p = shapiro(counts)
    else:
        p = 0
    chosen = []
    not_chosen = []

    if p > 0.05 or assume_normal:
        z = stats.norm.ppf(1-alpha)

        x = (z*std)+mean

    else:
        t = stats.t.ppf(1-alpha, len(items))
        error = (t * float(std)) / math.sqrt(len(items));

        x = error+mean

    x = round(x)
    for name, qt in counter.items():
        if qt >= x:
            chosen.append((name, qt))
        else:
            not_chosen.append((name, qt))
    if len(chosen)==0 and alpha < 0.5 and change_alpha:
        return cut_off(items, alpha+0.05)
    return chosen,not_chosen

## Redundancy removing

In [18]:
from rdflib import Graph, Namespace, RDFS, RDF, Literal, URIRef, OWL,BNode
from rdflib.plugins.sparql import prepareQuery
from collections import Counter
from anytree import Node, RenderTree, ZigZagGroupIter,LevelOrderGroupIter
from anytree.exporter import DotExporter

class Redundancy():
    def __init__(self, graph):
        self.g = graph
        self.ontolex = Namespace("http://www.w3.org/ns/lemon/ontolex#")
        self.decomp = Namespace("http://www.w3.org/ns/lemon/decomp#")
        self.lexinfo = Namespace("http://www.lexinfo.net/ontology/2.0/lexinfo#")
        self.synsem = Namespace("http://www.w3.org/ns/lemon/synsem#")
        self.myns = Namespace("http://assistive.cin.ufpe.br/aboard#")
    
    def get_hypernyms(self,synset_name,reference):
        el = URIRef(self.myns[synset_name])
        hypernyms = []
        for s,v,p in self.g.triples((el, self.lexinfo.hypernym, None)):
            if self.g.qname(p) in reference:
                hypernyms.append(self.g.qname(p))
        
        return hypernyms

    def replace(self,_list, old, new):
        return [new if x==old else x for x in _list]
    
    def by_frequency(self, initial):
        counter = Counter(initial)
        names = counter.keys()
        counts = counter.values()
        coisa = {}
        pais = []
        not_pai = []
        for name, qt in counter.items():
            hypernyms = self.get_hypernyms(name, names)
            node_name =name
            if len(hypernyms) > 0 and node_name not in not_pai:
                not_pai.append(node_name)
            for hypernym in hypernyms:
                hypernym_node = hypernym;
                if node_name in pais:
                    pais.remove(node_name)
                if hypernym_node not in pais and hypernym_node not in not_pai:
                    pais.append(hypernym_node)
                if hypernym_node not in coisa:
                    coisa[hypernym_node] = Node(hypernym_node)
                if node_name not in coisa:
                    coisa[node_name] = Node(node_name)
                coisa[node_name].parent = coisa[hypernym_node]


        for a in pais:

            asda = [[node.name for node in children] for children in ZigZagGroupIter(coisa[a])]
            asda.reverse()
            for level in asda:
                for item in level:
                    qt = counter[item]
                    if coisa[item].parent != None:
                        qt_parent = counter[coisa[item].parent.name]
                        if qt_parent > qt:
                            initial = self.replace(initial,item, coisa[item].parent.name)
                            counter = Counter(initial)
        return initial

    def get_parent_recursive(self,synset_name, reference):
        el = URIRef(self.myns[synset_name])
        for s,v,p in self.g.triples((el, self.lexinfo.hypernym, None)):
            if self.g.qname(p) in reference:
                return self.g.qname(p)
            _parent = self.get_parent_recursive(self.g.qname(p), reference)
            if _parent is not None:
                return _parent
        return None

    def parent_preference(self, initial, arr = False, keys = False):
        _obj = {}
        if arr:
            counter = Counter(initial)
            _obj = counter
        else:
            for name, qt in initial:
                _obj[name] = qt

            counter = Counter(_obj)
        names = list(counter.keys())
        isChanged = False

        for name in names:
            parent = self.get_parent_recursive(name, names)
            if parent is not None:
                if parent in _obj and name in _obj:
                    _obj[parent] = _obj[parent] + _obj[name]
                del _obj[name]
                isChanged = True
        
        if isChanged:
            if keys:
                return _obj.keys()
            f_return = []
            for a in _obj:
                f_return.append((a,_obj[a]))
            return f_return
        else:
            if arr and not keys:
                f_return = []
                for a in _obj:
                    f_return.append((a,_obj[a]))
                return f_return
            return initial;

## Building ontology

In [19]:
from rdflib import Graph, Namespace, RDFS, RDF, Literal, URIRef, OWL,BNode
from rdflib.plugins.sparql import prepareQuery

g = Graph()
g.parse("./base.ttl")

ontolex = Namespace("http://www.w3.org/ns/lemon/ontolex#")
decomp = Namespace("http://www.w3.org/ns/lemon/decomp#")
lexinfo = Namespace("http://www.lexinfo.net/ontology/2.0/lexinfo#")
synsem = Namespace("http://www.w3.org/ns/lemon/synsem#")
myns = Namespace("http://assistive.cin.ufpe.br/aboard#")
skos = Namespace("http://www.w3.org/2004/02/skos/core#")


g.bind("ontolex",ontolex)
g.bind("decomp",decomp)
g.bind("lexinfo", lexinfo)
g.bind("synsem", synsem)


In [22]:
for token in cs_tokenizer.get_vocab():
  if '%' not in token:
    concept = create_concept(token)
    lexicalized_sense = create_lexicalized_sense(token, concept)
    lexical_entry = create_lexical_entry(token, "p", lexicalized_sense, concept)    

In [23]:
nouns_synsets = []
all_synsets = []
for token in cs_tokenizer.get_vocab():
  if '%' in token:
    try:
      l = wn.lemma_from_key(token)
      concept = create_concept(l.synset().name())
      lexicalized_sense = create_lexicalized_sense(token, concept)
      lexical_entry = create_lexical_entry(l.name(), l.synset().pos(), lexicalized_sense, concept)
      
      if l.synset().pos() == 'n':
        nouns_synsets.append(l.synset())
      all_synsets.append(l.synset())
    except:
      print(token)

In [24]:
for s in nouns_synsets:
    # print(s)
    concept = create_concept(s.name())
    all_hyps = list(get_all_hypernyms(s))
    in_list = set(all_hyps).intersection(set(nouns_synsets))

    if len(in_list) > 0:
        depts = [{"synset":s,"depth":s.max_depth()} for s in list(in_list)]
        parent = sorted(depts,key=lambda k:k['depth'],reverse=True)[0]
        hy_concept = create_concept(parent['synset'].name())
        g.add((concept, lexinfo.hypernym, hy_concept))
        g.add((hy_concept, lexinfo.hyponym, concept))

## Create Semantic Roles in the graph

In [25]:
semanticRole = URIRef(myns["semanticRole"])
g.add((semanticRole, RDF.type, OWL.ObjectProperty))

frequency = URIRef(myns['frequency'])
g.add((frequency, RDF.type, OWL.AnnotationProperty))

def create_semantic_role(name):
    sm = URIRef(myns[name])
    g.add((sm, RDF.type, OWL.ObjectProperty))
    g.add((sm, RDFS.subPropertyOf, myns.semanticRole))
    return sm

## Transform into database

To facilitate queries we transform the ontologies in relational databases

In [57]:
!mkdir db

mkdir: cannot create directory ‘db’: File exists


In [63]:
import sqlite3

pos_map = {'n':'noun','v':'verb','a':'adjective','s':'adjective','r':'adverb'}

def db_create_concept(conn,name):
    c = conn.cursor()
    for row in c.execute('SELECT * FROM concept WHERE name = "'+name+'"'):
        return row[0]
    c.execute("INSERT INTO concept (name) VALUES (?)",(name,))
    return c.lastrowid

def db_create_word(conn,name, pos, w_type=1, parent=None):
    c = conn.cursor()
    for row in c.execute("SELECT * FROM pictogram WHERE name = ?  and pos = ?  and type= ?",(name,pos,str(w_type)) ):
        return row[0]
    c.execute("INSERT INTO pictogram (name,pos,type,image_file,parent) VALUES (?,?,?,?,?)",(name,pos,w_type,'...',parent,))
    return c.lastrowid

def db_create_word_concept(conn,concept_id, word_id):
    c = conn.cursor()
    for row in c.execute("SELECT * FROM pictogram_concept WHERE concept_id = ? AND pictogram_id = ?", (concept_id, word_id,)):
        return row[0]
    c.execute("INSERT INTO pictogram_concept (concept_id, pictogram_id) VALUES (?,?)",(concept_id, word_id,))
    return c.lastrowid

def db_create_tax(conn,hypernym_id, hyponym_id):
    c = conn.cursor()
    for row in c.execute("SELECT * FROM taxonomic_relationship WHERE hypernym_id = ? AND hyponym_id = ?", (hypernym_id, hyponym_id,)):
        return row[0]
    c.execute("INSERT INTO taxonomic_relationship (hypernym_id, hyponym_id) VALUES (?,?)",(hypernym_id, hyponym_id,))
    return c.lastrowid

def db_create_semantic_rel(conn,semantic_role, source_concept_id, destination_concept_id, frequency):
    c = conn.cursor()
    for row in c.execute("SELECT * FROM semantic_relationship WHERE semantic_role = '"+semantic_role+"' and source_concept_id =? AND destination_concept_id = ?",(source_concept_id, destination_concept_id,)):
        return row[0]
    c.execute("INSERT INTO semantic_relationship (semantic_role, source_concept_id, destination_concept_id, frequency) VALUES (?,?,?,?)",(semantic_role, source_concept_id, destination_concept_id, frequency,))
    return c.lastrowid


def to_database(g,percentil):
  path = "./db/semantic_grammar_{0}.db".format(percentil)
  !cp ./semantic_grammar_basis.db $path

  conn = sqlite3.connect(path)

  delete_cursor = conn.cursor()

  delete_cursor.execute("DELETE FROM concept;")
  delete_cursor.execute("DELETE FROM sqlite_sequence WHERE name = 'concept';")
  # delete_cursor.execute("DELETE FROM pictogram; ")
  # delete_cursor.execute("DELETE FROM sqlite_sequence WHERE name = 'pictogram';")
  delete_cursor.execute("DELETE FROM pictogram_concept; ")
  delete_cursor.execute("DELETE FROM sqlite_sequence WHERE name = 'pictogram_concept';")
  delete_cursor.execute("DELETE FROM semantic_relationship; ")
  delete_cursor.execute("DELETE FROM sqlite_sequence WHERE name = 'semantic_relationship';")
  delete_cursor.execute("DELETE FROM taxonomic_relationship; ")
  delete_cursor.execute("DELETE FROM sqlite_sequence WHERE name = 'taxonomic_relationship';")

  delete_cursor.close()
  for s, p, o in tqdm(g.triples((None, RDF.type, ontolex.LexicalConcept))):
      concept_name = g.label(s)
      concept_id = db_create_concept(conn,concept_name)
  for word,p,o in tqdm(g.triples((None, RDF.type, ontolex.Word))):
    form = g.value(word, ontolex.canonicalForm, None)
    writtenRep = g.value(form, ontolex.writtenRep, None)
    for s,p,synset in g.triples((word, ontolex.evokes, None)):
        if synset is None:
            synset = g.value(word, ontolex.evokes, None)
        pos = g.qname(g.value(word, lexinfo.partOfSpeech, None)).split(':')[1]
        
        collection = g.value(None, skos.member, synset)

        category_name = g.label(collection).split('-')[0]
        parent_id = db_create_word(conn," ".join(category_name.lower().split("_")),'folder',w_type=2)

        word_id = db_create_word(conn,writtenRep.lower(),pos,parent=parent_id)

        concept_name = g.label(synset)
        concept_id = db_create_concept(conn,concept_name)
        word_concept_id = db_create_word_concept(conn,concept_id,word_id)

  frequencyQry = """
      SELECT ?frequency WHERE {
          ?axiom rdf:type owl:Axiom .
          ?axiom owl:annotatedProperty ?property .
          ?axiom owl:annotatedSource ?source .
          ?axiom owl:annotatedTarget ?target .
          ?axiom myns:frequency ?frequency
      }
  """
  preparedQry = prepareQuery(frequencyQry,initNs={"rdf":RDF,"owl":OWL,"myns":myns})
  i = 0
  for s, p, o in tqdm(g.triples((None, RDF.type, ontolex.LexicalConcept))):
      i = i + 1
      concept_name = g.label(s)
      concept_id = db_create_concept(conn,concept_name)

      for hypo, prop, hyper in g.triples((s, lexinfo.hypernym, None)):
          # print(hypo)
          hypernym_id = db_create_concept(conn,g.qname(hyper))
          rel = db_create_tax(conn,hypernym_id, concept_id)

      for semantic_role, prop, o in g.triples((None, RDFS.subPropertyOf, myns.semanticRole)):
          for _s,_p,destination in g.triples((s, semantic_role, None)):
              destination_id = db_create_concept(conn,g.qname(destination))
              frequency = 0
              a = db_create_semantic_rel(conn,g.qname(semantic_role),concept_id,destination_id,frequency)

  conn.commit()
  conn.close()

4514it [00:00, 4956.21it/s]
3511it [00:08, 432.78it/s]
4514it [00:07, 618.04it/s]


## RUN

In [64]:
percentiles = [0.01,0.05,0.1,0.15,0.2,6,12,24,32,40]

In [31]:
!mkdir ontologies

In [None]:
from collections import Counter
import copy

for percentile in tqdm(percentiles):
  my_g = copy.deepcopy(g)
  redundancy = Redundancy(g)
  most_common=None
  if isinstance(percentile, int):
    most_common=percentile
  
  for synset in semantic_model:
    for arg in semantic_model[synset]:
      items = semantic_model[synset][arg]
      if len(items) > 0:
        redundance_removed = redundancy.by_frequency(items)
        if most_common is None:
          accepted, rejected = cut_off(redundance_removed,percentile, assume_normal=True)
        else:
          c = Counter(redundance_removed)
          accepted = c.most_common(most_common)
        real_accepted = redundancy.parent_preference(accepted)
        for complement_synset, frequency_qt in real_accepted:
          g.add((myns[synset], create_semantic_role(arg), myns[complement_synset]))

          a = BNode()
          g.add((a, RDF.type, OWL.Axiom))
          g.add((a, OWL.annotatedSource, myns[synset]))
          g.add((a, OWL.annotatedProperty, create_semantic_role(arg)))
          g.add((a, OWL.annotatedTarget, myns[complement_synset]))
          g.add((a, frequency, Literal(str(frequency_qt))))
  my_g.serialize(destination="./ontologies/output_CS_f{0}.ttl".format(percentile), format="turtle")
  to_database(my_g, percentile)

  0%|          | 0/10 [00:00<?, ?it/s]
0it [00:00, ?it/s][A
1080it [00:00, 10787.41it/s][A
2159it [00:00, 7309.68it/s] [A
2954it [00:00, 5790.56it/s][A
3584it [00:00, 4860.62it/s][A
4514it [00:00, 4746.98it/s]

0it [00:00, ?it/s][A
64it [00:00, 635.96it/s][A
128it [00:00, 395.26it/s][A
173it [00:00, 371.93it/s][A
213it [00:00, 327.47it/s][A
248it [00:00, 323.63it/s][A
282it [00:00, 309.70it/s][A
315it [00:00, 314.46it/s][A
353it [00:01, 327.40it/s][A
395it [00:01, 353.14it/s][A
432it [00:01, 357.36it/s][A
469it [00:01, 328.35it/s][A
503it [00:01, 295.80it/s][A
546it [00:01, 330.06it/s][A
581it [00:01, 331.11it/s][A
617it [00:01, 338.18it/s][A
653it [00:01, 343.68it/s][A
688it [00:02, 317.35it/s][A
727it [00:02, 336.22it/s][A
762it [00:02, 323.90it/s][A
796it [00:02, 327.71it/s][A
830it [00:02, 323.83it/s][A
864it [00:02, 328.07it/s][A
902it [00:02, 341.66it/s][A
944it [00:02, 361.82it/s][A
984it [00:02, 372.43it/s][A
1022it [00:03, 342.62it/s][A
1063it [