# Definition analysis

For a given definition we would like to divide in its fundamental parts.

We are going to focus on Nouns.
Hypothesis:  Nouns are defined by the following components:
- Bigger class they belong to. The object could be a subtype or a part. 
- Atributes that define them more specifically within the class.
- Relationships with other things: Actions that they can perform and be performed on. 

Examples:
house > building > structure > object > thing 
toe -> finger -> leg -> limb -> animal ->  living organism -> An individual animal, plant, or single-celled life form.


What are the most core components:
- Thing, something, anything
- Ones 

Maybe the definition is also of the subelements that it is made of. Also how to handle "Each of" and so on.

How tailored should this be:
- Extreme: Map every expression to its structure component Each of -> 

How to handle compound names such as living form.


This sturcture will only be the skeleton of the concepts, then we can add a lot of things. Like for example from these definitions we do not know what happens to an onion when we put it at the sun, that we would need to add information from a Wikipedia. How to do it also for compound things, like when we transform something is it another entity? Probably. An object is defined by its potential, once it has been acted on, it is another object.



In human language, endophoric awareness plays a key part in comprehension (decoding) skills, writing (encoding) skills, and general linguistic awareness. Endophora consists of anaphoric, cataphoric, and self-references within a text.
Anaphoric references occur when a word refers back to other ideas in the text for its meaning.
David went to the concert. He said it was an amazing experience.
He refers to David.
It refers to the concert.
Cataphoric references occur when a word refers to ideas later in the text.
Every time I visit her, my grandma bakes me cookies.
Her refers to my grandma.
Coreference resolution is the NLP (Natural Language Processing) equivalent of endophoric awareness used in information retrieval systems, conversational agents, and virtual assistants like Amazon’s Alexa. It is the task of clustering mentions in text that refer to the same underlying entities.

In [25]:
%load_ext autoreload
%autoreload 2

import networkx as nx
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from IPython.display import Image
import os
import json
%matplotlib qt 

from thothsnehet.dictionary_crawler import DictionaryCrawler

from thothsnehet.utils.basic import merge_dictionaries, get_unique_words, get_all_text_from_definitions, nltk_pos_tag

from traphing.utils import unwrap
import shutil

import nltk
nltk.download('averaged_perceptron_tagger')

import spacy
from spacy import displacy
from collections import Counter
import en_core_web_sm

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /home/montoya/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


In [3]:
storage_path = "./definitions/"

Remove the folder for the examples from the previous execution

In [4]:
try:
    shutil.rmtree(storage_path)
except:
    pass

In [5]:
filename = "words.jl"
dictionary_source = "oxford"
dictionary_crawler = DictionaryCrawler(storage_path, dictionary_source)

Get the definition

In [6]:
words = "cat"
output, error, return_code = dictionary_crawler.crawl_definitions(words, filename)
definitions = dictionary_crawler.read_crawled_words(filename)
definition = definitions[words]["noun"][0]

# 1. Entity labelling

This is important for many reasons:
- Get to know which further definitions to use.
- Help in parsing the information to the structure.

In [7]:
definition

'A small domesticated carnivorous mammal with soft fur, a short snout, and retractable claws. It is widely kept as a pet or for catching mice, and many breeds have been developed.'

## 1.1 Nltk

In [10]:
sent = nltk_pos_tag(definition)

In [11]:
sent

[('A', 'DT'),
 ('small', 'JJ'),
 ('domesticated', 'VBN'),
 ('carnivorous', 'JJ'),
 ('mammal', 'NN'),
 ('with', 'IN'),
 ('soft', 'JJ'),
 ('fur', 'NNS'),
 (',', ','),
 ('a', 'DT'),
 ('short', 'JJ'),
 ('snout', 'NN'),
 (',', ','),
 ('and', 'CC'),
 ('retractable', 'JJ'),
 ('claws', 'NN'),
 ('.', '.'),
 ('It', 'PRP'),
 ('is', 'VBZ'),
 ('widely', 'RB'),
 ('kept', 'VBN'),
 ('as', 'IN'),
 ('a', 'DT'),
 ('pet', 'NN'),
 ('or', 'CC'),
 ('for', 'IN'),
 ('catching', 'VBG'),
 ('mice', 'NN'),
 (',', ','),
 ('and', 'CC'),
 ('many', 'JJ'),
 ('breeds', 'NNS'),
 ('have', 'VBP'),
 ('been', 'VBN'),
 ('developed', 'VBN'),
 ('.', '.')]

## 2. SpyCi

In [12]:
nlp = en_core_web_sm.load()

In [13]:
doc = nlp(definition)

In [14]:
json_doc = doc.to_json()
json_doc["tokens"][0]

{'id': 0,
 'start': 0,
 'end': 1,
 'pos': 'DET',
 'tag': 'DT',
 'dep': 'det',
 'head': 4}

In [15]:
for token in json_doc["tokens"]:
    print(json_doc["text"][token["start"]:token["end"]], token["pos"], token["tag"], token["dep"])

A DET DT det
small ADJ JJ amod
domesticated VERB VBN amod
carnivorous ADJ JJ amod
mammal NOUN NN ROOT
with ADP IN prep
soft ADJ JJ amod
fur NOUN NN pobj
, PUNCT , punct
a DET DT det
short ADJ JJ amod
snout NOUN NN appos
, PUNCT , punct
and CCONJ CC cc
retractable ADJ JJ amod
claws NOUN NNS conj
. PUNCT . punct
It PRON PRP nsubjpass
is AUX VBZ auxpass
widely ADV RB advmod
kept VERB VBN ROOT
as SCONJ IN prep
a DET DT det
pet NOUN NN pobj
or CCONJ CC cc
for ADP IN conj
catching VERB VBG pcomp
mice NOUN NNS dobj
, PUNCT , punct
and CCONJ CC cc
many ADJ JJ amod
breeds NOUN NNS nsubjpass
have AUX VBP aux
been AUX VBN auxpass
developed VERB VBN conj
. PUNCT . punct


In [16]:
token

{'id': 35,
 'start': 177,
 'end': 178,
 'pos': 'PUNCT',
 'tag': '.',
 'dep': 'punct',
 'head': 34}

In [17]:
token["start"]

177

In [18]:
json_doc["tokens"][0]

{'id': 0,
 'start': 0,
 'end': 1,
 'pos': 'DET',
 'tag': 'DT',
 'dep': 'det',
 'head': 4}

The dependency visualizer, dep, shows part-of-speech tags and syntactic dependencies.

In [18]:
displacy.render(nlp(str(definition)), jupyter=True, style='dep')

In [19]:
doc

A small domesticated carnivorous mammal with soft fur, a short snout, and retractable claws. It is widely kept as a pet or for catching mice, and many breeds have been developed.

## Get the first root, that should be the hyperlative

In [23]:
from thothsnehet.utils.basic import get_first_root

In [27]:
get_first_root(definition, nlp)

'mammal'

## Create a blocking server for the image.

In [20]:
displacy.serve(doc, style="dep")

  "__main__", mod_spec)



Using the 'dep' visualizer
Serving on http://0.0.0.0:5000 ...



127.0.0.1 - - [26/Jul/2020 10:51:51] "GET / HTTP/1.1" 200 25419
127.0.0.1 - - [26/Jul/2020 10:51:51] "GET /favicon.ico HTTP/1.1" 200 25419


Shutting down server on port 5000.


displacy.serve(doc, style="dep")