## Scrapying

For a given definition we would like to divide in its fundamental parts.

We are going to focus on Nouns.
Hypothesis:  Nouns are defined by the following components:
- Bigger class they belong to. The object could be a subtype or a part. 
- Atributes that define them more specifically within the class.
- Relationships with other things: Actions that they can perform and be performed on. 

Examples:
house > building > structure > object > thing 
toe -> finger -> leg -> limb -> animal ->  living organism -> An individual animal, plant, or single-celled life form.



In [1]:
%load_ext autoreload
%autoreload 2

import networkx as nx
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from IPython.display import Image
import os
import json
%matplotlib qt 

from thothsnehet.dictionary_analyzer import DictionaryAnalyzer
from thothsnehet.dictionary_crawler import DictionaryCrawler

from thothsnehet.utils.basic import merge_dictionaries, get_unique_words, get_all_text_from_definitions

from traphing.utils import unwrap
import shutil

import nltk
nltk.download('averaged_perceptron_tagger')

import spacy
from spacy import displacy
from collections import Counter
import en_core_web_sm

In [2]:
storage_path = "./definitions_example/"

Remove the folder for the examples from the previous execution

In [3]:
try:
    shutil.rmtree(storage_path)
except:
    pass

In [4]:
filename = "words.jl"
dictionary_source = "oxford"
dictionary_crawler = DictionaryCrawler(storage_path, dictionary_source)

Get the definition

In [12]:
words = "leg"
output, error, return_code = dictionary_crawler.crawl_definitions(words, filename)
definitions = dictionary_crawler.read_crawled_words(filename)
definition = definitions["leg"]["noun"][0]

# 1. Entity labelling

This is important for many reasons:
- Get to know which further definitions to use.
- Help in parsing the information to the structure.

In [13]:
definition

'Each of the limbs on which a person or animal walks and stands.'

In [15]:
def nltk_pos_tag(sent):
    sent = nltk.word_tokenize(sent)
    sent = nltk.pos_tag(sent)
    return sent

In [16]:
sent = preprocess(definition)

In [17]:
sent

[('Each', 'DT'),
 ('of', 'IN'),
 ('the', 'DT'),
 ('limbs', 'NNS'),
 ('on', 'IN'),
 ('which', 'WDT'),
 ('a', 'DT'),
 ('person', 'NN'),
 ('or', 'CC'),
 ('animal', 'NN'),
 ('walks', 'NNS'),
 ('and', 'CC'),
 ('stands', 'NNS'),
 ('.', '.')]

In [20]:

nlp = en_core_web_sm.load()

In [21]:
displacy.render(nlp(str(definition)), jupyter=True, style='ent')

  "__main__", mod_spec)


In [26]:
doc = nlp(definition)

In [27]:
doc

Each of the limbs on which a person or animal walks and stands.

In [None]:
displacy.serve(doc, style="dep")

  "__main__", mod_spec)



Using the 'dep' visualizer
Serving on http://0.0.0.0:5000 ...



127.0.0.1 - - [12/Apr/2020 12:13:30] "GET / HTTP/1.1" 200 10919
127.0.0.1 - - [12/Apr/2020 12:13:30] "GET /favicon.ico HTTP/1.1" 200 10919


displacy.serve(doc, style="dep")

## 1.2 Internal dictionary of words

The dictionary_crawler object also has the functionality of loading all the definition files in the storage folder and set them as the internal variable words_dict.

In [15]:
dictionary_crawler.load_definitions()