## Scrapying

Since I could not find any reliable English dictionary online, I decided to scrap it from the most reliable sources. For this purpose the class XXXX has been created.

This class uses the scrapers from the github repository:
https://github.com/kiasar/Dictionary_crawler

The closest thing I found to an open dictionary is the one from the project Gutemberg but it seems that it is not that clean and consistent. So the efforts of parsing it are way higher than for the scrapped one, provided I do not get blocked by scrapyng.

In [5]:
%load_ext autoreload
%autoreload 2

import networkx as nx
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from IPython.display import Image
import os
import json
%matplotlib qt 

from thothsnehet.dictionary_crawler import DictionaryCrawler

from thothsnehet.utils.basic import merge_dictionaries, get_unique_words, get_all_text_from_definitions

from traphing.utils import unwrap
import shutil

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [6]:
storage_path = "./definitions/"

Remove the folder for the examples from the previous execution

In [52]:
try:
    shutil.rmtree(storage_path)
except:
    pass

# 1. Create instance of the Crawler

When we initialize it we indicate the directory where the already downloaded definitions are. Futher definitions will also be downloaded (scrapped) to this directory.

In [53]:
dictionary_source = "oxford"

In [54]:
dictionary_crawler = DictionaryCrawler(storage_path, dictionary_source)

In [55]:
unwrap(dictionary_crawler)

<DictionaryCrawler>	object has children:
    <str>	storage_path:	/home/montoya/Desktop/VScode/thoths-nehe
    <str>	dictionary_source:	oxford
    <str>	crawlers_path:	/home/montoya/Desktop/VScode/thoths-nehe
    <dict>	words_dict

  <dict>	words_dict has children:




## 1.1 Crawl a list of words

The most basic functionality is to download a list of word definitions using a scrapy crawler. The method crawl_definitions() serves this purpose. In its basic functionality, it simply calls the crawler process from the command line, retuns the results of the process and writes the file in jason lines format if successful.


In [56]:
words = "Your mom".split(" ")
filename = "words.jl"

In [57]:
output, error, return_code = dictionary_crawler.crawl_definitions(words, filename)

In [58]:
output

"['https://www.lexico.com/en/definition/Your', 'https://www.lexico.com/en/definition/mom']\n"

In [59]:
return_code

'0'

In [60]:
print(error[:3000])

2020-07-26 11:37:42 [scrapy.utils.log] INFO: Scrapy 2.0.1 started (bot: dictionary_crawler)
2020-07-26 11:37:42 [scrapy.utils.log] INFO: Versions: lxml 4.4.2.0, libxml2 2.9.9, cssselect 1.1.0, parsel 1.5.2, w3lib 1.21.0, Twisted 20.3.0, Python 3.7.4 (default, Aug 13 2019, 20:35:49) - [GCC 7.3.0], pyOpenSSL 19.0.0 (OpenSSL 1.1.1c  28 May 2019), cryptography 2.7, Platform Linux-5.3.0-40-generic-x86_64-with-debian-buster-sid
2020-07-26 11:37:42 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.epollreactor.EPollReactor
2020-07-26 11:37:42 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'dictionary_crawler',
 'CONCURRENT_REQUESTS': 512,
 'FEED_EXPORT_ENCODING': 'utf-8',
 'FEED_FORMAT': 'jl',
 'FEED_URI': '/home/montoya/Desktop/VScode/thoths-nehet/thothsnehet/scrapers/dictionary_crawler/spiders/./definitions/words.jl',
 'NEWSPIDER_MODULE': 'dictionary_crawler.spiders',
 'ROBOTSTXT_OBEY': True,
 'SPIDER_MODULES': ['dictionary_crawler.spiders']}
2020-07-26 11:37:42 [scrapy.ex

We then can load downlothe scrapped definitions file, which originally has a jason lines format.

In [61]:
definitions = dictionary_crawler.read_crawled_words(filename)
definitions

{'mom': {'noun': ["One's mother."]},
 'your': {'possessive determiner': ['Belonging to or associated with the person or people that the speaker is addressing.',
   'Belonging to or associated with any person in general.',
   'Used to denote someone or something that is familiar or typical of its kind.',
   'Used when addressing the holder of certain titles.']}}

Under the hood, the method is calling a process using scrapy to download the files. We can access the output by setting the verbose parameter to 1. 

Also the words parameter can be just a string of words

In [62]:
process_results = dictionary_crawler.crawl_definitions("the house is red", filename, verbose = 1)

Processing the words:  ['the', 'house', 'is', 'red']
scrapy crawl oxford -o /home/montoya/Desktop/VScode/thoths-nehet/thothsnehet/scrapers/dictionary_crawler/spiders/./definitions/words.jl -a words=the,house,is,red


In [63]:
definitions = dictionary_crawler.read_crawled_words(filename)
definitions.keys()

dict_keys(['mom', 'your', 'the', 'is', 'red', 'house'])

## 1.2 Internal dictionary of words

The dictionary_crawler object also has the functionality of loading all the definition files in the storage folder and set them as the internal variable words_dict.

In [64]:
dictionary_crawler.load_definitions()

In [65]:
dictionary_crawler.words_dict.keys()

dict_keys(['mom', 'your', 'the', 'is', 'red', 'house'])

The property words contains the list of words in the words_dict. It is computed on spot when accessed.

In [66]:
dictionary_crawler.words

['mom', 'your', 'the', 'is', 'red', 'house']

## 1.3 Advanced crawling options

Now that we have a reference of the words we have, we can also start doing cool things. 

We can just download the words that we do not already have in the dictionary

In [67]:
process_results = dictionary_crawler.crawl_definitions("the house is blue", filename, only_if_missing = True, verbose = 1)

Processing the words:  ['blue']
scrapy crawl oxford -o /home/montoya/Desktop/VScode/thoths-nehet/thothsnehet/scrapers/dictionary_crawler/spiders/./definitions/words.jl -a words=blue


Also add a given file to the crawded:

In [68]:
dictionary_crawler.add_crawled_words(filename)

In [69]:
dictionary_crawler.words

['mom', 'your', 'the', 'is', 'red', 'house', 'blue']

# 2. Recursive definition crawling

Given a initial set of words, we would like to recursively also download the set of words in the defitions if we do not have them until we are complete.

In [37]:
word = "table"
recursive_depth = 1

In [38]:
all_still_unknown_words = dictionary_crawler.crawl_definitions_recursively(word, recursive_depth)

------ Recursion level 1/1-------
Source word: table
['table']
--> 104/107 unknown to unique words in definitions
--> 0 words we already know we cannot get the definition to
['purposes', 'legs', ')', 'columns', 'typically']
['bearing', 'slab', 'providing', 'records', 'key']
['writing', 'food', 'seated', 'dummy', 'vertical']
['figures', 'faces', 'two', 'for', 'placed']
['stone', 'or', 'held', 'board', '.']
['especially', 'as', 'furniture', 'consideration', 'place']
['discussion', 'top', 'dispute', 'stored', 'formal']
['formally', 'at', 'of', 'cornice', 'group']
['flat', 'inscription', 'by', 'collection', 'an']
['displayed', '(', 'be', 'used', 'cut']
['set', 'meeting', 'discussions', 'household', 'a']
['with', 'surface', 'restaurant', 'which', 'series']
['hand', 'folding', 'in', 'data', 'horizontal']
['on', 'such', 'gem', 'to', 'objects']
['molding', 'forum', 'piece', 'it', 'defined']
['unique', 'each', 'quarter', 'eating', 'backgammon']
['more', 'playing', 'can', 'issue', 'memory']
['an

In [39]:
len(all_still_unknown_words)

2

In [40]:
all_still_unknown_words

(['purposes',
  'legs',
  ')',
  'columns',
  'typically',
  'bearing',
  'slab',
  'providing',
  'records',
  'is',
  'key',
  'writing',
  'food',
  'seated',
  'dummy',
  'vertical',
  'figures',
  'faces',
  'two',
  'for',
  'placed',
  'stone',
  'or',
  'held',
  'board',
  '.',
  'especially',
  'as',
  'furniture',
  'consideration',
  'place',
  'discussion',
  'top',
  'dispute',
  'stored',
  'formal',
  'formally',
  'at',
  'of',
  'cornice',
  'group',
  'flat',
  'inscription',
  'by',
  'collection',
  'an',
  'displayed',
  '(',
  'be',
  'the',
  'used',
  'cut',
  'set',
  'meeting',
  'discussions',
  'household',
  'a',
  'with',
  'surface',
  'restaurant',
  'which',
  'series',
  'hand',
  'folding',
  'in',
  'data',
  'horizontal',
  'on',
  'such',
  'gem',
  'to',
  'forum',
  'molding',
  'objects',
  'piece',
  'it',
  'defined',
  'unique',
  'each',
  'quarter',
  'eating',
  'backgammon',
  'more',
  'playing',
  'can',
  'issue',
  'memory',
  'and',

In [41]:
diff = set(all_still_unknown_words)

TypeError: unhashable type: 'list'

In [None]:
len(list(diff))

In [None]:
list(set(all_still_unknown_words).difference(diff))

If the words are not in the dictionary after scrapping there are multiple options:
- It is just a word that needs steamming. Like legs - leg. We should not steam from the beggining, like for exmaple with programming, but we should try.
- It is a symbol or number
- It is a noun. 

How should we handle these cases?

In [None]:
unknown_words

In [None]:
unknown_words

# 3 Parsing playground

The basic functionality of parsing was not perfect so lets try to improve it. In the following we will play with different preprocessing options using a presaved to disk example of the oxford website layout.

In [None]:
from thothsnehet.scrapers.dictionary_crawler.spiders.parsers import oxford_parser
from scrapy.http import HtmlResponse
import re

In [None]:
def oxford_parser(response):
    definition_dict = {}

    # Each type of meaning [noun, verb...] is in a  <section class="gramb">
    for grammar_word_type_section in response.xpath("//section[@class='gramb']"):
        try:
            # The first <span class="pos">noun</span> contains the type of word
            part_of_speech = grammar_word_type_section.xpath(".//span[@class='pos']/text()").extract()[0]
        except:
            part_of_speech = False

        if part_of_speech:
            definition_dict[part_of_speech] = dict()

        # For each meaning <div class="trg"> Only ther first layer since if there are subdefinitions also have this label
        for meaning_div in grammar_word_type_section.xpath("./div[@class='trg']"):
            try:
                # The first <span class="iteration">1 contains the type of word
                # If empty it can be a sub definition or 
                meaning_index = meaning_div.xpath(".//span[@class='iteration']/text()").extract()[0]
                meaning_subindex = meaning_div.xpath(".//span[@class='iteration']/text()").extract()[0]
                
            except:
                meaning_index = False
            print(meaning_index)
            
            def_list = meaning_div.xpath(".//span[@class='ind']").extract()
            print (def_list)
            if not def_list:
                def_list = meaning_div.xpath(".//div[@class='empty_sense']//div[@class='crossReference']").extract()
                
            
            def_list = [re.sub(r'<.*?>', "", i).strip() for i in def_list]
            def_list = [i for i in def_list if i]

            if def_list and part_of_speech and meaning_index:
                def_list= [def_list[0]]
                if meaning_index in definition_dict[part_of_speech]:
                    definition_dict[part_of_speech][meaning_index] += def_list
                else:
                    definition_dict[part_of_speech][meaning_index] = def_list
        
    return definition_dict


In [33]:
example_oxford_file_path = "../thothsnehet/scrapers/dictionary_crawler/spiders/html_format_oxford.html"
html_doc = open(example_oxford_file_path, "r").read().encode('utf-8')  
html_doc[:50]

b'<section class="gramb">\n    <h3 class="ps pos"><sp'

In [34]:
response = HtmlResponse(url = "", status=200, headers=None, body=html_doc)

In [35]:
dictionary = oxford_parser(response)

In [36]:
for key in dictionary:
    print(key)
    for key2 in dictionary[key]:
        print(key2)
        [print(" ".join(x.replace("\n", "").split())) for x in dictionary[key][key2]]

noun
