In [80]:
import os
from collections import defaultdict

In [81]:
import parser as lr

# Terminology

If we look up French words "pêche" and "péché", they will be on the same page and have the same address "https://dictionnaire.lerobert.com/definition/peche". The last part of the address "peche" is going to be without any diacritics. To give the last part a separate term, we are going to call it `word_path`. We are going to use `word_path` as the `filename`, when saving the HTML of a page.

# Discovering Definition Pages

`parser.py` provides functions needed to discover valid `word_path`s when scraping the dictionary:
1. The dictionary has "Explorer le dictionnaire" section, where it lists valid (and also not valid) links to the definitions. To find all the links in that section, you can use `get_explored_links()` function. 
2. You can also find definition pages via the API used by the built-in search. For this purpose, call `get_suggested_word_paths()` function with a search term as an argument.
3. When you have saved some of the definition pages as HTML files, you can go over these files and extract all definition links via `find_word_paths_html_file()` function.
4. The last option is to request a page and check its status code using `word_path` from a wordlist. Do not forget to remove diacritics and replace white spaces and `'` with `-` before that.

All four approaches were taken to compile a list of valid valid definition pages called `definition_word_paths.txt`.

# Scraping the Content

Let's use `definition_word_paths.txt` to download definition pages.

In [82]:
with open('definition_word_paths.txt', 'r', encoding='utf-8') as f:
    word_paths = [line.strip() for line in f]

## HTML

In [83]:
html_path = './assets/html'
os.makedirs(html_path, exist_ok=True)

In [84]:
results = lr.execute_async(lr.download_html, word_paths[:10], html_path=html_path)

100% (1 of 1) |##########################| Elapsed Time: 0:02:09 Time:  0:02:09


We saved the first 10 pages from `definition_word_paths.txt`. `execute_async()` function is based on `concurrent.futures` module and will come in handy when downloading 51000+ pages.

## Audio (MP3)

In [85]:
audio_path = './assets/mp3'
os.makedirs(audio_path, exist_ok=True)

In [86]:
results = lr.execute_async(lr.download_audio, lr.list_html_files(html_path)[:10], html_path=html_path, audio_path=audio_path)

100% (1 of 1) |##########################| Elapsed Time: 0:00:00 Time:  0:00:00


`list_html_files()` function downloads all `.mp3` files required by the HTML files saved above.

# Analysing the Structure of HTML

To get definition tags, we can use `find_definitions()` function as shown below.

In [87]:
html_file = lr.list_html_files(html_path)[0]
definition_tags = lr.find_definitions(lr.read_html_file(html_file, html_path=html_path))
html_file, len(definition_tags)

('de', 6)

Now, we can find all strings in a defintion tag and index them by their parents.

In [88]:
results = lr.execute_async(lr.index_strings_by_parents, lr.list_html_files(html_path)[:10], processes=True, html_path=html_path)
string_parents = defaultdict(lambda: dict())
for res_filename, res_parents in results:
    for key, val in res_parents.items():
        string_parents[key][res_filename] = val
string_parents = dict(string_parents)

100% (1 of 1) |##########################| Elapsed Time: 0:00:00 Time:  0:00:00


In [89]:
for key in list(string_parents.keys())[:5]:
    print(key)

(('h3', None), ('span', 'notBold'))
(('h3', None),)
(('h3', None), ('span', 'd_sound_cont'), ('sound', 'd_sound'), ('span', 'audioPlayer'), ('span', 'audioPlayer-play-pause-wrapper'), ('span', 'audioPlayer-speaker'))
(('h3', None), ('span', 'd_sound_cont'), ('sound', 'd_sound'), ('span', 'audioPlayer'), ('audio', None), ('span', None))
(('h3', None), ('span', 'd_mtb'))


Each key is a tuple of tags also represented as tuples. After indexing, finding a page with and its content with a specific HTML structure becomes simple.

In [90]:
string_parents[(('h3', None), ('span', 'notBold'))]

{'de': {0: [(1, 0, 0)],
  1: [(1, 0, 0)],
  2: [(1, 0, 0)],
  3: [(1, 0, 0)],
  4: [(1, 0, 0)],
  5: [(1, 0, 0)]},
 'je': {0: [(1, 0, 0)]},
 'est': {0: [(1, 0, 0)]},
 'pas': {0: [(1, 0, 0)], 1: [(1, 0, 0)]},
 'le': {0: [(1, 0, 0)], 1: [(1, 0, 0)], 2: [(1, 0, 0)]},
 'que': {0: [(1, 0, 0)], 1: [(1, 0, 0)], 2: [(1, 0, 0)]},
 'la': {0: [(1, 0, 0)], 1: [(1, 0, 0)], 2: [(1, 0, 0)]},
 'vous': {0: [(1, 0, 0)]},
 'tu': {0: [(1, 0, 0)], 1: [(1, 0, 0)]},
 'un': {0: [(1, 0, 0)]}}

Let's get a string now via found indices for `a-coup.html`.

In [91]:
lr.get_content(definition_tags[0], (1, 0, 0))

'Définition de '

The dictionary contains plenty of examples for which it uses tags with `class="d_xpl"`. Let's find all possible combinations with no more than one parent and one child.

In [92]:
tag_classes = set()
for key in string_parents.keys():
    for ind, tag in enumerate(key):  
        if tag[1] == 'd_xpl':
            tag_neighbors = key[ind-1:ind+2]
            tag_classes.add(tag_neighbors)
tag_classes

{(('div', 'd_dvl'), ('span', 'd_xpl')),
 (('div', 'd_dvl'), ('span', 'd_xpl'), ('span', 'd_gls')),
 (('div', 'd_dvl'), ('span', 'd_xpl'), ('span', 'd_lca')),
 (('div', 'd_dvl'), ('span', 'd_xpl'), ('span', 'd_mta')),
 (('div', 'd_dvl'), ('span', 'd_xpl'), ('span', 'd_x')),
 (('div', 'd_dvn'), ('span', 'd_xpl')),
 (('div', 'd_dvn'), ('span', 'd_xpl'), ('span', 'd_g')),
 (('div', 'd_dvn'), ('span', 'd_xpl'), ('span', 'd_gls')),
 (('div', 'd_dvn'), ('span', 'd_xpl'), ('span', 'd_lca')),
 (('div', 'd_dvn'), ('span', 'd_xpl'), ('span', 'd_lct')),
 (('div', 'd_dvn'), ('span', 'd_xpl'), ('span', 'd_mta')),
 (('div', 'd_dvn'), ('span', 'd_xpl'), ('span', 'd_mtb')),
 (('div', 'd_dvn'), ('span', 'd_xpl'), ('span', 'd_rm')),
 (('div', 'd_dvn'), ('span', 'd_xpl'), ('span', 'd_x')),
 (('div', 'd_dvr'), ('span', 'd_xpl')),
 (('div', 'd_dvr'), ('span', 'd_xpl'), ('span', 'd_gls')),
 (('div', 'd_dvr'), ('span', 'd_xpl'), ('span', 'd_lca')),
 (('div', 'd_dvr'), ('span', 'd_xpl'), ('span', 'd_mta')),
 (

The parent of an example tag (`class="d_xpl"`) is a certain meaning in which the word is used.