# `spell_checker`

This is a simple `.ipynb` file meant to quickly demonstrate the capabilities of this module.

First of all, we must add the current directory to `PATH` in order to import the `spell_checker` module:

In [1]:
from os import getcwd
from sys import path as sys_path
sys_path.append(getcwd())
from spell_checker import *

Then, we must import modules of interest to perform this demonstration properly:

In [2]:
from time import time
from random import random

With that out of the way, we can get down to business:

## Basic testing

### `CharacterTree`

In [3]:
a = CharacterTree("abacate","mamão","maniçoba","queijo")
print("Árvore criada com as palavras \"abacate\", \"mamão\",\"maniçoba\" e \"queijo\" nela.")
print("Há maniçoba nela? Resposta: {}.".format("maniçoba" in a))
print("Há abacate nela?  Resposta: {}.".format("abacate" in a))
print("E aba?            Resposta: {}.".format("aba" in a))
print("Então, adicionemos aba.")
a.insert("aba")
print("A palavra aba adicionada.")
print("Há aba na árvore? Resposta: {}.".format("aba" in a))
print("Perfeito.")

Árvore criada com as palavras "abacate", "mamão","maniçoba" e "queijo" nela.
Há maniçoba nela? Resposta: True.
Há abacate nela?  Resposta: True.
E aba?            Resposta: False.
Então, adicionemos aba.
A palavra aba adicionada.
Há aba na árvore? Resposta: True.
Perfeito.


### `RadixTree`

In [4]:
b = RadixTree("abacate","mamão","maniçoba","queijo")
print("Árvore criada com as palavras \"abacate\", \"mamão\",\"maniçoba\" e \"queijo\" nela.")
print("Há maniçoba nela? Resposta: {}.".format("maniçoba" in b))
print("Há abacate nela?  Resposta: {}.".format("abacate" in b))
print("E aba?            Resposta: {}.".format("aba" in b))
print("Então, adicionemos aba.")
b.insert("aba")
print("A palavra aba adicionada.")
print("Há aba na árvore? Resposta: {}.".format("aba" in b))
print("Perfeito.")

Árvore criada com as palavras "abacate", "mamão","maniçoba" e "queijo" nela.
Há maniçoba nela? Resposta: True.
Há abacate nela?  Resposta: True.
E aba?            Resposta: False.
Então, adicionemos aba.
A palavra aba adicionada.
Há aba na árvore? Resposta: True.
Perfeito.


## Spell-checking capabilities

After the afore-executed basic testing, it's time to get down to more interesting stuff, such as the actual spell-checking capability of this module. Although many more improvements can be made, the basic workings of the module shall remain as displayed below.

Under the `dictionaries` folder, a PT-BR dictionary can be found. It was downloaded from [@pythonbr/palavras](https://github.com/pythonprobr/palavras), and different versions were made from it. The ones that shall be used under this section are `palavras.txt` (the whole dictionary) and `palavras_sample.txt`. The sample file was made by randomly picking out 20% of the lines of the original file through the following function:

In [5]:
def sample_file(path,new_path,percentage=.20):
    """Randomly selects lines from files.
    
    Parameters
    ----------
    path : str (path)
        The path of the file to be sampled.
    new_path : str (path)
        The path of the file to be created with the chosen lines.
    percentage : float (default = .20)
        The percentage of lines to be chosen.
    """
    sample = []
    with open(path,"r",encoding="utf-8") as file:
        line = file.readline()
        while line != "":
            if random() < percentage: sample.append(line)
            line = file.readline()
    with open(new_path,"w",encoding="utf-8") as file:
        for item in sample: file.write(item)

### Other functions

In order to proceed with further testing, some functions have been created to ease the process. The function `shuffle_file`, defined below, shuffles the lines of a file in order to favor neither the `CharacterTree` nor the `RadixTree` objects.

In [6]:
def shuffle_file(path,new_path):
    """Randomly reorders the lines of a file and writes the result into another file.

    Parameters
    ----------
    path : str (path)
        The path to the file whose lines are to be randomized.
    new_path : str (path)
        The path in which to write the file with the already-randomized lines.
    """
    with open(path, "r", encoding="utf-8") as file: data = sorted([(random(), line) for line in file])
    with open(new_path, "w", encoding="utf-8") as file:
        for _, line in data: file.write(line)

The following function serves to read a dictionary file (one word per line) and turn it into a list, where each item is a word.

In [7]:
def dict_to_list(path):
    """Converts a dictionary file (one word per line) into a Python list object.
    
    Parameters
    ----------
    path : str (path)
        The path to the dictionary file to be transformed into a Python list object.
    
    Returns
    -------
    list
        A list where each item is a single word.
    """
    with open(path, "r", encoding="utf-8") as file:
        return [line.strip("\n") for line in file.readlines()]

### Loading a dictionary

#### One time (without randomized lines)

This process takes roughly 30 seconds in my computer; your mileage may vary.

##### `CharacterTree`

In [8]:
start_time = time()
print("Executing from_dict()...",end=" ")
ct_pt_br = from_dict("./dictionaries/palavras.txt","character")
final_time = time() - start_time
print("Time elapsed (in seconds): {}".format(final_time))
del(start_time,final_time)

Executing from_dict()... Time elapsed (in seconds): 25.253722667694092


In [9]:
ct_pt_br

<CharacterTree object>
320140 words loaded.
Available Initial Characters: A, B, C, D, E, F, G, H, I, J, K, L, M, N, O, P, Q, R, S, T, U, V, W, X, Y, Z, a, b, c, d, e, f, g, h, i, j, k, l, m, n, o, p, q, r, s, t, u, v, w, x, y, z, ª, µ, º, Á, Â, É, Ê, Í, Ò, Ó, Ô, Ø, Ú, à, á, â, ã, ç, é, ê, í, ó, ô, õ, ø, ú

##### `RadixTree`

In [10]:
start_time = time()
print("Executing from_dict()...",end=" ")
rt_pt_br = from_dict("./dictionaries/palavras.txt","radix")
final_time = time() - start_time
print("Time elapsed (in seconds): {}".format(final_time))
del(start_time,final_time)

Executing from_dict()... Time elapsed (in seconds): 25.149322509765625


In [11]:
rt_pt_br

<RadixTree object>
320140 words loaded.
Available Initial Radices: A, B, C, D, E, F, G, H, I, J, K, L, M, N, O, P, Q, R, S, T, U, V, W, X, Y, Z, a, b, c, d, e, f, g, h, i, j, k, l, m, n, o, p, q, r, s, t, u, v, w, x, y, z, ª, µ, º, Á, Â, É, Ê, Í, Ò, Ó, Ônibus, Ø, Ú, à, á, â, ã, ç, é, ê, í, ó, ô, ões, ø, ú

#### Multiple times (with randomized lines)

In [12]:
repetitions = 10

##### `CharacterTree`

In [13]:
final_time = 0
for i in range(repetitions):
    print("Repetition {}...".format(i), end="")
    shuffle_file("./dictionaries/palavras.txt","./dictionaries/palavras_randomized.txt")
    start_time = time()
    from_txt("./dictionaries/palavras_randomized.txt","CHARACTER")
    final_time += time() - start_time
    print(" done.")
print("\nTotal repetitions: {} | Average time elapsed per repetition (in seconds): {}".format(repetitions, final_time/repetitions))
del(start_time,final_time)

Repetition 0... done.
Repetition 1... done.
Repetition 2... done.
Repetition 3... done.
Repetition 4... done.
Repetition 5... done.
Repetition 6... done.
Repetition 7... done.
Repetition 8... done.
Repetition 9... done.

Total repetitions: 10 | Average time elapsed per repetition (in seconds): 18.002675580978394


##### `RadixTree`

In [14]:
final_time = 0
for i in range(repetitions):
    print("Repetition {}...".format(i), end="")
    shuffle_file("./dictionaries/palavras.txt","./dictionaries/palavras_randomized.txt")
    start_time = time()
    from_txt("./dictionaries/palavras_randomized.txt","RADIX")
    final_time += time() - start_time
    print(" done.")
print("\nTotal repetitions: {} | Average time elapsed per repetition (in seconds): {}".format(repetitions, final_time/repetitions))
del(start_time,final_time)

Repetition 0... done.
Repetition 1... done.
Repetition 2... done.
Repetition 3... done.
Repetition 4... done.
Repetition 5... done.
Repetition 6... done.
Repetition 7... done.
Repetition 8... done.
Repetition 9... done.

Total repetitions: 10 | Average time elapsed per repetition (in seconds): 22.061322021484376


### Loading a text file

Text files can be loaded up as well! Take a look:

#### `CharacterTree`

In [15]:
start_time = time()
print("Executing from_txt()...",end=" ")
ct_constituicao = from_txt("./texts/constituicao.txt","CHARACTER")
final_time = time() - start_time
print("Time elapsed (in seconds): {}".format(final_time))
del(start_time,final_time)

Executing from_txt()... Time elapsed (in seconds): 2.4257304668426514


In [16]:
ct_constituicao

<CharacterTree object>
7142 words loaded.
Available Initial Characters: A, B, C, D, E, F, G, H, I, J, K, L, M, N, O, P, Q, R, S, T, U, V, W, X, Y, Z, a, b, c, d, e, f, g, h, i, j, k, l, m, n, o, p, q, r, s, t, u, v, w, x, z, º, À, Á, Â, É, Í, à, á, â, é, í, ó, ô, ú

#### `RadixTree`

In [17]:
start_time = time()
print("Executing from_txt()...",end=" ")
rt_constituicao = from_txt("./texts/constituicao.txt","RADIX")
final_time = time() - start_time
print("Time elapsed (in seconds): {}".format(final_time))
del(start_time,final_time)

Executing from_txt()... Time elapsed (in seconds): 4.272570610046387


In [18]:
rt_constituicao

<RadixTree object>
7133 words loaded.
Available Initial Radices: A, B, C, D, E, F, G, H, I, J, K, L, M, N, O, P, Qu, R, S, T, U, V, W, X, Yunes, Z, a, b, c, d, e, f, g, h, i, j, k, l, m, n, o, p, q, r, s, t, u, v, w, x, z, º, À, Á, Ângelo, É, Í, à, á, âmbito, é, índi, ó, ônus, ú

### Spell-checking dictionary sample

`CharacterTree.check(path)` is set to return a list containing only the misspelled words (or the words not loaded in the dictionary). It is important to notice, however, that since `palavras.txt`, although contains words separated by `\n`, some of the words contained in some lines contain spaces, for whatever reason. As such, since they some times are not in the dictionary by themselves, but rather together in a composed word, they might be returned even though they are correctly written.

In [19]:
ct_misspellings, rt_misspellings = ct_pt_br.check("./dictionaries/palavras_sample.txt"), rt_pt_br.check("./dictionaries/palavras_sample.txt")
print("{} incorrect words found in the CharacterTree: {}".format(len(ct_misspellings), ct_misspellings))
print("{} incorrect words found in the RadixTree: {}".format(len(rt_misspellings), rt_misspellings))

0 incorrect words found in the CharacterTree: []
0 incorrect words found in the RadixTree: []


### Checking words in the dictionary

The following mode was created to see how much time it would take to see if a bunch of words is in the Tree objects or not.

In [20]:
words_to_check = 1000

#### `CharacterTree`

In [21]:
final_time = 0
pt_br = dict_to_list("./dictionaries/palavras_randomized.txt")

for i in range(words_to_check):
    word_to_check = pt_br.pop()
    start_time = time()
    word_to_check in ct_pt_br
    final_time += time() - start_time

print("Number of checked words: {} | Average Check Time (in seconds): {}".format(words_to_check, final_time/words_to_check))

Number of checked words: 1000 | Average Check Time (in seconds): 8.574795722961426e-05


#### `RadixTree`

In [22]:
final_time = 0
pt_br = dict_to_list("./dictionaries/palavras_randomized.txt")

for i in range(words_to_check):
    word_to_check = pt_br.pop()
    start_time = time()
    word_to_check in rt_pt_br
    final_time += time() - start_time

print("Number of checked words: {} | Average Check Time (in seconds): {}".format(words_to_check, final_time/words_to_check))

Number of checked words: 1000 | Average Check Time (in seconds): 5.086326599121094e-05


### Removing random words from the dictionary

The following code was created to see how much time it would take to remove a bunch of words.

In [23]:
words_to_remove = 1000

#### `CharacterTree`

In [24]:
final_time = 0
pt_br = dict_to_list("./dictionaries/palavras_randomized.txt")

for i in range(words_to_remove):
    word_to_remove = pt_br.pop()
    start_time = time()
    try: ct_pt_br.remove(word_to_remove)
    except: pass
    final_time += time() - start_time

print("Number of removed words: {} | Average Removal Time (in seconds): {}".format(words_to_remove, final_time/words_to_remove))

Number of removed words: 1000 | Average Removal Time (in seconds): 6.0839176177978515e-05


#### `RadixTree`

In [25]:
final_time = 0
pt_br = dict_to_list("./dictionaries/palavras_randomized.txt")

for i in range(words_to_remove):
    word_to_remove = pt_br.pop()
    start_time = time()
    try: rt_pt_br.remove(word_to_remove)
    except: pass
    final_time += time() - start_time

print("Number of removed words: {} | Average Removal Time (in seconds): {}".format(words_to_remove, final_time/words_to_remove))

Number of removed words: 1000 | Average Removal Time (in seconds): 5.984091758728028e-05
