# Dictionary Generation
In this notebook we generate a dictionary for [PySpellChecker](https://pyspellchecker.readthedocs.io/en/latest/index.html) that we will use when preprocessing PDF files. We do so by doing the following:
1. Downloading a frequency list from the Real Academia Española [Reference Corpus (Corpus de Referencia del Español Actual (CREA))](http://corpus.rae.es/lfrecuencias.html).
2. Parsing the RAE frequency list into a JSON file that PySpellChecker can read.
3. Creating a dictionary by using the documents we downloaded as HTML.
4. Merging both dictionaries into a PySpellChecker custom dictionary.

In [16]:
from spellchecker import SpellChecker
from collections import Counter 
import re
import json
import requests
import os

## 1. Downloading CREA frequency list.

We begin by downloading the CREA list and saving it to a temporary directory. The list comes in a zip file which we will have to extract later.

In [17]:
if not os.path.exists('temp'):
    os.mkdir('temp')

r = requests.get('http://corpus.rae.es/frec/CREA_total.zip')
with open('temp/CREA_Dictionary.zip', 'wb') as fp:
    fp.write(r.content)

Now we extract the zip file to obtain a .txt file.

In [18]:
import zipfile
with zipfile.ZipFile('temp/CREA_Dictionary.zip', 'r') as zip_ref:
    zip_ref.extractall('temp/')
    
os.listdir('temp/')

['CREA_Dictionary.zip',
 'dictRae.json',
 'philosophyDict.json',
 'CREA_total.TXT']

## 2. Parsing the CREA list into a JSON file.

Now we have `CREA_total.TXT`, which contains frequencies for words in Spanish. We will parse this list into a dictionary which holds the frequency counts. We do so by reading it as a TSV file, since this file has this structure. We will hence use the `csv` module in Python.

In [19]:
import csv
raeCounts = {}
with open('temp/CREA_total.TXT', encoding = "ISO-8859-1") as fp:
    raw = csv.reader(fp, delimiter = '\t')
    next(raw)
    for row in raw:
        word = re.sub('\W+', '', row[1])
        raeCounts[word] = int(row[2].replace(',', ''))

In [20]:
len(raeCounts)

736375

We will save this frequency list dictionary as a JSON file in case we need to reuse it later.

In [21]:
with open('temp/dictRae.json', 'w') as fp:
    json.dump(raeCounts, fp)

Lastly, we pass this JSON file to a PySpellChecker object that we will later save as our dictionary.

In [22]:
customDictionary = SpellChecker(language = None)
customDictionary.word_frequency.load_dictionary('temp/dictRae.json')

## 3. Creating a dictionary from HTML documents
Now we load the documents we got and parsed from HTML files. We have stored this into the `data/parsedHTML` directory.

In [23]:
import sys

module_path = os.path.abspath(os.path.join('..'))
if module_path not in sys.path:
    sys.path.append(module_path)

from utils.loaders import loadCorpusList, loadCorpusDict, saveCorpus

corpusPath = '../data/parsedHTML'

corpusList = loadCorpusList(corpusPath)

Now we count all the words in these files. We will only count words not recognized yet by our dictionary. We will not add them yet though, since we will only count those that are the most frequent. This will help both add only relevant words but also filter out some artifacts.

In [24]:
philWordCount = Counter()
for doc in corpusList:
    docWords = re.findall('\w+', doc.cleanText)
    for word in [word for word in docWords if word not in customDictionary]:
        philWordCount[word] += 1

In [26]:
philWordCount.most_common(100)

[('Aristotle', 498),
 ('responsibility', 203),
 ('καὶ', 146),
 ('Korsgaard', 140),
 ('Duica', 133),
 ('rhetoric', 132),
 ('Phaedo', 121),
 ('possibility', 116),
 ('Eriúgena', 112),
 ('Brandom', 111),
 ('philosopher', 106),
 ('Rancière', 104),
 ('Zahavi', 104),
 ('Socratic', 97),
 ('Hrsg', 94),
 ('τὸ', 94),
 ('KRV', 90),
 ('Callicles', 90),
 ('necessity', 89),
 ('Pereboom', 85),
 ('Badiou', 83),
 ('autoadscriptivo', 81),
 ('Metaphysics', 78),
 ('latter', 77),
 ('Responsibility', 76),
 ('KrV', 76),
 ('princípios', 75),
 ('Platonic', 74),
 ('metaphysics', 73),
 ('McTaggart', 73),
 ('therefore', 72),
 ('Gesinnung', 72),
 ('justiça', 70),
 ('nihilism', 69),
 ('precisely', 69),
 ('ação', 69),
 ('requires', 68),
 ('suggest', 67),
 ('Phenomenology', 67),
 ('beings', 66),
 ('instance', 65),
 ('punishment', 65),
 ('fatalism', 65),
 ('appears', 64),
 ('Trads', 63),
 ('implies', 63),
 ('Kritik', 63),
 ('ideasyvalores', 63),
 ('Žižek', 63),
 ('Brucker', 63),
 ('intuitions', 62),
 ('consider', 62),


Notice many of these words are words in English. We can also remove them by importing the default PySpellChecker English dictionary.

In [38]:
englishDictionary = SpellChecker(language = 'en')
portugueseDictionary = SpellChecker(language = 'pt')
germanDictionary = SpellChecker(language = 'de')

In [56]:
philWords = []
for doc in corpusList:
    docWords = re.findall('\w+', doc.cleanText)
    for word in customDictionary.unknown(docWords):
        if word in englishDictionary or word in portugueseDictionary or word in germanDictionary:
            continue
        philWords.append(word)
philWordCount = Counter(philWords)

In [57]:
philWordCount.most_common(100)

[('ideasyvalores', 63),
 ('trads', 57),
 ('hrsg', 40),
 ('krv', 29),
 ('metaphysik', 25),
 ('caimi', 18),
 ('brandom', 16),
 ('synthese', 14),
 ('transzendentale', 14),
 ('jadiaz', 14),
 ('καὶ', 14),
 ('badiou', 13),
 ('phänomenologische', 13),
 ('olms', 13),
 ('diánoia', 12),
 ('kpv', 12),
 ('korsgaard', 12),
 ('pereboom', 11),
 ('τὸ', 11),
 ('urteilskraft', 11),
 ('preussischen', 10),
 ('grundlegung', 10),
 ('rowman', 10),
 ('τὰ', 10),
 ('noûs', 10),
 ('γὰρ', 10),
 ('ἡ', 10),
 ('žižek', 10),
 ('sorabji', 9),
 ('hrsgg', 9),
 ('dianoia', 9),
 ('metaphysica', 9),
 ('τῶν', 9),
 ('τοῦ', 9),
 ('coherentismo', 9),
 ('representacionalista', 9),
 ('rancière', 9),
 ('περὶ', 9),
 ('zahavi', 9),
 ('compatibilism', 8),
 ('normativity', 8),
 ('τῆς', 8),
 ('rodopi', 8),
 ('τε', 8),
 ('ἢ', 8),
 ('δὲ', 8),
 ('frede', 8),
 ('οὐ', 8),
 ('κατὰ', 8),
 ('ἐν', 8),
 ('ashgate', 8),
 ('subpersonal', 8),
 ('duica', 8),
 ('bänden', 8),
 ('théologie', 8),
 ('tación', 8),
 ('posmetafísico', 7),
 ('libertaristas'

That looks slightly better. We still have words in Greek and Portuguese, but mostly we have names which we are interested in correcting. We can save the philosophy dictionary to file, add these word frequencies into our general dictionary and save the dictionary to file.

In [59]:

for word, count in philWordCount.most_common(100):
    philDict[word] = count

SyntaxError: invalid syntax (<ipython-input-59-f1ad46f97097>, line 1)

In [51]:
with open('temp/philosophyDict.json', 'w') as fp:
    json.dump(philWordCount, fp)

In [54]:
customDictionary.word_frequency.load_dictionary('temp/philosophyDict.json')

TypeError: unhashable type: 'list'

In [53]:
customDictionary.export('../notebooks/wordlists/customDictionary.gz', gzipped = True)