## LDA Preprocessing 2
# Stopword and punctuation removal, lemmatization

This notebook uses the lists of stopwords and protected words discussed in the previous notebook to clean the documents and arrive at a bag-of-words presentation of all documents. In this presentation, we stick with just the lemmata of all words that aren't stopwords.

After running this notebook, all json documents will contain a new entry `bagOfWords` containing the clean representation, ready for `scikit-learn`s LDA. We will save these entries using our utility function defined in `util.loaders`.

## Loading the corpus 

We start this notebook by loading the corpus just as in the first one:

In [1]:
import json
import re
import os
import sys 

# Jupyter Notebooks are not good at handling relative imports.
# Best solution (not great practice) is to add the project's path
# to the module loading paths of sys.

module_path = os.path.abspath(os.path.join('..'))
if module_path not in sys.path:
    sys.path.append(module_path)

from utils.loaders import loadCorpusList, saveCorpus

corpusPath = '../data/corpus'

corpusList = loadCorpusList(corpusPath)

In [2]:
corpusList = [doc for doc in corpusList if doc.lang == "es"]

In [3]:
len(corpusList)

1495

## Loading the wordlists

We will need the stopwords we defined, the protected words, and the dictionary of our manual lemmas:

In [4]:
with open("wordlists/stopwords.txt") as fp:
    stopwords = fp.read()

with open("wordlists/protectedWords.txt") as fp:
    protectedWords = fp.read()
    
with open("wordlists/manualLemmas.txt") as fp:
    manualLemmas = json.load(fp)

In [5]:
stopwords = set(stopwords.split("\n"))
protectedWords = set(protectedWords.split("\n"))

## Importing Spacy and its spanish model

Since we will be using SpaCy, we will need to install their `es_core_news_md` Natural Language Processing model.

In [6]:
# https://spacy.io/usage/models:
# Run the next line to install the NLP model from SpaCy.
# !python -m spacy download es_core_news_md 

In [7]:
import spacy
import es_core_news_md
nlp = es_core_news_md.load()

## Cleaning the documents

Let's write a generic function that cleans the documents in the `corpusList`.

In [8]:
import string

In [9]:
def cleanArticle(article):
    """
    This function takes in an Article object (as constructed with the Article class (see utils)).
    It processes the cleanText attribute of this object, which holds the text processed so far,
    and implements the following steps:
    
        1. Short word removal: words that are less than 2 letters long are ignored.
        2. Stopword removal: stopwords are also ignored.
        3. Lemmatization: transforms each word into its corresponding lemma.
        
    It then saves the resulting bag of words into a bagOfWords attribute inside the Article object.
    
    Input: Article (object)
    Return: None (function modifies object directly)
    """
    
    cleanText = article.cleanText
    table = str.maketrans('', '', string.punctuation + "¡¿")
    cleanText = cleanText.translate(table).lower()
    
    # Cleaning compound stopwords
    for stopword in stopwords:
        if len(stopword.split(" ")) > 1:
            cleanText = cleanText.replace(stopword, "")
    
    # Getting the bag of words representation
    bagOfWords = []
    for token in nlp(cleanText):
        # Ignore short words and stopwords
        if len(token.text) <= 2 or token.text in stopwords:
            continue
            
            # NOTE: "Yo" might be an imporant word. Which other 2-letter words are important?
            # NOTE 2: Eliminating 2-letter words also helps distinguish "es" from "ser".

        # Protect some words
        if token.text in protectedWords:
            # If the word is in the manual lemmas, we replace.
            # Otherwise, we just add the word.
            if token.text in manualLemmas.keys():
                bagOfWords.append(manualLemmas[token.text])
            else:
                bagOfWords.append(token.text)

        # For the rest, store lemmatas
        else:
            bagOfWords.append(token.lemma_)
    
    # Add the atribute to articles.
    # TODO: some of the w here are spaces, I should
    # remove them.
    bagOfWords = [w for w in bagOfWords if w != ""]
    bagOfWords = " ".join(bagOfWords)
    article.bagOfWords = bagOfWords
    
    return article

## Running the process

We run the cleaning process in parallel using Python's `multiprocessing` library. By default we use 5 threads, but this can be changed according to the available number of threads.

TODO: En efecto la implementación con Pool no comparte la lista global corpusList, lo que hace que no se guarden correctamente las ediciones sobre esta lista. El hotfix involucra una implementación en la cual la función `cleanArticle()` devuelve un objeto artículo que guardamos en una lista nueva, que es la que posteriormente guardamos a archivo. Sin embargo, esto duplica innecesariamente la corpusList en memoria. No es grave, pero una implementación correcta sería usando `Manager()` en lugar de `Pool`. Con `Manager()` sí podemos usar recursos compartidos y no necesitamos generar una lista nueva para los resultados procesados. 

In [10]:
from multiprocessing import Pool
import time

start = time.time()
with Pool(6) as pool:
    processedArticles = pool.map(cleanArticle, corpusList)
end = time.time()

print(f"Time elapsed: {end - start:.2f}s")

Time elapsed: 740.46s


## Saving the new corpus

Finally, we save the processed articles using our `saveCorpus` utility function.

In [11]:
saveCorpus('../data/corpus', processedArticles)