## LDA Preprocessing 2
# Stopword and punctuation removal, lemmatization

This notebook uses the lists of stopwords and protected words discussed in the previous notebook to clean the documents and arrive at a bag-of-words presentation of all documents. In this presentation, we stick with just the lemmata of all words that aren't stopwords.

After running this notebook, all json documents will contain a new entry `bagOfWords` containing the clean representation, ready for `scikit-learn`s LDA. We will save these entries using our utility function defined in `util.loaders`.

## Loading the corpus 

We start this notebook by loading the corpus just as in the first one:

In [1]:
from utils.corpus import Corpus

corpus = Corpus(registry_path='utils/article_registry.json')
corpus_list = corpus.get_documents_list()

Loading corpus. Num. of articles: 877


## Loading the wordlists

We will need the stopwords we defined, the protected words, and the dictionary of our manual lemmas:

In [2]:
import json

with open("wordlists/stopwords.txt") as fp:
    stopwords = fp.read()

with open("wordlists/protectedWords.txt") as fp:
    protected_words = fp.read()
    
with open("wordlists/manualLemmas.json") as fp:
    manual_lemmas = json.load(fp)

In [3]:
stopwords = set(stopwords.split("\n"))
protected_words = set(protected_words.split("\n"))

## Importing Spacy and its spanish model

Since we will be using SpaCy, we will need to install their `es_core_news_md` Natural Language Processing model.

In [4]:
# https://spacy.io/usage/models:
# Run the next line to install the NLP model from SpaCy.
# !python -m spacy download es_core_news_md 

In [5]:
import spacy
nlp = spacy.load("es_core_news_md")

## Cleaning the documents

Let's write a generic function that cleans the documents in the `corpus_list`.

In [14]:
import re

def clean_article(article):
    """
    This function takes in an Article object (as constructed with the Article class (see utils)).
    It processes the clean_text attribute of this object, which holds the text processed so far,
    and implements the following steps:
    
        1. Short word removal: words that are less than 2 letters long are ignored.
        2. Stopword removal: stopwords are also ignored.
        3. Lemmatization: transforms each word into its corresponding lemma.
        
    It then saves the resulting bag of words into a bag_of_words attribute inside the Article object.
    
    Input: Article (object)
    Return: None (function modifies object directly)
    """
    
    clean_text = ' '.join(re.findall("\w+", article.clean_text)).lower()
    
    # Cleaning compound stopwords
    for stopword in stopwords:
        if len(stopword.split(" ")) > 1:
            clean_text = clean_text.replace(stopword, "")
    
    clean_text = ' '.join(re.findall("\w+", clean_text))

    # Getting the bag of words representation
    bag_of_words = []
    for token in nlp(clean_text):        
        # Ignore short words and stopwords
        if (len(token.text) <= 2) or (token.text in stopwords):
            continue
            
            # NOTE: "Yo" might be an imporant word. Which other 2-letter words are important?
            # NOTE 2: Eliminating 2-letter words also helps distinguish "es" from "ser".

        # Protect some words
        if token.text in protected_words:            
            bag_of_words.append(token.text)

        # If the word is in the manual lemmas, we replace.
        # Otherwise, we just add the word.
        elif token.text in manual_lemmas:
            bag_of_words.append(manual_lemmas[token.text])

        # For the rest, store lemmatas
        else:
            bag_of_words.append(token.lemma_)
    
    # Add the atribute to articles.
    bag_of_words = [w for w in bag_of_words if w != ""]
    bag_of_words = " ".join(bag_of_words)
    
    
    # For some strange reason, there are weird blank characters in the bag of words,
    # some of which are not regular spaces nor tabs nor line breaks. We implement an extra
    # step that only includes word characters using regex before a final join.
    bag_of_words = re.findall('\w+', bag_of_words)
    bag_of_words = [w for w in bag_of_words if len(w) > 2]
    bag_of_words = " ".join(bag_of_words)

    article.bag_of_words = bag_of_words
    
    return article

## Running the process

We run the cleaning process in parallel using Python's `multiprocessing` library. By default we use 5 threads, but this can be changed according to the available number of threads.

In [15]:
%%time

from multiprocessing import Pool


with Pool(5) as pool:
    processed_articles = pool.map(clean_article, corpus_list)

## Saving the new corpus

Finally, we save the processed articles using the `save_documents()` method in the `Corpus` class.

In [None]:
corpus.documents = processed_articles
corpus.save_documents()