## LDA Preprocessing 2
# Stopword and punctuation removal, lemmatization

This notebook uses the lists of stopwords and protected words discussed in the previous notebook to clean the documents and arrive at a bag-of-words presentation of all documents. In this presentations, we stick with just the lemmata of all words that aren't stopwords.

After running this notebook, all json documents will contain a new entry `bagOfWords` containing the clean representation, ready for `scikit-learn`s LDA.

## Loading the corpus 

We start this notebook by loading the corpus just as in the first one:

In [1]:
import json
import re
import os
import sys 

# Jupyter Notebooks are not good at handling relative imports.
# Best solution (not great practice) is to add the project's path
# to the module loading paths of sys.

module_path = os.path.abspath(os.path.join('..'))
if module_path not in sys.path:
    sys.path.append(module_path)

from utils.loaders import loadCorpusList, saveCorpus

corpusPath = '../data/clean_json'

corpusList = loadCorpusList(corpusPath)

In [2]:
corpusList = [doc for doc in corpusList if doc.lang == "es"]

## Loading the wordlists

We will need the stopwords we defined, the protected words, and the dictionary of our manual lemmas:

In [3]:
with open("wordlists/stopwords.txt") as fp:
    stopwords = fp.read()

with open("wordlists/protectedWords.txt") as fp:
    protectedWords = fp.read()
    
with open("wordlists/manualLemmas.txt") as fp:
    manualLemmas = json.load(fp)

In [4]:
stopwords = set(stopwords.split("\n"))
protectedWords = set(protectedWords.split("\n"))

## Importing Spacy and its spanish model

Since we will be using SpaCy, we will need to install their `es_core_news_md` Natural Language Processing model.

In [5]:
# https://spacy.io/usage/models:
!python -m spacy download es_core_news_md 

[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('es_core_news_md')


In [6]:
import spacy
import es_core_news_md
nlp = es_core_news_md.load()

## Cleaning the documents

Let's write a generic function that cleans the documents in the `corpusList`.

In [7]:
import string

In [8]:
def clean_article(article):
    """
    TODO: write this.
    """
    
    cleanText = article.cleanText
    table = str.maketrans('', '', string.punctuation + "¡¿")
    cleanText = cleanText.translate(table).lower()
    
    # Cleaning compound stopwords
    for stopword in stopwords:
        if len(stopword.split(" ")) > 1:
            cleanText = cleanText.replace(stopword, "")
    
    # Getting the bag of words representation
    bagOfWords = []
    for token in nlp(cleanText):
        # Ignore stopwords
        if token.text in stopwords:
            continue

        # Protect some words
        if token.text in protectedWords:
            # If the word is in the manual lemmas, we replace.
            # Otherwise, we just add the word.
            if token.text in manualLemmas.keys():
                bagOfWords.append(manualLemmas[token.text])
            else:
                bagOfWords.append(token.text)

        # For the rest, store lemmatas
        else:
            bagOfWords.append(token.lemma_)
    
    # Add the atribute to articles.
    # TODO: some of the w here are spaces, I should
    # remove them.
    bagOfWords = [w for w in bagOfWords if w != ""]
    bagOfWords = " ".join(bagOfWords)
    article.bagOfWords = bagOfWords

## Running the process

TODO: I could run this in parallel.

In [None]:
for article in corpusList:
    clean_article(article)

## Saving the new corpus

In [None]:
saveCorpus('../data/clean_json', corpusList)