## LDA Preprocessing 1
# Artifact Removal and Stopword Selection
This notebook takes the clean JSON files for each article and does some preprocessing to obtain a text that we can analyze using LDA.

Specifically, we do:
* Artifact removal
* Stopword selection

In the next notebook we will do:
* Punctuation removal
* Lemmatization

We will also use a utility `Corpus` class defined in the `utils/` folder.

The `Corpus` class includes methods to get the documents, which we get as `Article` objects.

In [1]:
from utils.corpus import Corpus

## Loading the corpus

In [2]:
corpus = Corpus(registry_path='utils/article_registry.json')

We will only work with documents in Spanish, with text, and that are indeed articles (and not reviews, for example). The `Corpus` class included a method with these criteria by default.

In [3]:
corpus_list = corpus.get_documents_list()

Loading corpus. Num. of articles: 877


This leaves is with approximately 700 articles.

Let's check for duplicates in case we have some. This is important in case we downloaded an article twice, once in PDF and once in HTML.

In [4]:
len(set([doc.id for doc in corpus_list]))

877

## Artifact removal

There are some artifacts included in the text that are produced by HTML and PDF processing.

Let's start by removing numbers and some special characters such as newline characters (`\n`). We will keep normal punctuation for now as that might help SpaCy when we do lemmatization.

In [5]:
import re

In [6]:
for doc in corpus_list:
    doc.clean_text = re.sub('\d|\n',' ', doc.text)

We can detect some of these artifacts by looking for non-alphanumeric characters between alphanumeric characters (e.g. `"ar-gument"`, `"ar\xadgument"`).

In [7]:
artifacts = re.compile('\w+[^a-zA-ZáéíóúÁÉÍÓÚñÑüÜ\d\s:]\w+')

In [8]:
[re.findall(artifacts, doc.text) for doc in corpus_list][0][:5]

[]

One common artifact is the hex `\xad` for the soft hyphen which is used to break lines. We can remove it easily.

In [9]:
for doc in corpus_list:
    doc.clean_text = re.sub('\\xad','', doc.clean_text)
    doc.clean_text = doc.clean_text.replace(u"\xa0", "")

We can save the corpus for now.

In [10]:
corpus.save_documents()

## Stopword Selection
Stopword removal is perhaps the most difficult part of preprocessing. There are two challenges to meet:
* Some stopword lists such as the one included in NLTK for Spanish are too weak and do not filter many stopwords.
* Other stopword lists are too inclusive and can eliminate words that are meaningful in philosophy (e.g. 'verdadero', true). 
It is important to note that stopwords are very context-sensitive. A word in one context may provide little meaning (hence counting as a stopword) while in other contexts it may provide lots of information.

To tackle these challenges, we will first to an initial filtering with NTLK's list. This will leave many stopwords in the text, but will reduce the size of each text considerably. Then we will compare the text with a stronger list of stopwords (source). We will see which words are both in the text and the stronger stopwords list. We will inspect these lists manually and extract a list of protected words. We will iterate over this process a number of times. Once we have a robust list of protected words, we will concatenate NLTK's stopwords list with the stronger one and eliminate the protected words from it. This will provide a final (hopefully middle ground) stopword list with which to continue.

In [11]:
from nltk.corpus import stopwords as nltk_stopwords

stopwords_weak = nltk_stopwords.words('spanish')

In [12]:
import requests

r = requests.get('https://raw.githubusercontent.com/stopwords-iso/stopwords-es/master/stopwords-es.txt')
stopwords_strong = r.text.split('\n')

In [13]:
corpus_words = []
for doc in corpus_list:
    doc_words = [word.lower() for word in re.findall('\w+', doc.clean_text) if len(word) > 1]
    corpus_words += [word for word in doc_words if word not in stopwords_weak]

print(f"Total words: {len(corpus_words):d}")
corpus_words[:10]

Total words: 3903285


['hacia',
 'ontologia',
 'lhalectice',
 'existencia',
 'luis',
 'nieto',
 'arteta',
 'tardío',
 'descubrimiento',
 'esfera']

In [14]:
from collections import Counter

realwords_and_stopwords = Counter([word for word in corpus_words if word in stopwords_strong])

In [15]:
realwords_and_stopwords.most_common(10)

[('ser', 26827),
 ('si', 21051),
 ('puede', 18024),
 ('sino', 14759),
 ('mismo', 13898),
 ('bien', 10789),
 ('decir', 10316),
 ('así', 9979),
 ('tal', 8979),
 ('modo', 8451)]

Already in the first 10 most common words in both the documents and the strong list of stopwords we find words that in philosophy are quite meaningful:
* 'ser': being
* 'bien': good
* 'modo': mode

We will start saving those words and eliminating them from the stronger list of stopwords. Then we will repeat the process of selecting the words that are in both lists and see which words are common. By iterating over this process a couple of times, we will get a list of protected words.

In [16]:
from IPython.display import clear_output


with open('wordlists/protectedWords.txt') as fp:
    protected_words = fp.read().split()

new_protected_words = ['ser']
checked_words = protected_words.copy()

empty_iterations = 0
max_iterations = 3
while empty_iterations < max_iterations:
    candidates = Counter({word: count for word, count in realwords_and_stopwords.items() if word not in checked_words})
    candidates = candidates.most_common(20)

    checked_words += [word for word, count in candidates]

    print(candidates)

    new_protected_words = input("New protected words (comma separated) [end with 'None']: ")
    protected_words += new_protected_words.split(', ')
    if new_protected_words == 'None':
        empty_iterations += 1
    #clear_output()

[('si', 21051), ('puede', 18024), ('sino', 14759), ('así', 9979), ('tal', 8979), ('manera', 8377), ('dos', 8284), ('pues', 8215), ('solo', 7315), ('entonces', 6628), ('hace', 6411), ('vez', 6402), ('según', 6193), ('sólo', 6180), ('embargo', 5648), ('toda', 5612), ('cada', 5507), ('ejemplo', 5399), ('cosas', 5009), ('ahora', 4989)]
[('aquí', 4869), ('respecto', 4796), ('cuanto', 4658), ('siempre', 4627), ('parece', 4555), ('pueden', 4523), ('cómo', 4417), ('menos', 4408), ('general', 4313), ('hacer', 4270), ('mas', 4265), ('trata', 4153), ('primera', 4095), ('partir', 4087), ('aunque', 3934), ('hacia', 3896), ('todas', 3844), ('da', 3723), ('dice', 3588), ('podemos', 3483)]
[('sido', 3479), ('ver', 3432), ('podría', 3417), ('tan', 3400), ('fin', 3324), ('tener', 3281), ('primer', 3252), ('medio', 3128), ('momento', 3095), ('dentro', 3063), ('cuenta', 3051), ('bajo', 3043), ('cierto', 2988), ('segundo', 2889), ('dado', 2886), ('cualquier', 2813), ('dicho', 2782), ('lado', 2778), ('travé

Additionally, we have added words that we observed were incorrectly lemmatized. We will pass the list of protected words to the lemmatizer later on and we will skip these protected words.

In [17]:
protected_words = [word for word in protected_words if word and word != 'None']
protected_words = list(set(protected_words))

### Removing stopwords in other languages

Given that most of the articles have abstracts in English, some of the usual stopwords in English are appearing frequently in our documents. Thus, we will append the NLTK-generated list of English stopwords. We will also use one for Portuguese, which we sometimes get as well.

In [18]:
other_stopwords = []

for lang in ["english", "portuguese", "german", "french"]:
    other_stopwords += nltk_stopwords.words(lang)

TODO: should we be filtering these just like we filter the Spanish ones?

### Other stopwords custom to our corpus

There are some other stopwords that we would like to include, but that have not been taken into account in the previous processes. These are found in `wordlists/custom_stopwords.txt`.

Once we have a robust set of words we can save both the final stopword list and the protected words list.

In [19]:
with open('wordlists/custom_stopwords.txt') as file:
    custom_stopwords = file.read().split()

stopwords_final = list(set(
    stopwords_weak + stopwords_strong + other_stopwords + custom_stopwords
))
with open('wordlists/stopwords.txt', 'w') as fp:
    fp.write('\n'.join(stopwords_final))

with open('wordlists/protectedWords.txt', 'w') as fp:
    fp.write('\n'.join(protected_words))

In [20]:
corpus.save_documents()

# Final replacements and edits
PDF correction is not perfect and we observe some artifacts left in the LDA. A hotfix is to do those replacements manually for now and check whether we can improve on this process in the future.

Note: I saved these and removed the cells which contained this dictionary. We can find it in `wordlists/old_manual_replacements.json`.