## LDA Preprocessing 1
# Artifact Removal and Stopword Selection
This notebook takes the clean JSON files for each article and does some preprocessing to obtain a text that we can analyze using LDA.

Specifically, we do:
* Artifact removal
* Stopword selection

In the next notebook we will do:
* Punctuation removal
* Lemmatization

In [1]:
import json
import re

We will also use some utility functions we defined in the `utils/` folder:
* `loadCorpusList(path)`: Loads the corpus as a list of `Article` objects (see `utils/Article.py`). This will allow us to save the clean text per document into the same JSON file with the metadata included.
* `saveCorpus(path)`: Saves the articles in JSON format in their current state. Useful when we want to append information to our clean JSON files.

In [2]:
import os
import sys 

# Jupyter Notebooks are not good at handling relative imports.
# Best solution (not great practice) is to add the project's path
# to the module loading paths of sys.

module_path = os.path.abspath(os.path.join('..'))
if module_path not in sys.path:
    sys.path.append(module_path)

from utils.loaders import loadCorpusList, saveCorpus

## Loading the corpus

In [3]:
corpusPath = '../data/parsedHTML'

htmlList = loadCorpusList(corpusPath)

for file in htmlList:
    file.format = 'HTML'

In [4]:
corpusPath = '../data/parsedPDF'

pdfList = loadCorpusList(corpusPath)

for file in pdfList:
    file.format = 'PDF'

In [5]:
corpusList = htmlList + pdfList

In [6]:
corpusList[0].title

'Panikkar, Raimon. La religión, el mundo y el cuerpo'

We will only work with documents in Spanish. Hence, let's replace `corpusList` with only the articles that are in Spanish. There's also some articles that did not have any recognizable text after parsing, so we will remove those as well. Finally, we will only consider articles, not reviews or other kinds of texts present in the journal.

In [7]:
corpusList = [doc for doc in corpusList if doc.lang == 'es']
corpusList = [doc for doc in corpusList if doc.text] 
corpusList = [doc for doc in corpusList if doc.type == 'ARTÍCULOS']

This leaves is with approximately 700 articles.

In [8]:
len(corpusList) 

819

Let's check for duplicates in case we have some. This is important in case we downloaded an article twice, once in PDF and once in HTML.

In [9]:
len(set([doc.id for doc in corpusList]))

816

## Artifact removal

There are some artifacts included in the text that are produced by HTML processing (or in the future because of how PDF files store text).

Let's start by removing numbers and some special characters such as newline characters (`\n`). We will keep normal punctuation for now as that might help SpaCy when we do lemmatization.

In [10]:
for doc in corpusList:
    doc.cleanText = re.sub('\d|\n',' ', doc.text)

We can detect some of these artifacts by looking for non-alphanumeric characters between alphanumeric characters (e.g. `"ar-gument"`, `"ar\xadgument"`).

In [11]:
artifacts = re.compile('\w+[^a-zA-ZáéíóúÁÉÍÓÚñÑüÜ\d\s:]\w+')

In [12]:
[re.findall(artifacts, doc.text) for doc in corpusList][0]

['legamab@unal',
 'edu.co',
 "Abel's",
 'en-sí',
 'mundo,ya',
 'subjetivo-constructivo',
 'for-mas',
 'comprensivo-interpretativo',
 'mundo.1',
 'en-sí',
 'panorama.2',
 'Hans-Georg',
 'Hans-Georg',
 'Abel.3',
 're-identificado',
 'en-sí',
 'hablante-oyente',
 'interpretación-1',
 'espacio-temporal',
 'interpretación-1',
 'interpretación-1',
 'interpretación-2',
 'interpretación-3',
 'interpretaciones-3',
 'socio-culturalmente',
 'interpretación-1',
 'interpretación-3',
 'interpretativo-1',
 'nivel-1',
 'nivel-2',
 'histórico-culturales',
 'nivel-3',
 'interpretación-3',
 'nivel-2',
 'interpretaciones-2',
 'interpretación-2',
 'interpretativos-1',
 'c]uando',
 'categorializantes-1',
 'interpretativa-1',
 'nivel-3',
 'mundo-1',
 'nivel-2',
 'nivel-3',
 'i-3',
 'i-2',
 'i-1',
 'mundo-1',
 'mundos-2',
 'mundos-3',
 'nivel-3',
 '1-2',
 'histórico-culturales',
 'interpretación-3',
 'interpretaciones-2',
 'socio-históricos',
 'nivel-1',
 'nivel-1',
 'mundos-1',
 'interpretaciones-1',
 'mundo

One common artifact is the hex `\xad` for the soft hyphen which is used to break lines. We can remove it easily.

In [13]:
for doc in corpusList:
    doc.cleanText = re.sub('\\xad','', doc.cleanText)
    doc.cleanText = doc.cleanText.replace(u"\xa0", "")

We can save the corpus for now.

In [14]:
if not os.path.exists('../data/corpus'):
    os.mkdir('../data/corpus')
saveCorpus('../data/corpus', corpusList)

## Stopword Removal
Stopword removal is perhaps the most difficult part of preprocessing. There are two challenges to meet:
* Some stopword lists such as the one included in NLTK for Spanish are too weak and do not filter many stopwords.
* Other stopword lists are too inclusive and can eliminate words that are meaningful in philosophy (e.g. 'verdadero', true). 
It is important to note that stopwords are very context-sensitive. A word in one context may provide little meaning (hence counting as a stopword) while in other contexts it may provide lots of information.

To tackle these challenges, we will first to an initial filtering with NTLK's list. This will leave many stopwords in the text, but will reduce the size of each text considerably. Then we will compare the text with a stronger list of stopwords (source). We will see which words are both the text and the stronger stopwords list. We will inspect these lists manually and extract a list of protected words. We will iterate over this process a number of times. Once we have a robust list of protected words, we will concatenate NLTK's stopwords list with the stronger one and eliminate the protected words from it. This will provide a final (hopefully middle ground) stopword list with which to continue.

In [14]:
from nltk.corpus import stopwords as nltk_stopwords

stopwords_weak = nltk_stopwords.words('spanish')

In [15]:
import requests

r = requests.get('https://raw.githubusercontent.com/stopwords-iso/stopwords-es/master/stopwords-es.txt')
stopwords_strong = r.text.split('\n')

In [16]:
docWords = []
for doc in corpusList:
    docWords += [word for word in re.findall('\w+', doc.cleanText) if word not in stopwords_weak]

In [17]:
docWords[:10]

['INTERPRETACIÓN',
 'Y',
 'RELATIVISMO',
 'OBSERVACIONES',
 'SOBRE',
 'LA',
 'FILOSOFÍA',
 'DE',
 'GÜNTER',
 'ABEL']

In [18]:
from collections import Counter

docs_and_stopwords = Counter([word for word in docWords if word in stopwords_strong])

In [19]:
docs_and_stopwords.most_common(10)

[('ser', 20774),
 ('puede', 16814),
 ('u', 15382),
 ('si', 14336),
 ('sino', 11786),
 ('mismo', 11103),
 ('bien', 8738),
 ('decir', 8597),
 ('i', 7546),
 ('modo', 7061)]

Already in the first 10 most common words in both the documents and the strong list of stopwords we find words that in philosophy are quite meaningful:
* 'ser': being
* 'bien': good
* 'modo': mode

We will start saving those words and eliminating them from the stronger list of stopwords. Then we will repeat the process of selecting the words that are in both lists and see which words are common. By iterating over this process a couple of times, we will get a list of protected words.

In [20]:
protectedWords = [
    'ser',
    'bien',
    'modo'
]

In [21]:
stopwords_strong = [word for word in stopwords_strong if word not in protectedWords]
docs_and_stopwords = Counter([word for word in docWords if word in stopwords_strong])

In [22]:
docs_and_stopwords.most_common(100)

[('puede', 16814),
 ('u', 15382),
 ('si', 14336),
 ('sino', 11786),
 ('mismo', 11103),
 ('decir', 8597),
 ('i', 7546),
 ('manera', 6669),
 ('parte', 6622),
 ('tal', 6331),
 ('dos', 6217),
 ('solo', 6040),
 ('pues', 5903),
 ('misma', 5777),
 ('posible', 5597),
 ('hace', 5498),
 ('sólo', 5490),
 ('tiempo', 5458),
 ('lugar', 5342),
 ('debe', 5245),
 ('verdad', 5237),
 ('hecho', 5083),
 ('embargo', 5027),
 ('vez', 5018),
 ('así', 4913),
 ('entonces', 4841),
 ('ejemplo', 4608),
 ('toda', 4579),
 ('siempre', 4066),
 ('cosas', 3969),
 ('parece', 3962),
 ('ello', 3921),
 ('cada', 3920),
 ('aquí', 3822),
 ('poder', 3713),
 ('cuanto', 3688),
 ('respecto', 3656),
 ('saber', 3639),
 ('general', 3618),
 ('según', 3592),
 ('pueden', 3520),
 ('mas', 3517),
 ('hacer', 3466),
 ('da', 3446),
 ('trata', 3402),
 ('menos', 3341),
 ('cómo', 3321),
 ('partir', 3269),
 ('primera', 3148),
 ('propia', 3117),
 ('primer', 3056),
 ('trabajo', 2964),
 ('propio', 2963),
 ('podría', 2961),
 ('cierto', 2896),
 ('dice'

Additionally, we have added words that we observed were incorrectly lemmatized. We will pass the list of protected words to the lemmatizer later on and we will skip these protected words.

In [23]:
protectedWords += [
    'parte',
    'posible',
    'lugar',
    'hecho',
    'poder',
    'verdad',
    'cosas',
    'general',
    'fin',
    'trabajo',
    'cierto',
    'uso',
    'dado',
    'diferentes',
    'verdadero',
    'verdadera',
    'existe',
    'valor',
    'realizar',
    'existen',
    'conocer',
    'diferente',
    'idea',
    'caso',
    'consciencia',
    'conciencia',
    'objeto',
    'forma',
    'obra',
    'persona',
    'sujeto',
    'primer',
    'primera',
    'primero',
    'descartes',
    'libre',
    'libres',
    'escoto',
    'falta',
    'regla',
    'signo',
    'liberté',
    'potencia',
    'cosa',
    'nombre',
    'enunciado',
    'profundo',
    'moneda',
    'minuto',
    'madera',
    'indicio',
    'industria',  
    'espejo',
    'escolio',
    'era',
    'prototipo',
    'discurso',
    'escritura',
    'cave',
    'evidencia',
    'principia'
    ]

protectedWords = list(set(protectedWords))

After a couple of times to make the process less complex, once we are sure of a set of words, we can eliminate those from the list of document words and go back and repeat the process a couple more times.

In [24]:
stopwordsToRemove = [word[0] for word in docs_and_stopwords.most_common(100)]
docWords = [word for word in docWords if word not in stopwordsToRemove]

### Removing stopwords in English

Given that most of the articles have abstracts in English, some of the usual stopwords in English are appearing frequently in our documents. Thus, we will append the NLTK-generated list of English stopwords. We will also use one for Portuguese, which we sometimes get as well.

In [25]:
englishStopwords = nltk_stopwords.words("english")
portugueseStopwords = nltk_stopwords.words("portuguese")

TODO: should we be filtering these just like we filter the Spanish ones?

### Other stopwords custom to our corpus

There are some other stopwords that we would like to include, but that have not been taken into account in the previous processes. These are:

In [26]:
customStopwords = [
    "cf",
    "cfr",
    "sic",
    "quae",
    "pro",
    "sit",
    "quod",
    "quia",
    "wor",
    "wha",
    "whe",
    "no obstante",
    "sin embargo",
    "por ejemplo",
    "es decir",
    "ak",
    "krv",
    "tha",
    "press",
    "university",
    "est",
    "non",
    "par",
    "per",
    "tod",
    "ell",
    "cua",
    "alg",
    "segú",
    "chic",
    "thi",
    "cad",
    "hac",
    "ca",
    "pue",
    "cambridge",
    "would",
    "ést",
    "hua",
    "httpsdoiorg",
    "ser" # Agrego "ser" por ahora porque genera demasiado ruido
    
]

Once we have a robust set of words we can save both the final stopword list and the protected words list.

In [27]:
stopwords_final = list(set(
    stopwords_weak + stopwords_strong + englishStopwords + portugueseStopwords +  customStopwords
))
with open('wordlists/stopwords.txt', 'w') as fp:
    fp.write('\n'.join(stopwords_final))

with open('wordlists/protectedWords.txt', 'w') as fp:
    fp.write('\n'.join(protectedWords))

# Final replacements and edits
PDF correction is not perfect and we observe some artifacts left in the LDA. A hotfix is to do those replacements manually for now and check whether we can improve on this process in the future.

In [28]:
manualReplacements = {
    'kan': 'kant',
    'Kan': 'Kant',
    'entr': 'entre',
    'otr': 'otro',
    'mism': 'mismo',
    'wit': 'with',
    'tien': 'tiene',
    'maner': 'manera',
    'objet': 'objeto',
    'Hege': 'Hegel',
    'mund': 'mundo',
    'sistem': 'sistema',
    'obr': 'obra',
    'histori': 'historia',
    'pode': 'poder',
    'deci': 'decir',
    'bie': 'bien',
    'entonce': 'entonces',
    'verda': 'verdad',
    'deb': 'deber',
    'tant': 'tanto',
    'mora': 'moral',
    'form': 'forma',
    'Hum': 'Hume',
    'ide': 'idea',
    'mod': 'modo',
    'hech': 'hecho',
    'vid': 'vida',
    'relativ': 'relativo',
    'negativ': 'negativo',
    'mínim': 'mínimo',
    'implícit': 'implícito',
    'gegenealógic': 'genealógico',
    'explícit': 'explícito',
    'disput': 'disputa',
    'liberta': 'libertad',
    'polític': 'política',
    'part': 'parte',
    'punt': 'punto',
    'propi': 'propio',
    'crític': 'crítica',
    
}

In [29]:
for doc in corpusList:
    for word, replacement in manualReplacements.items():
        doc.cleanText = re.sub(r'\b' + word + r'\b', replacement, doc.cleanText)

In [30]:
if not os.path.exists('../data/corpus'):
    os.mkdir('../data/corpus')
saveCorpus('../data/corpus', corpusList)