# Topic modeling based on Spanish text corpora from the Hispanic Digital Library

The [Hispanic Digital Library](http://www.bne.es/en/Catalogos/BibliotecaDigitalHispanica/Acercade/) is the digital library of the Biblioteca Nacional de España. It provides access to thousands of digitised documents, including books printed from the 15th to the 20th century, manuscripts, drawings, engravings, pamphlets, posters, photographs, maps, atlases, music scores, historic newspapers and magazines and audio recordings.

This example is based on the works of the author Manuel José Quintana since the library provides his works openly available as OCR output text.

Topic Models are a type of statistical language models used for discovering hidden structure in a collection of texts.

## Downloading the text
The web interface allows to retrieve the OCR text of the documents. Each item provides a link to visualize the content from where the OCR output text can be downloaded. See, for example, the following [link](http://bdh-rd.bne.es/viewer.vm?id=0000131223&page=1).

<img src="images/bdh.png" width="50%">

The text files have been stored in the folder [BNE](./BNE).

## Setting up things

In [None]:
import sys
import requests
import pandas as pd
import re
import gensim
from gensim.utils import simple_preprocess
from nltk.corpus import wordnet
from nltk.tokenize import word_tokenize
from nltk.stem.porter import PorterStemmer
import nltk

## Reading the txt files

The dataset comprises several files and formats. We have prepared the text files in this project to work with them.

Note: the original dataset did not include a CSV file. It was generated from a Excel file.

In [None]:
# Read data into works
works = pd.read_csv('BNE/bne.csv', encoding='utf8')

# Print head
works.head()

## Reading the files and extracting the text

In [None]:
for index,row in works.iterrows():
    
    try:
        file = "BNE/"+ row['file'];
        f = open(file, "r")
        text = f.read()
        
        works.loc[index, 'original_text'] = text
                
    except:
        print("An exception occurred", sys.exc_info()[0]) 
        works.loc[index, 'original_text'] = ''

## Reviewing the content of the files

In [None]:
works.head()

## Remove punctuation/lower casing/stopwords

Next, let’s perform a simple preprocessing on the content to make them more amenable for analysis, and reliable results. We use a regular expression to remove any punctuation, lowercase the text, remove stopwords and then remove non Spanish words since the OCR may have some errors.

We use wordnet to verify if the word exists. We also have added some specific stopwords to enhance the performance.

The initial_clean function performs an initial clean by removing punctuations, uppercase text, etc.

In [None]:
def initial_clean(text):
    """
    Function to clean text-remove punctuations, lowercase text etc.    
    """
   
    text = text.lower() # lower case text
    text = nltk.word_tokenize(text)
    return text

We could use a language_detector for Spanish to remove non existent words. Due to the text provided in the dataset many words are not existent. While the result is better, the performance is reduced by removing non existent words.

Below there is an example of how to identify the language of a text.

In [None]:
from googletrans import Translator

translator = Translator()

lang = translator.detect("La casa de Fernando es muy bonita").lang
print(lang, ":", "La casa de Fernando es muy bonita")

We could also lemmatize words to use their roots. 

In [None]:
from nltk import SnowballStemmer
spanishstemmer=SnowballStemmer('spanish')

print(spanishstemmer.stem("habían"))
print(spanishstemmer.stem("campo"))
print(spanishstemmer.stem("casa"))

The following function could be improved using additional filters such as language identification and lemmatization.

In [None]:
def remove_words(text):
    filtered_text = [] 
    
    for token in text:

        if len(token) <= 2:
            continue
        else:
            filtered_text.append(token)
            
    return filtered_text

## Removing stop words

Stop words are words which does not add much meaning to a sentence. For example, the words in English like the, he, have, etc.

There are several Python packages that provide stopwords lists and they can also be customized.

In [None]:
from nltk.corpus import stopwords
nltk.download('stopwords')
stop_words = stopwords.words('spanish')
stop_words.extend(['habia', 'quo', 'dió', 'algún','darién', 'dia', 'sing', 'babia', 'habian', 'despues', 'indic', 'ele', 'sólo', 'según', 'jos', 'jucef', 'pers', 'the', 'ra.', '.—núm', 'aben'])
def remove_stop_words(text):
     return [word for word in text if word not in stop_words]

We create a function to perform the whole process

In [None]:
def apply_all(text):
    """
    This function applies all the functions above into one
    """
    return remove_stop_words(remove_words(initial_clean(text)))

Finally, we process the original text by using the function apply.

In [None]:
# clean reviews and create new column "tokenized" 
import time   
t1 = time.time()   
works['tokenized_text'] = works['original_text'].apply(apply_all)    
t2 = time.time()  
print("Time to clean and tokenize", len(works), "reviews:", (t2-t1)/60, "min") #Time to clean and tokenize

## Checking the result

In [None]:
works.head()

In [None]:
works['tokenized_text']

## Create Gensim Dictionary and Corpus
Topic modeling using LDA are based on the dictionary and the corpus. This example is based on gensim library for building both.

In [None]:
# LDA
import gensim
from gensim import corpora, models, similarities

In [None]:
tokenized = works['tokenized_text']

#Creating term dictionary of corpus, where each unique term is assigned an index.
dictionary = corpora.Dictionary(tokenized)
#Filter terms which occurs in less than 1 document and more than 80% of the documents.
dictionary.filter_extremes(no_below=1, no_above=0.8)
#convert the dictionary to a bag of words corpus 
corpus = [dictionary.doc2bow(tokens) for tokens in tokenized]
#print(corpus[:1])

## Building the Topic Model
In this step, num_topics is the number of topics to be created and passes corresponds to the number of times to iterate through the entire corpus. By running the LDA algorithm we get the topics as a result.

In [None]:
import warnings
warnings.simplefilter("ignore", DeprecationWarning)

#LDA
ldamodel = gensim.models.ldamodel.LdaModel(corpus, num_topics = 5, id2word=dictionary, passes=15)
ldamodel.save('model_combined.gensim')
topics = ldamodel.print_topics(num_words=4)
for topic in topics:
    print(topic)

This output shows the 5 topics created and the 4 words within each topic which best describes them. From the above output we could guess that each topic and their corresponding words revolve around a common theme (For e.g., topic 3 is related to franceses and cortes).