# Building language models based on Spanish text corpora

This example is based on the dataset that provides text in computer readable format of the 2 books of the volume dedicated to the province of Avila of the Monumental Catalogue of Spain written by Manuel Gómez-Moreno (1900-1901). The dataset is available at [figshare](https://figshare.com/articles/Transcripci_n_del_Cat_logo_Monumental_de_Espa_a_Provincia_de_vila_por_Manuel_G_mez_Moreno_1900-1901_/12006318). 

After automatic transcription based on Transkribus, the text was manually revised. The transcriptions were carried out by Raquel Liceras-Garrido, Alba Comino and Patricia Murrieta-Flores under the project “Goodbye reading glasses: a Machine Learning experiment on handwriting documents”, funded by the Faculty of Arts and Social Sciences and the Digital Humanities Hub of Lancaster University (UK).

The project produced several datasets based on other Spanish cities including [Soria](https://figshare.com/articles/Transcripci_n_del_Cat_logo_Monumental_de_la_Provincia_de_Soria_por_Juan_Cabr_1916-1917_/12006273
) and [Burgos](https://figshare.com/articles/Transcripci_n_del_Cat_logo_Monumental_y_Art_stico_de_la_Provincia_de_Burgos_por_Narciso_Sentenach_1925_/12006327).
    

## Setting up things

In [None]:
import sys
import os
import os.path
from nltk.corpus import wordnet
from nltk.tokenize import word_tokenize

from wordcloud import WordCloud
from wordcloud import STOPWORDS
from nltk.corpus import stopwords
from nltk import bigrams, trigrams
from collections import Counter, defaultdict
import random
from pathlib import Path

## Reading the txt files

The dataset comprises several files and formats. We have prepared the text files in this project to work with them.

In [None]:
filename = Path('CME_Avila/GM_Avila_v1_Text2_Procesado.txt')

text = ''

if os.path.exists(filename):
    with open(filename, 'r') as myfile:
        text = myfile.read()

## We get the text from the second file

In [None]:
filename = Path('CME_Avila/GM_Avila_v1_Text_Procesado_51-258.txt')

if os.path.exists(filename):
    with open(filename, 'r') as myfile:
        text += myfile.read()

## Let's see the text

In [None]:
text

## Removing stop words

Stop words are words which does not add much meaning to a sentence. For example, the words in English like the, he, have, etc.

There are several Python packages that provide stopwords lists and they can also be customized.

In [None]:
# adding specific stopwords
customized_stop_words = ["que", "es", "un", "una", "do", "toda", "hacia"] + stopwords.words('spanish')


In [None]:
# Create a WordCloud object
wordcloud = WordCloud(stopwords = customized_stop_words, collocations=False, background_color="white", max_words=5000, contour_width=3, contour_color='steelblue')

# Generate a word cloud
wordcloud.generate(text)

# Visualize the word cloud
wordcloud.to_image()

## Tokenization

Tokenization is the process of breaking down a text paragraph into smaller chunks such as words is called Tokenization.

## Sentence Tokenization
Sentence tokenizer breaks text paragraph into sentences.

In [None]:
from nltk.tokenize import sent_tokenize
tokenized_text=sent_tokenize(text)
print(tokenized_text)

In [None]:
from nltk.tokenize import word_tokenize
import string

tokenized_word=word_tokenize(text)
stop = stopwords.words('spanish') + list(string.punctuation)
cleaned_text = [i for i in word_tokenize(text.lower()) if i not in stop]

print(cleaned_text)

In [None]:
from nltk.probability import FreqDist
fdist = FreqDist(cleaned_text)
print(fdist)

In [None]:
fdist.most_common(5)

In [None]:
# Frequency Distribution Plot
import matplotlib.pyplot as plt
fdist.plot(30,cumulative=False)
plt.show()

In [None]:
from nltk.corpus import stopwords
stop_words=set(stopwords.words("spanish"))
print(stop_words)

N-grams are consecutive words in a sentence. Let's see how to generate them from a sentence in Python:

In [None]:
first_sentence = "Ojos de buey achaflanados y una ventana de arco agudo, prestaban la mayor cantidad de luz al interior, puesto que á los costados solo había saeteras"

print(first_sentence) 
 
# Get the bigrams
print(list(bigrams(word_tokenize(first_sentence))))
 
# Get the trigrams
print (list(trigrams(word_tokenize(first_sentence))))
 
# Get the padded trigrams
print (list(trigrams(word_tokenize(first_sentence), pad_left=True, pad_right=True)))

In [None]:
model = defaultdict(lambda: defaultdict(lambda: 0))
 
for sentence in sent_tokenize(text):
    for w1, w2, w3 in trigrams(word_tokenize(sentence), pad_right=True, pad_left=True):
        model[(w1, w2)][w3] += 1

print(model["El", "Barco"]["de"])
print(model["Fragmentos", "de"]["retablo"])

#print model["what", "the"]["economists"] # "economists" follows "what the" 2 times
#print model["what", "the"]["nonexistingword"] # 0 times
#print model[None, None]["The"] # 8839 sentences start with "The"

## Let's transform the counts to probabilities

In [None]:
for w1_w2 in model:
    total_count = float(sum(model[w1_w2].values()))
    
    for w3 in model[w1_w2]:
        model[w1_w2][w3] /= total_count
 
print(model["El", "Barco"]["de"])
print(model["Fragmentos", "de"]["retablo"])

## Now we have a trigram language model, let’s generate some text:

In [None]:
txt = [None, None]
 
sentence_finished = False
 
while not sentence_finished:
    r = random.random()
    accumulator = .0
    
    print(txt[-2:])
 
    for word in model[tuple(txt[-2:])].keys():
        accumulator += model[tuple(txt[-2:])][word]
 
        if accumulator >= r:
            txt.append(word)
            break
 
    if txt[-2:] == [None, None]:
        sentence_finished = True
 
print (' '.join([t for t in txt if t]))

## References

Liceras-Garrido, Raquel; Comino, Alba; Murrieta-Flores, Patricia (2020): Transcripción del Catálogo Monumental de España: Provincia de Ávila por Manuel Gómez Moreno (1900-1901). figshare. Dataset. https://doi.org/10.6084/m9.figshare.12006318.v1

https://www.datacamp.com/community/tutorials/text-analytics-beginners-nltk

https://nlpforhackers.io/language-models/