# Ranking n-grams based on Spanish text corpora

This example is based on the dataset that provides text in computer readable format of the 2 books of the volume dedicated to the province of Avila of the Monumental Catalogue of Spain written by Manuel Gómez-Moreno (1900-1901). The dataset is available at [figshare](https://figshare.com/articles/Transcripci_n_del_Cat_logo_Monumental_de_Espa_a_Provincia_de_vila_por_Manuel_G_mez_Moreno_1900-1901_/12006318). 

After automatic transcription based on Transkribus, the text was manually revised. The transcriptions were carried out by Raquel Liceras-Garrido, Alba Comino and Patricia Murrieta-Flores under the project “Goodbye reading glasses: a Machine Learning experiment on handwriting documents”, funded by the Faculty of Arts and Social Sciences and the Digital Humanities Hub of Lancaster University (UK).

The project produced several datasets based on other Spanish cities including [Soria](https://figshare.com/articles/Transcripci_n_del_Cat_logo_Monumental_de_la_Provincia_de_Soria_por_Juan_Cabr_1916-1917_/12006273
) and [Burgos](https://figshare.com/articles/Transcripci_n_del_Cat_logo_Monumental_y_Art_stico_de_la_Provincia_de_Burgos_por_Narciso_Sentenach_1925_/12006327).
    

## Setting up things

In [None]:
import sys
import os
import os.path
from pathlib import Path
import string

from wordcloud import WordCloud
from nltk.corpus import stopwords
from collections import Counter, defaultdict
import pandas as pd

import nltk
nltk.download('stopwords')
from nltk.tokenize import sent_tokenize, word_tokenize 
from nltk.probability import FreqDist
import gensim 
from gensim.models import Word2Vec 
import re
import unicodedata
import matplotlib.pyplot as plt


## Reading the txt files

The dataset comprises several files and formats. We have prepared the text files in this project to work with them.

In [None]:
filename = Path('CME_Avila/GM_Avila_v1_Text2_Procesado.txt')

text = ''

if os.path.exists(filename):
    with open(filename, 'r') as myfile:
        text = myfile.read()

## We get the text from the second file

In [None]:
filename = Path('CME_Avila/GM_Avila_v1_Text_Procesado_51-258.txt')

if os.path.exists(filename):
    with open(filename, 'r') as myfile:
        text += myfile.read()

## Let's see the text

In [None]:
text

## Removing stop words

Stop words are words which does not add much meaning to a sentence. For example, the words in English like the, he, have, etc.

There are several Python packages that provide stopwords lists and they can also be customized.

In [None]:
# adding specific stopwords
customized_stop_words = ["que", "es", "un", "una", "do", "toda", "hacia", "á", "ii", "et", "ta", "s.", "ms"] + stopwords.words('spanish') + list(string.punctuation)

In [None]:
# Create a WordCloud object
wordcloud = WordCloud(stopwords = customized_stop_words, collocations=False, background_color="white", max_words=5000, contour_width=3, contour_color='steelblue')

# Generate a word cloud
wordcloud.generate(text)

# Visualize the word cloud
wordcloud.to_image()

## Tokenization

Tokenization is the process of breaking down a text paragraph into smaller chunks such as words is called Tokenization.

## Sentence Tokenization
Sentence tokenizer breaks text paragraph into sentences.

In [None]:
# join words
text = text.replace("-\n", "")

from nltk.tokenize import sent_tokenize
tokenized_text=sent_tokenize(text)
print(tokenized_text)

In [None]:
tokenized_word=word_tokenize(text)
stop = customized_stop_words 
cleaned_text = [i for i in word_tokenize(text.lower()) if i not in stop]

print(cleaned_text)

## Let's compute frequencies

The FreqDist class is used to encode “frequency distributions”, which count the number of times that each outcome of an experiment occurs.

In [None]:
fdist = FreqDist(cleaned_text)
print(fdist)

In [None]:
fdist.most_common(5)

In [None]:
# Frequency Distribution Plot
fdist.plot(30,cumulative=False)
plt.show()

## Creating and visualizing n-gram ranking using nltk for natural language processing

An n-gram is a sequence of n words where n is a number that can range from 1 to n. For example, the word "car" is a 1-gram. The combination of the words "red car" is a 2-gram. Similarly, "nice red car" is a 3-gram.

In n-gram ranking, we rank the n-grams according to how many times they appear in a text that can consist on a book or a collection of tweets.

## Let's start!
The next function takes in a list of words or text as input and returns a cleaner set of words. The function does normalization, encoding/decoding, lower casing, and lemmatization.

In [None]:
def basic_clean(text):
    text = (unicodedata.normalize('NFKD', text)
      .encode('ascii', 'ignore')
      .decode('utf-8', 'ignore')
      .lower())
    words = re.sub(r'[^\w\s]', '', text).split()
    return [word for word in words if word not in customized_stop_words]

In [None]:
words = basic_clean(text)
words[:10]

## Let's generate the n-grams

In [None]:
(pd.Series(nltk.ngrams(words, 2)).value_counts())[:10]

In [None]:
(pd.Series(nltk.ngrams(words, 3)).value_counts())[:10]

In [None]:
bigrams_series = (pd.Series(nltk.ngrams(words, 2)).value_counts())[:15]
trigrams_series = (pd.Series(nltk.ngrams(words, 3)).value_counts())[:15]

In [None]:
bigrams_series.sort_values().plot.barh(color='yellow', width=.9, figsize=(12, 8))
plt.title('15 Most Frequently Bigrams')
plt.ylabel('Bigram')
plt.xlabel('Number of Occurrences')

## References

Liceras-Garrido, Raquel; Comino, Alba; Murrieta-Flores, Patricia (2020): Transcripción del Catálogo Monumental de España: Provincia de Ávila por Manuel Gómez Moreno (1900-1901). figshare. Dataset. https://doi.org/10.6084/m9.figshare.12006318.v1