# Part-of-Speech Tagging for Spanish

<div class="admonition note" name="html-admonition" style="background: lightblue; padding: 10px">
<p class="title">Note</p>
This section, "Working in Languages Beyond English," is authored by <a href="https://dlcl.stanford.edu/people/quinn-dombrowski/">Quinn Dombrowski</a>, the Academic Technology Specialist at Stanford University and a leading voice in multilingual digital humanities. I'm grateful to Quinn for helping expand this textbook to serve languages beyond English. 
</div>

In this lesson, we're going to learn about the textual analysis methods *part-of-speech tagging* and *keyword extraction*. These methods will help us computationally parse sentences and better understand words in context.

---

## spaCy and Natural Language Processing (NLP)

To computationally identify parts of speech, we're going to use the natural language processing library spaCy. For a more extensive introduction to NLP and spaCy, see the previous lesson.

To parse sentences, spaCy relies on machine learning models that were trained on large amounts of labeled text data. If you've used the preprocessing or named entity recognition notebooks for this language, you can skip the steps for installing spaCy and downloading the language model.

## Install spaCy

To use spaCy, we first need to install the library.

In [None]:
!pip install -U spacy

## Import Libraries

Then we're going to import `spacy` and `displacy`, a special spaCy module for visualization.

In [1]:
import spacy
from spacy import displacy
from collections import Counter
import pandas as pd
pd.set_option("max_rows", 400)
pd.set_option("max_colwidth", 400)

We're also going to import the `Counter` module for counting nouns, verbs, adjectives, etc., and the `pandas` library for organizing and displaying data (we're also changing the pandas default max row and column width display setting).

## Download Language Model

Next we need to download the Spanish-language model (`es_core_news_md`), which will be processing and making predictions about our texts. This is the model that was trained on the annotated ["AnCora" corpus](http://clic.ub.edu/corpus/). You can download the `es_core_news_md` model by running the cell below:

In [None]:
!python -m spacy download es_core_news_md

*Note: spaCy offers [models for other languages](https://spacy.io/usage/models#languages) including German, French, Spanish, Portuguese, Italian, Dutch, Greek, Norwegian, and Lithuanian. Languages such as Russian, Ukrainian, Thai, Chinese, Japanese, Korean and Vietnamese don't currently have their own NLP models. However, spaCy offers language and tokenization support for many of these language with external dependencies ‚Äî such as [PyviKonlpy](https://github.com/konlpy/konlpy) for Korean or [Jieba](https://github.com/fxsjy/jieba) for Chinese.*

## Load Language Model

Once the model is downloaded, we need to load it with `spacy.load()` and assign it to the variable `nlp`.

In [2]:
nlp = spacy.load('es_core_news_md')

## Create a Processed spaCy Document

Whenever we use spaCy, our first step will be to create a processed spaCy `document` with the loaded NLP model `nlp()`. Most of the heavy NLP lifting is done in this line of code. After processing, the `document` object will contain tons of juicy language data ‚Äî named entities, sentence boundaries, parts of speech ‚Äî¬†and the rest of our work will be devoted to accessing this information.

In [3]:
filepath = '../texts/other-languages/es.txt'
text = open(filepath, encoding='utf-8').read()
document = nlp(text)

## spaCy Part-of-Speech Tagging
The tags that spaCy uses for part-of-speech are based on work done by [Universal Dependencies](https://universaldependencies.org/), an effort to create a set of part-of-speech tags that work across many different languages. Texts from various languages are annotated using this common set of tags, and contributed to a common repository that can be used to train models like spaCy.

The Universal Dependencies page has information about the annotated corpora available for each language; it's worth looking into the corpora that were annotated for your language.

| POS   | Description               | Examples                                      |
|:-----:|:-------------------------:|:---------------------------------------------:|
| ADJ   | adjective                 | big, old, green, incomprehensible, first      |
| ADP   | adposition                | in, to, during                                |
| ADV   | adverb                    | very, tomorrow, down, where, there            |
| AUX   | auxiliary                 | is, has (done), will (do), should (do)        |
| CONJ  | conjunction               | and, or, but                                  |
| CCONJ | coordinating conjunction  | and, or, but                                  |
| DET   | determiner                | a, an, the                                    |
| INTJ  | interjection              | psst, ouch, bravo, hello                      |
| NOUN  | noun                      | girl, cat, tree, air, beauty                  |
| NUM   | numeral                   | 1, 2017, one, seventy-seven, IV, MMXIV        |
| PART  | particle                  | ‚Äôs, not,                                      |
| PRON  | pronoun                   | I, you, he, she, myself, themselves, somebody |
| PROPN | proper noun               | Mary, John, London, NATO, HBO                 |
| PUNCT | punctuation               | ., (, ), ?                                    |
| SCONJ | subordinating conjunction | if, while, that                               |
| SYM   | symbol                    | $, %, ¬ß, ¬©, +, ‚àí, √ó, √∑, =, :), üòù             |
| VERB  | verb                      | run, runs, running, eat, ate, eating          |
| X     | other                     | sfpksdpsxmsa                                  |
| SPACE | space                     |                                               |


Above is a POS chart taken from [spaCy's website](https://spacy.io/api/annotation#named-entities), which shows the different parts of speech that spaCy can identify as well as their corresponding labels. To quickly see spaCy's POS tagging in action, we can use the [spaCy module `displacy`](https://spacy.io/usage/visualizers#ent) on our sample `document` with the `style=` parameter set to "dep" (short for dependency parsing):

## Get Part-Of-Speech Tags

To get part of speech tags for every word in a document, we have to iterate through all the tokens in the document and pull out the `.lemma_` attribute for each token, which gives us the un-inflected version of the word. We'll also pull out the  `.pos_` attribute for each token. We can get even finer-grained dependency information with the attribute `.dep_`.


In [4]:
for token in document:
    print(token.lemma_, token.pos_, token.dep_)

ÔªøINTRODUCCION PROPN ROOT
. PUNCT punct



 SPACE 
ECONOM√çA NOUN ROOT
POL√çTICA ADV flat
. PUNCT punct


 SPACE 
El DET det
sombr√≠o ADJ amod
Prudhon PROPN nsubj
, PUNCT punct
imbu√≠do ADJ ROOT
, PUNCT punct
sin ADP advmod
dudar INTJ fixed
, PUNCT punct
en ADP case
los DET det
ideo NOUN obl
de ADP case
lo DET det
Santos PROPN nmod

 SPACE 
Padres NOUN flat
de ADP case
lo DET det
Iglesia PROPN nmod
que SCONJ nsubj
predicar AUX acl
el DET det
desden ADP obj
por ADP case
lo DET det
bien NOUN obl

 SPACE 
terrenal ADJ amod
, PUNCT punct
decir AUX conj
que SCONJ mark
lo DET det
pobreza NOUN nsubj
ser AUX cop
uno DET det
ley NOUN ccomp
de ADP case
nuestro DET det
naturaleza NOUN nmod
, PUNCT punct

 SPACE 
ley NOUN appos
bajar ADP case
lo DET det
cual PRON obl
hemo AUX aux
ser VERB aux
constitu√≠dos VERB acl
, PUNCT punct
de ADP case
donde PRON obl
se PRON obj
deducir VERB advcl
que SCONJ mark
el DET det

 SPACE 
pauperismo NOUN nsubj
ser AUX cop
mal ADJ ccomp
que SCONJ mark
no ADV advmod


## Practicing with the example text
When working with languages that have inflection, we typically use `token.lemma_` instead of `token.text` like you'll find in the English examples. This is important when we're counting, so that differently-inflected forms of a word (e.g. masculine vs. feminine or singular vs. plural) aren't counted as if they were different words.

In [5]:
filepath = "../texts/other-languages/es.txt"
document = nlp(open(filepath, encoding="utf-8").read())

## Get Adjectives

| POS   | Description               | Examples                                      |
|:-----:|:-------------------------:|:---------------------------------------------:|
| ADJ   | adjective                 | big, old, green, incomprehensible, first      |

To extract and count the adjectives in the example text, we will follow the same model as above, except we'll add an `if` statement that will pull out words only if their POS label matches "ADJ."

```{admonition} Python Review!
:class: pythonreview
While we demonstrate how to extract parts of speech in the sections below, we're also going to reinforce some integral Python skills. Notice how we use `for` loops and `if` statements to `.append()` specific words to a list. Then we count the words in the list and make a pandas dataframe from the list.
```

Here we make a list of the adjectives identified in the example text:

In [6]:
adjs = []
for token in document:
    if token.pos_ == 'ADJ':
        adjs.append(token.lemma_)

In [7]:
adjs

['sombr√≠o',
 'imbu√≠do',
 'terrenal',
 'mal',
 'desconsolar',
 'menesteroso',
 'ingrato',
 'humano',
 'tray√©ndoles',
 'plausible',
 'nuevo',
 'eficaz',
 'inn√∫mero',
 'multiplicar',
 'social',
 'venerable',
 'entendido',
 'grave',
 'pol√≠tico',
 'moralizador',
 'colectivo',
 'precioso',
 'especulativo',
 'individual',
 'concluir',
 'abstracto',
 'ilustrar',
 'elevar',
 'intelectual',
 'americano',
 'creador',
 'lujurioso',
 'invocar',
 'presentar',
 'atrayente',
 'econ√≥mico',
 'natural',
 'ing√©nua',
 'social',
 'noble',
 'fuerte',
 'privilegiar',
 'prestigioso',
 'suficiente',
 'so√±ador',
 'adusto',
 'economista',
 '√∫ltimo',
 'marcar',
 'CHAMUSQUINAS',
 'extremo',
 'armar',
 '\ufeff1',
 '√∫ltimo',
 '√∫ltimo',
 'plegaria',
 'cari√±oso',
 'p√°lido',
 'mojar',
 '\ufeff1',
 'aguardar',
 'aurora',
 'dram√°tico',
 'p√°rias',
 'forzar',
 'diario',
 'encorvar',
 'implacable',
 'terrible',
 'sanar',
 'fuerte',
 'proponer',
 'sentar',
 'malo',
 'querer',
 'mejorar',
 'largo',
 'Uncido',
 '

Then we count the unique adjectives in this list with the `Counter()` module:

In [8]:
adjs_tally = Counter(adjs)

In [9]:
adjs_tally.most_common()

[('bueno', 20),
 ('solo', 15),
 ('√∫ltimo', 11),
 ('bello', 11),
 ('fuerte', 10),
 ('pobre', 10),
 ('primero', 10),
 ('√∫nico', 9),
 ('j√≥ven', 9),
 ('gran', 8),
 ('misterioso', 8),
 ('mismo', 8),
 ('nuevo', 7),
 ('dulce', 7),
 ('viejo', 7),
 ('nacional', 7),
 ('franc√©s', 7),
 ('blanco', 7),
 ('\ufeff1', 6),
 ('largo', 6),
 ('largar', 6),
 ('necesario', 6),
 ('verdadero', 6),
 ('desconocer', 6),
 ('negro', 6),
 ('grande', 6),
 ('natural', 5),
 ('sentar', 5),
 ('amable', 5),
 ('mejor', 5),
 ('amar', 5),
 ('humano', 4),
 ('social', 4),
 ('intelectual', 4),
 ('noble', 4),
 ('p√°lido', 4),
 ('malo', 4),
 ('gozoso', 4),
 ('triste', 4),
 ('alto', 4),
 ('central', 4),
 ('lindo', 4),
 ('lejano', 4),
 ('azul', 4),
 ('inmenso', 4),
 ('profundo', 4),
 ('rico', 4),
 ('antiguo', 4),
 ('ben√©fico', 4),
 ('cuartar', 4),
 ('sencillo', 4),
 ('dulc√≠simo', 4),
 ('delicioso', 4),
 ('cerrar', 4),
 ('precioso', 3),
 ('elevar', 3),
 ('presentar', 3),
 ('querer', 3),
 ('pegar', 3),
 ('editorial', 3),
 ('pl√

Then we make a dataframe from this list:

In [10]:
df = pd.DataFrame(adjs_tally.most_common(), columns=['adj', 'count'])
df[:100]

Unnamed: 0,adj,count
0,bueno,20
1,solo,15
2,√∫ltimo,11
3,bello,11
4,fuerte,10
5,pobre,10
6,primero,10
7,√∫nico,9
8,j√≥ven,9
9,gran,8


## Get Nouns

| POS   | Description               | Examples                                      |
|:-----:|:-------------------------:|:---------------------------------------------:|
| NOUN  | noun                      | girl, cat, tree, air, beauty                  |

To extract and count nouns, we can follow the same model as above, except we will change our `if` statement to check for POS labels that match "NOUN".

In [11]:
nouns = []
for token in document:
    if token.pos_ == 'NOUN':
        nouns.append(token.lemma_)

nouns_tally = Counter(nouns)

df = pd.DataFrame(nouns_tally.most_common(), columns=['noun', 'count'])
df[:100]

Unnamed: 0,noun,count
0,se√±or,55
1,manir,37
2,casar,30
3,hora,28
4,vida,24
5,padre,23
6,hombre,22
7,hijo,22
8,voz,22
9,trabajar,21


## Get Verbs

| POS   | Description               | Examples                                      |
|:-----:|:-------------------------:|:---------------------------------------------:|
| VERB  | verb                      | run, runs, running, eat, ate, eating          |

To extract and count works of art, we can follow a similar-ish model to the examples above. This time, however, we're going to make our code even more economical and efficient (while still changing our `if` statement to match the POS label "VERB").

```{admonition} Python Review!
:class: pythonreview
We can use a [*list comprehension*](https://melaniewalsh.github.io/Intro-Cultural-Analytics/Python/More-Lists-Loops.html#List-Comprehensions) to get our list of verbs in a single line of code! Closely examine the first line of code below:
```

In [12]:
verbs = [token.lemma_ for token in document if token.pos_ == 'VERB']

verbs_tally = Counter(verbs)

df = pd.DataFrame(verbs_tally.most_common(), columns=['verb', 'count'])
df[:100]

Unnamed: 0,verb,count
0,ser,69
1,haber,52
2,hacer,42
3,decir,29
4,estar,25
5,dar,18
6,tener,15
7,tomar,15
8,dejar,15
9,ir,15


# Keyword Extraction

## Get Sentences with Keyword

spaCy can also identify sentences in a document. To access sentences, we can iterate through `document.sents` and pull out the `.text` of each sentence.

We can use spaCy's sentence-parsing capabilities to extract sentences that contain particular keywords, such as in the function below. Note that the function assumes that the keyword provided will be exactly the same as it appears in the text (e.g. matching all inflection for case, number, gender, etc. As a Spanish example, if you use `bueno` as the keyboard, it won't match `buena` or `buenos`.)

With the function `find_sentences_with_keyword()`, we will iterate through `document.sents` and pull out any sentence that contains a particular "keyword." Then we will display these sentence with the keywords bolded.

In [15]:
import re
from IPython.display import Markdown, display

In [16]:
def find_sentences_with_keyword(keyword, document):
    
    #Iterate through all the sentences in the document and pull out the text of each sentence
    for sentence in document.sents:
        sentence = sentence.text
        
        #Check to see if the keyword is in the sentence (and ignore capitalization by making both lowercase)
        if keyword.lower() in sentence.lower():
            
            #Use the regex library to replace linebreaks and to make the keyword bolded, again ignoring capitalization
            sentence = re.sub('\n', ' ', sentence)
            sentence = re.sub(f"{keyword}", f"**{keyword}**", sentence, flags=re.IGNORECASE)
            
            display(Markdown(sentence))

In [17]:
find_sentences_with_keyword(keyword="bueno", document=document)

√Å ¬´LA **bueno**S AIRES¬ª                                        

Enigma insoluble es para m√≠ este j√≥ven, tan franco, sin embargo, y tan **bueno**...  

Cumplido en √©l, el divino misterio, de rodillas ante el altar, el ni√±o tiende la mano sobre el Sagrado Libro y jura ser virtuoso y **bueno**.  

Suscrib√≠ala el escribano D..., uno de los hombres m√°s honorables de **bueno**s Aires.    

En el momento que Mauricio preparaba la realizacion de tan lisongero prop√≥sito, una carta de **bueno**s Aires, portadora de fatales nuevas, vino √° destruir sus proyectos y sus esperanzas.    

Adem√°s, en **bueno**s Aires que se agita √°   impulsos de un inmenso progreso, podr√° Vd. con el trabajo rehacer   su fortuna¬ª.  

Por los diarios de **bueno**s Aires y su propia correspondencia, √©ranle conocidas sus desgracias y su noble abnegacion.  

h√© ah√≠, por ejemplo, un modelo de valor y de resignacion: esta j√≥ven hija de **bueno**s Aires, vino hace tres a√±os para perfeccionarse en sus estudios musicales.

Su padre, ingeniero en comision, regres√≥ √° **bueno**s Aires dej√°ndola en casa de una parienta lejana.  

Rendidos al padre los √∫ltimos deberes, vuelve √° **bueno**s Aires, donde v√° √° buscar en el trabajo la subsistencia......  

De la **bueno**s Aires de sus recuerdos, solo reconoc√≠a el nombre: tan grande y bella, la gloriosa metr√≥poli hab√≠ase tornado.

La **bueno**s Aires¬ª, poderosa asociacion que cuenta en su seno √° los m√°s fuertes capitalistas nacionales y extranjeros.     

Una vez instalado, busc√≥ trabajo en uno de los diarios m√°s acreditados de **bueno**s Aires.  

--**bueno** √≥ malo, d√©jelas Vd. en √©l.

--El se√±or Eduardo M. Coll, hijo mio, es Gerente de ¬´La **bueno**s Aires¬ª, Compa√±√≠a de Seguros en la que yo mismo soy accionista ¬øqu√© ser√° ello? 

XXX   Antes de aquel t√©rmino el Gerente de ¬´La **bueno**s Aires¬ª recib√≠a una citacion del Banco Nacional con motivo del aviso que por su √≥rden registraban ¬´La Prensa¬ª y ¬´La Tribuna Nacional¬ª.  

Acudi√≥ el Gerente, y supo que all√≠ se hallaba, depositado por el se√±or C√°rlos Ridel, un paquete cerrado, que en Junio de 1888 deb√≠a ser entregado √° la Compa√±√≠a de Seguros ¬´La **bueno**s Aires¬ª.  

√° la √≥rden de C√°rlos Ridel y endosada por √©ste √° ¬´La **bueno**s Aires¬ª, como la segunda cuota que deb√≠a pagar por su p√≥liza.  

¬´Entre estas, las Compa√±√≠as de Seguros s√≥nme especialmente   simp√°ticas, sobretodo, ¬´La **bueno**s Aires¬ª, por su importancia y   valiosa organizacion.    

En el seno de una de esas asociaciones tutelares, ¬´La **bueno**s Aires¬ª, la Providencia guardaba un tesoro que √° su hora, hizo surgir para recompensar la abnegacion filial y dar la felicidad √° los que, creyendo en ella, esperaban.  

## Get Keyword in Context

We can also find out about a keyword's more immediate context ‚Äî its neighboring words to the left and right ‚Äî and we can fine-tune our search with POS tagging.

To do so, we will first create a list of what's called *ngrams*. "Ngrams" are any sequence of *n* tokens in a text. They're an important concept in computational linguistics and NLP. (Have you ever played with [Google's *Ngram* Viewer](https://books.google.com/ngrams)?)

Below we're going to make a list of *bigrams*, that is, all the two-word combinations from the sample text. We're going to use these bigrams to find the neighboring words that appear alongside particular keywords.

In [18]:
#Make a list of tokens and POS labels from document if the token is a word 
tokens_and_labels = [(token.text, token.pos_) for token in document if token.is_alpha]

In [19]:
#Make a function to get all two-word combinations
def get_bigrams(word_list, number_consecutive_words=2):
    
    ngrams = []
    adj_length_of_word_list = len(word_list) - (number_consecutive_words - 1)
    
    #Loop through numbers from 0 to the (slightly adjusted) length of your word list
    for word_index in range(adj_length_of_word_list):
        
        #Index the list at each number, grabbing the word at that number index as well as N number of words after it
        ngram = word_list[word_index : word_index + number_consecutive_words]
        
        #Append this word combo to the master list "ngrams"
        ngrams.append(ngram)
        
    return ngrams

In [20]:
bigrams = get_bigrams(tokens_and_labels)

Let's take a peek at the bigrams:

In [21]:
bigrams[5:20]

[[('imbu√≠do', 'ADJ'), ('sin', 'ADP')],
 [('sin', 'ADP'), ('duda', 'INTJ')],
 [('duda', 'INTJ'), ('en', 'ADP')],
 [('en', 'ADP'), ('las', 'DET')],
 [('las', 'DET'), ('ideas', 'NOUN')],
 [('ideas', 'NOUN'), ('de', 'ADP')],
 [('de', 'ADP'), ('los', 'DET')],
 [('los', 'DET'), ('Santos', 'PROPN')],
 [('Santos', 'PROPN'), ('Padres', 'NOUN')],
 [('Padres', 'NOUN'), ('de', 'ADP')],
 [('de', 'ADP'), ('la', 'DET')],
 [('la', 'DET'), ('Iglesia', 'PROPN')],
 [('Iglesia', 'PROPN'), ('que', 'SCONJ')],
 [('que', 'SCONJ'), ('predicaban', 'AUX')],
 [('predicaban', 'AUX'), ('el', 'DET')]]

Now that we have our list of bigrams, we're going to make a function `get_neighbor_words()`. This function will return the most frequent words that appear next to a particular keyword. The function can also be fine-tuned to return neighbor words that match a certain part of speech by changing the `pos_label` parameter.

In [22]:
def get_neighbor_words(keyword, bigrams, pos_label = None):
    
    neighbor_words = []
    keyword = keyword.lower()
    
    for bigram in bigrams:
        
        #Extract just the lowercased words (not the labels) for each bigram
        words = [word.lower() for word, label in bigram]        
        
        #Check to see if keyword is in the bigram
        if keyword in words:
            
            for word, label in bigram:
                
                #Now focus on the neighbor word, not the keyword
                if word.lower() != keyword:
                    #If the neighbor word matches the right pos_label, append it to the master list
                    if label == pos_label or pos_label == None:
                        neighbor_words.append(word.lower())
    
    return Counter(neighbor_words).most_common()

In [25]:
get_neighbor_words("bien", bigrams)

[('m√°s', 2),
 ('que', 2),
 ('as√≠', 1),
 ('fuerte', 1),
 ('y', 1),
 ('ya', 1),
 ('invisible', 1),
 ('pudiera', 1),
 ('distraido', 1),
 ('lo', 1),
 ('te', 1),
 ('cosa', 1),
 ('duerme', 1),
 ('charl√°bamos', 1),
 ('v√°', 1),
 ('no', 1),
 ('si', 1),
 ('seguro', 1),
 ('mi', 1),
 ('amada', 1),
 ('tan', 1),
 ('esta', 1),
 ('palabra', 1),
 ('estaba', 1)]

In [26]:
get_neighbor_words("bien", bigrams, pos_label='VERB')

[('pudiera', 1), ('duerme', 1)]

## Your Turn!

Try out `find_sentences_with_keyword()` and `get_neighbor_words` with your own keywords of interest.

In [None]:
find_sentences_with_keyword(keyword="YOUR KEY WORD", document=document)

In [None]:
get_neighbor_words(keyword="YOUR KEY WORD", bigrams, pos_label=None)