# Part-of-Speech Tagging for Portuguese

<div class="admonition note" name="html-admonition" style="background: lightblue; padding: 10px">
<p class="title">Note</p>
This section, "Working in Languages Beyond English," is co-authored with <a href="http://www.quinndombrowski.com/">Quinn Dombrowski</a>, the Academic Technology Specialist at Stanford University and a leading voice in multilingual digital humanities. I'm grateful to Quinn for helping expand this textbook to serve languages beyond English. 
</div>

In this lesson, we're going to learn about the textual analysis methods *part-of-speech tagging* and *keyword extraction* for Portuguese texts. These methods will help us computationally parse sentences and better understand words in context.

---

## spaCy and Natural Language Processing (NLP)

To computationally identify parts of speech, we're going to use the natural language processing library spaCy. For a more extensive introduction to NLP and spaCy, see the previous lesson.

To parse sentences, spaCy relies on machine learning models that were trained on large amounts of labeled text data. If you've used the preprocessing or named entity recognition notebooks for this language, you can skip the steps for installing spaCy and downloading the language model.

## Install spaCy

To use spaCy, we first need to install the library.

In [None]:
!pip install -U spacy

## Import Libraries

Then we're going to import `spacy` and `displacy`, a special spaCy module for visualization.

In [1]:
import spacy
from spacy import displacy
from collections import Counter
import pandas as pd
pd.set_option("max_rows", 400)
pd.set_option("max_colwidth", 400)

We're also going to import the `Counter` module for counting nouns, verbs, adjectives, etc., and the `pandas` library for organizing and displaying data (we're also changing the pandas default max row and column width display setting).

## Download Language Model

Next we need to download the Portuguese-language model (`pt_core_news_md`), which will be processing and making predictions about our texts. This is the model that was trained on both European (CETEMP√∫blico) and Brazilian (CETENFolha) variants. You can download the `pt_core_news_md` model by running the cell below:

In [None]:
!python -m spacy download pt_core_news_md

*Note: spaCy offers [models for other languages](https://spacy.io/usage/models#languages) including German, French, Spanish, Portuguese, Russian, Italian, Dutch, Greek, Norwegian, and Lithuanian*.  

*spaCy offers language and tokenization support for other language via external dependencies ‚Äî such as [PyviKonlpy](https://github.com/konlpy/konlpy) for Korean.*

## Load Language Model

Once the model is downloaded, we need to load it with `spacy.load()` and assign it to the variable `nlp`.

In [2]:
nlp = spacy.load('pt_core_news_md')

## Create a Processed spaCy Document

Whenever we use spaCy, our first step will be to create a processed spaCy `document` with the loaded NLP model `nlp()`. Most of the heavy NLP lifting is done in this line of code. After processing, the `document` object will contain tons of juicy language data ‚Äî named entities, sentence boundaries, parts of speech ‚Äî¬†and the rest of our work will be devoted to accessing this information.

In [3]:
filepath = '../texts/pt.txt'
text = open(filepath, encoding='utf-8').read()
document = nlp(text)

## spaCy Part-of-Speech Tagging
The tags that spaCy uses for part-of-speech are based on work done by [Universal Dependencies](https://universaldependencies.org/), an effort to create a set of part-of-speech tags that work across many different languages. Texts from various languages are annotated using this common set of tags, and contributed to a common repository that can be used to train models like spaCy.

The Universal Dependencies page has information about the annotated corpora available for each language; it's worth looking into the corpora that were annotated for your language.

| POS   | Description               | Examples                                      |
|:-----:|:-------------------------:|:---------------------------------------------:|
| ADJ   | adjective                 | big, old, green, incomprehensible, first      |
| ADP   | adposition                | in, to, during                                |
| ADV   | adverb                    | very, tomorrow, down, where, there            |
| AUX   | auxiliary                 | is, has (done), will (do), should (do)        |
| CONJ  | conjunction               | and, or, but                                  |
| CCONJ | coordinating conjunction  | and, or, but                                  |
| DET   | determiner                | a, an, the                                    |
| INTJ  | interjection              | psst, ouch, bravo, hello                      |
| NOUN  | noun                      | girl, cat, tree, air, beauty                  |
| NUM   | numeral                   | 1, 2017, one, seventy-seven, IV, MMXIV        |
| PART  | particle                  | ‚Äôs, not,                                      |
| PRON  | pronoun                   | I, you, he, she, myself, themselves, somebody |
| PROPN | proper noun               | Mary, John, London, NATO, HBO                 |
| PUNCT | punctuation               | ., (, ), ?                                    |
| SCONJ | subordinating conjunction | if, while, that                               |
| SYM   | symbol                    | $, %, ¬ß, ¬©, +, ‚àí, √ó, √∑, =, :), üòù             |
| VERB  | verb                      | run, runs, running, eat, ate, eating          |
| X     | other                     | sfpksdpsxmsa                                  |
| SPACE | space                     |                                               |


Above is a POS chart taken from [spaCy's website](https://spacy.io/api/annotation#named-entities), which shows the different parts of speech that spaCy can identify as well as their corresponding labels. To quickly see spaCy's POS tagging in action, we can use the [spaCy module `displacy`](https://spacy.io/usage/visualizers#ent) on our sample `document` with the `style=` parameter set to "dep" (short for dependency parsing):

## Get Part-Of-Speech Tags

To get part of speech tags for every word in a document, we have to iterate through all the tokens in the document and pull out the `.lemma_` attribute for each token, which gives us the un-inflected version of the word. We'll also pull out the  `.pos_` attribute for each token. We can get even finer-grained dependency information with the attribute `.dep_`.


In [4]:
for token in document:
    print(token.lemma_, token.pos_, token.dep_)

ÔªøPRIMEIRA INTJ det
PARTE PUNCT ROOT





 SPACE 
UMA DET nummod
HISTORIA NOUN ROOT
VERDADEIRA ADJ ROOT



 SPACE 
I ADJ ROOT


 SPACE 
Era AUX cop
umar DET det
physionomia NOUN ROOT
incaracteristica ADJ amod
, PUNCT punct
apagar VERB acl
, PUNCT punct
tristissima ADJ amod
. PUNCT punct


 SPACE 
N√£o ADV advmod
se PRON expl
poder VERB ROOT
dizer VERB xcomp
o DET det
idade NOUN obj
que PRON obj
ter VERB acl:relcl
, PUNCT punct
nem CCONJ cc
mesmo ADV advmod
se PRON nsubj
ter VERB conj
idade NOUN obj
. PUNCT punct


 SPACE 
Tanto ADV advmod
poder VERB ROOT
ter VERB xcomp
trintar NUM obj
ou CCONJ cc
quarentar NUM conj
comer ADP case
setenta NUM nummod
annos NOUN nmod
. PUNCT punct


 SPACE 
Curvado VERB ROOT
pelar DET case
idade NOUN obl:agent
ou CCONJ cc
pelos ADP case
desgosto NOUN conj
? PUNCT punct
Encanecido VERB ROOT
porque SCONJ mark
o DET det
annos PROPN nsubj
ter VERB advcl

 SPACE 
correr VERB advcl
por ADP case
sobrar ADP case
o DET det
cabe√ßa NOUN obl:agent
d'elle ADJ appos


## Practicing with the example text
When working with languages that have inflection, we typically use `token.lemma_` instead of `token.text` like you'll find in the English examples. This is important when we're counting, so that differently-inflected forms of a word (e.g. masculine vs. feminine or singular vs. plural) aren't counted as if they were different words.

In [5]:
filepath = "../texts/pt.txt"
document = nlp(open(filepath, encoding="utf-8").read())

## Get Adjectives

| POS   | Description               | Examples                                      |
|:-----:|:-------------------------:|:---------------------------------------------:|
| ADJ   | adjective                 | big, old, green, incomprehensible, first      |

To extract and count the adjectives in the example text, we will follow the same model as above, except we'll add an `if` statement that will pull out words only if their POS label matches "ADJ."

<div class="admonition pythonreview" name="html-admonition" style="background: lightgreen; padding: 10px">
<p class="title">Python Review</p>

While we demonstrate how to extract parts of speech in the sections below, we're also going to reinforce some integral Python skills. Notice how we use `for` loops and `if` statements to `.append()` specific words to a list. Then we count the words in the list and make a pandas dataframe from the list.
</div>

Here we make a list of the adjectives identified in the example text:

In [5]:
adjs = []
for token in document:
    if token.pos_ == 'ADJ':
        adjs.append(token.lemma_)

In [6]:
adjs

['VERDADEIRA',
 'I',
 'incaracteristica',
 'tristissima',
 "d'elle",
 'debeis',
 'rachitica',
 'incompleto',
 'azues',
 'gorgeadas',
 'escarlate',
 'voluptuoso',
 'estranho',
 'doloroso',
 'primeiro',
 'fraco',
 'inerme',
 'pobre',
 'dependente',
 'rude',
 'moral',
 'duro',
 'opulento',
 'parasito',
 'voluntarios',
 'inconsciente',
 'propria',
 'mesmo',
 'inquieto',
 'medroso',
 'primeiro',
 'alto',
 'plethorico',
 'grosso',
 'brutaes',
 'farto',
 'submisso',
 'bonito',
 'delgado',
 'flexivel',
 'branco',
 'ideal',
 'inglez',
 'gracioso',
 'delicado',
 'metalicos',
 'finar',
 'esguio',
 'branco',
 'pequeno',
 'fraco',
 'grotesco',
 'informar',
 'inteirar',
 'inteirar',
 'enfermar',
 'intenso',
 'dilacerante',
 'estranhar',
 'brutaes',
 'bravo',
 'altivo',
 'frio',
 'desdenhoso',
 'repugnante',
 'imperturbavel',
 'olympica',
 'glacial',
 'original',
 'ironico',
 'irregular',
 'obscuro',
 'infeliz',
 'velho',
 'contrariar',
 'grotesco',
 'asperas',
 'hostil',
 'extravagante',
 'escossez'

Then we count the unique adjectives in this list with the `Counter()` module:

In [7]:
adjs_tally = Counter(adjs)

In [8]:
adjs_tally.most_common()

[('grande', 87),
 ('bom', 65),
 ('pequeno', 55),
 ('pobre', 48),
 ('velho', 48),
 ('cheio', 39),
 ('primeiro', 38),
 ('alto', 36),
 ('mesmo', 31),
 ('pequenino', 31),
 ('feliz', 30),
 ('branco', 29),
 ('doce', 29),
 ('rico', 28),
 ('novo', 25),
 ("d'elle", 23),
 ('elegante', 22),
 ('moderno', 21),
 ('longo', 20),
 ('doloroso', 19),
 ('forte', 19),
 ('verdadeiro', 18),
 ('intimar', 18),
 ('_', 18),
 ('querido', 18),
 ('bonito', 17),
 ('triste', 17),
 ('altivo', 16),
 ('brilhante', 16),
 ('fino', 16),
 ('mau', 15),
 ("d'ella", 15),
 ('gracioso', 14),
 ('precisar', 14),
 ('antigo', 14),
 ('superior', 14),
 ('obscuro', 13),
 ('melhor', 13),
 ('humildar', 13),
 ('capaz', 13),
 ('formoso', 13),
 ('simples', 13),
 ('negro', 13),
 ('proprio', 13),
 ('raro', 13),
 ('puro', 13),
 ('luminoso', 12),
 ('adoravel', 12),
 ('nervoso', 12),
 ('querer', 12),
 ('ultimar', 12),
 ('louro', 11),
 ('completar', 11),
 ('largo', 11),
 ('meigo', 11),
 ('irresistivel', 11),
 ('feio', 10),
 ('dignar', 10),
 ('uni

Then we make a dataframe from this list:

In [9]:
df = pd.DataFrame(adjs_tally.most_common(), columns=['adj', 'count'])
df[:100]

Unnamed: 0,adj,count
0,grande,87
1,bom,65
2,pequeno,55
3,pobre,48
4,velho,48
5,cheio,39
6,primeiro,38
7,alto,36
8,mesmo,31
9,pequenino,31


## Get Nouns

| POS   | Description               | Examples                                      |
|:-----:|:-------------------------:|:---------------------------------------------:|
| NOUN  | noun                      | girl, cat, tree, air, beauty                  |

To extract and count nouns, we can follow the same model as above, except we will change our `if` statement to check for POS labels that match "NOUN".

In [10]:
nouns = []
for token in document:
    if token.pos_ == 'NOUN':
        nouns.append(token.lemma_)

nouns_tally = Counter(nouns)

df = pd.DataFrame(nouns_tally.most_common(), columns=['noun', 'count'])
df[:100]

Unnamed: 0,noun,count
0,vidar,131
1,dia,130
2,casar,109
3,mulher,105
4,homem,97
5,filho,91
6,olho,88
7,m√£e,75
8,cousa,74
9,pae,68


## Get Verbs

| POS   | Description               | Examples                                      |
|:-----:|:-------------------------:|:---------------------------------------------:|
| VERB  | verb                      | run, runs, running, eat, ate, eating          |

To extract and count works of art, we can follow a similar-ish model to the examples above. This time, however, we're going to make our code even more economical and efficient (while still changing our `if` statement to match the POS label "VERB").

<div class="admonition pythonreview" name="html-admonition" style="background: lightgreen; padding: 10px">
<p class="title">Python Review</p>

We can use a [*list comprehension*](https://melaniewalsh.github.io/Intro-Cultural-Analytics/Python/More-Lists-Loops.html#List-Comprehensions) to get our list of verbs in a single line of code! Closely examine the first line of code below:
</div>

In [11]:
verbs = [token.lemma_ for token in document if token.pos_ == 'VERB']

verbs_tally = Counter(verbs)

df = pd.DataFrame(verbs_tally.most_common(), columns=['verb', 'count'])
df[:100]

Unnamed: 0,verb,count
0,ter,316
1,fazer,166
2,dizer,148
3,saber,144
4,haver,111
5,dar,104
6,ver,87
7,ser,83
8,poder,80
9,vir,74


# Keyword Extraction

## Get Sentences with Keyword

spaCy can also identify sentences in a document. To access sentences, we can iterate through `document.sents` and pull out the `.text` of each sentence.

We can use spaCy's sentence-parsing capabilities to extract sentences that contain particular keywords, such as in the function below. Note that the function assumes that the keyword provided will be exactly the same as it appears in the text (e.g. matching all inflection for case, number, gender, etc. As a Spanish example, if you use `bueno` as the keyboard, it won't match `buena` or `buenos`.)

With the function `find_sentences_with_keyword()`, we will iterate through `document.sents` and pull out any sentence that contains a particular "keyword." Then we will display these sentence with the keywords bolded.

In [12]:
import re
from IPython.display import Markdown, display

In [13]:
def find_sentences_with_keyword(keyword, document):
    
    #Iterate through all the sentences in the document and pull out the text of each sentence
    for sentence in document.sents:
        sentence = sentence.text
        
        #Check to see if the keyword is in the sentence (and ignore capitalization by making both lowercase)
        if keyword.lower() in sentence.lower():
            
            #Use the regex library to replace linebreaks and to make the keyword bolded, again ignoring capitalization
            sentence = re.sub('\n', ' ', sentence)
            sentence = re.sub(f"{keyword}", f"**{keyword}**", sentence, flags=re.IGNORECASE)
            
            display(Markdown(sentence))

In [14]:
find_sentences_with_keyword(keyword="bom", document=document)

Todos o conheciam, todos o repetiam em alto e **bom** som para que elle o n√£o ignorasse, mas _ella_ amava-o;

Tudo que houvera **bom** na sua vida lhe tinha vindo d'ella.  

a imaginar quanto seria **bom** ser muito rico, viver na alta roda, n'aquella esphera aristocratica e distincta em que se n√£o trabalha, em que se falla de um modo especial e caracteristico, com termos escolhidos, com inflex√µes muito mais suaves, com uns certos desdens que d'antes lhe pareciam ridiculos e que lhe estavam agora parecendo superiormente requintados.

v√™l-o porque soube que tem sido muito **bom** para Thadeu, excellente mesmo.

Sempre √© **bom**.

V√™ como elle √© **bom**.

Henrique pedira com t√£o meigas e sentidas palavras a Thadeu para que elle os n√£o deixasse, que depois da viagem de rigor feita pelos noivos √° Suissa e √° Italia o _**bom** c√£o fiel_ foi viver junto d'elles.  

Henrique sempre **bom**, s√©rio, pensativo, de uma indulgencia de forte, de uma do√ßura de heroe.  

--Obrigado, vae **bom**!

--N√£o que elle gosta do que √© **bom**!  

Tornava o **bom** do lavrador, com as lagrimas nos olhos. 

Que eram horas d'elle estar a estudar; que n√£o era **bom** 

--E olhe que era um **bom** homem!

Filho segundo de uma casa de **bom** nome na provincia do Minho, cursava 

d'elle √°s suas, mostrar lhe bem claro, que o adorava pelo que elle transmittia a sua vida de elegante e de superior, mas que o considerava um objecto raro adquirido por muito **bom** pre√ßo, e do qual dispunha absolutamente.  

A casa do visconde das Lag√¥as tornou-se a _mans√£o de todos os prazeres_, como o **bom** do homem dizia na pra√ßa aos seus amigos titulares e merceeiros.

√â que a mulher que gosta de brilhar, n√£o sabe o que √© sacrificio e abnega√ß√£o, √© que para mim todos os encantos que se apreciam nas salas, n√£o valem um **bom** e candido cora√ß√£o que saiba amar-me e viver s√≥ para mim.  

Gosto do **bom** que ha em todas as escolas.

A uma aspira√ß√£o delicada, a tudo que √© bello e **bom**.

Sabes quem s√£o os meus mestres do **bom** e do bello?

Com mil **bom**bas!

O contra-mestre olhava de cima aquelle quadro e murmurava entre alegre e melancolico:  --Parece que √© **bom** ter familia e ter uma pequerrucha bonita como a do capit√£o que nos venha dar um abra√ßo quando vimos de longe...  

--A modo que elle n√£o estava **bom**!

**bom** nadador √© elle, dizia o contra-mestre, mas se ha tubar√µes assim!

O capit√£o apesar de **bom** nadador j√° estava velho 

um **bom** sorriso beatifico e dourado de mocidade que lhe illuminou o semblante.  

O presunto vamos com Deus, que tambem me sahiu **bom**.

que **bom** e que intimo foi aquelle jantar!  

--Ha quarenta annos que n√£o durmo um somno t√£o **bom**, minha m√£e!     

No meio d'isto, despretenciosa e simples, julgando-se a mais ignorante das creaturinhas do **bom** Deus, n√£o sabendo que era artista, que era intelligente, que tinha alma capaz de entender todas as grandes cousas.  

emquanto a _commendadora_ meditava o rol d'aquelle dia, digerindo um **bom** jantar, e um ataque de furia contra as suas criadas presentes e futuras, emquanto as meninas debru√ßadas √° janella, trocavam substanciosos commentarios 

A MORTE DE BERTHA  (A NALY)   Minha Naly, √°s vezes nos teus dias de **bom** humor, e sobre tudo nos raros dias em que est√°s um pouco menos traquinas, vens sentar-te ao p√© de mim, n'um banco pequenino, e pegando n'um livro,--o teu livro de grandes bonecos coloridos--, finges que est√°s lendo umas cousas que a tua inquieta phantasiasinha de duende te representa, escriptas n'aquellas paginas ainda mudas para os olhos da tua intelligencia.  

a alma de Bertha expandia-se naturalmente para tudo que √© **bom** e que √© bello.  

Ellas d√£o sombra, d√£o frescura, d√£o fructos, d√£o fl√¥r, d√£o um **bom** cheiro sadio, que reconforta e alegra;

Bertha, ora ennovellada aos p√©s da m√£e, nas felpas avelludadas do tapete, e com os grandes olhos curiosos fitos nos d'ella, ora folheando um grande livro de imagens--como o teu, minha Naly--, ora empoleirada no espaldar da larga poltrona onde o pae estava sentado, e passando-lhe a pequenina m√£o crestada pela cabelladura revolta e crespa, Bertha era a mais feliz das creaturinhas do **bom** Deus!  

Que **bom**!  

Como √© **bom** ir para o c√©o!

Os outros n√£o t√©em o talento d'elle, n√£o t√©em o alcance funesto ou **bom**, mas em todo o caso poderosissimo da sua obra, n√£o t√©em a sua paciencia de benedictino, exercida com os processos da nova escola.  

E depois s√£o taes os exageros e desmandos da chamada _escola realista_, √© tal o amesquinhamento a que ella reduz a humanidade, que √© **bom** que um escriptor de t√£o prestigiosa eloquencia como √© _Octavio Feuillet

N'esta era de transforma√ß√£o e de incerta claridade, √© **bom** que uma voz se erga e diga bem alto que a paix√£o s√≥ √© criminosa quando mal dirigida, que o excesso do sentimento s√≥ √© ridiculo quando mal applicado, que a abnega√ß√£o inteira e absoluta tem gozos superiores a todos os gozos da materia, e que as almas boas e as almas grandes descobriram uma linguagem mysteriosa, na qual fallam com Deus.  

sympathico, **bom**, com vaidades inoffensivas, e austeros orgulhos, sedento de um affecto _unico_, e de uma _celebridade_ que fosse s√≥ d'elle.  

porque elle, que soube pintar t√£o bem os cynicos, os depravados, os terriveis escarnecedores, cujo riso corroia como um caustico, era no intimo **bom**, quasi infantil; depois confidencias, esperan√ßas, sonhos politicos, sonhos financeiros, sonhos industriaes, planos gigantescos de trabalho, phantasias de artista, desejos de mulher garrida e bonita, observa√ß√µes profundas, divaga√ß√µes poeticas, melancolias de alma 

Ser√° **bom** que a gente pe√ßa a Deus d'aqui por diante nas suas ora√ß√µes mais fervorosas n√£o excitar a dedicada admira√ß√£o d'aquella illustre, mas indiscreta dama!  

√â **bom** que tenhamos isto sempre bem presente, para que n√£o sejamos accintosamente inimigos do que foi, nem loucamente vaidosos do que vai ser.  

√â **bom** que o repitamos:

Nunca se deu ao trabalho de improvisar arrojos de eloquencia, tinha sempre ao servi√ßo das suas convic√ß√µes umas anecdotas a um tempo cheias de gra√ßa e de **bom** senso, umas pequenas historias que deitavam por terra 

N√£o que elle fizesse nenhum d'esses discursos tribunicios que arrastam e enthusiasmam as massas, mas sempre pela for√ßa irresistivel do seu **bom** 

Marion, de Magdalena impudica e triumphante, levanta-se Magdalena arrependida e piedosa, e Esmeralda n√£o tem a esmola, a caridade de um sorriso **bom** para Quasimodo!  

## Get Keyword in Context

We can also find out about a keyword's more immediate context ‚Äî its neighboring words to the left and right ‚Äî and we can fine-tune our search with POS tagging.

To do so, we will first create a list of what's called *ngrams*. "Ngrams" are any sequence of *n* tokens in a text. They're an important concept in computational linguistics and NLP. (Have you ever played with [Google's *Ngram* Viewer](https://books.google.com/ngrams)?)

Below we're going to make a list of *bigrams*, that is, all the two-word combinations from the sample text. We're going to use these bigrams to find the neighboring words that appear alongside particular keywords.

In [15]:
#Make a list of tokens and POS labels from document if the token is a word 
tokens_and_labels = [(token.text, token.pos_) for token in document if token.is_alpha]

In [16]:
#Make a function to get all two-word combinations
def get_bigrams(word_list, number_consecutive_words=2):
    
    ngrams = []
    adj_length_of_word_list = len(word_list) - (number_consecutive_words - 1)
    
    #Loop through numbers from 0 to the (slightly adjusted) length of your word list
    for word_index in range(adj_length_of_word_list):
        
        #Index the list at each number, grabbing the word at that number index as well as N number of words after it
        ngram = word_list[word_index : word_index + number_consecutive_words]
        
        #Append this word combo to the master list "ngrams"
        ngrams.append(ngram)
        
    return ngrams

In [17]:
bigrams = get_bigrams(tokens_and_labels)

Let's take a peek at the bigrams:

In [18]:
bigrams[5:20]

[[('Era', 'AUX'), ('uma', 'DET')],
 [('uma', 'DET'), ('physionomia', 'NOUN')],
 [('physionomia', 'NOUN'), ('incaracteristica', 'ADJ')],
 [('incaracteristica', 'ADJ'), ('apagada', 'VERB')],
 [('apagada', 'VERB'), ('tristissima', 'ADJ')],
 [('tristissima', 'ADJ'), ('N√£o', 'ADV')],
 [('N√£o', 'ADV'), ('se', 'PRON')],
 [('se', 'PRON'), ('podia', 'VERB')],
 [('podia', 'VERB'), ('dizer', 'VERB')],
 [('dizer', 'VERB'), ('a', 'DET')],
 [('a', 'DET'), ('idade', 'NOUN')],
 [('idade', 'NOUN'), ('que', 'PRON')],
 [('que', 'PRON'), ('tinha', 'VERB')],
 [('tinha', 'VERB'), ('nem', 'CCONJ')],
 [('nem', 'CCONJ'), ('mesmo', 'ADV')]]

Now that we have our list of bigrams, we're going to make a function `get_neighbor_words()`. This function will return the most frequent words that appear next to a particular keyword. The function can also be fine-tuned to return neighbor words that match a certain part of speech by changing the `pos_label` parameter.

In [19]:
def get_neighbor_words(keyword, bigrams, pos_label = None):
    
    neighbor_words = []
    keyword = keyword.lower()
    
    for bigram in bigrams:
        
        #Extract just the lowercased words (not the labels) for each bigram
        words = [word.lower() for word, label in bigram]        
        
        #Check to see if keyword is in the bigram
        if keyword in words:
            
            for word, label in bigram:
                
                #Now focus on the neighbor word, not the keyword
                if word.lower() != keyword:
                    #If the neighbor word matches the right pos_label, append it to the master list
                    if label == pos_label or pos_label == None:
                        neighbor_words.append(word.lower())
    
    return Counter(neighbor_words).most_common()

In [20]:
get_neighbor_words("bom", bigrams)

[('√©', 10),
 ('que', 8),
 ('e', 7),
 ('do', 6),
 ('um', 6),
 ('de', 4),
 ('o', 3),
 ('na', 2),
 ('muito', 2),
 ('para', 2),
 ('nadador', 2),
 ('sorriso', 2),
 ('deus', 2),
 ('senso', 2),
 ('som', 1),
 ('houvera', 1),
 ('seria', 1),
 ('ser', 1),
 ('agora', 1),
 ('eu', 1),
 ('c√£o', 1),
 ('sempre', 1),
 ('s√©rio', 1),
 ('vae', 1),
 ('era', 1),
 ('das', 1),
 ('homem', 1),
 ('nome', 1),
 ('pre√ßo', 1),
 ('a', 1),
 ('ter', 1),
 ('estava', 1),
 ('disse', 1),
 ('capit√£o', 1),
 ('sahiu', 1),
 ('vaes', 1),
 ('t√£o', 1),
 ('minha', 1),
 ('jantar', 1),
 ('humor', 1),
 ('cheiro', 1),
 ('ir', 1),
 ('ou', 1),
 ('mas', 1),
 ('sympathico', 1),
 ('com', 1),
 ('intimo', 1),
 ('quasi', 1),
 ('ser√°', 1),
 ('seu', 1)]

In [21]:
get_neighbor_words("bom", bigrams, pos_label='VERB')

[('houvera', 1), ('ter', 1), ('disse', 1), ('ir', 1)]

## Your Turn!

Try out `find_sentences_with_keyword()` and `get_neighbor_words` with your own keywords of interest.

In [None]:
find_sentences_with_keyword(keyword="YOUR KEY WORD", document=document)

In [None]:
get_neighbor_words(keyword="YOUR KEY WORD", bigrams, pos_label=None)