# Tagging y Shallow Parsing

* *30 min* | Última modificación: Sept 22, 2020

http://www.nltk.org/book/

Text Analytics with Python

## Parts-of-Speech (POS) tagging / Categorización léxica

```
TAG   Descripción      Ejemplo
------------------------------------------------------------
CC    Coordination conjuntion                and, or
CD    Cardinal number                        one, two, 3
DT    Determiner                             a, the
EX    Existential there                      there were two cars 
FW    Foreign word                           hola mundo cruel 
IN    Preposition/subordinating conjunction  of, in, on, that
JJ    Adjective                              quick, lazy
JJR   Adjective, comparative                 quicker, lazier
JJS   Adjective, superlative                 quickest, laziest
NN    Noun, singular or mass                 fox, dog
NNS   Noun, plural                           foxes, dogs
NNPS  Noun, proper singular                  John, Alice  
NNP   Noun, proper plural                    Vikings, Indians, Germans
...

```

In [3]:
import nltk

##
## Para ver todos los posibles tags ejecute el siguiente codigo
##
# nltk.download('tagsets')
# nltk.help.upenn_tagset()

In [22]:
##
## Ejemplo --- POS tagging usando spaCy
##
sentence = "US unveils world's most powerful supercomputer, beats China."

nlp = spacy.load("en_core_web_sm", parse=True, tag=True, entity=True)

sentence_nlp = nlp(sentence)

spacy_pos_tagged = [(word, word.tag_, word.pos_) for word in sentence_nlp]
pd.DataFrame(spacy_pos_tagged, columns=["Word", "POS tag", "Tag type"])

Unnamed: 0,Word,POS tag,Tag type
0,US,NNP,PROPN
1,unveils,VBZ,VERB
2,world,NN,NOUN
3,'s,POS,PART
4,most,RBS,ADV
5,powerful,JJ,ADJ
6,supercomputer,NN,NOUN
7,",",",",PUNCT
8,beats,VBZ,VERB
9,China,NNP,PROPN


In [23]:
##
## Ejemplo --- POS tagging usando NLTK
## 
import nltk

nltk.download('averaged_perceptron_tagger')
nltk_pos_tagged = nltk.pos_tag(nltk.word_tokenize(sentence)) 
pd.DataFrame(nltk_pos_tagged, columns=['Word', 'POS tag'])

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/jdvelasq/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


Unnamed: 0,Word,POS tag
0,US,NNP
1,unveils,JJ
2,world,NN
3,'s,POS
4,most,RBS
5,powerful,JJ
6,supercomputer,NN
7,",",","
8,beats,VBZ
9,China,NNP


## Shallow parsing

* Noun phrase (NP): El sustantivo (noun) encabeza la frase. Una NP atua como el sujeto u objeto de un verbo.


* Verb phrase (VP): El verbo (verb) encabeza la frase. 


* Adjective phrase (ADJP): El adjetivo es la cabeza. Califica sustantivsos y pronombres en la sentencia.


* Adverb phrase (ADVP): frases que actuan como adverbios


* Preprositional phrase (PP): tienen una preposicion al inicio de la frase.

In [6]:
##
## Ejemplo del resultado de una shallow parser
##
from nltk.corpus import treebank_chunk

# nltk.download('treebank')

data = treebank_chunk.chunked_sents()

train_data = data[:3500] 
test_data = data[3500:]

print(train_data[7])

(S
  (NP A/DT Lorillard/NNP spokewoman/NN)
  said/VBD
  ,/,
  ``/``
  (NP This/DT)
  is/VBZ
  (NP an/DT old/JJ story/NN)
  ./.)


In [11]:
##
## Especificacion de un regexp parser
##
from nltk.chunk import RegexpParser

sentence = "US unveils world's most powerful supercomputer, beats China."

tagged_simple_sent = nltk.pos_tag(nltk.word_tokenize(sentence))
print('POS Tags:', tagged_simple_sent)

chunk_grammar = """
NP: {<DT>?<JJ>*<NN.*>}
"""

rc = RegexpParser(chunk_grammar) 
c = rc.parse(tagged_simple_sent)
print()
print(c)

POS Tags: [('US', 'NNP'), ('unveils', 'JJ'), ('world', 'NN'), ("'s", 'POS'), ('most', 'RBS'), ('powerful', 'JJ'), ('supercomputer', 'NN'), (',', ','), ('beats', 'VBZ'), ('China', 'NNP'), ('.', '.')]

(S
  (NP US/NNP)
  (NP unveils/JJ world/NN)
  's/POS
  most/RBS
  (NP powerful/JJ supercomputer/NN)
  ,/,
  beats/VBZ
  (NP China/NNP)
  ./.)


In [12]:
##
## Chink --- lo que no es reconocido
##
chink_grammar = """
NP:
   {<.*>+}                # Chunk everything as NP
   }<VBZ|VBD|JJ|IN>+{     # Chink sequences of VBD\VBZ\JJ\IN
"""

rc = RegexpParser(chink_grammar)
c = rc.parse(tagged_simple_sent)
print(c)

(S
  (NP US/NNP)
  unveils/JJ
  (NP world/NN 's/POS most/RBS)
  powerful/JJ
  (NP supercomputer/NN ,/,)
  beats/VBZ
  (NP China/NNP ./.))


In [13]:
##
## Mejoras
##
grammar = """
NP:   {<DT>?<JJ>?<NN.*>}
ADJP: {<JJ>}
ADVP: {<RB.*>}
PP:   {<IN>}
VP:   {<MD>?<VB.*>+}
"""

rc = RegexpParser(grammar)
c = rc.parse(tagged_simple_sent)
print(c)

(S
  (NP US/NNP)
  (NP unveils/JJ world/NN)
  's/POS
  (ADVP most/RBS)
  (NP powerful/JJ supercomputer/NN)
  ,/,
  (VP beats/VBZ)
  (NP China/NNP)
  ./.)


In [14]:
##
## Evaluacion
##
print(rc.evaluate(test_data))

ChunkParse score:
    IOB Accuracy:  46.1%%
    Precision:     19.9%%
    Recall:        43.3%%
    F-Measure:     27.3%%


In [15]:
##
## B- begining of the chunk
## I- inside a chunk
## O- no pertenece a ningun chunk
##
from nltk.chunk.util import tree2conlltags, conlltags2tree

train_sent = train_data[7]
print(train_sent)

(S
  (NP A/DT Lorillard/NNP spokewoman/NN)
  said/VBD
  ,/,
  ``/``
  (NP This/DT)
  is/VBZ
  (NP an/DT old/JJ story/NN)
  ./.)


In [16]:
wtc = tree2conlltags(train_sent)
wtc

[('A', 'DT', 'B-NP'),
 ('Lorillard', 'NNP', 'I-NP'),
 ('spokewoman', 'NN', 'I-NP'),
 ('said', 'VBD', 'O'),
 (',', ',', 'O'),
 ('``', '``', 'O'),
 ('This', 'DT', 'B-NP'),
 ('is', 'VBZ', 'O'),
 ('an', 'DT', 'B-NP'),
 ('old', 'JJ', 'I-NP'),
 ('story', 'NN', 'I-NP'),
 ('.', '.', 'O')]

In [17]:
tree = conlltags2tree(wtc)
print(tree)

(S
  (NP A/DT Lorillard/NNP spokewoman/NN)
  said/VBD
  ,/,
  ``/``
  (NP This/DT)
  is/VBZ
  (NP an/DT old/JJ story/NN)
  ./.)


In [18]:
def conll_tag_chunks(chunk_sents):
    tagged_sents = [tree2conlltags(tree) for tree in chunk_sents]
    return [[(t, c) for (w, t, c) in sent] for sent in tagged_sents]

def combined_tagger(train_data, taggers, backoff=None):
    for tagger in taggers:
        backoff = tagger(train_data, backoff=backoff)
    return backoff

In [19]:
from nltk.tag import UnigramTagger, BigramTagger
from nltk.chunk import ChunkParserI

class NGramTagChunker(ChunkParserI):
    def __init__(self, train_sentences, tagger_classes=[UnigramTagger, BigramTagger]):
        train_sent_tags = conll_tag_chunks(train_sentences)
        self.chunk_tagger = combined_tagger(train_sent_tags, tagger_classes)

    def parse(self, tagged_sentence):
        if not tagged_sentence:
            return None
        
        pos_tags = [tag for word, tag in tagged_sentence]
        chunk_pos_tags = self.chunk_tagger.tag(pos_tags)
        chunk_tags = [chunk_tag for (pos_tag, chunk_tag) in chunk_pos_tags]
        wpc_tags = [(word, pos_tag, chunk_tag) for ((word, pos_tag), chunk_tag) in zip(tagged_sentence, chunk_tags)]
        return conlltags2tree(wpc_tags)
    
ntc = NGramTagChunker(train_data)
print(ntc.evaluate(test_data))

ChunkParse score:
    IOB Accuracy:  97.2%%
    Precision:     91.4%%
    Recall:        94.3%%
    F-Measure:     92.8%%


In [25]:
sentence_nlp = nlp(sentence)
tagged_sentence = [(word.text, word.tag_) for word in sentence_nlp]
tree = ntc.parse(tagged_sentence)
print(tree)

NameError: name 'nlp' is not defined

In [26]:
from nltk.corpus import conll2000

wsj_data = conll2000.chunked_sents()
train_wsj_data = wsj_data[:10000]
test_wsj_data = wsj_data[10000:]

print(train_wsj_data[10])

(S
  (NP He/PRP)
  (VP reckons/VBZ)
  (NP the/DT current/JJ account/NN deficit/NN)
  (VP will/MD narrow/VB)
  (PP to/TO)
  (NP only/RB #/# 1.8/CD billion/CD)
  (PP in/IN)
  (NP September/NNP)
  ./.)


In [27]:
tc = NGramTagChunker(train_wsj_data)
print(tc.evaluate(test_wsj_data))

ChunkParse score:
    IOB Accuracy:  89.1%%
    Precision:     80.3%%
    Recall:        86.1%%
    F-Measure:     83.1%%


In [28]:
tree = tc.parse(tagged_sentence)
print(tree)

NameError: name 'tagged_sentence' is not defined

In [29]:
from nltk.corpus import conll2000
wsj_data = conll2000.chunked_sents()
train_wsj_data = wsj_data[:10000]
test_wsj_data = wsj_data[10000:]
print(train_wsj_data[10])

(S
  (NP He/PRP)
  (VP reckons/VBZ)
  (NP the/DT current/JJ account/NN deficit/NN)
  (VP will/MD narrow/VB)
  (PP to/TO)
  (NP only/RB #/# 1.8/CD billion/CD)
  (PP in/IN)
  (NP September/NNP)
  ./.)


In [30]:
tc = NGramTagChunker(train_wsj_data)

print(tc.evaluate(test_wsj_data))

ChunkParse score:
    IOB Accuracy:  89.1%%
    Precision:     80.3%%
    Recall:        86.1%%
    F-Measure:     83.1%%


In [31]:
tree = tc.parse(tagged_sentence)
print(tree)

NameError: name 'tagged_sentence' is not defined