# Etiquetado en NLTK

## Pipeline basico para el ingles

In [2]:
#@title Dependencias previas
import nltk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
from nltk import word_tokenize

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\MICHU\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\MICHU\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


In [3]:
#@title Etiquetado en una linea
text = word_tokenize('And now here I am enjoying today')
nltk.pos_tag(text)

[('And', 'CC'),
 ('now', 'RB'),
 ('here', 'RB'),
 ('I', 'PRP'),
 ('am', 'VBP'),
 ('enjoying', 'VBG'),
 ('today', 'NN')]

In [4]:
#@title Categoria gramatical de cada etiqueta
nltk.download('tagsets')
for tag in ['CC', 'RB', 'PRP']:
    print(nltk.help.upenn_tagset(tag))

CC: conjunction, coordinating
    & 'n and both but either et for less minus neither nor or plus so
    therefore times v. versus vs. whether yet
None
RB: adverb
    occasionally unabatingly maddeningly adventurously professedly
    stirringly prominently technologically magisterially predominately
    swiftly fiscally pitilessly ...
None
PRP: pronoun, personal
    hers herself him himself hisself it itself me myself one oneself ours
    ourselves ownself self she thee theirs them themselves they thou thy us
None
[nltk_data] Downloading package tagsets to
[nltk_data]     C:\Users\MICHU\AppData\Roaming\nltk_data...
[nltk_data]   Package tagsets is already up-to-date!


In [5]:
#@title Palabras homonimas
text = word_tokenize('They do not permit other people to get residence permit')
nltk.pos_tag(text)

[('They', 'PRP'),
 ('do', 'VBP'),
 ('not', 'RB'),
 ('permit', 'VB'),
 ('other', 'JJ'),
 ('people', 'NNS'),
 ('to', 'TO'),
 ('get', 'VB'),
 ('residence', 'NN'),
 ('permit', 'NN')]

# Etiquetado en espanol

Para el ingles, NLTK tiene tokenizador y etiquetador pre-entrenados por defecto. En cambio, para otros idiomas es preciso entrenarlo previamente.
- usamos el corpus cess_esp
- el cual usa una convencion de etiquetas gramaticales dada por el grupo EAGLES

In [8]:
nltk.download('cess_esp')
from nltk.corpus import cess_esp as cess
from nltk import UnigramTagger as ut 
from nltk import BigramTagger as bt 

[nltk_data] Downloading package cess_esp to
[nltk_data]     C:\Users\MICHU\AppData\Roaming\nltk_data...
[nltk_data]   Package cess_esp is already up-to-date!


In [9]:
#@title Entrenamiento del tagger por unigramas
cess_sents = cess.tagged_sents()
fraction = int(len(cess_sents) * 90 / 100)
uni_tagger = ut(cess_sents[:fraction])
uni_tagger.evaluate(cess_sents[fraction:])

0.8068832283915284

In [11]:
uni_tagger.tag('Yo soy una persona muy amable'.split(' '))

[('Yo', 'pp1csn00'),
 ('soy', 'vsip1s0'),
 ('una', 'di0fs0'),
 ('persona', 'ncfs000'),
 ('muy', 'rg'),
 ('amable', None)]

In [13]:
#@title Entrenamiento del tagger por bigramas
fraction = int(len(cess_sents) * 90 / 100)
bi_tagger = bt(cess_sents[:fraction])
bi_tagger.evaluate(cess_sents[fraction + 1:])

0.1095272206303725

In [14]:
# no se recomienda usar
bi_tagger.tag('Yo soy una persona muy amable'.split(' '))

[('Yo', 'pp1csn00'),
 ('soy', 'vsip1s0'),
 ('una', None),
 ('persona', None),
 ('muy', None),
 ('amable', None)]

# Etiquetado mejorado con Stanza (Stanford)

### Que es estanza?
- El grupo de investigacion en NLP de Stanford tenia una suite de librerias que ejecutaban varias tareas de NLP, esta suite se unifico en un solo servicio que le llamaron **CoreNLP** con base en codigo java: https://stanfordnlp.github.io/CoreNLP/index.html
- Para python existe StanfordNLP: https://stanfordnlp.github.io/stanfordnlp/index.html 
- Sin embargo, StanfodNLP ha sido deprecada y las nuevas versiones de la suite de NLP reciben mantenimiento bajo el nombre de Stanza: https://stanfordnlp.github.io/stanza

In [16]:
!pip install stanza

Collecting stanza
  Downloading stanza-1.2-py3-none-any.whl (282 kB)
Collecting torch>=1.3.0
  Downloading torch-1.8.0-cp37-cp37m-win_amd64.whl (190.5 MB)
Collecting typing-extensions
  Using cached typing_extensions-3.7.4.3-py3-none-any.whl (22 kB)
Installing collected packages: typing-extensions, torch, stanza
Successfully installed stanza-1.2 torch-1.8.0 typing-extensions-3.7.4.3
You should consider upgrading via the 'a:\anaconda3\python.exe -m pip install --upgrade pip' command.


In [17]:
import stanza
stanza.download('es')

Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/master/resources_1.2.0.json: 128kB [00:00, 16.0MB/s]                    
2021-03-10 12:33:37 INFO: Downloading default packages for language: es (Spanish)...
Downloading http://nlp.stanford.edu/software/stanza/1.2.0/es/default.zip: 100%|██████████| 566M/566M [01:48<00:00, 5.24MB/s]
2021-03-10 12:35:34 INFO: Finished downloading models and saved to C:\Users\MICHU\stanza_resources.


In [18]:
nlp = stanza.Pipeline('es', processors='tokenize, pos')
doc = nlp('Yo soy una persona muy amable')

2021-03-10 13:06:40 INFO: Loading these models for language: es (Spanish):
| Processor | Package |
-----------------------
| tokenize  | ancora  |
| mwt       | ancora  |
| pos       | ancora  |

2021-03-10 13:06:40 INFO: Use device: cpu
2021-03-10 13:06:40 INFO: Loading: tokenize
2021-03-10 13:06:40 INFO: Loading: mwt
2021-03-10 13:06:40 INFO: Loading: pos
2021-03-10 13:06:41 INFO: Done loading processors!


In [19]:
for snetence in doc.sentences:
    for word in snetence.words:
        print(word.text, word.pos)

Yo PRON
soy AUX
una DET
persona NOUN
muy ADV
amable ADJ
