# NLTK Labelling

## Basic Pipeline for English

In [1]:
#@title Previous Dependencies

# Natural Language Processing
import nltk
# NLTK Attributes
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
# Word Tokenizer
from nltk import word_tokenize

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


In [2]:
#@title One Line Labelling
text = word_tokenize("And now here I am enjoying today")
nltk.pos_tag(text)

[('And', 'CC'),
 ('now', 'RB'),
 ('here', 'RB'),
 ('I', 'PRP'),
 ('am', 'VBP'),
 ('enjoying', 'VBG'),
 ('today', 'NN')]

In [3]:
#@title Grammatical Category of each Label 
nltk.download('tagsets')
for tag in ['CC', 'RB', 'PRP', 'VBP', 'VBG', 'NN']:
  print(nltk.help.upenn_tagset(tag))

CC: conjunction, coordinating
    & 'n and both but either et for less minus neither nor or plus so
    therefore times v. versus vs. whether yet
None
RB: adverb
    occasionally unabatingly maddeningly adventurously professedly
    stirringly prominently technologically magisterially predominately
    swiftly fiscally pitilessly ...
None
PRP: pronoun, personal
    hers herself him himself hisself it itself me myself one oneself ours
    ourselves ownself self she thee theirs them themselves they thou thy us
None
VBP: verb, present tense, not 3rd person singular
    predominate wrap resort sue twist spill cure lengthen brush terminate
    appear tend stray glisten obtain comprise detest tease attract
    emphasize mold postpone sever return wag ...
None
VBG: verb, present participle or gerund
    telegraphing stirring focusing angering judging stalling lactating
    hankerin' alleging veering capping approaching traveling besieging
    encrypting interrupting erasing wincing ...
None
N

[nltk_data] Downloading package tagsets to /root/nltk_data...
[nltk_data]   Unzipping help/tagsets.zip.


In [4]:
#@title Homonymous Words 
text = word_tokenize("They do not permit other people to get residence permit")
nltk.pos_tag(text)

[('They', 'PRP'),
 ('do', 'VBP'),
 ('not', 'RB'),
 ('permit', 'VB'),
 ('other', 'JJ'),
 ('people', 'NNS'),
 ('to', 'TO'),
 ('get', 'VB'),
 ('residence', 'NN'),
 ('permit', 'NN')]

## Spanish Labelling 

In English, NLTK has a pre-trained tokenizer & tagger by default. Otherwise, other languages must a previous training step.

* Let's use spanish corpus `cess_esp` https://mailman.uib.no/public/corpora/2007-October/005448.html

* Used tag grammar convention given by EAGLES group https://www.cs.upc.edu/~nlp/tools/parole-sp.html

In [5]:
# Getting Spanish Corpus
nltk.download('cess_esp')
# NLTK Functionality
from nltk.corpus import cess_esp as cess
# Unitagger
from nltk import UnigramTagger as ut
# Bigram Tagger
from nltk import BigramTagger as bt

[nltk_data] Downloading package cess_esp to /root/nltk_data...
[nltk_data]   Unzipping corpora/cess_esp.zip.


In [7]:
#@title Tagger Training by Unigrams 

# Setting & Spliting Phrases
cess_sents = cess.tagged_sents()
# 90 % Dataset Training Sample
fraction = int(len(cess_sents)*90/100)
# Uni-tagger Storage
uni_tagger = ut(cess_sents[:fraction])
# Unitagger Evaluator
uni_tagger.evaluate(cess_sents[fraction+1:])

  Function evaluate() has been deprecated.  Use accuracy(gold)
  instead.
  uni_tagger.evaluate(cess_sents[fraction+1:])


0.8069484240687679

In [9]:
# Splitting & Tokenize Spanish Snetence "I'm a really nice person"
uni_tagger.tag("Yo soy una persona muy amable".split(" "))

[('Yo', 'pp1csn00'),
 ('soy', 'vsip1s0'),
 ('una', 'di0fs0'),
 ('persona', 'ncfs000'),
 ('muy', 'rg'),
 ('amable', None)]

In [11]:
#@title Tagger Training by Bigrams

# 90 % Dataset Training Sample
fraction = int(len(cess_sents)*90/100)
# Bi-tagger Storage
bi_tagger = bt(cess_sents[:fraction])
# Bitagger Evaluator
bi_tagger.evaluate(cess_sents[fraction+1:])

  Function evaluate() has been deprecated.  Use accuracy(gold)
  instead.
  bi_tagger.evaluate(cess_sents[fraction+1:])


0.1095272206303725

In [12]:
# Splitting & Tokenize Spanish Snetence "I'm a really nice person"
bi_tagger.tag("Yo soy una persona muy amable".split(" "))

[('Yo', 'pp1csn00'),
 ('soy', 'vsip1s0'),
 ('una', None),
 ('persona', None),
 ('muy', None),
 ('amable', None)]

# Improved Tagger with Stanza (StanfordNLP)

**¿What Stanza is?**

* The Stanford NLP research group had a suite of libraries that performed various NLP tasks, this suite was unified into a single service they called **CoreNLP** based on java code: https://stanfordnlp.github.io/CoreNLP/index.html

* For python exist **StanfordNLP**: https://stanfordnlp.github.io/stanfordnlp/index.html

* Nonetheless, **StanfordNLP** has been deprecated and new versions of the NLP suite are maintained under the name of **Stanza**: https://stanfordnlp.github.io/stanza/

In [13]:
!pip install stanza

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting stanza
  Downloading stanza-1.5.0-py3-none-any.whl (802 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m802.5/802.5 KB[0m [31m12.1 MB/s[0m eta [36m0:00:00[0m
Collecting emoji
  Downloading emoji-2.2.0.tar.gz (240 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m240.9/240.9 KB[0m [31m18.6 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: emoji
  Building wheel for emoji (setup.py) ... [?25l[?25hdone
  Created wheel for emoji: filename=emoji-2.2.0-py3-none-any.whl size=234926 sha256=10ba752a964041d1dd8c0f56e88c6c2477de6ff43463c1e9f228b9b3d9ee500d
  Stored in directory: /root/.cache/pip/wheels/9a/b8/0f/f580817231cbf59f6ade9fd132ff60ada1de9f7dc85521f857
Successfully built emoji
Installing collected packages: emoji, stanza
Successfully installed emoji-2.2.0 stan

In [14]:
# Importing Stanza Library in Spanish
import stanza
stanza.download('es')

Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.5.0.json:   0%|   …

INFO:stanza:Downloading default packages for language: es (Spanish) ...


Downloading https://huggingface.co/stanfordnlp/stanza-es/resolve/v1.5.0/models/default.zip:   0%|          | 0…

INFO:stanza:Finished downloading models and saved to /root/stanza_resources.


In [15]:
# Setting & Executive Stanza Pipelines
nlp = stanza.Pipeline('es', processors='tokenize,pos')
# Processing "I'm a really nice person"
doc = nlp('yo soy una persona muy amable')

INFO:stanza:Checking for updates to resources.json in case models have been updated.  Note: this behavior can be turned off with download_method=None or download_method=DownloadMethod.REUSE_RESOURCES


Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.5.0.json:   0%|   …

INFO:stanza:Loading these models for language: es (Spanish):
| Processor | Package |
-----------------------
| tokenize  | ancora  |
| mwt       | ancora  |
| pos       | ancora  |

INFO:stanza:Using device: cpu
INFO:stanza:Loading: tokenize
INFO:stanza:Loading: mwt
INFO:stanza:Loading: pos
INFO:stanza:Done loading processors!


In [16]:
# Checking Output

# on sentence
for sentence in doc.sentences:
  # On Sentence Words
  for word in sentence.words:
    print(word.text, word.pos)

yo PRON
soy AUX
una DET
persona NOUN
muy ADV
amable ADJ


# Additional References:

* POS Tagging with Stanza https://stanfordnlp.github.io/stanza/pos.html#accessing-pos-and-morphological-feature-for-word

* Stanza | Github: https://github.com/stanfordnlp/stanza

* Articulo en ArXiv: https://arxiv.org/pdf/2003.07082.pdf