# Testing NLP libraries: NLTK and SpaCy (part-of-speech tagging)

Some of the following examples and code snippets were taken/adapted from:
(a) NLTK webpage: https://www.nltk.org/
(b) spaCy webpage: https://spacy.io
(c) a notebook from Fernando Batista and Ricardo Ribeiro, my colleagues and dear friends from ISCTE. Thanks! (any mistake is on me)


1. Install (if needed)


In [1]:
!pip install -U spacy # no need do execute in colab (maybe in your computer)



2. Import

In [2]:
import re     # regular expressions
import nltk
import spacy
import pandas as pd

3. Load

In [3]:
# NLTK
nltk.download('averaged_perceptron_tagger_eng')

# spaCy
nlp_en = spacy.load("en_core_web_sm")

# For portuguese
!python -m spacy download pt_core_news_sm
nlp_pt = spacy.load("pt_core_news_sm")


[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger_eng.zip.


Collecting pt-core-news-sm==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/pt_core_news_sm-3.8.0/pt_core_news_sm-3.8.0-py3-none-any.whl (13.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.0/13.0 MB[0m [31m91.3 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pt-core-news-sm
Successfully installed pt-core-news-sm-3.8.0
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('pt_core_news_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


## 4. Text to process


4.1 Pre-processing

In [7]:
# Simple text
simple_text = "Natural Language is my favourite course ever. I just love it."

# External text (do not forget to upload it in colab -- check the left tab)
file_path = '/content/sample_data/P7_dataset_testeNLP.txt'
with open(file_path, 'r') as file:
  external_text = file.read()

# Split sentences considering a given list of punctuation marks
def split_punctuation(text):
  """Splits specified punctuation from words in the given text."""
  # Pattern matches any word character (\w) followed by any of the specified punctuation marks
  # and ensures a space is inserted between the word and the punctuation mark.
  punctuation_marks = [r'\.', r'\?', r'!', r',', r';']
  for mark in punctuation_marks:
    text = re.sub(f"(\\w)({mark})", r"\1 \2", text)
  return text

simple_text = split_punctuation(simple_text)
print(simple_text)

external_text = split_punctuation(external_text)
print(external_text)

Natural Language is my favourite course ever . I just love it .
O João come a sopa .
A Maria é muito simpática , mas não gosta de sopa .


4.2. Part-of-Speech tagging

In [8]:
# NLTK
posNLTK = nltk.pos_tag(simple_text.split())
print("NLTK_en:", posNLTK)

# spaCy
spaCyText_en = nlp_en(simple_text)
posSpaCy_en = [(token.text, token.tag_, token.pos_) for token in spaCyText_en]
print("spaCy_en:", posSpaCy_en)

spaCyText_pt = nlp_pt(external_text)
posSpaCy_pt = [(token.text, token.tag_, token.pos_) for token in spaCyText_pt]
print("spaCy_pt:", posSpaCy_pt)

NLTK_en: [('Natural', 'JJ'), ('Language', 'NNP'), ('is', 'VBZ'), ('my', 'PRP$'), ('favourite', 'JJ'), ('course', 'NN'), ('ever', 'RB'), ('.', '.'), ('I', 'PRP'), ('just', 'RB'), ('love', 'VB'), ('it', 'PRP'), ('.', '.')]
spaCy_en: [('Natural', 'NNP', 'PROPN'), ('Language', 'NNP', 'PROPN'), ('is', 'VBZ', 'AUX'), ('my', 'PRP$', 'PRON'), ('favourite', 'JJ', 'ADJ'), ('course', 'NN', 'NOUN'), ('ever', 'RB', 'ADV'), ('.', '.', 'PUNCT'), ('I', 'PRP', 'PRON'), ('just', 'RB', 'ADV'), ('love', 'VBP', 'VERB'), ('it', 'PRP', 'PRON'), ('.', '.', 'PUNCT')]
spaCy_pt: [('O', 'DET', 'DET'), ('João', 'PROPN', 'PROPN'), ('come', 'VERB', 'VERB'), ('a', 'DET', 'DET'), ('sopa', 'NOUN', 'NOUN'), ('.', 'PUNCT', 'PUNCT'), ('\n', 'SPACE', 'SPACE'), ('A', 'DET', 'DET'), ('Maria', 'PROPN', 'PROPN'), ('é', 'AUX', 'AUX'), ('muito', 'ADV', 'ADV'), ('simpática', 'ADJ', 'ADJ'), (',', 'PUNCT', 'PUNCT'), ('mas', 'CCONJ', 'CCONJ'), ('não', 'ADV', 'ADV'), ('gosta', 'VERB', 'VERB'), ('de', 'ADP', 'ADP'), ('sopa', 'NOUN', '

4.3 Find a specific PoS

In [9]:
from itertools import count
from spacy.parts_of_speech import PROPN

def is_proper_noun(token):
    return token.pos == spacy.parts_of_speech.PROPN

# For PT
count = 0
for token in spaCyText_en:
  if is_proper_noun(token):
    print(token)
    count += 1
print("#proper nouns in english text = %d" % count)

# For PT
count = 0
for token in spaCyText_pt:
  if is_proper_noun(token):
    print(token)
    count += 1
print("#proper nouns in portuguese text = %d" % count)

Natural
Language
#proper nouns in english text = 2
João
Maria
#proper nouns in portuguese text = 2
