# Testing Natural Language Processing libraries: SpaCy and stanza

Some of the following examples and code snippets were taken/adapted from:

(a) spaCy webpage: https://spacy.io

(b) stanza github: https://stanfordnlp.github.io/stanza/installation_usage.html

(c) a notebook from Fernando Batista and Ricardo Ribeiro, my colleagues and dear friends from ISCTE. Thanks! (any mistake is on me)

**1. Install** (if needed)

In [None]:
# !pip install -U pip setuptools wheel
# !pip install -U spacy
!pip install -U stanza # This is problebly needed
!python3 -m spacy download en_core_web_sm

**2. Import**

In [None]:
# Spacy <-- widely used in NLP
import spacy
from spacy import displacy  # for visualization (see below)

# Stanza <-- widely used in NLP
import stanza

**3. Load**

In [None]:
# Spacy
nlp_spacy_en = spacy.load('en_core_web_sm')

# Stanza
stanza.download('en')
nlp_stanza_en = stanza.Pipeline('en')

**4. Text to process**

In [None]:
# We will test with a simple text in English and an external text in Portuguese

# Simple text in English
my_text_en = """Natural Language is my favourite course ever. I just love it."""
print("EN: ", my_text_en)

**4.1 Tokenization**

In [None]:
# Spacy
my_text_en_spacy = my_text_en.replace("\n", " ") # you may need this for other texts
doc_spacy_en = nlp_spacy_en(my_text_en_spacy)

print(doc_spacy_en[:10]) # print first tokens
print(" + ".join([token.text for token in doc_spacy_en])) # show tokens split by +

In [None]:
# Stanza
doc_stanza_en = nlp_stanza_en(my_text_en)

# access and print the first Ntokens
def my_print(doc):
  tokens = []
  for sentence in doc.sentences:
     for token in sentence.tokens:
        tokens.append(token.text)
  return tokens

print(my_print(doc_stanza_en))

**4.2. Multi-generator (POS, lemma. etc.)**

a) spaCy

*Text*: The original word text.<br>
*Lemma*: The base form of the word.<br>
*POS*: The simple UPOS part-of-speech tag.<br>
*Tag*: The detailed part-of-speech tag.<br>
*Dep*: Syntactic dependency, i.e. the relation between tokens.<br>
*Shape*: The word shape â€“ capitalization, punctuation, digits.<br>
*is alpha*: Is the token an alpha character?<br>
*is stop*: Is the token part of a stop list, i.e. the most common words of the language?

b) Stanza

*PoS*
...

In [None]:
# spaCy
for token in doc_spacy_en:
    print(token.text, token.lemma_, token.pos_, token.tag_, token.dep_, token.is_alpha, token.is_stop)

In [None]:
# stanza
nlp_pos_en = stanza.Pipeline(lang='en', processors='tokenize,pos', tokenize_pretokenized=True)
doc_stanza_en = nlp_pos_en(my_text_en)
doc_en = nlp_pos_en(doc_stanza_en)
print("{:C}".format(doc_en))

**4.3. Syntactic Parsing**

- dependencies with spaCy
- constituents with stanza
- vizualization with spaCy

In [None]:
# spaCy: dependency
# Try the different options
options = {"compact": True, "bg": "#09a3d5", "color": "white", "font": "Source Sans Pro"} # Try!
# options = {}

# displacy.render(doc_spacy_en[:10], style="dep", options=options) # server instead of render if not in a notebook
html = displacy.render(list(doc_spacy_en.sents)[:1], style="dep", options=options, jupyter=False)

# check the syntactic dependencies in the generated deps.html (under Files or local folder)
with open("deps.html", "w", encoding="utf-8") as f:
  f.write(html)

In [None]:
# stanza: constituency
nlp_stanza_en = stanza.Pipeline(lang='en', processors='tokenize,pos,constituency')

doc_stanza_en = nlp_stanza_en(my_text_en)
for sentence in doc_stanza_en.sentences:
    print(sentence.constituency)