# Natural Language Processing (spaCy)

## Installation

- Spacy [language models](https://spacy.io/models/en-starters)

```
# Install package
## In terminal:
!pip install spacy

## Download language model for Chinese and English
!spacy download en
!spacy download zh
!spacy download en_vectors_web_lg ## pretrained word vectors
```

In [1]:
# Install package
## In terminal:
!pip install spacy

## Download language model English
!spacy download en

2023-09-06 21:48:37.441444: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
[38;5;3m⚠ As of spaCy v3.0, shortcuts like 'en' are deprecated. Please use the
full pipeline package name 'en_core_web_sm' instead.[0m
Collecting en-core-web-sm==3.6.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.6.0/en_core_web_sm-3.6.0-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m55.8 MB/s[0m eta [36m0:00:00[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


In [2]:
!pip install spacy-transformers

Collecting spacy-transformers
  Downloading spacy_transformers-1.2.5-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (190 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m190.8/190.8 kB[0m [31m2.6 MB/s[0m eta [36m0:00:00[0m
Collecting transformers<4.31.0,>=3.4.0 (from spacy-transformers)
  Downloading transformers-4.30.2-py3-none-any.whl (7.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.2/7.2 MB[0m [31m16.6 MB/s[0m eta [36m0:00:00[0m
Collecting spacy-alignments<1.0.0,>=0.7.2 (from spacy-transformers)
  Downloading spacy_alignments-0.9.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.2/1.2 MB[0m [31m26.8 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.14.1 (from transformers<4.31.0,>=3.4.0->spacy-transformers)
  Downloading huggingface_hub-0.16.4-py3-none-any.whl (268 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

In [3]:
#Download Transformer Eng model: https://spacy.io/models/en#en_core_web_trf
!python -m spacy download en_core_web_trf

2023-09-06 21:50:34.757603: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
Collecting en-core-web-trf==3.6.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_trf-3.6.1/en_core_web_trf-3.6.1-py3-none-any.whl (460.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m460.3/460.3 MB[0m [31m1.2 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: en-core-web-trf
Successfully installed en-core-web-trf-3.6.1
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_trf')


In [7]:
import spacy
from spacy import displacy
# load language model
nlp_en = spacy.load("en_core_web_trf") ## disable=["parser"]

# parse text
#doc = nlp_en('This is a sentence')
doc = nlp_en('How many albums does alis in pains have?')
doc_correct = nlp_en('How many albums does alice in chains have?')


In [5]:
%timeit doc_correct = nlp_en('How many albums does alice in chains have?')

115 ms ± 21.5 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


## Linguistic Features

- After we parse and tag a given text, we can extract token-level information:
    - Text: the original word text
    - Lemma: the base form of the word
    - POS: the simple universal POS tag
    - Tag: the detailed POS tag
    - Dep: Syntactic dependency
    - Shape: Word shape (capitalization, punc, digits)
    - is alpha
    - is stop
    
:::{admonition,dropdown,note}
For more information on POS tags, see spaCy (POS tag scheme documentation)[https://spacy.io/api/annotation#pos-tagging].
:::

In [8]:
# parts of speech tagging
for token in doc:
    print(((token.text,
            token.lemma_,
            token.pos_,
            token.tag_,
            token.dep_,
            token.shape_,
            token.is_alpha,
            token.is_stop,
            )))

('How', 'how', 'SCONJ', 'WRB', 'advmod', 'Xxx', True, True)
('many', 'many', 'ADJ', 'JJ', 'amod', 'xxxx', True, True)
('albums', 'album', 'NOUN', 'NNS', 'dobj', 'xxxx', True, False)
('does', 'do', 'AUX', 'VBZ', 'aux', 'xxxx', True, True)
('alis', 'alis', 'PROPN', 'NNP', 'nsubj', 'xxxx', True, False)
('in', 'in', 'ADP', 'IN', 'prep', 'xx', True, True)
('pains', 'pains', 'PROPN', 'NNP', 'pobj', 'xxxx', True, False)
('have', 'have', 'VERB', 'VB', 'ROOT', 'xxxx', True, True)
('?', '?', 'PUNCT', '.', 'punct', '?', False, False)


In [9]:
## Output in different ways
for token in doc:
    print('%s_%s' % (token.text, token.tag_))

out = ''
for token in doc:
    out = out + ' '+ '/'.join((token.text, token.tag_))
print(out)


How_WRB
many_JJ
albums_NNS
does_VBZ
alis_NNP
in_IN
pains_NNP
have_VB
?_.
 How/WRB many/JJ albums/NNS does/VBZ alis/NNP in/IN pains/NNP have/VB ?/.


In [11]:
## Check meaning of a POS tag
spacy.explain('NNP')


'noun, proper singular'

## Visualization Linguistic Features

In [12]:
from IPython.core.display import display, HTML
from spacy import displacy

# Assuming 'doc' is your processed SpaCy document
html = displacy.render(doc, style="dep", page=True)

display(HTML(html))

# Save to output.html
with open("output.html", "w", encoding="utf-8") as f:
    f.write(html)