# Exploring the spaCy NLP library

This notebook is a basic, introductory exploration of the spaCy NLP library.  We will explore how to:

* download stock models
* perform word and sentence tokenization
* extract part-of-speech tags
* chunk noun phrases
* perform named entity recognition

This is mostly for my own enlightenment and is not intended to be a thorough tutorial.

[SpaCy](https://spacy.io/) is an [open-source library](https://github.com/explosion/spaCy) for performing a variety of NLP tasks.  SpaCy claims to be an "industrial strength" library aimed at real-world and production NLP use cases.

Let's begin by importing the spaCy library and any dependencies we might need.

In [20]:
import time

import spacy

## Downloading and installing stock models

SpaCy comes with a number of stock models, in various languages, that are ready for immediate use.  After browsing through the [online model catelog](https://spacy.io/models), it sounds like we'll be interested in looking at the `en_core_web_sm` model, which is the stock English model, trained over data fetched from the web and with a relatively small footprint.  The model is described by spaCy as an "English multi-task CNN trained on OntoNotes," that "Assigns context-specific token vectors, POS tags, dependency parse and named entities."

Note that each model contains multiple pipleine components, i.e., NLP features, packaged together.  The `en_core_web_sm` model, for example, contains tagger, parser and NER components.

SpaCy models can be downloaded directly from a shell prompt.

In [2]:
%%bash
python3 -m spacy download en_core_web_sm

[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('en_core_web_sm')




Alternately, the model can be downloaded directly from a python or ipython prompt.

In [5]:
spacy.cli.download('en_core_web_sm')

[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('en_core_web_sm')


Once the model has been successfully downloaded, it can now be loaded from within python using the `spacy.load` function.

Note that we use the variable name `nlp` for the object returned by the model load operation.  This object can be thought of as an "NLP processing pipeline" that can be called over text document to apply all of the pipeline components contained in the downloaded model.

In [14]:
nlp = spacy.load('en_core_web_sm')

## Sample document

Next, let's load some sample data to play with.  The document that we will look at is a plaintext version of my PhD dissertation; however, any English document would do here.

In [15]:
with open('dissertation.txt', mode='r', encoding='utf8') as fh:
    data = fh.read()

Just to make sure it was loaded correctly, we print the first 1,500 characters.

In [16]:
print(data[:1500])

DISSERTATION

CONVOLUTIONAL NEURAL NETWORKS FOR EEG SIGNAL CLASSIFICATION IN
ASYNCHRONOUS BRAIN-COMPUTER INTERFACES

Submitted by
Elliott M. Forney
Department of Computer Science

In partial fulfillment of the requirements
For the Degree of Doctor of Philosophy
Colorado State University
Fort Collins, Colorado
Fall 2019

Doctoral Committee:
Advisor: Charles Anderson
Asa Ben-Hur
Michael Kirby
Donald Rojas

Copyright Elliott M. Forney 2019

This work is licensed under the Creative Commons
Attribution-NonCommercial-NoDerivatives 4.0 International Public License.

You are permitted to share this document without modification for non-commercial purposes.
To view a copy of the full license, please see Appendix ??.

ABSTRACT

CONVOLUTIONAL NEURAL NETWORKS FOR EEG SIGNAL CLASSIFICATION IN
ASYNCHRONOUS BRAIN-COMPUTER INTERFACES

Brain-Computer Interfaces (BCIs) are emerging technologies that enable users to interact
with computerized devices using only voluntary changes in their mental state. BC

## Applying the pipeline

Next, we apply the `nlp` pipeline to out text `data` and store the resulting object in the variable `doc`.

This takes a bit, so we also track the time taken to apply the pipeline.  On my machine, this takes about eight seconds.  Note that spaCy does have a [GPU-enabled package that can be installed](https://spacy.io/usage#gpu), but we'll stick to the standard version for now and explore computational performance more deeply on another day.

In [24]:
start_time = time.time()

doc = nlp(data)

time.time() - start_time

7.440038204193115

## Looking at tokens

Note that the `doc` object is an instance of `spacy.tokens.doc.Doc` and behaves like a sequence of word tokens when iterated over direclty.

In [25]:
type(doc)

spacy.tokens.doc.Doc

We can see that our document contains about 89,000 words and we can easily access the string representations of these words by iterating over `doc` and accessing the `.text` attribute of the resulting `spacy.tokens.token.Token` objects.

In [27]:
type(doc[0]), len(doc)

(spacy.tokens.token.Token, 89119)

This allows for simple, pythonic expressions to be used to access word tokens.

In [10]:
[token.text for token in doc[:20]]

['DISSERTATION',
 '\n\n',
 'CONVOLUTIONAL',
 'NEURAL',
 'NETWORKS',
 'FOR',
 'EEG',
 'SIGNAL',
 'CLASSIFICATION',
 'IN',
 '\n',
 'ASYNCHRONOUS',
 'BRAIN',
 '-',
 'COMPUTER',
 'INTERFACES',
 '\n\n',
 'Submitted',
 'by',
 '\n']

## Token offsets

The character offset of each token within the original text is stored in the `.idx` attribute of each `Token` object.  The begin and end offsets can then be extracted using the starting offset and the length of each token.

In [38]:
[(token.text, token.idx, token.idx + len(token)) for token in doc[:20]]

[('DISSERTATION', 0, 12),
 ('\n\n', 12, 14),
 ('CONVOLUTIONAL', 14, 27),
 ('NEURAL', 28, 34),
 ('NETWORKS', 35, 43),
 ('FOR', 44, 47),
 ('EEG', 48, 51),
 ('SIGNAL', 52, 58),
 ('CLASSIFICATION', 59, 73),
 ('IN', 74, 76),
 ('\n', 76, 77),
 ('ASYNCHRONOUS', 77, 89),
 ('BRAIN', 90, 95),
 ('-', 95, 96),
 ('COMPUTER', 96, 104),
 ('INTERFACES', 105, 115),
 ('\n\n', 115, 117),
 ('Submitted', 117, 126),
 ('by', 127, 129),
 ('\n', 129, 130)]

This can be confirmed by extracting each token from the original document using only the token offsets.

In [40]:
[data[token.idx:token.idx + len(token)] for token in doc[:20]]

['DISSERTATION',
 '\n\n',
 'CONVOLUTIONAL',
 'NEURAL',
 'NETWORKS',
 'FOR',
 'EEG',
 'SIGNAL',
 'CLASSIFICATION',
 'IN',
 '\n',
 'ASYNCHRONOUS',
 'BRAIN',
 '-',
 'COMPUTER',
 'INTERFACES',
 '\n\n',
 'Submitted',
 'by',
 '\n']

## Getting part-of-speech tags

The predicted Part-Of-Speech (POS) tags can be access using the `.pos_` attribute on each `Token` object.

Note that the `.pos_` attribute returns the string representation of the POS tag from the [Universal POS Tag Set](https://universaldependencies.org/docs/u/pos/) while the `.pos` attribute is an integer enum-style representation.

In [31]:
[token.pos_ for token in doc[:20]]

['NOUN',
 'SPACE',
 'NOUN',
 'NOUN',
 'NOUN',
 'ADP',
 'PROPN',
 'PROPN',
 'VERB',
 'ADP',
 'SPACE',
 'PROPN',
 'PROPN',
 'PUNCT',
 'NOUN',
 'PROPN',
 'SPACE',
 'VERB',
 'ADP',
 'SPACE']

In [32]:
[token.pos for token in doc[:20]]

[92,
 103,
 92,
 92,
 92,
 85,
 96,
 96,
 100,
 85,
 103,
 96,
 96,
 97,
 92,
 96,
 103,
 100,
 85,
 103]

It is straightforward to get tuples containing both the token text along side the corresponding POS tags.

In [37]:
[(token, token.pos_) for token in doc[:20]]

[(DISSERTATION, 'NOUN'),
 (
  , 'SPACE'),
 (CONVOLUTIONAL, 'NOUN'),
 (NEURAL, 'NOUN'),
 (NETWORKS, 'NOUN'),
 (FOR, 'ADP'),
 (EEG, 'PROPN'),
 (SIGNAL, 'PROPN'),
 (CLASSIFICATION, 'VERB'),
 (IN, 'ADP'),
 (, 'SPACE'),
 (ASYNCHRONOUS, 'PROPN'),
 (BRAIN, 'PROPN'),
 (-, 'PUNCT'),
 (COMPUTER, 'NOUN'),
 (INTERFACES, 'PROPN'),
 (
  ,
  'SPACE'),
 (Submitted, 'VERB'),
 (by, 'ADP'),
 (, 'SPACE')]

## Token "is" and "like" attributes

Each token also has convenient attributes that denote whether or not it looks like a digit, number, email, et cetra.

In [43]:
word_2019 = doc[54]
word_2019

2019

In [44]:
word_2019, word_2019.is_digit, word_2019.like_num, word_2019.like_email

(2019, True, True, False)

## Named Entity Recognition (NER)

Our pipeline also contained a component for performing NER.  While iterating over the `doc` variable directly operates over tokens, the results from additional pipeline components are generally `Span` object, which represent a series of contiguous tokens, as an attribute that is placed onto the `doc` object.

In this case, we have an attribute called `doc.ents` that is a tuple of `Span` objects representing each entity.

In [45]:
type(doc.ents), type(doc.ents[0])

(tuple, spacy.tokens.span.Span)

Again, pythonic expressions, e.g., list comprehensions, can be used to extract information from this tuple of `Span`'s.

In [13]:
[ent for ent in doc.ents[:20]]

[INTERFACES,
 Elliott M. Forney,
 Department of Computer Science,
 the Degree of Doctor of Philosophy,
 Colorado State University,
 Fort Collins,
 Colorado,
 Fall 2019,
 Doctoral Committee,
 Charles Anderson,
 Asa Ben-Hur,
 Michael Kirby,
 Donald Rojas,
 the Creative Commons,
 NonCommercial-NoDerivatives,
 Appendix,
 EEG,
 EEG,
 the Convolutional Neural Network,
 CNN]

The `.label_` attribute contains a string version of the predicted entity type while `.label` attribute holds an integer representation, similar to what we saw for POS tags.

Note that entities are not disambiguated, merged or linked.  *It's unclear to me if there is any way to do these things in spaCy?*

In [46]:
[(ent.text, ent.label_, ent.label) for ent in doc.ents[:20]]

[('INTERFACES', 'ORG', 383),
 ('Elliott M. Forney\n', 'PERSON', 380),
 ('Department of Computer Science', 'ORG', 383),
 ('the Degree of Doctor of Philosophy\n', 'WORK_OF_ART', 388),
 ('Colorado State University', 'ORG', 383),
 ('Fort Collins', 'GPE', 384),
 ('Colorado', 'GPE', 384),
 ('Fall 2019', 'DATE', 391),
 ('Doctoral Committee', 'ORG', 383),
 ('Charles Anderson', 'PERSON', 380),
 ('Asa Ben-Hur', 'PERSON', 380),
 ('Michael Kirby', 'PERSON', 380),
 ('Donald Rojas', 'PERSON', 380),
 ('the Creative Commons', 'ORG', 383),
 ('NonCommercial-NoDerivatives', 'ORG', 383),
 ('Appendix', 'GPE', 384),
 ('EEG', 'ORG', 383),
 ('EEG', 'ORG', 383),
 ('the Convolutional Neural Network', 'ORG', 383),
 ('CNN', 'ORG', 383)]

Again, pythonic expressions can be used to do things like extract the text for all entities of a given type.

`PERSON` entities look pretty good...

In [15]:
[ent.text for ent in doc.ents if ent.label_ == 'PERSON'][:20]

['Elliott M. Forney\n',
 'Charles Anderson',
 'Asa Ben-Hur',
 'Michael Kirby',
 'Donald Rojas',
 'Charles Anderson',
 'BCI',
 'Bill Gavin',
 'Marla Roll',
 'Brittany Taylor',
 'Jewel Crasta',
 'Stephanie Scott',
 'Katie Bruegger',
 'Kim Teh',
 'Stephanie Teh',
 'Tomojit Ghosh',
 'Glen Forney',
 'Nancy Forney',
 'Maggie',
 'Parker']

`ORG` has some misses, especially with abbreviations...

In [16]:
[ent.text for ent in doc.ents if ent.label_ == 'ORG'][:20]

['INTERFACES',
 'Department of Computer Science',
 'Colorado State University',
 'Doctoral Committee',
 'the Creative Commons',
 'NonCommercial-NoDerivatives',
 'EEG',
 'EEG',
 'the Convolutional Neural Network',
 'CNN',
 'EEG',
 'Time-Delay Neural',
 'EEG',
 'EEG',
 'Fourier',
 'EEG',
 'EEG',
 'CSU',
 'Patti Davies',
 'EEG']

Same for `GPE` (geopolitical entities)...

In [17]:
[ent.text for ent in doc.ents if ent.label_ == 'GPE'][:20]

['Fort Collins',
 'Colorado',
 'Appendix',
 'Colorado',
 'distinct mental states',
 's2',
 'sT',
 'DFT',
 'Sa\n\n',
 'noisy',
 'CSP-2',
 'MI',
 'al.',
 'al.',
 '−',
 '−',
 'Mn',
 'i.e',
 'Unnikrishnan',
 'Tzanakou']

## Noun-phrase chunking

Noun chunks are also extracted by our pipeline.  This time, however, we are given a generator instead of list, which is certainly reasonable.

In [53]:
type(doc.noun_chunks)

generator

In [52]:
noun_chunks = list(doc.noun_chunks)

type(noun_chunks[0]), len(noun_chunks)

(spacy.tokens.span.Span, 17531)

Again, noun chunks are easily extracted.

In [18]:
[noun_chunk.text for noun_chunk in doc.noun_chunks]

['DISSERTATION',
 'CONVOLUTIONAL NEURAL NETWORKS',
 'EEG SIGNAL',
 'ASYNCHRONOUS BRAIN-COMPUTER INTERFACES',
 'Elliott M. Forney',
 'Department',
 'Computer Science',
 'partial fulfillment',
 'the requirements',
 'the Degree',
 'Doctor',
 'Philosophy',
 'Colorado State University',
 'Fort Collins',
 'Colorado',
 'Fall',
 'Doctoral Committee',
 'Advisor',
 'Charles Anderson',
 'Asa Ben-Hur',
 'Michael Kirby',
 'Donald Rojas',
 'Copyright Elliott M. Forney',
 'This work',
 'the Creative Commons\nAttribution-NonCommercial-NoDerivatives 4.0 International Public License',
 'You',
 'this document',
 'modification',
 'non-commercial purposes',
 'a copy',
 'the full license',
 'Appendix',
 'ABSTRACT\n\nCONVOLUTIONAL NEURAL NETWORKS',
 'EEG SIGNAL',
 'ASYNCHRONOUS BRAIN-COMPUTER INTERFACES',
 'Brain-Computer Interfaces',
 'BCIs',
 'technologies',
 'users',
 'computerized devices',
 'only voluntary changes',
 'their mental state',
 'BCIs',
 'a\nnumber',
 'important applications',
 'the developme

## Sentence tokenization

Sentences are also extracted and, again, are represented as `Span`'s.

In [54]:
[sent.text for sent in doc.sents][50:60]

['and I would like to\nthank you all.',
 'I would also like to thank my dissertation committee and all of the members of\nthe BCI laboratory.',
 'Bill Gavin, Patti Davies and Marla Roll, in particular, have taught me much\nof what I know about EEG and how to work with participants in a kind and professional way.\n',
 'Brittany Taylor, Jewel Crasta, Stephanie Scott, Katie Bruegger, Kim Teh and Stephanie Teh have\nall been especially instrumental in the success of my research and the BCI project and have all\nbeen great sources of support and friendship.',
 'Tomojit Ghosh and Fereydoon Vafaei have been\ngreat friends and colleagues and were extremely helpful throughout the process of formulating\nmy ideas and designing my experiments.',
 'I would also like to extend a special thank you to\neveryone that participated in our studies and, especially, to those who have graciously allowed\nus to enter their homes.',
 'I truly hope that this research helps to develop next-generation\nassistive

By iterating over the tokens contained in each span, we can easily get a nested list of tokens by sentence.

In [21]:
[[token.text for token in sent] for sent in list(doc.sents)[50:60]]

[['and', 'I', 'would', 'like', 'to', '\n', 'thank', 'you', 'all', '.'],
 ['I',
  'would',
  'also',
  'like',
  'to',
  'thank',
  'my',
  'dissertation',
  'committee',
  'and',
  'all',
  'of',
  'the',
  'members',
  'of',
  '\n',
  'the',
  'BCI',
  'laboratory',
  '.'],
 ['Bill',
  'Gavin',
  ',',
  'Patti',
  'Davies',
  'and',
  'Marla',
  'Roll',
  ',',
  'in',
  'particular',
  ',',
  'have',
  'taught',
  'me',
  'much',
  '\n',
  'of',
  'what',
  'I',
  'know',
  'about',
  'EEG',
  'and',
  'how',
  'to',
  'work',
  'with',
  'participants',
  'in',
  'a',
  'kind',
  'and',
  'professional',
  'way',
  '.',
  '\n'],
 ['Brittany',
  'Taylor',
  ',',
  'Jewel',
  'Crasta',
  ',',
  'Stephanie',
  'Scott',
  ',',
  'Katie',
  'Bruegger',
  ',',
  'Kim',
  'Teh',
  'and',
  'Stephanie',
  'Teh',
  'have',
  '\n',
  'all',
  'been',
  'especially',
  'instrumental',
  'in',
  'the',
  'success',
  'of',
  'my',
  'research',
  'and',
  'the',
  'BCI',
  'project',
  'and',
  

## Sentiment analysis

Sentiment analysis is contained in some spaCy models, but, unfortunately not the `en_core_web_sm` model.  If it were, however, the sentiment score for the document would be stored in the floating point `.sentiment` attribute.

In [55]:
doc.sentiment

0.0

Overall, spaCy seems pretty easy to get started with.  I do wonder a bit if the `Span` data model abstraction, i.e., iterables of tokens, is really sufficient for a wide variety of NLP tasks, but I'll have to ponder on that a bit.  In future notebooks, I hope to explore performance a bit more and also sentiment and training of custom models.  Perhaps I can also think of an easy practical NLP application to integrate.