# DIGI405 Lab 4.1: Getting started with spaCy

## Introduction

For advanced text processing capabilities, the [spaCy](https://spacy.io/) NLP library offers a range of features. The aims of this notebook are to introduce spaCy and demonstrate basic parts of speech tagging and dependency parsing methods.

## Installing spaCy

The spaCy library and small language model are already installed on JupyterHub. However, if you are running the notebook locally, you may need to install them. You can find the instructions in the spaCy documentation here: [https://spacy.io/usage](https://spacy.io/usage)

In [None]:
import spacy
import collections
from collections import Counter
from IPython.display import display, HTML

In [None]:
nlp = spacy.load('en_core_web_sm')

In [None]:
text = '''
Sometime between 1250 and 1300, Polynesians settled in the islands that later were named New Zealand and developed a distinctive Māori culture. 
In 1642, Dutch explorer Abel Tasman became the first European to sight New Zealand. 
In 1840, representatives of the United Kingdom and Māori chiefs signed the Treaty of Waitangi, which declared British sovereignty over the islands. 
In 1841, New Zealand became a colony within the British Empire and in 1907 it became a dominion; it gained full statutory independence in 1947 and the British monarch remained the head of state.
'''
# text from https://en.wikipedia.org/wiki/New_Zealand
# you can replace the text variable with any text you want

The following cell tokenises and annotates some text with spacy and prints out the tokens.

In [None]:
# tokenise and annotate some text
doc = nlp(text) 

# this will output the individual tokens 
for token in doc:
    print(token)

Spacy can segment our text into sentences. This cell prints each sentence in turn ...

In [None]:
for sent in doc.sents:
    print(sent.text)

## Part of Speech tagging with spaCy

There are two ways to access parts of speech in spaCy. First we will output a count of parts of speech using Penn Treebank tags. 

The [spacy.explain](https://spacy.io/api/top-level#spacy.explain) function can be used to output a user-friendly description for a given POS tag, dependency label or entity type.

In [None]:
tags = [token.tag_ for token in doc]

tag_freq = Counter(tags)
for tag in sorted(tag_freq, key=tag_freq.get, reverse=True):
    print(tag, tag_freq[tag], spacy.explain(tag), sep='\t')

Spacy also uses the simpler set of 'Universal' part of speech labels. These are also very useful for general POS patterns.

Note the difference between the cell above and below (i.e. above the Penn Treebank tags are accessed via tag_ and below Spacy's Universal POS tags are available via pos_).

In [None]:
tags = [token.pos_ for token in doc]

tag_freq = Counter(tags)
for tag in sorted(tag_freq, key=tag_freq.get, reverse=True):
    print(tag, tag_freq[tag], spacy.explain(tag), sep='\t')

Here is an example of filtering tokens by their part of speech.

In [None]:
filtered_tokens = []

for token in doc:
    if token.pos_ == 'PROPN':
        filtered_tokens.append(token)
        
print(filtered_tokens)

Refer to the tags listed above with their spacy.explain explanations. Replace 'PROPN' with other parts of speech, such as:

- VERB
- NOUN
- ADP (preposition)
- NUM (numbers)

You can also use the more extensive Penn Treebank tag set under the "English" heading. This allows you to do things like differentiate between tenses of verbs (e.g. select verbs in past-tense, VBD). 

In [None]:
filtered_tokens = []

for token in doc:
    if token.tag_ == 'VBD': # similar to above, except instead of pos_ we use tag_ to access the different tag set
        filtered_tokens.append(token)
        
print(filtered_tokens)

Here we are outputting a frequency list for proper nouns. Change the code to output a frequency list for another part of speech.

In [None]:
# select prop noun tokens only 
filtered_tokens = [token.text for token in doc if token.pos_ == "PROPN"]

token_freq = Counter(filtered_tokens)
for token in sorted(token_freq, key=token_freq.get, reverse=True):
    print(token, token_freq[token])

We can use this model to remove or normalise particular token types as required. For example, here we are normalising individual numbers to one token "NUMBER". This might be useful if you were interested in collocation patterns related to numbers in general.

In [None]:
no_numbers = []

for token in doc:
    if token.pos_ == 'NUM':
        no_numbers.append('NUMBER') # if we wanted to remove numbers completely change this line to: continue
    else:
        no_numbers.append(token)

print(no_numbers)

## Filtering by character types of tokens

You can filter by types of tokens. For example, here we are excluding any tokens with non-alphabetic characters such as numbers or '$'. Look at the list of token attributes here: https://spacy.io/api/token#attributes  
Change is_alpha to another boolean (bool) type to filter in another way (e.g. is_digit).

In [None]:
char_filtered = []

for token in doc:
    if token.is_alpha is False:
        continue
    else:
        char_filtered.append(token) 

print(char_filtered)
# no dates or punctuation

## Noun phrases / chunks

Identifying noun chunks is a basic way of identifying entities and objects written about in a text.

In [None]:
for chunk in doc.noun_chunks:
    print(chunk.text)

## Dependency parsing

Dependency parsing analyses sentences based on relationships between words.

You can view the different annotation labels for Spacy's dependency parsing here: https://spacy.io/models/en#en_core_web_sm (expand the "label scheme" and see the PARSER labels). 

Spacy is packaged with a useful dependency visualiser:

In [None]:
# make a list of sentences
sentences = list(doc.sents) # create a list of sentences
# the sentences list can be passed to the following line, but here just displaying the shortest sentence
html = spacy.displacy.render(sentences[1], style='dep', jupyter = False)
display(HTML(html))