# DIGI405 - Week 6 Lab Notebook - Information Extraction

## Introduction

For advanced text processing capabilities, programming libraries like [Spacy](https://spacy.io/) offer a range of features. The aims of this notebook are to introduce Spacy and introduce information extraction using Spacy.  This will be the basis of your Lab work for week 6.

## Installing Spacy

The Spacy library is already installed on JupyterHub. However, you may need to install a language model that Spacy will use.

In [None]:
import spacy
import collections
from collections import Counter

In [None]:
nlp = spacy.load('en_core_web_sm')

If you get errors loading the spacy model, try running the cell below to install a language model, then re-run the cell above.

In [None]:
%%bash
python -m spacy download en_core_web_sm

If the model cannot be located, try going to the Kernel menu and selecting 'Restart & Clear Output', then try loading the model again.

An alternate path that may be useful if this is not working is:
```
nlp = spacy.load('/home/#####/.local/lib/python3.8/site-packages/en_core_web_sm/en_core_web_sm-3.0.0')
```
Where ##### is your student userid (e.g. abc123).

In [None]:
text = '''
Sometime between 1250 and 1300, Polynesians settled in the islands that later were named New Zealand and developed a distinctive Māori culture. 
In 1642, Dutch explorer Abel Tasman became the first European to sight New Zealand. 
In 1840, representatives of the United Kingdom and Māori chiefs signed the Treaty of Waitangi, which declared British sovereignty over the islands. 
In 1841, New Zealand became a colony within the British Empire and in 1907 it became a dominion; it gained full statutory independence in 1947 and the British monarch remained the head of state.
'''
# text from https://en.wikipedia.org/wiki/New_Zealand
# you can replace the text variable with any text you want

The following cell tokenises and annotates some text with spacy and prints out the tokens.

In [None]:
# tokenise and annotate some text
doc = nlp(text) 

# this will output the individual tokens 
for token in doc:
    print(token)

Spacy can segment our text into sentences. This cell prints each sentence in turn ...

In [None]:
for sent in doc.sents:
    print(sent.text)

## Part of Speech tagging with Spacy

Here we are outputting a count by Penn Treebank tags. 

The [spacy.explain](https://spacy.io/api/top-level#spacy.explain) function can be used to output a user-friendly description for a given POS tag, dependency label or entity type.

In [None]:
tags = [token.tag_ for token in doc]

tag_freq = Counter(tags)
for tag in sorted(tag_freq, key=tag_freq.get, reverse=True):
    print(tag, tag_freq[tag], spacy.explain(tag), sep='\t')

Spacy's default part of speech tagging uses a simpler set of labels. Note: here the difference between the cell above and below (i.e. the Penn Treebank tags are accessed via tag_ and Spacy's POS tags are available via pos_).

In [None]:
tags = [token.pos_ for token in doc]

tag_freq = Counter(tags)
for tag in sorted(tag_freq, key=tag_freq.get, reverse=True):
    print(tag, tag_freq[tag], spacy.explain(tag), sep='\t')

Here is an example of filtering tokens by their part of speech.

In [None]:
filtered_tokens = []

for token in doc:
    if token.pos_ == 'PROPN':
        filtered_tokens.append(token)
        
print(filtered_tokens)

Refer to the tags listed above with their spacy.explain explanations. Replace 'PROPN' with other parts of speech, such as:

- VERB
- NOUN
- ADP (preposition)
- NUM (numbers)

You can also use the more extensive Penn Treebank tag set under the "English" heading. This allows you to do things like differentiate between tenses of verbs (e.g. select verbs in past-tense, VBD). 

In [None]:
filtered_tokens = []

for token in doc:
    if token.tag_ == 'VBD': # similar to above, except instead of pos_ we use tag_ to access the different tag set
        filtered_tokens.append(token)
        
print(filtered_tokens)

Here we are outputting a frequency list for proper nouns. Change the code to output a frequency list for another part of speech.

In [None]:
# select prop noun tokens only 
filtered_tokens = [token.text for token in doc if token.pos_ == "PROPN"]

token_freq = Counter(filtered_tokens)
for token in sorted(token_freq, key=token_freq.get, reverse=True):
    print(token, token_freq[token])

We can use this model to remove or normalise particular token types as required. For example, here we are normalising individual numbers to one token "NUMBER". This might be useful if you were interested in collocation patterns related to numbers in general.

In [None]:
no_numbers = []

for token in doc:
    if token.pos_ == 'NUM':
        no_numbers.append('NUMBER') # if we wanted to remove numbers completely change this line to: continue
    else:
        no_numbers.append(token)

print(no_numbers)

## Filtering by character types of tokens

You can filter by types of tokens. For example, here we are excluding any tokens with non-alphabetic characters such as numbers or '$'. Look at the list of token attributes here: https://spacy.io/api/token#attributes  
Change is_alpha to another boolean (bool) type to filter in another way (e.g. is_digit).

In [None]:
char_filtered = []

for token in doc:
    if token.is_alpha is False:
        continue
    else:
        char_filtered.append(token) 

print(char_filtered)
# no dates or punctuation

## Noun phrases / chunks

As we discussed in the lecture on information extraction, identifying noun chunks is a basic way of identifying entities within our text.

In [None]:
for chunk in doc.noun_chunks:
    print(chunk.text)

## Named Entity Recognition

Spacy detects named entities. The labels for the entities for English-language models are documented here: https://spacy.io/models/en#en_core_web_sm. Expand out the "Label Scheme" section to see the various labels. 

You can use spacy.explain to find out more about a specific label. You can run the following cell with different labels for a description.

In [None]:
print(spacy.explain('NORP'))

Below are the named entities in our sample text and their frequency. Again, we are using spacy.explain here to give a user-friendly description.

In [None]:
entities = [ent.label_ for ent in doc.ents]

entities_freq = Counter(entities)
for entity in sorted(entities_freq, key=entities_freq.get, reverse=True):
    print(entity, entities_freq[entity], spacy.explain(entity), sep='\t')

Here are all the entities listed out with their position within the text. 

Take a look at the results below and make sure you understand all the labels and the types of entities that Spacy can detect. 

How would you modify the named entity recognition code to only list Countries/Places (GPE)?

In [None]:
for ent in doc.ents:
    print(ent.label_, ent.start_char, ent.end_char, ent.text,sep='\t')

Spacy has a "pretty" way to visualise named entities.

In [None]:
spacy.displacy.render(doc, style="ent")

## Dependency parsing

Dependency parsing analyses sentences besed on relationships between words.

You can view the different annotation labels for Spacy's dependency parsing here: https://spacy.io/models/en#en_core_web_sm (expand the "label scheme" and see the PARSER labels). 

Spacy is packaged with a useful dependency visualiser:

In [None]:
# make a list of sentences
sentences = list(doc.sents) # create a list of sentences
# the sentences list can be passed to the following line, but here just displaying the shortest sentence
spacy.displacy.render(sentences[1], style='dep')

## Task

Select a short piece of text from the web (e.g. a news article, a blog post, a short story, a report) containing several names of people, places, organisations or other entities. Copy and paste it into the top of this notebook between the triple quotes to replace the ```text``` variable. Then run each cell again to extract part of speech, noun chunk and named entities. Spend some time investigating characteristics of the text using parts of speech, noun phrases, named entities, and dependendency parsing.