## Introduction to NLP with spaCy

spaCy is a Python library for Natural Language Processing. It is robust and versatile, here we demonstrate how to install and use it.

This notebook illustrates how to load a language model, and how to analyse text with it.

See also spacy 101, https://spacy.io/usage/spacy-101

This notebook is part of a hands-on introduction to NLP for Geo-text data mining.

### Topics covered in this notebook

1. Loading a spaCy model 
2. Selecting words by their part-of-speech 
3. Visualising parse results
4. pattern matching and information extraction
5. Semantic similarity (of words, of sentences)


### Other topics (check the other notebooks or instruction pages)

6. Named Entities (visualisation, explain labels) (notebook)
7. Entity Linking (to geonames) (notebook)
8. Information Extraction with SPIKE (demo, examples)
9. (optional) Information Extraction with Spacy
10. LLMs and ChatGPT for Information Extraction

### 1. Loading Spacy

To load the spacy library, you first need to [install](https://spacy.io/usage) it (unless you are running this tutorial on Google colab). In addition, you need to select a **model** for the language that you want to analyze. Models generally come in 3 flavors, from small to large. Smaller models load and run faster, larger models are more accurate. For English there is also the transformer model, the most accurate model for many tasks and datasets. 

spacy.info() gives information about your installation. 



In [None]:
import spacy

# 1. loading the model may produce some errors from tensorflow. Do not worry about these 
# 2. Google colab only has en_core_web_sm pre-installed, run command below to install other models 
# !python -m spacy download en_core_web_lg

nlp = spacy.load("en_core_web_lg")

spacy.info()


### Visualisation

The easiest way to understand the output of analysis is to visualise the result. This works best for short input strings.


In [None]:
from spacy import displacy

doc = nlp("New York is a large city")
displacy.render(doc, style="dep",minify=True,jupyter=True,options={'distance':130})



### Exercise 1

* Change the example above with a sentence that contains your name in it, and display the result. 
* Does your example contain words that are labeled with the part-of-speech tag PROPN? What does this tell you?
* Proper names often consist of more than 1 word (_New York_). Is there a syntactic relation that tells you when two words that form a single proper name are indeed recognized as one name by the model?


Apart from visualising the output of the model, we can also process it. The result of analyzing a text with Spacy is a list of *token* objects that have attributes such as .pos_ (note the underscore!) for its part-of-speech text. By iterating over the list of tokens, we can select certain tokens, and print some of its properties, as illustrated below. 



In [None]:
text = '''
Chatbots like ChatGPT are used by hundreds of millions of people for an increasingly wide array of tasks, including email services, online tutors and search engines. 
And they could change the way people interact with information. 
But there is no way of ensuring that these systems produce information that is accurate.
'''

doc = nlp(text)

for word in doc:
    if word.pos_ == "NOUN" :
        print(word.text, word.lemma_, word.pos_)

### Exercise 3

* Replace the text above with a recent news item. Modify the code to list all the proper names in the text. 

Side note: Long texts can also be visualised by Spacy on a per sentence basis. Try it for your example text:

In [None]:
sentence_spans = list(doc.sents)
displacy.render(sentence_spans, style="dep",minify=True,jupyter=True,options={'distance':100})

### Selecting combinations fo words

Sometimes we want to find concepts in a text, like names of places or organisations, or for instance, the most important keywords and concepts in a text (i.e. for visualisation in a word cloud). Such concepts are not always a single word, but often consist of two or more words. The **phrase matcher** allows you to identify such word combinations using all the properties that spacy has assigned to the tokens in the input, and using regular expression operators for creating flexible patterns.

The example below, finds combinations consisting of an ADJECTIVE and a NOUN ('artificial intelligence').



In [None]:
from spacy.matcher import Matcher

matcher = Matcher(nlp.vocab)

pattern = [{"POS": "ADJ"}, {"POS": "NOUN"}]

# https://spacy.io/api/matcher#add as of version 3, longest match is supported
matcher.add("adj_noun", [pattern],greedy="LONGEST")

text = '''Some old green trees have thick stems. 
Artificial intelligence is a booming business. 
NVIDIA is a computer firm that is at the forefront.'''

doc = nlp(text)

matches = matcher(doc)

# the name of your pattern is hashed. To map from hash-id to text, look it up in vocab.strings
for match_id, start, end in matches:
    span = doc[start:end]
    print(nlp.vocab.strings[match_id], ": ", span.text)
    


### Exercise 4

* Modify the adj_noun pattern, so that the pattern also matches nouns preceded by zero or more adjectives 

* Modify it, so that the pattern also matches with noun noun compounds like _'computer firm'_

See [operators](https://spacy.io/usage/rule-based-matching#quantifiers) for examples of adding regular expression operators to your patterns (for matching optionally, or zero or more times)

Test your pattern on an input text that contains examples of such cases. Does it always work as expected?

### Optional: Towards Information Extraction

The pattern matcher can also to some extent be used for extracting words by means of their syntactic relationships in text. This can be useful for information extraction. Imagine for instance that you want to extract all occurrences of organisations in a text, and, if they are the object of verb, also the verb (i.e. from the sentence _Google wants to buy NVIDIA_, it would extract the tuple _(buy, NVIDIA)_.

Tokens contain information about their part-of-speech etc., but also syntactic information, such as their syntactic function. These are the labels on the edges between nodes you see in the displacy visualisations. 'nsubj' is the dependency label for subjects, 'dobj' is the dependency label for objects.

We can use this information to also find the relationship between a matched span and the word of which it is a dependent. The example below matches proper names (sequences with PROPN as part-of-speech, we will soon see better ways to find named entities) and prints their dependency label (dep_) as well as the word of wich they are a dependent. 

One twist here is that the span (i.e. a sequence of tokens) that matches a pattern does not have these linguistic attributes. The attribute root is used to find the most prominent word within a span. This root token then provides information such as the dependency label and a link to the head.
                                                                                        

In [None]:
object_pattern = [{"POS":"PROPN", "OP":"+"}]

matcher.add("OBJECT", [object_pattern])

doc = nlp("Google wants to buy Open AI")

for match_id, start, end in matcher(doc):
    span = doc[start:end]
    print(span.text, span.root.dep_,span.root.head.text)
    


### Exercise 5 (optional)

* Modify the code above so that it only prints cases where the root of the matched span is a direct object (dobj).
* Modify the code above so that it prints the lemma_ of the head token (the verb) instead of the text. Test this for an example like 'Google buys Open AI'.

Hint: these modifications require modifying the for-loop.