## Introduction to ```spaCy```

There are a number of different NLP frameworks that you're likely to encounter. The most popular and widely-used of these are:

- ```NLTK``` (Natural Language Toolkit, old-school)
- ```UDPipe``` (Neural network based, fast and light, but not super accurate)
- ```CoreNLP``` and ```stanza``` (Created by the team at Stanford; academically robust)
- ```spaCy``` production-ready, well-documented, state-of-the-art

We'll be working with ```spaCy``` in this module, primarily because it's easy and intuitive, and also scales well.

First thing we need to do is install ```spaCy``` and the language model that we want to use.

## Initializing ```spaCy```

The first thing we need to do is import ```spaCy``` __and__ the language model that we want to use.

Note that, if you want to use different langauges you want to use different language models.

In [4]:
# create a spacy NLP object
import spacy
nlp = spacy.load("en_core_web_md")

With the model now loaded, we can begin to do some very simple NLP tasks.

Here, we create a spaCy object and assign it to the variable ```nlp```. This is the NLP pipeline that will do all our heavy lifting, using the trained model we've specified.

Below, you can see what the pipeline does with a bit of sample text. Passing text to the nlp object gives us access to a bunch of properties, including tokens (words), parts of speech, named entities, and so on. Here's we two of them, tokens and entities. These objects, in turn, have certain methods attached to them. A full outline of available methods can be found in the spaCy docs.

In this case, for all token objects, let's return the token itself (token.text); its part-of-speech tag (token.pos_); and the grammatical dependency relations between the tokens (token.dep_).


In [5]:
doc = nlp("My name is Alina and I was born in Hungary. I currently live in Aarhus.")

In [6]:
type(doc)

spacy.tokens.doc.Doc

In [7]:
print(doc) #data type is no longer a string

My name is Alina and I was born in Hungary. I currently live in Aarhus.


__Tokenize__

In [8]:
for token in doc:
    print(token.text)

My
name
is
Alina
and
I
was
born
in
Hungary
.
I
currently
live
in
Aarhus
.


__Trying some more attributes__

In [10]:
for token in doc:
    print(token.i, token.text, token.lemma_, token.pos_, token.dep_, token.morph)

0 My my PRON poss Number=Sing|Person=1|Poss=Yes|PronType=Prs
1 name name NOUN nsubj Number=Sing
2 is be AUX ROOT Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin
3 Alina Alina PROPN attr Number=Sing
4 and and CCONJ cc ConjType=Cmp
5 I I PRON nsubjpass Case=Nom|Number=Sing|Person=1|PronType=Prs
6 was be AUX auxpass Mood=Ind|Number=Sing|Person=3|Tense=Past|VerbForm=Fin
7 born bear VERB conj Aspect=Perf|Tense=Past|VerbForm=Part
8 in in ADP prep 
9 Hungary Hungary PROPN pobj Number=Sing
10 . . PUNCT punct PunctType=Peri
11 I I PRON nsubj Case=Nom|Number=Sing|Person=1|PronType=Prs
12 currently currently ADV advmod 
13 live live VERB ROOT Tense=Pres|VerbForm=Fin
14 in in ADP prep 
15 Aarhus Aarhus PROPN pobj Number=Sing
16 . . PUNCT punct PunctType=Peri


token.pos = part of speech tagging. We get a number which corresponds to a part of speech (just represented as numbers). If we run it with an underscore token.pos_ then we get the name of the part of speech. 

token.morph = more finegrained linguistic information.

__NER__

Extracting named entities from a spaCy doc requires an extra step, but nothing too challenging:

In [11]:
# extracting NERs
for ent in doc.ents:
    print(ent.text, ent.label_)
# Alina is a person, Hungary and Aarhus are both geopolitical entities

Alina PERSON
Hungary GPE
Aarhus GPE


## Count distribution of linguistic features

__Create doc object__

In [12]:
# get file name
import os 
filename = os.path.join("..", "data", "example.txt")

In [14]:
# load text file
with open(filename, "r", encoding="utf-8") as file: 
    text = file.read()

In [15]:
doc = nlp(text)

In [17]:
# count adjectives
adjective_count = 0
for token in doc:
    if token.pos_ == "ADJ":
        adjective_count += 1
print(adjective_count)

46


In [18]:
# create empty list
entities = []

# get named entities
for ent in doc.ents: 
    entities.append(ent.text)

In [19]:
print(set(entities)) # all unique objects 

{'PPE', 'Jorge', 'Israel', '18 October 2020', 'UK', 'Taylor Swift', 'Instagram', 'YouTube', 'Hanan', 'Team Jorge', 'ICO', 'first', 'one', 'Twitter', 'Amazon', 'Sheffield', 'Facebook', 'Gmail', 'Team Jorge’s Aims', 'Canaelan', 'Guardian', '2,000', 'multimillion-pound', 'Advanced Impact Media Solutions', 'Tottenham Hotspur', 'Tal Hanan', 'Twitter two days later', 'LinkedIn', 'Telegram', 'Der Spiegel', 'Airbnb', 'Information Commissioner’s Office', 'Le Monde', 'more than 30,000'}


__Relative frequency__

In [21]:
# find the relative frequency per 10,000 words
# if our word doc was 10000 long, how many ADJ would we expect to see?
relative_freq = adjective_count/len(doc)*1000

In [22]:
print(f"This text has a relative frequency of {int(relative_freq)} adjectives per 1000 tokens")

This text has a relative frequency of 72 adjectives per 1000 tokens


## Creating neater outputs using pandas

At the moment, all of our output from ```spaCy``` is in the form of lists. If we want to save these, it probably makes sense to have them saved in a more transferable format, such as CSV files or JSONs.

One very easy way to do this with Python is by using the dataframe library ```pandas```.

In [23]:
import pandas as pd

In [24]:
# create spaCy doc
# a single sentence example
input_string = "My name is Alina and I was born in Hungary."

# create a new Doc object (spacy DOCs are complex objects)
doc = nlp(input_string)

In [25]:
annotations = []

for token in doc: 
    annotations.append((token.text, token.pos_)) # append two things by making it into a tuple

In [26]:
# spaCy doc to pandas dataframe
data = pd.DataFrame(annotations, columns=["tokens", "pos"])

In [27]:
data

Unnamed: 0,tokens,pos
0,My,PRON
1,name,NOUN
2,is,AUX
3,Alina,PROPN
4,and,CCONJ
5,I,PRON
6,was,AUX
7,born,VERB
8,in,ADP
9,Hungary,PROPN


In [None]:
# save dataframe
outpath = os.path.join("..", "out", "annotations.csv")
data.to_csv(outpath)