# Introduction to spaCy

<b>spaCy</b> provides a rather complete NLP pipeline (the output of one module feeds to the input of the next): it takes a raw document and performs tokenization, POS-tagging, stop word recognition, morphological analysis, lemmatization, sentence splitting, dependency parsing and Named Entity Recognition (NER), among others.

 - <b>Sentence splitting:</b> attribute sents of a Doc (of type spacy.tokens.doc.Doc)
 - <b>Tokenization:</b> Doc contains a sequence of Token objects (of type spacy.tokens.token.Token)
 - <b>Part-of-speech (POS) tagging:</b> attributes <b>pos_</b> and <b>tag_</b> of Token
 - <b>Stop words recognition</b> attribute <b>is_stop</b> of Token
 - <b>Stemming and lemmatization:</b> attribute lemma_ of Token
 - <b>Constituency/dependency parsing:</b> attributes <b>dep_</b> and head
 - <b>Named Entity Recognition (NER):</b> attribute ents (of type spacy.tokens.span.Span) of Doc (of type spacy.tokens.doc.Doc).

In [4]:
import spacy
nlp = spacy.load("en_core_web_sm") # creating the spaCy object 'nlp'

<b>'nlp'</b> object can be used to process text through a defined pipeline of modules and store the result as a value for another variable for accessing it. The result of processing a text with spaCy is another spaCy object of the type <b>'Doc'</b>.

<b>'Doc'</b> objects are complex and give access to different analyses that have been applied to the input text. In a Doc object, tokens that make up the text can, their lemmas, their PoS, the sentences, chunks, named entities, and many more are accesssible.

In [5]:
doc = nlp("More than ten million people are already thought to have fled their homes in Ukraine because of the invasion, according to the United Nations.")
type(doc)

spacy.tokens.doc.Doc

doc is now a Python object of the class Doc. It is a container for accessing linguistic annotations and a sequence of Token objects.

## Doc, Token and Span objects
There are three important types of objects to remember:

 - A <b>Doc</b> is a sequence of Token objects.
 - A <b>Token</b> object represents an individual token — i.e. a word, punctuation symbol, whitespace, etc. It has attributes representing linguistic annotations.
 - A <b>Span</b> object is a slice from a Doc object and consists of a sequence of Token objects.

In [8]:
for token in doc:
    print(token)

More
than
ten
million
people
are
already
thought
to
have
fled
their
homes
in
Ukraine
because
of
the
invasion
,
according
to
the
United
Nations
.


In [10]:
print(list(doc))
# spaCy does not really create a list but a so-called 'generator'. 
# A generator is a so-called 'lazy iterator' in Python that does not overload memory

[More, than, ten, million, people, are, already, thought, to, have, fled, their, homes, in, Ukraine, because, of, the, invasion, ,, according, to, the, United, Nations, .]


In [11]:
first_token = doc[0]
print("First token:", first_token)
second_token = doc[1]
print("Second token:", second_token)

First token: More
Second token: than


Even though these tokens look like <b>strings</b>, they are not. Print just gives the print representation of the token.

In [13]:
for token in doc:
    print(token,"\t", type(token))

More 	 <class 'spacy.tokens.token.Token'>
than 	 <class 'spacy.tokens.token.Token'>
ten 	 <class 'spacy.tokens.token.Token'>
million 	 <class 'spacy.tokens.token.Token'>
people 	 <class 'spacy.tokens.token.Token'>
are 	 <class 'spacy.tokens.token.Token'>
already 	 <class 'spacy.tokens.token.Token'>
thought 	 <class 'spacy.tokens.token.Token'>
to 	 <class 'spacy.tokens.token.Token'>
have 	 <class 'spacy.tokens.token.Token'>
fled 	 <class 'spacy.tokens.token.Token'>
their 	 <class 'spacy.tokens.token.Token'>
homes 	 <class 'spacy.tokens.token.Token'>
in 	 <class 'spacy.tokens.token.Token'>
Ukraine 	 <class 'spacy.tokens.token.Token'>
because 	 <class 'spacy.tokens.token.Token'>
of 	 <class 'spacy.tokens.token.Token'>
the 	 <class 'spacy.tokens.token.Token'>
invasion 	 <class 'spacy.tokens.token.Token'>
, 	 <class 'spacy.tokens.token.Token'>
according 	 <class 'spacy.tokens.token.Token'>
to 	 <class 'spacy.tokens.token.Token'>
the 	 <class 'spacy.tokens.token.Token'>
United 	 <class 'spac

<b>Attributes</b> are indicated without parentheses. In the case of spaCy tokens, attributes typically contain annotations of the token in the text. There are many attributes with double listings, one without and once with the suffix <b>_</b>. The attributes without <b>_</b> actually have numerical values that spaCy uses internally, whereas variants with <b>_</b> have the human readable rendering of the value in unicode. The internal numerical repesentations are used to store data more efficiently, whereas the readable values are only generated for rendering output.

In [15]:
for token in doc:
    print(token.lemma, token.lemma_, token.pos, token.pos_)

16799767236440142622 More 84 ADJ
10794458019344880855 than 85 ADP
7970704286052693043 ten 93 NUM
17365054503653917826 million 93 NUM
7593739049417968140 people 92 NOUN
10382539506755952630 be 87 AUX
12647358837591399640 already 86 ADV
16875814820671380748 think 100 VERB
3791531372978436496 to 94 PART
14692702688101715474 have 87 AUX
10512024121063493905 flee 100 VERB
4244585616942201722 their 95 PRON
12006852138382633966 home 92 NOUN
3002984154512732771 in 85 ADP
13009530111827274038 Ukraine 96 PROPN
16950148841647037698 because 98 SCONJ
886050111519832510 of 85 ADP
7425985699627899538 the 90 DET
17776359803327588471 invasion 92 NOUN
2593208677638477497 , 97 PUNCT
701735504652304602 accord 100 VERB
3791531372978436496 to 85 ADP
7425985699627899538 the 90 DET
13226800834791099135 United 96 PROPN
2786571147933758565 Nations 96 PROPN
12646065887601541794 . 97 PUNCT


In [17]:
spacy.explain("SCONJ")

'subordinating conjunction'

# Sentence splitting & tokenization 

<b>spaCy</b> performs sentence splitting and the information is stored in the attribute sents of Doc *(of type spacy.tokens.doc.Doc)*. Each Doc contains a sequence of Token objects, this is where the output from the tokenizer is found. The token itself can be accessed using the attribute *text*. Each Doc instance will also have an index over the tokens to group them into sentences. 

In [21]:
doc2 = nlp("Business Secretary Kwasi Kwarteng has given the British Geological Survey (BGS) three months to assess any changes to the science around the controversial practice.But senior Conservatives have been calling for a rethink in recent weeks.")

In [25]:
sentences = doc2.sents

for sentence in sentences:
    print("NEXT SENTENCE")
    print(sentence)
    
    for token in sentence:
        print(token.text)

NEXT SENTENCE
Business Secretary Kwasi Kwarteng has given the British Geological Survey (BGS) three months to assess any changes to the science around the controversial practice.
Business
Secretary
Kwasi
Kwarteng
has
given
the
British
Geological
Survey
(
BGS
)
three
months
to
assess
any
changes
to
the
science
around
the
controversial
practice
.
NEXT SENTENCE
But senior Conservatives have been calling for a rethink in recent weeks.
But
senior
Conservatives
have
been
calling
for
a
rethink
in
recent
weeks
.


# Lemmatization

The output from the <b>lemmatizer</b> is stored in the attribute *lemma_* of each <b>`Token`</b> object.

In [26]:
doc3 = nlp("Business Secretary Kwasi has given the British Geological Survey (BGS) three months to assess any changes.")

In [34]:
month_token = doc3[13]
print(month_token.text,month_token.lemma_)

months month


# Part of speech tagging

The output from the part of speech tagger is stored:

- in the attribute `pos_` of each Token object: The simple part-of-speech tag
- in the attribute `tag_` of each Token object: The detailed part-of-speech tag

In [37]:
print(month_token.text,month_token.pos_,month_token.tag_)

months NOUN NNS


# Dependency parsing

The output of the dependency parser can only be accessed by combining the information from multiple attributes.

In [41]:
doc4 = nlp("UN scientists have unveiled a plan that they believe can limit the root causes of dangerous climate change.A key UN body says in a report that there must be 'rapid, deep and immediate' cuts in carbon dioxide (CO2) emissions.")

In [42]:
from spacy import displacy
displacy.render(doc4)

Each token has a dependency relation with at least one other token. For example:

 - cars has an ***amod*** relation with ***autonomous***
 - the main verb ***shift*** has an ***nsubj*** relation with ***cars***

In [43]:
spacy.explain("amod")

'adjectival modifier'

spaCy makes use of the terms child and head in their dependency parsing output.

- a relation is always in one direction from a child to a head, e.g., autonomous is the child of cars
- a head of a phrase can be the child of another token, e.g., cars is the child of shift
- a token without a head is the root of the text or sentence (often the main verb)

- **dep_** provides the syntactic relation, e.g., nsubj
- **head** provides the head of a `Token`, e.g., in the case of autonomous the head would be cars

In [46]:
scientist_token = doc4[1]
print(scientist_token, scientist_token.head,scientist_token.dep_)

scientists unveiled nsubj


# Named Entity Recognition

The output from the **Named Entity Recognizer** is stored in the attribute **ents** of `Doc`. The attribute **label_** and an **ent** (of type spacy.tokens.span.Span) contains the named entity type.

In [47]:
text = "It comes after new board member, Tesla boss Elon Musk, asked his followers in a Twitter poll whether they wanted the feature."

doc = nlp(text)

In [48]:
displacy.render(doc,jupyter=True,style="ent")

In [50]:
for ent in doc.ents:
    print(ent.text,ent.label_)

Elon Musk PERSON
Twitter PRODUCT
