![title](SpaCy_logo.svg)

Spacy is a python library wich provides powerful state-of-the-art NLP models in a simple to understand interface.

Tutorial inspired by:
https://gist.github.com/aparrish/f21f6abbf2367e8eb23438558207e1c3

![title](cc.png)

## Preparing the environment

The following tutorial is written in `python 3.6` and `anaconda`.

Insalling Anaconda:

    https://www.anaconda.com/download/

### Create a new environment

The first step is to create a new environemnt

    conda create --name anlp
    
To activate the environment type

    source activate anlp
    
At this point we are ready to install **SpaCy**!!

## Installing spaCy



    pip install spacy
    
(If you're not using a virtual environment, try `sudo pip install spacy`.)

After you've installed spaCy, you'll need to download the models. Run the following on the command line will download the english models:

    python -m spacy download en
    

In [41]:
import spacy

## Basic Usage

Create a new spaCy object using `spacy.load('en')` (assuming you want to work with English; spaCy supports other languages as well).

In [42]:
nlp = spacy.load('en')

It is now possible to create a Document object by calling the `nlp` variable.

In [43]:
doc = nlp('Who is the president of the United States? Donald Trump is the president of the United States.')

## Sentences

A document is organized in sentences, and spacy provides a simple API to extract them:

In [44]:
for sent in doc.sents:
    print(sent)

Who is the president of the United States?
Donald Trump is the president of the United States.


`doc.sents` is a generator object therefore to obtain a list of sentences you cannot index a specific sentence directly.

In [45]:
doc.sents[1]

TypeError: 'generator' object is not subscriptable

To get a specific sentence from a document the generator has to be converted to a `list()`

In [46]:
list(doc.sents)[1]

Donald Trump is the president of the United States.

## Tokens

Iterating over a `Document` yields the tokens composing the sentence that are annotated with differents attributes

In [47]:
for tok in doc:
    print(tok.text)
    

Who
is
the
president
of
the
United
States
?
Donald
Trump
is
the
president
of
the
United
States
.


In [48]:
president = list(doc)[3]
is_ = list(doc)[1]
president.text

'president'

In [49]:
is_.lemma_

'be'

In [50]:
president.prefix_

'p'

In [51]:
president.suffix_

'ent'

In [52]:
president.nbor()

of

In [53]:
president.nbor(3)

United

In [54]:
president.nbor(-1)

the

In [55]:
def get_context(token, n):
    return [token.nbor(i) for i in range(-(n//2), n//2+1)]
    
get_context(president, 5)

[is, the, president, of, the]

## Part of speech

Each word covers a specific syntactic function in the sentence (e.g. noun, verb, ..)

In spacy it is possible to acces part of speech attriburtes from the token usin `.tag_` or `.pos_`

In [56]:
president.tag_

'NN'

In [57]:
president.pos_

'NOUN'

To extract all the Nouns appearing in the document:

In [58]:
[tok for tok in doc if tok.pos_=='NOUN']

[Who, president, president]

In [59]:
[tok for tok in doc if tok.pos_=='PROPN']

[United, States, Donald, Trump, United, States]

The tag attribute provides more fine grained categories:

In [60]:
[tok.tag_ for tok in doc if tok.pos_=='NOUN']

['WP', 'NN', 'NN']

### Larger syntactic units

If we need larger syntactic units other then word (e.g noun phrases) we can simply use the API `.noun_chunks`

In [61]:
for chunk in doc.noun_chunks:
    print(chunk)

Who
the president
the United States
Donald Trump
the president
the United States


## Named Entities

Certain noun phrases in a document refer to entities in the real word. In our document for example we refer to two name entities (Donald Trump and United States).

In [62]:
spacy.displacy.render(doc, style='ent', jupyter=True)

Woooo it is magic! However those models can be easily fooled!

In [63]:
doc_ = nlp('Who is the president of the United States? donald trump is the president of the United states.')
spacy.displacy.render(doc_, style='ent', jupyter=True)

## Extracting more complex information

More complex information can be extracted from syntactic trees:

In [64]:
simple_doc = nlp('Donald Trump is the president of the United States')

spacy.displacy.render(simple_doc, style='dep', jupyter=True, options={'compact': True})

From dependency trees one can extract different types of information. For example to extract all the subjects in a document:

In [65]:
def flatten_subtree(st):
    return ''.join([w.text_with_ws for w in list(st)]).strip()

In [66]:
subjects = []
for word in doc:
    if word.dep_ in ('nsubj', 'nsubjpass'):
        subjects.append(flatten_subtree(word.subtree))
subjects

['Donald Trump']

Let's try with a bigger corpus:

In [67]:
!wget https://raw.githubusercontent.com/aparrish/rwet-examples/master/genesis.txt

--2018-04-12 13:02:31--  https://raw.githubusercontent.com/aparrish/rwet-examples/master/genesis.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.12.133
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.12.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4119 (4,0K) [text/plain]
Saving to: ‘genesis.txt.1’


2018-04-12 13:02:31 (35,9 MB/s) - ‘genesis.txt.1’ saved [4119/4119]



In [29]:
big_doc = nlp(open("genesis.txt").read().replace('\n', ''))

In [68]:
subjects = []
for word in big_doc:
    if word.dep_ in ('nsubj', 'nsubjpass'):
        subjects.append(flatten_subtree(word.subtree))
subjects

['God',
 'the earth',
 'darkness',
 'the Spirit of God',
 'God',
 'God',
 'it',
 'God',
 'God',
 'he',
 'the evening and the morning',
 'God',
 'it',
 'God',
 'which',
 'which',
 'it',
 'God',
 'the evening and the morning',
 'God',
 'the waters under the heaven',
 'the dry land',
 'it',
 'God',
 'God',
 'it',
 'God',
 'the earth',
 'whose seed',
 'it',
 'the earth',
 'whose seed',
 'God',
 'it',
 'the evening and the morning',
 'God',
 'them',
 'them',
 'it',
 'God',
 'he',
 'God',
 'God',
 'it',
 'the evening and the morning',
 'God',
 'the waters',
 'that',
 'that',
 'God',
 'the waters',
 'God',
 'it',
 'God',
 'the evening and the morning',
 'God',
 'the earth',
 'it',
 'God',
 'that',
 'God',
 'it',
 'God',
 'us',
 'them',
 'that',
 'God',
 'male and female',
 'God',
 'God',
 'that',
 'God',
 'I',
 'which',
 'which',
 'it',
 'that',
 'I',
 'it',
 'God',
 'he',
 'it',
 'the evening and the morning']

In [69]:
spacy.displacy.render(big_doc, style='ent', jupyter=True)

## Dealing with noisy text

Spacy it is able to tokenize a document even if it contains emoticons or mispelled words. 

In [70]:
tweet_doc = nlp("lol that is rly funny :) This is gr8 i rate it 8/8!!!")

In [71]:
" ".join(word.text for word in tweet_doc)
   

'lol that is rly funny :) This is gr8 i rate it 8/8 ! ! !'

additionaly spacy can normalize some words in the text 

In [72]:
norm_doc = nlp("I'm gonna realise")
" ".join(word.norm_ for word in norm_doc)

'i am going to realize'

## Word Embeddings

Rule based approaches and Statistical NLP usually treat words as atomic symbols: “class”, “lesson”.

In vector space, words are represented in sparse vectors with one “1” and a lot of zeroes . The size of this vectors is equal to the size of the vocabulary.

You can get a lot of value by representing a word by means of its neighbors.

*“You shall know a word by the company it keeps”* (J. R. Firth 1957)

 - Today’s **class** is about spaCy
 - Daniele is **teaching** a lesson on spaCy


Words around “class” and “lesson” provides semantic information about the word
itself.

With Word Embedding we can represent words as vectors of features. These features are weights trained by mean of a Neural Network Model model.
(Collobert & Weston 2008, Turian et al. 2010, Mikolov et al 2013).

![title](embs.png)

with spaCy we can use word embedding by downloading the complete model (large)


https://spacy.io/models/en#en_core_web_lg


    python -m spacy download en_core_web_lg
    


In [73]:
nlp_large = spacy.load('en_core_web_lg')

In [74]:
cat = nlp_large('cat')
dog = nlp_large('dog')
car = nlp_large('car')

In [75]:
cat.vector

array([-0.15067  , -0.024468 , -0.23368  , -0.23378  , -0.18382  ,
        0.32711  , -0.22084  , -0.28777  ,  0.12759  ,  1.1656   ,
       -0.64163  , -0.098455 , -0.62397  ,  0.010431 , -0.25653  ,
        0.31799  ,  0.037779 ,  1.1904   , -0.17714  , -0.2595   ,
       -0.31461  ,  0.038825 , -0.15713  , -0.13484  ,  0.36936  ,
       -0.30562  , -0.40619  , -0.38965  ,  0.3686   ,  0.013963 ,
       -0.6895   ,  0.004066 , -0.1367   ,  0.32564  ,  0.24688  ,
       -0.14011  ,  0.53889  , -0.80441  , -0.1777   , -0.12922  ,
        0.16303  ,  0.14917  , -0.068429 , -0.33922  ,  0.18495  ,
       -0.082544 , -0.46892  ,  0.39581  , -0.13742  , -0.35132  ,
        0.22223  , -0.144    , -0.048287 ,  0.3379   , -0.31916  ,
        0.20526  ,  0.098624 , -0.23877  ,  0.045338 ,  0.43941  ,
        0.030385 , -0.013821 , -0.093273 , -0.18178  ,  0.19438  ,
       -0.3782   ,  0.70144  ,  0.16236  ,  0.0059111,  0.024898 ,
       -0.13613  , -0.11425  , -0.31598  , -0.14209  ,  0.0281

In [76]:
cat.similarity(dog)

0.8016855517329495

In [77]:
cat.similarity(car)

0.3190753256322656

## Sentence similarity

Word embedding are powerful tools to compute word similarity. 
Neverthless, they can be used to compute document level similarities.

$e_{doc} = \frac{1}{N} \sum_{i=0}^N e_{w_i} $

A document can be represented by the average of word embeddings composing it!

In [81]:
doc_ = nlp_large('Daniele Bonadiman is the president of the United States')
doc_.similarity(nlp_large('Who was the head of the White House?'))

0.7862439084398206

In [82]:
for tok in doc_:
    print(tok.has_vector)

True
False
True
True
True
True
True
True
True


In [84]:
list(doc_)[1].vector

array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0.