# NLP with spaCy

SpaCy is a modern natural language processing library for Python https://spacy.io/

We follow material found on offical github https://github.com/explosion/spacy-notebooks

First of all, install spaCy library and language-specific resources, if necessary:

In [1]:
#!pip3 install spacy
#!python3 -m spacy download it # where "it" is Italian language code, use "en" for English

Then import library and language resources in python (it might take a while):

In [2]:
import spacy
nlp = spacy.load('it')

# Introduction to spaCy

The following examples are based on the notebook available at:
https://github.com/explosion/spacy-notebooks/blob/master/notebooks/conference_notebooks/pycon_nlp/00_spacy_intro.ipynb

This is nothing but a quick overwiev of some features of spaCy, with the goal of getting familiar with the library.

We will quickly cover tokenizer, sentence splitting, POS-tagging, syntactic dependencies, named entities and word embeddings.

## Tokens and sentences

Having loaded language-specific resources into nlp object, we can call it on text to automatically process it and transform it into a "document":

In [3]:
example_text = u"Una frase di esempio. NLP in poche righe di codice."
example_text

'Una frase di esempio. NLP in poche righe di codice.'

In [4]:
example_doc = nlp(example_text)
example_doc

Una frase di esempio. NLP in poche righe di codice.

The doc object has a number of useful methods. Compare:

In [5]:
print('first position of example_text is {}'.format(example_text[0])) # display result of indicing text

first position of example_text is U


In [6]:
print('first position of example_doc is {}'.format(example_doc[0])) # display result of indicing doc

first position of example_doc is Una


It's tokenized!

Another examples with multiple sentences:

In [7]:
example_text = u"Questa è la prima frase. Questa è la seconda, mentre questa è la terza."

In [8]:
example_doc = nlp(example_text)

In [9]:
for sent in example_doc.sents: # .sents method split sentences
    print(sent)

Questa è la prima frase.
Questa è la seconda, mentre questa è la terza.


## Part of speech tagging

In [10]:
# For each token, print corresponding part of speech tag
for token in example_doc:
    print('{} - {}'.format(token, token.pos_))

Questa - PRON
è - VERB
la - DET
prima - ADJ
frase - NOUN
. - PUNCT
Questa - PRON
è - VERB
la - DET
seconda - ADJ
, - PUNCT
mentre - SCONJ
questa - PRON
è - VERB
la - DET
terza - ADJ
. - PUNCT


## Syntactic dependencies

In [11]:
# Write a function that walks up the syntactic tree of the given token
# and collects all tokens to the root token (including root token).

def tokens_to_root(token):
    """
    Walk up the syntactic tree, collecting tokens to the root of the given `token`.
    :param token: Spacy token
    :return: list of Spacy tokens
    """
    tokens_to_r = []
    while token.head != token:
        tokens_to_r.append(token)
        token = token.head
        tokens_to_r.append(token)

    return tokens_to_r

In [12]:
# For every token in document, print its tokens to the root
for token in example_doc:
    print('{} --> {}'.format(token, tokens_to_root(token)))

Questa --> [Questa, frase]
è --> [è, frase]
la --> [la, frase]
prima --> [prima, frase]
frase --> []
. --> [., frase]
Questa --> [Questa, seconda]
è --> [è, seconda]
la --> [la, seconda]
seconda --> []
, --> [,, seconda]
mentre --> [mentre, terza, terza, seconda]
questa --> [questa, terza, terza, seconda]
è --> [è, terza, terza, seconda]
la --> [la, terza, terza, seconda]
terza --> [terza, seconda]
. --> [., seconda]


In [13]:
# Print dependency labels of the tokens
for token in example_doc:
    print('-> '.join(['{}-{}'.format(dependent_token, dependent_token.dep_) for dependent_token in tokens_to_root(token)]))

Questa-nsubj-> frase-ROOT
è-cop-> frase-ROOT
la-det-> frase-ROOT
prima-amod-> frase-ROOT

.-punct-> frase-ROOT
Questa-nsubj-> seconda-ROOT
è-cop-> seconda-ROOT
la-det-> seconda-ROOT

,-punct-> seconda-ROOT
mentre-mark-> terza-advcl-> terza-advcl-> seconda-ROOT
questa-nsubj-> terza-advcl-> terza-advcl-> seconda-ROOT
è-cop-> terza-advcl-> terza-advcl-> seconda-ROOT
la-det-> terza-advcl-> terza-advcl-> seconda-ROOT
terza-advcl-> seconda-ROOT
.-punct-> seconda-ROOT


## Named Entities

In [14]:
# Print all named entities with their correspondin named entity types

example_doc = nlp(u"Sono andato al Parc Güell di Barcellona guidando una Fiat, per incontrare Salvador Dalì.")
for ent in example_doc.ents: # ents method
    print('{} - {}'.format(ent, ent.label_)) # label_ method

Parc Güell - LOC
Barcellona - LOC
Fiat - ORG
Salvador Dalì - PER


## Noun Chunks

In [15]:
# use noun_chunks method
example_doc = nlp(u"Il mio amico Salvador ha dipinto un orologio sciolto sulla mia Fiat nuova.")
print([chunk for chunk in example_doc.noun_chunks])

[]


Not sure why it doesn't work. Let's try it in English:

In [16]:
nlp2 = spacy.load("en")
example_doc2 = nlp2(u"My friend Salvador painted a melted clock on my new Fiat.")
print([chunk for chunk in example_doc2.noun_chunks])

[My friend, Salvador, a melted clock, my new Fiat]


So maybe it's not implemented for Italian yet.

## Word Embeddings

First, we need to download full spaCy model including English pre-trained word vectors:

(As far as I understood, the Italian model has only context-dependent vectors forthe time being, i.e. based on POS, NER etc. Cf. https://spacy.io/models/it)

In [17]:
#!python3 -m spacy download en_core_web_md

In [18]:
# For a given document, calculate similarity between 'apples' and 'oranges' and 'boots' and 'hippos'
nlp3 = spacy.load('en_core_web_md')
example_doc3 = nlp3(u"Apples and oranges are similar. Boots and hippos aren't.")
apples = example_doc3[0]
oranges = example_doc3[2]
boots = example_doc3[6]
hippos = example_doc3[8]
print(apples.similarity(oranges))
print(boots.similarity(hippos))

0.77809423
0.11093954
