# SpaCy

- See https://spacy.io/usage/spacy-101 for an intro
- Site also has recipes for different tasks (sentiment etc.)
- https://spacy.io/usage/facts-figures 


- Spanish and German models
- Easy to try (a pipeline including dependency parsing takes three lines of code)
- Classical pipeline and neural NLP components. Word embeddings available from the outset


- No coreference

# Installation

- Several package managers supportd (pip, conda) or also with OS package manager

- Download the language modules after installation, e.g for Spanish  `python -m spacy download es`

# Simple usage

In [36]:
import spacy
nlp = spacy.load("es")
print("Loaded ES\n")
doc = nlp("El hombre bajo toca un bajo bajo el baobab")
for word in doc:
    print(word.text, word.lemma_, word.tag_, " "* (55-len(word.tag_)), word.pos_, word.dep_, sep='\t')

Loaded ES

El	El	DET__Definite=Def|Gender=Masc|Number=Sing|PronType=Art	 	DET	det
hombre	hombre	NOUN__Gender=Masc|Number=Sing	                          	NOUN	nsubj
bajo	bajar	ADJ__Gender=Masc|Number=Sing	                           	ADJ	mark
toca	tocar	VERB__Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin		VERB	ROOT
un	uno	DET__Definite=Ind|Gender=Masc|Number=Sing|PronType=Art	 	DET	det
bajo	bajar	NOUN__Gender=Masc|Number=Sing	                          	NOUN	obj
bajo	bajar	ADP__AdpType=Prep	                                      	ADP	case
el	el	DET__Definite=Def|Gender=Masc|Number=Sing|PronType=Art	 	DET	det
baobab	baobab	NOUN__Gender=Masc|Number=Sing	                          	NOUN	nmod


## Named entities

In [41]:
txt = """Mucha sangre de Caín
tiene la gente labriega, 
y en el hogar campesino 
armó la envidia pelea.

Juan y Martín, los mayores 
de Alvargonzález, un día 
pesada marcha emprendieron 
con el alba, Duero arriba.

Llegaron los asesinos 
hasta la Laguna Negra, 
agua transparente y muda 
que enorme muro de piedra
"""

doc = nlp(txt.replace("\n", " "))

for ent in doc.ents:
    print(ent.text, ent.label_, ent.start, ent.end)

Caín PER 3 4
Juan PER 22 23
Martín PER 24 25
 de Alvargonzález PER 28 31
Duero LOC 43 44
Laguna Negra LOC 53 55


# Matching sequences (Rule-based matching)

See https://spacy.io/usage/linguistic-features#rule-based-matching


Here the task is to match phrases where an adjective is predicated on "Facebook" using a sequence of tokens and the spacy `Matcher` class.

In [7]:
import spacy
from spacy import displacy
from spacy.matcher import Matcher

nlp = spacy.load('en')
matcher = Matcher(nlp.vocab)
matched_sents = [] # collect data of matched sentences to be visualized

def collect_sents(matcher, doc, i, matches):
    match_id, start, end = matches[i]
    span = doc[start : end] # matched span
    sent = span.sent # sentence containing matched span
    # append mock entity for match in displaCy style to matched_sents
    # get the match span by ofsetting the start and end of the span with the
    # start and end of the sentence in the doc
    match_ents = [{'start': span.start_char - sent.start_char,
                   'end': span.end_char - sent.start_char,
                   'label': 'MATCH'}]
    matched_sents.append({'text': sent.text, 'ents': match_ents })

pattern = [{'LOWER': 'facebook'}, {'LEMMA': 'be'}, {'POS': 'ADV', 'OP': '*'},
           {'POS': 'ADJ'}]
matcher.add('FacebookIs', collect_sents, pattern) # add pattern
matches = matcher(nlp("Facebook is really horrible and FAcebook is very bad")) # match on your text

print(matched_sents)

[{'ents': [{'label': 'MATCH', 'end': 27, 'start': 0}], 'text': 'Facebook is really horrible and FAcebook is very bad'}, {'ents': [{'label': 'MATCH', 'end': 52, 'start': 32}], 'text': 'Facebook is really horrible and FAcebook is very bad'}]


If want to visualize the matches, can use displacy (shows results on localhost:5000)

In [8]:
# serve visualization of sentences containing match with displaCy
# set manual=True to make displaCy render straight from a dictionary
displacy.serve(matched_sents, style='ent', manual=True)



[93m    Serving on port 5000...[0m
    Using the 'ent' visualizer


    Shutting down server on port 5000.



# Possible activity

Write rules for identifying novel metadata with the `Matcher` (or perhapss the `PhraseMatcher`) classes