# Introduction

The goal of this notebook is show how to write and use custom processors and components using the new methods provided by spaCy 2.


# Pipelines

spaCy works with the concept of pipelines which pipe different processors. For example the basic English model pipeline is:

In [None]:
import spacy

nlp = spacy.load('en)

print(nlp.pipe_names) # Default processing components for en model

# Adding a new processor to the pipeline

Custom processor can be added to existing pipelines. A processor can be a function or class with a call method, that receive a spaCy doc and returns a potentially modified doc:

In [None]:
def custom_processor(doc):
    # Do something with doc here: add annotations, merge spans, ...
    print('I am a silly processor')
    return doc 
nlp.add_pipe(custom_processor, name='silly_processor', first=True)

print(nlp.pipe_names)

## Exercise 1
Write a component that prints the number of tokens in a document

In [None]:
def print_len(doc):
    return len(doc)

nlp.add_pipe(print_len, name='print_len')
nlp

# Adding stateful components to the pipeline

Classes can be used to init a component with state (e.g., vocab, endpoint for accessing a web API, etc.:


In [None]:
class CustomComponent(object):
    name = 'still_silly'
    def __init__(self, config):
        # We can initialize this with settings
        self.config = config
    def  __call__(self, doc):
        # Do things
        return doc
    def __call__(self, doc):
        return doc

custom_component = CustomComponent({}) 
nlp.add_pipe(custom_component) 
print(nlp.pipe_names)

## Exercise 2

Write a custom component that can be initialize with a NER type (e.g., ORG) and that prints the number of entities of that type in the doc 

# Using extensions to write new attributes and annotations to tokens, spans and documents
spaCy 2 allows the developer to create custom properties and attach them to Tokens, Spans, and Docs. Typically, these custom attributes will be created dynamically by custom processors, as we will see at the end with the Entity Linking example. First let's see a simple example:



In [None]:
from spacy.tokens import Doc, Span, Token

experiment_keywords = ['experiment', 'results', 'validation', 'experimental']

is_experiment_keyword = lambda token: token.lower_ in experiment_keywords
is_experiment_part = lambda text: any([token.lower_ in experiment_keywords for token in text])

Token.set_extension('is_experiment_keyword', getter=is_experiment_keyword)
Doc.set_extension('has_experiment_part', getter=is_experiment_part)
Span.set_extension('is_experiment_part', getter=is_experiment_part) 

In [None]:
doc = nlp(u"This section presents the experimental results.") 
print(doc._.has_experiment_part)
print(doc[4:5]._.is_experiment_part)
print(doc[1:2]._.is_experiment_part)

# Putting it all together: Entity Linking using Agdistis endpoints

To demonstrate the capabilities of custom components in spaCy, we have created a simple component which uses the nice Agdistis entity linking services described at https://github.com/dice-group/AGDISTIS.

The custom component is available at: `lib/linkers.py`


In [None]:
from lib import AgdistisEntityLinker

linker = AgdistisEntityLinker()

In [None]:
nlp.add_pipe(linker)
print(nlp.pipe_names)

In [None]:
text = "Google LLC is an American multinational technology company that specializes in Internet-related services and products. These include online advertising technologies, search, cloud computing, software, and hardware. Google was founded in 1998 by Larry Page and Sergey Brin while they were Ph.D. students at Stanford University, in California. Together, they own about 14 percent of its shares, and control 56 percent of the stockholder voting power through supervoting stock. They incorporated Google as a privately held company on September 4, 1998. An initial public offering (IPO) took place on August 19, 2004, and Google moved to its new headquarters in Mountain View, California, nicknamed the Googleplex. "
doc = nlp(text)
for ent in doc.ents:
    if(ent._.has_dbpedia_uri):
        print(ent._.dbpedia_uri)

## Final exercises

Agdistis provides endpoints for different languages. Test the custom component with different languages. Please remember to download and install these languages using `python -m spacy download lang`.


In [None]:
# de

In [None]:
# es

In [None]:
# fr

## Extra: 
Using similarity methods in entity surface forms (text) for listing most dissimilar entities. You might need to download a spa

In [44]:
counter = 0
cumulative_sim = {}
for ent in doc.ents:
    for ent2 in doc.ents:
        if(ent._.has_dbpedia_uri and ent2._.has_dbpedia_uri):
            counter += 1
            sim = ent.similarity(ent2)
            if(cumulative_sim.get(ent._.dbpedia_uri)):
                cumulative_sim[ent._.dbpedia_uri] =  cumulative_sim[ent._.dbpedia_uri] + sim
            else:
                cumulative_sim[ent._.dbpedia_uri] = sim

less_similar =  sorted(cumulative_sim.items(), key=lambda x: x[1])
print(less_similar)

[('http://dbpedia.org/resource/American_McGee', 3.0031245001739504), ('http://dbpedia.org/resource/Ph.D._(band)', 3.5573368906102765), ('http://dbpedia.org/resource/Sergey_Brin', 5.5983635527915059), ('http://dbpedia.org/resource/Googleplex', 6.1589201220928214), ('http://dbpedia.org/resource/Mountain_View,_California', 6.3304150115303051), ('http://dbpedia.org/resource/Larry_Page', 6.4600735751997433), ('http://dbpedia.org/resource/Stanford_University', 6.5145620198743925), ('http://dbpedia.org/resource/September_1,_1939', 7.6217991003315637), ('http://dbpedia.org/resource/California', 12.373009386132999)]
