# Spacy Entity Linker

* [Spacy Entity Linker](https://github.com/egerber/spaCy-entity-linker#spacy-entity-linker)
                    
> Spacy Entity Linker is a pipeline for spaCy that performs Linked Entity Extraction with Wikidata on a given Document. The Entity Linking System operates by matching potential candidates from each sentence (subject, object, prepositional phrase, compounds, etc.) to aliases from Wikidata. 
> 
> The package allows to easily **find the category** behind each entity (e.g. "banana" is type "food" OR "Microsoft" is type "company"). It can is therefore useful for information extraction tasks and labeling tasks.
> ```
> pip install spacy-entity-linker
> ```

## Usage
```
# initialize language model
nlp = spacy.load("en_core_web_md")

# add pipeline (declared through entry_points in setup.py)
nlp.add_pipe("entityLinker", last=True)
```

## EntityElement

Each linked Entity is an object of type EntityElement

```
get_description() returns description from Wikidata
get_id() returns Wikidata ID
get_label() returns Wikidata label
get_span(doc) returns the span from the spacy document that contains the linked entity. You need to provide the current doc as argument, in order to receive an actual spacy.tokens.Span object, otherwise you will receive a SpanInfo emulating the behaviour of a Span
get_url() returns the url to the corresponding Wikidata item
pretty_print() prints out information about the entity element
get_sub_entities(limit=10) returns EntityCollection of all entities that derive from the current entityElement (e.g. fruit -> apple, banana, etc.)
get_super_entities(limit=10) returns EntityCollection of all entities that the current entityElement derives from (e.g. New England Patriots -> Football Team))
```

In [1]:
import pandas as pd
import spacy
import spacy_entity_linker

# Model

In [2]:
nlp = spacy.load("en_core_web_lg")
nlp.add_pipe("entityLinker", last=True)

nlp.pipe_names

['tok2vec',
 'tagger',
 'parser',
 'attribute_ruler',
 'lemmatizer',
 'ner',
 'entityLinker']

In [3]:
def get_linked_entity_detail(entity: spacy_entity_linker.EntityElement.EntityElement):
    return {
        "entity": entity.get_span().text,
        "id": entity.get_id(),
        "label": entity.get_label(),
        "description": entity.get_description(),
        "category": entity.get_super_entities(limit=3),
        "url": entity.get_url(),
    }

def get_linked_entity_as_pd_dataframe(doc: spacy.tokens.Doc) -> pd.DataFrame:
    return pd.DataFrame([
        get_linked_entity_detail(entity)
        for entity in doc._.linkedEntities
    ])

# Data

In [4]:
text = "I watched the Pirates of the Caribbean last silvester. Jonny Depp is a fantastic actor."
text = "Alterations in the hypocretin receptor 2 and preprohypocretin genes produce narcolepsy in some animals."

# Document

In [5]:
doc = nlp(text)

# Entities

In [6]:
get_linked_entity_as_pd_dataframe(doc)

Unnamed: 0,entity,id,label,description,category,url
0,Alterations,16827625,Alterations,1988 film,(film),https://www.wikidata.org/wiki/Q16827625
1,receptor,208467,biochemical receptor,protein molecule receiving signals for a cell,(protein),https://www.wikidata.org/wiki/Q208467
2,genes,7187,gene,basic physical and functional unit of heredity,"(nucleic acid sequence, biological sequence, b...",https://www.wikidata.org/wiki/Q7187
3,narcolepsy,189561,narcolepsy,sleep disorder that involves an excessive urge...,"(disease, rare disease, sleep disorder)",https://www.wikidata.org/wiki/Q189561
4,animals,729,animal,kingdom of multicellular eukaryotic organisms,"(taxon, multicellular organism)",https://www.wikidata.org/wiki/Q729
