Once you have your data in the supported format, and a spaCy model with vectors you can 
use the spacy_ann CLI to compute the nearest neighbors index for your Aliases and tran an 
Encoder for disambiguating entity spans to their canonical Id
 
Usage:
```console
spacy_ann create_index en_core_web_md examples/tutorial/data examples/tutorial/models
```

In [1]:
import spacy
from spacy_ann import AnnLinker

# Load the spaCy model from the output_dir you used from the create_index command
model_dir = "models/ann_linker/"
nlp = spacy.load(model_dir)

# The NER component of the en_core_web_md model doesn't actually recognize the aliases as entities
# so we'll add a spaCy EntityRuler component for now to extract them.
# ruler = nlp.create_pipe('entity_ruler')
# patterns = [{"label": "SKILL", "pattern": alias} for alias in nlp.get_pipe('ann_linker').kb.get_alias_strings()]
# ruler.add_patterns(patterns)
# nlp.add_pipe(ruler, before="ann_linker")

ruler = nlp.add_pipe('entity_ruler', before="ann_linker")
patterns = [{"label": "SKILL", "pattern": alias} for alias in nlp.get_pipe('ann_linker').kb.get_alias_strings()]
ruler.add_patterns(patterns)

In [12]:
doc = nlp("NLP is a highly researched subset of Machine learning.")
[(e.text, e.label_, e.kb_id_) for e in doc.ents]

[('NLP', 'ORG', 'a3'), ('Machine', 'ORG', '')]

In [3]:
doc.vector_norm

3.171593248390084

In [4]:
import srsly
import numpy as np
entities = list(srsly.read_jsonl('data/entities.jsonl'))
natl_doc = nlp.make_doc(entities[2]['description'])
neur_doc = nlp.make_doc(entities[3]['description']) 

In [5]:
entity_encodings = np.asarray([natl_doc.vector, neur_doc.vector])
entity_norm = np.linalg.norm(entity_encodings, axis=1)
entity_norm

array([3.2457936, 2.6232092], dtype=float32)

In [6]:
sims = np.dot(entity_encodings, doc.vector.T) / (doc.vector_norm * entity_norm)
sims.argmax()

0

In [7]:
patterns = [
    {"label": "SKILL", "pattern": alias}
    for alias in nlp.get_pipe('ann_linker').kb.get_alias_strings()
]

In [8]:
print([(e.text, e.label_, e.kb_id_, e._.alias_candidates) for e in doc.ents])

[('NLP', 'ORG', 'a3', [AliasCandidate(alias='NLP', similarity=1.0)]), ('Machine', 'ORG', '', [])]


In [9]:
nlp("More text about nlpe")

More text about nlpe

In [10]:
ent = list(doc.ents)[0]

In [13]:
doc.ents

(NLP, Machine)

In [14]:
text = """Natural language processing (NLP) is a subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language, in particular how to program computers to process and analyze large amounts of natural language data. The goal is a computer capable of "understanding" the contents of documents, including the contextual nuances of the language within them. The technology can then accurately extract information and insights contained in the documents as well as categorize and organize the documents themselves."""

In [15]:
doc = nlp(text)
[(e.text, e.label_, e.kb_id_) for e in doc.ents]

[('Natural language processing', 'SKILL', 'a3'), ('NLP', 'ORG', 'a3')]