In [None]:
#!pip install -U pip setuptools wheel
#!pip install -U spacy spacy-lookups-data
#!pip install spacy-llm spacy-transformers
#python -m spacy download en_core_web_trf #Accuracy
#python -m spacy download en_core_web_sm #Efficiency
#python -m spacy download en_core_web_lg #with vectors

In [1]:
import spacy


In [None]:
nlp = spacy.load("en_core_web_trf")
import en_core_web_trf
nlp = en_core_web_trf.load()

The central data structures in spaCy are the Language class, the Vocab and the Doc object. The Language class is used to process a text and turn it into a Doc object. It’s typically stored as a variable called nlp. The Doc object owns the sequence of tokens and all their annotations. By centralizing strings, word vectors and lexical attributes in the Vocab, we avoid storing multiple copies of this data. This saves memory, and ensures there’s a single source of truth.

Text annotations are also designed to allow a single source of truth: the Doc object owns the data, and Span and Token are views that point into it. The Doc object is constructed by the Tokenizer, and then modified in place by the components of the pipeline. The Language object coordinates these components. It takes raw text and sends it through the pipeline, returning an annotated document. It also orchestrates training and serialization.

In [2]:
doc = nlp("This is a sentence.")
print([(w.text, w.pos_) for w in doc])

[('This', 'PRON'), ('is', 'AUX'), ('a', 'DET'), ('sentence', 'NOUN'), ('.', 'PUNCT')]


In [3]:
doc = nlp("Apple is looking at buying U.K. startup for $1 billion")

for ent in doc.ents:
    print(ent.text, ent.start_char, ent.end_char, ent.label_)

Apple 0 5 ORG
U.K. 27 31 GPE
$1 billion 44 54 MONEY


When you call nlp on a text, spaCy first tokenizes the text to produce a Doc object. The Doc is then processed in several different steps – this is also referred to as the processing pipeline. The pipeline used by the trained pipelines typically include a tagger, a lemmatizer, a parser and an entity recognizer. Each pipeline component returns the processed Doc, which is then passed on to the next component.

The tokenizer is a “special” component and isn’t part of the regular pipeline. It also doesn’t show up in nlp.pipe_names. The reason is that there can only really be one tokenizer, and while all other pipeline components take a Doc and return it, the tokenizer takes a string of text and turns it into a Doc.

The capabilities of a processing pipeline always depend on the components, their models and how they were trained. For example, a pipeline for named entity recognition needs to include a trained named entity recognizer component with a statistical model and weights that enable it to make predictions of entity labels. This is why each pipeline specifies its components and their settings in the config.

Order of components: it matters if you add the EntityRuler before or after the statistical entity recognizer: if it’s added before, the entity recognizer will take the existing entities into account when making predictions. The EntityLinker, which resolves named entities to knowledge base IDs, should be preceded by a pipeline component that recognizes entities such as the EntityRecognizer.

# NER classes

## EntityRecognizer

A transition-based named entity recognition component. The entity recognizer identifies non-overlapping labelled spans of tokens. The transition-based algorithm used encodes certain assumptions that are effective for “traditional” named entity recognition tasks, but may not be a good fit for every span identification problem. Specifically, the loss function optimizes for whole entity accuracy, so if your inter-annotator agreement on boundary tokens is low, the component will likely perform poorly on your problem. The transition-based algorithm also assumes that the most decisive information about your entities will be close to their initial tokens. If your entities are long and characterized by tokens in their middle, the component will likely not be a good fit for your task.

## EntityRuler

The entity ruler lets you add spans to the Doc.ents using token-based rules or exact phrase matches. It can be combined with the statistical EntityRecognizer to boost accuracy, or used on its own to implement a purely rule-based entity recognition system. For usage examples, see the docs on [rule-based entity recognition](https://spacy.io/usage/rule-based-matching#entityruler).

## EntityLinker

An EntityLinker component disambiguates textual mentions (tagged as named entities) to unique identifiers, grounding the named entities into the “real world”. It requires a KnowledgeBase, as well as a function to generate plausible candidates from that KnowledgeBase given a certain textual mention, and a machine learning model to pick the right candidate, given the local context of the mention. EntityLinker defaults to using the InMemoryLookupKB implementation.

In [None]:
#In the transformer models, ner listens to the transformer component, 
# so you can disable all components related tagging, parsing, and lemmatization.

nlp = spacy.load("en_core_web_trf", disable=["tagger", "parser", "attribute_ruler", "lemmatizer"])

# Spacy with LLMs (rather than BERT-based models)

The spacy-llm package integrates Large Language Models (LLMs) into spaCy pipelines, featuring a modular system for fast prototyping and prompting, and turning unstructured responses into robust outputs for various NLP tasks, no training data required.

Supports OpenSource HuggingFace models and integrates with LangChain.

Tasks available out of the box: Named Entity Recognition; Text classification; Lemmatization; Relationship extraction; Sentiment analysis; Span categorization; Summarization. Easy implementation of your own functions via spaCy's registry for custom prompting, parsing and model integrations.

You can quickly initialize a pipeline with components powered by LLM prompts, and freely mix in components powered by other approaches. As your project progresses, you can look at replacing some or all of the LLM-powered components as you require.

Of course, there can be components in your system for which the power of an LLM is fully justified. If you want a system that can synthesize information from multiple documents in subtle ways and generate a nuanced summary for you, bigger is better. However, even if your production system needs an LLM for some of the task, that doesn't mean you need an LLM for all of it. Maybe you want to use a cheap text classification model to help you find the texts to summarize, or maybe you want to add a rule-based system to sanity check the output of the summary. These before-and-after tasks are much easier with a mature and well-thought-out library, which is exactly what spaCy provides.

The task and the model have to be supplied to the llm pipeline component using the config system.

In [4]:
# Example creating the component directly
nlp = spacy.blank("en")
llm_ner = nlp.add_pipe("llm_ner")
llm_ner.add_label("PERSON")
llm_ner.add_label("LOCATION")
nlp.initialize()
doc = nlp("Jack and Jill rode up the hill in Les Deux Alpes")
print([(ent.text, ent.label_) for ent in doc.ents])



KeyError: 'data'

In [3]:
# Example using a HugingFace model
from spacy_llm.util import assemble

nlp = assemble("config.cfg")
doc = nlp("Jack and Jill rode up the hill in Les Deux Alpes")
print([(ent.text, ent.label_) for ent in doc.ents])



Downloading (…)lve/main/config.json:   0%|          | 0.00/819 [00:00<?, ?B/s]

Downloading (…)instruct_pipeline.py:   0%|          | 0.00/9.16k [00:00<?, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/databricks/dolly-v2-3b:
- instruct_pipeline.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


Downloading pytorch_model.bin:   0%|          | 0.00/5.68G [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/450 [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/2.11M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/228 [00:00<?, ?B/s]

  input_ids = input_ids.repeat_interleave(expand_size, dim=0)


RuntimeError: MPS does not support cumsum op with int64 input

Note: None of the tutorials example run.

__Choice of model for spacy-llm__

All built-in models are registered in llm_models. If no model is specified, the repo currently connects to the OpenAI API by default using REST, and accesses the "gpt-3.5-turbo" model.

Currently three different approaches to use LLMs are supported:

1. spacy-llms native REST interface. This is the default for all hosted models (e. g. OpenAI, Cohere, Anthropic, …).

2. A HuggingFace integration that allows to run a limited set of HF models locally.

3. A LangChain integration that allows to run any model supported by LangChain (hosted or locally).

Approaches 1. and 2 are the default for hosted model and local models, respectively. Alternatively you can use LangChain to access hosted or local models by specifying one of the models registered with the langchain. prefix.

Includes: `spacy.Llama2.v1`:	Llama2 models through HuggingFace; `spacy.OpenLLaMA.v1`:	OpenLLaMA models through HuggingFace. Note that the chat models variants of Llama 2 are currently not supported. This is because they need a particular prompting setup and don’t add any discernible benefits in the use case of spacy-llm (i. e. no interactive chat) compared to the completion model variants.