I've been working with [`spacy`](https://spacy.io/) more and more over the years, and I thought it'd be a good idea to write about the configuration system. There are mentions of it throughout the [docs](https://spacy.io/usage/training#config) and in some of the `spacy` 3.0 [videos](https://youtu.be/BWhh3r6W-qE), but I have yet to find a super detailed breakdown of what's going on (except maybe this [blog](https://explosion.ai/blog/spacy-v3-project-config-systems#spacy-config-system)). Hopefully this post will shed some light.

Let's start with a brief demo of `spacy`.

> Install `spacy` and the `en_core_web_sm` model if you want to follow along:
> ```shell
$ pip install spacy
$ python -m spacy download en_core_web_sm
```

In [1]:
import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp("Hi, my name is Ian and this is my blog.")
print(doc)

Hi, my name is Ian and this is my blog.


Nothing fancy on the surface, but this [`doc`](https://spacy.io/api/doc) object that we've created is the product of sending our string of characters through a [pipeline of models](https://spacy.io/usage/processing-pipelines), or as `spacy` likes to call them, [components](https://spacy.io/usage/processing-pipelines#pipelines). We can view the pipeline components via the [`nlp.pipeline` property](https://spacy.io/api/language#attributes).

In [2]:
nlp.pipeline

[('tok2vec', <spacy.pipeline.tok2vec.Tok2Vec at 0x21c3db2bbf0>),
 ('tagger', <spacy.pipeline.tagger.Tagger at 0x21c3dcd4350>),
 ('parser', <spacy.pipeline.dep_parser.DependencyParser at 0x21c3da31700>),
 ('attribute_ruler',
  <spacy.pipeline.attributeruler.AttributeRuler at 0x21c3db76e50>),
 ('lemmatizer', <spacy.lang.en.lemmatizer.EnglishLemmatizer at 0x21c3d9b5490>),
 ('ner', <spacy.pipeline.ner.EntityRecognizer at 0x21c3da313f0>)]

And we can get more component information with [`nlp.analyze_pipes`](https://spacy.io/api/language#analyze_pipes) such as what each assigns, their requirements, their scoring metrics, whether they retokenize, and in what order the components perform their annotations.

In [3]:
# note the semicolon (;) to reduce output after the table.
nlp.analyze_pipes(pretty=True);

[1m

#   Component         Assigns               Requires   Scores             Retokenizes
-   ---------------   -------------------   --------   ----------------   -----------
0   tok2vec           doc.tensor                                          False      
                                                                                     
1   tagger            token.tag                        tag_acc            False      
                                                                                     
2   parser            token.dep                        dep_uas            False      
                      token.head                       dep_las                       
                      token.is_sent_start              dep_las_per_type              
                      doc.sents                        sents_p                       
                                                       sents_r                       
                                                

Notice the first component, `tok2vec`. This [component](https://spacy.io/api/tok2vec) is responsible for mapping tokens to vectors, i.e., creating an [embedding layer](https://spacy.io/usage/embeddings-transformers), and making them available for later components to use via the `doc.tensor` attribute.
> Note, this is not the same as a [`tokenizer`](https://spacy.io/api/tokenizer).

In the `en_core_web_sm` pipeline, we can see that the [`tagger`](https://spacy.io/api/tagger) and [`parser`](https://spacy.io/api/dependencyparser) components both use the `tok2vec`'s output by accessing the `tok2vec.listening_components`.

In [4]:
tok2vec = nlp.get_pipe("tok2vec")
tok2vec.listening_components

['tagger', 'parser']

On the flip side, we can see which components *use* a `tok2vec` model by checking their configuration via `nlp.get_pipe_config`.

In [5]:
[
    name
    for name in nlp.pipe_names
    if (model := nlp.get_pipe_config(name).get("model")) is not None
    and model.get("tok2vec") is not None
]

['tagger', 'parser', 'ner']

The `tagger` and `parser` are both present as expected, but so is the `ner` component. The `ner` component has its own `tok2vec` "layer", separate from the `tok2vec` at the beginning of the `nlp.pipeline`.

In [6]:
ner_tok2vec = nlp.get_pipe_config("ner")["model"]["tok2vec"]
ner_tok2vec

{'@architectures': 'spacy.Tok2Vec.v2',
 'embed': {'@architectures': 'spacy.MultiHashEmbed.v2',
  'width': 96,
  'attrs': ['NORM', 'PREFIX', 'SUFFIX', 'SHAPE'],
  'rows': [5000, 1000, 2500, 2500],
  'include_static_vectors': False},
 'encode': {'@architectures': 'spacy.MaxoutWindowEncoder.v2',
  'width': 96,
  'depth': 4,
  'window_size': 1,
  'maxout_pieces': 3}}

Whereas the `tagger` and `parser` components both "listen to" or share the the tensors produced via the `tok2vec` component in the `nlp.pipeline`.

In [7]:
tagger_tok2vec = nlp.get_pipe_config("tagger")["model"]["tok2vec"]
tagger_tok2vec

{'@architectures': 'spacy.Tok2VecListener.v1',
 'width': '${components.tok2vec.model.encode:width}',
 'upstream': 'tok2vec'}

In [8]:
parser_tok2vec = nlp.get_pipe_config("parser")["model"]["tok2vec"]
parser_tok2vec

{'@architectures': 'spacy.Tok2VecListener.v1',
 'width': '${components.tok2vec.model.encode:width}',
 'upstream': 'tok2vec'}

Listening to/sharing an upstream component has some pros and cons including speed and flexibility (docs; stack overflow answer). Sometimes sharing a component can help boost later components metrics, and other times it's easier to have something more independent.

For example, suppose we wanted to fine-tune the weights in a `senter` component.

In [9]:
nlp.enable_pipe("senter")
senter_config = nlp.get_pipe_config("senter")

In [45]:
nlp.get_pipe_config("senter")

{'factory': 'senter',
 'model': {'@architectures': 'spacy.Tagger.v2',
  'nO': None,
  'normalize': False,
  'tok2vec': {'@architectures': 'spacy.Tok2Vec.v2',
   'embed': {'@architectures': 'spacy.MultiHashEmbed.v2',
    'width': 16,
    'attrs': ['NORM', 'PREFIX', 'SUFFIX', 'SHAPE', 'SPACY'],
    'rows': [1000, 500, 500, 500, 50],
    'include_static_vectors': False},
   'encode': {'@architectures': 'spacy.MaxoutWindowEncoder.v2',
    'width': 16,
    'depth': 2,
    'window_size': 1,
    'maxout_pieces': 2}}},
 'overwrite': False,
 'scorer': {'@scorers': 'spacy.senter_scorer.v1'}}

In [46]:
nlp.get_pipe_config("tok2vec")

{'factory': 'tok2vec',
 'model': {'@architectures': 'spacy.Tok2Vec.v2',
  'embed': {'@architectures': 'spacy.MultiHashEmbed.v2',
   'width': '${components.tok2vec.model.encode:width}',
   'attrs': ['NORM', 'PREFIX', 'SUFFIX', 'SHAPE', 'SPACY', 'IS_SPACE'],
   'rows': [5000, 1000, 2500, 2500, 50, 50],
   'include_static_vectors': False},
  'encode': {'@architectures': 'spacy.MaxoutWindowEncoder.v2',
   'width': 96,
   'depth': 4,
   'window_size': 1,
   'maxout_pieces': 3}}}