I've been working with [`spacy`](https://spacy.io/) more and more over the years, and I thought it'd be a good idea to write about the configuration system. There are mentions of it throughout the [docs](https://spacy.io/usage/training#config) and in some of the `spacy` 3.0 [videos](https://youtu.be/BWhh3r6W-qE), but I have yet to find a super detailed breakdown of what's going on (except maybe this [blog](https://explosion.ai/blog/spacy-v3-project-config-systems#spacy-config-system)). Hopefully this post will shed some light.

Let's start with a brief demo of `spacy`.

> Install spacy and the `en_core_web_sm` model if you want to follow along:
> ```shell
$ pip install spacy
$ python -m spacy download en_core_web_sm
```

In [1]:
import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp("Hi, my name is Ian and this is my blog.")
print(doc)

Hi, my name is Ian and this is my blog.


Nothing fancy on the surface, but this [`doc`](https://spacy.io/api/doc) object that we've created is the product of sending our string of characters through a [pipeline of models](https://spacy.io/usage/processing-pipelines), or as `spacy` likes to call them, [components](https://spacy.io/usage/processing-pipelines#pipelines). We can view the pipeline components via the [`nlp.pipeline` property](https://spacy.io/api/language#attributes).

In [2]:
nlp.pipeline

[('tok2vec', <spacy.pipeline.tok2vec.Tok2Vec at 0x1ca902cbbf0>),
 ('tagger', <spacy.pipeline.tagger.Tagger at 0x1ca90474230>),
 ('parser', <spacy.pipeline.dep_parser.DependencyParser at 0x1ca901d1700>),
 ('attribute_ruler',
  <spacy.pipeline.attributeruler.AttributeRuler at 0x1ca90318990>),
 ('lemmatizer', <spacy.lang.en.lemmatizer.EnglishLemmatizer at 0x1ca90154c10>),
 ('ner', <spacy.pipeline.ner.EntityRecognizer at 0x1ca901d13f0>)]

And we can get more component information with [nlp.analyze_pipes](https://spacy.io/api/language#analyze_pipes) such as what each assigns, their requirements, their scoring metrics, whether they retokenize, and in what order the components perform their annotations.

In [3]:
nlp.analyze_pipes(pretty=True);  # note the semicolon (;) to reduce output after the table.

[1m

#   Component         Assigns               Requires   Scores             Retokenizes
-   ---------------   -------------------   --------   ----------------   -----------
0   tok2vec           doc.tensor                                          False      
                                                                                     
1   tagger            token.tag                        tag_acc            False      
                                                                                     
2   parser            token.dep                        dep_uas            False      
                      token.head                       dep_las                       
                      token.is_sent_start              dep_las_per_type              
                      doc.sents                        sents_p                       
                                                       sents_r                       
                                                

Notice the first component, `tok2vec`. This [component](https://spacy.io/api/tok2vec) is responsible for mapping tokens to vectors, i.e., creating an [embedding layer](https://spacy.io/usage/embeddings-transformers), and making them available for later components to use via the `doc.tensor` attribute.
> Note, this is not the same as a [`tokenizer`](https://spacy.io/api/tokenizer).

In the `en_core_web_sm` pipeline, we can see that the [`tagger`](https://spacy.io/api/tagger) and [`parser`](https://spacy.io/api/dependencyparser) components both use the `tok2vec`'s output by accessing the `tok2vec.listening_components`.

In [5]:
tok2vec = nlp.get_pipe("tok2vec")
tok2vec.listening_components

On the flip side, we can see which components *use* a `tok2vec` model by checking their configuration via `nlp.get_pipe_config`.

In [55]:
nlp.get_pipe_config("parser")

{'factory': 'parser',
 'learn_tokens': False,
 'min_action_freq': 30,
 'model': {'@architectures': 'spacy.TransitionBasedParser.v2',
  'state_type': 'parser',
  'extra_state_tokens': False,
  'hidden_width': 64,
  'maxout_pieces': 2,
  'use_upper': True,
  'nO': None,
  'tok2vec': {'@architectures': 'spacy.Tok2VecListener.v1',
   'width': '${components.tok2vec.model.encode:width}',
   'upstream': 'tok2vec'}},
 'moves': None,
 'scorer': {'@scorers': 'spacy.parser_scorer.v1'},
 'update_with_oracle_cut_size': 100}

In [50]:
parser.cfg

{'moves': None,
 'update_with_oracle_cut_size': 100,
 'multitasks': [],
 'min_action_freq': 30,
 'learn_tokens': False,
 'beam_width': 1,
 'beam_density': 0.0,
 'beam_update_prob': 0.0,
 'incorrect_spans_key': None}

In [30]:
parser = nlp.get_pipe("parser")

In [31]:
parser.tok2vec

<thinc.model.Model at 0x1ca90495ec0>

In [35]:
parser.model.layers[0].layers

[<spacy.pipeline.tok2vec.Tok2VecListener at 0x1ca90498550>,
 <thinc.model.Model at 0x1ca904895c0>,
 <thinc.model.Model at 0x1ca90495440>]

In [45]:
tagger = nlp.get_pipe("tagger")

In [46]:
tagger.tok2vec

AttributeError: 'Tagger' object has no attribute 'tok2vec'

In [48]:
tok2vec.listener_map

{'tagger': [<spacy.pipeline.tok2vec.Tok2VecListener at 0x1ca904982d0>],
 'parser': [<spacy.pipeline.tok2vec.Tok2VecListener at 0x1ca90498550>]}

In [33]:
tagger.model.layers

[<spacy.pipeline.tok2vec.Tok2VecListener at 0x1ca904982d0>,
 <thinc.model.Model at 0x1ca9048b8c0>]

In [27]:
nlp.get_pipe_config("ner")

{'factory': 'ner',
 'incorrect_spans_key': None,
 'model': {'@architectures': 'spacy.TransitionBasedParser.v2',
  'state_type': 'ner',
  'extra_state_tokens': False,
  'hidden_width': 64,
  'maxout_pieces': 2,
  'use_upper': True,
  'nO': None,
  'tok2vec': {'@architectures': 'spacy.Tok2Vec.v2',
   'embed': {'@architectures': 'spacy.MultiHashEmbed.v2',
    'width': 96,
    'attrs': ['NORM', 'PREFIX', 'SUFFIX', 'SHAPE'],
    'rows': [5000, 1000, 2500, 2500],
    'include_static_vectors': False},
   'encode': {'@architectures': 'spacy.MaxoutWindowEncoder.v2',
    'width': 96,
    'depth': 4,
    'window_size': 1,
    'maxout_pieces': 3}}},
 'moves': None,
 'scorer': {'@scorers': 'spacy.ner_scorer.v1'},
 'update_with_oracle_cut_size': 100}

In [26]:
nlp.get_pipe_config("tagger")

{'factory': 'tagger',
 'label_smoothing': 0.0,
 'model': {'@architectures': 'spacy.Tagger.v2',
  'nO': None,
  'normalize': False,
  'tok2vec': {'@architectures': 'spacy.Tok2VecListener.v1',
   'width': '${components.tok2vec.model.encode:width}',
   'upstream': 'tok2vec'}},
 'neg_prefix': '!',
 'overwrite': False,
 'scorer': {'@scorers': 'spacy.tagger_scorer.v1'}}

In [22]:
[(name, hasattr(component, "tok2vec")) for name, component in nlp.pipeline]

[('tok2vec', False),
 ('tagger', False),
 ('parser', True),
 ('attribute_ruler', False),
 ('lemmatizer', False),
 ('ner', True)]

In [14]:
ner = nlp.get_pipe("ner")

In [17]:
hasattr(ner, "tok2vec")

True

In [12]:
nlp.get_pipe_config("ner")

{'factory': 'ner',
 'incorrect_spans_key': None,
 'model': {'@architectures': 'spacy.TransitionBasedParser.v2',
  'state_type': 'ner',
  'extra_state_tokens': False,
  'hidden_width': 64,
  'maxout_pieces': 2,
  'use_upper': True,
  'nO': None,
  'tok2vec': {'@architectures': 'spacy.Tok2Vec.v2',
   'embed': {'@architectures': 'spacy.MultiHashEmbed.v2',
    'width': 96,
    'attrs': ['NORM', 'PREFIX', 'SUFFIX', 'SHAPE'],
    'rows': [5000, 1000, 2500, 2500],
    'include_static_vectors': False},
   'encode': {'@architectures': 'spacy.MaxoutWindowEncoder.v2',
    'width': 96,
    'depth': 4,
    'window_size': 1,
    'maxout_pieces': 3}}},
 'moves': None,
 'scorer': {'@scorers': 'spacy.ner_scorer.v1'},
 'update_with_oracle_cut_size': 100}

In [10]:
for pipe in nlp.pipe_names:
    print(nlp.get_pipe(pipe))

<spacy.pipeline.tok2vec.Tok2Vec object at 0x000001CA902CBBF0>
<spacy.pipeline.tagger.Tagger object at 0x000001CA90474230>
<spacy.pipeline.dep_parser.DependencyParser object at 0x000001CA901D1700>
<spacy.pipeline.attributeruler.AttributeRuler object at 0x000001CA90318990>
<spacy.lang.en.lemmatizer.EnglishLemmatizer object at 0x000001CA90154C10>
<spacy.pipeline.ner.EntityRecognizer object at 0x000001CA901D13F0>


You can see that the `ner` component has a `tok2vec`

In [49]:
nlp.tokenizer

<spacy.tokenizer.Tokenizer at 0x1cc3d03f9a0>

In [33]:
tok2vec = nlp.get_pipe("tok2vec")

In [48]:
nlp.components

[('tok2vec', <spacy.pipeline.tok2vec.Tok2Vec at 0x1cc3df79970>),
 ('tagger', <spacy.pipeline.tagger.Tagger at 0x1cc3df7aed0>),
 ('parser', <spacy.pipeline.dep_parser.DependencyParser at 0x1cc3d103bc0>),
 ('senter', <spacy.pipeline.senter.SentenceRecognizer at 0x1cc3df79cd0>),
 ('attribute_ruler',
  <spacy.pipeline.attributeruler.AttributeRuler at 0x1cc3d417450>),
 ('lemmatizer', <spacy.lang.en.lemmatizer.EnglishLemmatizer at 0x1cc3df4bf50>),
 ('ner', <spacy.pipeline.ner.EntityRecognizer at 0x1cc3d103b50>)]

In [47]:
[tok2vec.find_listeners(c) for c in nlp.components]

[None, None, None, None, None, None, None]