# Spancat Architecture Walkthrough

In this notebook, we'll go through how the spancat architecture works. We have three goals:
- Understand how each component of the spancat pipeline works.
- Familiarize ourselves with working on the spaCy registry.
- Figure out which component to udpate whenever we want to change something in Spancat

In [1]:
import spacy
print(f"Your spaCy version is {spacy.__version__}")

Your spaCy version is 3.3.1


The spaCy registry is based on [`explosion/catalogue`](https://github.com/explosion/catalogue), and it allows us to call some of spaCy's components straight into our code. For convenience, I'll name the overall spaCy registry as `reg`.

In [2]:
reg = spacy.registry

Let's make sample spaCy Doc objects for debugging. In this case, let's have a short sentence and a longer one. 

In [3]:
nlp = spacy.blank("en")
texts = [
    "A short sentence.",
    "Multivariate analysis revealed that septic shock and bacteremia originating from lower respiratory tract infection were two independent risk factors for 30-day mortality."
]
docs = list(nlp.pipe(texts))

## Initialize the spancat component 

In this section, we'll try to initialize the spancat component by creating various factories for its components. Note that we will be working backwards, starting from the `spancat` pipeline, and construct each component as we go along

![](spancat_archi.png)

### Suggester Function (`suggester=ngram_suggester.v1`)

The first thing we need to create is the suggester function. It may also be good to see how the suggester provides the suggested spans. 

In [4]:
suggester_factory = reg.get("misc", "spacy.ngram_suggester.v1")
suggester = suggester_factory(sizes=[1,2,3])

In [5]:
suggested = suggester(docs)

In [6]:
len(docs[0]), len(docs[1])

(4, 25)

In [7]:
suggested.lengths

array([ 9, 72], dtype=int32)

The `suggested.lengths` are the windowed indices of the [spaCy Doc](https://spacy.io/api/doc) tokens. So in this case, it's looking at sizes of windows `1`, `2`, and `3`. If we inspect what they look like for the short sentence (note that it has four tokens: "A", "short", "sentence", "."), we'll have the following:

In [8]:
window_idxs = suggested.data[:suggested.lengths[0]].tolist()
for idx, (start, end) in enumerate(window_idxs):
    print(f"Span {idx}: {docs[0][start:end]}")

Span 0: A
Span 1: short
Span 2: sentence
Span 3: .
Span 4: A short
Span 5: short sentence
Span 6: sentence.
Span 7: A short sentence
Span 8: short sentence.


### Model (`model=spacy.SpanCategorizer.v1`)

The `model`, on the other hand, requires us to pass a `tok2vec`, `reducer`, and `scorer` components. In addition,  the `tok2vec` architecture we want to use also requires us to pass something for the `embed` and `encode` parameters (check the dependency graph). Let's start with the `tok2vec` first and work our way from left to right.

In [9]:
# tok2vec: get factories
tok2vec_factory = reg.get("architectures", "spacy.Tok2Vec.v2")
embed_factory = reg.get("architectures", "spacy.MultiHashEmbed.v2")
encode_factory = reg.get("architectures", "spacy.MaxoutWindowEncoder.v2")

# tok2vec: construct components
embed = embed_factory(
    width=96,
    rows=[5000, 2000, 1000, 1000],
    attrs=["ORTH", "PREFIX", "SUFFIX", "SHAPE"],
    include_static_vectors=False,
)

encode = encode_factory(
    width=96,
    window_size=1,
    maxout_pieces=3,
    depth=4,
)

# tok2vec: assemble
tok2vec = tok2vec_factory(embed=embed, encode=encode)

In [10]:
# reducer: get factories
reducer_factory = reg.get("layers", "spacy.mean_max_reducer.v1")

# reducer: construct components and assemble
reducer = reducer_factory(hidden_size=128)

In [11]:
# scorer: get factories
scorer_factory = reg.get("layers", "spacy.LinearLogistic.v1")

# scorer: construct components and assemble
scorer = scorer_factory()

Now that we have the `tok2vec`, `reducer`, and `scorer`, we can assemble them together for the spancat model

In [12]:
model_factory = reg.get("architectures", "spacy.SpanCategorizer.v1")
model = model_factory(tok2vec=tok2vec, reducer=reducer, scorer=scorer)

### Pipeline (factory=`spancat`)

Now that we have the `model` and `suggester`, we can finally assemble the `spancat` pipeline. There are other parameters we need to provide, like `nlp`, etc. but we don't need to construct any of them unlike the other two earlier.

Optionally, we can also add a `Scorer` (this is different from the scorer `LinearLogistic.v1`that affects the model) to store our results. We need to construct from a factory again but this should be quick and easy:

In [13]:
spacy_scorer_factory = reg.get("scorers", "spacy.spancat_scorer.v1")
spacy_scorer = spacy_scorer_factory()

What `spacy.spancat_scorer.v1` does is that it adds the `spans_sc_{}` values in the `spacy.Scorer` so that it gets reported during training and evaluation. Again, note that this only performs scoring and eval, the component itself does not affect the model, it only reports what the original scorer, `LinearLogistic.v1` has computed.

In [14]:
pipeline_factory = reg.get("factories", "spancat")
pipeline = pipeline_factory(
    nlp=nlp,  ## a blank:en pipeline
    name="spancat",
    threshold=0.5,
    spans_key="sc",
    max_positive=None,
    # These are the two that we constructed earlier
    model=model,
    suggester=suggester,
    # The optional spaCy scorer for storing the computed scores
    scorer=spacy_scorer,
)

From here we can now access **anything** in the [SpanCategorizer pipeline](https://spacy.io/api/spancategorizer). For example, we can train our model by calling the `pipeline.update` function. However, it's more ideal to use the [Training Config system](https://spacy.io/usage/training) for that purpose.

In [20]:
pipeline.get_loss??