# Transformers in spacy

Why transformers?

From the context of classical spacy models: 
 - They are better at noticing context, so presumably any trainable component will train a little better.
 - They are also better at making something of unknown words, which is useful in languages that make new words by combining morphemes ([agglutination](https://en.wikipedia.org/wiki/Agglutination)), like Dutch does.


Consider an transformer-based model that was already molded into spacy-style usability, like [en_core_web_trf](https://spacy.io/models/en#en_core_web_trf)

Now say you want a transformer-based spacy model one for Dutch.  Transformers for ducth exist, ***but*** things are not quite that simple.

## Some distinctions you may care about

Now maybe you've heard that transformers are cooler and you've found [huggingface.co/models](https://huggingface.co/models) which is even from the same company, [explosion.ai](https://explosion.ai/) ([behind at least](https://explosion.ai/software#thinc) [spacy](https://spacy.io/), [huggingface](https://huggingface.co/), [prodigy](https://prodi.gy/) and libraries like [`transformers`](https://github.com/huggingface/transformers) and [thinc](https://thinc.ai/)).

However, [huggingface.co/models](https://huggingface.co/models) is just intended as a central store for a models in a much broader sense (and includes things for text, images, audio, and more; transformer is one architecture among many).   So the spacy models are there too (e.g. [en_core_web_trf](https://huggingface.co/spacy/en_core_web_trf) again).

This makes it a little harder to explain where these things are not the same.

**On models and models**

The models listed at https://spacy.io/models are the ones you can fetch via `spacy download modelname` and then spacy.load().



And then there's a `transformers` library, also by them. https://huggingface.co/docs/transformers/index
And a `spacy_transformers`


"Is `transformers` like spacy? Can you just use them in spacy?"

Broadly, no.
There is a lot of shared approach, e.g. specifying a model through configuration. 
But the `transformers` library, 


If it is a spacy model, it will probably tell give you the code to use it.
See e.g. [nl_udv25_dutchalpino_trf](https://huggingface.co/explosion/nl_udv25_dutchalpino_trf)'s "Use in spaCy" button


Is 
The [huggingface-listed models](https://huggingface.co/models),
are their own thing - and broader, also having models for images and audio.

Well, the spacy models are **also** here. As are some other models that are **for** spacy 

It can be confusing that while the same company is behind different collections of models 



a collection of transformer models (see) and a 






A lot of transformer models aren't made for spacy at all.

Maybe you found something that seems to be doing masked language modeling (MLM), or NER. 


This can be confusing if you didn't do a deep dive into spacy, and just found some code that says it's transformers that, say, just seem to do masking

they just can be adapted for use by spacy in that way.



"Can I just drop in a transformer thing and have it deal with words better?"

Broadly, that's the idea.

But if you want to make a usable, loadable, tunable spacy model, you'll need to do one deep dive.

There's a reason that if you look for "how to adapt transformers in spacy"
you will usually just find people loading a _trf model and training a classifier.


https://www.youtube.com/watch?v=RB9uDpJPZdc


In [None]:
import spacy  
english_trf = spacy.load('en_core_web_trf')

for pipe_name in english_trf.pipe_names:
    print( '==== %s ====\n%s\n'%(pipe_name, english_trf.get_pipe(pipe_name).__doc__) )

In [None]:
from transformers import *  
nlp = pipeline("ner", model="pdelobelle/robbert-v2-dutch-base")

In [14]:
doc = nlp('Wat zijn binnenwaterkerende landschapselementen?')
list(doc)

[{'entity': 'LABEL_0',
  'score': 0.61068934,
  'index': 1,
  'word': 'Wat',
  'start': 0,
  'end': 3},
 {'entity': 'LABEL_0',
  'score': 0.6292593,
  'index': 2,
  'word': 'Ġzijn',
  'start': 4,
  'end': 8},
 {'entity': 'LABEL_0',
  'score': 0.5361042,
  'index': 3,
  'word': 'Ġbinnen',
  'start': 9,
  'end': 15},
 {'entity': 'LABEL_0',
  'score': 0.50466686,
  'index': 4,
  'word': 'wat',
  'start': 15,
  'end': 18},
 {'entity': 'LABEL_0',
  'score': 0.50857574,
  'index': 5,
  'word': 'erk',
  'start': 18,
  'end': 21},
 {'entity': 'LABEL_0',
  'score': 0.6450409,
  'index': 6,
  'word': 'erende',
  'start': 21,
  'end': 27},
 {'entity': 'LABEL_0',
  'score': 0.5645136,
  'index': 7,
  'word': 'Ġlandschaps',
  'start': 28,
  'end': 38},
 {'entity': 'LABEL_0',
  'score': 0.571217,
  'index': 8,
  'word': 'elementen',
  'start': 38,
  'end': 47},
 {'entity': 'LABEL_0',
  'score': 0.60749924,
  'index': 9,
  'word': '?',
  'start': 47,
  'end': 48}]

In [None]:
from transformers import RobertaTokenizer, RobertaForSequenceClassification
tokenizer = RobertaTokenizer.from_pretrained("pdelobelle/robbert-v2-dutch-base")
model = RobertaForSequenceClassification.from_pretrained("pdelobelle/robbert-v2-dutch-base")

In [16]:
tokenizer('Wat zijn binnenwaterkerende landschapselementen?')

{'input_ids': [0, 815, 20, 143, 3063, 1576, 1237, 19472, 8304, 51, 2], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

In [None]:
model

In [None]:
#https://github.com/explosion/spaCy/discussions/8243

In [None]:
!python3 -m spacy download en_trf_bertbaseuncased_lg

import spacy 
nlp = spacy.load("en_trf_bertbaseuncased_lg")

In [11]:
nlp("Foo bar you")

[{'entity': 'LABEL_1',
  'score': 0.62097406,
  'index': 1,
  'word': 'F',
  'start': 0,
  'end': 1},
 {'entity': 'LABEL_1',
  'score': 0.5734882,
  'index': 2,
  'word': 'oo',
  'start': 1,
  'end': 3},
 {'entity': 'LABEL_1',
  'score': 0.5073553,
  'index': 3,
  'word': 'Ġbar',
  'start': 4,
  'end': 7},
 {'entity': 'LABEL_1',
  'score': 0.5949716,
  'index': 4,
  'word': 'Ġyou',
  'start': 8,
  'end': 11}]

In [None]:
from transformers import RobertaTokenizer, RobertaForSequenceClassification
tokenizer = RobertaTokenizer.from_pretrained("pdelobelle/robbert-v2-dutch-base")

In [3]:
tokenizer('Ik ben een kaas')
#help(tokenizer)

{'input_ids': [0, 204, 131, 9, 3975, 2], 'attention_mask': [1, 1, 1, 1, 1, 1]}

In [None]:
from transformers import pipeline
pipe = pipeline('fill-mask', model='GroNLP/bert-base-dutch-cased') # BERTje

In [None]:
help(pipe)



Help on FillMaskPipeline in module transformers.pipelines.fill_mask object:

class FillMaskPipeline(transformers.pipelines.base.Pipeline)
 |  FillMaskPipeline(model: Union[ForwardRef('PreTrainedModel'), ForwardRef('TFPreTrainedModel')], tokenizer: Union[transformers.tokenization_utils.PreTrainedTokenizer, NoneType] = None, feature_extractor: Union[ForwardRef('SequenceFeatureExtractor'), NoneType] = None, modelcard: Union[transformers.modelcard.ModelCard, NoneType] = None, framework: Union[str, NoneType] = None, task: str = '', args_parser: transformers.pipelines.base.ArgumentHandler = None, device: int = -1, binary_output: bool = False, **kwargs)
 |  
 |  Masked language modeling prediction pipeline using any `ModelWithLMHead`. See the [masked language modeling
 |  examples](../task_summary#masked-language-modeling) for more information.
 |  
 |  This mask filling pipeline can currently be loaded from [`pipeline`] using the following task identifier:
 |  `"fill-mask"`.
 |  
 |  The mod

In [None]:
import random, re
cur = 'Parijs is de [MASK] '

for _ in range(30):
    print( cur )
    res = pipe(cur)
    

    fres = []
    for r in res:
        #print(r)
        if re.search(r'[A-Za-z]+', r['token_str']) is not None:
            #print('ACCEPT: %r'%r['token_str'] )
            fres.append(r)
        #else:
        #    print('SKIP: %r'%r['token_str'])

    choice = random.choice( fres)
    #choice = random.choice( list(r  for r in res  if len(r['sequence'])>=2 ) )
    print ('  -> %s'%choice['sequence'])
    tt = choice['sequence'].split()
    #rpos = random.randint( max(2, len(tt)-2) , len(tt))
    #tt = tt[:rpos] + [' [MASK] '] + tt[rpos:]
    tt = tt + [' [MASK] ']
    cur = ' '.join( tt )

print( cur )


Parijs is de [MASK] 
  -> Parijs is de stad
Parijs is de stad  [MASK] 
  -> Parijs is de stad van
Parijs is de stad van  [MASK] 
  -> Parijs is de stad van Frankrijk
Parijs is de stad van Frankrijk  [MASK] 
  -> Parijs is de stad van Frankrijk
Parijs is de stad van Frankrijk  [MASK] 
  -> Parijs is de stad van Frankrijk
Parijs is de stad van Frankrijk  [MASK] 
  -> Parijs is de stad van Frankrijk
Parijs is de stad van Frankrijk  [MASK] 
  -> Parijs is de stad van Frankrijk
Parijs is de stad van Frankrijk  [MASK] 
  -> Parijs is de stad van Frankrijk
Parijs is de stad van Frankrijk  [MASK] 
  -> Parijs is de stad van Frankrijk
Parijs is de stad van Frankrijk  [MASK] 
  -> Parijs is de stad van Frankrijk
Parijs is de stad van Frankrijk  [MASK] 
  -> Parijs is de stad van Frankrijk
Parijs is de stad van Frankrijk  [MASK] 
  -> Parijs is de stad van Frankrijk
Parijs is de stad van Frankrijk  [MASK] 
  -> Parijs is de stad van Frankrijk
Parijs is de stad van Frankrijk  [MASK] 
  -> Parijs i