Skip to content

lukasgarbas/can-we-tune-together

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

30 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Can We Tune Together

Can we tune multiple language models together? There are many pre-trained transformer language models available on HuggingFace model hub. The current hype for sentence-level tasks is to pick one language model (i.e BERT, ELECTRA, deBERTa) and fine-tune it for the task at hand. Each LM has something different: either a different pretraining objective, a different corpus used for pretaining, or some other twists in the transformer architecture. When evaluating unalike models on the GLUE benchmark, they also score differently.

I couldn't find any works that combine multiple LMs and tune them together — so I'm running some experiments here! 🤷‍♂️.

The initial idea was to concatenate different CLS token representations of each LM and let the complete model (multi-encoder) figure out how to combine them. Later on, I came across some other works (Dynamic Meta-Embeddings for Sentence Representations by Kiela et. al.) which applied attention to combine static GloVe and FastText embeddings. Even though the previous work with static/frozen embeddings does not directly translate to language models, the attention mechanism can still help us visualize the attention scores for each LM and look at what is happening inside the multi-encoder.

My code is very much based on FlairNLP (shout-out to them! Look it up, it's a cool NLP framework 🙃). If combining and tuning multiple LMs together happens to work, I might create a PR later.

Datasets

Datasets for experiments

Text Classification

Corpus Name Sentences Labels Task type Source
CLICKBAIT 32k 2 Sentence classification clickbait
ISEAR 7k 7 Sentence classification isear #6 only in .sav and .mdb formats
TREC 6k 6 Sentence classification trec
EMOTION_STIMULUS 1.6k 7 Sentence classification emotion stimulus
GLUE_COLA 9.5k 2 Sentence classification glue
GLUE_SST2 68k 2 Sentence classification glue
GLUE_MRPC 4k 3 Sentence-pair glue
GLUE_RTE 3k 3 Sentence-pair glue
GLUE_MNLI 413k 3 Sentence-pair glue
GLUE_QNLI 114k 3 Sentence-pair glue
GLUE_QQP 404k 3 Sentence-pair glue
GLUE_WNLI 700 2 Sentence-pair glue
SICK 10k 3 Sentence-pair sick

Text Regression

Corpus Name Sentences Labels Task type Reference
GLUE_STSB 8.5k Similarity score Sentence regression glue
EMOBANK 10k Valance, arousal, or dominance scores Sentence regression emobank
FB_VALENCE_AROUSAL 6k Valence or arousal scores Sentence regression valence arousal in fb posts

Sequence Labeling

Corpus Name Sentences Labels Task type Reference
CONLL_NER 20k 4 NER conll03
WNUT_NER 5k 6 NER wnut17
MIT_MOVIE_NER 10k 13 NER mit movie
MIT_RESTAURANT_NER 9k 9 NER mit restaurant

Tasks: Text Classification ✅ Text Regression ✅ Sequence Labeling (still working on SequenceTagger).

Tuning a single Language Model

You can pick any transformer-based language model from the model hub and tune it for GLUE tasks:

from datasets import GLUE_COLA
from encoders import LanguageModel
from task_heads import TextClassifier
from trainer import ModelTrainer

# 1. load any GLUE corpus (e.g. Corpus of Linguistic Acceptability)
corpus = GLUE_COLA()

print(corpus)

# 2. pick a language model
language_model = LanguageModel('google/electra-base-discriminator')

# 3. use classification or regression head
classifier = TextClassifier(encoder=language_model,
                            num_classes=corpus.num_classes)

# 4. create model trainer
trainer = ModelTrainer(model=classifier,
                      corpus=corpus)

# 5. start training
trainer.train(learning_rate=2e-5,
              batch_size=16,
              epochs=10,
              shuffle_data=True)
  • Electra base scores 67.8 ± 1.2 (Matthews correlation coefficient) for CoLA dev set. You can look at the scores provided by the authors here: expected electra results.
  • Roberta base scores 62.0 ± 1.3 expected roberta results.
  • SpanBERT scores 57.2 ± 1.0
  • Large models need much smaller learning rates e.g. 5e-6

Combining Language Models

from datasets import GLUE_COLA
from encoders import LanguageModel, MultiEncoder
from task_heads import TextClassifier
from trainer import ModelTrainer

# 1. load any GLUE corpus (i.e. Corpus of Linguistic Acceptability)
corpus = GLUE_COLA()

print(corpus)

# 2. pick some language models
language_models = [
    LanguageModel('google/electra-base-discriminator'),
    LanguageModel('roberta-base'),
]

# 3. create multi-encoder and choose the combine method: 'concat' or 'dme'
multi_encoder = MultiEncoder(language_models=language_models,
                            combine_method='dme')

# 4. use classification or regression head
classifier = TextClassifier(encoder=multi_encoder,
                            num_classes=corpus.num_classes)

# 5. create model trainer
trainer = ModelTrainer(model=classifier,
                       corpus=corpus)

# 6. start training
trainer.train(learning_rate=2e-5,
              batch_size=16,
              epochs=10,
              shuffle_data=True)
  • We can increase the score of Electra if we add Roberta and tune them together. The average (5 runs) Mcc score is 68.6.
  • Concatenation scores a bit better than DME MultiEncoder(combine_method="concat"). Expected difference is ↑ 0.1 compared to DME.
  • The increase in scores is still very minor. Mostly it's in the range of standard deviation of Electra.
  • You can pick a more stable dataset where the difference between runs is much smaller (e.g. GLUE_STSB regression task has stdev of 0.2). Combining Electra with Ernie on STS-B scores 91.6 Spearman's rank (↑ 0.5).

Looking at attention weights

If you use DME as combine method, you can embedd a few sentences from CoLA dev set and look at the attention scores for each language model:

# pick a few sentences from CoLA dev set
sentences = [
  "The government's imposition of a fine.",
  "Somebody just left - guess who.",
  "He can will go"
]

# let's load the best model after training
model_path = "models/CoLA-multi-transformer-electra-base-discriminator" \
    "-roberta-base-classifier/best-model.pt"
model = TextClassifier.from_checkpoint(model_path)

# classify sentences and look at the dme scores of multi-encoder
predictions = model.predict_tags(sentences, corpus.label_map)

print(predictions)

"""
{'Sentences': 
     ["The government's imposition of a fine.", 
      "Somebody just left - guess who.",
      "He can will go"], 
 'DME scores': 
     [[0.7901, 0.2099],
      [0.8116, 0.1884],
      [0.6635, 0.3365]],
 'Predictions': 
     ['acceptable', 'acceptable', 'not_acceptable']}
"""
# if you didn't save the model to file, you can predict with the best model directly from trainer
predictions = trainer.best_model.predict_tags(sentences, corpus.label_map)

# or skip the corpus.label_map if your dataset doesn't have one
predictions = trainer.best_model.predict_tags(sentences)
print(predictions)

DME scores show how much weight does the multi-encoder assign to each cls representation when embedding a given sentence:

  • When predicting if sentence "He can will go" is linguistically correct, the model uses 0.66 of Electra and 0.34 of RoBERTa.
  • It's quite fun to inspect these scores. Especially if you train on multilingual corpora (i.e. sentences in different languages) and mix two monolingual models.

Combining more than two models

You can throw even more LMs to the mix:

# 2. mix some language models
language_models = [
    LanguageModel('google/electra-base-discriminator'),
    LanguageModel('roberta-base'),
    LanguageModel('SpanBERT/spanbert-base-cased'),
]

# 3. create multi-encoder and choose the combine method: 'dme' or 'concat'
multi_encoder = MultiEncoder(
    language_models=language_models,
)

This scored 69.3. I tried it only once so the score might differ after taking the average of multiple runs. Combining a lot of language models suffers from overfitting and needs to be further analysed 🤷‍♂️

Notes on joint tuning and why it sometimes doesn't work

This approach is very sensitive to overfitting.

  • There are cases where one language model can fit the training data much faster than others. DME scores show that newly added weights can learn to ignore some language models. One option would be to play with hidden dropout parameters in LMs: LanguageModel('roberta-base', hidden_dropout_prob=0.4).
  • You can also attach different learning rates for each language model: trainer.train(learning_rate=[3e-5, 1e-5]).
  • I am currently experimenting with larger learning rates just for linear decoder layer: trainer.train(learning_rate=2e-5, decoder_learning_rate=1e-3).

Let me know if you find any interesting model combinations or hyper-parameter settings ✌️

About

Combining encoder-based language models

Resources

Stars

Watchers

Forks

Languages