# Word Pair Classification using TextPairClassifier in Flair

Let's see what's possible here! 🚀

__Goal__: solve word pair classification task using a model that is originally build to classify sentence pairs.

Main thing to consider: TextPairClassifier has an option to embedd separately (which goes in the direction of siamese networks). if embedd_separately is set to False (default configuration), this model creates a single sentence out of two by adding a [SEP] between them.

__Dataset__: Dummy dataset to see if the model works. I will take an existing sentence-pair dataset and keep only first words.

__Steps__:
- We will try to solve word pair classification task using TransformerDocumentEmbeddings + TextPairClassifier (similar to sentence-pair classification task)
- We will try to use FlairWordEmbeddings and/or WordEmbeddings
- TODO: We will look if TransformerWordEmbeddings can be fine-tuned together with TextPairClassifier model.


__Notes__: 
- ❌ I just ran into the same issue (using the latest flair version) as described [in the last message here](https://github.com/flairNLP/flair/issues/2536). That's why I will use flair==0.10. I will create a PR to fix it soonish.
- My dataset is just a dummy dataset. We are classifying two random words as entailment and not_entailment. This means that results don't make sense. This is just to test if the model runs and share some ideas for you work 😊


Let's try 👉

In [None]:
!pip install flair==0.10 &> /dev/null

In [None]:
import flair

print(flair.__version__)

0.10


# Step 1. Create a dataset of single word pairs

I will take an existing sentence-pair dataset and leave only first words in each sentence. The goal is to test if TextPairClassifier can be used with WordEmbeddings, FlairEmbeddings, or TransformerWordEmbeddings!

If you are using your own custom dataset, you will need to load it as described in [this issue](https://github.com/flairNLP/flair/issues/2536).

In [None]:
from flair.datasets import GLUE_RTE
from flair.embeddings import TransformerDocumentEmbeddings
from flair.models import TextPairClassifier
from flair.trainers import ModelTrainer

# Step 1: You can look at an existing text pair corpus (e.g. GLUE_RTE, GLUE_MRPC)
corpus = GLUE_RTE()

# language inference: predict if sentence A entails or contradicts sentence B
print(corpus)
print(corpus.train[0])

label_type = 'entailment'

label_dictionary = corpus.make_label_dictionary(label_type=label_type)

2022-11-21 16:59:15,395 Reading data from /root/.flair/datasets/glue/RTE
2022-11-21 16:59:15,397 Train: /root/.flair/datasets/glue/RTE/train.tsv
2022-11-21 16:59:15,400 Dev: /root/.flair/datasets/glue/RTE/dev.tsv
2022-11-21 16:59:15,402 Test: None
Corpus: 2241 train + 277 dev + 249 test sentences
DataPair:
 − First Sentence: "No Weapons of Mass Destruction Found in Iraq Yet ."   [− Tokens: 10]
 − Second Sentence: "Weapons of Mass Destruction Found in Iraq ."   [− Tokens: 8]
 − Labels: [not_entailment (1.0)]
2022-11-21 16:59:20,828 Computing label dictionary. Progress:


100%|██████████| 2241/2241 [00:00<00:00, 53986.85it/s]

2022-11-21 16:59:20,877 Corpus contains the labels: entailment (#2241)
2022-11-21 16:59:20,881 Created (for label 'entailment') Dictionary with 3 tags: <unk>, not_entailment, entailment





Let's inspect our original dataset

In [None]:
# print a few data pairs from our original train set
print('A random instance from training set:')
print(corpus.train[2])

# print a few data pairs from our original dev set
print('\nA random instance from validation set:')
print(corpus.dev[5])

# print a few data pairs from our original test set
print('\nA random instance from testing set:')
print(corpus.test[0])

A random instance from training set:
DataPair:
 − First Sentence: "Judie Vivian , chief executive at ProMedica , a medical service company that helps sustain the 2-year-old Vietnam Heart Institute in Ho Chi Minh City ( formerly Saigon ) , said that so far about 1,500 children have received treatment ."   [− Tokens: 41]
 − Second Sentence: "The previous name of Ho Chi Minh City was Saigon ."   [− Tokens: 11]
 − Labels: [entailment (1.0)]

A random instance from validation set:
DataPair:
 − First Sentence: "In 1979 , the leaders signed the Egypt-Israel peace treaty on the White House lawn . Both President Begin and Sadat received the Nobel Peace Prize for their work . The two nations have enjoyed peaceful relations to this day ."   [− Tokens: 41]
 − Second Sentence: "The Israel-Egypt Peace Agreement was signed in 1979 ."   [− Tokens: 9]
 − Labels: [entailment (1.0)]

A random instance from testing set:
DataPair:
 − First Sentence: "Herceptin was already approved to treat the sickest brea

I will only keep first words of each sentence

In [None]:
from flair.data import Sentence

for data_split in [corpus.train, corpus.dev, corpus.test]:
  for sentence in data_split:
    sentence.first = Sentence(sentence.first[0].text)
    sentence.second = Sentence(sentence.second[0].text)

In [None]:
# print a few data pairs from our new train set
print('A random instance from training set:')
print(corpus.train[2])

# print a few data pairs from our new dev set
print('\nA random instance from validation set:')
print(corpus.dev[5])

# print a few data pairs from our new test set
print('\nA random instance from testing set:')
print(corpus.test[0])

A random instance from training set:
DataPair:
 − First Sentence: "Judie"   [− Tokens: 1]
 − Second Sentence: "The"   [− Tokens: 1]
 − Labels: [entailment (1.0)]

A random instance from validation set:
DataPair:
 − First Sentence: "In"   [− Tokens: 1]
 − Second Sentence: "The"   [− Tokens: 1]
 − Labels: [entailment (1.0)]

A random instance from testing set:
DataPair:
 − First Sentence: "Herceptin"   [− Tokens: 1]
 − Second Sentence: "Herceptin"   [− Tokens: 1]
 − Labels: [entailment (1.0)]


# Step 2. Two ways to use TransformerDocumentEmbeddings + TextPairClassifier.

Keep in mind that TransformerDocumentEmbeddings are using CLS token representation instead of actual word (or sub-word) representation. This might not be ideal for your case where we have single words (but it can work anyway given that we fine-tune the model).



In [None]:
from flair.embeddings import WordEmbeddings
from flair.embeddings import TransformerDocumentEmbeddings
from flair.models import TextPairClassifier
from flair.trainers import ModelTrainer

# Step 2: Pick English GloVe embeddings
embeddings = TransformerDocumentEmbeddings('prajjwal1/bert-tiny', fine_tune=True)

# Step 3: Use text pair classification model
classifier = TextPairClassifier(document_embeddings=embeddings,
                                label_type=label_type,
                                label_dictionary=label_dictionary,
                                embed_separately=False,
                                )

# Step 4: Initialize trainer and train the model
trainer = ModelTrainer(classifier, corpus)

# if you are using transformer embeddings, you can simply call trainer.fine_tune()
trainer.fine_tune(base_path='resources/word-pair-test-flair',
                  use_final_model_for_eval=False,
                  learning_rate=2e-5,
                  mini_batch_size=16,
                  max_epochs=10)

2022-11-21 17:00:20,580 No model_max_length in Tokenizer's config.json - setting it to 512. Specify desired model_max_length by passing it as attribute to embedding instance.
2022-11-21 17:00:21,388 ----------------------------------------------------------------------------------------------------
2022-11-21 17:00:21,391 Model: "TextPairClassifier(
  (loss_function): CrossEntropyLoss()
  (document_embeddings): TransformerDocumentEmbeddings(
    (model): BertModel(
      (embeddings): BertEmbeddings(
        (word_embeddings): Embedding(30522, 128, padding_idx=0)
        (position_embeddings): Embedding(512, 128)
        (token_type_embeddings): Embedding(2, 128)
        (LayerNorm): LayerNorm((128,), eps=1e-12, elementwise_affine=True)
        (dropout): Dropout(p=0.1, inplace=False)
      )
      (encoder): BertEncoder(
        (layer): ModuleList(
          (0): BertLayer(
            (attention): BertAttention(
              (self): BertSelfAttention(
                (query): Linea

  "There should be no best model saved at epoch 1 except there is a model from previous trainings"


2022-11-21 17:00:21,689 epoch 1 - iter 14/141 - loss 0.14913489 - samples/sec: 887.96 - lr: 0.000002
2022-11-21 17:00:21,931 epoch 1 - iter 28/141 - loss 0.14022871 - samples/sec: 937.75 - lr: 0.000004
2022-11-21 17:00:22,175 epoch 1 - iter 42/141 - loss 0.13314404 - samples/sec: 929.21 - lr: 0.000006
2022-11-21 17:00:22,420 epoch 1 - iter 56/141 - loss 0.12583450 - samples/sec: 928.16 - lr: 0.000008
2022-11-21 17:00:22,666 epoch 1 - iter 70/141 - loss 0.11711008 - samples/sec: 918.57 - lr: 0.000010
2022-11-21 17:00:22,905 epoch 1 - iter 84/141 - loss 0.10797442 - samples/sec: 951.39 - lr: 0.000012
2022-11-21 17:00:23,148 epoch 1 - iter 98/141 - loss 0.10057374 - samples/sec: 931.72 - lr: 0.000014
2022-11-21 17:00:23,394 epoch 1 - iter 112/141 - loss 0.09489216 - samples/sec: 922.68 - lr: 0.000016
2022-11-21 17:00:23,637 epoch 1 - iter 126/141 - loss 0.09015299 - samples/sec: 935.01 - lr: 0.000018
2022-11-21 17:00:23,874 epoch 1 - iter 140/141 - loss 0.08607270 - samples/sec: 954.65 - 

{'test_score': 0.4779116465863454,
 'dev_score_history': [0.4981949458483754,
  0.4693140794223827,
  0.4693140794223827,
  0.48375451263537905,
  0.4693140794223827,
  0.48375451263537905,
  0.5054151624548736,
  0.4693140794223827,
  0.4693140794223827,
  0.48014440433212996],
 'train_loss_history': [0.08617967565322443,
  0.049560129150435735,
  0.04661472768775058,
  0.04455986306476891,
  0.04429628055364838,
  0.04454586302154248,
  0.04387105435518642,
  0.04389802295679708,
  0.0439148366065602,
  0.04345289485787558],
 'dev_loss_history': [tensor(0.0492, device='cuda:0'),
  tensor(0.0509, device='cuda:0'),
  tensor(0.0465, device='cuda:0'),
  tensor(0.0483, device='cuda:0'),
  tensor(0.0461, device='cuda:0'),
  tensor(0.0460, device='cuda:0'),
  tensor(0.0458, device='cuda:0'),
  tensor(0.0460, device='cuda:0'),
  tensor(0.0462, device='cuda:0'),
  tensor(0.0461, device='cuda:0')]}

## trying to embed words separately instead of together.

- embed_separately=True means: 'fine' and 'great' will be embedded separately (i.e. passed through the embedding model separately).
- embed_separately=False means: 'fine' and 'great' will create a new sentence 'fine [SEP] great' and this will be embedded (i.e. a single sentence will be passed through the embedding model).

In [None]:
from flair.embeddings import WordEmbeddings
from flair.embeddings import TransformerDocumentEmbeddings
from flair.models import TextPairClassifier
from flair.trainers import ModelTrainer

# Step 2: Pick English GloVe embeddings
embeddings = TransformerDocumentEmbeddings('prajjwal1/bert-tiny')

# Step 3: Use text pair classification model
classifier = TextPairClassifier(document_embeddings=embeddings,
                                label_type=label_type,
                                label_dictionary=label_dictionary,
                                embed_separately=True, # embedd sentences separately and concatenate them later
                                )

# Step 4: Initialize trainer and train the model
trainer = ModelTrainer(classifier, corpus)

# if you are using transformer embeddings, you can simply call trainer.fine_tune()
trainer.fine_tune(base_path='resources/word-pair-test-flair',
                  use_final_model_for_eval=False,
                  learning_rate=2e-5,
                  mini_batch_size=16,
                  max_epochs=10)

2022-11-21 17:02:40,723 No model_max_length in Tokenizer's config.json - setting it to 512. Specify desired model_max_length by passing it as attribute to embedding instance.
2022-11-21 17:02:41,537 ----------------------------------------------------------------------------------------------------
2022-11-21 17:02:41,539 Model: "TextPairClassifier(
  (loss_function): CrossEntropyLoss()
  (document_embeddings): TransformerDocumentEmbeddings(
    (model): BertModel(
      (embeddings): BertEmbeddings(
        (word_embeddings): Embedding(30522, 128, padding_idx=0)
        (position_embeddings): Embedding(512, 128)
        (token_type_embeddings): Embedding(2, 128)
        (LayerNorm): LayerNorm((128,), eps=1e-12, elementwise_affine=True)
        (dropout): Dropout(p=0.1, inplace=False)
      )
      (encoder): BertEncoder(
        (layer): ModuleList(
          (0): BertLayer(
            (attention): BertAttention(
              (self): BertSelfAttention(
                (query): Linea

  "There should be no best model saved at epoch 1 except there is a model from previous trainings"


2022-11-21 17:02:41,960 epoch 1 - iter 14/141 - loss 0.09959839 - samples/sec: 598.72 - lr: 0.000002
2022-11-21 17:02:42,329 epoch 1 - iter 28/141 - loss 0.09516682 - samples/sec: 611.23 - lr: 0.000004
2022-11-21 17:02:42,707 epoch 1 - iter 42/141 - loss 0.09305085 - samples/sec: 597.19 - lr: 0.000006
2022-11-21 17:02:43,080 epoch 1 - iter 56/141 - loss 0.09119593 - samples/sec: 606.35 - lr: 0.000008
2022-11-21 17:02:43,452 epoch 1 - iter 70/141 - loss 0.08587865 - samples/sec: 605.75 - lr: 0.000010
2022-11-21 17:02:43,833 epoch 1 - iter 84/141 - loss 0.08145745 - samples/sec: 593.42 - lr: 0.000012
2022-11-21 17:02:44,202 epoch 1 - iter 98/141 - loss 0.07789742 - samples/sec: 612.77 - lr: 0.000014
2022-11-21 17:02:44,574 epoch 1 - iter 112/141 - loss 0.07593339 - samples/sec: 606.97 - lr: 0.000016
2022-11-21 17:02:44,941 epoch 1 - iter 126/141 - loss 0.07390003 - samples/sec: 616.09 - lr: 0.000018
2022-11-21 17:02:45,312 epoch 1 - iter 140/141 - loss 0.07169289 - samples/sec: 607.13 - 

{'test_score': 0.5502008032128514,
 'dev_score_history': [0.4729241877256318,
  0.46570397111913364,
  0.48014440433212996,
  0.4981949458483754,
  0.51985559566787,
  0.5090252707581228,
  0.5306859205776173,
  0.51985559566787,
  0.5126353790613718,
  0.51985559566787],
 'train_loss_history': [0.07207265992294407,
  0.053268743925784005,
  0.047408804078019125,
  0.04746661044986372,
  0.04686431778908202,
  0.045738561764618796,
  0.04571370057685202,
  0.04469791082652857,
  0.0450230745639401,
  0.045077076505093315],
 'dev_loss_history': [tensor(0.0512, device='cuda:0'),
  tensor(0.0491, device='cuda:0'),
  tensor(0.0468, device='cuda:0'),
  tensor(0.0463, device='cuda:0'),
  tensor(0.0461, device='cuda:0'),
  tensor(0.0464, device='cuda:0'),
  tensor(0.0459, device='cuda:0'),
  tensor(0.0460, device='cuda:0'),
  tensor(0.0460, device='cuda:0'),
  tensor(0.0460, device='cuda:0')]}

# Step 3: Train a model using WordEmbeddings

Issue: DocumentEmbeddings (or a sentence representation) is a single vector. Whereas WordEmbeddings is a list of word vectors. 

TextClassifier accepts document embeddings. Since we are using a word per sentence, we can use DocumentPoolEmbeddings (i.e. this averages a single word vector which changes only the shape of a tensor).

In [None]:
from flair.embeddings import WordEmbeddings
from flair.embeddings import DocumentPoolEmbeddings
from flair.models import TextPairClassifier
from flair.trainers import ModelTrainer

# Step 2: Pick English GloVe embeddings
embeddings = DocumentPoolEmbeddings([WordEmbeddings('en')])

# Step 3: Use text pair classification model
classifier = TextPairClassifier(document_embeddings=embeddings,
                                label_type=label_type,
                                label_dictionary=label_dictionary,
                                embed_separately=True,
                                )

# Step 4: Initialize trainer and train the model
trainer = ModelTrainer(classifier, corpus)

# if you are using transformer embeddings, you can simply call trainer.fine_tune()
trainer.train(base_path='resources/word-pair-test-flair',
              use_final_model_for_eval=False,
              learning_rate=0.1,
              max_epochs=100,
              )

2022-11-21 17:04:05,331 ----------------------------------------------------------------------------------------------------
2022-11-21 17:04:05,333 Model: "TextPairClassifier(
  (loss_function): CrossEntropyLoss()
  (document_embeddings): DocumentPoolEmbeddings(
    fine_tune_mode=none, pooling=mean
    (embeddings): StackedEmbeddings(
      (list_embedding_0): WordEmbeddings(
        'en'
        (embedding): Embedding(1000001, 300)
      )
    )
  )
  (decoder): Linear(in_features=600, out_features=3, bias=True)
  (weights): None
  (weight_tensor) None
)"
2022-11-21 17:04:05,336 ----------------------------------------------------------------------------------------------------
2022-11-21 17:04:05,337 Corpus: "Corpus: 2241 train + 277 dev + 249 test sentences"
2022-11-21 17:04:05,339 ----------------------------------------------------------------------------------------------------
2022-11-21 17:04:05,340 Parameters:
2022-11-21 17:04:05,341  - learning_rate: "0.1"
2022-11-21 17:04:

  "There should be no best model saved at epoch 1 except there is a model from previous trainings"


2022-11-21 17:04:05,605 epoch 1 - iter 21/71 - loss 0.02686282 - samples/sec: 3015.11 - lr: 0.100000
2022-11-21 17:04:05,680 epoch 1 - iter 28/71 - loss 0.02606895 - samples/sec: 3096.57 - lr: 0.100000
2022-11-21 17:04:05,750 epoch 1 - iter 35/71 - loss 0.02548538 - samples/sec: 3285.58 - lr: 0.100000
2022-11-21 17:04:05,823 epoch 1 - iter 42/71 - loss 0.02511740 - samples/sec: 3201.32 - lr: 0.100000
2022-11-21 17:04:05,894 epoch 1 - iter 49/71 - loss 0.02478288 - samples/sec: 3272.58 - lr: 0.100000
2022-11-21 17:04:05,966 epoch 1 - iter 56/71 - loss 0.02450089 - samples/sec: 3232.60 - lr: 0.100000
2022-11-21 17:04:06,041 epoch 1 - iter 63/71 - loss 0.02430677 - samples/sec: 3069.20 - lr: 0.100000
2022-11-21 17:04:06,116 epoch 1 - iter 70/71 - loss 0.02415550 - samples/sec: 3113.38 - lr: 0.100000
2022-11-21 17:04:06,120 ----------------------------------------------------------------------------------------------------
2022-11-21 17:04:06,122 EPOCH 1 done: loss 0.0246 - lr 0.1000000
20

{'test_score': 0.5220883534136547,
 'dev_score_history': [0.46570397111913364,
  0.555956678700361,
  0.5415162454873647,
  0.5848375451263538,
  0.5740072202166066,
  0.5415162454873647,
  0.48375451263537905,
  0.48375451263537905,
  0.51985559566787,
  0.5270758122743683,
  0.5342960288808665,
  0.5415162454873647,
  0.5451263537906137,
  0.5234657039711191,
  0.5451263537906137,
  0.5234657039711191,
  0.5415162454873647,
  0.5379061371841155,
  0.5379061371841155,
  0.5451263537906137,
  0.5415162454873647,
  0.5379061371841155,
  0.5451263537906137,
  0.5415162454873647,
  0.5415162454873647,
  0.5415162454873647,
  0.5451263537906137,
  0.5415162454873647,
  0.5415162454873647,
  0.5415162454873647,
  0.5415162454873647,
  0.5415162454873647,
  0.5415162454873647,
  0.5415162454873647,
  0.5415162454873647,
  0.5415162454873647,
  0.5379061371841155,
  0.5379061371841155,
  0.5379061371841155,
  0.5379061371841155,
  0.5379061371841155,
  0.5379061371841155,
  0.5379061371841155

# Step 4: Stack Flair and Word embeddings

In [None]:
from flair.embeddings import WordEmbeddings, FlairEmbeddings, StackedEmbeddings
from flair.embeddings import DocumentPoolEmbeddings
from flair.models import TextPairClassifier
from flair.trainers import ModelTrainer

embedding_stack = StackedEmbeddings([
    WordEmbeddings('en'),
    FlairEmbeddings('en-forward'),
])

embedding_stack = DocumentPoolEmbeddings([embedding_stack])

# Step 3: Use text pair classification model
classifier = TextPairClassifier(document_embeddings=embedding_stack,
                                label_type=label_type,
                                label_dictionary=label_dictionary,
                                embed_separately=True,
                                )

# Step 4: Initialize trainer and train the model
trainer = ModelTrainer(classifier, corpus)

# if you are using transformer embeddings, you can simply call trainer.fine_tune()
trainer.train(base_path='resources/word-pair-test-flair',
              use_final_model_for_eval=False,
              learning_rate=0.1,
              max_epochs=100,
              )

2022-11-21 17:05:53,029 ----------------------------------------------------------------------------------------------------
2022-11-21 17:05:53,032 Model: "TextPairClassifier(
  (loss_function): CrossEntropyLoss()
  (document_embeddings): DocumentPoolEmbeddings(
    fine_tune_mode=none, pooling=mean
    (embeddings): StackedEmbeddings(
      (list_embedding_0): StackedEmbeddings(
        (list_embedding_0): WordEmbeddings(
          'en'
          (embedding): Embedding(1000001, 300)
        )
        (list_embedding_1): FlairEmbeddings(
          (lm): LanguageModel(
            (drop): Dropout(p=0.05, inplace=False)
            (encoder): Embedding(300, 100)
            (rnn): LSTM(100, 2048)
            (decoder): Linear(in_features=2048, out_features=300, bias=True)
          )
        )
      )
    )
  )
  (decoder): Linear(in_features=4696, out_features=3, bias=True)
  (weights): None
  (weight_tensor) None
)"
2022-11-21 17:05:53,033 ---------------------------------------------

  "There should be no best model saved at epoch 1 except there is a model from previous trainings"


2022-11-21 17:05:53,334 epoch 1 - iter 7/71 - loss 0.02813241 - samples/sec: 925.79 - lr: 0.100000
2022-11-21 17:05:53,545 epoch 1 - iter 14/71 - loss 0.02616650 - samples/sec: 1073.75 - lr: 0.100000
2022-11-21 17:05:53,741 epoch 1 - iter 21/71 - loss 0.02519040 - samples/sec: 1156.79 - lr: 0.100000
2022-11-21 17:05:53,943 epoch 1 - iter 28/71 - loss 0.02460725 - samples/sec: 1123.95 - lr: 0.100000
2022-11-21 17:05:54,144 epoch 1 - iter 35/71 - loss 0.02417782 - samples/sec: 1130.63 - lr: 0.100000
2022-11-21 17:05:54,341 epoch 1 - iter 42/71 - loss 0.02391600 - samples/sec: 1151.71 - lr: 0.100000
2022-11-21 17:05:54,534 epoch 1 - iter 49/71 - loss 0.02373013 - samples/sec: 1175.77 - lr: 0.100000
2022-11-21 17:05:54,726 epoch 1 - iter 56/71 - loss 0.02352508 - samples/sec: 1179.30 - lr: 0.100000
2022-11-21 17:05:54,918 epoch 1 - iter 63/71 - loss 0.02337854 - samples/sec: 1184.35 - lr: 0.100000
2022-11-21 17:05:55,112 epoch 1 - iter 70/71 - loss 0.02329357 - samples/sec: 1169.71 - lr: 0

{'test_score': 0.5301204819277109,
 'dev_score_history': [0.48375451263537905,
  0.46570397111913364,
  0.45126353790613716,
  0.5415162454873647,
  0.5415162454873647,
  0.51985559566787,
  0.46570397111913364,
  0.5234657039711191,
  0.5415162454873647,
  0.4368231046931408,
  0.5018050541516246,
  0.51985559566787,
  0.4981949458483754,
  0.48375451263537905,
  0.49097472924187724,
  0.4981949458483754,
  0.48736462093862815,
  0.49097472924187724,
  0.5018050541516246,
  0.5018050541516246,
  0.5090252707581228,
  0.5018050541516246,
  0.4981949458483754,
  0.5018050541516246,
  0.5018050541516246,
  0.5054151624548736,
  0.5018050541516246,
  0.48736462093862815,
  0.48736462093862815,
  0.49097472924187724,
  0.4981949458483754,
  0.49097472924187724,
  0.49458483754512633,
  0.49458483754512633,
  0.49458483754512633,
  0.49458483754512633,
  0.49458483754512633,
  0.49458483754512633,
  0.49458483754512633,
  0.49458483754512633,
  0.49458483754512633,
  0.49458483754512633,
  

 # Step 5: Add TransformerWordEmbeddings to the mix

 We will set fine-tuning to false for a transformer (static embeddings). We will be combining multiple stattic embeddings this time.

In [None]:
from flair.embeddings.token import TransformerWordEmbeddings
from flair.embeddings import WordEmbeddings, FlairEmbeddings, StackedEmbeddings
from flair.embeddings import DocumentPoolEmbeddings
from flair.models import TextPairClassifier
from flair.trainers import ModelTrainer

embeddings_stack = StackedEmbeddings([
    WordEmbeddings('en'),
    FlairEmbeddings('en-forward'),
    TransformerWordEmbeddings('xlm-roberta-base',
                              fine_tune=False, 
                              subtoken_pooling='first'), # subtoken_pooling is something to experiment here with. You can also try mean.
])

embedding_stack = DocumentPoolEmbeddings([embedding_stack])

# Step 3: Use text pair classification model
classifier = TextPairClassifier(document_embeddings=embedding_stack,
                                label_type=label_type,
                                label_dictionary=label_dictionary,
                                )

# Step 4: Initialize trainer and train the model
trainer = ModelTrainer(classifier, corpus)

# if you are using transformer embeddings, you can simply call trainer.fine_tune()
trainer.train(base_path='resources/word-pair-test-flair',
              use_final_model_for_eval=False,
              learning_rate=0.1,
              max_epochs=100,
              )

2022-11-21 17:07:32,683 ----------------------------------------------------------------------------------------------------
2022-11-21 17:07:32,686 Model: "TextPairClassifier(
  (loss_function): CrossEntropyLoss()
  (document_embeddings): DocumentPoolEmbeddings(
    fine_tune_mode=none, pooling=mean
    (embeddings): StackedEmbeddings(
      (list_embedding_0): DocumentPoolEmbeddings(
        fine_tune_mode=none, pooling=mean
        (embeddings): StackedEmbeddings(
          (list_embedding_0): StackedEmbeddings(
            (list_embedding_0): WordEmbeddings(
              'en'
              (embedding): Embedding(1000001, 300)
            )
            (list_embedding_1): FlairEmbeddings(
              (lm): LanguageModel(
                (drop): Dropout(p=0.05, inplace=False)
                (encoder): Embedding(300, 100)
                (rnn): LSTM(100, 2048)
                (decoder): Linear(in_features=2048, out_features=300, bias=True)
              )
            )
          )

  "There should be no best model saved at epoch 1 except there is a model from previous trainings"


2022-11-21 17:07:33,105 epoch 1 - iter 7/71 - loss 0.02971298 - samples/sec: 1487.88 - lr: 0.100000
2022-11-21 17:07:33,256 epoch 1 - iter 14/71 - loss 0.02766108 - samples/sec: 1511.02 - lr: 0.100000
2022-11-21 17:07:33,375 epoch 1 - iter 21/71 - loss 0.02654351 - samples/sec: 1917.05 - lr: 0.100000
2022-11-21 17:07:33,491 epoch 1 - iter 28/71 - loss 0.02584137 - samples/sec: 1979.05 - lr: 0.100000
2022-11-21 17:07:33,611 epoch 1 - iter 35/71 - loss 0.02531000 - samples/sec: 1901.32 - lr: 0.100000
2022-11-21 17:07:33,727 epoch 1 - iter 42/71 - loss 0.02493115 - samples/sec: 1980.79 - lr: 0.100000
2022-11-21 17:07:33,848 epoch 1 - iter 49/71 - loss 0.02464105 - samples/sec: 1884.11 - lr: 0.100000
2022-11-21 17:07:33,962 epoch 1 - iter 56/71 - loss 0.02440316 - samples/sec: 2026.27 - lr: 0.100000
2022-11-21 17:07:34,072 epoch 1 - iter 63/71 - loss 0.02418056 - samples/sec: 2086.95 - lr: 0.100000
2022-11-21 17:07:34,185 epoch 1 - iter 70/71 - loss 0.02401489 - samples/sec: 2035.61 - lr: 

{'test_score': 0.5140562248995983,
 'dev_score_history': [0.46570397111913364,
  0.4729241877256318,
  0.5234657039711191,
  0.4620938628158845,
  0.5487364620938628,
  0.5342960288808665,
  0.5451263537906137,
  0.5415162454873647,
  0.4693140794223827,
  0.5451263537906137,
  0.5487364620938628,
  0.5379061371841155,
  0.4620938628158845,
  0.5595667870036101,
  0.5234657039711191,
  0.4548736462093863,
  0.49458483754512633,
  0.49458483754512633,
  0.49097472924187724,
  0.5234657039711191,
  0.5090252707581228,
  0.5342960288808665,
  0.5415162454873647,
  0.51985559566787,
  0.5270758122743683,
  0.5234657039711191,
  0.4981949458483754,
  0.49458483754512633,
  0.5054151624548736,
  0.5126353790613718,
  0.4981949458483754,
  0.4981949458483754,
  0.5126353790613718,
  0.5090252707581228,
  0.5090252707581228,
  0.5090252707581228,
  0.5090252707581228,
  0.5090252707581228,
  0.5054151624548736,
  0.4981949458483754,
  0.4981949458483754,
  0.49458483754512633,
  0.494584837545