# Transfer Learning mit FARM

## Ziele


*   BERT und Transfer Learning
*   Einführung in FARM

*   Wie funktioniert das Fine-Tuning eines BERT-Modells?





## BERT und Transfer Learning
Transfer Learning ist eine Machine Learning-Methode, bei der ein Modell, das für einen bestimmten Task trainiert wurde, als Ausgangspunkt für ein Modell für einen anderen Task verwendet wird.

Die Idee hinter diesem Konzept ist, dass man das Erlernte von einem Problem auf ein anderes überträgt. Wenn man z.B. Java als erste Programmiersprache gelernt hat, fällt es uns leichter, Python oder eine andere Programmiersprache zu erlernen. Denn die grundlegenden Programmierkonzepte aus Java/OOP können leicht auf jede andere Programmiersprache übertragen werden. Man muss also nicht von Null beginnnen, sondern kann auf existierendem Wissen aufbauen. Ähnlich verhält es sich auch mit dem Transfer Learning von Sprachmodellen wie BERT. Existierende BERT-Modelle wurden hauptsächlich auf der Wikipedia trainiert. Möchte man allerdings Textklassifikationen in bestimmten Domänen durchführen, ist die Wikipedia mit ihrem Vokabular nicht repräsentativ genug. Daher nutzt man das Fine-Tuning von vortrainierten BERT-Modellen, um sie an den neuen Task/das neue Problem anzupassen. Durch diesen Schritt erhofft man sich bessere Performance.

## FARM 
(**F**ramework for **A**dapting **R**epresentation **M**odels)
![farm_logo](https://github.com/deepset-ai/FARM/raw/master/docs/img/farm_logo_text_right_wide.png?raw=true)
... ermöglicht uns, das Transfer Learning mit BERT (und anderen Sprachmodellen) umzusetzen. Damit kann man einfach und schnell Modelle für Tasks wie Text Classification, NER und Question Answering erstellen:
![task_table](https://miro.medium.com/max/3840/1*0RYwLSOMegTKfhnyTwkjIQ.png)

# Fine-Tuning eines BERT-Modells mit FARM (adapted from FARM-Tutorial 1)

Welcome to the FARM building blocks tutorial! There are many different ways to make use of this repository, but in this notebook, we will be going through the most import building blocks that will help you harvest the rewards of a successfully trained NLP model.

Happy FARMing!

## 1) Text Classification

GermEval 2018 (GermEval2018) (https://projects.fzai.h-da.de/iggsa/) is an open data set containing texts that need to be classified by whether they are offensive or not. There are a set of coarse and fine labels, but here we will only be looking at the coarse set which labels each example as either OFFENSE or OTHER. To tackle this task, we are going to build a classifier that is composed of Google's BERT language model and a feed forward neural network prediction head.

### Setup

In [1]:
# Install FARM
!pip install torch==1.6.0+cu101 -f https://download.pytorch.org/whl/torch_stable.html
!pip install farm==0.5.0

Looking in links: https://download.pytorch.org/whl/torch_stable.html
Collecting torch==1.6.0+cu101
  Downloading https://download.pytorch.org/whl/cu101/torch-1.6.0%2Bcu101-cp37-cp37m-linux_x86_64.whl (708.0 MB)
[K     |████████████████████████████████| 708.0 MB 11 kB/s 
Installing collected packages: torch
  Attempting uninstall: torch
    Found existing installation: torch 1.9.0+cu102
    Uninstalling torch-1.9.0+cu102:
      Successfully uninstalled torch-1.9.0+cu102
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
torchvision 0.10.0+cu102 requires torch==1.9.0, but you have torch 1.6.0+cu101 which is incompatible.
torchtext 0.10.0 requires torch==1.9.0, but you have torch 1.6.0+cu101 which is incompatible.[0m
Successfully installed torch-1.6.0+cu101
Collecting farm==0.5.0
  Downloading farm-0.5.0-py3-none-any.whl (207 kB)
[K     |█████████████████████

In [2]:
# Here are the imports we need

import torch
from farm.modeling.tokenization import Tokenizer
from farm.data_handler.processor import TextClassificationProcessor
from farm.data_handler.data_silo import DataSilo
from farm.modeling.language_model import LanguageModel
from farm.modeling.prediction_head import TextClassificationHead
from farm.modeling.adaptive_model import AdaptiveModel
from farm.modeling.optimization import initialize_optimizer
from farm.train import Trainer
from farm.utils import MLFlowLogger
import pandas as pd

09/17/2021 09:49:24 - INFO - farm.modeling.prediction_head -   Better speed can be achieved with apex installed from https://www.github.com/nvidia/apex .


In [3]:
# Farm allows simple logging of many parameters & metrics. Let's use the MLflow framework to track our experiment ...
# You will see your results on https://public-mlflow.deepset.ai/
ml_logger = MLFlowLogger(tracking_uri="https://public-mlflow.deepset.ai/")
ml_logger.init_experiment(experiment_name="Public_FARM", run_name="Tutorial1_Colab")



 __          __  _                            _        
 \ \        / / | |                          | |       
  \ \  /\  / /__| | ___ ___  _ __ ___   ___  | |_ ___  
   \ \/  \/ / _ \ |/ __/ _ \| '_ ` _ \ / _ \ | __/ _ \ 
    \  /\  /  __/ | (_| (_) | | | | | |  __/ | || (_) |
     \/  \/ \___|_|\___\___/|_| |_| |_|\___|  \__\___/ 
  ______      _____  __  __  
 |  ____/\   |  __ \|  \/  |              _.-^-._    .--.
 | |__ /  \  | |__) | \  / |           .-'   _   '-. |__|
 |  __/ /\ \ |  _  /| |\/| |          /     |_|     \|  |
 | | / ____ \| | \ \| |  | |         /               \  |
 |_|/_/    \_\_|  \_\_|  |_|        /|     _____     |\ |
                                     |    |==|==|    |  |
|---||---|---|---|---|---|---|---|---|    |--|--|    |  |
|---||---|---|---|---|---|---|---|---|    |==|==|    |  |
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 


In [4]:
# We need to fetch the right device to drive the growth of our model
# Make sure that you have gpu turned on in this notebook by going to
# Runtime>Change runtime type and select GPU as Hardware accelerator.

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print("Devices available: {}".format(device))

Devices available: cuda


### Data Handling

In FARM the Processor contains the functions which handle the conversion from file or request to PyTorch Datasets. In essence, it prepares data to be consumed by the modelling components. This is done in stages to allow for easier debugging. It should be able to handle file input or requests. This class contains everything that needs to be customized when adapting a new dataset. Custom datasets can be handled by extending the Processor (e.g. see CONLLProcessor).

The DataSilo is a generic class that stores the train, dev and test data sets. It calls upon the methods from the Processor to do the loading and then exposes a DataLoader for each set. In cases where there is not a separate dev file, it will create one by slicing the train set.
![data_handling](https://farm.deepset.ai/_images/data_silo_no_bg.jpg)

## distilbert-base-uncased-finetuned-sst-2-english

In [None]:
# Here we initialize a tokenizer that will be used for preprocessing text
# This is the BERT Tokenizer which uses the byte pair encoding method.
# It is currently loaded with a German model

tokenizer = Tokenizer.load(
    pretrained_model_name_or_path="distilbert-base-uncased-finetuned-sst-2-english",
    do_lower_case=False)




09/16/2021 10:09:10 - INFO - farm.modeling.tokenization -   Loading tokenizer of type 'DistilBertTokenizer'


In [None]:
# In order to prepare the data for the model, we need a set of
# functions to transform data files into PyTorch Datasets.
# We group these together in Processor objects.
# We will need a new Processor object for each new source of data.
# The abstract class can be found in farm.data_handling.processor.Processor
# TOXIC = 1
# OTHER = 0
LABEL_LIST = ["0", "1"]
processor = TextClassificationProcessor(tokenizer=tokenizer,
                                        max_seq_len=128,
                                        warmup = 600,
                                        train_filename="train.tsv",
                                        dev_filename="test_with_solutions.tsv",
                                        test_filename="impermium_verification_labels.tsv",
                                        data_dir="../content",
                                        label_list=LABEL_LIST,
                                        metric="f1_macro",
                                        text_column_name="Comment",
                                        label_column_name="Insult")



In [None]:
# We need a DataSilo in order to keep our train, dev and test sets separate.
# The DataSilo will call the functions in the Processor to generate these sets.
# From the DataSilo, we can fetch a PyTorch DataLoader object which will
# be passed on to the model.
# Here is a good place to define a batch size for the model

BATCH_SIZE = 32

data_silo = DataSilo(
    processor=processor,
    batch_size=BATCH_SIZE)

09/16/2021 10:09:12 - INFO - farm.data_handler.data_silo -   
Loading data into the data silo ... 
              ______
               |o  |   !
   __          |:`_|---'-.
  |__|______.-/ _ \-----.|       
 (o)(o)------'\ _ /     ( )      
 
09/16/2021 10:09:12 - INFO - farm.data_handler.data_silo -   Loading train set from: ../content/train.tsv 
09/16/2021 10:09:12 - INFO - farm.data_handler.data_silo -   Got ya 1 parallel workers to convert 3947 dictionaries to pytorch datasets (chunksize = 790)...
09/16/2021 10:09:12 - INFO - farm.data_handler.data_silo -    0 
09/16/2021 10:09:12 - INFO - farm.data_handler.data_silo -   /w\
09/16/2021 10:09:12 - INFO - farm.data_handler.data_silo -   / \
09/16/2021 10:09:12 - INFO - farm.data_handler.data_silo -   
09/16/2021 10:09:14 - INFO - farm.data_handler.processor -   *** Show 2 random examples ***
09/16/2021 10:09:14 - INFO - farm.data_handler.processor -   

      .--.        _____                       _      
    .'_\/_'.     / ____|    

### Modeling

![modeling](https://farm.deepset.ai/_images/adaptive_model_no_bg.jpg)
In FARM, we make a strong distinction between the language model and prediction head so that you can mix and match different building blocks for your needs.

For example, in the transfer learning paradigm, you might have the one language model that you will be using for both document classification and NER. Or perhaps you have a pretrained language model which you would like to adapt to your domain, then use it for a downstream task such as question answering. 

All this is possible within FARM and requires only the replacement of a few modular components, as we shall see below.

Let's first have a look at how we might set up a model.

In [None]:
# The language model is the foundation on which modern NLP systems are built.
# They encapsulate a general understanding of sentence semantics
# and are not specific to any one task.

# Here we are using Google's BERT model as implemented by HuggingFace. 
# The model being loaded is a German model that we trained. 
# You can also change the MODEL_NAME_OR_PATH to point to a BERT model that you
# have saved or download one connected to the HuggingFace repository.
# See farm.modeling.language_model.PRETRAINED_MODEL_ARCHIVE_MAP for a list of
# available models

MODEL_NAME_OR_PATH = "distilbert-base-uncased-finetuned-sst-2-english"
# MODEL_NAME_OR_PATH = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english")
language_model = LanguageModel.load(MODEL_NAME_OR_PATH)

09/16/2021 10:09:38 - INFO - farm.modeling.language_model -   Automatically detected language from language model name: english


In [None]:
# A prediction head is a model that processes the output of the language model
# for a specific task.
# Prediction heads will look different depending on whether you're doing text classification
# Named Entity Recognition (NER), question answering or some other task.
# They should generate logits over the available prediction classes and contain methods
# to convert these logits to losses or predictions 

# Here we use TextClassificationHead which receives a single fixed length sentence vector
# and processes it using a feed forward neural network. layer_dims is a list of dimensions:
# [input_dims, hidden_1_dims, hidden_2_dims ..., output_dims]

# Here by default we have a single layer network.
# It takes in a vector of length 768 (the default size of BERT's output).
# It outputs a vector of length 2 (the number of classes in the GermEval18 (coarse) dataset)

prediction_head = TextClassificationHead(num_labels=len(LABEL_LIST), class_weights=data_silo.calculate_class_weights(task_name="text_classification"))

09/16/2021 10:09:38 - INFO - farm.modeling.prediction_head -   Prediction head initialized with size [768, 2]
09/16/2021 10:09:38 - INFO - farm.modeling.prediction_head -   Using class weights for task 'text_classification': [0.6810969 1.8804762]


In [None]:
# The language model and prediction head are coupled together in the Adaptive Model.
# This class takes care of model saving and loading and also coordinates
# cases where there is more than one prediction head.

# EMBEDS_DROPOUT_PROB is the probability that an element of the output vector from the
# language model will be set to zero.
# EMBEDS_DROPOUT_PROB = 0.1 distilbert-base-uncased-finetuned-sst-2-english_1.PNG
# EMBEDS_DROPOUT_PROB = 0.01
# EMBEDS_DROPOUT_PROB = 0.2 distilbert-base-uncased-finetuned-sst-2-english_2.PNG
EMBEDS_DROPOUT_PROB = 0.1

model = AdaptiveModel(
    language_model=language_model,
    prediction_heads=[prediction_head],
    embeds_dropout_prob=EMBEDS_DROPOUT_PROB,
    lm_output_types=["per_sequence"],
    device=device)



### Training

In [None]:
# Here we initialize a Bert Adam optimizer that has a linear warmup and warmdown
# Here you can set learning rate, the warmup proportion and number of epochs to train for

#LEARNING_RATE = 2e-5 # distilbert-base-uncased-finetuned-sst-2-english_1.PNG
#LEARNING_RATE = 1e-7
LEARNING_RATE = 1e-5  # distilbert-base-uncased-finetuned-sst-2-english_2.PNG

N_EPOCHS = 3

model, optimizer, lr_schedule = initialize_optimizer(
    model=model,
    device=device,
    learning_rate=LEARNING_RATE,
    n_batches=len(data_silo.loaders["train"]),
    n_epochs=N_EPOCHS)

09/16/2021 10:09:38 - INFO - farm.modeling.optimization -   Loading optimizer `TransformersAdamW`: '{'correct_bias': False, 'weight_decay': 0.01, 'lr': 1e-05}'
09/16/2021 10:09:38 - INFO - farm.modeling.optimization -   Using scheduler 'get_linear_schedule_with_warmup'
09/16/2021 10:09:38 - INFO - farm.modeling.optimization -   Loading schedule `get_linear_schedule_with_warmup`: '{'num_warmup_steps': 37.2, 'num_training_steps': 372}'


In [None]:
# Training loop handled by this
# It will also trigger evaluation during training using the dev data
# and after training using the test data.

# Set N_GPU to a positive value if CUDA is available
N_GPU = 1

trainer = Trainer(
    model=model,
    optimizer=optimizer,
    data_silo=data_silo,
    epochs=N_EPOCHS,
    n_gpu=N_GPU,
    lr_schedule=lr_schedule,
    device=device, 
)

In [None]:
!nvidia-smi

Thu Sep 16 10:09:39 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.63.01    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla K80           Off  | 00000000:00:04.0 Off |                    0 |
| N/A   52C    P0    60W / 149W |   5737MiB / 11441MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [None]:

model = trainer.train()

09/16/2021 10:09:39 - INFO - farm.train -   
 

          &&& &&  & &&             _____                   _             
      && &\/&\|& ()|/ @, &&       / ____|                 (_)            
      &\/(/&/&||/& /_/)_&/_&     | |  __ _ __ _____      ___ _ __   __ _ 
   &() &\/&|()|/&\/ '%" & ()     | | |_ | '__/ _ \ \ /\ / / | '_ \ / _` |
  &_\_&&_\ |& |&&/&__%_/_& &&    | |__| | | | (_) \ V  V /| | | | | (_| |
&&   && & &| &| /& & % ()& /&&    \_____|_|  \___/ \_/\_/ |_|_| |_|\__, |
 ()&_---()&\&\|&&-&&--%---()~                                       __/ |
     &&     \|||                                                   |___/
             |||
             |||
             |||
       , -=-~  .-^- _
              `

Train epoch 0/2 (Cur. train loss: 0.8731):  81%|████████  | 100/124 [01:04<00:15,  1.56it/s]
Evaluating:   0%|          | 0/83 [00:00<?, ?it/s][A
Evaluating: 100%|██████████| 83/83 [00:18<00:00,  4.45it/s]
09/16/2021 10:11:02 - INFO - farm.eval -   

\\|//       \\|//  

In [None]:
# Test your model on a sample (Inference)
from farm.infer import Inferencer
from pprint import PrettyPrinter

infer_model = Inferencer(processor=processor, model=model, task_type="text_classification", gpu=True)

basic_texts = [
    {"text": "Martin is an idiot"},
    {"text": "Martin Müller plays Voleyball in Berlin"},
]
result = infer_model.inference_from_dicts(dicts=basic_texts)
PrettyPrinter().pprint(result)


09/16/2021 10:14:49 - INFO - farm.utils -   device: cuda n_gpu: 1, distributed training: False, automatic mixed precision training: None
09/16/2021 10:14:49 - INFO - farm.infer -   Got ya 1 parallel workers to do inference ...
09/16/2021 10:14:49 - INFO - farm.infer -    0 
09/16/2021 10:14:49 - INFO - farm.infer -   /w\
09/16/2021 10:14:49 - INFO - farm.infer -   /'\
09/16/2021 10:14:49 - INFO - farm.infer -   
09/16/2021 10:14:49 - INFO - farm.data_handler.processor -   *** Show 2 random examples ***
09/16/2021 10:14:49 - INFO - farm.data_handler.processor -   

      .--.        _____                       _      
    .'_\/_'.     / ____|                     | |     
    '. /\ .'    | (___   __ _ _ __ ___  _ __ | | ___ 
      "||"       \___ \ / _` | '_ ` _ \| '_ \| |/ _ \ 
       || /\     ____) | (_| | | | | | | |_) | |  __/
    /\ ||//\)   |_____/ \__,_|_| |_| |_| .__/|_|\___|
   (/\||/                             |_|           
______\||/_________________________________________

[{'predictions': [{'context': 'Martin is an idiot',
                   'end': None,
                   'label': '1',
                   'probability': 0.99127847,
                   'start': None},
                  {'context': 'Martin Müller plays Voleyball in Berlin',
                   'end': None,
                   'label': '0',
                   'probability': 0.9728008,
                   'start': None}],
  'task': 'text_classification'}]





In [None]:
#model.save('../data/detecting-insults-in-social-commentary/model')
#processor.save('../data/detecting-insults-in-social-commentary/processor')

## Aufgabe
Verwenden Sie den imdb-Datensatz aus der letzten Sitzung und nutzen Sie ein englisches BERT-Modell (z.B. ```bert-base-uncased```), um es für die Sentiment-Analyse von Film-Reviews finezutunen. Bilden Sie Gruppen und gehen Sie wie folgt vor:

1. Verwenden Sie ein Subset des Datensatzes. Grund: Es ist unklar, wie viel die (kostenlose) Google Colab-GPU zu leisten im Stande ist. Verwenden Sie innerhalb der Gruppe unterschiedliche Samplezahlen des Datensatzes und testen Sie aus, wo die Grenzen liegen. 
2. Adjustieren Sie verschiedene Parameter (Learning Rate, Dropout Rate ...), um Unterschiede in der Performance festzustellen.
3. Speichern Sie das fine-getunte BERT-Modell ab: 
```python
model.save('path/to/directory')
processor.save('path/to/directory')
```


## BERT <br>
https://huggingface.co/bert-base-uncased 

In [None]:
from transformers import AutoTokenizer
import torch
from farm.modeling.tokenization import Tokenizer
from farm.data_handler.processor import TextClassificationProcessor
from farm.data_handler.data_silo import DataSilo
from farm.modeling.language_model import LanguageModel
from farm.modeling.prediction_head import TextClassificationHead
from farm.modeling.adaptive_model import AdaptiveModel
from farm.modeling.optimization import initialize_optimizer
from farm.train import Trainer
from farm.utils import MLFlowLogger

# Here we initialize a tokenizer that will be used for preprocessing text
# This is the BERT Tokenizer which uses the byte pair encoding method.
# It is currently loaded with a German model

#tokenizer = Tokenizer.load(
#    pretrained_model_name_or_path="finiteautomata/bertweet-base-sentiment-analysis",
#    do_lower_case=False)

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

# In order to prepare the data for the model, we need a set of
# functions to transform data files into PyTorch Datasets.
# We group these together in Processor objects.
# We will need a new Processor object for each new source of data.
# The abstract class can be found in farm.data_handling.processor.Processor
# TOXIC = 1
# OTHER = 0
LABEL_LIST = ["0", "1"]
processor = TextClassificationProcessor(tokenizer=tokenizer,
                                        max_seq_len=128,
                                        warmup = 600,
                                        train_filename="train.tsv",
                                        dev_filename="test_with_solutions.tsv",
                                        test_filename="impermium_verification_labels.tsv",
                                        data_dir="../content",
                                        label_list=LABEL_LIST,
                                        metric="f1_macro",
                                        text_column_name="Comment",
                                        label_column_name="Insult")

# We need a DataSilo in order to keep our train, dev and test sets separate.
# The DataSilo will call the functions in the Processor to generate these sets.
# From the DataSilo, we can fetch a PyTorch DataLoader object which will
# be passed on to the model.
# Here is a good place to define a batch size for the model

BATCH_SIZE = 32

data_silo = DataSilo(
    processor=processor,
    batch_size=BATCH_SIZE)

# The language model is the foundation on which modern NLP systems are built.
# They encapsulate a general understanding of sentence semantics
# and are not specific to any one task.

# Here we are using Google's BERT model as implemented by HuggingFace. 
# The model being loaded is a German model that we trained. 
# You can also change the MODEL_NAME_OR_PATH to point to a BERT model that you
# have saved or download one connected to the HuggingFace repository.
# See farm.modeling.language_model.PRETRAINED_MODEL_ARCHIVE_MAP for a list of
# available models

MODEL_NAME_OR_PATH = "bert-base-uncased"
# MODEL_NAME_OR_PATH = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english")
language_model = LanguageModel.load(MODEL_NAME_OR_PATH)

# A prediction head is a model that processes the output of the language model
# for a specific task.
# Prediction heads will look different depending on whether you're doing text classification
# Named Entity Recognition (NER), question answering or some other task.
# They should generate logits over the available prediction classes and contain methods
# to convert these logits to losses or predictions 

# Here we use TextClassificationHead which receives a single fixed length sentence vector
# and processes it using a feed forward neural network. layer_dims is a list of dimensions:
# [input_dims, hidden_1_dims, hidden_2_dims ..., output_dims]

# Here by default we have a single layer network.
# It takes in a vector of length 768 (the default size of BERT's output).
# It outputs a vector of length 2 (the number of classes in the GermEval18 (coarse) dataset)

prediction_head = TextClassificationHead(num_labels=len(LABEL_LIST), class_weights=data_silo.calculate_class_weights(task_name="text_classification"))

# The language model and prediction head are coupled together in the Adaptive Model.
# This class takes care of model saving and loading and also coordinates
# cases where there is more than one prediction head.

# EMBEDS_DROPOUT_PROB is the probability that an element of the output vector from the
# language model will be set to zero.
# EMBEDS_DROPOUT_PROB = 0.1 distilbert-base-uncased-finetuned-sst-2-english_1.PNG
# EMBEDS_DROPOUT_PROB = 0.01
# EMBEDS_DROPOUT_PROB = 0.2 distilbert-base-uncased-finetuned-sst-2-english_2.PNG
EMBEDS_DROPOUT_PROB = 0.1

model = AdaptiveModel(
    language_model=language_model,
    prediction_heads=[prediction_head],
    embeds_dropout_prob=EMBEDS_DROPOUT_PROB,
    lm_output_types=["per_sequence"],
    device=device)

# Here we initialize a Bert Adam optimizer that has a linear warmup and warmdown
# Here you can set learning rate, the warmup proportion and number of epochs to train for

#LEARNING_RATE = 2e-5 # distilbert-base-uncased-finetuned-sst-2-english_1.PNG
#LEARNING_RATE = 1e-7
LEARNING_RATE = 1e-5  # distilbert-base-uncased-finetuned-sst-2-english_2.PNG

N_EPOCHS = 3

model, optimizer, lr_schedule = initialize_optimizer(
    model=model,
    device=device,
    learning_rate=LEARNING_RATE,
    n_batches=len(data_silo.loaders["train"]),
    n_epochs=N_EPOCHS)

# Training loop handled by this
# It will also trigger evaluation during training using the dev data
# and after training using the test data.

# Set N_GPU to a positive value if CUDA is available
N_GPU = 1

trainer = Trainer(
    model=model,
    optimizer=optimizer,
    data_silo=data_silo,
    epochs=N_EPOCHS,
    n_gpu=N_GPU,
    lr_schedule=lr_schedule,
    device=device, 
)

!nvidia-smi

model = trainer.train()

09/16/2021 10:14:50 - INFO - farm.data_handler.data_silo -   
Loading data into the data silo ... 
              ______
               |o  |   !
   __          |:`_|---'-.
  |__|______.-/ _ \-----.|       
 (o)(o)------'\ _ /     ( )      
 
09/16/2021 10:14:50 - INFO - farm.data_handler.data_silo -   Loading train set from: ../content/train.tsv 
09/16/2021 10:14:50 - INFO - farm.data_handler.data_silo -   Got ya 1 parallel workers to convert 3947 dictionaries to pytorch datasets (chunksize = 790)...
09/16/2021 10:14:50 - INFO - farm.data_handler.data_silo -    0 
09/16/2021 10:14:50 - INFO - farm.data_handler.data_silo -   /w\
09/16/2021 10:14:50 - INFO - farm.data_handler.data_silo -   / \
09/16/2021 10:14:50 - INFO - farm.data_handler.data_silo -   
09/16/2021 10:14:53 - INFO - farm.data_handler.processor -   *** Show 2 random examples ***
09/16/2021 10:14:53 - INFO - farm.data_handler.processor -   

      .--.        _____                       _      
    .'_\/_'.     / ____|    

Thu Sep 16 10:15:25 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.63.01    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla K80           Off  | 00000000:00:04.0 Off |                    0 |
| N/A   55C    P0    60W / 149W |   5919MiB / 11441MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

09/16/2021 10:15:25 - INFO - farm.train -   
 

          &&& &&  & &&             _____                   _             
      && &\/&\|& ()|/ @, &&       / ____|                 (_)            
      &\/(/&/&||/& /_/)_&/_&     | |  __ _ __ _____      ___ _ __   __ _ 
   &() &\/&|()|/&\/ '%" & ()     | | |_ | '__/ _ \ \ /\ / / | '_ \ / _` |
  &_\_&&_\ |& |&&/&__%_/_& &&    | |__| | | | (_) \ V  V /| | | | | (_| |
&&   && & &| &| /& & % ()& /&&    \_____|_|  \___/ \_/\_/ |_|_| |_|\__, |
 ()&_---()&\&\|&&-&&--%---()~                                       __/ |
     &&     \|||                                                   |___/
             |||
             |||
             |||
       , -=-~  .-^- _
              `

Train epoch 0/2 (Cur. train loss: 0.4456):  81%|████████  | 100/124 [02:06<00:30,  1.27s/it]
Evaluating:   0%|          | 0/83 [00:00<?, ?it/s][A
Evaluating:  28%|██▊       | 23/83 [00:10<00:27,  2.21it/s][A
Evaluating:  55%|█████▌    | 46/83 [00:20<00:16,  2.21it/s][

### BERT: k-fold cross validation attempt

In [None]:
from sklearn.metrics import accuracy_score
from sklearn.metrics import f1_score,recall_score,precision_score
from sklearn.model_selection import cross_val_score
import pandas as pd
from sklearn.model_selection import train_test_split

# splitte die daten einfach manuell für k-fold cross validation
X_train = train.Comment.Values
y_train = train.Insult.Values

# Aufteilen des train.csv Datensatzes für k-fold cross validation

#train = pd.read_csv("../content/train.csv")

#X  = train.Comment.values
#y = train.Insult.values

#X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

In [27]:
logger = logging.getLogger(__name__)
xval_folds = 5
xval_stratification = True

set_all_seeds(seed=42)
device, n_gpu = initialize_device_settings(use_cuda=True)
n_epochs = 20
batch_size = 32
evaluate_every = 100
dev_split = 0.1
# For xval the dev_stratification parameter must not be None: with None, the devset cannot be created
# using the default method of only splitting by the available chunks as initial train set for each fold
# is just a single chunk!
dev_stratification = True
lang_model = "bert-base-uncased"
do_lower_case = True
use_amp = None

09/17/2021 11:03:58 - INFO - farm.utils -   device: cuda n_gpu: 1, distributed training: False, automatic mixed precision training: None


In [28]:
# 1.Create a tokenizer
tokenizer = Tokenizer.load(
    pretrained_model_name_or_path=lang_model,
    do_lower_case=do_lower_case)

def mymetrics(preds, labels):
    acc = simple_accuracy(preds, labels).get("acc")
    f1other = f1_score(y_true=labels, y_pred=preds, pos_label="0")
    f1offense = f1_score(y_true=labels, y_pred=preds, pos_label="1")
    f1macro = f1_score(y_true=labels, y_pred=preds, average="macro")
    f1micro = f1_score(y_true=labels, y_pred=preds, average="micro")
    mcc = matthews_corrcoef(labels, preds)
    return {
        "acc": acc,
        "f1_other": f1other,
        "f1_offense": f1offense,
        "f1_macro": f1macro,
        "f1_micro": f1micro,
        "mcc": mcc
    }
register_metrics('mymetrics', mymetrics)
metric = 'mymetrics'

label_list = ["0", "1"]
processor = TextClassificationProcessor(tokenizer=tokenizer,
                                        max_seq_len=128,
                                        data_dir=Path("../content"),
                                        train_filename="train.tsv",
                                        dev_filename="test_with_solutions.tsv",
                                        test_filename="impermium_verification_labels.tsv",
                                        label_list=label_list,
                                        metric=metric,
                                        dev_split=dev_split,
                                        dev_stratification=dev_stratification,
                                        label_column_name="Insult",
                                        text_column_name="Comment"
                                        )

09/17/2021 11:04:01 - INFO - farm.modeling.tokenization -   Loading tokenizer of type 'BertTokenizer'


In [29]:
data_silo = DataSilo(
    processor=processor,
    batch_size=batch_size)

silos = DataSiloForCrossVal.make(data_silo,
                                  sets=["train", "dev"],
                                  n_splits=xval_folds)

09/17/2021 11:04:04 - INFO - farm.data_handler.data_silo -   
Loading data into the data silo ... 
              ______
               |o  |   !
   __          |:`_|---'-.
  |__|______.-/ _ \-----.|       
 (o)(o)------'\ _ /     ( )      
 
09/17/2021 11:04:04 - INFO - farm.data_handler.data_silo -   Loading train set from: ../content/train.tsv 
09/17/2021 11:04:04 - INFO - farm.data_handler.data_silo -   Got ya 1 parallel workers to convert 3947 dictionaries to pytorch datasets (chunksize = 790)...
09/17/2021 11:04:04 - INFO - farm.data_handler.data_silo -    0 
09/17/2021 11:04:04 - INFO - farm.data_handler.data_silo -   /w\
09/17/2021 11:04:04 - INFO - farm.data_handler.data_silo -   /'\
09/17/2021 11:04:04 - INFO - farm.data_handler.data_silo -   
09/17/2021 11:04:07 - INFO - farm.data_handler.processor -   *** Show 2 random examples ***
09/17/2021 11:04:07 - INFO - farm.data_handler.processor -   

      .--.        _____                       _      
    .'_\/_'.     / ____|    

In [30]:
def train_on_split(silo_to_use, n_fold, save_dir):
        logger.info(f"############ Crossvalidation: Fold {n_fold} of {xval_folds} ############")
        logger.info(f"Fold training   samples: {len(silo_to_use.data['train'])}")
        logger.info(f"Fold dev        samples: {len(silo_to_use.data['dev'])}")
        logger.info(f"Fold testing    samples: {len(silo_to_use.data['test'])}")
        logger.info( "Total number of samples: "
                    f"{len(silo_to_use.data['train'])+len(silo_to_use.data['dev'])+len(silo_to_use.data['test'])}")

        # Create an AdaptiveModel
        # a) which consists of a pretrained language model as a basis
        language_model = LanguageModel.load(lang_model)
        # b) and a prediction head on top that is suited for our task => Text classification
        prediction_head = TextClassificationHead(
            class_weights=data_silo.calculate_class_weights(task_name="text_classification"),
            num_labels=len(label_list))

        model = AdaptiveModel(
            language_model=language_model,
            prediction_heads=[prediction_head],
            embeds_dropout_prob=0.2,
            lm_output_types=["per_sequence"],
            device=device)

        # Create an optimizer
        model, optimizer, lr_schedule = initialize_optimizer(
            model=model,
            learning_rate=0.5e-5,
            device=device,
            n_batches=len(silo_to_use.loaders["train"]),
            n_epochs=n_epochs,
            use_amp=use_amp)

        # Feed everything to the Trainer, which keeps care of growing our model into powerful plant and evaluates it from time to time
        # Also create an EarlyStopping instance and pass it on to the trainer

        # An early stopping instance can be used to save the model that performs best on the dev set
        # according to some metric and stop training when no improvement is happening for some iterations.
        # NOTE: Using a different save directory for each fold, allows us afterwards to use the
        # nfolds best models in an ensemble!
        save_dir = Path(str(save_dir) + f"-{n_fold}")
        earlystopping = EarlyStopping(
            metric="f1_offense", mode="max",   # use the metric from our own metrics function instead of loss
            save_dir=save_dir,  # where to save the best model
            patience=5    # number of evaluations to wait for improvement before terminating the training
        )

        trainer = Trainer(
            model=model,
            optimizer=optimizer,
            data_silo=silo_to_use,
            epochs=n_epochs,
            n_gpu=n_gpu,
            lr_schedule=lr_schedule,
            evaluate_every=evaluate_every,
            device=device,
            early_stopping=earlystopping,
            evaluator_test=False)

        # train it
        trainer.train()

        return trainer.model

In [None]:
allresults = []
bestfold = None
bestf1_offense = -1
save_dir = Path("saved_models/bert-doc-tutorial-es")
for num_fold, silo in enumerate(silos):
    mlflow.start_run(run_name=f"fold-{num_fold + 1}-of-{len(silos)}", nested=True)
    model = train_on_split(silo, num_fold, save_dir)

    # do eval on test set here (and not in Trainer),
    #  so that we can easily store the actual preds and labels for a "global" eval across all folds.
    evaluator_test = Evaluator(
        data_loader=silo.get_data_loader("test"),
        tasks=silo.processor.tasks,
        device=device
    )
    result = evaluator_test.eval(model, return_preds_and_labels=True)
    evaluator_test.log_results(result, "Test", steps=len(silo.get_data_loader("test")), num_fold=num_fold)

    allresults.append(result)

    # keep track of best fold
    f1_offense = result[0]["f1_offense"]
    if f1_offense > bestf1_offense:
        bestf1_offense = f1_offense
        bestfold = num_fold
    mlflow.end_run()
    # emtpy cache to avoid memory leak and cuda OOM across multiple folds
    model.cpu()
    torch.cuda.empty_cache()

09/17/2021 11:04:45 - INFO - __main__ -   ############ Crossvalidation: Fold 0 of 5 ############
09/17/2021 11:04:45 - INFO - __main__ -   Fold training   samples: 4748
09/17/2021 11:04:45 - INFO - __main__ -   Fold dev        samples: 527
09/17/2021 11:04:45 - INFO - __main__ -   Fold testing    samples: 1319
09/17/2021 11:04:45 - INFO - __main__ -   Total number of samples: 6594
09/17/2021 11:04:45 - INFO - filelock -   Lock 139741784311376 acquired on /root/.cache/torch/transformers/f2ee78bdd635b758cc0a12352586868bef80e47401abe4c4fcc3832421e7338b.36ca03ab34a1a5d5fa7bc3d03d55c4fa650fed07220e2eeebc06ce58d0e9a157.lock


Downloading:   0%|          | 0.00/440M [00:00<?, ?B/s]

09/17/2021 11:05:05 - INFO - filelock -   Lock 139741784311376 released on /root/.cache/torch/transformers/f2ee78bdd635b758cc0a12352586868bef80e47401abe4c4fcc3832421e7338b.36ca03ab34a1a5d5fa7bc3d03d55c4fa650fed07220e2eeebc06ce58d0e9a157.lock
	 We guess it's an *ENGLISH* model ... 
	 If not: Init the language model by supplying the 'language' param.
09/17/2021 11:05:08 - INFO - farm.modeling.prediction_head -   Prediction head initialized with size [768, 2]
09/17/2021 11:05:08 - INFO - farm.modeling.prediction_head -   Using class weights for task 'text_classification': [0.6810969 1.8804762]
09/17/2021 11:05:11 - INFO - farm.modeling.optimization -   Loading optimizer `TransformersAdamW`: '{'correct_bias': False, 'weight_decay': 0.01, 'lr': 5e-06}'
09/17/2021 11:05:12 - INFO - farm.modeling.optimization -   Using scheduler 'get_linear_schedule_with_warmup'
09/17/2021 11:05:12 - INFO - farm.modeling.optimization -   Loading schedule `get_linear_schedule_with_warmup`: '{'num_warmup_steps'

In [None]:
X_train.shape, y_train.shape

((3157,), (3157,))

## RoBERTa <br>
https://huggingface.co/roberta-base 

In [None]:
from transformers import AutoTokenizer
import torch
from farm.modeling.tokenization import Tokenizer
from farm.data_handler.processor import TextClassificationProcessor
from farm.data_handler.data_silo import DataSilo
from farm.modeling.language_model import LanguageModel
from farm.modeling.prediction_head import TextClassificationHead
from farm.modeling.adaptive_model import AdaptiveModel
from farm.modeling.optimization import initialize_optimizer
from farm.train import Trainer
from farm.utils import MLFlowLogger

# Here we initialize a tokenizer that will be used for preprocessing text
# This is the BERT Tokenizer which uses the byte pair encoding method.
# It is currently loaded with a German model

#tokenizer = Tokenizer.load(
#    pretrained_model_name_or_path="finiteautomata/bertweet-base-sentiment-analysis",
#    do_lower_case=False)

tokenizer = AutoTokenizer.from_pretrained("roberta-base")

# In order to prepare the data for the model, we need a set of
# functions to transform data files into PyTorch Datasets.
# We group these together in Processor objects.
# We will need a new Processor object for each new source of data.
# The abstract class can be found in farm.data_handling.processor.Processor
# TOXIC = 1
# OTHER = 0
LABEL_LIST = ["0", "1"]
processor = TextClassificationProcessor(tokenizer=tokenizer,
                                        max_seq_len=128,
                                        warmup = 600,
                                        train_filename="train.tsv",
                                        dev_filename="test_with_solutions.tsv",
                                        test_filename="impermium_verification_labels.tsv",
                                        data_dir="../content",
                                        label_list=LABEL_LIST,
                                        metric="f1_macro",
                                        text_column_name="Comment",
                                        label_column_name="Insult")

# We need a DataSilo in order to keep our train, dev and test sets separate.
# The DataSilo will call the functions in the Processor to generate these sets.
# From the DataSilo, we can fetch a PyTorch DataLoader object which will
# be passed on to the model.
# Here is a good place to define a batch size for the model

BATCH_SIZE = 32

data_silo = DataSilo(
    processor=processor,
    batch_size=BATCH_SIZE)

# The language model is the foundation on which modern NLP systems are built.
# They encapsulate a general understanding of sentence semantics
# and are not specific to any one task.

# Here we are using Google's BERT model as implemented by HuggingFace. 
# The model being loaded is a German model that we trained. 
# You can also change the MODEL_NAME_OR_PATH to point to a BERT model that you
# have saved or download one connected to the HuggingFace repository.
# See farm.modeling.language_model.PRETRAINED_MODEL_ARCHIVE_MAP for a list of
# available models

MODEL_NAME_OR_PATH = "roberta-base"
# MODEL_NAME_OR_PATH = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english")
language_model = LanguageModel.load(MODEL_NAME_OR_PATH)

# A prediction head is a model that processes the output of the language model
# for a specific task.
# Prediction heads will look different depending on whether you're doing text classification
# Named Entity Recognition (NER), question answering or some other task.
# They should generate logits over the available prediction classes and contain methods
# to convert these logits to losses or predictions 

# Here we use TextClassificationHead which receives a single fixed length sentence vector
# and processes it using a feed forward neural network. layer_dims is a list of dimensions:
# [input_dims, hidden_1_dims, hidden_2_dims ..., output_dims]

# Here by default we have a single layer network.
# It takes in a vector of length 768 (the default size of BERT's output).
# It outputs a vector of length 2 (the number of classes in the GermEval18 (coarse) dataset)

prediction_head = TextClassificationHead(num_labels=len(LABEL_LIST), class_weights=data_silo.calculate_class_weights(task_name="text_classification"))

# The language model and prediction head are coupled together in the Adaptive Model.
# This class takes care of model saving and loading and also coordinates
# cases where there is more than one prediction head.

# EMBEDS_DROPOUT_PROB is the probability that an element of the output vector from the
# language model will be set to zero.
# EMBEDS_DROPOUT_PROB = 0.1 distilbert-base-uncased-finetuned-sst-2-english_1.PNG
# EMBEDS_DROPOUT_PROB = 0.01
# EMBEDS_DROPOUT_PROB = 0.2 distilbert-base-uncased-finetuned-sst-2-english_2.PNG
EMBEDS_DROPOUT_PROB = 0.1

model = AdaptiveModel(
    language_model=language_model,
    prediction_heads=[prediction_head],
    embeds_dropout_prob=EMBEDS_DROPOUT_PROB,
    lm_output_types=["per_sequence"],
    device=device)

# Here we initialize a Bert Adam optimizer that has a linear warmup and warmdown
# Here you can set learning rate, the warmup proportion and number of epochs to train for

#LEARNING_RATE = 2e-5 # distilbert-base-uncased-finetuned-sst-2-english_1.PNG
#LEARNING_RATE = 1e-7
LEARNING_RATE = 1e-5  # distilbert-base-uncased-finetuned-sst-2-english_2.PNG

N_EPOCHS = 3

model, optimizer, lr_schedule = initialize_optimizer(
    model=model,
    device=device,
    learning_rate=LEARNING_RATE,
    n_batches=len(data_silo.loaders["train"]),
    n_epochs=N_EPOCHS)

# Training loop handled by this
# It will also trigger evaluation during training using the dev data
# and after training using the test data.

# Set N_GPU to a positive value if CUDA is available
N_GPU = 1

trainer = Trainer(
    model=model,
    optimizer=optimizer,
    data_silo=data_silo,
    epochs=N_EPOCHS,
    n_gpu=N_GPU,
    lr_schedule=lr_schedule,
    device=device, 
)

!nvidia-smi

model = trainer.train()

09/16/2021 10:25:36 - INFO - filelock -   Lock 139649219672016 acquired on /root/.cache/torch/transformers/e1a2a406b5a05063c31f4dfdee7608986ba7c6393f7f79db5e69dcd197208534.117c81977c5979de8c088352e74ec6e70f5c66096c28b61d3c50101609b39690.lock


Downloading:   0%|          | 0.00/481 [00:00<?, ?B/s]

09/16/2021 10:25:37 - INFO - filelock -   Lock 139649219672016 released on /root/.cache/torch/transformers/e1a2a406b5a05063c31f4dfdee7608986ba7c6393f7f79db5e69dcd197208534.117c81977c5979de8c088352e74ec6e70f5c66096c28b61d3c50101609b39690.lock
09/16/2021 10:25:37 - INFO - filelock -   Lock 139649218496720 acquired on /root/.cache/torch/transformers/d0c5776499adc1ded22493fae699da0971c1ee4c2587111707a4d177d20257a2.ef00af9e673c7160b4d41cfda1f48c5f4cba57d5142754525572a846a1ab1b9b.lock


Downloading:   0%|          | 0.00/899k [00:00<?, ?B/s]

09/16/2021 10:25:38 - INFO - filelock -   Lock 139649218496720 released on /root/.cache/torch/transformers/d0c5776499adc1ded22493fae699da0971c1ee4c2587111707a4d177d20257a2.ef00af9e673c7160b4d41cfda1f48c5f4cba57d5142754525572a846a1ab1b9b.lock
09/16/2021 10:25:38 - INFO - filelock -   Lock 139649522597264 acquired on /root/.cache/torch/transformers/b35e7cd126cd4229a746b5d5c29a749e8e84438b14bcdb575950584fe33207e8.70bec105b4158ed9a1747fea67a43f5dee97855c64d62b6ec3742f4cfdb5feda.lock


Downloading:   0%|          | 0.00/456k [00:00<?, ?B/s]

09/16/2021 10:25:39 - INFO - filelock -   Lock 139649522597264 released on /root/.cache/torch/transformers/b35e7cd126cd4229a746b5d5c29a749e8e84438b14bcdb575950584fe33207e8.70bec105b4158ed9a1747fea67a43f5dee97855c64d62b6ec3742f4cfdb5feda.lock
09/16/2021 10:25:39 - INFO - farm.data_handler.data_silo -   
Loading data into the data silo ... 
              ______
               |o  |   !
   __          |:`_|---'-.
  |__|______.-/ _ \-----.|       
 (o)(o)------'\ _ /     ( )      
 
09/16/2021 10:25:39 - INFO - farm.data_handler.data_silo -   Loading train set from: ../content/train.tsv 
09/16/2021 10:25:39 - INFO - farm.data_handler.data_silo -   Got ya 1 parallel workers to convert 3947 dictionaries to pytorch datasets (chunksize = 790)...
09/16/2021 10:25:39 - INFO - farm.data_handler.data_silo -    0 
09/16/2021 10:25:39 - INFO - farm.data_handler.data_silo -   /w\
09/16/2021 10:25:39 - INFO - farm.data_handler.data_silo -   / \
09/16/2021 10:25:39 - INFO - farm.data_handler.data_silo 

Downloading:   0%|          | 0.00/501M [00:00<?, ?B/s]

09/16/2021 10:26:25 - INFO - filelock -   Lock 139649147428560 released on /root/.cache/torch/transformers/80b4a484eddeb259bec2f06a6f2f05d90934111628e0e1c09a33bd4a121358e1.49b88ba7ec2c26a7558dda98ca3884c3b80fa31cf43a1b1f23aef3ff81ba344e.lock
	 We guess it's an *ENGLISH* model ... 
	 If not: Init the language model by supplying the 'language' param.
09/16/2021 10:26:28 - INFO - farm.modeling.prediction_head -   Prediction head initialized with size [768, 2]
09/16/2021 10:26:28 - INFO - farm.modeling.prediction_head -   Using class weights for task 'text_classification': [0.67650837 1.9163636 ]
09/16/2021 10:26:28 - INFO - farm.modeling.optimization -   Loading optimizer `TransformersAdamW`: '{'correct_bias': False, 'weight_decay': 0.01, 'lr': 1e-05}'
09/16/2021 10:26:28 - INFO - farm.modeling.optimization -   Using scheduler 'get_linear_schedule_with_warmup'
09/16/2021 10:26:28 - INFO - farm.modeling.optimization -   Loading schedule `get_linear_schedule_with_warmup`: '{'num_warmup_step

Thu Sep 16 10:26:29 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.63.01    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla K80           Off  | 00000000:00:04.0 Off |                    0 |
| N/A   58C    P0    61W / 149W |   6429MiB / 11441MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

09/16/2021 10:26:29 - INFO - farm.train -   
 

          &&& &&  & &&             _____                   _             
      && &\/&\|& ()|/ @, &&       / ____|                 (_)            
      &\/(/&/&||/& /_/)_&/_&     | |  __ _ __ _____      ___ _ __   __ _ 
   &() &\/&|()|/&\/ '%" & ()     | | |_ | '__/ _ \ \ /\ / / | '_ \ / _` |
  &_\_&&_\ |& |&&/&__%_/_& &&    | |__| | | | (_) \ V  V /| | | | | (_| |
&&   && & &| &| /& & % ()& /&&    \_____|_|  \___/ \_/\_/ |_|_| |_|\__, |
 ()&_---()&\&\|&&-&&--%---()~                                       __/ |
     &&     \|||                                                   |___/
             |||
             |||
             |||
       , -=-~  .-^- _
              `

Train epoch 0/2 (Cur. train loss: 0.4021): 100%|██████████| 99/99 [02:05<00:00,  1.27s/it]
Train epoch 1/2 (Cur. train loss: 0.2414):   1%|          | 1/99 [00:01<02:01,  1.24s/it]
Evaluating:   0%|          | 0/25 [00:00<?, ?it/s][A
Evaluating: 100%|██████████| 25/25 [

## ALBERT <br>
https://huggingface.co/albert-base-v2

In [None]:
from transformers import AutoTokenizer
import torch
from farm.modeling.tokenization import Tokenizer
from farm.data_handler.processor import TextClassificationProcessor
from farm.data_handler.data_silo import DataSilo
from farm.modeling.language_model import LanguageModel
from farm.modeling.prediction_head import TextClassificationHead
from farm.modeling.adaptive_model import AdaptiveModel
from farm.modeling.optimization import initialize_optimizer
from farm.train import Trainer
from farm.utils import MLFlowLogger

# Here we initialize a tokenizer that will be used for preprocessing text
# This is the BERT Tokenizer which uses the byte pair encoding method.
# It is currently loaded with a German model

#tokenizer = Tokenizer.load(
#    pretrained_model_name_or_path="finiteautomata/bertweet-base-sentiment-analysis",
#    do_lower_case=False)

tokenizer = AutoTokenizer.from_pretrained("albert-base-v2")

# In order to prepare the data for the model, we need a set of
# functions to transform data files into PyTorch Datasets.
# We group these together in Processor objects.
# We will need a new Processor object for each new source of data.
# The abstract class can be found in farm.data_handling.processor.Processor
# TOXIC = 1
# OTHER = 0
LABEL_LIST = ["0", "1"]
processor = TextClassificationProcessor(tokenizer=tokenizer,
                                        max_seq_len=128,
                                        warmup = 600,
                                        train_filename="train.tsv",
                                        dev_filename="test_with_solutions.tsv",
                                        test_filename="impermium_verification_labels.tsv",
                                        data_dir="../content",
                                        label_list=LABEL_LIST,
                                        metric="f1_macro",
                                        text_column_name="Comment",
                                        label_column_name="Insult")

# We need a DataSilo in order to keep our train, dev and test sets separate.
# The DataSilo will call the functions in the Processor to generate these sets.
# From the DataSilo, we can fetch a PyTorch DataLoader object which will
# be passed on to the model.
# Here is a good place to define a batch size for the model

BATCH_SIZE = 32

data_silo = DataSilo(
    processor=processor,
    batch_size=BATCH_SIZE)

# The language model is the foundation on which modern NLP systems are built.
# They encapsulate a general understanding of sentence semantics
# and are not specific to any one task.

# Here we are using Google's BERT model as implemented by HuggingFace. 
# The model being loaded is a German model that we trained. 
# You can also change the MODEL_NAME_OR_PATH to point to a BERT model that you
# have saved or download one connected to the HuggingFace repository.
# See farm.modeling.language_model.PRETRAINED_MODEL_ARCHIVE_MAP for a list of
# available models

MODEL_NAME_OR_PATH = "albert-base-v2"
# MODEL_NAME_OR_PATH = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english")
language_model = LanguageModel.load(MODEL_NAME_OR_PATH)

# A prediction head is a model that processes the output of the language model
# for a specific task.
# Prediction heads will look different depending on whether you're doing text classification
# Named Entity Recognition (NER), question answering or some other task.
# They should generate logits over the available prediction classes and contain methods
# to convert these logits to losses or predictions 

# Here we use TextClassificationHead which receives a single fixed length sentence vector
# and processes it using a feed forward neural network. layer_dims is a list of dimensions:
# [input_dims, hidden_1_dims, hidden_2_dims ..., output_dims]

# Here by default we have a single layer network.
# It takes in a vector of length 768 (the default size of BERT's output).
# It outputs a vector of length 2 (the number of classes in the GermEval18 (coarse) dataset)

prediction_head = TextClassificationHead(num_labels=len(LABEL_LIST), class_weights=data_silo.calculate_class_weights(task_name="text_classification"))

# The language model and prediction head are coupled together in the Adaptive Model.
# This class takes care of model saving and loading and also coordinates
# cases where there is more than one prediction head.

# EMBEDS_DROPOUT_PROB is the probability that an element of the output vector from the
# language model will be set to zero.
# EMBEDS_DROPOUT_PROB = 0.1 distilbert-base-uncased-finetuned-sst-2-english_1.PNG
# EMBEDS_DROPOUT_PROB = 0.01
# EMBEDS_DROPOUT_PROB = 0.2 distilbert-base-uncased-finetuned-sst-2-english_2.PNG
EMBEDS_DROPOUT_PROB = 0.1

model = AdaptiveModel(
    language_model=language_model,
    prediction_heads=[prediction_head],
    embeds_dropout_prob=EMBEDS_DROPOUT_PROB,
    lm_output_types=["per_sequence"],
    device=device)

# Here we initialize a Bert Adam optimizer that has a linear warmup and warmdown
# Here you can set learning rate, the warmup proportion and number of epochs to train for

#LEARNING_RATE = 2e-5 # distilbert-base-uncased-finetuned-sst-2-english_1.PNG
#LEARNING_RATE = 1e-7
LEARNING_RATE = 1e-5  # distilbert-base-uncased-finetuned-sst-2-english_2.PNG

N_EPOCHS = 3

model, optimizer, lr_schedule = initialize_optimizer(
    model=model,
    device=device,
    learning_rate=LEARNING_RATE,
    n_batches=len(data_silo.loaders["train"]),
    n_epochs=N_EPOCHS)

# Training loop handled by this
# It will also trigger evaluation during training using the dev data
# and after training using the test data.

# Set N_GPU to a positive value if CUDA is available
N_GPU = 1

trainer = Trainer(
    model=model,
    optimizer=optimizer,
    data_silo=data_silo,
    epochs=N_EPOCHS,
    n_gpu=N_GPU,
    lr_schedule=lr_schedule,
    device=device, 
)

!nvidia-smi

model = trainer.train()

09/16/2021 10:33:39 - INFO - filelock -   Lock 139649145630416 acquired on /root/.cache/torch/transformers/0bbb1531ce82f042a813219ffeed7a1fa1f44cd8f78a652c47fc5311e0d40231.978ff53dd976bbf4bc66f09bf4205da0542be753d025263787842df74d15bbca.lock


Downloading:   0%|          | 0.00/684 [00:00<?, ?B/s]

09/16/2021 10:33:40 - INFO - filelock -   Lock 139649145630416 released on /root/.cache/torch/transformers/0bbb1531ce82f042a813219ffeed7a1fa1f44cd8f78a652c47fc5311e0d40231.978ff53dd976bbf4bc66f09bf4205da0542be753d025263787842df74d15bbca.lock
09/16/2021 10:33:40 - INFO - filelock -   Lock 139649145661200 acquired on /root/.cache/torch/transformers/dd1588b85b6fdce1320e224d29ad062e97588e17326b9d05a0b29ee84b8f5f93.c81d4deb77aec08ce575b7a39a989a79dd54f321bfb82c2b54dd35f52f8182cf.lock


Downloading:   0%|          | 0.00/760k [00:00<?, ?B/s]

09/16/2021 10:33:41 - INFO - filelock -   Lock 139649145661200 released on /root/.cache/torch/transformers/dd1588b85b6fdce1320e224d29ad062e97588e17326b9d05a0b29ee84b8f5f93.c81d4deb77aec08ce575b7a39a989a79dd54f321bfb82c2b54dd35f52f8182cf.lock
09/16/2021 10:33:41 - INFO - farm.data_handler.data_silo -   
Loading data into the data silo ... 
              ______
               |o  |   !
   __          |:`_|---'-.
  |__|______.-/ _ \-----.|       
 (o)(o)------'\ _ /     ( )      
 
09/16/2021 10:33:41 - INFO - farm.data_handler.data_silo -   Loading train set from: ../content/train.tsv 
09/16/2021 10:33:41 - INFO - farm.data_handler.data_silo -   Got ya 1 parallel workers to convert 3947 dictionaries to pytorch datasets (chunksize = 790)...
09/16/2021 10:33:41 - INFO - farm.data_handler.data_silo -    0 
09/16/2021 10:33:41 - INFO - farm.data_handler.data_silo -   /w\
09/16/2021 10:33:41 - INFO - farm.data_handler.data_silo -   /'\
09/16/2021 10:33:41 - INFO - farm.data_handler.data_silo 

Downloading:   0%|          | 0.00/47.4M [00:00<?, ?B/s]

09/16/2021 10:34:02 - INFO - filelock -   Lock 139649522373072 released on /root/.cache/torch/transformers/c7c1b2b621933bfa9a5f6ed18b1d6dc2f445001779b13d37286a806117ebeb10.ab806923413c2af99835e13fdbb6014b24af86b0de8edc2d71ef5c646fc54f24.lock
	 We guess it's an *ENGLISH* model ... 
	 If not: Init the language model by supplying the 'language' param.
09/16/2021 10:34:02 - INFO - farm.modeling.prediction_head -   Prediction head initialized with size [768, 2]
09/16/2021 10:34:02 - INFO - farm.modeling.prediction_head -   Using class weights for task 'text_classification': [0.6791237 1.8956834]
09/16/2021 10:34:02 - INFO - farm.modeling.optimization -   Loading optimizer `TransformersAdamW`: '{'correct_bias': False, 'weight_decay': 0.01, 'lr': 1e-05}'
09/16/2021 10:34:03 - INFO - farm.modeling.optimization -   Using scheduler 'get_linear_schedule_with_warmup'
09/16/2021 10:34:03 - INFO - farm.modeling.optimization -   Loading schedule `get_linear_schedule_with_warmup`: '{'num_warmup_steps'

Thu Sep 16 10:34:03 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.63.01    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla K80           Off  | 00000000:00:04.0 Off |                    0 |
| N/A   53C    P0    60W / 149W |   7021MiB / 11441MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

09/16/2021 10:34:03 - INFO - farm.train -   
 

          &&& &&  & &&             _____                   _             
      && &\/&\|& ()|/ @, &&       / ____|                 (_)            
      &\/(/&/&||/& /_/)_&/_&     | |  __ _ __ _____      ___ _ __   __ _ 
   &() &\/&|()|/&\/ '%" & ()     | | |_ | '__/ _ \ \ /\ / / | '_ \ / _` |
  &_\_&&_\ |& |&&/&__%_/_& &&    | |__| | | | (_) \ V  V /| | | | | (_| |
&&   && & &| &| /& & % ()& /&&    \_____|_|  \___/ \_/\_/ |_|_| |_|\__, |
 ()&_---()&\&\|&&-&&--%---()~                                       __/ |
     &&     \|||                                                   |___/
             |||
             |||
             |||
       , -=-~  .-^- _
              `

Train epoch 0/2 (Cur. train loss: 0.4100): 100%|██████████| 99/99 [02:08<00:00,  1.30s/it]
Train epoch 1/2 (Cur. train loss: 0.4458):   1%|          | 1/99 [00:01<02:07,  1.30s/it]
Evaluating:   0%|          | 0/25 [00:00<?, ?it/s][A
Evaluating: 100%|██████████| 25/25 [

## DistilBERT <br>
https://huggingface.co/distilbert-base-uncased

In [None]:
from transformers import AutoTokenizer
import torch
from farm.modeling.tokenization import Tokenizer
from farm.data_handler.processor import TextClassificationProcessor
from farm.data_handler.data_silo import DataSilo
from farm.modeling.language_model import LanguageModel
from farm.modeling.prediction_head import TextClassificationHead
from farm.modeling.adaptive_model import AdaptiveModel
from farm.modeling.optimization import initialize_optimizer
from farm.train import Trainer
from farm.utils import MLFlowLogger

# Here we initialize a tokenizer that will be used for preprocessing text
# This is the BERT Tokenizer which uses the byte pair encoding method.
# It is currently loaded with a German model

#tokenizer = Tokenizer.load(
#    pretrained_model_name_or_path="finiteautomata/bertweet-base-sentiment-analysis",
#    do_lower_case=False)

tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

# In order to prepare the data for the model, we need a set of
# functions to transform data files into PyTorch Datasets.
# We group these together in Processor objects.
# We will need a new Processor object for each new source of data.
# The abstract class can be found in farm.data_handling.processor.Processor
# TOXIC = 1
# OTHER = 0
LABEL_LIST = ["0", "1"]
processor = TextClassificationProcessor(tokenizer=tokenizer,
                                        max_seq_len=128,
                                        warmup = 600,
                                        train_filename="train.tsv",
                                        dev_filename="test_with_solutions.tsv",
                                        test_filename="impermium_verification_labels.tsv",
                                        data_dir="../content",
                                        label_list=LABEL_LIST,
                                        metric="f1_macro",
                                        text_column_name="Comment",
                                        label_column_name="Insult")

# We need a DataSilo in order to keep our train, dev and test sets separate.
# The DataSilo will call the functions in the Processor to generate these sets.
# From the DataSilo, we can fetch a PyTorch DataLoader object which will
# be passed on to the model.
# Here is a good place to define a batch size for the model

BATCH_SIZE = 32

data_silo = DataSilo(
    processor=processor,
    batch_size=BATCH_SIZE)

# The language model is the foundation on which modern NLP systems are built.
# They encapsulate a general understanding of sentence semantics
# and are not specific to any one task.

# Here we are using Google's BERT model as implemented by HuggingFace. 
# The model being loaded is a German model that we trained. 
# You can also change the MODEL_NAME_OR_PATH to point to a BERT model that you
# have saved or download one connected to the HuggingFace repository.
# See farm.modeling.language_model.PRETRAINED_MODEL_ARCHIVE_MAP for a list of
# available models

MODEL_NAME_OR_PATH = "distilbert-base-uncased"
# MODEL_NAME_OR_PATH = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english")
language_model = LanguageModel.load(MODEL_NAME_OR_PATH)

# A prediction head is a model that processes the output of the language model
# for a specific task.
# Prediction heads will look different depending on whether you're doing text classification
# Named Entity Recognition (NER), question answering or some other task.
# They should generate logits over the available prediction classes and contain methods
# to convert these logits to losses or predictions 

# Here we use TextClassificationHead which receives a single fixed length sentence vector
# and processes it using a feed forward neural network. layer_dims is a list of dimensions:
# [input_dims, hidden_1_dims, hidden_2_dims ..., output_dims]

# Here by default we have a single layer network.
# It takes in a vector of length 768 (the default size of BERT's output).
# It outputs a vector of length 2 (the number of classes in the GermEval18 (coarse) dataset)

prediction_head = TextClassificationHead(num_labels=len(LABEL_LIST), class_weights=data_silo.calculate_class_weights(task_name="text_classification"))

# The language model and prediction head are coupled together in the Adaptive Model.
# This class takes care of model saving and loading and also coordinates
# cases where there is more than one prediction head.

# EMBEDS_DROPOUT_PROB is the probability that an element of the output vector from the
# language model will be set to zero.
# EMBEDS_DROPOUT_PROB = 0.1 distilbert-base-uncased-finetuned-sst-2-english_1.PNG
# EMBEDS_DROPOUT_PROB = 0.01
# EMBEDS_DROPOUT_PROB = 0.2 distilbert-base-uncased-finetuned-sst-2-english_2.PNG
EMBEDS_DROPOUT_PROB = 0.1

model = AdaptiveModel(
    language_model=language_model,
    prediction_heads=[prediction_head],
    embeds_dropout_prob=EMBEDS_DROPOUT_PROB,
    lm_output_types=["per_sequence"],
    device=device)

# Here we initialize a Bert Adam optimizer that has a linear warmup and warmdown
# Here you can set learning rate, the warmup proportion and number of epochs to train for

#LEARNING_RATE = 2e-5 # distilbert-base-uncased-finetuned-sst-2-english_1.PNG
#LEARNING_RATE = 1e-7
LEARNING_RATE = 1e-5  # distilbert-base-uncased-finetuned-sst-2-english_2.PNG

N_EPOCHS = 3

model, optimizer, lr_schedule = initialize_optimizer(
    model=model,
    device=device,
    learning_rate=LEARNING_RATE,
    n_batches=len(data_silo.loaders["train"]),
    n_epochs=N_EPOCHS)

# Training loop handled by this
# It will also trigger evaluation during training using the dev data
# and after training using the test data.

# Set N_GPU to a positive value if CUDA is available
N_GPU = 1

trainer = Trainer(
    model=model,
    optimizer=optimizer,
    data_silo=data_silo,
    epochs=N_EPOCHS,
    n_gpu=N_GPU,
    lr_schedule=lr_schedule,
    device=device, 
)

!nvidia-smi

model = trainer.train()

09/16/2021 10:41:37 - INFO - filelock -   Lock 139649146410704 acquired on /root/.cache/torch/transformers/a41e817d5c0743e29e86ff85edc8c257e61bc8d88e4271bb1b243b6e7614c633.8949e27aafafa845a18d98a0e3a88bc2d248bbc32a1b75947366664658f23b1c.lock


Downloading:   0%|          | 0.00/442 [00:00<?, ?B/s]

09/16/2021 10:41:37 - INFO - filelock -   Lock 139649146410704 released on /root/.cache/torch/transformers/a41e817d5c0743e29e86ff85edc8c257e61bc8d88e4271bb1b243b6e7614c633.8949e27aafafa845a18d98a0e3a88bc2d248bbc32a1b75947366664658f23b1c.lock
09/16/2021 10:41:38 - INFO - farm.data_handler.data_silo -   
Loading data into the data silo ... 
              ______
               |o  |   !
   __          |:`_|---'-.
  |__|______.-/ _ \-----.|       
 (o)(o)------'\ _ /     ( )      
 
09/16/2021 10:41:38 - INFO - farm.data_handler.data_silo -   Loading train set from: ../content/train.tsv 
09/16/2021 10:41:38 - INFO - farm.data_handler.data_silo -   Got ya 1 parallel workers to convert 3947 dictionaries to pytorch datasets (chunksize = 790)...
09/16/2021 10:41:38 - INFO - farm.data_handler.data_silo -    0 
09/16/2021 10:41:38 - INFO - farm.data_handler.data_silo -   /w\
09/16/2021 10:41:38 - INFO - farm.data_handler.data_silo -   /'\
09/16/2021 10:41:38 - INFO - farm.data_handler.data_silo 

Downloading:   0%|          | 0.00/268M [00:00<?, ?B/s]

09/16/2021 10:42:13 - INFO - filelock -   Lock 139649522998864 released on /root/.cache/torch/transformers/ae9df7a8d658c4f3e1917a471a8a21cf678fa1d4cb91e7702dfe0598dbdcf354.c2015533705b9dff680ae707e205a35e2860e8d148b45d35085419d74fe57ac5.lock
	 We guess it's an *ENGLISH* model ... 
	 If not: Init the language model by supplying the 'language' param.
09/16/2021 10:42:14 - INFO - farm.modeling.prediction_head -   Prediction head initialized with size [768, 2]
09/16/2021 10:42:14 - INFO - farm.modeling.prediction_head -   Using class weights for task 'text_classification': [0.68979055 1.8172414 ]
09/16/2021 10:42:15 - INFO - farm.modeling.optimization -   Loading optimizer `TransformersAdamW`: '{'correct_bias': False, 'weight_decay': 0.01, 'lr': 1e-05}'
09/16/2021 10:42:15 - INFO - farm.modeling.optimization -   Using scheduler 'get_linear_schedule_with_warmup'
09/16/2021 10:42:15 - INFO - farm.modeling.optimization -   Loading schedule `get_linear_schedule_with_warmup`: '{'num_warmup_step

Thu Sep 16 10:42:15 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.63.01    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla K80           Off  | 00000000:00:04.0 Off |                    0 |
| N/A   51C    P0    60W / 149W |   7165MiB / 11441MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

09/16/2021 10:42:15 - INFO - farm.train -   
 

          &&& &&  & &&             _____                   _             
      && &\/&\|& ()|/ @, &&       / ____|                 (_)            
      &\/(/&/&||/& /_/)_&/_&     | |  __ _ __ _____      ___ _ __   __ _ 
   &() &\/&|()|/&\/ '%" & ()     | | |_ | '__/ _ \ \ /\ / / | '_ \ / _` |
  &_\_&&_\ |& |&&/&__%_/_& &&    | |__| | | | (_) \ V  V /| | | | | (_| |
&&   && & &| &| /& & % ()& /&&    \_____|_|  \___/ \_/\_/ |_|_| |_|\__, |
 ()&_---()&\&\|&&-&&--%---()~                                       __/ |
     &&     \|||                                                   |___/
             |||
             |||
             |||
       , -=-~  .-^- _
              `

Train epoch 0/2 (Cur. train loss: 0.5709): 100%|██████████| 99/99 [01:02<00:00,  1.57it/s]
Train epoch 1/2 (Cur. train loss: 0.2287):   1%|          | 1/99 [00:00<01:21,  1.21it/s]
Evaluating: 100%|██████████| 25/25 [00:05<00:00,  4.58it/s]
09/16/2021 10:43:25 - INFO - f

### Optimizing DistilBERT

In [None]:
from transformers import DistilBertForSequenceClassification, Trainer, TrainingArguments, AutoModelForMaskedLM

training_args = TrainingArguments(
    output_dir='./results',          # output directory
    num_train_epochs=3,              # total number of training epochs
    per_device_train_batch_size=16,  # batch size per device during training
    per_device_eval_batch_size=64,   # batch size for evaluation
    warmup_steps=500,                # number of warmup steps for learning rate scheduler
    weight_decay=0.01,               # strength of weight decay
    logging_dir='./logs',            # directory for storing logs
    logging_steps=10,
)

model = AutoModelForMaskedLM.from_pretrained("distilbert-base-uncased")

trainer = Trainer(
    model=model,                            # the instantiated 🤗 Transformers model to be trained
    args=training_args,                     # training arguments, defined above
    train_dataset="train.tsv",              # training dataset
    eval_dataset="test_with_solutions.tsv"  # evaluation dataset
)

trainer.train()

Epoch:   0%|          | 0/3 [00:00<?, ?it/s]

Iteration:   0%|          | 0/1 [00:00<?, ?it/s]

TypeError: ignored

## XLMRoBERTa <br>
https://huggingface.co/xlm-roberta-base 

In [None]:
from transformers import AutoTokenizer
import torch
from farm.modeling.tokenization import Tokenizer
from farm.data_handler.processor import TextClassificationProcessor
from farm.data_handler.data_silo import DataSilo
from farm.modeling.language_model import LanguageModel
from farm.modeling.prediction_head import TextClassificationHead
from farm.modeling.adaptive_model import AdaptiveModel
from farm.modeling.optimization import initialize_optimizer
from farm.train import Trainer
from farm.utils import MLFlowLogger

# Here we initialize a tokenizer that will be used for preprocessing text
# This is the BERT Tokenizer which uses the byte pair encoding method.
# It is currently loaded with a German model

#tokenizer = Tokenizer.load(
#    pretrained_model_name_or_path="finiteautomata/bertweet-base-sentiment-analysis",
#    do_lower_case=False)

tokenizer = AutoTokenizer.from_pretrained("xlm-roberta-base")

# In order to prepare the data for the model, we need a set of
# functions to transform data files into PyTorch Datasets.
# We group these together in Processor objects.
# We will need a new Processor object for each new source of data.
# The abstract class can be found in farm.data_handling.processor.Processor
# TOXIC = 1
# OTHER = 0
LABEL_LIST = ["0", "1"]
processor = TextClassificationProcessor(tokenizer=tokenizer,
                                        max_seq_len=128,
                                        warmup = 600,
                                        train_filename="train.tsv",
                                        dev_filename="test_with_solutions.tsv",
                                        test_filename="impermium_verification_labels.tsv",
                                        data_dir="../content",
                                        label_list=LABEL_LIST,
                                        metric="f1_macro",
                                        text_column_name="Comment",
                                        label_column_name="Insult")

# We need a DataSilo in order to keep our train, dev and test sets separate.
# The DataSilo will call the functions in the Processor to generate these sets.
# From the DataSilo, we can fetch a PyTorch DataLoader object which will
# be passed on to the model.
# Here is a good place to define a batch size for the model

BATCH_SIZE = 32

data_silo = DataSilo(
    processor=processor,
    batch_size=BATCH_SIZE)

# The language model is the foundation on which modern NLP systems are built.
# They encapsulate a general understanding of sentence semantics
# and are not specific to any one task.

# Here we are using Google's BERT model as implemented by HuggingFace. 
# The model being loaded is a German model that we trained. 
# You can also change the MODEL_NAME_OR_PATH to point to a BERT model that you
# have saved or download one connected to the HuggingFace repository.
# See farm.modeling.language_model.PRETRAINED_MODEL_ARCHIVE_MAP for a list of
# available models

MODEL_NAME_OR_PATH = "xlm-roberta-base"
# MODEL_NAME_OR_PATH = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english")
language_model = LanguageModel.load(MODEL_NAME_OR_PATH)

# A prediction head is a model that processes the output of the language model
# for a specific task.
# Prediction heads will look different depending on whether you're doing text classification
# Named Entity Recognition (NER), question answering or some other task.
# They should generate logits over the available prediction classes and contain methods
# to convert these logits to losses or predictions 

# Here we use TextClassificationHead which receives a single fixed length sentence vector
# and processes it using a feed forward neural network. layer_dims is a list of dimensions:
# [input_dims, hidden_1_dims, hidden_2_dims ..., output_dims]

# Here by default we have a single layer network.
# It takes in a vector of length 768 (the default size of BERT's output).
# It outputs a vector of length 2 (the number of classes in the GermEval18 (coarse) dataset)

prediction_head = TextClassificationHead(num_labels=len(LABEL_LIST), class_weights=data_silo.calculate_class_weights(task_name="text_classification"))

# The language model and prediction head are coupled together in the Adaptive Model.
# This class takes care of model saving and loading and also coordinates
# cases where there is more than one prediction head.

# EMBEDS_DROPOUT_PROB is the probability that an element of the output vector from the
# language model will be set to zero.
# EMBEDS_DROPOUT_PROB = 0.1 distilbert-base-uncased-finetuned-sst-2-english_1.PNG
# EMBEDS_DROPOUT_PROB = 0.01
# EMBEDS_DROPOUT_PROB = 0.2 distilbert-base-uncased-finetuned-sst-2-english_2.PNG
EMBEDS_DROPOUT_PROB = 0.1

model = AdaptiveModel(
    language_model=language_model,
    prediction_heads=[prediction_head],
    embeds_dropout_prob=EMBEDS_DROPOUT_PROB,
    lm_output_types=["per_sequence"],
    device=device)

# Here we initialize a Bert Adam optimizer that has a linear warmup and warmdown
# Here you can set learning rate, the warmup proportion and number of epochs to train for

#LEARNING_RATE = 2e-5 # distilbert-base-uncased-finetuned-sst-2-english_1.PNG
#LEARNING_RATE = 1e-7
LEARNING_RATE = 1e-5  # distilbert-base-uncased-finetuned-sst-2-english_2.PNG

N_EPOCHS = 3

model, optimizer, lr_schedule = initialize_optimizer(
    model=model,
    device=device,
    learning_rate=LEARNING_RATE,
    n_batches=len(data_silo.loaders["train"]),
    n_epochs=N_EPOCHS)

# Training loop handled by this
# It will also trigger evaluation during training using the dev data
# and after training using the test data.

# Set N_GPU to a positive value if CUDA is available
N_GPU = 1

trainer = Trainer(
    model=model,
    optimizer=optimizer,
    data_silo=data_silo,
    epochs=N_EPOCHS,
    n_gpu=N_GPU,
    lr_schedule=lr_schedule,
    device=device, 
)

!nvidia-smi

model = trainer.train()

ModuleNotFoundError: ignored

## XLNet Cased <br>
https://huggingface.co/xlnet-base-cased 

In [None]:
import torch
from farm.modeling.tokenization import Tokenizer
from farm.data_handler.processor import TextClassificationProcessor
from farm.data_handler.data_silo import DataSilo
from farm.modeling.language_model import LanguageModel
from farm.modeling.prediction_head import TextClassificationHead
from farm.modeling.adaptive_model import AdaptiveModel
from farm.modeling.optimization import initialize_optimizer
from farm.train import Trainer
from farm.utils import MLFlowLogger

# Here we initialize a tokenizer that will be used for preprocessing text
# This is the BERT Tokenizer which uses the byte pair encoding method.
# It is currently loaded with a German model

#tokenizer = Tokenizer.load(
#    pretrained_model_name_or_path="finiteautomata/bertweet-base-sentiment-analysis",
#    do_lower_case=False)

tokenizer = AutoTokenizer.from_pretrained("xlnet-base-cased")

# In order to prepare the data for the model, we need a set of
# functions to transform data files into PyTorch Datasets.
# We group these together in Processor objects.
# We will need a new Processor object for each new source of data.
# The abstract class can be found in farm.data_handling.processor.Processor
# TOXIC = 1
# OTHER = 0
LABEL_LIST = ["0", "1"]
processor = TextClassificationProcessor(tokenizer=tokenizer,
                                        max_seq_len=128,
                                        warmup = 600,
                                        train_filename="train.tsv",
                                        test_filename="test_with_solutions.tsv",
                                        data_dir="../content",
                                        label_list=LABEL_LIST,
                                        metric="f1_macro",
                                        text_column_name="Comment",
                                        label_column_name="Insult")

# We need a DataSilo in order to keep our train, dev and test sets separate.
# The DataSilo will call the functions in the Processor to generate these sets.
# From the DataSilo, we can fetch a PyTorch DataLoader object which will
# be passed on to the model.
# Here is a good place to define a batch size for the model

BATCH_SIZE = 32

data_silo = DataSilo(
    processor=processor,
    batch_size=BATCH_SIZE)

# The language model is the foundation on which modern NLP systems are built.
# They encapsulate a general understanding of sentence semantics
# and are not specific to any one task.

# Here we are using Google's BERT model as implemented by HuggingFace. 
# The model being loaded is a German model that we trained. 
# You can also change the MODEL_NAME_OR_PATH to point to a BERT model that you
# have saved or download one connected to the HuggingFace repository.
# See farm.modeling.language_model.PRETRAINED_MODEL_ARCHIVE_MAP for a list of
# available models

MODEL_NAME_OR_PATH = "xlnet-base-cased"
# MODEL_NAME_OR_PATH = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english")
language_model = LanguageModel.load(MODEL_NAME_OR_PATH)

# A prediction head is a model that processes the output of the language model
# for a specific task.
# Prediction heads will look different depending on whether you're doing text classification
# Named Entity Recognition (NER), question answering or some other task.
# They should generate logits over the available prediction classes and contain methods
# to convert these logits to losses or predictions 

# Here we use TextClassificationHead which receives a single fixed length sentence vector
# and processes it using a feed forward neural network. layer_dims is a list of dimensions:
# [input_dims, hidden_1_dims, hidden_2_dims ..., output_dims]

# Here by default we have a single layer network.
# It takes in a vector of length 768 (the default size of BERT's output).
# It outputs a vector of length 2 (the number of classes in the GermEval18 (coarse) dataset)

prediction_head = TextClassificationHead(num_labels=len(LABEL_LIST), class_weights=data_silo.calculate_class_weights(task_name="text_classification"))

# The language model and prediction head are coupled together in the Adaptive Model.
# This class takes care of model saving and loading and also coordinates
# cases where there is more than one prediction head.

# EMBEDS_DROPOUT_PROB is the probability that an element of the output vector from the
# language model will be set to zero.
# EMBEDS_DROPOUT_PROB = 0.1 distilbert-base-uncased-finetuned-sst-2-english_1.PNG
# EMBEDS_DROPOUT_PROB = 0.01
# EMBEDS_DROPOUT_PROB = 0.2 distilbert-base-uncased-finetuned-sst-2-english_2.PNG
EMBEDS_DROPOUT_PROB = 0.01

model = AdaptiveModel(
    language_model=language_model,
    prediction_heads=[prediction_head],
    embeds_dropout_prob=EMBEDS_DROPOUT_PROB,
    lm_output_types=["per_sequence"],
    device=device)

# Here we initialize a Bert Adam optimizer that has a linear warmup and warmdown
# Here you can set learning rate, the warmup proportion and number of epochs to train for

#LEARNING_RATE = 2e-5 # distilbert-base-uncased-finetuned-sst-2-english_1.PNG
#LEARNING_RATE = 1e-7
LEARNING_RATE = 1e-5  # distilbert-base-uncased-finetuned-sst-2-english_2.PNG

N_EPOCHS = 3

model, optimizer, lr_schedule = initialize_optimizer(
    model=model,
    device=device,
    learning_rate=LEARNING_RATE,
    n_batches=len(data_silo.loaders["train"]),
    n_epochs=N_EPOCHS)

# Training loop handled by this
# It will also trigger evaluation during training using the dev data
# and after training using the test data.

# Set N_GPU to a positive value if CUDA is available
N_GPU = 1

trainer = Trainer(
    model=model,
    optimizer=optimizer,
    data_silo=data_silo,
    epochs=N_EPOCHS,
    n_gpu=N_GPU,
    lr_schedule=lr_schedule,
    device=device, 
)

!nvidia-smi

model = trainer.train()

09/13/2021 16:40:25 - INFO - farm.data_handler.data_silo -   
Loading data into the data silo ... 
              ______
               |o  |   !
   __          |:`_|---'-.
  |__|______.-/ _ \-----.|       
 (o)(o)------'\ _ /     ( )      
 
09/13/2021 16:40:25 - INFO - farm.data_handler.data_silo -   Loading train set from: ../content/train.tsv 
09/13/2021 16:40:26 - INFO - farm.data_handler.data_silo -   Got ya 1 parallel workers to convert 3947 dictionaries to pytorch datasets (chunksize = 790)...
09/13/2021 16:40:26 - INFO - farm.data_handler.data_silo -    0 
09/13/2021 16:40:26 - INFO - farm.data_handler.data_silo -   /w\
09/13/2021 16:40:26 - INFO - farm.data_handler.data_silo -   /'\
09/13/2021 16:40:26 - INFO - farm.data_handler.data_silo -   
09/13/2021 16:40:27 - INFO - farm.data_handler.processor -   *** Show 2 random examples ***
09/13/2021 16:40:27 - INFO - farm.data_handler.processor -   

      .--.        _____                       _      
    .'_\/_'.     / ____|    

Mon Sep 13 16:41:00 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.63.01    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla K80           Off  | 00000000:00:04.0 Off |                    0 |
| N/A   49C    P0    61W / 149W |  10553MiB / 11441MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

09/13/2021 16:41:00 - INFO - farm.train -   
 

          &&& &&  & &&             _____                   _             
      && &\/&\|& ()|/ @, &&       / ____|                 (_)            
      &\/(/&/&||/& /_/)_&/_&     | |  __ _ __ _____      ___ _ __   __ _ 
   &() &\/&|()|/&\/ '%" & ()     | | |_ | '__/ _ \ \ /\ / / | '_ \ / _` |
  &_\_&&_\ |& |&&/&__%_/_& &&    | |__| | | | (_) \ V  V /| | | | | (_| |
&&   && & &| &| /& & % ()& /&&    \_____|_|  \___/ \_/\_/ |_|_| |_|\__, |
 ()&_---()&\&\|&&-&&--%---()~                                       __/ |
     &&     \|||                                                   |___/
             |||
             |||
             |||
       , -=-~  .-^- _
              `

  attn_score = (ac + bd + ef) * self.scale
Train epoch 0/2 (Cur. train loss: 0.0000):   0%|          | 0/99 [00:00<?, ?it/s]


ModuleAttributeError: ignored

## Switching to NER (NOT WORKING)

In [None]:
# Import the new building blocks

from farm.data_handler.processor import NERProcessor
from farm.modeling.prediction_head import TokenClassificationHead
ml_logger.init_experiment(experiment_name="Public_FARM", run_name="Tutorial1_Colab_NER")

# This processor will preprocess the data for the CoNLL03 NER task
ner_labels = ["[PAD]", "X", "O", "B-MISC", "I-MISC", "B-PER", "I-PER", "B-ORG", "I-ORG", "B-LOC", "I-LOC", "B-OTH", "I-OTH"]

ner_processor = NERProcessor(tokenizer=tokenizer, 
                             max_seq_len=128, 
                             data_dir="../data/conll03-de",
                             label_list=ner_labels,
                             metric="seq_f1")

# This prediction head is also a feed forward neural network but expects one
# vector per token in the input sequence and will generate a set of logits
# for each input

ner_prediction_head = TokenClassificationHead(num_labels=len(ner_labels))

# We can integrate these new pieces with the rest using this code
# It is pretty much the same structure as what we had above for text classification

BATCH_SIZE = 32
EMBEDS_DROPOUT_PROB = 0.1
LEARNING_RATE = 2e-5
N_EPOCHS = 1
N_GPU = 1

data_silo = DataSilo(
    processor=ner_processor,
    batch_size=BATCH_SIZE)

model = AdaptiveModel(
    language_model=language_model,
    prediction_heads=[ner_prediction_head],
    embeds_dropout_prob=EMBEDS_DROPOUT_PROB,
    lm_output_types=["per_token"],
    device=device)

model, optimizer, lr_schedule = initialize_optimizer(
    model=model,
    learning_rate=LEARNING_RATE,
    n_batches=len(data_silo.loaders["train"]),
    n_epochs=N_EPOCHS,
    device=device)

trainer = Trainer(
    model=model,
    optimizer=optimizer,
    data_silo=data_silo,
    epochs=N_EPOCHS,
    n_gpu=N_GPU,
    lr_schedule=lr_schedule,
    device=device,
)

model = trainer.train()

In [None]:
# Import the new building blocks

from farm.data_handler.processor import NERProcessor
from farm.modeling.prediction_head import TokenClassificationHead
ml_logger.init_experiment(experiment_name="Public_FARM", run_name="Tutorial1_Colab_NER")


In [None]:
# This processor will preprocess the data for the CoNLL03 NER task
ner_labels = ["1", "0"]

ner_processor = NERProcessor(tokenizer=tokenizer, 
                             max_seq_len=128,
                             train_filename="train.tsv",
                             test_filename="test_with_solutions.tsv",
                             dev_filename=None,
                             data_dir="../content",
                             label_list=ner_labels,
                             metric="f1_macro",
                             text_column_name="Comment",
                              label_column_name="Insult")

In [None]:
# This prediction head is also a feed forward neural network but expects one
# vector per token in the input sequence and will generate a set of logits
# for each input

ner_prediction_head = TokenClassificationHead(num_labels=len(ner_labels))

In [None]:
# We can integrate these new pieces with the rest using this code
# It is pretty much the same structure as what we had above for text classification

BATCH_SIZE = 32
EMBEDS_DROPOUT_PROB = 0.1
LEARNING_RATE = 2e-5
N_EPOCHS = 1
N_GPU = 1

data_silo = DataSilo(
    processor=ner_processor,
    batch_size=BATCH_SIZE)

model = AdaptiveModel(
    language_model=language_model,
    prediction_heads=[ner_prediction_head],
    embeds_dropout_prob=EMBEDS_DROPOUT_PROB,
    lm_output_types=["per_sequence"],
    device=device)

model, optimizer, lr_schedule = initialize_optimizer(
    model=model,
    learning_rate=LEARNING_RATE,
    n_batches=len(data_silo.loaders["train"]),
    n_epochs=N_EPOCHS,
    device=device)

trainer = Trainer(
    model=model,
    optimizer=optimizer,
    data_silo=data_silo,
    epochs=N_EPOCHS,
    n_gpu=N_GPU,
    lr_schedule=lr_schedule,
    device=device,
)

In [None]:
model = trainer.train()