# BERT Fine-tuning on Report Texts

For fine-tuning, the 'simpletransformers' library is used, as it only requires a few lines of code for the training. The library can be downloaded from Github via: https://github.com/ThilinaRajapakse/simpletransformers. 

## Creating the enviroment


```bash
conda create --name=finetuning 
conda install tensorflow-gpu pytorch scikit-learn

cd transformers 
pip install .

git clone https://github.com/ThilinaRajapakse/simpletransformers
cd simpletransformers
pip install .

git clone https://github.com/NVIDIA/apex
cd apex
pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./

conda install ipykernel pandas
ipython kernel install --user --name=finetuning

pip install python-box ipywidgets
jupyter nbextension enable --py widgetsnbextension
```
'nvidia-apex' raises an error about incompatible CUDA versions. The function to check for errors was commented out. 



## Data preparation
Before fine-tuning, the data needs to be pre-processed. The 'simpletransformers' library requires the following data-structure:

| text | labels |
|------|--------|
| 'some text for finetuning' | \[1,0,0,1,1,0,1] |
| 'some more texts for finetuning' | \[0,0,0,0,0,1,1] |
| 'even more texts for finetuning' | \[0,1,0,1,0,0,1] |

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from simpletransformers.classification import MultiLabelClassificationModel
from statistics import mean, median, stdev

Path to folder containing data file. 

In [None]:
DATADIR = '../data/'

In [None]:
!ls $DATADIR

Load the test data-set and bring it into the desired format (as specified above). 

In [None]:
to_drop=['Filename', 'Annotator', 'Confidence']
data = pd.read_csv(DATADIR + 'train.csv', header=0).drop(to_drop,axis=1)
labels=['Stauung','Verschattung','Erguss','Pneumothorax','Thoraxdrainage','ZVK','Magensonde','Tubus','Materialfehllage']
data['labels']=data[labels].values.tolist()
data=data.drop(labels, axis = 1)
data.shape

In [None]:
data.sample(10)

In [None]:
test = pd.read_csv(DATADIR + 'test.csv', header=0).drop(to_drop,axis=1)
test['labels']=test[labels].values.tolist()
test=test.drop(labels, axis = 1)
test.sample(10)

Define functions for performance measurements. 

In [None]:
# Create a MultiLabelClassificationModel
args={'output_dir': 'outputs/',
      'cache_dir': 'cache_dir/',
      'fp16': False,
      'fp16_opt_level': 'O1',
      'max_seq_length': 512,           
      'train_batch_size': 8,
      'gradient_accumulation_steps': 1,
      'eval_batch_size': 12,
      'num_train_epochs': 4,          
      'weight_decay': 0,
      'learning_rate': 4e-5,
      'adam_epsilon': 1e-8,
      'warmup_ratio': 0.06,
      'warmup_steps': 0,
      'max_grad_norm': 1.0,
      'logging_steps': 50,
      'save_steps': 2000,  
      'evaluate_during_training': False,
      'overwrite_output_dir': True,
      'reprocess_input_data': True,
      'n_gpu': 2,
      'use_multiprocessing': True,
      'silent': False,
      'threshold': 0.5,
      
      # for long texts     
      'sliding_window': True,
      'tie_value': 1}

model_names= ['../models/pt-radiobert-base-german-cased/', 'bert-base-german-cased', '../models/pt-radiobert-from-scratch/', 'bert-base-multilingual-cased']

In [None]:
with open("results.csv", 'w+') as f:
    f.write('train_size,model,' + ','.join(map(str, range(1,501))) + ',\n')

for i in [100, 200, 300, 400, 500, 600, 700, 800, 900, 1000, 1100, 1200, 1300, 1400, 1500, 1600, 1700, 1800, 1900, 2000, 2250, 2500, 2750, 3000, 3250, 3500, 3750, 4000, 4500]:
    train = data.sample(i)
    for model_name in model_names:

        model = MultiLabelClassificationModel('bert', model_name, num_labels=9, args=args)
        model.train_model(train)
        result, model_outputs, wrong_predictions = model.eval_model(test)
        pred, raw =  model.predict(test.text)   
        
        for rep in ['models', '..', "/"]:
            model_name=model_name.replace(rep, '')
              
        with open('results.csv', 'a') as f:
            f.write(str(train.shape[0]) + ',' + model_name +  ',' + ','.join(map(str, raw)).replace('\n', '') +'\n')     
            
    !git add *
    !git commit -m "update accuracy"
    !git push

Saving the final models, trained on the whole train data-set.  

In [None]:
out_dirs = ['outputs/final/radbert/', 'outputs/final/gerbert/', 'outputs/final/fsbert/', 'outputs/final/multibert/']

for i in range(3,4):
    args["output_dir"] = out_dirs[i]
    model = MultiLabelClassificationModel('bert', model_names[i], args=args, num_labels=9)
    model.train_model(data)
    pred, raw =  model.predict(test.text)   
        
    with open('results.csv', 'a') as f:
        f.write(str(data.shape[0]) + ',' + model_names[i] +  ',' + ','.join(map(str, raw)).replace('\n', '') +'\n') 

# Evaluation on long texts

On short report texts, our model does only outperform the standard german BERT model or the multilingual BERT model if the training-set for fine-tuning is very small. On larger train-sizes the value of pretraining is low. 
However, as the vocabulary of the models differs significantly, we believe, that our model will perform better on longer report texts, due to a more efficient tokenization e.g. in the context of text reports for computed tomography. 

In [None]:
test = pd.read_csv(DATADIR + 'ct.csv', header=0).drop(to_drop,axis=1)
test['labels']=test[labels].values.tolist()
test=test.drop(labels, axis = 1)

In [None]:
from simpletransformers.classification import ClassificationModel

with open('results-long-text.csv', 'w+') as f:
    f.write('train_size,model,' + ','.join(map(str, range(1,165))) + ',\n')

model_dirs = ['outputs/final/radbert/', 'outputs/final/fsbert/', 'outputs/final/gerbert/', 'outputs/final/multibert/']

for model_dir in model_dirs:
    model = ClassificationModel('bert', model_dir, args=args)
    pred, raw =  model.predict(test.text)   
        
    for rep in ['outputs', 'final', '/']:
        model_dir=model_dir.replace(rep, '')
              
    with open('results-long-text.csv', 'a') as f:
        f.write(str(4000) + ',' + model_dir +  ',' + ','.join(map(str, raw)).replace('\n', '') +'\n')     