# Train BERT for the Preselection of Reports	 

For fine-tuning, the 'simpletransformers' library is used, as it only requires a few lines of code for the training. The library can be downloaded from Github via:  (https://github.com/ThilinaRajapakse/simpletransformers). 

## Creating the enviroment


```bash
conda create --name=finetuning 
conda install tensorflow-gpu pytorch scikit-learn

cd transformers 
pip install .

git clone https://github.com/ThilinaRajapakse/simpletransformers
cd simpletransformers
pip install .

git clone https://github.com/NVIDIA/apex
cd apex
pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./

conda install ipykernel pandas
ipython kernel install --user --name=finetuning

pip install python-box ipywidgets
jupyter nbextension enable --py widgetsnbextension
```
'nvidia-apex' raises an error about incompatible CUDA versions. The function to check for errors is commented out.


In [None]:
import pandas as pd
from simpletransformers.classification import ClassificationModel

Path to folder containing data file. 

In [None]:
DATADIR = '../data/'

In [None]:
!ls $DATADIR

Load the train dataset.

In [None]:
data = pd.read_csv(DATADIR + 'train-evaluable.csv', header=0)
data.shape

In [None]:
data.sample(10)

In [None]:
# Create a MultiLabelClassificationModel
args={'output_dir': 'outputs/',
      'cache_dir': 'cache_dir/',
      'fp16': False,
      'fp16_opt_level': 'O1',
      'max_seq_length': 512,           
      'train_batch_size': 8,
      'gradient_accumulation_steps': 10,
      'eval_batch_size': 12,
      'num_train_epochs': 10,          
      'weight_decay': 0,
      'learning_rate': 4e-5,
      'adam_epsilon': 1e-8,
      'warmup_ratio': 0.06,
      'warmup_steps': 0,
      'max_grad_norm': 1.0,
      'logging_steps': 50,
      'save_steps': 2000,  
      'evaluate_during_training': True,
      'overwrite_output_dir': True,
      'reprocess_input_data': True,
      'n_gpu': 2,
      'use_multiprocessing': True,
      'silent': False,
      'threshold': 0.5,
      'wandb_project': 'bert-for-radiology',
      
      # for long texts     
      'sliding_window': True,
      'tie_value': 1}

model_names= ['../models/pt-radiobert-base-german-cased/', 'bert-base-german-cased', '../models/pt-radiobert-from-scratch/', 'bert-base-multilingual-cased']

Training the models.

In [None]:
test = pd.read_csv(DATADIR + 'test-evaluable.csv', header=0)
args["output_dir"] = 'outputs/final/radbert-binary/'
model = ClassificationModel('bert', '../models/pt-radiobert-base-german-cased/', args=args)
model.train_model(data, eval_df = test)

In [None]:
args["output_dir"] = 'outputs/final/radbert-binary/'
model = ClassificationModel('bert', '../models/pt-radiobert-base-german-cased/', args=args)
model.train_model(data)

args["output_dir"] = 'outputs/final/fsbert-binary/'
model = ClassificationModel('bert', '../models/pt-radiobert-from-scratch/', args=args)
model.train_model(data)

args["output_dir"] = 'outputs/final/gerbert-binary/'
model = ClassificationModel('bert', 'bert-base-german-cased', args=args)
model.train_model(data)

args["output_dir"] = 'outputs/final/multibert-binary/'
model = ClassificationModel('bert', 'bert-base-multilingual-cased', args=args)
model.train_model(data)

In [None]:
test = pd.read_csv(DATADIR + 'test-evaluable.csv', header=0)

with open('results-binary.csv', 'w+') as f:
    f.write('model,' + ','.join(map(str, range(1,501))) + ',\n')

model_dirs = ['outputs/final/radbert-binary/', 'outputs/final/fsbert-binary/', 'outputs/final/gerbert-binary/', 'outputs/final/multibert-binary/']

for model_dir in model_dirs:
    model = ClassificationModel('bert', model_dir, args=args)
    pred, raw =  model.predict(test.text)   
        
    for rep in ['outputs', 'final', '/']:
        model_dir=model_dir.replace(rep, '')
              
    with open('results-binary.csv', 'a') as f:
        f.write(model_dir +  ',' + ','.join(map(str, raw)).replace('\n', '') +'\n')    