# Connect to Google Drive and change directory





In [None]:
from google.colab import drive 
drive.mount('/content/drive')
%cd /content/drive/MyDrive/BachelorThesis/

Mounted at /content/drive
/content/drive/MyDrive/BachelorThesis


# Installations

In [None]:
!pip install datasets transformers

Installing collected packages: multidict, frozenlist, yarl, asynctest, async-timeout, aiosignal, pyyaml, fsspec, aiohttp, xxhash, tokenizers, sacremoses, huggingface-hub, transformers, datasets
  Attempting uninstall: pyyaml
    Found existing installation: PyYAML 3.13
    Uninstalling PyYAML-3.13:
      Successfully uninstalled PyYAML-3.13
Successfully installed aiohttp-3.8.1 aiosignal-1.2.0 async-timeout-4.0.2 asynctest-0.13.0 datasets-1.18.3 frozenlist-1.3.0 fsspec-2022.1.0 huggingface-hub-0.4.0 multidict-6.0.2 pyyaml-6.0 sacremoses-0.0.47 tokenizers-0.11.4 transformers-4.16.2 xxhash-2.0.2 yarl-1.7.2


# GPU Support

## Enable GPU

To enable GPUs for the notebook:
- Navigate to Edit→Notebook Settings
- select GPU from the Hardware Accelerator drop-down
Next, we'll confirm that we can connect to the GPU with tensorflow:

In [None]:
%tensorflow_version 2.x
import tensorflow as tf
device_name = tf.test.gpu_device_name()
if device_name != '/device:GPU:0':
  raise SystemError('GPU device not found')
print('Found GPU at: {}'.format(device_name))

Check which GPU was allocated in the given session:

In [None]:
!nvidia-smi -L

## GPU Speedup relative to CPU

This example constructs a typical convolutional neural network layer over a random image and manually places the resulting ops on either the CPU or the GPU to compare execution speed.

In [None]:
import timeit

device_name = tf.test.gpu_device_name()
if device_name != '/device:GPU:0':
  print(
      '\n\nThis error most likely means that this notebook is not '
      'configured to use a GPU.  Change this in Notebook Settings via the '
      'command palette (cmd/ctrl-shift-P) or the Edit menu.\n\n')
  raise SystemError('GPU device not found')

def cpu():
  with tf.device('/cpu:0'):
    random_image_cpu = tf.random.normal((100, 100, 100, 3))
    net_cpu = tf.keras.layers.Conv2D(32, 7)(random_image_cpu)
    return tf.math.reduce_sum(net_cpu)

def gpu():
  with tf.device('/device:GPU:0'):
    random_image_gpu = tf.random.normal((100, 100, 100, 3))
    net_gpu = tf.keras.layers.Conv2D(32, 7)(random_image_gpu)
    return tf.math.reduce_sum(net_gpu)
  
# We run each op once to warm up; see: https://stackoverflow.com/a/45067900
cpu()
gpu()

# Run the op several times.
print('Time (s) to convolve 32x7x7x3 filter over random 100x100x100x3 images '
      '(batch x height x width x channel). Sum of ten runs.')
print('CPU (s):')
cpu_time = timeit.timeit('cpu()', number=10, setup="from __main__ import cpu")
print(cpu_time)
print('GPU (s):')
gpu_time = timeit.timeit('gpu()', number=10, setup="from __main__ import gpu")
print(gpu_time)
print('GPU speedup over CPU: {}x'.format(int(cpu_time/gpu_time)))

# Run Further Pretraining

For domain-adaptive pre-training (DAPT) of BERT on the RecipeNLG dataset, the [run_mlm.py](https://github.com/huggingface/transformers/blob/master/examples/pytorch/language-modeling/run_mlm.py) (Retrieved at: 06.01.2022) script from 🤗[Huggingface Transformer library](https://huggingface.co/transformers/) was used and slightly modified.

The following code cell runs the modified script and thus starts the training process

The **pretraining was startet from the BERT-base-uncased checkpoint**. After every 1000 training steps, a model checkpoint is saved in the *model_output* folder (all the other output can also be found there). These checkpoints can then be loader to continue the pretraining.

**Note:** Pretraining takes quite some time (multiple days). Google Colabs GPU allocation per user is restricted to 12 hours (24 hours for pro users, respectively), thus saving checkpoints and continuing from there is necessary!

In [None]:
!python CookBERT/further_pretraining/run_mlm.py \
--model_name_or_path=bert-base-uncased \
--output_dir=CookBERT/further_pretraining/model_output \
--do_train \
--do_eval \
--validation_split_percentage=5 \
--train_file=datasets/recipeNLG/recipeNLG_instructions.txt \
--per_device_train_batch_size=16 \
--per_device_eval_batch_size=16 \
--gradient_accumulation_steps=2 \
--learning_rate=2e-5 \
--num_train_epochs=3 \
--save_total_limit=10 \
--save_strategy=steps \
--save_steps=1000 \
--line_by_line \
--max_seq_length=256 \
--evaluation_strategy=steps \
--eval_steps=1000 \