First of all, make sure your environment has installed the latest version of [🤗 Optimum Graphcore](https://github.com/huggingface/optimum-graphcore).

In [None]:
#! pip install optimum[graphcore]

Let's print out the versions of Transformers and Optimum Graphcore:

In [None]:
import transformers
import optimum.graphcore

print(transformers.__version__)
print(optimum.graphcore.__version__)

# Train a language model

In this notebook, we'll see how to train a model that is not supported by Optimum Graphcore and not even in [🤗 Transformers](https://github.com/huggingface/transformers) on a language modeling task.

We will see how to easily load and preprocess the dataset for each one of those tasks, and how to use the `IPUTrainer` API to train a model on it.

This notebooks assumes you have trained a tokenizer on the corpus you are using, see the [How to train a tokenizer](https://github.com/huggingface/notebooks/blob/master/examples/tokenizer_training.ipynb) notebook ([open in colab](https://colab.research.google.com/github/huggingface/notebooks/blob/master/examples/tokenizer_training.ipynb)).

## Preparing the dataset

For each of those tasks, we will use the [Wikitext 2]() dataset as an example. You can load it very easily with the 🤗 Datasets library.

In [1]:
from datasets import load_dataset
datasets = load_dataset('wikitext', 'wikitext-2-raw-v1')

Reusing dataset wikitext (/localdata/jincheng/huggingface/datasets/wikitext/wikitext-2-raw-v1/1.0.0/a241db52902eaf2c6aa732210bead40c090019a499ceb13bcbfa3f8ab646a126)


  0%|          | 0/3 [00:00<?, ?it/s]

## Causal Language modeling

In [2]:
model_checkpoint = "gpt2"
tokenizer_checkpoint = "gpt2"

ipu_config_name = "Graphcore/gpt2-small-ipu"

To tokenize all our texts with the same vocabulary that was used when training the model, we have to download a pretrained tokenizer. This is all done by the `AutoTokenizer` class:

In [3]:
from transformers import AutoTokenizer
    
tokenizer = AutoTokenizer.from_pretrained(tokenizer_checkpoint)

We can now call the tokenizer on all our texts. This is very simple, using the [`map`](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Dataset.map) method from the Datasets library. First we define a function that call the tokenizer on our texts:

In [4]:
def tokenize_function(examples):
    return tokenizer(examples["text"])

Then we apply it to all the splits in our `datasets` object.

In [5]:
tokenized_datasets = datasets.map(tokenize_function, batched=True, num_proc=4, remove_columns=["text"])



      

#0:   0%|          | 0/2 [00:00<?, ?ba/s]

 

#1:   0%|          | 0/2 [00:00<?, ?ba/s]

 

#2:   0%|          | 0/2 [00:00<?, ?ba/s]

#3:   0%|          | 0/2 [00:00<?, ?ba/s]

      

#0:   0%|          | 0/10 [00:00<?, ?ba/s]

 

#1:   0%|          | 0/10 [00:00<?, ?ba/s]

 

#2:   0%|          | 0/10 [00:00<?, ?ba/s]

#3:   0%|          | 0/10 [00:00<?, ?ba/s]

      

#0:   0%|          | 0/1 [00:00<?, ?ba/s]

 

#1:   0%|          | 0/1 [00:00<?, ?ba/s]

 

#2:   0%|          | 0/1 [00:00<?, ?ba/s]

#3:   0%|          | 0/1 [00:00<?, ?ba/s]

Then we grab the maximum length our model was pretrained with.

In [6]:
block_size = 128

Then we write the preprocessing function that will group our texts:

In [7]:
def group_texts(examples):
    # Concatenate all texts.
    concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
    total_length = len(concatenated_examples[list(examples.keys())[0]])
    # We drop the small remainder, we could add padding if the model supported it instead of this drop, you can
        # customize this part to your needs.
    total_length = (total_length // block_size) * block_size
    # Split by chunks of max_len.
    result = {
        k: [t[i : i + block_size] for i in range(0, total_length, block_size)]
        for k, t in concatenated_examples.items()
    }
    result["labels"] = result["input_ids"].copy()
    return result

Again we apply it to all the splits in our `datasets` object.

In [8]:
lm_datasets = tokenized_datasets.map(
    group_texts,
    batched=True,
    batch_size=1000,
    num_proc=4,
)

      

#0:   0%|          | 0/2 [00:00<?, ?ba/s]

 

#1:   0%|          | 0/2 [00:00<?, ?ba/s]

 

#2:   0%|          | 0/2 [00:00<?, ?ba/s]

#3:   0%|          | 0/2 [00:00<?, ?ba/s]

      

#0:   0%|          | 0/10 [00:00<?, ?ba/s]

 

#1:   0%|          | 0/10 [00:00<?, ?ba/s]

 

#2:   0%|          | 0/10 [00:00<?, ?ba/s]

#3:   0%|          | 0/10 [00:00<?, ?ba/s]

      

#0:   0%|          | 0/1 [00:00<?, ?ba/s]

 

#1:   0%|          | 0/1 [00:00<?, ?ba/s]

#2:   0%|          | 0/1 [00:00<?, ?ba/s]

 

#3:   0%|          | 0/1 [00:00<?, ?ba/s]

Let define a customized model. You might notice that this is just a customized version of GPT2.

In [9]:
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.nn import TransformerEncoder, TransformerEncoderLayer

class TransformerModel(nn.Module):

    def __init__(self, block_size, vocab_size, d_model, nhead, dim_feedforward, nlayers, dropout=0.1, embd_pdrop=0.1):
        super(TransformerModel, self).__init__()
        self.block_size = block_size
        self.word_embeddings = nn.Embedding(vocab_size, d_model)
        self.position_embeddings = nn.Embedding(block_size, d_model)
        self.drop = nn.Dropout(embd_pdrop)
        encoder_layer = TransformerEncoderLayer(d_model, nhead, dim_feedforward, dropout, batch_first=True)
        self.transformer_encoder = TransformerEncoder(encoder_layer, nlayers)
        self.lm_head = nn.Linear(d_model, vocab_size)

        self.tie_weights(self.lm_head, self.word_embeddings)

        self.init_weights()

    def tie_weights(self, output_embeddings, input_embeddings):
        output_embeddings.weight = input_embeddings.weight
        output_embeddings.bias.data = nn.functional.pad(
            output_embeddings.bias.data,
            (
                0,
                output_embeddings.weight.shape[0] - output_embeddings.bias.shape[0],
            ),
            "constant",
            0,
        )
        output_embeddings.out_features = input_embeddings.num_embeddings

    def _generate_square_subsequent_mask(self, sz):
        mask = (torch.triu(torch.ones(sz, sz)) == 1).transpose(0, 1)
        mask = mask.float().masked_fill(mask == 0, -10000.0).masked_fill(mask == 1, float(0.0))
        return mask

    def init_weights(self):
        initrange = 0.1
        nn.init.uniform_(self.word_embeddings.weight, -initrange, initrange)
        nn.init.uniform_(self.position_embeddings.weight, -initrange, initrange)
        nn.init.zeros_(self.lm_head.bias)
        nn.init.uniform_(self.lm_head.weight, -initrange, initrange)

    def forward(self, input_ids, attention_mask=None, labels=None):
        device = input_ids.device
        input_shape = input_ids.size()

        mask = self._generate_square_subsequent_mask(self.block_size).to(device)

        inputs_embeds = self.word_embeddings(input_ids)
        position_ids = torch.arange(0, input_shape[-1], dtype=torch.long, device=device)
        position_ids = position_ids.unsqueeze(0).view(-1, input_shape[-1])
        position_embeds = self.position_embeddings(position_ids)
        hidden_states = inputs_embeds + position_embeds
        hidden_states = self.drop(hidden_states)

        hidden_states = self.transformer_encoder(hidden_states, mask)
        lm_logits = self.lm_head(hidden_states)

        loss = None
        if labels is not None:
            # Shift so that tokens < n predict n. Use roll() + ignore_index instead of slicing for better efficiency on IPUs.
            labels = torch.roll(labels, -1, 1)
            # By default the ignore_index of CrossEntropyLoss is -100
            labels[:, -1] = -100
            loss_fct = nn.CrossEntropyLoss()
            loss = loss_fct(lm_logits.view(-1, lm_logits.size(-1)), labels.view(-1))

        output = (lm_logits,)
        return (loss,) if loss is not None else output

We then subclass the model to inherit from `PipelineMixin`. Then the model will have the `parallelize` and `deparallelize` methods. Here we override the `parallelize` method to customize the optimization. Note that if the model is simple and no customized optimization is needed for the model, there is no need to override the methods.

In [15]:
import poptorch
from optimum.graphcore.modeling_utils import PipelineMixin, get_layer_ipu, recomputation_checkpoint, register
from optimum.utils import logging
logger = logging.get_logger(__name__)

class IPUTransformerModel(TransformerModel, PipelineMixin):
    def parallelize(self):
        super().parallelize()
        logger.info("---------- Device Allocation -----------")
        logger.info("Embedding  --> IPU 0")
        self.word_embeddings = poptorch.BeginBlock(self.word_embeddings, "word_embeddings", ipu_id=0)
        self.position_embeddings = poptorch.BeginBlock(self.position_embeddings, "position_embeddings", ipu_id=0)

        layer_ipu = get_layer_ipu(self.ipu_config.layers_per_ipu)
        for index, layer in enumerate(self.transformer_encoder.layers):
            if self.ipu_config.recompute_checkpoint_every_layer:
                # Put checkpoints on every encoder layer
                h = recomputation_checkpoint(layer)
                self._hooks.append(h)
            ipu = layer_ipu[index]
            logger.info(f"Encoder {index:<2} --> IPU {ipu}")
            self.transformer_encoder.layers[index] = poptorch.BeginBlock(layer, f"Encoder{index}", ipu_id=ipu)

        logger.info(f"Head       --> IPU 0")
        logger.info("---------------------------------------")
        self.lm_head = poptorch.BeginBlock(self.lm_head, "lm_head", ipu_id=0)
        return self

Let's instantiate the model.

In [11]:
model = IPUTransformerModel(
    block_size=block_size,
    vocab_size=tokenizer.vocab_size,
    d_model=768,
    nhead=12,
    dim_feedforward=768,
    nlayers=12,
)

To instantiate an `IPUTrainer`, we first define the `IPUConfig`, which is a class that specifies attributes and configuration parameters to compile and put the model on the device. We initialize it with one config name or path, which we set earlier.

In [12]:
from optimum.graphcore import IPUConfig, IPUTrainer, IPUTrainingArguments

ipu_config = IPUConfig.from_pretrained("../test_trainer.json")

The other thing we need to define is the `IPUTrainingArguments`, which is a class that contains all the attributes to customize the training. It requires one folder name, which will be used to save the checkpoints of the model, and all other arguments are optional:

In [13]:
micro_batch_size = 1
gradient_accumulation_steps = 16
pod_type = "pod16"

training_args = IPUTrainingArguments(
    "mymodel-wikitext2",
    learning_rate=2e-5,
    weight_decay=0.01,
    per_device_train_batch_size=micro_batch_size,
    per_device_eval_batch_size=micro_batch_size,
    gradient_accumulation_steps=gradient_accumulation_steps,
    pod_type=pod_type,
    num_train_epochs=1,
    loss_scaling=16384,
    warmup_ratio=0.1,
    dataloader_drop_last=True,
    dataloader_num_workers=64,
    logging_steps=10,
)

Finally, we pass along all of those to the `IPUTrainer` class:

In [16]:
trainer = IPUTrainer(
    model=model,
    ipu_config=ipu_config,
    args=training_args,
    train_dataset=lm_datasets["train"],
    eval_dataset=lm_datasets["validation"],
)

Overriding IPU config: gradient_accumulation_steps=16
Cloning https://huggingface.co/Jinchen/gpt2-wikitext2 into local empty directory.


Download file pytorch_model.bin:   0%|          | 15.2k/249M [00:00<?, ?B/s]

Download file training_args.bin: 100%|##########| 2.55k/2.55k [00:00<?, ?B/s]

Clean file training_args.bin:  39%|###9      | 1.00k/2.55k [00:00<?, ?B/s]

Clean file pytorch_model.bin:   0%|          | 1.00k/249M [00:00<?, ?B/s]

And we can train our model:

In [17]:
trainer.train()

Compiling Model...
Graph compilation:   0%|                                                                                                                     | 0/100 [00:00<?]2022-08-19T15:40:00.333998Z popart:devicex 1576165.1576165 W: The `debug.retainDebugInformation` engine option was implicitly set to `true`. The default will change to `false` in a future release. Set it to `true` explicitly if you want to query debug information (for example, by calling `Session::getReport`).
Graph compilation:   3%|███▍                                                                                                             | 3/100 [00:21<11:46]2022-08-19T15:40:21.161204Z popart:devicex 1576165.1576165 W: The `debug.retainDebugInformation` engine option was implicitly set to `true`. The default will change to `false` in a future release. Set it to `true` explicitly if you want to query debug information (for example, by calling `Session::getReport`).
Graph compilation: 100%|█████████████████

  0%|          | 0/2910 [00:00<?, ?it/s]

{'loss': 12.0398, 'learning_rate': 6.872852233676977e-07, 'epoch': 0.03}
{'loss': 11.9625, 'learning_rate': 1.3745704467353954e-06, 'epoch': 0.07}
{'loss': 11.7398, 'learning_rate': 2.061855670103093e-06, 'epoch': 0.1}
{'loss': 11.4867, 'learning_rate': 2.7491408934707907e-06, 'epoch': 0.14}
{'loss': 11.1125, 'learning_rate': 3.436426116838488e-06, 'epoch': 0.17}
{'loss': 10.7312, 'learning_rate': 4.123711340206186e-06, 'epoch': 0.21}
{'loss': 10.4102, 'learning_rate': 4.810996563573884e-06, 'epoch': 0.24}
{'loss': 10.0617, 'learning_rate': 5.4982817869415815e-06, 'epoch': 0.27}
{'loss': 9.7906, 'learning_rate': 6.185567010309279e-06, 'epoch': 0.31}
{'loss': 9.6422, 'learning_rate': 6.872852233676976e-06, 'epoch': 0.34}
{'loss': 9.3812, 'learning_rate': 7.560137457044674e-06, 'epoch': 0.38}
{'loss': 9.3125, 'learning_rate': 8.247422680412371e-06, 'epoch': 0.41}
{'loss': 9.1977, 'learning_rate': 8.93470790378007e-06, 'epoch': 0.45}
{'loss': 9.1625, 'learning_rate': 9.621993127147768e-06

Saving model checkpoint to gpt2-wikitext2/checkpoint-500


{'loss': 7.2898, 'learning_rate': 1.840397098129057e-05, 'epoch': 1.72}


Trainer.model is not a `PreTrainedModel`, only saving its state dict.
Configuration saved in gpt2-wikitext2/checkpoint-500/ipu_config.json


{'loss': 7.2383, 'learning_rate': 1.8327605956471936e-05, 'epoch': 1.75}
{'loss': 7.3582, 'learning_rate': 1.8251240931653306e-05, 'epoch': 1.79}
{'loss': 7.2828, 'learning_rate': 1.817487590683467e-05, 'epoch': 1.82}
{'loss': 7.2621, 'learning_rate': 1.809851088201604e-05, 'epoch': 1.86}
{'loss': 7.1852, 'learning_rate': 1.8022145857197405e-05, 'epoch': 1.89}
{'loss': 7.1805, 'learning_rate': 1.794578083237877e-05, 'epoch': 1.92}
{'loss': 7.1793, 'learning_rate': 1.786941580756014e-05, 'epoch': 1.96}
{'loss': 7.159, 'learning_rate': 1.7793050782741504e-05, 'epoch': 1.99}
{'loss': 7.0969, 'learning_rate': 1.7716685757922874e-05, 'epoch': 2.03}
{'loss': 7.2242, 'learning_rate': 1.764032073310424e-05, 'epoch': 2.06}
{'loss': 7.1527, 'learning_rate': 1.7563955708285607e-05, 'epoch': 2.1}
{'loss': 7.1473, 'learning_rate': 1.7487590683466973e-05, 'epoch': 2.13}
{'loss': 7.2023, 'learning_rate': 1.741122565864834e-05, 'epoch': 2.16}
{'loss': 6.9977, 'learning_rate': 1.7334860633829706e-05, '

Saving model checkpoint to gpt2-wikitext2/checkpoint-1000
Trainer.model is not a `PreTrainedModel`, only saving its state dict.


{'loss': 6.8582, 'learning_rate': 1.4585719740358917e-05, 'epoch': 3.44}


Configuration saved in gpt2-wikitext2/checkpoint-1000/ipu_config.json


{'loss': 7.0332, 'learning_rate': 1.4509354715540286e-05, 'epoch': 3.47}
{'loss': 6.9055, 'learning_rate': 1.443298969072165e-05, 'epoch': 3.51}
{'loss': 7.0617, 'learning_rate': 1.4356624665903018e-05, 'epoch': 3.54}
{'loss': 6.977, 'learning_rate': 1.4280259641084385e-05, 'epoch': 3.57}
{'loss': 6.9223, 'learning_rate': 1.4203894616265753e-05, 'epoch': 3.61}
{'loss': 6.918, 'learning_rate': 1.4127529591447117e-05, 'epoch': 3.64}
{'loss': 6.9617, 'learning_rate': 1.4051164566628486e-05, 'epoch': 3.68}
{'loss': 6.8781, 'learning_rate': 1.3974799541809852e-05, 'epoch': 3.71}
{'loss': 6.95, 'learning_rate': 1.389843451699122e-05, 'epoch': 3.75}
{'loss': 6.8121, 'learning_rate': 1.3822069492172585e-05, 'epoch': 3.78}
{'loss': 6.9035, 'learning_rate': 1.3745704467353953e-05, 'epoch': 3.81}
{'loss': 6.759, 'learning_rate': 1.3669339442535321e-05, 'epoch': 3.85}
{'loss': 6.9887, 'learning_rate': 1.3592974417716687e-05, 'epoch': 3.88}
{'loss': 6.909, 'learning_rate': 1.3516609392898055e-05, '

Saving model checkpoint to gpt2-wikitext2/checkpoint-1500


{'loss': 6.8297, 'learning_rate': 1.0767468499427262e-05, 'epoch': 5.15}


Trainer.model is not a `PreTrainedModel`, only saving its state dict.
Configuration saved in gpt2-wikitext2/checkpoint-1500/ipu_config.json


{'loss': 6.8754, 'learning_rate': 1.069110347460863e-05, 'epoch': 5.19}
{'loss': 6.9203, 'learning_rate': 1.0614738449789997e-05, 'epoch': 5.22}
{'loss': 6.8203, 'learning_rate': 1.0538373424971365e-05, 'epoch': 5.26}
{'loss': 6.7996, 'learning_rate': 1.046200840015273e-05, 'epoch': 5.29}
{'loss': 6.7547, 'learning_rate': 1.0385643375334097e-05, 'epoch': 5.33}
{'loss': 6.9426, 'learning_rate': 1.0309278350515464e-05, 'epoch': 5.36}
{'loss': 6.791, 'learning_rate': 1.0232913325696832e-05, 'epoch': 5.4}
{'loss': 6.8102, 'learning_rate': 1.01565483008782e-05, 'epoch': 5.43}
{'loss': 6.7168, 'learning_rate': 1.0080183276059565e-05, 'epoch': 5.46}
{'loss': 6.8051, 'learning_rate': 1.0003818251240933e-05, 'epoch': 5.5}
{'loss': 6.8586, 'learning_rate': 9.927453226422299e-06, 'epoch': 5.53}
{'loss': 6.7195, 'learning_rate': 9.851088201603667e-06, 'epoch': 5.57}
{'loss': 6.8895, 'learning_rate': 9.774723176785034e-06, 'epoch': 5.6}
{'loss': 6.7758, 'learning_rate': 9.6983581519664e-06, 'epoch'

Saving model checkpoint to gpt2-wikitext2/checkpoint-2000


{'loss': 6.7289, 'learning_rate': 6.94921725849561e-06, 'epoch': 6.87}


Trainer.model is not a `PreTrainedModel`, only saving its state dict.
Configuration saved in gpt2-wikitext2/checkpoint-2000/ipu_config.json


{'loss': 6.7098, 'learning_rate': 6.872852233676976e-06, 'epoch': 6.91}
{'loss': 6.7488, 'learning_rate': 6.796487208858344e-06, 'epoch': 6.94}
{'loss': 6.5922, 'learning_rate': 6.72012218403971e-06, 'epoch': 6.98}
{'loss': 6.6934, 'learning_rate': 6.643757159221077e-06, 'epoch': 7.01}
{'loss': 6.6152, 'learning_rate': 6.567392134402444e-06, 'epoch': 7.04}
{'loss': 6.6137, 'learning_rate': 6.491027109583811e-06, 'epoch': 7.08}
{'loss': 6.7324, 'learning_rate': 6.414662084765179e-06, 'epoch': 7.11}
{'loss': 6.7504, 'learning_rate': 6.338297059946545e-06, 'epoch': 7.15}
{'loss': 6.6051, 'learning_rate': 6.2619320351279125e-06, 'epoch': 7.18}
{'loss': 6.5969, 'learning_rate': 6.185567010309279e-06, 'epoch': 7.22}
{'loss': 6.5906, 'learning_rate': 6.109201985490646e-06, 'epoch': 7.25}
{'loss': 6.8258, 'learning_rate': 6.0328369606720125e-06, 'epoch': 7.29}
{'loss': 6.623, 'learning_rate': 5.95647193585338e-06, 'epoch': 7.32}
{'loss': 6.6352, 'learning_rate': 5.880106911034746e-06, 'epoch':

Saving model checkpoint to gpt2-wikitext2/checkpoint-2500


{'loss': 6.6824, 'learning_rate': 3.1309660175639563e-06, 'epoch': 8.59}


Trainer.model is not a `PreTrainedModel`, only saving its state dict.
Configuration saved in gpt2-wikitext2/checkpoint-2500/ipu_config.json


{'loss': 6.607, 'learning_rate': 3.054600992745323e-06, 'epoch': 8.63}
{'loss': 6.6441, 'learning_rate': 2.97823596792669e-06, 'epoch': 8.66}
{'loss': 6.5324, 'learning_rate': 2.9018709431080567e-06, 'epoch': 8.69}
{'loss': 6.5539, 'learning_rate': 2.8255059182894235e-06, 'epoch': 8.73}
{'loss': 6.616, 'learning_rate': 2.7491408934707907e-06, 'epoch': 8.76}
{'loss': 6.6035, 'learning_rate': 2.6727758686521575e-06, 'epoch': 8.8}
{'loss': 6.6469, 'learning_rate': 2.5964108438335243e-06, 'epoch': 8.83}
{'loss': 6.7117, 'learning_rate': 2.520045819014891e-06, 'epoch': 8.87}
{'loss': 6.7086, 'learning_rate': 2.4436807941962584e-06, 'epoch': 8.9}
{'loss': 6.7434, 'learning_rate': 2.367315769377625e-06, 'epoch': 8.93}
{'loss': 6.6223, 'learning_rate': 2.290950744558992e-06, 'epoch': 8.97}
{'loss': 6.7184, 'learning_rate': 2.2145857197403592e-06, 'epoch': 9.0}
{'loss': 6.5652, 'learning_rate': 2.138220694921726e-06, 'epoch': 9.04}
{'loss': 6.5812, 'learning_rate': 2.061855670103093e-06, 'epoch



Training completed. Do not forget to share your model on huggingface.co/models =)




{'loss': 6.634, 'learning_rate': 0.0, 'epoch': 10.0}
{'train_runtime': 841.0864, 'train_samples_per_second': 221.428, 'train_steps_per_second': 3.46, 'train_loss': 7.128000161082475, 'epoch': 10.0}


TrainOutput(global_step=2910, training_loss=7.128000161082475, metrics={'train_runtime': 841.0864, 'train_samples_per_second': 221.428, 'train_steps_per_second': 3.46, 'train_loss': 7.128000161082475, 'epoch': 10.0})

Once the training is completed, we can evaluate our model and get its perplexity on the validation set like this:

In [18]:
import math
eval_results = trainer.evaluate()
print(f"Perplexity: {math.exp(eval_results['eval_loss']):.2f}")

Compiling Model...
Graph compilation:   3%|███▍                                                                                                             | 3/100 [00:03<02:07]2022-08-19T16:02:10.584383Z popart:devicex 1576165.1576165 W: The `debug.retainDebugInformation` engine option was implicitly set to `true`. The default will change to `false` in a future release. Set it to `true` explicitly if you want to query debug information (for example, by calling `Session::getReport`).
2022-08-19T16:02:13.576535Z popart:devicex 1576165.1576165 W: The `debug.retainDebugInformation` engine option was implicitly set to `true`. The default will change to `false` in a future release. Set it to `true` explicitly if you want to query debug information (for example, by calling `Session::getReport`).
Graph compilation: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100/100 [01:51<00:00]
Compiled/Loaded model in 138.40369308274

  0%|          | 0/96 [00:00<?, ?it/s]

Perplexity: 780.67


The perplexity is still quite high since for this demo we trained on a small dataset for a small number of epochs. For a real LM training, you  would need a larger dataset and more epochs.