If you're opening this Notebook on colab, you will probably need to install 🤗 Transformers and 🤗 Datasets. Right now this requires the current master branch of both. Uncomment the following cell and run it.

In [2]:
! pip install git+https://github.com/huggingface/transformers.git
! pip install git+https://github.com/huggingface/datasets.git

Collecting git+https://github.com/huggingface/transformers.git
  Cloning https://github.com/huggingface/transformers.git to c:\users\marty\appdata\local\temp\pip-req-build-632ct81a
  Resolved https://github.com/huggingface/transformers.git to commit c4a96cecbc3a7ec1794c4f4c3b4359887afb6bce
  Installing build dependencies: started
  Installing build dependencies: finished with status 'done'
  Getting requirements to build wheel: started
  Getting requirements to build wheel: finished with status 'done'
  Preparing metadata (pyproject.toml): started
  Preparing metadata (pyproject.toml): finished with status 'done'


  Running command git clone --filter=blob:none -q https://github.com/huggingface/transformers.git 'C:\Users\marty\AppData\Local\Temp\pip-req-build-632ct81a'


Collecting git+https://github.com/huggingface/datasets.git
  Cloning https://github.com/huggingface/datasets.git to c:\users\marty\appdata\local\temp\pip-req-build-7nu3617l
  Resolved https://github.com/huggingface/datasets.git to commit 6a7467be15428c3de46702e1bc2d86cc1a7c8e37


  Running command git clone --filter=blob:none -q https://github.com/huggingface/datasets.git 'C:\Users\marty\AppData\Local\Temp\pip-req-build-7nu3617l'


  Preparing metadata (setup.py): started
  Preparing metadata (setup.py): finished with status 'done'


If you're opening this notebook locally, make sure your environment has an install from the last version of those libraries.

To be able to share your model with the community and generate results like the one shown in the picture below via the inference API, there are a few more steps to follow.

First you have to store your authentication token from the Hugging Face website (sign up [here](https://huggingface.co/join) if you haven't already!) then uncomment the following cell and input your username and password (this only works on Colab, in a regular notebook, you need to do this in a terminal):

Then you need to install Git-LFS and setup Git if you haven't already. On Linux, uncomment the following instructions and adapt with your name and email. On Windows, please download git-lfs at https://git-lfs.github.com/

In [1]:
import os
os.environ["CUDA_VISIBLE_DEVICES"] = "-1"

Make sure your version of Transformers is at least 4.8.1 since the functionality was introduced in that version:

In [2]:
import transformers

print(transformers.__version__)

4.15.0.dev0


You can find a script version of this notebook to fine-tune your model in a distributed fashion using multiple GPUs or TPUs [here](https://github.com/huggingface/transformers/tree/master/examples/language-modeling).

# Train a language model

## Preparing the dataset

In [3]:
import os
import pandas as pd
import re

cols = ['text']
df = pd.DataFrame(columns=cols, index=range(35))

i = 0
for filename in os.listdir("trump_speeches"):
    path = "trump_speeches/"+filename
    with open(path, encoding = 'cp850') as f:
        fileText = f.read()
        fileText = re.sub(r"[^a-zA-Z0-9\. ]", "", fileText)
    #print(fileLoc, fileMonth, fileYear, fileText)
    df.append([fileText])
    df.loc[i].text = fileText
    i+=1

df.head()

Unnamed: 0,text
0,Thank you. Thank you. Thank you to Vice Presid...
1,Theres a lot of people. Thats great. Thank you...
2,Thank you. Thank you. Thank you. All I can say...
3,I want to thank you very much. North Carolina ...
4,Thank you all. Thank you very much. Thank you ...


In [4]:
sentences = []
too_short_save = ""

for i in df.iterrows():
    dot_split = i[1]["text"].split(".")

    mem_words = 0
    for i in dot_split:
        words = i.split(" ")
        if (len(words) + mem_words) < 30: 
            too_short_save = too_short_save + ". " + i
            mem_words = mem_words + len(words)
        else:
            too_short_save = too_short_save + ". " + i + ". "
            sentences.append(too_short_save[2:])
            mem_words = 0
            too_short_save = ""

In [5]:
from datasets import Dataset

dataset = Dataset.from_pandas(pd.DataFrame(sentences))

In [6]:
dataset = dataset.train_test_split(test_size=0.1)

In [7]:
dataset

DatasetDict({
    train: Dataset({
        features: ['0'],
        num_rows: 9384
    })
    test: Dataset({
        features: ['0'],
        num_rows: 1043
    })
})

For each of those tasks, we will use the [Wikitext 2]() dataset as an example. You can load it very easily with the 🤗 Datasets library.

You can replace the dataset above with any dataset hosted on [the hub](https://huggingface.co/datasets) or use your own files. Just uncomment the following cell and replace the paths with values that will lead to your files:

You can also load datasets from a csv or a JSON file, see the [full documentation](https://huggingface.co/docs/datasets/loading_datasets.html#from-local-files) for more information.

To access an actual element, you need to select a split first, then give an index:

In [8]:
dataset["train"][0]

{'0': ' Were taking our soldiers were bringing them back home.  Were not law enforcement.  Were bringing them back home.  American troops cannot be the policemen for the world or to create democracy in other nations that frankly probably dont want it. '}

To get a sense of what the data looks like, the following function will show some examples picked randomly in the dataset.

In [9]:
from datasets import ClassLabel
import random
import pandas as pd
from IPython.display import display, HTML


def show_random_elements(dataset, num_examples=10):
    assert num_examples <= len(
        dataset
    ), "Can't pick more elements than there are in the dataset."
    picks = []
    for _ in range(num_examples):
        pick = random.randint(0, len(dataset) - 1)
        while pick in picks:
            pick = random.randint(0, len(dataset) - 1)
        picks.append(pick)

    df = pd.DataFrame(dataset[picks])
    for column, typ in dataset.features.items():
        if isinstance(typ, ClassLabel):
            df[column] = df[column].transform(lambda i: typ.names[i])
    display(HTML(df.to_html()))

In [10]:
show_random_elements(dataset["train"])

Unnamed: 0,0
0,His people want to defund the police. They actually have places Seattle they actually dont want to have police. Minneapolis they dont want to have police. They dont want to have police.
1,Theyre playing the song YMCA. So when youre having a hard time gee whats US just think of the song YMCA and youll remember it USMCA. Its a massive win for Iowa farmers and workers of all kinds.
2,So when I got in always when you get in there are none. How many do you have I have none. None. So I got to I say How many federal judges do I have Sir you have 142 federal judges.
3,I said did Raisin Kane and a couple of other generals that were there these guys are central casting like from a movie except better. Theyre stronger bigger tougher meaner and actually better looking in a certain way.
4,Again. Because we did it and then we had to close it up. We saved millions of lives. If we didnt do that you would have had three million people.
5,He doesnt know what platform. He doesnt. Its your radical left people. Its AOC plus three. Its all these people. Bernie. Its Bernie.
6,It was a disaster. It was incompetent. They called themselves incompetent. They call and now theyre coming in like well we would have done this and Biden by the way was against you remember xenophobic racist because I closed down China.
7,I went there four or five times. People are great. I won it. And they were right. There is no path to 270 but there was a path to 306.
8,Its going to be protected. What theyre doing is crazy. And we will always protect patients with preexisting conditions. And we will also protect you with preexisting physicians.
9,You had to see we were in Iowa we were in New Hampshire. You saw what happened in Colorado last. That was unbelievable. Every place we go its the same.


As we can see, some of the texts are a full paragraph of a Wikipedia article while others are just titles or empty lines.

## Causal Language modeling

For causal language modeling (CLM) we are going to take all the texts in our dataset and concatenate them after they are tokenized. Then we will split them in examples of a certain sequence length. This way the model will receive chunks of contiguous text that may look like:
```
part of text 1
```
or 
```
end of text 1 [BOS_TOKEN] beginning of text 2
```
depending on whether they span over several of the original texts in the dataset or not. The labels will be the same as the inputs, shifted to the left.

We will use the [`gpt2`](https://huggingface.co/gpt2) architecture for this example. You can pick any of the checkpoints listed [here](https://huggingface.co/models?filter=causal-lm) instead. For the tokenizer, you can replace the checkpoint by the one you trained yourself.

In [11]:
model_checkpoint = "gpt2"
tokenizer_checkpoint = 'gpt2'

To tokenize all our texts with the same vocabulary that was used when training the model, we have to download a pretrained tokenizer. This is all done by the `AutoTokenizer` class:

In [12]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(tokenizer_checkpoint)

def tokenize_function(examples):
    return tokenizer(examples["0"])

tokenized_datasets = dataset.map(
    tokenize_function, batched=True, num_proc=1, remove_columns=["0"]
)

  0%|          | 0/10 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

We can now call the tokenizer on all our texts. This is very simple, using the [`map`](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Dataset.map) method from the Datasets library. First we define a function that call the tokenizer on our texts:

Then we apply it to all the splits in our `datasets` object, using `batched=True` and 4 processes to speed up the preprocessing. We won't need the `text` column afterward, so we discard it.

Now for the harder part: we need to concatenate all our texts together then split the result in small chunks of a certain `block_size`. To do this, we will use the `map` method again, with the option `batched=True`. This option actually lets us change the number of examples in the datasets by returning a different number of examples than we got. This way, we can create our new samples from a batch of examples.

First, we grab the maximum length our model was pretrained with. This might be a big too big to fit in your GPU RAM, so here we take a bit less at just 128.

Then we write the preprocessing function that will group our texts:

In [13]:
# block_size = tokenizer.model_max_length
block_size = 50

def group_texts(examples):
    # Concatenate all texts.
    concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
    total_length = len(concatenated_examples[list(examples.keys())[0]])
    # We drop the small remainder, we could add padding if the model supported it instead of this drop, you can
    # customize this part to your needs.
    total_length = (total_length // block_size) * block_size
    # Split by chunks of max_len.
    result = {
        k: [t[i : i + block_size] for i in range(0, total_length, block_size)]
        for k, t in concatenated_examples.items()
    }
    result["labels"] = result["input_ids"].copy()
    return result

lm_datasets = tokenized_datasets.map(
    group_texts,
    batched=True,
    batch_size=1000,
    num_proc=1,
)

  0%|          | 0/10 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

In [14]:
tokenizer.decode(lm_datasets["train"][3]["input_ids"])

' close all their factories here.  President I have no choice.  I said Youre right.  Youre right.  First thing Ive ever heard him say that I agreed with hes got no choice.  Actually I honestly dont think he'

Now that the data has been cleaned, we're ready to instantiate our `Model`. First we create the model using the same config as our checkpoint, but initialized with random weights:

In [15]:
from transformers import AutoConfig, TFAutoModelForCausalLM
from transformers import AdamWeightDecay
import tensorflow as tf

learning_rate = 2e-5
weight_decay = 0.01

config = AutoConfig.from_pretrained(model_checkpoint)
model = TFAutoModelForCausalLM.from_config(config)
optimizer = AdamWeightDecay(learning_rate=learning_rate, weight_decay_rate=weight_decay)
model.compile(optimizer=optimizer)

No loss specified in compile() - the model's internal loss computation will be used as the loss. Don't panic - this is a common way to train TensorFlow models in Transformers! Please ensure your labels are passed as the 'labels' key of the input dict so that they are accessible to the model during the forward pass. To disable this behaviour, please pass a loss argument, or explicitly pass loss=None if you do not want your model to compute a loss.


In [16]:
from transformers import DefaultDataCollator

data_collator = DefaultDataCollator(return_tensors='tf')

train_set = lm_datasets["train"].to_tf_dataset(
    columns=["attention_mask", "input_ids", "labels"],
    shuffle=True,
    batch_size=16,
    collate_fn=data_collator
)
validation_set = lm_datasets["test"].to_tf_dataset(
    columns=["attention_mask", "input_ids", "labels"],
    shuffle=False,
    batch_size=16,
    collate_fn=data_collator
)

Now we can train our model. We can also add a callback to sync up our model with the Hub - this allows us to resume training from other machines and even test the model's inference quality midway through training! Make sure to change the `username` if you do. If you don't want to do this, simply remove the callbacks argument in the call to `fit()`.

In [None]:


model.fit(train_set, validation_data=validation_set, epochs=2)

Epoch 1/2
Instructions for updating:
The `validate_indices` argument has no effect. Indices are always validated on CPU and never validated on GPU.
 35/508 [=>............................] - ETA: 51:34 - loss: 9.3971

Once the training is completed, we can evaluate our model and get its loss on the validation set like this:

In [29]:
eval_loss = model.evaluate(validation_set)



In [19]:
prompt = "its a great day"
inputs = tokenizer(prompt, add_special_tokens=False, return_tensors="tf")["input_ids"]



In [34]:
prompt_length = len(tokenizer.decode(inputs[0]))
outputs = model.generate(inputs, max_length=16, do_sample=True, top_p = 0.9, temperature=.80, top_k = 150)
generated = prompt + tokenizer.decode(outputs[0])[prompt_length+1:]

Setting `pad_token_id` to 50256 (first `eos_token_id`) to generate sequence


In [35]:
generated

'its a great dayof the world.  And I mean I just dont know'

The quality of language models is often measured in 'perplexity' rather than cross-entropy. To convert to perplexity, we simply raise e to the power of the cross-entropy loss.

In [None]:
import math

print(f"Perplexity: {math.exp(eval_loss):.2f}")

Perplexity: 574.92


The perplexity is still quite high since for this demo we trained on a small dataset for a small number of epochs. For a real LM training, you  would need a larger dataset and more epochs.

If you used the callback above, you can now share this model with all your friends, family, favorite pets: they can all load it with the identifier `"your-username/the-name-you-picked"` so for instance:

```python
from transformers import TFAutoModelForCausalLM

model = TFAutoModelForCausalLM.from_pretrained("your-username/my-awesome-model")
```

## Masked language modeling

For masked language modeling (MLM) we are going to use the same preprocessing as before for our dataset with one additional step: we will randomly mask some tokens (by replacing them by `[MASK]`) and the labels will be adjusted to only include the masked tokens (we don't have to predict the non-masked tokens). If you use a tokenizer you trained yourself, make sure the `[MASK]` token is among the special tokens you passed during training!

We will use the [`bert-base-cased`](https://huggingface.co/bert-based-cased) model for this example. You can pick any of the checkpoints listed [here](https://huggingface.co/models?filter=masked-lm) instead. For the tokenizer, replace the checkpoint by the one you trained.

In [None]:
model_checkpoint = "bert-base-cased"
tokenizer_checkpoint = "sgugger/bert-like-tokenizer"

We can apply the same tokenization function as before, we just need to update our tokenizer to use the checkpoint we just picked:

In [None]:
tokenizer = AutoTokenizer.from_pretrained(tokenizer_checkpoint)
tokenized_datasets = datasets.map(
    tokenize_function, batched=True, num_proc=4, remove_columns=["text"]
)

Token indices sequence length is longer than the specified maximum sequence length for this model (571 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (554 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (522 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (657 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (514 > 512). Running this sequence through the model will result in indexing errors


And like before, we group texts together and chunk them in samples of length `block_size`. You can skip that step if your dataset is composed of individual sentences.

In [None]:
lm_datasets = tokenized_datasets.map(
    group_texts,
    batched=True,
    batch_size=1000,
    num_proc=4,
)

The rest is very similar to what we had, with two exceptions. First we use a model suitable for masked LM:

In [None]:
from transformers import AutoConfig, TFAutoModelForMaskedLM

config = AutoConfig.from_pretrained(model_checkpoint)
model = TFAutoModelForMaskedLM.from_config(config)

We redefine our hyperparameters and choose a new name:

In [None]:
learning_rate = 2e-5
weight_decay = 0.01
push_to_hub_model_id = f"{model_checkpoint}-wikitext2"

Now we initialize our optimizer.

In [None]:
from transformers import AdamWeightDecay

optimizer = AdamWeightDecay(learning_rate=learning_rate, weight_decay_rate=weight_decay)

All Transformers models compute loss internally, so as in the CLM example we can just leave the loss argument blank to use the internal loss.

In [None]:
import tensorflow as tf

model.compile(optimizer=optimizer)

No loss specified in compile() - the model's internal loss computation will be used as the loss. Don't panic - this is a common way to train TensorFlow models in Transformers! Please ensure your labels are passed as the 'labels' key of the input dict so that they are accessible to the model during the forward pass. To disable this behaviour, please pass a loss argument, or explicitly pass loss=None if you do not want your model to compute a loss.


Finally, we use a special `data_collator`. The `data_collator` is a function that is responsible of taking the samples and batching them in tensors. In the previous example, we had nothing special to do, so we just used the default for this argument. Here we want to do the random-masking. We could do it as a pre-processing step (like the tokenization) but then the tokens would always be masked the same way at each epoch. By doing this step inside the `data_collator`, we ensure this random masking is done in a new way each time we go over the data.

To do this masking for us, the library provides a `DataCollatorForLanguageModeling`. We can adjust the probability of the masking. Make sure to set `return_tensors="tf"` too - the `DataCollator` objects all support multiple frameworks, and we don't want to accidentally get a bunch of `torch.Tensor` objects floating around in our TensorFlow code!

In [None]:
from transformers import DataCollatorForLanguageModeling

data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer, mlm_probability=0.15, return_tensors="tf"
)

Now we pass our data collator to the `to_tf_dataset()` argument.

In [None]:
train_set = lm_datasets["train"].to_tf_dataset(
    columns=["attention_mask", "input_ids", "labels"],
    shuffle=True,
    batch_size=16,
    collate_fn=data_collator,
)
validation_set = lm_datasets["validation"].to_tf_dataset(
    columns=["attention_mask", "input_ids", "labels"],
    shuffle=False,
    batch_size=16,
    collate_fn=data_collator,
)

And now we can train our model:

In [None]:
from transformers.keras_callbacks import PushToHubCallback

model_name = model_checkpoint.split("/")[-1]
push_to_hub_model_id = f"{model_name}-finetuned-wikitext2"
username = "Rocketknight1"

callback = PushToHubCallback(
    output_dir="./mlm_from_scratch_model_save",
    tokenizer=tokenizer,
    hub_model_id=f"{username}/{push_to_hub_model_id}",
)

model.fit(train_set, validation_data=validation_set, epochs=2)

/home/matt/PycharmProjects/notebooks/examples/mlm_from_scratch_model_save is already a clone of https://huggingface.co/Rocketknight1/bert-base-cased-finetuned-wikitext2. Make sure you pull the latest changes with `repo.git_pull()`.


Epoch 1/2
Epoch 2/2


<keras.callbacks.History at 0x7f82c04458b0>

Like before, we can evaluate our model on the validation set. The perplexity is much lower than for the CLM objective because for the MLM objective, we only have to make predictions for the masked tokens (which represent 15% of the total here) while having access to the rest of the tokens. It's thus an easier task for the model.

In [None]:
eval_loss = model.evaluate(validation_set)
print(f"Perplexity: {math.exp(eval_loss):.2f}")

Perplexity: 524.21


The perplexity is still quite high since for this demo we trained on a small dataset for a small number of epochs. For a real LM training, you  would need a larger dataset and more epochs.

If you used the callback above, you can now share this model with all your friends, family, favorite pets: they can all load it with the identifier `"your-username/the-name-you-picked"` so for instance:

```python
from transformers import TFAutoModelForMaskedLM

model = TFAutoModelForMaskedLM.from_pretrained("your-username/my-awesome-model")
```