If you're opening this Notebook on colab, you will probably need to install 🤗 Transformers and 🤗 Datasets. Uncomment the following cell and run it.

In [1]:
! pip install datasets transformers



If you're opening this notebook locally, make sure your environment has an install from the last version of those libraries.

To be able to share your model with the community and generate results like the one shown in the picture below via the inference API, there are a few more steps to follow.

First you have to store your authentication token from the Hugging Face website (sign up [here](https://huggingface.co/join) if you haven't already!) then execute the following cell and input your username and password:

In [2]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

Then you need to install Git-LFS. Uncomment the following instructions:

In [3]:
!apt install git-lfs

The operation couldn’t be completed. Unable to locate a Java Runtime that supports apt.
Please visit http://www.java.com for information on installing Java.



Make sure your version of Transformers is at least 4.11.0 since the functionality was introduced in that version:

In [3]:
import transformers

print(transformers.__version__)

4.28.0


You can find a script version of this notebook to fine-tune your model in a distributed fashion using multiple GPUs or TPUs [here](https://github.com/huggingface/transformers/tree/master/examples/language-modeling).

We also quickly upload some telemetry - this tells us which examples and software versions are getting used so we know where to prioritize our maintenance efforts. We don't collect (or care about) any personally identifiable information, but if you'd prefer not to be counted, feel free to skip this step or delete this cell entirely.

In [4]:
from transformers.utils import send_example_telemetry

# grace: changed from pytorch to tensorflow
send_example_telemetry("language_modeling_notebook", framework="pytorch")

# Fine-tuning a language model

In this notebook, we'll see how to fine-tune one of the [🤗 Transformers](https://github.com/huggingface/transformers) model on a language modeling tasks. We will cover two types of language modeling tasks which are:

- Causal language modeling: the model has to predict the next token in the sentence (so the labels are the same as the inputs shifted to the right). To make sure the model does not cheat, it gets an attention mask that will prevent it to access the tokens after token i when trying to predict the token i+1 in the sentence.

![Widget inference representing the causal language modeling task](images/causal_language_modeling.png)

- Masked language modeling: the model has to predict some tokens that are masked in the input. It still has access to the whole sentence, so it can use the tokens before and after the tokens masked to predict their value.

![Widget inference representing the masked language modeling task](images/masked_language_modeling.png)

We will see how to easily load and preprocess the dataset for each one of those tasks, and how to use the `Trainer` API to fine-tune a model on it.

A script version of this notebook you can directly run on a distributed environment or on TPU is available in our [examples folder](https://github.com/huggingface/transformers/tree/master/examples).

## Preparing the dataset

For each of those tasks, we will use the [Wikitext 2]() dataset as an example. You can load it very easily with the 🤗 Datasets library.

In [5]:
from datasets import load_dataset
datasets = load_dataset('wikitext', 'wikitext-2-raw-v1')

len(datasets["validation"])

3760

You can replace the dataset above with any dataset hosted on [the hub](https://huggingface.co/datasets) or use your own files. Just uncomment the following cell and replace the paths with values that will lead to your files:

In [6]:
datasets = load_dataset("text", data_files={"train": "../preprocessing/txt_files/train_output.txt", "validation": "../preprocessing/txt_files/dev_output.txt"})
len(datasets["train"])

8034

You can also load datasets from a csv or a JSON file, see the [full documentation](https://huggingface.co/docs/datasets/loading_datasets.html#from-local-files) for more information.

To access an actual element, you need to select a split first, then give an index:

In [7]:
datasets["train"][10]

{'text': "customer: Hi, is this jacket dry clean only? agent: Hello, it is not dry clean only. We use high quality fabric in our materials, so you can wash and dry them in a typical washer and dryer! customer: Great! It is a little inconvenient when one is not machine washable. agent: I highly agree! Especially with everything happening right now, you may not even be able to find one that is open! agent: Anyway, is there anything else I can help you with? customer: Do they come in different colors agent: Hello, I would have to direct you to our website, as our stock is very dynamic. If you do not see the color you like, I suggest checking back at the end of the month, when we refill the stock customer: That's alright. I actually want it in red if you have it. But if not, I'll get this. agent: That sounds great! Is there anything else I can help you with? customer: this is good. thank you agent: Have a great day!"}

To get a sense of what the data looks like, the following function will show some examples picked randomly in the dataset.

In [8]:
from datasets import ClassLabel
import random
import pandas as pd
from IPython.display import display, HTML

def show_random_elements(dataset, num_examples=10):
    assert num_examples <= len(dataset), "Can't pick more elements than there are in the dataset."
    picks = []
    for _ in range(num_examples):
        pick = random.randint(0, len(dataset)-1)
        while pick in picks:
            pick = random.randint(0, len(dataset)-1)
        picks.append(pick)
    
    df = pd.DataFrame(dataset[picks])
    for column, typ in dataset.features.items():
        if isinstance(typ, ClassLabel):
            df[column] = df[column].transform(lambda i: typ.names[i])
    display(HTML(df.to_html()))

In [9]:
show_random_elements(datasets["train"])

Unnamed: 0,text
0,"agent: Hello. How may I assist you? customer: Hi! I would like to know a little more about these Hilfiger jeans agent: Sure, what is your question? customer: How long is the length/outseam agent: Our regular pair is 42"" on the outseam, this extra long length is meant so that the customer will tailor the jeans for their exact height. agent: Can I help you with anything else? customer: No thank you agent: Great, have a good night."
1,"agent: Hello, how can I help you today? customer: I would like to know the state of my refund customer: I just want to double check the status agent: Glad to help. May I have your full name please? customer: Joseph Banter agent: thanks Joseph. To validate your purchase, can I have your username, email address and order id please? customer: josephb1 customer: josephb1@email.com customer: 7032690209 agent: Perfect. The refund is in progress according to the system. customer: Okay, thank you agent: Great, is there anything else that I can help you with? customer: That is everything agent: Have a nice day! customer: you too"
2,"agent: Hello, what can I help you with today? customer: It looks like i was given and charged for a premium subscription, but I don't want it. can you refund the fee for me? customer: i didn't sign up for it agent: Let me check on that for you, can I please have your name? customer: Alessandro Phoenix agent: Thank you, can I also have your account ID and order ID number customer: I don't have either agent: No problem, how did you hear that you were charged for the subscription? customer: I got an email agent: I apologize, it looks like that was our mistake! agent: How much was the charge for? customer: 40 customer: thanks! agent: Ok, I have refunded that amount for you. Is there anything else I can help you with today? customer: nope, that covers it! thanks! agent: You're welcome!"
3,"agent: Thank you for contacting AmceBrands! How can I help you? customer: never had this problem before? customer: I add things to my cart but they don't stay there customer: it doesn't add agent: I'm sorry your having trouble completing your order. So the shopping cart isn't updating when you add items? customer: that's correct customer: would like to buy but can't won't let me! agent: I understand your frustration. Could I have your name? customer: alright customer: crystal minh silver level customer: never had this problem customer: cminh202 agent: Great. Crystal, could you log out of your account and try to log back in for me to see if your cart updates? customer: ok, i've tried it but I'll do it again customer: afraid that isn't helping agent: Now that you're logged back in and on the shopping cart page, click the refresh button and let me know if it updates. customer: hmm customer: I doubted that one but guess it did the trick! customer: really weird customer: already had tried that agent: Great! Is there anything else I can help you with? customer: no that's it sorry about that silly problem! agent: It's okay. No worries. I hope you enjoy the rest of your day!"
4,"agent: Hello! Thank you for contacting us today. How can I help you? customer: Hi. I signed up for the premium subscription recently. I need to make a payment to keep it active. agent: Sure I can help you with that. Can I get your name please? agent: I will also need your Account ID and the Order ID. customer: ok, my name is David Williams and my account id is UK0R5XII5E and my order id is 6939563683 agent: Perfect. Thank you, David. How much of the fee were you looking to pay today? customer: As much as I can. agent: Okay. Let me check what you need to pay. agent: Okay the total today is $49. Do you have a credit card number you want me to put that on? customer: i don't have it on hand right now. agent: I can use the one on the account if that is sufficient. customer: yeah that's my main one. agent: Perfect. I have made that payment to renew the subscription for you. agent: Can I help you with anything else today? customer: nope that's it agent: Thank you and have a great afternoon!"
5,"agent: hello, how may i help you today? customer: Hello, I would like information on how to upgrade the shipping for Order ID 6432922258? agent: if i understand correctly, you would like to upgrade the shipping on your order? customer: That is correct. agent: i can assist with that today customer: I want to upgrade to overnight so I can receive it asap. agent: may i please have your full name and account ID? customer: My name is Joseph Banter. My account ID is CMWD4EXJN customer: That was the wrong ID. Correct ID is CMWD4JDXJN. My apologies for the typo. agent: no problem, and thank you agent: do you have the shipping status of the item currently customer: My account states that it is ""In Transit"" agent: since the item has already shipped, it is not possible for me to upgrade it to overnight shipping agent: however, because it is already in transit, it will arrive within a day or two customer: Does that mean it will be here tomorrow or the day after? agent: i do however apologize for the inconvience agent: it is likely to arrive today or tomorrow agent: if you check your tracking number, that should give you a better idea of the timefram (based on if it is local or not yet, and your standard shipper delivery time windows for your residence) customer: Thank you for your assistance. I'll need to remember to check my order type next time. agent: is there anything else i can assist with today? customer: I am all set today. Thank you for your assistance."
6,"agent: Hello, how can I help you? customer: My husband just told me that a premium subscription was added to our account that we didn't want agent: Let me check on that for you agent: Can you provide your full name or account ID please? customer: Rodriguez Domingo agent: And you heard about this from your spouse, correct? customer: yes agent: Hello, I have checked the system and it looks like you have not had a service added. I think your husband may have misheard. You have nothing to worry about and have NOT been charged! customer: Well thank you. I'm happy about that. agent: Is there anything else I can help you with? customer: Not today, thanks. agent: Have a great day :)"
7,"agent: Hello, how may I help you today? customer: Hi, I want to know when my annual subscription fee is due customer: My name is Joseph Banter customer: Account ID: SS68H8TXK5 agent: Okay, Joseph. Let's look at your subscription fee due date. agent: I need your order ID. customer: Order ID: 6238541609 agent: Your payment is due three days from now. customer: Oh okay, I will pay it later then. agent: Here is a link to your account login page so you can view your information when needed. customer: Great, thank you agent: Will that be all? customer: Yes, that's all. Thank you for helping me agent: No problem, have a great day."
8,"agent: Welcome to AcmeBrands. How may I help? customer: Can't access account, I wanted to check status of my order? agent: OK. Why can't you access your account? customer: I forgot my username agent: OK. I can help with that. customer: Thank you kindly agent: May I have your full name or account ID customer: Alessandro Phoenix customer: Account Id is CORONA123 agent: Thanks. I have pulled up the account. agent: I also need your ip code and phone nuber agent: ip agent: sorry, zip customer: 61196 customer: (875) 054-9867 customer: Can you please hurry, I need to cook a rabbit for Easter dinner agent: Thanks I have verified your identity. customer: Coolio agent: aphoenix 1 is your username. Anything else? customer: Ok thanks. Okay I'm log in"
9,"agent: Hello, how can i help you today customer: Hello! is your website down right now? customer: It is incredibly slow... customer: I can't proceed to the check out section agent: Not that i know of but i will put in a not with our internal team. Could you try logging out and logging in and seeing if it is still slow customer: Ok let me try customer: Hmm. still not working agent: Okay could you see if other websites are slow for you customer: They are good agent: Okay, looks like our site is having issues right now. If you have a purchase in mind i could make it for you if not the site will be up soon customer: yes please customer: it is a CK jeans customer: size 4 please agent: Okay sure could i get the credit card you would like to use please customer: 2345 4567 5789 6766 customer: the exp date is 03/20 agent: Okay, i have made the purchase and will have them shipped to the adress on file, is there anything else i can do for you customer: Thank you. That's all customer: Have a great one agent: You too."


As we can see, some of the texts are a full paragraph of a Wikipedia article while others are just titles or empty lines.

## Causal Language modeling

For causal language modeling (CLM) we are going to take all the texts in our dataset and concatenate them after they are tokenized. Then we will split them in examples of a certain sequence length. This way the model will receive chunks of contiguous text that may look like:
```
part of text 1
```
or 
```
end of text 1 [BOS_TOKEN] beginning of text 2
```
depending on whether they span over several of the original texts in the dataset or not. The labels will be the same as the inputs, shifted to the left.

We will use the [`distilgpt2`](https://huggingface.co/distilgpt2) model for this example. You can pick any of the checkpoints listed [here](https://huggingface.co/models?filter=causal-lm) instead:

In [10]:
model_checkpoint = "distilgpt2"

To tokenize all our texts with the same vocabulary that was used when training the model, we have to download a pretrained tokenizer. This is all done by the `AutoTokenizer` class:

In [11]:
from transformers import AutoTokenizer
    
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, use_fast=True)

We can now call the tokenizer on all our texts. This is very simple, using the [`map`](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasets.Dataset.map) method from the Datasets library. First we define a function that call the tokenizer on our texts:

In [12]:
def tokenize_function(examples):
    return tokenizer(examples["text"])

Then we apply it to all the splits in our `datasets` object, using `batched=True` and 4 processes to speed up the preprocessing. We won't need the `text` column afterward, so we discard it.

In [13]:
tokenized_datasets = datasets.map(tokenize_function, batched=True, num_proc=4, remove_columns=["text"])

If we now look at an element of our datasets, we will see the text have been replaced by the `input_ids` the model will need:

In [14]:
tokenized_datasets["train"][1]

{'input_ids': [25781,
  25,
  922,
  6672,
  11,
  703,
  460,
  314,
  1037,
  345,
  30,
  6491,
  25,
  655,
  2227,
  284,
  2198,
  319,
  262,
  3722,
  286,
  257,
  12929,
  5797,
  25,
  1654,
  11,
  561,
  345,
  1577,
  502,
  534,
  1336,
  1438,
  393,
  1848,
  4522,
  6491,
  25,
  47319,
  28092,
  9643,
  6491,
  25,
  49812,
  8538,
  24,
  2670,
  5797,
  25,
  3224,
  284,
  428,
  345,
  561,
  1577,
  502,
  262,
  1502,
  4522,
  290,
  3053,
  5797,
  25,
  3387,
  6491,
  25,
  9225,
  1433,
  3134,
  2414,
  1983,
  6491,
  25,
  49812,
  8538,
  24,
  2670,
  31,
  12888,
  13,
  785,
  6491,
  25,
  645,
  18572,
  5797,
  25,
  2279,
  287,
  1502,
  11,
  2582,
  314,
  481,
  7603,
  262,
  3722,
  286,
  534,
  12929,
  13,
  6491,
  25,
  1049,
  6491,
  25,
  1309,
  502,
  760,
  5797,
  25,
  632,
  318,
  3058,
  287,
  4371,
  290,
  262,
  6074,
  2446,
  351,
  543,
  340,
  318,
  852,
  13686,
  318,
  2691,
  3371,
  534,
  3884,
  2657,
  13

Now for the harder part: we need to concatenate all our texts together then split the result in small chunks of a certain `block_size`. To do this, we will use the `map` method again, with the option `batched=True`. This option actually lets us change the number of examples in the datasets by returning a different number of examples than we got. This way, we can create our new samples from a batch of examples.

First, we grab the maximum length our model was pretrained with. This might be a big too big to fit in your GPU RAM, so here we take a bit less at just 128.

In [15]:
# block_size = tokenizer.model_max_length
block_size = 128

Then we write the preprocessing function that will group our texts:

In [16]:
def group_texts(examples):
    # Concatenate all texts.
    concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
    total_length = len(concatenated_examples[list(examples.keys())[0]])
    # We drop the small remainder, we could add padding if the model supported it instead of this drop, you can
        # customize this part to your needs.
    total_length = (total_length // block_size) * block_size
    # Split by chunks of max_len.
    result = {
        k: [t[i : i + block_size] for i in range(0, total_length, block_size)]
        for k, t in concatenated_examples.items()
    }
    result["labels"] = result["input_ids"].copy()
    return result

First note that we duplicate the inputs for our labels. This is because the model of the 🤗 Transformers library apply the shifting to the right, so we don't need to do it manually.

Also note that by default, the `map` method will send a batch of 1,000 examples to be treated by the preprocessing function. So here, we will drop the remainder to make the concatenated tokenized texts a multiple of `block_size` every 1,000 examples. You can adjust this behavior by passing a higher batch size (which will also be processed slower). You can also speed-up the preprocessing by using multiprocessing:

In [17]:
lm_datasets = tokenized_datasets.map(
    group_texts,
    batched=True,
    batch_size=1000,
    num_proc=4,
)

And we can check our datasets have changed: now the samples contain chunks of `block_size` contiguous tokens, potentially spanning over several of our original texts.

In [20]:
tokenizer.decode(lm_datasets["train"][1]["input_ids"])

" a bronze agent: ok, was the purchase made in the last 90 days? customer: No, I bought it in November. agent: ok, unfortunately because it has been more than 90 days we cannot accept the return. Would there be anything else I can help you with? customer: What if I ask really, really nicely? agent: I can escalate to my manager if you'd like agent: I'd just need your phone number. customer: (977) 625-2661 customer: I'll look forward to hearing from them. customer: Thanks for trying to help. agent: OK, I have let my manager know"

Now that the data has been cleaned, we're ready to instantiate our `Trainer`. We will a model:

In [18]:
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained(model_checkpoint)

And some `TrainingArguments`:

In [19]:
from transformers import Trainer, TrainingArguments

In [20]:
model_name = model_checkpoint.split("/")[-1]
training_args = TrainingArguments(
    f"{model_name}-finetuned-wikitext2",
    evaluation_strategy = "epoch",
    learning_rate=2e-5,
    weight_decay=0.01,
    push_to_hub=True,
)

# if you run into memory error use
# per_device_train_batch_size (reduce to 1, or increase to 2... change etc.)

The last argument to setup everything so we can push the model to the [Hub](https://huggingface.co/models) regularly during training. Remove it if you didn't follow the installation steps at the top of the notebook. If you want to save your model locally in a name that is different than the name of the repository it will be pushed, or if you want to push your model under an organization and not your name space, use the `hub_model_id` argument to set the repo name (it needs to be the full name, including your namespace: for instance `"sgugger/gpt-finetuned-wikitext2"` or `"huggingface/gpt-finetuned-wikitext2"`).

We pass along all of those to the `Trainer` class:

In [21]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=lm_datasets["train"],
    eval_dataset=lm_datasets["validation"],
)

/Users/graceli/Desktop/asapp-2b/finetuning/distilgpt2-finetuned-wikitext2 is already a clone of https://huggingface.co/xrzhangli/distilgpt2-finetuned-wikitext2. Make sure you pull the latest changes with `repo.git_pull()`.


And we can train our model:

In [22]:
trainer.train()



  0%|          | 0/5283 [00:00<?, ?it/s]

{'loss': 2.7968, 'learning_rate': 1.8107136096914633e-05, 'epoch': 0.28}


Several commits (12) will be pushed upstream.


{'loss': 2.4696, 'learning_rate': 1.6214272193829265e-05, 'epoch': 0.57}


Adding files tracked by Git LFS: ['checkpoint-1000/optimizer.pt', 'checkpoint-1000/pytorch_model.bin', 'pytorch_model.bin', 'checkpoint-1000/rng_state.pth', 'checkpoint-1000/scheduler.pt', 'checkpoint-1000/training_args.bin', 'runs/Oct17_21-25-01_Graces-MacBook-Pro-104.local/events.out.tfevents.1697592324.Graces-MacBook-Pro-104.local.75929.0']. This may take a bit of time if the files are large.
Several commits (13) will be pushed upstream.


{'loss': 2.3517, 'learning_rate': 1.4321408290743897e-05, 'epoch': 0.85}


Adding files tracked by Git LFS: ['checkpoint-1500/optimizer.pt', 'checkpoint-1500/rng_state.pth', 'checkpoint-1500/scheduler.pt', 'checkpoint-1500/training_args.bin']. This may take a bit of time if the files are large.
Several commits (14) will be pushed upstream.


  0%|          | 0/223 [00:00<?, ?it/s]

{'eval_loss': 2.2283639907836914, 'eval_runtime': 56.0152, 'eval_samples_per_second': 31.741, 'eval_steps_per_second': 3.981, 'epoch': 1.0}
{'loss': 2.2948, 'learning_rate': 1.2428544387658528e-05, 'epoch': 1.14}


Adding files tracked by Git LFS: ['checkpoint-2000/optimizer.pt', 'checkpoint-2000/rng_state.pth', 'checkpoint-2000/scheduler.pt', 'checkpoint-2000/training_args.bin']. This may take a bit of time if the files are large.
Several commits (15) will be pushed upstream.


{'loss': 2.2359, 'learning_rate': 1.0535680484573161e-05, 'epoch': 1.42}


Adding files tracked by Git LFS: ['checkpoint-2500/optimizer.pt', 'checkpoint-2500/rng_state.pth', 'checkpoint-2500/scheduler.pt', 'checkpoint-2500/training_args.bin']. This may take a bit of time if the files are large.
Several commits (16) will be pushed upstream.


{'loss': 2.2086, 'learning_rate': 8.642816581487791e-06, 'epoch': 1.7}


Adding files tracked by Git LFS: ['checkpoint-3000/optimizer.pt', 'checkpoint-3000/rng_state.pth', 'checkpoint-3000/scheduler.pt', 'checkpoint-3000/training_args.bin']. This may take a bit of time if the files are large.
Several commits (17) will be pushed upstream.


{'loss': 2.1882, 'learning_rate': 6.749952678402424e-06, 'epoch': 1.99}


Adding files tracked by Git LFS: ['checkpoint-3500/optimizer.pt', 'checkpoint-3500/rng_state.pth', 'checkpoint-3500/scheduler.pt', 'checkpoint-3500/training_args.bin']. This may take a bit of time if the files are large.
Several commits (18) will be pushed upstream.


  0%|          | 0/223 [00:00<?, ?it/s]

{'eval_loss': 2.1384809017181396, 'eval_runtime': 80.4342, 'eval_samples_per_second': 22.105, 'eval_steps_per_second': 2.772, 'epoch': 2.0}
{'loss': 2.1602, 'learning_rate': 4.8570887753170555e-06, 'epoch': 2.27}


Adding files tracked by Git LFS: ['checkpoint-4000/optimizer.pt', 'checkpoint-4000/rng_state.pth', 'checkpoint-4000/scheduler.pt', 'checkpoint-4000/training_args.bin']. This may take a bit of time if the files are large.
Several commits (19) will be pushed upstream.


{'loss': 2.154, 'learning_rate': 2.9642248722316867e-06, 'epoch': 2.56}


Adding files tracked by Git LFS: ['checkpoint-4500/optimizer.pt', 'checkpoint-4500/rng_state.pth', 'checkpoint-4500/scheduler.pt', 'checkpoint-4500/training_args.bin']. This may take a bit of time if the files are large.
Several commits (20) will be pushed upstream.


{'loss': 2.1532, 'learning_rate': 1.0713609691463186e-06, 'epoch': 2.84}


Adding files tracked by Git LFS: ['checkpoint-5000/optimizer.pt', 'checkpoint-5000/rng_state.pth', 'checkpoint-5000/scheduler.pt', 'checkpoint-5000/training_args.bin']. This may take a bit of time if the files are large.
Several commits (21) will be pushed upstream.


  0%|          | 0/223 [00:00<?, ?it/s]

{'eval_loss': 2.1155452728271484, 'eval_runtime': 55.4917, 'eval_samples_per_second': 32.041, 'eval_steps_per_second': 4.019, 'epoch': 3.0}
{'train_runtime': 6130.9185, 'train_samples_per_second': 6.891, 'train_steps_per_second': 0.862, 'train_loss': 2.2927263770030346, 'epoch': 3.0}


TrainOutput(global_step=5283, training_loss=2.2927263770030346, metrics={'train_runtime': 6130.9185, 'train_samples_per_second': 6.891, 'train_steps_per_second': 0.862, 'train_loss': 2.2927263770030346, 'epoch': 3.0})

Once the training is completed, we can evaluate our model and get its perplexity on the validation set like this:

In [23]:
import math
eval_results = trainer.evaluate()
print(f"Perplexity: {math.exp(eval_results['eval_loss']):.2f}")

  0%|          | 0/223 [00:00<?, ?it/s]

Perplexity: 8.29


You can now upload the result of the training to the Hub, just execute this instruction:

In [24]:
trainer.push_to_hub()

Adding files tracked by Git LFS: ['runs/Oct17_21-25-01_Graces-MacBook-Pro-104.local/events.out.tfevents.1697598651.Graces-MBP-104.fios-router.home.75929.2']. This may take a bit of time if the files are large.
Several commits (22) will be pushed upstream.
The progress bars may be unreliable.
batch response: This repository is over its data quota. Account responsible for LFS bandwidth should purchase more data packs to restore access.
error: failed to push some refs to 'https://github.com/graceli458/asapp-2b'



OSError: batch response: This repository is over its data quota. Account responsible for LFS bandwidth should purchase more data packs to restore access.
error: failed to push some refs to 'https://github.com/graceli458/asapp-2b'


You can now share this model with all your friends, family, favorite pets: they can all load it with the identifier `"your-username/the-name-you-picked"` so for instance:

```python
from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained("sgugger/my-awesome-model")
```

## Masked language modeling

For masked language modeling (MLM) we are going to use the same preprocessing as before for our dataset with one additional step: we will randomly mask some tokens (by replacing them by `[MASK]`) and the labels will be adjusted to only include the masked tokens (we don't have to predict the non-masked tokens).

We will use the [`distilroberta-base`](https://huggingface.co/distilroberta-base) model for this example. You can pick any of the checkpoints listed [here](https://huggingface.co/models?filter=masked-lm) instead:

In [None]:
model_checkpoint = "distilroberta-base"

We can apply the same tokenization function as before, we just need to update our tokenizer to use the checkpoint we just picked:

In [None]:
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, use_fast=True)
tokenized_datasets = datasets.map(tokenize_function, batched=True, num_proc=4, remove_columns=["text"])

And like before, we group texts together and chunk them in samples of length `block_size`. You can skip that step if your dataset is composed of individual sentences.

In [None]:
lm_datasets = tokenized_datasets.map(
    group_texts,
    batched=True,
    batch_size=1000,
    num_proc=4,
)

The rest is very similar to what we had, with two exceptions. First we use a model suitable for masked LM:

In [None]:
from transformers import AutoModelForMaskedLM
model = AutoModelForMaskedLM.from_pretrained(model_checkpoint)

We redefine our `TrainingArguments`:

In [None]:
model_name = model_checkpoint.split("/")[-1]
training_args = TrainingArguments(
    f"{model_name}-finetuned-wikitext2",
    evaluation_strategy = "epoch",
    learning_rate=2e-5,
    weight_decay=0.01,
    push_to_hub=True,
)

Like before, the last argument to setup everything so we can push the model to the [Hub](https://huggingface.co/models) regularly during training. Remove it if you didn't follow the installation steps at the top of the notebook. If you want to save your model locally in a name that is different than the name of the repository it will be pushed, or if you want to push your model under an organization and not your name space, use the `hub_model_id` argument to set the repo name (it needs to be the full name, including your namespace: for instance `"sgugger/bert-finetuned-wikitext2"` or `"huggingface/bert-finetuned-wikitext2"`).

Finally, we use a special `data_collator`. The `data_collator` is a function that is responsible of taking the samples and batching them in tensors. In the previous example, we had nothing special to do, so we just used the default for this argument. Here we want to do the random-masking. We could do it as a pre-processing step (like the tokenization) but then the tokens would always be masked the same way at each epoch. By doing this step inside the `data_collator`, we ensure this random masking is done in a new way each time we go over the data.

To do this masking for us, the library provides a `DataCollatorForLanguageModeling`. We can adjust the probability of the masking:

In [None]:
from transformers import DataCollatorForLanguageModeling
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm_probability=0.15)

Then we just have to pass everything to `Trainer` and begin training:

In [None]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=lm_datasets["train"],
    eval_dataset=lm_datasets["validation"],
    data_collator=data_collator,
)

In [None]:
trainer.train()

Like before, we can evaluate our model on the validation set. The perplexity is much lower than for the CLM objective because for the MLM objective, we only have to make predictions for the masked tokens (which represent 15% of the total here) while having access to the rest of the tokens. It's thus an easier task for the model.

In [None]:
eval_results = trainer.evaluate()
print(f"Perplexity: {math.exp(eval_results['eval_loss']):.2f}")

You can now upload the result of the training to the Hub, just execute this instruction:

In [None]:
trainer.push_to_hub()

You can now share this model with all your friends, family, favorite pets: they can all load it with the identifier `"your-username/the-name-you-picked"` so for instance:

```python
from transformers import AutoModelForMaskedLM

model = AutoModelForMaskedLM.from_pretrained("sgugger/my-awesome-model")
```