# Huggingface Trainer

* [Processing the data](https://huggingface.co/course/chapter3/2?fw=pt)

> The last thing we will need to do is pad all the examples to the length of the longest element when we batch elements together — a technique we refer to as dynamic padding.
> The function that is responsible for **putting together samples inside a batch** is called a **collate function**. The default being a function that will just **convert your samples to PyTorch tensors and concatenate them (recursively if your elements are lists, tuples, or dictionaries)**.
> 
> Transformers library provides us with such a function via ```DataCollatorWithPadding```. It takes a tokenizer when you instantiate it (to know which padding token to use, and whether the model expects padding to be on the left or on the right of the inputs) and will do everything you need:
> ```
> from transformers import DataCollatorWithPadding
> 
> data_collator = DataCollatorWithPadding(tokenizer=tokenizer)
> batch = data_collator(samples)
> ```
> 
> Dynamic padding means the samples in this batch should all be padded to a length of 67, the maximum length inside the batch.

## Trainer

* [Fine-tuning a model with the Trainer API](https://huggingface.co/course/chapter3/3) ([YouTube](https://youtu.be/nvBXf7s7vTI))
* [Trainer class](https://huggingface.co/docs/transformers/main_classes/trainer)

## Tutorials

* [Fine-tune a pretrained model](https://huggingface.co/docs/transformers/training)
* [Summerization fine tuning](https://huggingface.co/docs/transformers/tasks/summarization)
* [Question Answering fine tuning](https://huggingface.co/course/chapter7/7?fw=pt)
* [Fine-tuning Bert for Abstractive Summarisation with the Curation Dataset](https://medium.com/curation-corporation/fine-tuning-bert-for-abstractive-summarisation-with-the-curation-dataset-79ea4b40a923) ([Github](https://github.com/CurationCorp/curation-corpus/blob/master/examples/bertextabs/finetune_bertabs_walkthrough.ipynb))
* [ Text Summarization with Pretrained Encoders](https://github.com/nlpyang/PreSumm)

> This code is for EMNLP 2019 paper Text Summarization [Pretrained Encoders](https://arxiv.org/abs/1908.08345)



## Huggingface Notebooks

* [How to train a new language model from scratch using Transformers and Tokenizers (Feb 2020)](https://huggingface.co/blog/how-to-train) ([Github notebook - 01_how_to_train.ipynb](https://github.com/huggingface/blog/blob/main/notebooks/01_how_to_train.ipynb))
* [Pre-training SmallBERTa - A tiny model to train on a tiny dataset](https://gist.github.com/aditya-malte/2d4f896f471be9c38eb4d723a710768b)

* [Transformers Notebooks](https://huggingface.co/docs/transformers/main/notebooks)
<img src="./image/huggingface_notebooks.png" align="left" width=200/>

