#### Overview
This notebook is to help you download all the required models from the huggingface servers.  
They will be stored in `./models/huggingface/`

#### List of models required to run each notebook: 
- bert.ipynb: `bert-large-uncased-whole-word-masking-finetuned-squad`, `bert-large-uncased-whole-word-masking`
- gpt2.ipynb: `gpt2-xl`
- t5.ipynb: `t5-3b`

You can also modify these notebooks to allow you to work with whatever model you fancy. I've explained how to download any model on huggingface using BERT as an example. You can find these models online [here](https://huggingface.co/models)

#### Downloading the BERT Model
Before we begin, we need to download the pre-trained models to run them offline. The pre-trained model we will download is `bert-large-uncased-whole-word-masking-finetuned-squad`. We will also add a layer to this model using the `BertForQuestionAnswering` class. Additionally, we need to download the required tokenizer for this model using the `AutoTokenizer` class. We will also save the `BertForMaskedLM` version of the `bert-large-uncased-whole-word-masking` model.

In [None]:
from transformers import BertForQuestionAnswering, AutoTokenizer, BertForMaskedLM

# Define the directory to save the model and tokenizer
qa_model = "bert-large-uncased-whole-word-masking-finetuned-squad"
mlm_model = "bert-large-uncased-whole-word-masking"
qa_directory = "../../models/huggingface/" + qa_model + "_qa"
mlm_directory = "../../models/huggingface/" + mlm_model + "_mlm"

# Download/save the tokenizer for the BertForQuestionAnswering model
tokenizer = AutoTokenizer.from_pretrained(qa_model)
tokenizer.save_pretrained(qa_directory)

# Download/save the BertForQuestionAnswering model
model = BertForQuestionAnswering.from_pretrained(qa_model)
model.save_pretrained(qa_directory)

# Download/save the tokenizer for the BertForMaskedLM model
tokenizer = AutoTokenizer.from_pretrained(mlm_model)
tokenizer.save_pretrained(mlm_directory)

# Download/save the BertForMaskedLM model
model = BertForMaskedLM.from_pretrained(mlm_model)
model.save_pretrained(mlm_directory)

### In detail
#### What do `BertForQuestionAnswering` and `BertForMaskedLM` do?
- These are specialized model classes that store a pre-defined configuration of BERT.
- It is not easy to directly use BERT for tasks like "question answering" or "fill-in-the-blank."
- To carry out these specific tasks, we can add an extra pre-trained layer on top of the BERT architecture.
- The `transformers` library provides pre-defined model classes specifically for these purposes (e.g., `BertForQuestionAnswering`).

#### What does `AutoTokenizer` do?
- The `AutoTokenizer` class is designed to load the correct tokenizer for each model.

#### What does `from_pretrained()` do?
- Each model/tokenizer class in the `transformers` library has a `from_pretrained()` function that loads a pre-trained model or tokenizer.
- In this example, you see it loading the pre-trained model and tokenizer from the `BertForQuestionAnswering`, `BertForMaskedLM` and `AutoTokenizer` classes.
- The first time you run this function:
  - It downloads the respective model/tokenizer from the Hugging Face servers and stores it in a local cache file located at `~/.cache/huggingface/hub`.
  - It sets up the model/tokenizer configuration automatically.
  - Finally, it initializes the model with pre-trained weights.

#### What does `save_pretrained()` do?
- This function saves the pre-trained weights, vocabulary, and configuration files to a specific local folder.
- This allows you to reuse the model later on without needing an internet connection.


#### Download GPT-2 models

In [None]:
from transformers import GPT2LMHeadModel, AutoTokenizer

# Alternate options: "gpt2", "gpt2-medium", "gpt2-large"
model = "gpt2-xl"

# Define the directory to save the model and tokenizer
directory = "../../models/huggingface/" + model + "_lmheadmodel"

# Download/save the tokenizer for the GPT2LMHeadModel model
tokenizer = AutoTokenizer.from_pretrained(model)
tokenizer.save_pretrained(directory)

# Download/save the GPT2LMHeadModel model
model = GPT2LMHeadModel.from_pretrained(model)
model.save_pretrained(directory)

#### Download T5 models

In [None]:
from transformers import T5ForConditionalGeneration, T5TokenizerFast

# All possible options: "t5-small", "t5-base", "t5-large", "t5-3b"(~11 Gb model), "t5-11b" (~40 Gb model)
model = "t5-3b"

# Define the directory to save the model and tokenizer
directory = "../../models/huggingface/" + model + "_forconditionalgeneration"

# Download/save the tokenizer for the T5ForConditionalGeneration model
# Using the fast tokenizer specifically for speed. Performance seems better than AutoTokenizer
tokenizer = T5TokenizerFast.from_pretrained(model)
tokenizer.save_pretrained(directory)

# Download/save the T5ForConditionalGeneration model
model = T5ForConditionalGeneration.from_pretrained(model)
model.save_pretrained(directory)