In [1]:
!pip install accelerate -U



In [1]:
!pip install transformers



# The Hugging Face Hub:

## Introduction:

* The [hugging Face Hub](https://huggingface.co/) is the place where every Model, Dataset is deployed and stored.
* In this chapter we will focus on how to:
    - Use a fine-tuned model from the Hub
    - Share and deploy a our model to the Hub
    - Build a model card
* At its core a shared model is just a Git reposetory, which means that it can be cloned and used by others.
* When a new model is shared to the community a hosted inference API is deployed automatically, so anyone can test that model directly or build on top of it.

## Using pretrained models

* As we saw in previous chapters, using finetuned models from the Hub on our tasks is easy and can be achieved with few lines of code

In [2]:
from transformers import pipeline
unmasker = pipeline("fill-mask", model = 'camembert-base')
unmasker("This course will teach you all about <mask> models.", top_k=2)


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
Some weights of the model checkpoint at camembert-base were not used when initializing CamembertForMaskedLM: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing CamembertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing CamembertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassific

[{'score': 0.1466376781463623,
  'token': 808,
  'token_str': 'the',
  'sequence': 'This course will teach you all about the models.'},
 {'score': 0.06081351637840271,
  'token': 9098,
  'token_str': 'this',
  'sequence': 'This course will teach you all about this models.'}]

* Of course we need to pick a checkpoint that suitable for our task, otherwise we will get results that don't male sense at all.
  - in this case we pick `camembert-base` which is a good checkpoint for filling mask tasks.
* We could also insentiate the checkpoint from the model calss directly:  

In [3]:
from transformers import CamembertTokenizer, CamembertForMaskedLM
ckpt = 'camembert-base'
tokenizer = CamembertTokenizer.from_pretrained(ckpt)
model = CamembertForMaskedLM.from_pretrained(ckpt)

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Some weights of the model checkpoint at camembert-base were not used when initializing CamembertForMaskedLM: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing CamembertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing CamembertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


* However its recommended to use the `auto` class to handel the insentitating of model and tokenizers:

In [4]:
from transformers import AutoTokenizer, AutoModelForMaskedLM
tokenizer = AutoTokenizer.from_pretrained(ckpt)
model = AutoModelForMaskedLM.from_pretrained(ckpt)

Some weights of the model checkpoint at camembert-base were not used when initializing CamembertForMaskedLM: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing CamembertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing CamembertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


* What's important is to inderstand how that specific model is trained, which dataset is used, and what's its limitations and biases.
  - all of this informastions should be mentioned in the model card (which we will build later)

## Sharing a pretrained model:

* In general there's 3 ways to create a new model reposetories:
  
  - Using the push_to_hub API
  - Using the huggingface_hub Python library
  - Using the web interface
* One the repo is created, we can add and edit files just like any other repo on github

### Using the push_to_hub API:

* The simplest way to create a model repo is to use `push_to_hub` API.
  - but first we need to get our credentials in order to use the API:

In [5]:
from huggingface_hub import notebook_login
notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

* We used earlier the `TrainingArguments` class to pass hyper-parameters during the building of the training loop, the easiest way to push a model is by setting `push_to_hub= True` as an arguments:

In [6]:
from transformers import Trainer, TrainingArguments
training_arguments = TrainingArguments('test-train-0', save_strategy = 'epoch', push_to_hub= True)

* Once the model is trained and the `trainer.train()` is called, the api will upload the model to the hub and save it in a repo with the name we pick `test-train-0`, but we can chose another name by passing `hub_model_id="my_model_name"`
* Once the training is complete we should do the final `trainer_push_to_hub()` to upload the last version of the model. This will also generate the model card, which contains all the metadata, hyperparameters and evaluation results.

![model card *From Hugging Face*](model_card.png)

* The `push_to_hub()` method can be applied on model, tokenizer, configs. It take care of both: creating the repo and pushing the model and tokenizer directly to that repository.
* Now let's see exactly how this work:



In [8]:
from transformers import AutoTokenizer, AutoModelForMaskedLM
ckpt = 'bert-base-uncased'
tokenizer = AutoTokenizer.from_pretrained(ckpt)
model = AutoModelForMaskedLM.from_pretrained(ckpt)

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


* Now we can take these model and build whatever we want with them, modify, add..and when we are satisfied with the results we can use `push_to_hup()` method:

In [9]:
model.push_to_hub('test-ch4')

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/Smail/test-ch4/commit/7bd99cf8dede8382f9f084399e04a5f16b07804e', commit_message='Upload BertForMaskedLM', commit_description='', oid='7bd99cf8dede8382f9f084399e04a5f16b07804e', pr_url=None, pr_revision=None, pr_num=None)