# SHARING MODELS AND TOKENIZERS

In [None]:
!pip install datasets evaluate transformers[sentencepiece]

# Using pretrained models

For example, look for a French-based model that can perform mask filling.

In [None]:
from transformers import pipeline

camembert_fill_mask = pipeline('fill-mask', model='camembert-base')

In [3]:
results = camembert_fill_mask("Le camembert est <mask> :)")
results

[{'score': 0.5239004492759705,
  'token': 7200,
  'token_str': 'délicieux',
  'sequence': 'Le camembert est délicieux :)'},
 {'score': 0.09637846052646637,
  'token': 2183,
  'token_str': 'excellent',
  'sequence': 'Le camembert est excellent :)'},
 {'score': 0.036324650049209595,
  'token': 26202,
  'token_str': 'succulent',
  'sequence': 'Le camembert est succulent :)'},
 {'score': 0.02907998487353325,
  'token': 528,
  'token_str': 'meilleur',
  'sequence': 'Le camembert est meilleur :)'},
 {'score': 0.027218414470553398,
  'token': 1654,
  'token_str': 'parfait',
  'sequence': 'Le camembert est parfait :)'}]

We can also instantiate the checkpoint using the model architecture directly:

In [4]:
from transformers import CamembertTokenizer, CamembertForMaskedLM

tokenizer = CamembertTokenizer.from_pretrained('camembert-base')
model = CamembertForMaskedLM.from_pretrained('camembert-base')

Some weights of the model checkpoint at camembert-base were not used when initializing CamembertForMaskedLM: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing CamembertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing CamembertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


It is recommended to use the `Auto*` classes instead, as these are by design architecture-agnostic.

In [5]:
from transformers import AutoTokenizer, AutoModelForMaskedLM

tokenizer = AutoTokenizer.from_pretrained('camembert-base')
model = AutoModelForMaskedLM.from_pretrained('camembert-base')

Some weights of the model checkpoint at camembert-base were not used when initializing CamembertForMaskedLM: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing CamembertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing CamembertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


# Sharing pretrained models

There are three ways to go about creating new model repositories:
* Using the `push_to_hub` API
* Using the `huggingface_hub` Python library
* Using the web infterface

Once we have created a repository, we can upload files to it via git and git-lfs.

## Using the `push_to_hub` API

First, we need to generate an authentication token so that the `huggingface_hub` API knows who we are and what namespaces we have write access to.

If we are in a notebook, we can use:

In [5]:
from huggingface_hub import notebook_login

notebook_login()

In a terminal,
```bash
huggingface-cli login
```

If we have played around with the `Trainer` API to train a model, the easiest way to upload it to the Hub is to set `push_to_hub=True` when we define our `TrainingArguments`:

In [None]:
from transformers import TrainingArguments

training_args = TrainingArguments(
    'bert-finetuned-mrpc',
    save_strategy='epoch',
    push_to_hub=True,
)

When we call `trainer.train()`, the `Trainer` will then upload our model to the Hub each time it is saved (every epoch) in a repository in our namespaces. The repository will be named as `bert-finetuned-mrpc`.

To upload our model to an organization we are a member of, pass it with `hub_model_id = "my_organization/my_repo_name"`.

Once our training is finished, we should do a final `trainer.push_to_hub()` to upload the last version of our model. It will also generate a model card with all the relevant metadata, reporting the hyperparameters used and the evaluation results.



In [None]:
from transformers import AutoModelForMaskedLM, AutoTokenizer

checkpoint = 'camembert-base'

tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForMaskedLM.from_pretrained(checkpoint)

Once we are happy with the resulting model, weights, and tokenizer, we can leverage the `push_to_hub()` method:

In [None]:
model.push_to_hub('dummy-model')

This will create the new repository `dummy-model` in our profile, nad populate it with our model files. Do the same with the tokenizer, so that all the files are now available in this repository:

In [None]:
tokenizer.push_to_hub('dummy-model')

If we belong to an organization, specify the `organization` argument to upload that organization's namespace:

In [None]:
tokenizer.push_to_hub('dummy-model', organization='binliu')

If we wish to use a specific HuggingFace token, we are free to specify it to the `push_to_hub()` method:

In [None]:
tokenizer.push_to_hub('dummy-model', organization='binliu', use_auth_token='<TOKEN>')

Now head to the Model Hub to find our newly uploaded model: http://huggingface.co/user-or-organization/dummy-model.

## Using the `huggingface_hub` Python library

In [None]:
from huggingface_hub import (
    # User management
    login,
    logout,
    whoami,

    # Repository creation and management
    create_repo,
    delete_repo,
    update_repo_visibility,

    # some methods to retrieve/change information about the content
    list_models,
    list_datasets,
    list_metrics,
    list_repo_files,
    upload_file,
    delete_file,
)

The `huggingface_hub` offers the very powerful `Repository` class to manage a local repository.

The `create_repo` method can be used to create a new repository on the hub:

In [None]:
create_repo('dummy-model')

To create a repository with an organization,

In [None]:
create_repo('dummy-model', organization='binliu')

Other arguments:
* `private`, in order to specify if the repository should be visible from others or not.
* `token`, if we would like to override the token sotred in our cache by a given token.
* `repo_type`, if we would like to create a `dataset` or a `space` instead of a model. Accepted values are `"dataset"` and `"space"`.

## Uploading the model files

## The `upload_file` approach

This approach does not handle files that are larger than 5GB in size.

In [None]:
upload_file(
    "<path_to_file>/config.json",
    path_in_repo='config.json',
    repo_id="<namespace>/dummy-model",
)

This will upload the file `config.json` available at `<path_to_file>` to the root of the repository as `config.json`, to the `dummy-model` repository.

## The `Repository` class

Using this class requires having git and git-lfs installed.

In [None]:
from huggingface_hub import Repository

repo = Repository(
    '<path_to_dummy_folder>',
    clone_from='<namespace>/dummy-model'
)

This created the folder `<path_to_dummy_folder>` in our working directory. This folder only contains the `.gitattributes` file as that's the only file created when instantiating the repository through `create_repo`.

From now on, we may leverage several of the traditional git method:
```python
repo.git_pull()
repo.git_add()
repo.git_commit()
repo.git_push()
repo.git_tag()
```

We first make sure that our local clone is up to date by pulling the latest changes:

In [None]:
repo.git_pull()

Once that is done, we save the model and tokenizer files:

In [None]:
tokenizer.save_pretrained('<path_to_dummy_folder>')
model.save_pretrained('<path_to_dummy_folder>')

The `<path_to_dummy_folder>` now contains all the model and tokenizer files.

In [None]:
repo.git_add()
repo.git_commit('Add model and tokenizer files')
repo.git_push()