# Sharing pretrained models (PyTorch)

Install the Transformers, Datasets, and Evaluate libraries to run this notebook.

In [1]:
!pip install datasets evaluate transformers[sentencepiece] accelerate>=0.20.1
!apt install git-lfs

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
git-lfs is already the newest version (3.0.2-1ubuntu0.2).
0 upgraded, 0 newly installed, 0 to remove and 15 not upgraded.


You will need to setup git, adapt your email and name in the following cell.

In [2]:
!git config --global user.email "vinmehta007@gmail.com"
!git config --global user.name "ivineetm007"

You will also need to be logged in to the Hugging Face Hub. Execute the following and enter your credentials. Use write token to push model and create repo.

In [3]:
from huggingface_hub import notebook_login
# Pass wrote token to create repo
notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

## Using the push_to_hub API

The easiest way to upload it to the Hub is to set **push_to_hub=True** when you define your TrainingArguments.
1. When you call trainer.train(), the Trainer will then upload your model to the Hub each time it is saved (here every epoch) in a repository in your namespace.
2. Repository will be named like the output directory you picked. But you can choose a different name with hub_model_id = "a_different_name".
3. Once your training is finished, you should do a final **trainer.push_to_hub()** to upload the last version of your model. It will also generate a model card with all the relevant metadata, reporting the hyperparameters used and the evaluation results

In [None]:
from transformers import TrainingArguments

training_args = TrainingArguments(
    "bert-finetuned-mrpc", save_strategy="epoch", push_to_hub=True
)

At a lower level, accessing the Model Hub can be done directly on models, tokenizers, and configuration objects via their push_to_hub() method.

In [13]:
from transformers import AutoModelForMaskedLM, AutoTokenizer

checkpoint = "camembert-base"

model = AutoModelForMaskedLM.from_pretrained(checkpoint)
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

config.json:   0%|          | 0.00/508 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/445M [00:00<?, ?B/s]

Some weights of the model checkpoint at camembert-base were not used when initializing CamembertForMaskedLM: ['roberta.pooler.dense.weight', 'roberta.pooler.dense.bias']
- This IS expected if you are initializing CamembertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing CamembertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


sentencepiece.bpe.model:   0%|          | 0.00/811k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.40M [00:00<?, ?B/s]

In [None]:
model.push_to_hub("dummy-model")

model.safetensors:   0%|          | 0.00/443M [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/vinm007/dummy-model/commit/f20d1bb93a4e4bf5aa910560f2a76037c55eae7a', commit_message='Upload CamembertForMaskedLM', commit_description='', oid='f20d1bb93a4e4bf5aa910560f2a76037c55eae7a', pr_url=None, pr_revision=None, pr_num=None)

In [None]:
tokenizer.push_to_hub("dummy-tokenizer")

sentencepiece.bpe.model:   0%|          | 0.00/811k [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/vinm007/dummy-tokenizer/commit/23e45b8a947112b069fe9bc6011b507d6043c117', commit_message='Upload tokenizer', commit_description='', oid='23e45b8a947112b069fe9bc6011b507d6043c117', pr_url=None, pr_revision=None, pr_num=None)

In [None]:
## If you belong to an organization, simply specify the organization argument to upload to that organization’s namespace
# tokenizer.push_to_hub("dummy-model", organization="huggingface")

## Using specific token
# tokenizer.push_to_hub("dummy-model", organization="huggingface", use_auth_token="<TOKEN>")

## Using the huggingface_hub Python library
It provides simple APIs that work on top of git to manage the content of repositories on the hub and to integrate the Hub in your projects and libraries.

In [5]:
!huggingface-cli login


    _|    _|  _|    _|    _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|_|_|_|    _|_|      _|_|_|  _|_|_|_|
    _|    _|  _|    _|  _|        _|          _|    _|_|    _|  _|            _|        _|    _|  _|        _|
    _|_|_|_|  _|    _|  _|  _|_|  _|  _|_|    _|    _|  _|  _|  _|  _|_|      _|_|_|    _|_|_|_|  _|        _|_|_|
    _|    _|  _|    _|  _|    _|  _|    _|    _|    _|    _|_|  _|    _|      _|        _|    _|  _|        _|
    _|    _|    _|_|      _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|        _|    _|    _|_|_|  _|_|_|_|

    A token is already saved on your machine. Run `huggingface-cli whoami` to get more information or `huggingface-cli logout` if you want to log out.
    Setting a new token will erase the existing one.
    To login, `huggingface_hub` requires a token generated from https://huggingface.co/settings/tokens .
Token: 
Add token as git credential? (Y/n) Y
Token is valid (permission: write).
Your token has been saved in your con

In [4]:
from huggingface_hub import (
    # User management
    login,
    logout,
    whoami,

    # Repository creation and management
    create_repo,
    delete_repo,
    update_repo_visibility,

    # And some methods to retrieve/change information about the content
    list_models,
    list_datasets,
    list_metrics,
    list_repo_files,
    upload_file,
    delete_file,
)

**create_repo**

Other arguments
1. **private**:- in order to specify if the repository should be visible from others or not.
2. **token**:- if you would like to override the token stored in your cache by a given token.
3. **repo_type**:- if you would like to create a dataset or a space instead of a model. Accepted values are "dataset" and "space".

In [22]:
from huggingface_hub import create_repo

create_repo("dummy-model2")

RepoUrl('https://huggingface.co/vinm007/dummy-model2', endpoint='https://huggingface.co', repo_type='model', repo_id='vinm007/dummy-model2')

In [None]:
# This will create the dummy-model repository in the huggingface namespace, assuming you belong to that organization.
# create_repo("dummy-model", organization="huggingface")

## Uploading the model files

**upload_file**\
Using upload_file does **not require git and git-lfs** to be installed on your system. It pushes files directly to the 🤗 Hub using **HTTP POST requests**. A limitation of this approach is that it doesn't handle files that are larger than **5GB in size**.

In [None]:
from huggingface_hub import upload_file

upload_file(
    "<path_to_file>/config.json",
    path_in_repo="config.json",
    repo_id="<namespace>/dummy-model",
)

 **Repository class**\
 The Repository class manages a local repository in a git-like manner.

In [9]:
from huggingface_hub import Repository

repo = Repository("./dummy-model", clone_from="vinm007/dummy-model")

For more details, please read https://huggingface.co/docs/huggingface_hub/concepts/git_vs_http.
Cloning https://huggingface.co/vinm007/dummy-model into local empty directory.


In [10]:
# Available Methods
# repo.git_pull()
# repo.git_add()
# repo.git_commit()
# repo.git_push()
# repo.git_tag()

In [11]:
repo.git_pull()

In [14]:
model.save_pretrained("./dummy-model")
tokenizer.save_pretrained("./dummy-model")

('./dummy-model/tokenizer_config.json',
 './dummy-model/special_tokens_map.json',
 './dummy-model/sentencepiece.bpe.model',
 './dummy-model/added_tokens.json',
 './dummy-model/tokenizer.json')

In [15]:
repo.git_add()
repo.git_commit("Add model and tokenizer files")
repo.git_push()

Upload file model.safetensors:   0%|          | 1.00/422M [00:00<?, ?B/s]

Upload file sentencepiece.bpe.model:   0%|          | 1.00/792k [00:00<?, ?B/s]

To https://huggingface.co/vinm007/dummy-model
   be2b2a4..c1c0876  main -> main

   be2b2a4..c1c0876  main -> main



'https://huggingface.co/vinm007/dummy-model/commit/c1c0876249c791b3c7def647feb992476b82323f'

**git based approach**

In [25]:
#Create repo and clone it
!git clone https://huggingface.co/vinm007/dummy-model2

Cloning into 'dummy-model2'...
remote: Enumerating objects: 3, done.[K
remote: Total 3 (delta 0), reused 0 (delta 0), pack-reused 3[K
Unpacking objects: 100% (3/3), 418 bytes | 418.00 KiB/s, done.


In [30]:
%cd dummy-model2

/content/dummy-model2


In [32]:
from transformers import AutoModelForMaskedLM, AutoTokenizer

checkpoint = "camembert-base"

model = AutoModelForMaskedLM.from_pretrained(checkpoint)
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

# Do whatever with the model, train it, fine-tune it...

model.save_pretrained("./")
tokenizer.save_pretrained("./")

Some weights of the model checkpoint at camembert-base were not used when initializing CamembertForMaskedLM: ['roberta.pooler.dense.weight', 'roberta.pooler.dense.bias']
- This IS expected if you are initializing CamembertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing CamembertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


('./tokenizer_config.json',
 './special_tokens_map.json',
 './sentencepiece.bpe.model',
 './added_tokens.json',
 './tokenizer.json')

In [33]:
!ls -a -lh

total 426M
drwxr-xr-x 3 root root 4.0K Dec  2 09:43 .
drwxr-xr-x 1 root root 4.0K Dec  2 09:43 ..
-rw-r--r-- 1 root root   28 Dec  2 09:43 added_tokens.json
-rw-r--r-- 1 root root  701 Dec  2 09:43 config.json
drwxr-xr-x 8 root root 4.0K Dec  2 09:42 .git
-rw-r--r-- 1 root root 1.5K Dec  2 09:42 .gitattributes
-rw-r--r-- 1 root root 423M Dec  2 09:43 model.safetensors
-rw-r--r-- 1 root root 792K Dec  2 09:43 sentencepiece.bpe.model
-rw-r--r-- 1 root root  374 Dec  2 09:43 special_tokens_map.json
-rw-r--r-- 1 root root 1.8K Dec  2 09:43 tokenizer_config.json
-rw-r--r-- 1 root root 2.4M Dec  2 09:43 tokenizer.json


In [34]:
!git add .

In [35]:
!git status

On branch main
Your branch is up to date with 'origin/main'.

Changes to be committed:
  (use "git restore --staged <file>..." to unstage)
	[32mnew file:   added_tokens.json[m
	[32mnew file:   config.json[m
	[32mnew file:   model.safetensors[m
	[32mnew file:   sentencepiece.bpe.model[m
	[32mnew file:   special_tokens_map.json[m
	[32mnew file:   tokenizer.json[m
	[32mnew file:   tokenizer_config.json[m



In [36]:
!git lfs status

On branch main
Objects to be pushed to origin/main:


Objects to be committed:

	added_tokens.json (Git: 43734cd)
	config.json (Git: c267925)
	model.safetensors (LFS: 2785d2e)
	sentencepiece.bpe.model (LFS: 988bc5a)
	special_tokens_map.json (Git: b547935)
	tokenizer.json (Git: 9a9362e)
	tokenizer_config.json (Git: c49982e)

Objects not staged for commit:




In [37]:
!git commit -m "First model version"

[main 294e2fa] First model version
 7 files changed, 128351 insertions(+)
 create mode 100644 added_tokens.json
 create mode 100644 config.json
 create mode 100644 model.safetensors
 create mode 100644 sentencepiece.bpe.model
 create mode 100644 special_tokens_map.json
 create mode 100644 tokenizer.json
 create mode 100644 tokenizer_config.json


In [38]:
!git push

Uploading LFS objects: 100% (2/2), 444 MB | 82 MB/s, done.
Enumerating objects: 10, done.
Counting objects: 100% (10/10), done.
Delta compression using up to 2 threads
Compressing objects: 100% (8/8), done.
Writing objects: 100% (9/9), 592.03 KiB | 4.32 MiB/s, done.
Total 9 (delta 0), reused 0 (delta 0), pack-reused 0
To https://huggingface.co/vinm007/dummy-model2
   6594ac3..294e2fa  main -> main
