# Sharing pretrained models (PyTorch)

Install the Transformers, Datasets, and Evaluate libraries to run this notebook.

In [1]:
%pip install datasets evaluate transformers[sentencepiece]

Note: you may need to restart the kernel to use updated packages.


apt install requires user with root privilege. Else executing the command by prepending sudo fails to take password input in Ipython notebook.

Solution #1: Run sudo command using os library by passing password as argument. This is not recommended due to security risk.

Solution #2: Edit sudoers file to allow non-root users with sudo permission for specific commands.

Simpler solution: If you are running locally, better run the install command in shell script and then execute this notebook by commenting out the following cell.

In [3]:
!apt install git-lfs

[1;31mE: [0mCould not open lock file /var/lib/dpkg/lock-frontend - open (13: Permission denied)[0m
[1;31mE: [0mUnable to acquire the dpkg frontend lock (/var/lib/dpkg/lock-frontend), are you root?[0m


You will need to setup git, adapt your email and name in the following cell.

In [4]:
!git config user.email "you@example.com"
!git config user.name "Your Name"

You will also need to be logged in to the Hugging Face Hub. Execute the following and enter your credentials.

In [5]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [7]:
from transformers import TrainingArguments

training_args = TrainingArguments(
    "bert-finetuned-mrpc",
    save_strategy="epoch",
    push_to_hub=True
)

Access the Model Hub via push_to_hub() method

In [8]:
from transformers import AutoTokenizer, AutoModelForMaskedLM

checkpoint = "camembert-base"

tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForMaskedLM.from_pretrained(checkpoint)

Some weights of the model checkpoint at camembert-base were not used when initializing CamembertForMaskedLM: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing CamembertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing CamembertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [9]:
model_name = "dummy-camembert-base-model"

In [10]:
model.push_to_hub(model_name)

pytorch_model.bin:   0%|          | 0.00/443M [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/kaushikacharya/dummy-camembert-base-model/commit/d833f8058183f56bc24490d29457ab72d2bacb41', commit_message='Upload CamembertForMaskedLM', commit_description='', oid='d833f8058183f56bc24490d29457ab72d2bacb41', pr_url=None, pr_revision=None, pr_num=None)

In [11]:
tokenizer.push_to_hub(model_name)

sentencepiece.bpe.model:   0%|          | 0.00/811k [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/kaushikacharya/dummy-camembert-base-model/commit/6d48ec0d5f77e3516d59107c5d54143790beb451', commit_message='Upload tokenizer', commit_description='', oid='6d48ec0d5f77e3516d59107c5d54143790beb451', pr_url=None, pr_revision=None, pr_num=None)

In [None]:
tokenizer.push_to_hub("dummy-model", organization="huggingface")

In [None]:
tokenizer.push_to_hub("dummy-model", organization="huggingface", use_auth_token="<TOKEN>")

Using the huggingface_hub Python library

In [12]:
from huggingface_hub import (
    # User management
    login,
    logout,
    whoami,

    # Repository creation and management
    create_repo,
    delete_repo,
    update_repo_visibility,

    # And some methods to retrieve/change information about the content
    list_models,
    list_datasets,
    list_metrics,
    list_repo_files,
    upload_file,
    delete_file,
)

In [13]:
model_name = "test-dummy-model"

In [26]:
from huggingface_hub import create_repo, delete_repo, list_models, RepoUrl

author = "kaushikacharya"
models = [x.modelId for x in list_models(author=author)]
print(f"Models: {models}")

# Delete the model if already existing
if f"{author}/{model_name}" in models:
    delete_repo(repo_id=model_name)

repo_url = create_repo(model_name)

Models: ['kaushikacharya/dummy-model', 'kaushikacharya/dummy-camembert-base-model']


In [None]:
from huggingface_hub import create_repo

create_repo("dummy-model", organization="huggingface")

upload_file approach

In [15]:
# Check which one is the current directory
import os

os.getcwd()

'/home/kaushik/projects/Transformers_Hugging_Face/code/notebooks/chapter4'

In [18]:
# Create a dummy config json file
import json

config_dict = {
    "_name_or_path": model_name,
    "architectures": [
        "CamembertForMaskedLM"
    ],
    "vocab_size": 32005
}

with open("config.json", "w") as fp:
    json.dump(config_dict, fp=fp)

Upload the above created config.json

In [27]:
from huggingface_hub import upload_file

upload_file(
    path_or_fileobj="config.json",
    path_in_repo="config.json",
    repo_id=f"{author}/{model_name}"
)

'https://huggingface.co/kaushikacharya/test-dummy-model/blob/main/config.json'

Repository class

In [29]:
from huggingface_hub import Repository

repo = Repository(local_dir=f"../../{model_name}",
                  clone_from=f"{author}/{model_name}"
                  )

Cloning https://huggingface.co/kaushikacharya/test-dummy-model into local empty directory.


In [None]:
repo.git_pull()
repo.git_add()
repo.git_commit()
repo.git_push()
repo.git_tag()

In [30]:
repo.git_pull()

In [None]:
model.save_pretrained("<path_to_dummy_folder>")
tokenizer.save_pretrained("<path_to_dummy_folder>")

In [None]:
repo.git_add()
repo.git_commit("Add model and tokenizer files")
repo.git_push()

In [None]:
from transformers import AutoModelForMaskedLM, AutoTokenizer

checkpoint = "camembert-base"

model = AutoModelForMaskedLM.from_pretrained(checkpoint)
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

# Do whatever with the model, train it, fine-tune it...

model.save_pretrained("<path_to_dummy_folder>")
tokenizer.save_pretrained("<path_to_dummy_folder>")