Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

hf: warn of deprecating internal callback #740

Merged
merged 27 commits into from
Feb 27, 2024
Merged
Show file tree
Hide file tree
Changes from 9 commits
Commits
Show all changes
27 commits
Select commit Hold shift + click to select a range
8ecc790
hf: warn of deprecating internal callback
dberenbaum Nov 17, 2023
4ef660e
hf: update notebook
dberenbaum Nov 17, 2023
c386e89
Merge branch 'main' into hf-deprecation-warning
dberenbaum Nov 17, 2023
42cd1ab
fix notebook
dberenbaum Nov 17, 2023
cb89b98
Merge branch 'main' into hf-deprecation-warning
dberenbaum Dec 12, 2023
e2c23d7
hf: test without passing live instance
dberenbaum Dec 12, 2023
99aff66
Merge branch 'hf-deprecation-warning' of github.com:iterative/dvclive…
dberenbaum Dec 12, 2023
801178e
Merge branch 'main' into hf-deprecation-warning
dberenbaum Dec 12, 2023
fa244f9
Merge branch 'main' into hf-deprecation-warning
dberenbaum Dec 22, 2023
8e9afeb
Merge branch 'main' into hf-deprecation-warning
dberenbaum Jan 22, 2024
d52461b
merge hf notebooks and fix dvclivecallback import
dberenbaum Jan 22, 2024
b0c3c5c
Merge branch 'hf-deprecation-warning' of github.com:iterative/dvclive…
dberenbaum Jan 22, 2024
f7b4efb
hf: test HF_DVCLIVE_LOG_MODEL env var
dberenbaum Jan 22, 2024
6bef79e
fix ci: account for huggingface transformers changes
mattseddon Feb 12, 2024
a9f01b0
see if loss is broken
mattseddon Feb 12, 2024
88931f0
show what is going through on_log
mattseddon Feb 12, 2024
be1cfde
try unparallelize
mattseddon Feb 12, 2024
ec73bf7
check next_step call count
mattseddon Feb 12, 2024
e6f1845
try report_to none
mattseddon Feb 12, 2024
27e38cb
revert all code
mattseddon Feb 12, 2024
8c25642
Merge branch 'check-huggingface' into hf-deprecation-warning
dberenbaum Feb 12, 2024
153f72b
clean up hf tests
dberenbaum Feb 12, 2024
4043b8d
merge
dberenbaum Feb 12, 2024
8f2f91b
revert moving spy call
dberenbaum Feb 12, 2024
63c40f5
hf: test log_model=None
dberenbaum Feb 12, 2024
1977009
Merge branch 'main' into hf-deprecation-warning
dberenbaum Feb 27, 2024
7e9d1f9
Update src/dvclive/huggingface.py
dberenbaum Feb 27, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
333 changes: 333 additions & 0 deletions examples/DVCLive_HuggingFace.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,333 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"id": "3SJ8SY6ldmsS"
},
"source": [
"### How to do Experiment tracking with DVCLive\n",
"\n",
"What you will learn?\n",
"\n",
"- Fine-tuning a model on a binary text classification task\n",
"- Track machine learning experiments with DVCLive\n",
"- Visualize results and create a report\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "nxiSBytidmsU"
},
"source": [
"#### Setup (Install Dependencies & Setup Git)\n",
"\n",
"- Install accelerate , Datasets , evaluate , transformers and dvclive\n",
"- Start a Git repo. Your experiments will be saved in a commit but hidden in\n",
" order to not clutter your repo.\n",
"- Initialize DVC\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "CLRgy2W4dmsU"
},
"outputs": [],
"source": [
"!pip install datasets dvclive evaluate pandas 'transformers[torch]' --upgrade"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "fo0sq84UdmsV"
},
"outputs": [],
"source": [
"!git init -q\n",
"!git config --local user.email \"you@example.com\"\n",
"!git config --local user.name \"Your Name\"\n",
"!dvc init -q\n",
"!git commit -m \"DVC init\""
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "T5WYJ31UdmsV"
},
"source": [
"### Fine-tuning a model on a text classification task\n",
"\n",
"#### Loading the dataset\n",
"\n",
"We will use the [imdb](https://huggingface.co/datasets/imdb) Large Movie Review Dataset. This is a dataset for binary\n",
"sentiment classification containing a set of 25K movie reviews for training and\n",
"25K for testing.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "41fP0WCbdmsV"
},
"outputs": [],
"source": [
"from datasets import load_dataset\n",
"from transformers import AutoTokenizer\n",
"\n",
"dataset = load_dataset(\"imdb\")"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "V3gDKbbSdmsV"
},
"source": [
"#### Preprocessing the data\n",
"\n",
"We use `transformers.AutoTokenizer` which transforms the inputs and put them in a format\n",
"the model expects.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "uVr5lufodmsV"
},
"outputs": [],
"source": [
"tokenizer = AutoTokenizer.from_pretrained(\"distilbert-base-cased\")\n",
"\n",
"def tokenize_function(examples):\n",
" return tokenizer(examples[\"text\"], padding=\"max_length\", truncation=True)\n",
"\n",
"small_train_dataset = dataset[\"train\"].shuffle(seed=42).select(range(2000)).map(tokenize_function, batched=True)\n",
"small_eval_dataset = dataset[\"test\"].shuffle(seed=42).select(range(200)).map(tokenize_function, batched=True)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "g9sELYMHdmsV"
},
"source": [
"#### Define evaluation metrics\n",
"\n",
"f1 is a metric for combining precision and recall metrics in one unique value, so\n",
"we take this criteria for evaluating the models.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "wmJoy5V-dmsW"
},
"outputs": [],
"source": [
"import numpy as np\n",
"import evaluate\n",
"\n",
"metric = evaluate.load(\"f1\")\n",
"\n",
"def compute_metrics(eval_pred):\n",
" logits, labels = eval_pred\n",
" predictions = np.argmax(logits, axis=-1)\n",
" return metric.compute(predictions=predictions, references=labels)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "NwFntrIKdmsW"
},
"source": [
"### Training and Tracking experiments with DVCLive\n",
"\n",
"Track experiments in DVC by changing a few lines of your Python code.\n",
"Save model artifacts using `HF_DVCLIVE_LOG_MODEL=true`."
]
},
{
"cell_type": "code",
"source": [
"%env HF_DVCLIVE_LOG_MODEL=true"
],
"metadata": {
"id": "-A1oXCxE4zGi"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "gKKSTh0ZdmsW"
},
"outputs": [],
"source": [
"from dvclive.huggingface import DVCLiveCallback\n",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

q: is it expected that we use our own callback in this example?

"from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer\n",
"\n",
"model = AutoModelForSequenceClassification.from_pretrained(\"distilbert-base-cased\", num_labels=2)\n",
"for param in model.base_model.parameters():\n",
" param.requires_grad = False\n",
"\n",
"lr = 3e-4\n",
"\n",
"training_args = TrainingArguments(\n",
" evaluation_strategy=\"epoch\",\n",
" learning_rate=lr,\n",
" logging_strategy=\"epoch\",\n",
" num_train_epochs=5,\n",
" output_dir=\"output\",\n",
" overwrite_output_dir=True,\n",
" load_best_model_at_end=True,\n",
" save_strategy=\"epoch\",\n",
" weight_decay=0.01,\n",
")\n",
"\n",
"trainer = Trainer(\n",
" model=model,\n",
" args=training_args,\n",
" train_dataset=small_train_dataset,\n",
" eval_dataset=small_eval_dataset,\n",
" compute_metrics=compute_metrics,\n",
")\n",
"trainer.train()"
]
},
{
"cell_type": "markdown",
"source": [
"To customize tracking, include `transformers.integrations.DVCLiveCallback` in the `Trainer` callbacks and pass additional keyword arguments to `dvclive.Live`."
],
"metadata": {
"id": "KKJCw0Vj6UTw"
}
},
{
"cell_type": "code",
"source": [
"from dvclive import Live\n",
"from transformers.integrations import DVCLiveCallback\n",
"\n",
"lr = 1e-4\n",
"\n",
"training_args = TrainingArguments(\n",
" evaluation_strategy=\"epoch\",\n",
" learning_rate=lr,\n",
" logging_strategy=\"epoch\",\n",
" num_train_epochs=5,\n",
" output_dir=\"output\",\n",
" overwrite_output_dir=True,\n",
" load_best_model_at_end=True,\n",
" save_strategy=\"epoch\",\n",
" weight_decay=0.01,\n",
")\n",
"\n",
"trainer = Trainer(\n",
" model=model,\n",
" args=training_args,\n",
" train_dataset=small_train_dataset,\n",
" eval_dataset=small_eval_dataset,\n",
" compute_metrics=compute_metrics,\n",
" callbacks=[DVCLiveCallback(live=Live(report=\"notebook\"), log_model=True)],\n",
")\n",
"trainer.train()"
],
"metadata": {
"id": "M4FKUYTi5zYQ"
},
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "l29wqAaDdmsW"
},
"source": [
"### Comparing Experiments\n",
"\n",
"We create a dataframe with the experiments in order to visualize it.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "wwMwHvVtdmsW"
},
"outputs": [],
"source": [
"import dvc.api\n",
"import pandas as pd\n",
"\n",
"columns = [\"Experiment\", \"epoch\", \"eval.f1\"]\n",
"\n",
"df = pd.DataFrame(dvc.api.exp_show(), columns=columns)\n",
"\n",
"df.dropna(inplace=True)\n",
"df.reset_index(drop=True, inplace=True)\n",
"df\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "TNBGUqoCdmsW"
},
"outputs": [],
"source": [
"!dvc plots diff $(dvc exp list --names-only)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "sL5pH4X5dmsW"
},
"outputs": [],
"source": [
"from IPython.display import HTML\n",
"HTML(filename='./dvc_plots/index.html')"
]
}
],
"metadata": {
"colab": {
"provenance": []
},
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.13"
}
},
"nbformat": 4,
"nbformat_minor": 0
}
4 changes: 4 additions & 0 deletions src/dvclive/huggingface.py
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,10 @@ def __init__(
log_model: Optional[Union[Literal["all"], bool]] = None,
**kwargs,
):
logger.warning(
"This callback will be deprecated in DVCLive 4.0 in favor of"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Q] Do you want to deprecate it now and remove it in 4.0?

" `transformers.integrations.DVCLiveCallback`"
)
super().__init__()
self._log_model = log_model
self.live = live if live is not None else Live(**kwargs)
Expand Down
Loading