Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add support for csv files #592

Merged
merged 16 commits into from
Nov 4, 2022
2 changes: 2 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,8 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0

### Added

- Add support for CSV files to the `fit` function. [#592](https://github.com/jina-ai/finetuner/pull/592)

### Removed

### Changed
Expand Down
26 changes: 25 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -148,7 +148,8 @@ run = finetuner.fit(
)
```

Fine-tuning might take 5 minute to finish. You can later re-connect your run with:
Here, the training data used is gathered from the Jina AI Cloud, however data can also be passed as a CSV file or DocumentArray, as described [here](https://finetuner.jina.ai/walkthrough/create-training-data/).
Fine-tuning might take 5 minutes to finish. You can later re-connect your run with:

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be nice to show in the README how to train a model with a CSV file. This was also demanded by Han. So I think we should add something.

```python
import finetuner
Expand Down Expand Up @@ -186,6 +187,29 @@ finetuner.encode(model=model, data=da)
da.summary()
```

## Training on your own data

If you want to train a model using your own dataset instead of one on the Jina AI Cloud, you can provide labeled data in a CSV file in the following way:

```plaintext
This is an apple apple_label
This is a pear pear_label
...
```

You can then provide the path to your CSV file as your training data:

```python
run = finetuner.fit(
model='bert-base-cased',
run_name='bert-my-own-run',
train_data='path/to/some/data.csv',
)
```
More information on providing your own training data is found in the [Prepare Training Data](https://finetuner.jina.ai/walkthrough/create-training-data/) section of the [walkthrough](https://finetuner.jina.ai/walkthrough/).



### Next steps

- Take the [walkthrough](https://finetuner.jina.ai/walkthrough/) and submit your first fine-tuning job.
Expand Down
2 changes: 1 addition & 1 deletion docs/get-started/design-principles.md
Original file line number Diff line number Diff line change
Expand Up @@ -40,6 +40,6 @@ while all these parameters are optional.

If you do not have a machine learning background,
don't worry about it.
As was stated before, you only need to provide the training data organized as a {class}`~docarray.array.document.DocumentArray`.
As was stated before, you only need to provide the training data organized as a {class}`~docarray.array.document.DocumentArray` or as a CSV file.
In case you do not know which backbone to choose,
use {meth}`~finetuner.describe_models()` to let Finetuner suggest a backbone model for you.
2 changes: 1 addition & 1 deletion docs/get-started/how-it-works.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@ but instead outputs a feature vector to represent your data.
### Step 2: Triplet construction and training on-the-fly

Finetuner works on labeled data.
It expects a {class}`~docarray.array.document.DocumentArray` consisting of {class}`~docarray.document.Document`s where each one contains `finetuner_label` corresponding to the class of a specific training example.
It expects either a CSV file or a {class}`~docarray.array.document.DocumentArray` consisting of {class}`~docarray.document.Document`s where each one contains `finetuner_label` corresponding to the class of a specific training example. After receiving a CSV file, its contents are parsed and a {class}`~docarray.array.document.DocumentArray` is constructed.

During the fine-tuning, Finetuner creates Triplets `(anchor, positive, negative)` on-the-fly.
For each anchor,
Expand Down
85 changes: 65 additions & 20 deletions docs/walkthrough/create-training-data.md
Original file line number Diff line number Diff line change
@@ -1,13 +1,73 @@
(create-training-data)=
# Prepare Training Data

Finetuner accepts training data and evaluation data in the form of {class}`~docarray.array.document.DocumentArray` objects.
Because Finetuner follows a [supervised-learning](https://en.wikipedia.org/wiki/Supervised_learning) scheme,
you should assign a label to each {class}`~docarray.document.Document` inside your {class}`~docarray.array.document.DocumentArray`.
Finetuner accepts training data and evaluation data in the form of CSV files
or {class}`~docarray.array.document.DocumentArray` objects.
Because Finetuner follows a [supervised-learning](https://en.wikipedia.org/wiki/Supervised_learning) scheme, each element requires a label that identifies which other elements it should be similar to.
If you need to evaluate metrics on separate evaluation data, it is recommended to create a dataset only for evaluation purposes. This can be done in the same way a a training dataset is created, as described below.

Data can be prepared in two different formats, either as a CSV file, or as a {class}`~docarray.array.document.DocumentArray`. In the sections below, you can see examples which demonstrate how the training datasets should look like for each format.

## Preparing CSV Files

To record data in a CSV file, the contents of each element are stored plainly, with each row either representing one labeled item, a pair of items that should be semantically similar, or two items of different modalities in the case that a CLIP model is being used. The provided CSV files are then parsed and a {class}`~docarray.array.document.DocumentArray` is constructed containing the elements within the CSV file. Any references to local images within the CSV file are then loaded into memory.
Currently, `excel`, `excel-tab` and `unix` CSV dialects are supported. To specify which dialect to use, provide a {class}`~finetuner.data.CSVOptions` object with `dialect=chosen_dialect` as the `csv_options` argument to the {meth}`~finetuner.fit` function. The list of all options for reading CSV files can be found in the description of the {class}`~finetuner.data.CSVOptions` class.


````{tab} two elements per row
If you want two elements to be semantically close together, they can be placed on the same row as a pair, doing so will assign each pair a distinct label:

```markdown
This is an English sentence Das ist ein englischer Satz
This is another English sentence Dies ist ein weiterer englischer Satz
...
```
This format can be used to construct training data for text-to-text and image-to-image retrieval models:

```markdown
apple.jpg https://example.com/apple-styling.jpg
orange.jpg https://example.com/orange-styling.jpg
```
````

````{tab} Labeled data
In cases where you want multiple elements grouped together, you can provide a label in the second column. This way, all elements in the first column that have the same label will be considered similar when training. To indicate that the second column of your CSV file represents a label instead of a second element, set `is_labeled = True` in the `csv_options` argument of the {meth}`~finetuner.fit` function. Your data can then be structured like so:

```markdown
Hello! greeting-english
Hi there. greeting-english
Good morning. greeting-english
I'm (…) sorry! apologize-english
I'm sorry to have… apologize-english
Please, forgive me! apologize-english
```
````

````{tab} text-to-image search using CLIP
To prepare data for text-to-image search, each row must contain one uri to an image, and one piece of text. The order that these two are placed does not matter, so long as the ordering is kept consistent for all rows.

```markdown
This is a photo of an apple. apple.jpg
This is a black-white photo of an organge. orange.jpg
```

```{admonition} CLIP model explained
:class: hint
OpenAI CLIP model wraps two models: a vision transformer and a text transformer.
During fine-tuning, we're optimizing two models in parallel.

At the model saving time, you will discover, we are saving two models to your local directory.
```
LMMilliken marked this conversation as resolved.
Show resolved Hide resolved

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest to add a note that csv data is before the finetuning loaded into memory (a DocumentArray object) and thereby locally stored images are also loaded into memory

````



## Preparing a DocumentArray
When providing training data in a DocumentArray, each element is represented as a {class}`~docarray.document.Document`. You should assign a label to each {class}`~docarray.document.Document` inside your {class}`~docarray.array.document.DocumentArray`.
For most of the models, this is done by adding a `fintuner_label` tag to each document.
Only for cross-modality (text-to-image) fine-tuning with CLIP, this is not necessary as explained at the bottom of this section.
Only for cross-modality (text-to-image) fine-tuning with CLIP, is this not necessary as explained at the bottom of this section.

In the code blocks below, you can see examples which demonstrate how the training datasets should look like:

````{tab} text-to-text search
```python
Expand Down Expand Up @@ -86,18 +146,3 @@ The image and text form a pair.
During the training, CLIP learns to place documents that are part of a pair close to
each other and documents that are not part of a pair far from each other.
As a result, no further labels need to be provided.

Evaluation data should be created in the same way as the training data in the examples above.

```{admonition} CLIP model explained
:class: hint
OpenAI CLIP model wraps two models: a vision transformer and a text transformer.
During fine-tuning, we're optimizing two models in parallel.

At the model saving time, you will discover, we are saving two models to your local directory.
```

If you need to evaluate metrics on separate evaluation data,
it is recommended to create another `DocumentArray` as above only for evaluation purposes.

Carry on, you're almost there!
1 change: 1 addition & 0 deletions docs/walkthrough/integrate-with-jina.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@

Once fine-tuning is finished, it's time to actually use the model.
You can use the fine-tuned models directly to encode [DocumentArray](https://docarray.jina.ai/) objects or setting up an encoding service.
It is worth noting that, while training data can be provided as a csv, data for encoding cannot.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my opinion we don't need to mention this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Data would be structured differently in the csv, meaning a new method would need to be written, this could be added if it important.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should be consistent everywhere we accept datasets. If we want to make CSV the primary format this should apply for all the API in total. Generally our build_dataset function should support unlabeled data as well. @bwanglzu

If we leave this for another PR we are creating technical debt

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes we should not leave technical debts, everywhere should be consistent


(integrate-with-docarray)=
## Embed DocumentArray
Expand Down
10 changes: 6 additions & 4 deletions docs/walkthrough/run-job.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
(start-finetuner)=
# Run Job

Now you should have your training data and evaluation data (optional) prepared as {class}`~docarray.array.document.DocumentArray`s,
Now you should have your training data and evaluation data (optional) prepared as CSV files or {class}`~docarray.array.document.DocumentArray`s,
and have selected your backbone model.

Up until now, you have worked locally to prepare a dataset and select our model. From here on out, you will send your processes to the cloud!
Expand All @@ -14,7 +14,7 @@ To start fine-tuning, you can call:
import finetuner
from docarray import DocumentArray

train_data = DocumentArray(...)
train_data = 'path/to/some/data.csv'

run = finetuner.fit(
model='efficientnet_b0',
Expand Down Expand Up @@ -45,9 +45,10 @@ Finetuner gives you the flexibility to set hyper-parameters explicitly:
```python
import finetuner
from docarray import DocumentArray
from finetuner.data import CSVOptions

train_data = DocumentArray(...)
eval_data = DocumentArray(...)
train_data = 'path/to/some/train_data.csv'
eval_data = 'path/to/some/eval_data.csv'

# Create an experiment
finetuner.create_experiment(name='finetune-flickr-dataset')
Expand All @@ -74,6 +75,7 @@ run = finetuner.fit(
device='cuda',
num_workers=4,
to_onnx=False, # If set, please pass `is_onnx` when making inference.
csv_options=CSVOptions(), # Additional options for reading data from a CSV file
)
```

Expand Down
7 changes: 5 additions & 2 deletions docs/walkthrough/using-callbacks.md
Original file line number Diff line number Diff line change
Expand Up @@ -43,7 +43,8 @@ The evaluation callback is used to calculate performance metrics for the model b
```

The evaluation callback is triggered at the end of each epoch, in which the model is evaluated using the `query_data` and `index_data` datasets that were provided when the callback was created.
It is worth noting that the evaluation callback and the `eval_data` parameter of the fit method do not do the same thing. The eval data parameter takes a `DocumentArray` (or the name of one that has been pushed on the Jina AI Cloud) and uses its contents to evaluate the loss of the model whereas the evaluation callback is used to evaluate the quality of the searches using metrics such as average precision and recall. These search metrics can be used by other callbacks if the evaluation callback is first in the list of callbacks when creating a run.
It is worth noting that the evaluation callback and the `eval_data` parameter of the fit method do not do the same thing. The `eval_data` parameter takes a dataset, in the form of a path to a CSV file, a {class}`~docarray.array.document.DocumentArray`, or the name of a {class}`~docarray.array.document.DocumentArray` that has been pushed on the Jina AI Cloud, and uses its contents to evaluate the loss of the model. On the other hand, the evaluation callback is used to evaluate the quality of the searches using metrics such as average precision and recall. These search metrics can be used by other callbacks if the evaluation callback is first in the list of callbacks when creating a run.


## BestModelCheckpoint

Expand Down Expand Up @@ -138,5 +139,7 @@ Please refer to {ref}`Apply WiSE-FT <wise-ft>` in the CLIP fine-tuning example.

```{warning}
It is recommended to use WiSEFTCallback when fine-tuning CLIP.
We can not ensure it works for other category of models, such as ResNet or Bert.
We can not ensure it works for other types of models, such as ResNet or BERT.


```
21 changes: 14 additions & 7 deletions finetuner/__init__.py
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
import inspect
import os
import warnings
from typing import TYPE_CHECKING, Any, Dict, List, Optional, Union
from typing import TYPE_CHECKING, Any, Dict, List, Optional, TextIO, Union

from _finetuner.runner.stubs import model as model_stub
from docarray import DocumentArray
Expand All @@ -12,6 +12,7 @@
HOST,
HUBBLE_REGISTRY,
)
from finetuner.data import CSVOptions
from finetuner.run import Run
from hubble import login_required

Expand Down Expand Up @@ -95,8 +96,8 @@ def describe_models() -> None:
@login_required
def fit(
model: str,
train_data: Union[str, DocumentArray],
eval_data: Optional[Union[str, DocumentArray]] = None,
train_data: Union[str, TextIO, DocumentArray],
eval_data: Optional[Union[str, TextIO, DocumentArray]] = None,
run_name: Optional[str] = None,
description: Optional[str] = None,
experiment_name: Optional[str] = None,
Expand All @@ -117,15 +118,16 @@ def fit(
device: str = 'cuda',
num_workers: int = 4,
to_onnx: bool = False,
csv_options: Optional[CSVOptions] = None,
) -> Run:
"""Start a finetuner run!

:param model: The name of model to be fine-tuned. Run `finetuner.list_models()` or
`finetuner.describe_models()` to see the available model names.
:param train_data: Either a `DocumentArray` for training data or a
name of the `DocumentArray` that is pushed on Hubble.
:param eval_data: Either a `DocumentArray` for evaluation data or a
name of the `DocumentArray` that is pushed on Hubble.
:param train_data: Either a `DocumentArray` for training data, a name of the
`DocumentArray` that is pushed on Jina AI Cloud or a path to a CSV file.
:param eval_data: Either a `DocumentArray` for evaluation data, a name of the
`DocumentArray` that is pushed on Jina AI Cloud or a path to a CSV file.
:param run_name: Name of the run.
:param description: Run description.
:param experiment_name: Name of the experiment.
Expand Down Expand Up @@ -178,11 +180,15 @@ def fit(
workers used by the dataloader.
:param to_onnx: If the model is an onnx model or not. If you call the `fit` function
with `to_onnx=True`, please set this parameter as `True`.
:param csv_options: A :class:`CSVOptions` object containing options used for
reading in training and evaluation data from a CSV file, if they are
provided as such.

.. note::
Unless necessary, please stick with `device="cuda"`, `cpu` training could be
extremely slow and inefficient.
"""

return ft.create_run(
model=model,
train_data=train_data,
Expand All @@ -207,6 +213,7 @@ def fit(
device=device,
num_workers=num_workers,
to_onnx=to_onnx,
csv_options=csv_options,
)


Expand Down
1 change: 1 addition & 0 deletions finetuner/constants.py
Original file line number Diff line number Diff line change
Expand Up @@ -35,6 +35,7 @@
MODEL = 'model'
MODEL_OPTIONS = 'model_options'
ARTIFACT_ID = 'artifact_id'
DEFAULT_TAG_KEY = 'finetuner_label'
# Run status
CREATED = 'CREATED'
STARTED = 'STARTED'
Expand Down
Loading