-
Notifications
You must be signed in to change notification settings - Fork 66
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add support for csv files #592
Changes from all commits
595b707
31199ff
f2a3099
93dde31
9bf52e6
623a446
85d8366
aec0e1b
0d8f0f8
e0f2203
be1ff52
f949e1e
2864537
5dec4b6
7a3076f
8cb3f77
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,13 +1,73 @@ | ||
(create-training-data)= | ||
# Prepare Training Data | ||
|
||
Finetuner accepts training data and evaluation data in the form of {class}`~docarray.array.document.DocumentArray` objects. | ||
Because Finetuner follows a [supervised-learning](https://en.wikipedia.org/wiki/Supervised_learning) scheme, | ||
you should assign a label to each {class}`~docarray.document.Document` inside your {class}`~docarray.array.document.DocumentArray`. | ||
Finetuner accepts training data and evaluation data in the form of CSV files | ||
or {class}`~docarray.array.document.DocumentArray` objects. | ||
Because Finetuner follows a [supervised-learning](https://en.wikipedia.org/wiki/Supervised_learning) scheme, each element requires a label that identifies which other elements it should be similar to. | ||
If you need to evaluate metrics on separate evaluation data, it is recommended to create a dataset only for evaluation purposes. This can be done in the same way a a training dataset is created, as described below. | ||
|
||
Data can be prepared in two different formats, either as a CSV file, or as a {class}`~docarray.array.document.DocumentArray`. In the sections below, you can see examples which demonstrate how the training datasets should look like for each format. | ||
|
||
## Preparing CSV Files | ||
|
||
To record data in a CSV file, the contents of each element are stored plainly, with each row either representing one labeled item, a pair of items that should be semantically similar, or two items of different modalities in the case that a CLIP model is being used. The provided CSV files are then parsed and a {class}`~docarray.array.document.DocumentArray` is constructed containing the elements within the CSV file. Any references to local images within the CSV file are then loaded into memory. | ||
Currently, `excel`, `excel-tab` and `unix` CSV dialects are supported. To specify which dialect to use, provide a {class}`~finetuner.data.CSVOptions` object with `dialect=chosen_dialect` as the `csv_options` argument to the {meth}`~finetuner.fit` function. The list of all options for reading CSV files can be found in the description of the {class}`~finetuner.data.CSVOptions` class. | ||
|
||
|
||
````{tab} two elements per row | ||
If you want two elements to be semantically close together, they can be placed on the same row as a pair, doing so will assign each pair a distinct label: | ||
|
||
```markdown | ||
This is an English sentence Das ist ein englischer Satz | ||
This is another English sentence Dies ist ein weiterer englischer Satz | ||
... | ||
``` | ||
This format can be used to construct training data for text-to-text and image-to-image retrieval models: | ||
|
||
```markdown | ||
apple.jpg https://example.com/apple-styling.jpg | ||
orange.jpg https://example.com/orange-styling.jpg | ||
``` | ||
```` | ||
|
||
````{tab} Labeled data | ||
In cases where you want multiple elements grouped together, you can provide a label in the second column. This way, all elements in the first column that have the same label will be considered similar when training. To indicate that the second column of your CSV file represents a label instead of a second element, set `is_labeled = True` in the `csv_options` argument of the {meth}`~finetuner.fit` function. Your data can then be structured like so: | ||
|
||
```markdown | ||
Hello! greeting-english | ||
Hi there. greeting-english | ||
Good morning. greeting-english | ||
I'm (…) sorry! apologize-english | ||
I'm sorry to have… apologize-english | ||
Please, forgive me! apologize-english | ||
``` | ||
```` | ||
|
||
````{tab} text-to-image search using CLIP | ||
To prepare data for text-to-image search, each row must contain one uri to an image, and one piece of text. The order that these two are placed does not matter, so long as the ordering is kept consistent for all rows. | ||
|
||
```markdown | ||
This is a photo of an apple. apple.jpg | ||
This is a black-white photo of an organge. orange.jpg | ||
``` | ||
|
||
```{admonition} CLIP model explained | ||
:class: hint | ||
OpenAI CLIP model wraps two models: a vision transformer and a text transformer. | ||
During fine-tuning, we're optimizing two models in parallel. | ||
|
||
At the model saving time, you will discover, we are saving two models to your local directory. | ||
``` | ||
LMMilliken marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I would suggest to add a note that csv data is before the finetuning loaded into memory (a DocumentArray object) and thereby locally stored images are also loaded into memory |
||
```` | ||
|
||
|
||
|
||
## Preparing a DocumentArray | ||
When providing training data in a DocumentArray, each element is represented as a {class}`~docarray.document.Document`. You should assign a label to each {class}`~docarray.document.Document` inside your {class}`~docarray.array.document.DocumentArray`. | ||
For most of the models, this is done by adding a `fintuner_label` tag to each document. | ||
Only for cross-modality (text-to-image) fine-tuning with CLIP, this is not necessary as explained at the bottom of this section. | ||
Only for cross-modality (text-to-image) fine-tuning with CLIP, is this not necessary as explained at the bottom of this section. | ||
|
||
In the code blocks below, you can see examples which demonstrate how the training datasets should look like: | ||
|
||
````{tab} text-to-text search | ||
```python | ||
|
@@ -86,18 +146,3 @@ The image and text form a pair. | |
During the training, CLIP learns to place documents that are part of a pair close to | ||
each other and documents that are not part of a pair far from each other. | ||
As a result, no further labels need to be provided. | ||
|
||
Evaluation data should be created in the same way as the training data in the examples above. | ||
|
||
```{admonition} CLIP model explained | ||
:class: hint | ||
OpenAI CLIP model wraps two models: a vision transformer and a text transformer. | ||
During fine-tuning, we're optimizing two models in parallel. | ||
|
||
At the model saving time, you will discover, we are saving two models to your local directory. | ||
``` | ||
|
||
If you need to evaluate metrics on separate evaluation data, | ||
it is recommended to create another `DocumentArray` as above only for evaluation purposes. | ||
|
||
Carry on, you're almost there! |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -2,6 +2,7 @@ | |
|
||
Once fine-tuning is finished, it's time to actually use the model. | ||
You can use the fine-tuned models directly to encode [DocumentArray](https://docarray.jina.ai/) objects or setting up an encoding service. | ||
It is worth noting that, while training data can be provided as a csv, data for encoding cannot. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. In my opinion we don't need to mention this. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Why not? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Data would be structured differently in the csv, meaning a new method would need to be written, this could be added if it important. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think we should be consistent everywhere we accept datasets. If we want to make CSV the primary format this should apply for all the API in total. Generally our build_dataset function should support unlabeled data as well. @bwanglzu If we leave this for another PR we are creating technical debt There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. yes we should not leave technical debts, everywhere should be consistent |
||
|
||
(integrate-with-docarray)= | ||
## Embed DocumentArray | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be nice to show in the README how to train a model with a CSV file. This was also demanded by Han. So I think we should add something.