Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add support for csv files #592

Merged
merged 16 commits into from
Nov 4, 2022
Merged

feat: add support for csv files #592

merged 16 commits into from
Nov 4, 2022

Conversation

LMMilliken
Copy link
Contributor

@LMMilliken LMMilliken commented Oct 31, 2022

This PR adds support for csv files to the finetuner.fit function.
Both train_data and eval_data can now be supplied as paths to a csv file or an in memory TextIO stream. The format of the provided csv can be any that are listed by the csv.list_dialects function. Currently, each row of the csv file can contain:

  • text-text pairs
  • image-image pairs
  • text-label pairs
  • image-label pairs
  • text-image pairs (for CLIP models)

Each row must of the same format, and to indicate that the second column represents labels and not text, provide a dictionary with is_labeled = True as the csv_options argument.


  • This PR references an open issue
  • I have added a line about this change to CHANGELOG

@github-actions github-actions bot added the area/testing This issue/PR affects testing label Nov 1, 2022
@LMMilliken LMMilliken linked an issue Nov 1, 2022 that may be closed by this pull request
Copy link
Member

@guenthermi guenthermi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added some comments

docs/get-started/how-it-works.md Outdated Show resolved Hide resolved
docs/walkthrough/create-training-data.md Outdated Show resolved Hide resolved
docs/walkthrough/create-training-data.md Outdated Show resolved Hide resolved
docs/walkthrough/create-training-data.md Outdated Show resolved Hide resolved
docs/walkthrough/create-training-data.md Outdated Show resolved Hide resolved
finetuner/utils.py Outdated Show resolved Hide resolved
finetuner/utils.py Outdated Show resolved Hide resolved
finetuner/utils.py Outdated Show resolved Hide resolved
finetuner/utils.py Outdated Show resolved Hide resolved
tests/unit/test_utils.py Outdated Show resolved Hide resolved
CHANGELOG.md Outdated Show resolved Hide resolved
docs/get-started/design-principles.md Outdated Show resolved Hide resolved
docs/get-started/how-it-works.md Outdated Show resolved Hide resolved
docs/walkthrough/create-training-data.md Outdated Show resolved Hide resolved
docs/walkthrough/create-training-data.md Outdated Show resolved Hide resolved
finetuner/utils.py Outdated Show resolved Hide resolved
finetuner/utils.py Outdated Show resolved Hide resolved
finetuner/utils.py Outdated Show resolved Hide resolved
finetuner/utils.py Outdated Show resolved Hide resolved
finetuner/utils.py Outdated Show resolved Hide resolved
Co-authored-by: George Mastrapas <32414777+gmastrapas@users.noreply.github.com>
Co-authored-by: Michael Günther <guenthermi50@gmail.com>
Copy link
Member

@bwanglzu bwanglzu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's also add an integration test

docs/walkthrough/create-training-data.md Show resolved Hide resolved
finetuner/experiment.py Outdated Show resolved Hide resolved
finetuner/utils.py Outdated Show resolved Hide resolved
Copy link
Member

@bwanglzu bwanglzu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

left some minor comments

finetuner/utils.py Outdated Show resolved Hide resolved
finetuner/utils.py Outdated Show resolved Hide resolved
finetuner/utils.py Outdated Show resolved Hide resolved
Copy link
Member

@gmastrapas gmastrapas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! Some minor comments

README.md Outdated Show resolved Hide resolved
docs/walkthrough/create-training-data.md Outdated Show resolved Hide resolved
docs/walkthrough/run-job.md Outdated Show resolved Hide resolved
docs/walkthrough/using-callbacks.md Outdated Show resolved Hide resolved
finetuner/utils.py Outdated Show resolved Hide resolved
finetuner/utils.py Outdated Show resolved Hide resolved
finetuner/utils.py Outdated Show resolved Hide resolved
finetuner/utils.py Outdated Show resolved Hide resolved
finetuner/utils.py Outdated Show resolved Hide resolved
finetuner/utils.py Outdated Show resolved Hide resolved
LMMilliken and others added 2 commits November 3, 2022 14:12
Co-authored-by: George Mastrapas <32414777+gmastrapas@users.noreply.github.com>
Copy link
Member

@bwanglzu bwanglzu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great job!

Copy link
Member

@gmastrapas gmastrapas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

tests/unit/test_data.py Outdated Show resolved Hide resolved
tests/unit/test_data.py Outdated Show resolved Hide resolved
Copy link
Member

@guenthermi guenthermi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, only added some minor comments


At the model saving time, you will discover, we are saving two models to your local directory.
```

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest to add a note that csv data is before the finetuning loaded into memory (a DocumentArray object) and thereby locally stored images are also loaded into memory

def load_finetune_data_from_csv(
file: Union[str, TextIO],
task: str = 'text-to-text',
options: CSVOptions = CSVOptions(),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is not recommended to call a constructor in a function declaration, since it is then only called once and every function while use the exact same instance.
So it is better to do:

def load_finetune_data_from_csv(
[...]
    options=None,
[...]
):
  options = options or CSVOptions()

@github-actions
Copy link

github-actions bot commented Nov 4, 2022

📝 Docs are deployed on https://ft-feat-support-csv--jina-docs.netlify.app 🎉

@LMMilliken LMMilliken merged commit a3b62a0 into main Nov 4, 2022
@LMMilliken LMMilliken deleted the feat-support-csv branch November 4, 2022 08:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add support for csv files
4 participants