jina-ai · LMMilliken · Nov 4, 2022 · Oct 31, 2022 · Nov 1, 2022 · Nov 1, 2022
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -10,6 +10,8 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 
 ### Added
 
+- Add support for CSV files to the `fit` function. [#592](https://github.com/jina-ai/finetuner/pull/592)
+
 ### Removed
 
 ### Changed

diff --git a/README.md b/README.md
@@ -148,7 +148,8 @@ run = finetuner.fit(
 )
 ```
 
-Fine-tuning might take 5 minute to finish. You can later re-connect your run with:
+Here, the training data used is gathered from the Jina AI Cloud, however data can also be passed as a CSV file or DocumentArray, as described [here](https://finetuner.jina.ai/walkthrough/create-training-data/).  
+Fine-tuning might take 5 minutes to finish. You can later re-connect your run with:
 
 ```python
 import finetuner
@@ -186,6 +187,29 @@ finetuner.encode(model=model, data=da)
 da.summary()
 ```
 
+## Training on your own data
+
+If you want to train a model using your own dataset instead of one on the Jina AI Cloud, you can provide labeled data in a CSV file in the following way:
+
+```plaintext
+This is an apple    apple_label
+This is a pear      pear_label
+...
+```
+
+You can then provide the path to your CSV file as your training data:
+
+```python
+run = finetuner.fit(
+    model='bert-base-cased',
+    run_name='bert-my-own-run',
+    train_data='path/to/some/data.csv',
+)
+```
+More information on providing your own training data is found in the [Prepare Training Data](https://finetuner.jina.ai/walkthrough/create-training-data/) section of the [walkthrough](https://finetuner.jina.ai/walkthrough/).
+
+
+
 ### Next steps
 
 - Take the [walkthrough](https://finetuner.jina.ai/walkthrough/) and submit your first fine-tuning job.

diff --git a/docs/get-started/design-principles.md b/docs/get-started/design-principles.md
@@ -40,6 +40,6 @@ while all these parameters are optional.
 
 If you do not have a machine learning background,
 don't worry about it.
-As was stated before, you only need to provide the training data organized as a {class}`~docarray.array.document.DocumentArray`.
+As was stated before, you only need to provide the training data organized as a {class}`~docarray.array.document.DocumentArray` or as a CSV file.
 In case you do not know which backbone to choose,
 use {meth}`~finetuner.describe_models()` to let Finetuner suggest a backbone model for you.
diff --git a/docs/get-started/how-it-works.md b/docs/get-started/how-it-works.md
@@ -19,7 +19,7 @@ but instead outputs a feature vector to represent your data.
 ### Step 2: Triplet construction and training on-the-fly
 
 Finetuner works on labeled data.
-It expects a {class}`~docarray.array.document.DocumentArray` consisting of {class}`~docarray.document.Document`s where each one contains `finetuner_label` corresponding to the class of a specific training example.
+It expects either a CSV file or a {class}`~docarray.array.document.DocumentArray` consisting of {class}`~docarray.document.Document`s where each one contains `finetuner_label` corresponding to the class of a specific training example. After receiving a CSV file, its contents are parsed and a {class}`~docarray.array.document.DocumentArray` is constructed.
 
 During the fine-tuning, Finetuner creates Triplets `(anchor, positive, negative)` on-the-fly.
 For each anchor,

diff --git a/docs/walkthrough/create-training-data.md b/docs/walkthrough/create-training-data.md
@@ -1,13 +1,73 @@
 (create-training-data)=
 # Prepare Training Data
 
-Finetuner accepts training data and evaluation data in the form of {class}`~docarray.array.document.DocumentArray` objects.
-Because Finetuner follows a [supervised-learning](https://en.wikipedia.org/wiki/Supervised_learning) scheme,
-you should assign a label to each {class}`~docarray.document.Document` inside your {class}`~docarray.array.document.DocumentArray`.
+Finetuner accepts training data and evaluation data in the form of CSV files 
+or {class}`~docarray.array.document.DocumentArray` objects.
+Because Finetuner follows a [supervised-learning](https://en.wikipedia.org/wiki/Supervised_learning) scheme, each element requires a label that identifies which other elements it should be similar to. 
+If you need to evaluate metrics on separate evaluation data, it is recommended to create a dataset only for evaluation purposes. This can be done in the same way a a training dataset is created, as described below.
+
+Data can be prepared in two different formats, either as a CSV file, or as a {class}`~docarray.array.document.DocumentArray`. In the sections below, you can see examples which demonstrate how the training datasets should look like for each format.
+
+## Preparing CSV Files
+
+To record data in a CSV file, the contents of each element are stored plainly, with each row either representing one labeled item, a pair of items that should be semantically similar, or two items of different modalities in the case that a CLIP model is being used. The provided CSV files are then parsed and a {class}`~docarray.array.document.DocumentArray` is constructed containing the elements within the CSV file. Any references to local images within the CSV file are then loaded into memory.
+Currently, `excel`, `excel-tab` and `unix` CSV dialects are supported. To specify which dialect to use, provide a {class}`~finetuner.data.CSVOptions` object with `dialect=chosen_dialect` as the `csv_options` argument to the {meth}`~finetuner.fit` function. The list of all options for reading CSV files can be found in the description of the {class}`~finetuner.data.CSVOptions` class.
+
+
+````{tab} two elements per row
+If you want two elements to be semantically close together, they can be placed on the same row as a pair, doing so will assign each pair a distinct label:
+
+```markdown
+This is an English sentence         Das ist ein englischer Satz
+This is another English sentence    Dies ist ein weiterer englischer Satz
+...
+```
+This format can be used to construct training data for text-to-text and image-to-image retrieval models:
+
+```markdown
+apple.jpg   https://example.com/apple-styling.jpg
+orange.jpg  https://example.com/orange-styling.jpg
+```
+````
+
+````{tab} Labeled data
+In cases where you want multiple elements grouped together, you can provide a label in the second column. This way, all elements in the first column that have the same label will be considered similar when training. To indicate that the second column of your CSV file represents a label instead of a second element, set `is_labeled = True` in the `csv_options` argument of the {meth}`~finetuner.fit` function. Your data can then be structured like so:
+
+```markdown
+Hello!                  greeting-english
+Hi there.               greeting-english
+Good morning.           greeting-english
+I'm (…) sorry!          apologize-english
+I'm sorry to have…      apologize-english
+Please, forgive me!     apologize-english
+```
+````
+
+````{tab} text-to-image search using CLIP
+To prepare data for text-to-image search, each row must contain one uri to an image, and one piece of text. The order that these two are placed does not matter, so long as the ordering is kept consistent for all rows.
+
+```markdown
+This is a photo of an apple.                apple.jpg
+This is a black-white photo of an organge.  orange.jpg
+```
+
+```{admonition} CLIP model explained
+:class: hint
+OpenAI CLIP model wraps two models: a vision transformer and a text transformer.
+During fine-tuning, we're optimizing two models in parallel.
+
+At the model saving time, you will discover, we are saving two models to your local directory. 
+```
+
+````
+
+
+
+## Preparing a DocumentArray
+When providing training data in a DocumentArray, each element is represented as a {class}`~docarray.document.Document`. You should assign a label to each {class}`~docarray.document.Document` inside your {class}`~docarray.array.document.DocumentArray`.
 For most of the models, this is done by adding a `fintuner_label` tag to each document.
-Only for cross-modality (text-to-image) fine-tuning with CLIP, this is not necessary as explained at the bottom of this section.
+Only for cross-modality (text-to-image) fine-tuning with CLIP, is this not necessary as explained at the bottom of this section.
 
-In the code blocks below, you can see examples which demonstrate how the training datasets should look like: 
 
 ````{tab} text-to-text search
 ```python
@@ -86,18 +146,3 @@ The image and text form a pair.
 During the training, CLIP learns to place documents that are part of a pair close to
 each other and documents that are not part of a pair far from each other.
 As a result, no further labels need to be provided.
-
-Evaluation data should be created in the same way as the training data in the examples above.
-
-```{admonition} CLIP model explained
-:class: hint
-OpenAI CLIP model wraps two models: a vision transformer and a text transformer.
-During fine-tuning, we're optimizing two models in parallel.
-
-At the model saving time, you will discover, we are saving two models to your local directory. 
-```
-
- If you need to evaluate metrics on separate evaluation data,
-it is recommended to create another `DocumentArray` as above only for evaluation purposes.
-
-Carry on, you're almost there!
diff --git a/docs/walkthrough/integrate-with-jina.md b/docs/walkthrough/integrate-with-jina.md
@@ -2,6 +2,7 @@
 
 Once fine-tuning is finished, it's time to actually use the model.
 You can use the fine-tuned models directly to encode [DocumentArray](https://docarray.jina.ai/) objects or setting up an encoding service.
+It is worth noting that, while training data can be provided as a csv, data for encoding cannot.
 
 (integrate-with-docarray)=
 ## Embed DocumentArray

diff --git a/docs/walkthrough/run-job.md b/docs/walkthrough/run-job.md
@@ -1,7 +1,7 @@
 (start-finetuner)=
 # Run Job
 
-Now you should have your training data and evaluation data (optional) prepared as {class}`~docarray.array.document.DocumentArray`s,
+Now you should have your training data and evaluation data (optional) prepared as CSV files or {class}`~docarray.array.document.DocumentArray`s,
 and have selected your backbone model.
 
 Up until now, you have worked locally to prepare a dataset and select our model. From here on out, you will send your processes to the cloud!
@@ -14,7 +14,7 @@ To start fine-tuning, you can call:
 import finetuner
 from docarray import DocumentArray
 
-train_data = DocumentArray(...)
+train_data = 'path/to/some/data.csv'
 
 run = finetuner.fit(
     model='efficientnet_b0',
@@ -45,9 +45,10 @@ Finetuner gives you the flexibility to set hyper-parameters explicitly:
 ```python
 import finetuner
 from docarray import DocumentArray
+from finetuner.data import CSVOptions
 
-train_data = DocumentArray(...)
-eval_data = DocumentArray(...)
+train_data = 'path/to/some/train_data.csv'
+eval_data = 'path/to/some/eval_data.csv'
 
 # Create an experiment
 finetuner.create_experiment(name='finetune-flickr-dataset')
@@ -74,6 +75,7 @@ run = finetuner.fit(
     device='cuda',
     num_workers=4,
     to_onnx=False,  # If set, please pass `is_onnx` when making inference.
+    csv_options=CSVOptions(),  # Additional options for reading data from a CSV file
 )
 ```
 

diff --git a/docs/walkthrough/using-callbacks.md b/docs/walkthrough/using-callbacks.md
@@ -43,7 +43,8 @@ The evaluation callback is used to calculate performance metrics for the model b
 ```
 
 The evaluation callback is triggered at the end of each epoch, in which the model is evaluated using the `query_data` and `index_data` datasets that were provided when the callback was created.
-It is worth noting that the evaluation callback and the `eval_data` parameter of the fit method do not do the same thing. The eval data parameter takes a `DocumentArray` (or the name of one that has been pushed on the Jina AI Cloud) and uses its contents to evaluate the loss of the model whereas the evaluation callback is used to evaluate the quality of the searches using metrics such as average precision and recall. These search metrics can be used by other callbacks if the evaluation callback is first in the list of callbacks when creating a run.
+It is worth noting that the evaluation callback and the `eval_data` parameter of the fit method do not do the same thing. The `eval_data` parameter takes a dataset, in the form of a path to a CSV file, a {class}`~docarray.array.document.DocumentArray`, or the name of a {class}`~docarray.array.document.DocumentArray` that has been pushed on the Jina AI Cloud, and uses its contents to evaluate the loss of the model. On the other hand, the evaluation callback is used to evaluate the quality of the searches using metrics such as average precision and recall. These search metrics can be used by other callbacks if the evaluation callback is first in the list of callbacks when creating a run.
+
 
 ## BestModelCheckpoint
 
@@ -138,5 +139,7 @@ Please refer to {ref}`Apply WiSE-FT <wise-ft>` in the CLIP fine-tuning example.
 
 ```{warning}
 It is recommended to use WiSEFTCallback when fine-tuning CLIP.
-We can not ensure it works for other category of models, such as ResNet or Bert.
+We can not ensure it works for other types of models, such as ResNet or BERT.
+
+
 ```
diff --git a/finetuner/__init__.py b/finetuner/__init__.py
@@ -1,7 +1,7 @@
 import inspect
 import os
 import warnings
-from typing import TYPE_CHECKING, Any, Dict, List, Optional, Union
+from typing import TYPE_CHECKING, Any, Dict, List, Optional, TextIO, Union
 
 from _finetuner.runner.stubs import model as model_stub
 from docarray import DocumentArray
@@ -12,6 +12,7 @@
     HOST,
     HUBBLE_REGISTRY,
 )
+from finetuner.data import CSVOptions
 from finetuner.run import Run
 from hubble import login_required
 
@@ -95,8 +96,8 @@ def describe_models() -> None:
 @login_required
 def fit(
     model: str,
-    train_data: Union[str, DocumentArray],
-    eval_data: Optional[Union[str, DocumentArray]] = None,
+    train_data: Union[str, TextIO, DocumentArray],
+    eval_data: Optional[Union[str, TextIO, DocumentArray]] = None,
     run_name: Optional[str] = None,
     description: Optional[str] = None,
     experiment_name: Optional[str] = None,
@@ -117,15 +118,16 @@ def fit(
     device: str = 'cuda',
     num_workers: int = 4,
     to_onnx: bool = False,
+    csv_options: Optional[CSVOptions] = None,
 ) -> Run:
     """Start a finetuner run!
 
     :param model: The name of model to be fine-tuned. Run `finetuner.list_models()` or
         `finetuner.describe_models()` to see the available model names.
-    :param train_data: Either a `DocumentArray` for training data or a
-        name of the `DocumentArray` that is pushed on Hubble.
-    :param eval_data: Either a `DocumentArray` for evaluation data or a
-        name of the `DocumentArray` that is pushed on Hubble.
+    :param train_data: Either a `DocumentArray` for training data, a name of the
+        `DocumentArray` that is pushed on Jina AI Cloud or a path to a CSV file.
+    :param eval_data: Either a `DocumentArray` for evaluation data, a name of the
+        `DocumentArray` that is pushed on Jina AI Cloud or a path to a CSV file.
     :param run_name: Name of the run.
     :param description: Run description.
     :param experiment_name: Name of the experiment.
@@ -178,11 +180,15 @@ def fit(
         workers used by the dataloader.
     :param to_onnx: If the model is an onnx model or not. If you call the `fit` function
         with `to_onnx=True`, please set this parameter as `True`.
+    :param csv_options: A :class:`CSVOptions` object containing options used for
+        reading in training and evaluation data from a CSV file, if they are
+        provided as such.
 
     .. note::
        Unless necessary, please stick with `device="cuda"`, `cpu` training could be
        extremely slow and inefficient.
     """
+
     return ft.create_run(
         model=model,
         train_data=train_data,
@@ -207,6 +213,7 @@ def fit(
         device=device,
         num_workers=num_workers,
         to_onnx=to_onnx,
+        csv_options=csv_options,
     )
 
 

diff --git a/finetuner/constants.py b/finetuner/constants.py
@@ -35,6 +35,7 @@
 MODEL = 'model'
 MODEL_OPTIONS = 'model_options'
 ARTIFACT_ID = 'artifact_id'
+DEFAULT_TAG_KEY = 'finetuner_label'
 # Run status
 CREATED = 'CREATED'
 STARTED = 'STARTED'