docs: refine documentation for 0.7 (#643)

jina-ai · Jan 16, 2023 · 23aebe3 · 23aebe3
1 parent fb9296d
commit 23aebe3
Show file tree

Hide file tree

Showing 41 changed files with 729 additions and 830 deletions.
diff --git a/.gitattributes b/.gitattributes
@@ -0,0 +1,2 @@
+# ignore ipynb line counts
+*.ipynb linguist-documentation
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -50,6 +50,14 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 
 - Add `finetuner` namespace to artifact names in the documentation. ([#649](https://github.com/jina-ai/finetuner/pull/649))
 
+- Rewrite M-CLIP notebook to use German fashion dataset. ([#643](https://github.com/jina-ai/finetuner/pull/643))
+
+- New advanced topics section. ([#643](https://github.com/jina-ai/finetuner/pull/643))
+
+- Improve developer reference. ([#643](https://github.com/jina-ai/finetuner/pull/643))
+
+- Improve walkthrough sections. ([#643](https://github.com/jina-ai/finetuner/pull/643))
+
 
 ## [0.6.7] - 2022-11-25
 

diff --git a/README.md b/README.md
@@ -18,20 +18,25 @@
 
 <!-- start elevator-pitch -->
 
-Fine-tuning is an effective way to improve performance on neural search tasks. However, setting up and performing 
-fine-tuning can be very time-consuming and resource-intensive.
+Fine-tuning is an effective way to improve performance on [neural search](https://jina.ai/news/what-is-neural-search-and-learn-to-build-a-neural-search-engine/) tasks.
+However, setting up and performing fine-tuning can be very time-consuming and resource-intensive.
 
-Jina AI's Finetuner makes fine-tuning easier and faster by streamlining the workflow and handling all complexity and 
-infrastructure in the cloud. With Finetuner, one can easily enhance the performance of pre-trained models, making them 
-production-ready without buying expensive hardware.
+Jina AI's Finetuner makes fine-tuning easier and faster by streamlining the workflow and handling all the complexity and infrastructure in the cloud.
+With Finetuner, you can easily enhance the performance of pre-trained models,
+making them production-ready [without extensive labeling](https://jina.ai/news/fine-tuning-with-low-budget-and-high-expectations/) or expensive hardware.
 
-📈 **Performance promise**: enhance the performance of pre-trained models and deliver state-of-the-art performance on 
-domain-specific neural search applications.
+🎏 **Better embeddings**: Create high-quality embeddings for semantic search, visual similarity search, cross-modal text<->image search, recommendation systems,
+clustering, duplication detection, anomaly detection, or other uses.
 
-🔱 **Simple yet powerful**: easy access to 40+ mainstream loss functions, 10+ optimisers, layer pruning, weight 
+⏰ **Low budget, high expectations**: Bring considerable improvements to model performance, making the most out of as little as a few hundred training samples, and finish fine-tuning in as little as an hour.
+
+📈 **Performance promise**: Enhance the performance of pre-trained models so that they deliver state-of-the-art performance on 
+domain-specific applications.
+
+🔱 **Simple yet powerful**: Easy access to 40+ mainstream loss functions, 10+ optimisers, layer pruning, weight 
 freezing, dimensionality reduction, hard-negative mining, cross-modal models, and distributed training. 
 
-☁ **All-in-cloud**: train using our free GPU infrastructure, manage runs, experiments and artifacts on Jina AI Cloud
+☁ **All-in-cloud**: Train using our free GPU infrastructure, manage runs, experiments and artifacts on Jina AI Cloud
 without worrying about resource availability, complex integration, or infrastructure costs.
 
 <!-- end elevator-pitch -->
@@ -105,19 +110,34 @@ without worrying about resource availability, complex integration, or infrastruc
     <td>0.430</td>
     <td>0.648</td>
     <td><span style="color:green">50.7%</span></td>
-    <td rowspan="2"><p align=center><a href="https://colab.research.google.com/drive/1N7iWZV0OunFZSLtsQxoazS808MPXhCwq?usp=sharing"><img alt="Open In Colab" src="https://colab.research.google.com/assets/colab-badge.svg"></a></p></td>
+    <td rowspan="2"><p align=center><a href="https://colab.research.google.com/drive/10Wldbu0Zugj7NmQyZwZzuorZ6SSAhtIo"><img alt="Open In Colab" src="https://colab.research.google.com/assets/colab-badge.svg"></a></p></td>
   </tr>
   <tr>
     <td>Recall</td>
     <td>0.247</td>
     <td>0.340</td>
     <td><span style="color:green">37.7%</span></td>
   </tr>
+  <tr>
+    <td rowspan="2">PointNet++</td>
+    <td rowspan="2"><a href="https://modelnet.cs.princeton.edu/">ModelNet40</a> 3D Mesh Search</td>
+    <td>mRR</td>
+    <td>0.791</td>
+    <td>0.891</td>
+    <td><span style="color:green">12.7%</span></td>
+    <td rowspan="2"><p align=center><a href="https://colab.research.google.com/drive/1lIMDFkUVsWMshU-akJ_hwzBfJ37zLFzU?usp=sharing"><img alt="Open In Colab" src="https://colab.research.google.com/assets/colab-badge.svg"></a></p></td>
+  </tr>
+  <tr>
+    <td>Recall</td>
+    <td>0.154</td>
+    <td>0.242</td>
+    <td><span style="color:green">57.1%</span></td>
+  </tr>
 
 </tbody>
 </table>
 
-<sub><sup>All metrics were evaluated for k@20 after training for 5 epochs using the Adam optimizer with learning rates of 1e-4 for ResNet, 1e-7 for CLIP and 1e-5 for the BERT models.</sup></sub>
+<sub><sup>All metrics were evaluated for k@20 after training for 5 epochs using the Adam optimizer with learning rates of 1e-4 for ResNet, 1e-7 for CLIP and 1e-5 for the BERT models, 5e-4 for PointNet++</sup></sub>
 
 <!-- start install-instruction -->
 
@@ -142,127 +162,6 @@ pip install "finetuner[full]"
 > ⚠️ Starting with version 0.5.0, Finetuner computing is performed on Jina AI Cloud. The last local version is `0.4.1`. 
 > This version is still available for installation via `pip`. See [Finetuner git tags and releases](https://github.com/jina-ai/finetuner/releases).
 
-
-
-
-
-## Get Started
-
-The following code snippet describes how to fine-tune ResNet50 on the [_Totally Looks Like_ dataset](https://sites.google.com/view/totally-looks-like-dataset). 
-You can run it as-is. The model and training data are already hosted in Jina AI Cloud and Finetuner will 
-download them automatically.
-(NB: If there is already a run called `resnet50-tll-run`, choose a different run-name in the code below.)
-
-```python
-import finetuner
-from finetuner.callback import EvaluationCallback
-
-finetuner.login()
-
-run = finetuner.fit(
-    model='resnet50',
-    run_name='resnet50-tll-run',
-    train_data='finetuner/tll-train-data',
-    callbacks=[
-        EvaluationCallback(
-            query_data='finetuner/tll-test-query-data',
-            index_data='finetuner/tll-test-index-data',
-        )
-    ],
-)
-```
-This code snippet describes the following steps:
-
-1. Log in to Jina AI Cloud.
-2. Select backbone model, training and evaluation data for your evaluation callback.
-3. Start the cloud run.
-
-You can also pass data to Finetuner as a CSV file or a `DocumentArray` object, as described [in the Finetuner documentation](https://finetuner.jina.ai/walkthrough/create-training-data/).  
-
-Depending on the data, task, model, hyperparameters, fine-tuning might take some time to finish. You can leave your jobs 
-to run on the Jina AI Cloud, and later reconnect to them, using code like this below:
-
-```python
-import finetuner
-
-finetuner.login()
-
-run = finetuner.get_run('resnet50-tll-run')
-
-for log_entry in run.stream_logs():
-    print(log_entry)
-
-run.save_artifact('resnet-tll')
-```
-
-This code logs into Jina AI Cloud, then connects to your run by name. After that, it does the following:
-  * Monitors the status of the run and prints out the logs.
-  * Saves the model once fine-tuning is done.
-
-## Using Finetuner to encode
-
-Finetuner has interfaces for using models to do encoding:
-
-```python
-import finetuner
-from docarray import Document, DocumentArray
-
-da = DocumentArray([Document(uri='~/Pictures/your_img.png')])
-
-model = finetuner.get_model('resnet-tll')
-finetuner.encode(model=model, data=da)
-
-da.summary()
-```
-
-When encoding, you can provide data either as a DocumentArray or a list. Since the modality of your input data can be inferred from the model being used, there is no need to provide any additional information besides the content you want to encode. When providing data as a list, the `finetuner.encode` method will return a `np.ndarray` of embeddings, instead of a `docarray.DocumentArray`:
-
-```python
-import finetuner
-from docarray import Document, DocumentArray
-
-images = ['~/Pictures/your_img.png']
-
-model = finetuner.get_model('resnet-tll')
-embeddings = finetuner.encode(model=model, data=images)
-```
-
-## Training on your own data
-
-If you want to train a model using your own dataset instead of one on the Jina AI Cloud, you can provide labeled data in a CSV file.
-
-A CSV file is a tab or comma-delimited plain text file. For example:
-
-```plaintext
-This is an apple    apple_label
-This is a pear      pear_label
-...
-```
-The file should have two columns: The first for the data and the second for the category label.
-
-You can then provide a path to a CSV file as training data for Finetuner:
-
-```python
-run = finetuner.fit(
-    model='bert-base-cased',
-    run_name='bert-my-own-run',
-    train_data='path/to/some/data.csv',
-)
-```
-More information on providing your own training data is found in the [Prepare Training Data](https://finetuner.jina.ai/walkthrough/create-training-data/) section of the [Finetuner documentation](https://finetuner.jina.ai/).
-
-
-
-### Next steps
-
-- Take the [walkthrough](https://finetuner.jina.ai/walkthrough/) and submit your first fine-tuning job.
-- Try out different search tasks:
-  - [Text-to-Text Search via BERT](https://finetuner.jina.ai/notebooks/text_to_text/)
-  - [Image-to-Image Search via ResNet50](https://finetuner.jina.ai/notebooks/image_to_image/)
-  - [Text-to-Image Search via CLIP](https://finetuner.jina.ai/notebooks/text_to_image/)
-
-[Read our documentation](https://finetuner.jina.ai/) to learn more about what Finetuner can do.
-
 <!-- start support-pitch -->
 ## Support
 

diff --git a/docs/advanced-topics/budget.md b/docs/advanced-topics/budget.md
@@ -0,0 +1,92 @@
+(budget)=
+# {octicon}`database` How much data?
+
+```{admonition} Read full blog
+:class: hint
+Please checkout [Fine-tuning with Low Budget and High Expectations](https://jina.ai/news/fine-tuning-with-low-budget-and-high-expectations/)
+to read the full tech blog.
+```
+
+Fine-tuning takes a pre-trained model,
+trained on a related task, and then further trains it for a new task.
+Alternately, it can mean taking a model pre-trained for an open domain task, and further training it for a domain-specific one.
+Compared to training from scratch, fine-tuning is a much more cost-efficient solution whenever it is feasible. But:
+
++ Exactly how much **data** do you need to get a good result?
++ Exactly how much **time** do you need to get good results?
+
+## Experiments
+
+We designed two experiments to quantitatively study how labeled data and training time affect fine-tuning performance.
+For each experiment, we constructed three search tasks by fine-tuning three models.
+We chose seven datasets, two of which are non-domain-specific public datasets, to ensure the generality of our experiment.
+
+We measured the performance of the fine-tuned models by evaluating their ability to perform search tasks, as measured by Mean Reciprocal Rank (mRR), Recall, and Mean Average Precision (mAP).
+These metrics are calculated using the top 20 results of each search in the validation subset held out from each dataset.
+
+### How much labeled data is needed?
+
+We gradually increase the amount of labeled data fed to Finetuner from 100 items to 100,000 and see how this affects performance on the metrics described in the previous section.
+
+In the figures below, the X-axis represents the amount of labeled data, and the Y-axis represents the relative improvement over the pre-trained model. The higher, the better.
+
+...             |  ...
+:-------------------------:|:-------------------------:
+![text-text-quora](https://jina-ai-gmbh.ghost.io/content/images/2022/12/Text-to-text-search-on-QuoraQA--3-.svg)  |  ![text-text-clinc](https://jina-ai-gmbh.ghost.io/content/images/2022/12/Text-to-text-search-on-Clinc150--3-.svg)
+![image-image-tll](https://jina-ai-gmbh.ghost.io/content/images/2022/12/Image-to-image-search-on-Totally-looks-like.svg) | ![image-image-celeba](https://jina-ai-gmbh.ghost.io/content/images/2022/12/Image-to-image-search-on-Celeba--4-.svg)
+![image-image-flickr30k](https://jina-ai-gmbh.ghost.io/content/images/2022/12/Text-to-image-search-on-Flickr30K--5-.svg) | ![image-image-coco](https://jina-ai-gmbh.ghost.io/content/images/2022/12/Text-to-image-search-on-CoCoCaptions--4-.svg)
+
+These results are promising but not particularly surprising.
+Performance improves with more labeled data on nearly all tasks and all datasets, more for some tasks and datasets than for others.
+However, the only conclusion we can draw from these figures is that the Finetuner works as advertised. So far so good.
+
+We further calculate the return on investment (ROI),
+by dividing the relative improvement (a proxy for net profit) by the amount of labeled data (a proxy for investment cost).
+**This is useful because it indicates the point at which adding more data is producing diminishing returns.**
+
+In the figures below, the X-axis represents the amount of labeled data, and the Y-axis represents the ROI per labeled data item. The higher, the better.
+In particular, `ROI=0` means adding new labeled data at that point no longer contributes to any improvement.
+
+...             |  ...
+:-------------------------:|:-------------------------:
+![text-text-quora](https://jina-ai-gmbh.ghost.io/content/images/2022/12/Text-to-text-search-on-QuoraQA--7-.svg)  |  ![text-text-clinc](https://jina-ai-gmbh.ghost.io/content/images/2022/12/Text-to-text-search-on-Clinc150--7-.svg)
+![image-image-tll](https://jina-ai-gmbh.ghost.io/content/images/2022/12/Image-to-image-search-on-Totally-looks-like--1-.svg) | ![image-image-celeba](https://jina-ai-gmbh.ghost.io/content/images/2022/12/Image-to-image-search-on-Celeba--5-.svg)
+![image-image-flickr30k](https://jina-ai-gmbh.ghost.io/content/images/2022/12/Text-to-image-search-on-Flickr30K--6-.svg) | ![image-image-coco](https://jina-ai-gmbh.ghost.io/content/images/2022/12/Text-to-image-search-on-CoCoCaptions--5-.svg)
+
+Surprisingly, we can see that the ROI per unit of new labeled data starts to drop almost immediately. We expected that it would eventually decrease, but this is an unexpected result.
+
+### How much time is needed?
+
+To measure the value of added training time, we fixed the amount of new labeled data to 1000 items, and then we gradually increased the number of training epochs from 1 to 10.
+At each increase, we measure improvement over the pre-trained model and calculate the ROI.
+For these experiments, the ROI is calculated by dividing the relative improvement by the elapsed time in seconds.
+This means that when `ROI=0`, adding training time no longer improves performance.
+
+...            |  ...
+:-------------------------:|:-------------------------:
+![text-text-quora](https://jina-ai-gmbh.ghost.io/content/images/2022/12/Text-to-text-search-on-QuoraQA--4-.svg)  |  ![text-text-clinc](https://jina-ai-gmbh.ghost.io/content/images/2022/12/Text-to-text-search-on-Clinc150--4-.svg)
+![image-image-tll](https://jina-ai-gmbh.ghost.io/content/images/2022/12/Image-to-image-search-on-Totally-look-like--2-.svg) | ![image-image-celeba](https://jina-ai-gmbh.ghost.io/content/images/2022/12/Image-to-image-search-on-Celeba--2-.svg)
+![image-image-flickr30k](https://jina-ai-gmbh.ghost.io/content/images/2022/12/Text-to-image-search-on-Flickr30K--3-.svg) | ![image-image-coco](https://jina-ai-gmbh.ghost.io/content/images/2022/12/Text-to-image-search-on-CocoCaptions--2-.svg)
+
+We knew in advance that adding more time does not guarantee any improvement at all.
+It can, in fact, reduce performance due to the overfitting problem.
+Some models (e.g. CLIP) are more prone to overfitting than others.
+In principle, if we keep training with the same 1000 data points over and over, we are guaranteed to overfit on the data and the overall performance will drop.
+
+Let's look at the ROI curves.
+
+...             |  ...
+:-------------------------:|:-------------------------:
+![text-text-quora](https://jina-ai-gmbh.ghost.io/content/images/2022/12/Text-to-text-search-on-QuoraQA--5-.svg)  |  ![text-text-clinc](https://jina-ai-gmbh.ghost.io/content/images/2022/12/Text-to-text-search-on-Clinc150--9-.svg)
+![image-image-tll](https://jina-ai-gmbh.ghost.io/content/images/2022/12/Image-to-image-search-on-Totally-look-like--3-.svg) | ![image-image-celeba](https://jina-ai-gmbh.ghost.io/content/images/2022/12/Image-to-image-search-on-Celeba--3-.svg)
+![image-image-flickr30k](https://jina-ai-gmbh.ghost.io/content/images/2022/12/Text-to-image-search-on-Flickr30K--4-.svg) | ![image-image-coco](https://jina-ai-gmbh.ghost.io/content/images/2022/12/Text-to-image-search-on-CocoCaptions--3-.svg)
+
+The ROI drops immediately after the first epoch of fine-tuning.
+Unlike in the last experiment, where ROI approached zero but stayed positive when increasing the number of epochs, here, the ROI on added time can go negative due to the overfitting problem!
+
+## Summary
+
+What does this mean for users looking to maximize gains and minimize costs?
+
++ Many state-of-the-art deep neural networks are capable of few-shot learning. They are quick learners and can make large improvements with only a few hundred items of labeled data and only a few minutes of training time. You might have thought that deep neural network training requires millions of data items and a week of runtime, but we have shown in these examples how that stereotype does not hold up to reality.
++ Because they can learn so much, so fast, from so little data, ROI drops quickly as you put more time and data into fine-tuning. In the experiments above, ROI shrinks by 70% from its highest value after 500 labeled data items or 600 added seconds of GPU training time. Further investment beyond a few hundred items of training data and very minimal training time may not pay off as well as you would like.