Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: refine documentation for 0.7 #643

Merged
merged 70 commits into from
Jan 16, 2023
Merged
Show file tree
Hide file tree
Changes from 61 commits
Commits
Show all changes
70 commits
Select commit Hold shift + click to select a range
49cd801
docs: refine icons
bwanglzu Dec 29, 2022
b5d43fa
docs: add icons
bwanglzu Dec 29, 2022
c77aebe
docs: add advanced section
bwanglzu Dec 29, 2022
9a0f694
docs: restructure advanced topics
bwanglzu Dec 29, 2022
20c3169
docs: compact readme
bwanglzu Dec 30, 2022
7faa254
docs: improve value proposition
bwanglzu Dec 30, 2022
2f201eb
docs: improve value proposition
bwanglzu Dec 30, 2022
d36df78
docs: add pointnet to readme
bwanglzu Dec 30, 2022
053fed9
docs: rewrite how it works
bwanglzu Dec 30, 2022
03abcb2
docs: refine inference section
bwanglzu Dec 30, 2022
592ec5b
docs: create self-hosting page
bwanglzu Dec 30, 2022
7244027
docs: improve index inference delete principle
bwanglzu Jan 2, 2023
f03b745
docs: add clip inference
bwanglzu Jan 2, 2023
28d0124
docs: finish mining
bwanglzu Jan 2, 2023
01dd176
docs: improve callbacks
bwanglzu Jan 2, 2023
88a79f2
docs: add linear probe section
bwanglzu Jan 2, 2023
665097a
docs: remove unused code example
bwanglzu Jan 2, 2023
8b3e779
docs: improve api doc with autosummary
bwanglzu Jan 2, 2023
f373c0a
docs: fix grammar
bwanglzu Jan 2, 2023
f4663b8
docs: fix grammar
bwanglzu Jan 2, 2023
7bf0de9
docs: fix grammar
bwanglzu Jan 2, 2023
1cc833c
docs: fix grammar
bwanglzu Jan 2, 2023
343f002
docs: improve wording
bwanglzu Jan 2, 2023
20b469b
docs: fix grammar
bwanglzu Jan 2, 2023
f08de7c
docs: fix grammar
bwanglzu Jan 2, 2023
3b8dda4
docs: improve wording
bwanglzu Jan 2, 2023
02923ff
docs: fix grammar
bwanglzu Jan 2, 2023
5619c92
docs: improve wording
bwanglzu Jan 2, 2023
76ef462
chore: ignore line counts from ipynb
bwanglzu Jan 2, 2023
5be3549
docs: improve docstring
bwanglzu Jan 3, 2023
e65b9fd
docs: improve wording
bwanglzu Jan 3, 2023
3de8086
docs: create budget page
bwanglzu Jan 4, 2023
07f1071
docs: mimimize header
bwanglzu Jan 4, 2023
177cbbc
docs: rewrite mclip example
bwanglzu Jan 4, 2023
24f42d7
docs: improve wording
bwanglzu Jan 4, 2023
9318f70
docs: improve wording
bwanglzu Jan 4, 2023
549b794
docs: improve wording
bwanglzu Jan 4, 2023
b6f08fb
docs: improve wording
bwanglzu Jan 4, 2023
c8aa0e4
docs: fix grammar
bwanglzu Jan 4, 2023
3f3e163
docs: improve wording
bwanglzu Jan 4, 2023
93b1a58
docs: improve wording
bwanglzu Jan 4, 2023
f59ba5d
docs: improve wording
bwanglzu Jan 4, 2023
c5b698c
docs: improve wording
bwanglzu Jan 4, 2023
0163528
docs: improve wording
bwanglzu Jan 4, 2023
dc5f584
docs: improve wording
bwanglzu Jan 5, 2023
4b5cd3b
docs: improve wording
bwanglzu Jan 5, 2023
906563f
docs: improve wording
bwanglzu Jan 5, 2023
29a35fe
docs: improve wording
bwanglzu Jan 5, 2023
1136dee
docs: improve wording
bwanglzu Jan 5, 2023
35f10dc
docs: improve wording
bwanglzu Jan 5, 2023
9713314
docs: improve wording
bwanglzu Jan 5, 2023
f249a85
docs: improve wording
bwanglzu Jan 5, 2023
af39881
docs: improve wording
bwanglzu Jan 5, 2023
a2965d8
docs: improve wording
bwanglzu Jan 5, 2023
3a6bab8
docs: improve wording
bwanglzu Jan 5, 2023
c72f6b4
docs: improve wording
bwanglzu Jan 5, 2023
661bd34
docs: rename linear probe to projection head
bwanglzu Jan 5, 2023
5d4361c
refactor: solve merge conflict
guenthermi Jan 12, 2023
5e57b91
refactor: implement review comments
guenthermi Jan 12, 2023
68cddfb
chore: remove self-hosting section
guenthermi Jan 12, 2023
7a5aa20
docs: add hint and batch construction
guenthermi Jan 13, 2023
9466c52
refactor: Scotts review comments
guenthermi Jan 13, 2023
72a2349
refactor: further adjustments
guenthermi Jan 13, 2023
be373d2
refactor: Update docs/walkthrough/inference.md
guenthermi Jan 13, 2023
3dbb911
refactor: Update docs/walkthrough/inference.md
guenthermi Jan 13, 2023
bf57e8b
refactor: update notebooks
guenthermi Jan 16, 2023
0e2bc02
docs: update links
guenthermi Jan 16, 2023
d17f583
chore: update changelog
guenthermi Jan 16, 2023
86b4a5f
refactor: apply suggestions from code review
guenthermi Jan 16, 2023
9f69ab6
docs: add -
guenthermi Jan 16, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions .gitattributes
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
# ignore ipynb line counts
*.ipynb linguist-documentation
161 changes: 30 additions & 131 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,20 +18,25 @@

<!-- start elevator-pitch -->

Fine-tuning is an effective way to improve performance on neural search tasks. However, setting up and performing
fine-tuning can be very time-consuming and resource-intensive.
Fine-tuning is an effective way to improve performance on [neural search](https://jina.ai/news/what-is-neural-search-and-learn-to-build-a-neural-search-engine/) tasks.
However, setting up and performing fine-tuning can be very time-consuming and resource-intensive.

Jina AI's Finetuner makes fine-tuning easier and faster by streamlining the workflow and handling all complexity and
infrastructure in the cloud. With Finetuner, one can easily enhance the performance of pre-trained models, making them
production-ready without buying expensive hardware.
Jina AI's Finetuner makes fine-tuning easier and faster by streamlining the workflow and handling all the complexity and infrastructure in the cloud.
With Finetuner, you can easily enhance the performance of pre-trained models,
making them production-ready [without extensive labeling](https://jina.ai/news/fine-tuning-with-low-budget-and-high-expectations/) or expensive hardware.

📈 **Performance promise**: enhance the performance of pre-trained models and deliver state-of-the-art performance on
domain-specific neural search applications.
🎏 **Better embeddings**: Create high-quality embeddings for semantic search, visual similarity search, cross-modal text<->image search, recommendation systems,
clustering, duplication detection, anomaly detection, or other uses.

🔱 **Simple yet powerful**: easy access to 40+ mainstream loss functions, 10+ optimisers, layer pruning, weight
⏰ **Low budget, high expectations**: Bring considerable improvements to model performance, making the most out of as little as a few hundred training samples, and finish fine-tuning in as little as an hour.

📈 **Performance promise**: Enhance the performance of pre-trained models so that they deliver state-of-the-art performance on
domain-specific applications.

🔱 **Simple yet powerful**: Easy access to 40+ mainstream loss functions, 10+ optimisers, layer pruning, weight
freezing, dimensionality reduction, hard-negative mining, cross-modal models, and distributed training.

☁ **All-in-cloud**: train using our free GPU infrastructure, manage runs, experiments and artifacts on Jina AI Cloud
☁ **All-in-cloud**: Train using our free GPU infrastructure, manage runs, experiments and artifacts on Jina AI Cloud
without worrying about resource availability, complex integration, or infrastructure costs.

<!-- end elevator-pitch -->
Expand Down Expand Up @@ -113,11 +118,26 @@ without worrying about resource availability, complex integration, or infrastruc
<td>0.340</td>
<td><span style="color:green">37.7%</span></td>
</tr>
<tr>
<td rowspan="2">PointNet++</td>
<td rowspan="2"><a href="https://modelnet.cs.princeton.edu/">ModelNet40</a> 3D Mesh Search</td>
<td>mRR</td>
<td>0.791</td>
<td>0.891</td>
<td><span style="color:green">12.7%</span></td>
<td rowspan="2"><p align=center><a href="https://colab.research.google.com/drive/1lIMDFkUVsWMshU-akJ_hwzBfJ37zLFzU?usp=sharing"><img alt="Open In Colab" src="https://colab.research.google.com/assets/colab-badge.svg"></a></p></td>
</tr>
<tr>
<td>Recall</td>
<td>0.154</td>
<td>0.242</td>
<td><span style="color:green">57.1%</span></td>
</tr>

</tbody>
</table>

<sub><sup>All metrics were evaluated for k@20 after training for 5 epochs using the Adam optimizer with learning rates of 1e-4 for ResNet, 1e-7 for CLIP and 1e-5 for the BERT models.</sup></sub>
<sub><sup>All metrics were evaluated for k@20 after training for 5 epochs using the Adam optimizer with learning rates of 1e-4 for ResNet, 1e-7 for CLIP and 1e-5 for the BERT models, 5e-4 for PointNet++</sup></sub>

<!-- start install-instruction -->

Expand All @@ -142,127 +162,6 @@ pip install "finetuner[full]"
> ⚠️ Starting with version 0.5.0, Finetuner computing is performed on Jina AI Cloud. The last local version is `0.4.1`.
> This version is still available for installation via `pip`. See [Finetuner git tags and releases](https://github.com/jina-ai/finetuner/releases).





## Get Started

The following code snippet describes how to fine-tune ResNet50 on the [_Totally Looks Like_ dataset](https://sites.google.com/view/totally-looks-like-dataset).
You can run it as-is. The model and training data are already hosted in Jina AI Cloud and Finetuner will
download them automatically.
(NB: If there is already a run called `resnet50-tll-run`, choose a different run-name in the code below.)

```python
import finetuner
from finetuner.callback import EvaluationCallback

finetuner.login()

run = finetuner.fit(
model='resnet50',
run_name='resnet50-tll-run',
train_data='finetuner/tll-train-data',
callbacks=[
EvaluationCallback(
query_data='finetuner/tll-test-query-data',
index_data='finetuner/tll-test-index-data',
)
],
)
```
guenthermi marked this conversation as resolved.
Show resolved Hide resolved
This code snippet describes the following steps:

1. Log in to Jina AI Cloud.
2. Select backbone model, training and evaluation data for your evaluation callback.
3. Start the cloud run.

You can also pass data to Finetuner as a CSV file or a `DocumentArray` object, as described [in the Finetuner documentation](https://finetuner.jina.ai/walkthrough/create-training-data/).

Depending on the data, task, model, hyperparameters, fine-tuning might take some time to finish. You can leave your jobs
to run on the Jina AI Cloud, and later reconnect to them, using code like this below:

```python
import finetuner

finetuner.login()

run = finetuner.get_run('resnet50-tll-run')

for log_entry in run.stream_logs():
print(log_entry)

run.save_artifact('resnet-tll')
```

This code logs into Jina AI Cloud, then connects to your run by name. After that, it does the following:
* Monitors the status of the run and prints out the logs.
* Saves the model once fine-tuning is done.

## Using Finetuner to encode

Finetuner has interfaces for using models to do encoding:

```python
import finetuner
from docarray import Document, DocumentArray

da = DocumentArray([Document(uri='~/Pictures/your_img.png')])

model = finetuner.get_model('resnet-tll')
finetuner.encode(model=model, data=da)

da.summary()
```

When encoding, you can provide data either as a DocumentArray or a list. Since the modality of your input data can be inferred from the model being used, there is no need to provide any additional information besides the content you want to encode. When providing data as a list, the `finetuner.encode` method will return a `np.ndarray` of embeddings, instead of a `docarray.DocumentArray`:

```python
import finetuner
from docarray import Document, DocumentArray

images = ['~/Pictures/your_img.png']

model = finetuner.get_model('resnet-tll')
embeddings = finetuner.encode(model=model, data=images)
```

## Training on your own data

If you want to train a model using your own dataset instead of one on the Jina AI Cloud, you can provide labeled data in a CSV file.

A CSV file is a tab or comma-delimited plain text file. For example:

```plaintext
This is an apple apple_label
This is a pear pear_label
...
```
The file should have two columns: The first for the data and the second for the category label.

You can then provide a path to a CSV file as training data for Finetuner:

```python
run = finetuner.fit(
model='bert-base-cased',
run_name='bert-my-own-run',
train_data='path/to/some/data.csv',
)
```
More information on providing your own training data is found in the [Prepare Training Data](https://finetuner.jina.ai/walkthrough/create-training-data/) section of the [Finetuner documentation](https://finetuner.jina.ai/).



### Next steps

- Take the [walkthrough](https://finetuner.jina.ai/walkthrough/) and submit your first fine-tuning job.
- Try out different search tasks:
- [Text-to-Text Search via BERT](https://finetuner.jina.ai/notebooks/text_to_text/)
- [Image-to-Image Search via ResNet50](https://finetuner.jina.ai/notebooks/image_to_image/)
- [Text-to-Image Search via CLIP](https://finetuner.jina.ai/notebooks/text_to_image/)

[Read our documentation](https://finetuner.jina.ai/) to learn more about what Finetuner can do.

<!-- start support-pitch -->
## Support

Expand Down
92 changes: 92 additions & 0 deletions docs/advanced-topics/budget.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,92 @@
(budget)=
# {octicon}`database` How much data?

```{admonition} Read full blog
:class: hint
Please checkout [Fine-tuning with Low Budget and High Expectations](https://jina.ai/news/fine-tuning-with-low-budget-and-high-expectations/)
to read the full tech blog.
```

Fine-tuning takes a pre-trained model,
trained on a related task, and then further trains it for the new task.
guenthermi marked this conversation as resolved.
Show resolved Hide resolved
Alternately, it can mean taking a model pre-trained for an open domain task, and further training it for a domain-specific one.
Compared to training from scratch, fine-tuning is a much more cost-efficient solution whenever it is feasible. But:

+ Exactly how much data do you need to get a good result?
guenthermi marked this conversation as resolved.
Show resolved Hide resolved
+ Exactly how much time do you need to get good results?
guenthermi marked this conversation as resolved.
Show resolved Hide resolved

## Experiments

We designed two experiments to quantitatively study how labeled data and training time affect fine-tuning performance.
For each experiment, we constructed three search tasks by fine-tuning three models.
We chose seven datasets, two of which are non-domain-specific public datasets, to ensure the generality of our experiment.

We measured the performance of the fine-tuned models by evaluating their ability to perform search tasks, as measured by Mean Reciprocal Rank (mRR), Recall, and Mean Average Precision (mAP).
These metrics are calculated using the top 20 results of each search in the validation subset held out from each dataset.

### How much labeled data is needed?

We gradually increase the amount of labeled data fed to Finetuner from 100 items to 100,000 and see how this affects performance on the metrics described in the previous section.

In the figures below, the X-axis represents the amount of labeled data, and the Y-axis represents the relative improvement over the pre-trained model. The higher, the better.

... | ...
:-------------------------:|:-------------------------:
![text-text-quora](https://jina-ai-gmbh.ghost.io/content/images/2022/12/Text-to-text-search-on-QuoraQA--3-.svg) | ![text-text-clinc](https://jina-ai-gmbh.ghost.io/content/images/2022/12/Text-to-text-search-on-Clinc150--3-.svg)
![image-image-tll](https://jina-ai-gmbh.ghost.io/content/images/2022/12/Image-to-image-search-on-Totally-looks-like.svg) | ![image-image-celeba](https://jina-ai-gmbh.ghost.io/content/images/2022/12/Image-to-image-search-on-Celeba--4-.svg)
![image-image-flickr30k](https://jina-ai-gmbh.ghost.io/content/images/2022/12/Text-to-image-search-on-Flickr30K--5-.svg) | ![image-image-coco](https://jina-ai-gmbh.ghost.io/content/images/2022/12/Text-to-image-search-on-CoCoCaptions--4-.svg)

These results are promising but not particularly surprising.
Performance improves with more labeled data on nearly all tasks and all datasets, more for some tasks and datasets than for others.
However, the only conclusion we can draw from these figures is that the Finetuner works as advertised. So far so good.

We further calculate the return on investment (ROI),
by dividing the relative improvement (a proxy for net profit) by the amount of labeled data (a proxy for investment cost).
**This is useful because it indicates the point at which adding more data is producing diminishing returns.**

In the figures below, the X-axis represents the amount of labeled data, and the Y-axis represents the ROI per labeled data item. The higher, the better.
In particular, `ROI=0` means adding new labeled data at that point no longer contributes to any improvement.

... | ...
:-------------------------:|:-------------------------:
![text-text-quora](https://jina-ai-gmbh.ghost.io/content/images/2022/12/Text-to-text-search-on-QuoraQA--7-.svg) | ![text-text-clinc](https://jina-ai-gmbh.ghost.io/content/images/2022/12/Text-to-text-search-on-Clinc150--7-.svg)
![image-image-tll](https://jina-ai-gmbh.ghost.io/content/images/2022/12/Image-to-image-search-on-Totally-looks-like--1-.svg) | ![image-image-celeba](https://jina-ai-gmbh.ghost.io/content/images/2022/12/Image-to-image-search-on-Celeba--5-.svg)
![image-image-flickr30k](https://jina-ai-gmbh.ghost.io/content/images/2022/12/Text-to-image-search-on-Flickr30K--6-.svg) | ![image-image-coco](https://jina-ai-gmbh.ghost.io/content/images/2022/12/Text-to-image-search-on-CoCoCaptions--5-.svg)

Surprisingly, we can see that the ROI per unit of new labeled data starts to drop almost immediately. We expected that it would eventually decrease, but this is an unexpected result.

### How much time is needed?
guenthermi marked this conversation as resolved.
Show resolved Hide resolved

To measure the value of added training time, we fixed the amount of new labeled data to 1000 items, and then we gradually increased the number of training epochs from 1 to 10.
At each increase, we measure improvement over the pre-trained model and calculate the ROI.
For these experiments, the ROI is calculated by dividing the relative improvement by the elapsed time in seconds.
This means that when `ROI=0`, adding training time no longer improves performance.

... | ...
:-------------------------:|:-------------------------:
![text-text-quora](https://jina-ai-gmbh.ghost.io/content/images/2022/12/Text-to-text-search-on-QuoraQA--4-.svg) | ![text-text-clinc](https://jina-ai-gmbh.ghost.io/content/images/2022/12/Text-to-text-search-on-Clinc150--4-.svg)
![image-image-tll](https://jina-ai-gmbh.ghost.io/content/images/2022/12/Image-to-image-search-on-Totally-look-like--2-.svg) | ![image-image-celeba](https://jina-ai-gmbh.ghost.io/content/images/2022/12/Image-to-image-search-on-Celeba--2-.svg)
![image-image-flickr30k](https://jina-ai-gmbh.ghost.io/content/images/2022/12/Text-to-image-search-on-Flickr30K--3-.svg) | ![image-image-coco](https://jina-ai-gmbh.ghost.io/content/images/2022/12/Text-to-image-search-on-CocoCaptions--2-.svg)

We knew in advance that adding more time does not guarantee any improvement at all.
It can, in fact, reduce performance due to the overfitting problem.
Some models (e.g. CLIP) are more prone to overfitting than others.
In principle, if we keep training with the same 1000 data points over and over, we are guaranteed to overfit on the data and the overall performance will drop.

Let's look at the ROI curves.

... | ...
:-------------------------:|:-------------------------:
![text-text-quora](https://jina-ai-gmbh.ghost.io/content/images/2022/12/Text-to-text-search-on-QuoraQA--5-.svg) | ![text-text-clinc](https://jina-ai-gmbh.ghost.io/content/images/2022/12/Text-to-text-search-on-Clinc150--9-.svg)
![image-image-tll](https://jina-ai-gmbh.ghost.io/content/images/2022/12/Image-to-image-search-on-Totally-look-like--3-.svg) | ![image-image-celeba](https://jina-ai-gmbh.ghost.io/content/images/2022/12/Image-to-image-search-on-Celeba--3-.svg)
![image-image-flickr30k](https://jina-ai-gmbh.ghost.io/content/images/2022/12/Text-to-image-search-on-Flickr30K--4-.svg) | ![image-image-coco](https://jina-ai-gmbh.ghost.io/content/images/2022/12/Text-to-image-search-on-CocoCaptions--3-.svg)

The ROI drops immediately after the first epoch of fine-tuning.
Unlike in the last experiment, where ROI approached zero but stayed positive when increasing the number of epochs, here, the ROI on added time can go negative due to the overfitting problem!

## Summary

What does this mean for users looking to maximize gains and minimize costs?

+ Many state-of-the-art deep neural networks are capable of few-shot learning. They are quick learners and can make large improvements with only a few hundred items of labeled data and only a few minutes of training time. You might have thought that deep neural network training requires millions of data items and a week of runtime, but we have shown in these examples how that stereotype does not hold up to reality.
+ Because they can learn so much, so fast, from so little data, ROI drops quickly as you put more time and data into fine-tuning. In the experiments above, ROI shrinks by 70% from its highest value after 500 labeled data items or 600 added seconds of GPU training time. Further investment beyond a few hundred items of training data and very minimal training time may not pay off as well as you would like.
Loading