Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: refine documentation for 0.7 #643

Merged
merged 70 commits into from
Jan 16, 2023
Merged
Show file tree
Hide file tree
Changes from 17 commits
Commits
Show all changes
70 commits
Select commit Hold shift + click to select a range
49cd801
docs: refine icons
bwanglzu Dec 29, 2022
b5d43fa
docs: add icons
bwanglzu Dec 29, 2022
c77aebe
docs: add advanced section
bwanglzu Dec 29, 2022
9a0f694
docs: restructure advanced topics
bwanglzu Dec 29, 2022
20c3169
docs: compact readme
bwanglzu Dec 30, 2022
7faa254
docs: improve value proposition
bwanglzu Dec 30, 2022
2f201eb
docs: improve value proposition
bwanglzu Dec 30, 2022
d36df78
docs: add pointnet to readme
bwanglzu Dec 30, 2022
053fed9
docs: rewrite how it works
bwanglzu Dec 30, 2022
03abcb2
docs: refine inference section
bwanglzu Dec 30, 2022
592ec5b
docs: create self-hosting page
bwanglzu Dec 30, 2022
7244027
docs: improve index inference delete principle
bwanglzu Jan 2, 2023
f03b745
docs: add clip inference
bwanglzu Jan 2, 2023
28d0124
docs: finish mining
bwanglzu Jan 2, 2023
01dd176
docs: improve callbacks
bwanglzu Jan 2, 2023
88a79f2
docs: add linear probe section
bwanglzu Jan 2, 2023
665097a
docs: remove unused code example
bwanglzu Jan 2, 2023
8b3e779
docs: improve api doc with autosummary
bwanglzu Jan 2, 2023
f373c0a
docs: fix grammar
bwanglzu Jan 2, 2023
f4663b8
docs: fix grammar
bwanglzu Jan 2, 2023
7bf0de9
docs: fix grammar
bwanglzu Jan 2, 2023
1cc833c
docs: fix grammar
bwanglzu Jan 2, 2023
343f002
docs: improve wording
bwanglzu Jan 2, 2023
20b469b
docs: fix grammar
bwanglzu Jan 2, 2023
f08de7c
docs: fix grammar
bwanglzu Jan 2, 2023
3b8dda4
docs: improve wording
bwanglzu Jan 2, 2023
02923ff
docs: fix grammar
bwanglzu Jan 2, 2023
5619c92
docs: improve wording
bwanglzu Jan 2, 2023
76ef462
chore: ignore line counts from ipynb
bwanglzu Jan 2, 2023
5be3549
docs: improve docstring
bwanglzu Jan 3, 2023
e65b9fd
docs: improve wording
bwanglzu Jan 3, 2023
3de8086
docs: create budget page
bwanglzu Jan 4, 2023
07f1071
docs: mimimize header
bwanglzu Jan 4, 2023
177cbbc
docs: rewrite mclip example
bwanglzu Jan 4, 2023
24f42d7
docs: improve wording
bwanglzu Jan 4, 2023
9318f70
docs: improve wording
bwanglzu Jan 4, 2023
549b794
docs: improve wording
bwanglzu Jan 4, 2023
b6f08fb
docs: improve wording
bwanglzu Jan 4, 2023
c8aa0e4
docs: fix grammar
bwanglzu Jan 4, 2023
3f3e163
docs: improve wording
bwanglzu Jan 4, 2023
93b1a58
docs: improve wording
bwanglzu Jan 4, 2023
f59ba5d
docs: improve wording
bwanglzu Jan 4, 2023
c5b698c
docs: improve wording
bwanglzu Jan 4, 2023
0163528
docs: improve wording
bwanglzu Jan 4, 2023
dc5f584
docs: improve wording
bwanglzu Jan 5, 2023
4b5cd3b
docs: improve wording
bwanglzu Jan 5, 2023
906563f
docs: improve wording
bwanglzu Jan 5, 2023
29a35fe
docs: improve wording
bwanglzu Jan 5, 2023
1136dee
docs: improve wording
bwanglzu Jan 5, 2023
35f10dc
docs: improve wording
bwanglzu Jan 5, 2023
9713314
docs: improve wording
bwanglzu Jan 5, 2023
f249a85
docs: improve wording
bwanglzu Jan 5, 2023
af39881
docs: improve wording
bwanglzu Jan 5, 2023
a2965d8
docs: improve wording
bwanglzu Jan 5, 2023
3a6bab8
docs: improve wording
bwanglzu Jan 5, 2023
c72f6b4
docs: improve wording
bwanglzu Jan 5, 2023
661bd34
docs: rename linear probe to projection head
bwanglzu Jan 5, 2023
5d4361c
refactor: solve merge conflict
guenthermi Jan 12, 2023
5e57b91
refactor: implement review comments
guenthermi Jan 12, 2023
68cddfb
chore: remove self-hosting section
guenthermi Jan 12, 2023
7a5aa20
docs: add hint and batch construction
guenthermi Jan 13, 2023
9466c52
refactor: Scotts review comments
guenthermi Jan 13, 2023
72a2349
refactor: further adjustments
guenthermi Jan 13, 2023
be373d2
refactor: Update docs/walkthrough/inference.md
guenthermi Jan 13, 2023
3dbb911
refactor: Update docs/walkthrough/inference.md
guenthermi Jan 13, 2023
bf57e8b
refactor: update notebooks
guenthermi Jan 16, 2023
0e2bc02
docs: update links
guenthermi Jan 16, 2023
d17f583
chore: update changelog
guenthermi Jan 16, 2023
86b4a5f
refactor: apply suggestions from code review
guenthermi Jan 16, 2023
9f69ab6
docs: add -
guenthermi Jan 16, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
153 changes: 26 additions & 127 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,12 +18,17 @@

<!-- start elevator-pitch -->

Fine-tuning is an effective way to improve performance on neural search tasks. However, setting up and performing
fine-tuning can be very time-consuming and resource-intensive.
Fine-tuning is an effective way to improve performance on [neural search](https://jina.ai/news/what-is-neural-search-and-learn-to-build-a-neural-search-engine/) tasks.
However, setting up and performing fine-tuning can be very time-consuming and resource-intensive.

Jina AI's Finetuner makes fine-tuning easier and faster by streamlining the workflow and handling all complexity and
infrastructure in the cloud. With Finetuner, one can easily enhance the performance of pre-trained models, making them
production-ready without buying expensive hardware.
Jina AI's Finetuner makes fine-tuning easier and faster by streamlining the workflow and handling all complexity and infrastructure in the cloud.
bwanglzu marked this conversation as resolved.
Show resolved Hide resolved
With Finetuner, one can easily enhance the performance of pre-trained models,
bwanglzu marked this conversation as resolved.
Show resolved Hide resolved
making them production-ready [without extensive labeling](https://jina.ai/news/fine-tuning-with-low-budget-and-high-expectations/) and maintaining hardware.
bwanglzu marked this conversation as resolved.
Show resolved Hide resolved

🎏 **Better embeddings**: create high-quality embeddings for semantic search, visual similarity search, cross-modal text image search, recommendation,
bwanglzu marked this conversation as resolved.
Show resolved Hide resolved
clustering, duplication detection, anomaly detection etc.
bwanglzu marked this conversation as resolved.
Show resolved Hide resolved

⏰ **Low budget, high expectation**: effectively use a few hundreds of training samples and finish tuning within an hour while bring considerable improvements.
bwanglzu marked this conversation as resolved.
Show resolved Hide resolved

📈 **Performance promise**: enhance the performance of pre-trained models and deliver state-of-the-art performance on
bwanglzu marked this conversation as resolved.
Show resolved Hide resolved
domain-specific neural search applications.
bwanglzu marked this conversation as resolved.
Show resolved Hide resolved
Expand Down Expand Up @@ -113,11 +118,26 @@ without worrying about resource availability, complex integration, or infrastruc
<td>0.340</td>
<td><span style="color:green">37.7%</span></td>
</tr>
<tr>
<td rowspan="2">PointNet++</td>
<td rowspan="2"><a href="https://modelnet.cs.princeton.edu/">ModelNet40</a> 3D Mesh Search</td>
<td>mRR</td>
<td>0.791</td>
<td>0.891</td>
<td><span style="color:green">12.7%</span></td>
<td rowspan="2"><p align=center><a href="https://colab.research.google.com/drive/1lIMDFkUVsWMshU-akJ_hwzBfJ37zLFzU?usp=sharing"><img alt="Open In Colab" src="https://colab.research.google.com/assets/colab-badge.svg"></a></p></td>
</tr>
<tr>
<td>Recall</td>
<td>0.154</td>
<td>0.242</td>
<td><span style="color:green">57.1%</span></td>
</tr>

</tbody>
</table>

<sub><sup>All metrics were evaluated for k@20 after training for 5 epochs using the Adam optimizer with learning rates of 1e-4 for ResNet, 1e-7 for CLIP and 1e-5 for the BERT models.</sup></sub>
<sub><sup>All metrics were evaluated for k@20 after training for 5 epochs using the Adam optimizer with learning rates of 1e-4 for ResNet, 1e-7 for CLIP and 1e-5 for the BERT models, 5e-4 for PointNet++</sup></sub>

<!-- start install-instruction -->

Expand All @@ -142,127 +162,6 @@ pip install "finetuner[full]"
> ⚠️ Starting with version 0.5.0, Finetuner computing is performed on Jina AI Cloud. The last local version is `0.4.1`.
> This version is still available for installation via `pip`. See [Finetuner git tags and releases](https://github.com/jina-ai/finetuner/releases).





## Get Started

The following code snippet describes how to fine-tune ResNet50 on the [_Totally Looks Like_ dataset](https://sites.google.com/view/totally-looks-like-dataset).
You can run it as-is. The model and training data are already hosted in Jina AI Cloud and Finetuner will
download them automatically.
(NB: If there is already a run called `resnet50-tll-run`, choose a different run-name in the code below.)

```python
import finetuner
from finetuner.callback import EvaluationCallback

finetuner.login()

run = finetuner.fit(
model='resnet50',
run_name='resnet50-tll-run',
train_data='tll-train-data',
callbacks=[
EvaluationCallback(
query_data='tll-test-query-data',
index_data='tll-test-index-data',
)
],
)
```
guenthermi marked this conversation as resolved.
Show resolved Hide resolved
This code snippet describes the following steps:

1. Log in to Jina AI Cloud.
2. Select backbone model, training and evaluation data for your evaluation callback.
3. Start the cloud run.

You can also pass data to Finetuner as a CSV file or a `DocumentArray` object, as described [in the Finetuner documentation](https://finetuner.jina.ai/walkthrough/create-training-data/).

Depending on the data, task, model, hyperparameters, fine-tuning might take some time to finish. You can leave your jobs
to run on the Jina AI Cloud, and later reconnect to them, using code like this below:

```python
import finetuner

finetuner.login()

run = finetuner.get_run('resnet50-tll-run')

for log_entry in run.stream_logs():
print(log_entry)

run.save_artifact('resnet-tll')
```

This code logs into Jina AI Cloud, then connects to your run by name. After that, it does the following:
* Monitors the status of the run and prints out the logs.
* Saves the model once fine-tuning is done.

## Using Finetuner to encode

Finetuner has interfaces for using models to do encoding:

```python
import finetuner
from docarray import Document, DocumentArray

da = DocumentArray([Document(uri='~/Pictures/your_img.png')])

model = finetuner.get_model('resnet-tll')
finetuner.encode(model=model, data=da)

da.summary()
```

When encoding, you can provide data either as a DocumentArray or a list. Since the modality of your input data can be inferred from the model being used, there is no need to provide any additional information besides the content you want to encode. When providing data as a list, the `finetuner.encode` method will return a `np.ndarray` of embeddings, instead of a `docarray.DocumentArray`:

```python
import finetuner
from docarray import Document, DocumentArray

images = ['~/Pictures/your_img.png']

model = finetuner.get_model('resnet-tll')
embeddings = finetuner.encode(model=model, data=images)
```

## Training on your own data

If you want to train a model using your own dataset instead of one on the Jina AI Cloud, you can provide labeled data in a CSV file.

A CSV file is a tab or comma-delimited plain text file. For example:

```plaintext
This is an apple apple_label
This is a pear pear_label
...
```
The file should have two columns: The first for the data and the second for the category label.

You can then provide a path to a CSV file as training data for Finetuner:

```python
run = finetuner.fit(
model='bert-base-cased',
run_name='bert-my-own-run',
train_data='path/to/some/data.csv',
)
```
More information on providing your own training data is found in the [Prepare Training Data](https://finetuner.jina.ai/walkthrough/create-training-data/) section of the [Finetuner documentation](https://finetuner.jina.ai/).



### Next steps

- Take the [walkthrough](https://finetuner.jina.ai/walkthrough/) and submit your first fine-tuning job.
- Try out different search tasks:
- [Text-to-Text Search via BERT](https://finetuner.jina.ai/notebooks/text_to_text/)
- [Image-to-Image Search via ResNet50](https://finetuner.jina.ai/notebooks/image_to_image/)
- [Text-to-Image Search via CLIP](https://finetuner.jina.ai/notebooks/text_to_image/)

[Read our documentation](https://finetuner.jina.ai/) to learn more about what Finetuner can do.

<!-- start support-pitch -->
## Support

Expand Down
34 changes: 34 additions & 0 deletions docs/advanced-topics/budget.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
(budget)=
# {octicon}`database` How much data?

## Motivation

Fine-tuning is a transfer learning technique developed as part of the Deep Learning revolution in artificial intelligence.
Instead of learning a new task from scratch,
fine-tuning takes a pre-trained model,
trained on a related task, and then further trains it for the new task.
guenthermi marked this conversation as resolved.
Show resolved Hide resolved
Alternately, it can mean taking a model pre-trained for an open domain task, and further training it for a domain-specific one.
Compared to training from scratch, fine-tuning is a much more cost-efficient solution whenever it is feasible. It requires:

+ **less labeled data**: as there is no need to learn everything all over again. All the training is devoted to acquiring domain-specific knowledge.
+ **less time to train**: since the number of variables is much smaller and most layers in the deep neural network freeze during fine-tuning.

But:

+ **Exactly how much data do you need to get a good result?** One labeled data point? Ten? One thousand? Ten thousand?
+ **Exactly how much time do you need to get good results?** One minute of fine-tuning? An hour? A day? A week?

## Experiments

We designed two experiments to quantitatively study how labeled data and training time affect fine-tuning performance.
For each experiment, we construct three search tasks by fine-tuning three deep neural networks.
bwanglzu marked this conversation as resolved.
Show resolved Hide resolved
We chose seven datasets, two of which are non-domain-specific public datasets, to ensure the generality of our experiment.

We measure the performance of fine-tuned models by evaluating their ability to perform search tasks, as measured by Mean Reciprocal Rank (mRR), Recall, and Mean Average Precision (mAP).
bwanglzu marked this conversation as resolved.
Show resolved Hide resolved
These metrics are calculated using the top 20 results of each search in the validation subset held out from each dataset.

### How much labeled data is needed?

### How much time is needed?
guenthermi marked this conversation as resolved.
Show resolved Hide resolved

## Summary
Original file line number Diff line number Diff line change
@@ -1,91 +1,5 @@
# Encode Documents

Once fine-tuning is finished, it's time to actually use the model.
You can use the fine-tuned models directly to encode [DocumentArray](https://docarray.jina.ai/) objects or setting up an encoding service.
When encoding, data can also be provided as a regular list.

(integrate-with-docarray)=
## Embed DocumentArray

To embed a [DocumentArray](https://docarray.jina.ai/) with a fine-tuned model, you can get the model of your Run via the {func}`~finetuner.get_model` function and embed it via the {func}`finetuner.encode` function:

````{tab} Artifact id and token
```python
from docarray import DocumentArray, Document
import finetuner

finetuner.login()

token = finetuner.get_token()
run = finetuner.get_run(
experiment_name='YOUR-EXPERIMENT',
run_name='YOUR-RUN'
)

model = finetuner.get_model(
run.artifact_id,
token=token,
device='cuda', # model will be placed on cpu by default.
)

da = DocumentArray([Document(text='some text to encode')])

finetuner.encode(model=model, data=da)

for doc in da:
print(f'Text of the returned document: {doc.text}')
print(f'Shape of the embedding: {doc.embedding.shape}')
```
````
````{tab} Locally saved artifact
```python
from docarray import DocumentArray, Document
import finetuner

model = finetuner.get_model('/path/to/YOUR-MODEL.zip')

da = DocumentArray([Document(text='some text to encode')])

finetuner.encode(model=model, data=da)

for doc in da:
print(f'Text of the returned document: {doc.text}')
print(f'Shape of the embedding: {doc.embedding.shape}')
```
````

```console
Text of the returned document: some text to encode
Shape of the embedding: (768,)
```

## Encoding a List
Data that is stored in a regular list can be embedded in the same way you would a [DocumentArray](https://docarray.jina.ai/). Since the modality of your input data can be inferred from the model being used, there is no need to provide any additional information besides the content you want to encode. When providing data as a list, the `finetuner.encode` method will return a `np.ndarray` of embeddings, instead of a `docarray.DocumentArray`:

```python
from docarray import DocumentArray, Document
import finetuner

model = finetuner.get_model('/path/to/YOUR-MODEL.zip')

texts = ['some text to encode']

embeddings = finetuner.encode(model=model, data=texts)

for text, embedding in zip(texts, embeddings):
print(f'Text of the returned document: {text}')
print(f'Shape of the embedding: {embedding.shape}')
```


```{admonition} Inference with ONNX
:class: tip
In case you set `to_onnx=True` when calling `finetuner.fit` function,
please use `model = finetuner.get_model('/path/to/YOUR-MODEL.zip', is_onnx=True)`
```

(integrate-with-jina)=
## Fine-tuned model as Executor
(finetuner-executor)=
# {octicon}`gear` Use FinetunerExecutor inside a Jina Flow

Finetuner, being part of the Jina AI Cloud, provides a convenient way to use tuned models via [Jina Executors](https://docs.jina.ai/fundamentals/executor/).

Expand Down Expand Up @@ -190,58 +104,6 @@ into the same vector space.
To use those models, you have to provide the name of the model via an additional
`select_model` parameter to the {func}`~finetuner.get_model` function.


````{tab} CLIP text model
```python
from docarray import DocumentArray, Document
import finetuner

finetuner.login()

token = finetuner.get_token()
run = finetuner.get_run(
experiment_name='YOUR-EXPERIMENT',
run_name='YOUR-RUN'
)

model = finetuner.get_model(
run.artifact_id,
token=token,
device='cuda',
select_model='clip-text'
)

da = DocumentArray([Document(text='some text to encode')])

finetuner.encode(model=model, data=da)
```
````
````{tab} CLIP vision model
```python
from docarray import DocumentArray, Document
import finetuner

finetuner.login()

token = finetuner.get_token()
run = finetuner.get_run(
experiment_name='YOUR-EXPERIMENT',
run_name='YOUR-RUN'
)

model = finetuner.get_model(
run.artifact_id,
token=token,
device='cuda',
select_model='clip-vision'
)

da = DocumentArray([Document(text='~/Pictures/my_img.png')])

finetuner.encode(model=model, data=da)
```
````

If you want to host the CLIP models, you also have to provide the name of the model via the
`select_model` parameter inside the `uses_with` attribute:

Expand All @@ -264,4 +126,5 @@ f = Flow().add(
},
)

```
```

Loading