# Multilingual Text To Image search with MultilingualCLIP

Most text-image models are only able to provide embeddings for text in a single language, typically English. Multilingual CLIP models, however, are models that have been trained on multiple different languages. This allows the model the produce similar embeddings for the same sentence in multiple different languages.  

This guide will show you how to finetune a multilingual CLIP model for a text to image retrieval in non-English languages.

*Note, please consider switching to GPU/TPU Runtime for faster inference.*


## Install

In [None]:
!pip install 'finetuner[full]'

## Task

We'll be finetuning multilingual CLIP on the [german x-market dataset](https://xmrec.github.io/data/de/), specifically the electronics section.  

Each product in the dataset contains several attributes, we will be making use of the image and category attributes to create a [`Document`](https://docarray.jina.ai/fundamentals/document/#document) containing two [chunks](https://docarray.jina.ai/fundamentals/document/nested/#nested-structure), one containing the image and another containing the category of the product.

## Data
I am not sure what data to show to users as an example, this will be left blank for now


In [7]:
import finetuner
from docarray import DocumentArray, Document
import os
os.environ['JINA_FINETUNER_REGISTRY'] = 'https://api-staging.finetuner.fit'
finetuner.login(force=True)

VBox(children=(VBox(children=(HTML(value="\n<div class='custom-container'>\n    <style>\n        .button1 {\n …

In [8]:
LANG = "de"
train_data = DocumentArray.load(f"train-{LANG}.da")
eval_data = DocumentArray.load(f"eval-{LANG}.da")
train_data.summary()

## Backbone Model
Currently, we only support one multilingual CLIP model, which has been made available by [open-clip](https://github.com/mlfoundations/open_clip).

## Fine-tuning
Now that our data has been prepared, we can start our fine-tuning run.

In [11]:
import finetuner

run = finetuner.fit(
    model='xlm-roberta-base-ViT-B-32::laion5b_s13b_b90k',
    train_data=train_data,
    eval_data=eval_data,
    epochs=5,
    learning_rate=1e-6,
    loss='CLIPLoss',
    device='cpu',
)

Pushing a DocumentArray to Hubble under the name finetuner-dastorage-finetuner-stupefied-einstein-train ...


Output()

Pushing a DocumentArray to Hubble under the name finetuner-dastorage-finetuner-stupefied-einstein-eval ...


Output()

You may notice that this piece of code looks very similar to the one used to fine-tune regular clip models, as shown [here](https://finetuner.jina.ai/notebooks/text_to_image/). The only real difference is the data being provided and the model being used. 

## Monitoring

Now that we've created a run, let's see its status. You can monitor the run by checking the status - `run.status()` and - the logs - `run.logs()` or - `run.stream_logs()`. 

In [None]:
# note, the fine-tuning might takes 20~ minutes
for entry in run.stream_logs():
    print(entry)

[10:10:06] INFO     Starting finetuner run ...                                                           __main__.py:113
           DEBUG    Found Jina AI Cloud authentication token                                             __main__.py:125
           DEBUG    Running in online mode                                                               __main__.py:126
           INFO     Reading config ...                                                                   __main__.py:133
           DEBUG    Reading config from stream                                                           __main__.py:145
           INFO     Parsing config ...                                                                   __main__.py:148
           INFO     Config loaded 📜                                                                     __main__.py:150
           INFO     Run name: stupefied-einstein                                                         __main__.py:152
           INFO     Experiment na

Since some runs might take up to several hours/days, it's important to know how to reconnect to Finetuner and retrieve your run.

```python
import finetuner

finetuner.login()
run = finetuner.get_run(run.name)
```

You can continue monitoring the run by checking the status - `finetuner.run.Run.status()` or the logs - `finetuner.run.Run.logs()`.

## Evaluating
Currently, we don't have a user-friendly way to get evaluation metrics from the {class}`~finetuner.callback.EvaluationCallback` we initialized previously.

```bash
           INFO     Done ✨                                                                              __main__.py:219
           INFO     Saving fine-tuned models ...                                                         __main__.py:222
           INFO     Saving model 'model' in /usr/src/app/tuned-models/model ...                          __main__.py:233
           INFO     Pushing saved model to Jina AI Cloud ...                                                    __main__.py:240
[10:38:14] INFO     Pushed model artifact ID: '62a1af491597c219f6a330fe'                                 __main__.py:246
           INFO     Finished 🚀                                                                          __main__.py:248
```

```{admonition} Evaluation of CLIP

In this example, we did not plug-in an `EvaluationCallback` since the callback can evaluate one model at one time.
In most cases, we want to evaluate two models: i.e. use `CLIPTextEncoder` to encode textual Documents as `query_data` while use `CLIPImageEncoder` to encode image Documents as `index_data`.
Then use the textual Documents to search image Documents.

We have done the evaulation for you in the table below.
```

TODO

## Saving

After the run has finished successfully, you can download the tuned model on your local machine:

In [None]:
artifact = run.save_artifact('m-clip-model')

## Inference

Now you saved the `artifact` into your host machine,
let's use the fine-tuned model to encode a new `Document`:

In [None]:
text_da = DocumentArray([Document(text='setwas Text zum Codieren')])
image_da = DocumentArray([Document(uri='https://upload.wikimedia.org/wikipedia/commons/4/4e/Single_apple.png')])

clip_text_encoder = finetuner.get_model(artifact=artifact, select_model='clip-text')
clip_image_encoder = finetuner.get_model(artifact=artifact, select_model='clip-vision')

finetuner.encode(model=clip_text_encoder, data=text_da)
finetuner.encode(model=clip_image_encoder, data=image_da)

print(text_da.embeddings.shape)
print(image_da.embeddings.shape)

```bash
(1, 512)
(1, 512)
```

```{admonition} what is select_model?
When fine-tuning CLIP, we are fine-tuning the CLIPVisionEncoder and CLIPTextEncoder in parallel.
The artifact contains two models: `clip-vision` and `clip-text`.
The parameter `select_model` tells finetuner which model to use for inference, in the above example,
we use `clip-text` to encode a Document with text content.
```

```{admonition} Inference with ONNX
In case you set `to_onnx=True` when calling `finetuner.fit` function,
please use `model = finetuner.get_model(artifact, is_onnx=True)`
```