jina-ai · LMMilliken · Nov 24, 2022 · Nov 18, 2022 · Nov 18, 2022 · Nov 22, 2022
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -17,6 +17,8 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 
 ### Docs
 
+- Add notebook for multilingual CLIP models. ([#611](https://github.com/jina-ai/finetuner/pull/611))
+
 - Improve `describe_models` with `task` to better organize list of backbones. ([#610](https://github.com/jina-ai/finetuner/pull/610))
 
 

diff --git a/docs/notebooks/images/mclip-example-1.jpg b/docs/notebooks/images/mclip-example-1.jpg
diff --git a/docs/notebooks/images/mclip-example-2.jpg b/docs/notebooks/images/mclip-example-2.jpg
diff --git a/docs/notebooks/using_mclip.ipynb b/docs/notebooks/using_mclip.ipynb
@@ -0,0 +1,328 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "72867ba9-6a8c-4b14-acbf-487ea0a61836",
+   "metadata": {},
+   "source": [
+    "# Multilingual Text To Image search with MultilingualCLIP"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "f576573b-a48f-4790-817d-e99f8bd28fd0",
+   "metadata": {},
+   "source": [
+    "Most text-image models are only able to provide embeddings for text in a single language, typically English. Multilingual CLIP models, however, are models that have been trained on multiple different languages. This allows the model the produce similar embeddings for the same sentence in multiple different languages.  \n",
+    "\n",
+    "This guide will show you how to finetune a multilingual CLIP model for a text to image retrieval in non-English languages.\n",
+    "\n",
+    "*Note, please consider switching to GPU/TPU Runtime for faster inference.*\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "ed1e7d55-a458-4dfd-8f4c-eeb02521c221",
+   "metadata": {},
+   "source": [
+    "## Install"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "9261d0a7-ad6d-461f-bdf7-54e9804cc45d",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "!pip install 'finetuner[full]'"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "11f13ad8-e0a7-4ba6-b52b-f85dd221db0f",
+   "metadata": {},
+   "source": [
+    "## Task"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "ed1f88d4-f140-48d4-9d20-00e628c73e38",
+   "metadata": {},
+   "source": [
+    "We'll be finetuning multilingual CLIP on the `toloka-fashion` dataset, which contains information about fashion products, with all descriptions being in German.  \n",
+    "\n",
+    "Each product in the dataset contains several attributes, we will be making use of the image and category attributes to create a [`Document`](https://docarray.jina.ai/fundamentals/document/#document) containing two [chunks](https://docarray.jina.ai/fundamentals/document/nested/#nested-structure), one containing the image and another containing the category of the product."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "2a40f0b1-7272-4ae6-9d0a-f5c8d6d534d8",
+   "metadata": {},
+   "source": [
+    "## Data\n",
+    "We will use the `toloka-fashion` dataset, which we have already pre-processed and made available on the Jina AI Cloud. You can access it by like so:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "4420a4ac-531a-4db3-af75-ebb58d8f828b",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import finetuner\n",
+    "from docarray import DocumentArray, Document\n",
+    "\n",
+    "finetuner.login(force=True)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "bab5c3fb-ee75-4818-bd18-23c7a5983e1b",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "train_data = DocumentArray.pull('toloka-fashion-train-data', show_progress=True)\n",
+    "eval_data = DocumentArray.pull('toloka-fashion-eval-data', show_progress=True)\n",
+    "\n",
+    "train_data.summary()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "3b859e9c-99e0-484b-98d5-643ad51de8f0",
+   "metadata": {},
+   "source": [
+    "## Backbone Model\n",
+    "Currently, we only support one multilingual CLIP model, which has been made available by [open-clip](https://github.com/mlfoundations/open_clip). This model is the `xlm-roberta-base-ViT-B-32`, which has been trained on the `laion5b` dataset."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "0b57559c-aa55-40ff-9d05-f061dfb01354",
+   "metadata": {},
+   "source": [
+    "## Fine-tuning\n",
+    "Now that our data has been prepared, we can start our fine-tuning run."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "a0cba20d-e335-43e0-8936-d926568034b3",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import finetuner\n",
+    "\n",
+    "run = finetuner.fit(\n",
+    "    model='xlm-roberta-base-ViT-B-32::laion5b_s13b_b90k',\n",
+    "    train_data=train_data,\n",
+    "    eval_data=eval_data,\n",
+    "    epochs=5,\n",
+    "    learning_rate=1e-6,\n",
+    "    loss='CLIPLoss',\n",
+    "    device='cuda',\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "6be36da7-452b-4450-a5d5-6cae84522bb5",
+   "metadata": {},
+   "source": [
+    "You may notice that this piece of code looks very similar to the one used to fine-tune regular clip models, as shown [here](https://finetuner.jina.ai/notebooks/text_to_image/). The only real difference is the data being provided and the model being used. "
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "923e4206-ac60-4a75-bb3d-4acfc4218cea",
+   "metadata": {},
+   "source": [
+    "## Monitoring\n",
+    "\n",
+    "Now that we've created a run, let's see its status. You can monitor the run by checking the status - `run.status()` and - the logs - `run.logs()` or - `run.stream_logs()`. "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "56d020bf-8095-4a83-a532-9b6c296e985a",
+   "metadata": {
+    "scrolled": true,
+    "tags": []
+   },
+   "outputs": [],
+   "source": [
+    "# note, the fine-tuning might takes 20~ minutes\n",
+    "for entry in run.stream_logs():\n",
+    "    print(entry)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "b58930f1-d9f5-43d3-b852-5cbaa04cb1aa",
+   "metadata": {},
+   "source": [
+    "Since some runs might take up to several hours/days, it's important to know how to reconnect to Finetuner and retrieve your run.\n",
+    "\n",
+    "```python\n",
+    "import finetuner\n",
+    "\n",
+    "finetuner.login()\n",
+    "run = finetuner.get_run(run.name)\n",
+    "```\n",
+    "\n",
+    "You can continue monitoring the run by checking the status - `finetuner.run.Run.status()` or the logs - `finetuner.run.Run.logs()`."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "f0b81ec1-2e02-472f-b2f4-27085bb041cc",
+   "metadata": {},
+   "source": [
+    "## Evaluating\n",
+    "Currently, we don't have a user-friendly way to get evaluation metrics from the {class}`~finetuner.callback.EvaluationCallback` we initialized previously.\n",
+    "\n",
+    "```bash\n",
+    "           INFO     Done ✨                                                                              __main__.py:219\n",
+    "           INFO     Saving fine-tuned models ...                                                         __main__.py:222\n",
+    "           INFO     Saving model 'model' in /usr/src/app/tuned-models/model ...                          __main__.py:233\n",
+    "           INFO     Pushing saved model to Jina AI Cloud ...                                                    __main__.py:240\n",
+    "[10:38:14] INFO     Pushed model artifact ID: '62a1af491597c219f6a330fe'                                 __main__.py:246\n",
+    "           INFO     Finished 🚀                                                                          __main__.py:248\n",
+    "```\n",
+    "\n",
+    "```{admonition} Evaluation of CLIP\n",
+    "\n",
+    "In this example, we did not plug-in an `EvaluationCallback` since the callback can evaluate one model at one time.\n",
+    "In most cases, we want to evaluate two models: i.e. use `CLIPTextEncoder` to encode textual Documents as `query_data` while use `CLIPImageEncoder` to encode image Documents as `index_data`.\n",
+    "Then use the textual Documents to search image Documents.\n",
+    "\n",
+    "We have done the evaulation for you in the table below.\n",
+    "```\n",
+    "\n",
+    "|                   | Before Finetuning   | After Finetuning    |\n",
+    "|-------------------|---------------------|---------------------|\n",
+    "| average_precision | 0.449150592183874   | 0.5229004685258555  |\n",
+    "| dcg_at_k          | 0.6027663856128129  | 0.669843418638272   |\n",
+    "| f1_score_at_k     | 0.0796103896103896  | 0.08326118326118326 |\n",
+    "| hit_at_k          | 0.83                | 0.8683333333333333  |\n",
+    "| ndcg_at_k         | 0.5998242304751983  | 0.6652403194597005  |\n",
+    "| precision_at_k    | 0.04183333333333333 | 0.04375             |\n",
+    "| r_precision       | 0.4489283699616517  | 0.5226226907480778  |\n",
+    "| recall_at_k       | 0.8283333333333334  | 0.8666666666666667  |\n",
+    "| reciprocal_rank   | 0.44937281440609617 | 0.5231782463036333  |"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "2b8da34d-4c14-424a-bae5-6770f40a0721",
+   "metadata": {},
+   "source": [
+    "## Saving\n",
+    "\n",
+    "After the run has finished successfully, you can download the tuned model on your local machine:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "0476c03f-838a-4589-835c-60d1b7f3f893",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "artifact = run.save_artifact('m-clip-model')"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "baabd6be-8660-47cc-a48d-feb43d0a507b",
+   "metadata": {},
+   "source": [
+    "## Inference\n",
+    "\n",
+    "Now you saved the `artifact` into your host machine,\n",
+    "let's use the fine-tuned model to encode a new `Document`:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "fe43402f-4191-4343-905c-c75c64694662",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "text_da = DocumentArray([Document(text='setwas Text zum Codieren')])\n",
+    "image_da = DocumentArray([Document(uri='https://upload.wikimedia.org/wikipedia/commons/4/4e/Single_apple.png')])\n",
+    "\n",
+    "clip_text_encoder = finetuner.get_model(artifact=artifact, select_model='clip-text')\n",
+    "clip_image_encoder = finetuner.get_model(artifact=artifact, select_model='clip-vision')\n",
+    "\n",
+    "finetuner.encode(model=clip_text_encoder, data=text_da)\n",
+    "finetuner.encode(model=clip_image_encoder, data=image_da)\n",
+    "\n",
+    "print(text_da.embeddings.shape)\n",
+    "print(image_da.embeddings.shape)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "ff2e7818-bf11-4179-a34d-d7b790b0db12",
+   "metadata": {},
+   "source": [
+    "```bash\n",
+    "(1, 512)\n",
+    "(1, 512)\n",
+    "```\n",
+    "\n",
+    "```{admonition} what is select_model?\n",
+    "When fine-tuning CLIP, we are fine-tuning the CLIPVisionEncoder and CLIPTextEncoder in parallel.\n",
+    "The artifact contains two models: `clip-vision` and `clip-text`.\n",
+    "The parameter `select_model` tells finetuner which model to use for inference, in the above example,\n",
+    "we use `clip-text` to encode a Document with text content.\n",
+    "```\n",
+    "\n",
+    "```{admonition} Inference with ONNX\n",
+    "In case you set `to_onnx=True` when calling `finetuner.fit` function,\n",
+    "please use `model = finetuner.get_model(artifact, is_onnx=True)`\n",
+    "```"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "b8d6aec2-b306-4a03-a3e1-d79c1b4d6e42",
+   "metadata": {},
+   "source": [
+    "## Before and After\n",
+    "Now that we have shown you how to fine tune multilingual CLIP model, we can compare the difference between the returned results for the two models:\n",
+    "![mclip-example-1](images/mclip-example-1.jpg)\n",
+    "![mclip-example-2](images/mclip-example-2.jpg)"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.8.10"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
Original file line number	Diff line number	Diff line change
Expand Up		@@ -17,6 +17,8 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0

		### Docs

		- Add notebook for multilingual CLIP models. ([#611](https://github.com/jina-ai/finetuner/pull/611))

		- Improve `describe_models` with `task` to better organize list of backbones. ([#610](https://github.com/jina-ai/finetuner/pull/610))


Expand Down