# **Getting Started with Marqtune - A Guided Walkthrough**

This notebook contains information on how to get up and running with Marqtune, the embedding model training platform. We have an in-depth [article](https://marqo.ai/blog/getting-started-with-marqtune) to accompany this walkthrough which we highly recommend reading.

This guide will walk you through the process of fine tuning a model based on a base open clip model using a multi-modal training dataset. We will then evaluate the performance of the tuned model and compare it with an equivalent evaluation of the base model to demonstrate an improvement in performance. This tuned model can subsequently be used in a [Marqo index](https://cloud.marqo.ai/) to provide more relevant results for queries.

Let's get stuck in!

### **1. Set Up and Installation**

To use Marqtune you will need the Marqtune Python client. You can install this using pip:

In [1]:
!pip install marqtune

Collecting marqtune
  Downloading marqtune-0.2.2-py3-none-any.whl.metadata (5.8 kB)
Downloading marqtune-0.2.2-py3-none-any.whl (10 kB)
Installing collected packages: marqtune
Successfully installed marqtune-0.2.2


Next, you will need a Marqo API key with access to Marqtune. To obtain this, sign in to your Marqo Cloud and navigate to the API keys section and create your own. For more information on obtaining your Marqo API Key, see this [article](https://marqo.ai/blog/finding-my-marqo-api-key).

We are using Google Colab so we will take advantage of 'Secrets' which allows you to store environment variables privately. Navigate to the 'key' logo on the lefthand side navigation bar and store your API key there.

In [2]:
# Store Marqo API Key using Secrets in Google Colab
from google.colab import userdata
api_key = userdata.get('MARQO_API_KEY')   # alternatively, api_key = "..."

### **2. Initializing the Client**

We now make the necessary imports and setup the Marqtune Python client.

In [3]:
from marqtune.client import Client
from marqtune.enums import DatasetType, InstanceType
from urllib.request import urlopen
import gzip
import json
import uuid
import os

# Suffix is used just to make the dataset and model names unique
suffix = str(uuid.uuid4())[:8]
print(f"Using suffix={suffix} for this walkthrough")

# Set up Marqtune Client
# To find your API Key, go to Marqo Cloud and click 'API Keys' from the lefthand side navigation bar or visit https://www.marqo.ai/blog/finding-my-marqo-api-key
marqtune_client = Client(url="https://marqtune.marqo.ai", api_key=api_key)

Using suffix=d166c36f for this walkthrough


To see the results of datasets and other resources generated in this walkthrough in the Marqtune UI, please refer to our [article](https://marqo.ai/blog/getting-started-with-marqtune).

### **3. Dataset Creation**

We will now create two datasets, one for training and another for evaluation. The datasets will be sourced from a couple of CSV files. The data in these CSV files consists of shopping data generated from a subset of `Marqo-GS-10M` which is described in more detail in our [open-source GCL repository](https://github.com/marqo-ai/GCL).

Both CSV files have the same format; however, the first one is larger (100,000 rows) which we will use for training a model, the second is smaller (25,000 rows) which we will use for model evaluation.

The datasets are multi-modal, consisting of both text and images. The images are represented by URLs that Marqtune will use to download.

Let’s begin by downloading these data files:

In [4]:
print("Downloading data files:")
base_path = (
    "https://marqo-gcl-public.s3.us-west-2.amazonaws.com/marqtune_test/datasets/v1"
)
training_data = "gs_100k_training.csv"
eval_data = "gs_25k_eval.csv"
open(training_data, "w").write(
    gzip.open(urlopen(f"{base_path}/{training_data}.gz"), "rb").read().decode("utf-8")
)
open(eval_data, "w").write(
    gzip.open(urlopen(f"{base_path}/{eval_data}.gz"), "rb").read().decode("utf-8")
)

Downloading data files:


5079946

We now want to create datasets in Marqtune. In order to do this, we need to identify the columns in the CSVs as well as their types by defining a data schema. We will reuse the same data schema for both training and evaluation datasets though this is not strictly necessary.

In [5]:
data_schema = {
    "query": "text",
    "title": "text",
    "image": "image_pointer",
    "score": "score",
}

After defining the data schema we can then create the two datasets. Note that creating a dataset takes a few minutes to complete as it accomplishes a few steps:

1. The CSV file has to be uploaded
2. Some simple validations have to pass (e.g. the data schema needs to be validated against each row in the CSV input)
3. The URLs in the `image_pointer` columns are used to download the image files to the dataset

In [6]:
# Create the training dataset.
training_dataset_name = f"{training_data}-{suffix}"
print(f"Creating training dataset ({training_dataset_name}):")
training_dataset = marqtune_client.create_dataset(
    dataset_name=training_dataset_name,
    file_path=training_data,
    dataset_type=DatasetType.TRAINING,
    data_schema=data_schema,
    query_columns=["query"],
    result_columns=["title", "image"],
    # setting wait_for_completion=True will make this a blocking call and will also print logs interactively
    wait_for_completion=True,
)

Creating training dataset (gs_100k_training.csv-d166c36f):
Dataset was initialised. Dataset ID: 426c58ac-0732-44dc-ac10-d734ffaada38
Attempting to upload file...
File uploaded successfully. Job will start soon
Uploading..
Creating: Provisioning...
Creating: Running
2024-10-01 15:09:12,619 - INFO - Initialising task
2024-10-01 15:09:12,688 - INFO - Downloading files for task 426c58ac-0732-44dc-ac10-d734ffaada38
2024-10-01 15:09:13,247 - INFO - File download is completed
2024-10-01 15:09:13,249 - INFO - Preparing dataset with {'input_file': 'dataset/426c58ac-0732-44dc-ac10-d734ffaada38/dataset.csv', 'data_schema': {'query': 'text', 'title': 'text', 'image': 'image_pointer', 'score': 'score'}, 'output_path': '426c58ac-0732-44dc-ac10-d734ffaada38', 'dataset_type': 'training', 'result_columns': None, 'image_download_headers': None, 'logger': <Logger __main__ (DEBUG)>, 'metrics_collector': <clients.cw_client.CWClient object at 0xffffa151cb90>}
2024-10-01 15:09:13,518 - INFO - Total rows afte

We do the same for the evaluation dataset.

In [7]:
# Similarly we create the Evaluation dataset.
eval_dataset_name = f"{eval_data}-{suffix}"
print(f"Creating evaluation dataset ({eval_dataset_name}):")
eval_dataset = marqtune_client.create_dataset(
    dataset_name=eval_dataset_name,
    file_path=eval_data,
    dataset_type=DatasetType.EVALUATION,
    data_schema=data_schema,
    query_columns=["query"],
    result_columns=["title", "image"],
    wait_for_completion=True,
)

Creating evaluation dataset (gs_25k_eval.csv-d166c36f):
Dataset was initialised. Dataset ID: 9744d86a-505c-4a32-bb55-eec30826f00f
Attempting to upload file...
File uploaded successfully. Job will start soon
Uploading..
Creating: Provisioning...
Creating: Running
2024-10-01 15:13:07,903 - INFO - Initialising task
2024-10-01 15:13:07,971 - INFO - Downloading files for task 9744d86a-505c-4a32-bb55-eec30826f00f
2024-10-01 15:13:08,197 - INFO - File download is completed
2024-10-01 15:13:08,199 - INFO - Preparing dataset with {'input_file': 'dataset/9744d86a-505c-4a32-bb55-eec30826f00f/dataset.csv', 'data_schema': {'query': 'text', 'title': 'text', 'image': 'image_pointer', 'score': 'score'}, 'output_path': '9744d86a-505c-4a32-bb55-eec30826f00f', 'dataset_type': 'evaluation', 'result_columns': ['title', 'image'], 'image_download_headers': None, 'logger': <Logger __main__ (DEBUG)>, 'metrics_collector': <clients.cw_client.CWClient object at 0xffff91cd1910>, 'query_columns': ['query']}
2024-10

### **4. Model Tuning**

Now we're ready to train a model. To do so we define a few training hyper parameters. In this example we've set some parameters that work well with the sample dataset but you are encouraged to experiment with these values for your own datasets.

In our example for the base pre-trained open clip model, we've chosen to use `ViT-B-32 - laion400m_e31` which is a good model to start with as it gives us good performance with low latency/memory usage. We have previously published a guide to help you [choose the right model](https://www.marqo.ai/blog/benchmarking-models-for-multimodal-search) for your use case.

In [8]:
# Setup training hyper parameters:
training_params = {
    "leftKeys": ["query"],
    "leftWeights": [1],
    "rightKeys": ["image", "title"],
    "rightWeights": [0.9, 0.1],
    "weightKey": "score",
    "epochs": 5,
}

base_model = "ViT-B-32"
base_checkpoint = "laion2b_s34b_b79k"

The `training_params` dictionary is used to define the training hyperparameters. We've chosen a minimal set of hyperparameters to get you started - primarily the left/right keys define the columns in the input CSV that we're training on. You can experiment on these parameters yourself, refer to the [Training Parameters documentation](https://docs.marqo.ai/2.10/Marqtune/API/evaluation/evaluation_parameters/) for documentation on these and other parameters available for training.

This training will take a while to complete, though you may choose to run it faster using more powerful hardware: `instance_type=InstanceType.PERFORMANCE`.

It's also worth noting that once training has been successfully kicked off in Marqtune it will continue till completion no matter what happens to your local client session. On start the logs will show the new model id that can be used to identify your model - copy this id so that if your local console disconnects for some reason during training you can always resume the rest of this guide after loading the completed model: `tuned_model = marqtune_client.model('<model id>')`.

In [None]:
model_name = f"{training_data}-model-{suffix}"
print(f"Training a new model ({model_name}):")
tuned_model = marqtune_client.train_model(
    dataset_id=training_dataset.dataset_id,
    model_name=f"{training_data}-model-{suffix}",
    instance_type=InstanceType.BASIC,
    base_model=f"Marqo/{base_model}.{base_checkpoint}",
    hyperparameters=training_params,
    wait_for_completion=True,
)

Training a new model (gs_100k_training.csv-model-d166c36f):
Model creation was initialised. Model ID: a4f61808-819f-4576-8fd2-30144047a1d4
Initializing.
Creating: Provisioning...
Creating: Running
2024-10-01 15:14:42,115 - INFO - Initialising task
2024-10-01 15:14:42,188 - INFO - Downloading files for task a4f61808-819f-4576-8fd2-30144047a1d4
2024-10-01 15:15:17,857 - INFO - File download is completed
2024-10-01 15:15:17,935 - INFO - Initializing training job for model ViT-B-32 and dataset 426c58ac-0732-44dc-ac10-d734ffaada38. Please wait, it may take some time.
METRIC: imagesPreprocessingTime=29;
Executing training job with 1 GPUs.
Torchrun command:  ['torchrun', '--nproc_per_node', '1', 'main.py', '--', '--left-keys', "['query']", '--left-weights', '[1]', '--right-keys', "['image', 'title']", '--right-weights', '[0.9, 0.1]', '--weight-key', 'score', '--epochs', '5', '--train-data', '/app/src/data/label.csv', '--img-or-txt', "[['txt'], ['img', 'txt']]", '--id-keys', "['query', 'image'

Note, the logs contain information about the training process. Here’s an example:

```python
1721298452795 2024-07-18 10:27:32,795 - INFO - 2024-07-18T10:27:22.521705828Z 2024-07-18,10:27:22 | INFO | Train Epoch: 0 [   256/100000 (0%)] Data (t): 1.996 Batch (t): 6.086, 42.0608/s, 42.0608/s/gpu LR: 0.000000 Logit Scale: 100.003, Logit Bias: 0.000, Txt_img_0_0_loss: 1.5323 (1.5323) Txt_txt_0_1_loss: 2.2189 (2.2189) Weighted_mean_loss: 1.1285 (1.1285) Loss: 1.6030 (1.6030)
1721298602882 2024-07-18 10:30:02,882 - INFO - 2024-07-18T10:29:57.309371399Z 2024-07-18,10:29:57 | INFO | Train Epoch: 0 [ 25856/100000 (26%)] Data (t): 0.742 Batch (t): 1.548, 160.529/s, 160.529/s/gpu LR: 0.000005 Logit Scale: 99.984, Logit Bias: 0.000, Txt_img_0_0_loss: 0.89163 (1.2119) Txt_txt_0_1_loss: 0.40258 (1.3107) Weighted_mean_loss: 0.61996 (0.87425) Loss: 0.70145 (1.1522)
1721298762976 2024-07-18 10:32:42,975 - INFO - 2024-07-18T10:32:34.148183677Z 2024-07-18,10:32:34 | INFO | Train Epoch: 0 [ 51456/100000 (52%)] Data (t): 0.766 Batch (t): 1.568, 161.327/s, 161.327/s/gpu LR: 0.000010 Logit Scale: 99.969, Logit Bias: 0.000, Txt_img_0_0_loss: 0.89120 (1.1050) Txt_txt_0_1_loss: 0.32536 (0.98227) Weighted_mean_loss: 0.66933 (0.80594) Loss: 0.69427 (0.99957)
1721298913063 2024-07-18 10:35:13,063 - INFO - 2024-07-18T10:35:10.632618682Z 2024-07-18,10:35:10 | INFO | Train Epoch: 0 [ 77056/100000 (77%)] Data (t): 0.762 Batch (t): 1.565, 167.016/s, 167.016/s/gpu LR: 0.000015 Logit Scale: 99.943, Logit Bias: 0.000, Txt_img_0_0_loss: 1.0737 (1.0972) Txt_txt_0_1_loss: 0.36037 (0.82679) Weighted_mean_loss: 0.74134 (0.78979) Loss: 0.81226 (0.95274)
1721299053145 2024-07-18 10:37:33,145 - INFO - 2024-07-18T10:37:29.910381986Z 2024-07-18,10:37:29 | INFO | Train Epoch: 0 [ 99840/100000 (100%)] Data (t): 0.762 Batch (t): 1.565, 165.425/s, 165.425/s/gpu LR: 0.000019 Logit Scale: 99.911, Logit Bias: 0.000, Txt_img_0_0_loss: 0.97796 (1.0733) Txt_txt_0_1_loss: 0.27951 (0.71734) Weighted_mean_loss: 0.60969 (0.75377) Loss: 0.71128 (0.90445)
```

We see here information about the epoch, data, batch, logit scale, logit bias, text-image loss, text-text loss, weighted mean loss and loss.

### **5. Evaluation**

Once we've successfully tuned the model we will want to be able to quantify the performance of the tuned model against the baseline set by the original base model. To do this we can get Marqtune to use the evaluation dataset to run a an evaluation on the original base model to establish a baseline and then a subsequent evaluation with the same dataset on the last checkpoint generated by our freshly tuned model.

Finally, we will print out the results of each evaluation which should show the tuned model returning better performance numbers than the base model.

In [None]:
eval_params = {
    "leftKeys": ["query"],
    "leftWeights": [1],
    "rightKeys": ["image", "title"],
    "rightWeights": [0.9, 0.1],
    "weightKey": "score",
}

In [None]:
print("Evaluating the base model:")
base_model_eval = marqtune_client.evaluate(
    dataset_id=eval_dataset.dataset_id,
    model=f"Marqo/{base_model}.{base_checkpoint}",
    hyperparameters=eval_params,
    wait_for_completion=True,
)

print("Evaluating the tuned model:")
tuned_model_id = tuned_model.model_id
tuned_checkpoint = tuned_model.describe()["checkpoints"][-1]
tuned_model_eval = marqtune_client.evaluate(
    dataset_id=eval_dataset.dataset_id,
    model=f"{tuned_model_id}/{tuned_checkpoint}",
    hyperparameters=eval_params,
    wait_for_completion=True,
)

In [None]:
# convenience function to inspect evaluation logs and extract the results
def print_eval_results(description, evaluation):
    results = next(
        (
            json.loads(log["message"][index:].replace("'", '"'))
            for log in evaluation.logs()[-10:]
            if (index := log["message"].find("{'mAP@1000': ")) != -1
        ),
        None,
    )
    print(description)
    print(json.dumps(results, indent=4))


print_eval_results("Evaluation results from base model:", base_model_eval)
print_eval_results("Evaluation results from tuned model:", tuned_model_eval)

Again, we've chosen a minimal set of hyperparameters for the evaluation tasks, and you can read about these in the [Evaluation Parameters documentation](https://docs.marqo.ai/2.10/Marqtune/API/evaluation/evaluation_parameters/).

Due to the inherent stochasticity of training and evaluation the results you see will likely be different from our measurements, but you should see improvements similar to the measurements below (higher numbers are better).

Picking out one of the above metrics: NDCG@10 (Normalized Discounted Cumulative Gain - a measure of the ranking and retrieval quality of the model by comparing top 10 model retrievals with the ground truth) we can see our tuned model performed better than the base model. Similarly, the other metrics also show consistent improvements. Refer to our blog post on [Generalised Contrastive Learning for Multimodal Retrieval and Ranking](https://www.marqo.ai/blog/generalized-contrastive-learning-for-multi-modal-retrieval-and-ranking) for more information as well as an explanation of each of the metrics above.

### **6. Download and Cleanup**

At this point, you can download the model to your local disk:

In [None]:
tuned_model.download()

From here you can choose to [create a Marqo index with this custom model](https://gh-previews.marqo.pages.dev/marqtune_walkthrough/Guides/Models-Reference/bring_your_own_model/#3-use-your-model-in-marqo).

Finally, you can choose to (optionally) clean up your generated resources:

In [None]:
training_dataset.delete()
eval_dataset.delete()
tuned_model.delete()
base_model_eval.delete()
tuned_model_eval.delete()

### **Conclusion**

This notebook has guided you through the process of fine-tuning a base open clip model using a multi-modal training dataset with Marqtune. We evaluated the performance of this newly fine-tuned model and found significant improvements when compared to the base model. [Marqtune](https://cloud.marqo.ai/) can be used to fine-tune a variety of different models—try it yourself, today!

### **Code**

You can find this code on our GitHub [here](https://github.com/marqo-ai/marqtune-examples/).