# Data Synthesis

<a href="https://colab.research.google.com/drive/1sX5K0eophlHXu1S7joysZJUj1zfh28Gi?usp=sharing"><img alt="Open In Colab" src="https://colab.research.google.com/assets/colab-badge.svg"></a>

When using Finetuner, each item in your training data must either have a label, or have a similarity score comparing it to some other item. see the Finetuner documentation on [preparing training data](https://finetuner.jina.ai/walkthrough/create-training-data/).
If your data is not labelled, and you don't want to spend time manually organizing and labelling it, you can use the `finetuner.synthesize` function to automatically construct a dataset that can be used in training.

This guide will walk you through the process of using the `finetuner.synthesize` function, as well as how to use its output for training.

![synthesis_flowchart](https://user-images.githubusercontent.com/58855099/240291609-5b3711d6-7c1b-4656-882e-5de9b488d395.png)


### Install

In [None]:
!pip install 'finetuner[full]'

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting git+https://github.com/jina-ai/finetuner.git
  Cloning https://github.com/jina-ai/finetuner.git to /tmp/pip-req-build-eg7zb_0x
  Running command git clone --filter=blob:none --quiet https://github.com/jina-ai/finetuner.git /tmp/pip-req-build-eg7zb_0x
  Resolved https://github.com/jina-ai/finetuner.git to commit d69f696e3dc4815618fba135ab9e87957fc79214
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting docarray[common]<0.30.0 (from finetuner==0.7.6)
  Downloading docarray-0.21.0.tar.gz (658 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m658.0/658.0 kB[0m [31m9.5 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata 

## Prepare Synthesis Data
To perform synthesis, we need a query dataset and a corpus dataset, with the query dataset containing examples of user queries, and the corpus containing example search results.

We'll be generating training data based on the electronics section of the [Amazon cross-market dataset](https://xmrec.github.io/data/us/), a collection of products, ratings and reviews taken from Amazon. For our purposes, we will only be using the product names.  

We use the `xmarket_queries_da` and `xmarket_corpus_da` datasets, which we have already pre-processed and made available on the Jina AI Cloud. You can access them using `DocumentArray.pull`:

In [None]:
import finetuner
from docarray import Document, DocumentArray

finetuner.login(force=True)

VBox(children=(VBox(children=(HTML(value="\n<div class='custom-container'>\n    <style>\n        .button1 {\n …

In [None]:
query_data = DocumentArray.pull('finetuner/xmarket_queries_da')
corpus_data = DocumentArray.pull('finetuner/xmarket_corpus_da')

query_data.summary()
query_data[0].summary()

The format of the data in these `DocumentArray`s is very simple, each `Document` wraps a single item, contained in its `text` field.

### Choosing models
Data synthesis jobs require two different models: a relation miner and a cross encoder.  

The relation miner is used to identify one similar and several dissimilar documents from the corpus data for each query in the query data.  

The cross encoder is then used to calculate a similarity between each query and its corresponding (dis)similar documents.  

Currently, we only support synthesis jobs for data in English, so when choosing a model you can just provide the `synthesis_model_en` object which contains the appropriate models for each of these tasks.

## Start Synthesis Run
Now that we have the query and corpus datasets loaded as `DocumentArray`s, we can begin our synthesis run. We only need to provide the query and corpus data and the models that we are using.  

The `num_relations` parameter is set to 10. This parameter determines how many documents are retrieved for each query. There will always be one similar document and `(num_relations - 1)` dissimilar documents retrieved. These dissimilar documents are what make up the generated documents, so the size of the generated `DocumentArray` is always equal to `len(query_data) * (num_relations - 1)`. By default this parameter is set to 3, meaning that the size of the generated dataset would be twice as large as the size of the query dataset.

In [None]:
from finetuner.model import synthesis_model_en

synthesis_run = finetuner.synthesize(
    query_data='finetuner/xmarket_queries_da',
    corpus_data='finetuner/xmarket_corpus_da',
    models=synthesis_model_en,
    num_relations=10,
)


### Monitoring

Now that we've created a run, we can check its status. You can monitor the run's progress with the function `synthesis_run.status()`, and the logs with `synthesis_run.logs()` or `synthesis_run.stream_logs()`. 

*Note: The job will take around 15 minutes to finish.*

In [None]:
for entry in synthesis_run.stream_logs():
  print(entry)

Output()

[12:25:24] INFO     Starting finetuner generation run ...                                                __main__.py:350
DEBUG    Found Jina AI Cloud authentication token                                             __main__.py:362
DEBUG    Running in online mode                                                               __main__.py:363
INFO     Reading config ...                                                                   __main__.py:370
DEBUG    Reading config from stream                                                           __main__.py:382
INFO     Parsing config ...                                                                   __main__.py:385
INFO     Config loaded 📜                                                                     __main__.py:389
INFO     Run name: nostalgic-chaplygin                                                        __main__.py:391
INFO     Experiment name: default                                                             __main__.py:392


Dependending on the size of the training data, some runs might take up to several hours. You can easily reconnect to your run later to monitor its status.

```python
import finetuner

finetuner.login()
synthesis_run = finetuner.get_run('my-synthesis-run')
print(f'Run status: {run.status()}')
```

### Retrieving the data

Once the synthesis run has finished, the synthesised data will be pushed to the Jina AI Cloud under your account. The name of the pushed `DocumentArray` will be stored in `synthesis_run.train_data`.

In [None]:
train_data_name = synthesis_run.train_data
train_data = DocumentArray.pull(train_data_name)
train_data.summary()

## Start Training with Synthesised Data

Using your synthesised data, you can now train a model using the `MarginMSELoss` function.  

 We have prepared the index and query datasets `xmarket-gpl-eval-queries` and `xmarket-gpl-eval-queries` so that we can evaluate the improvement provided by training on this data:

In [None]:
from finetuner.callback import EvaluationCallback

training_run = finetuner.fit(
    model='sbert-base-en',
    train_data=synthesis_run.train_data,
    loss='MarginMSELoss',
    optimizer='Adam',
    learning_rate=1e-5,
    epochs=3,
    callbacks=[
        EvaluationCallback(
            query_data='finetuner/xmarket-gpl-eval-queries',
            index_data='finetuner/xmarket-gpl-eval-index',
            batch_size=32,
        )
    ]
)

Just as before, you can monitor the progress of your run using `training_run.stream_logs()`:

In [None]:
for entry in training_run.stream_logs():
  print(entry)

[12:52:10] INFO     Starting finetuner training run ...                                                  __main__.py:350
DEBUG    Found Jina AI Cloud authentication token                                             __main__.py:362
DEBUG    Running in online mode                                                               __main__.py:363
INFO     Reading config ...                                                                   __main__.py:370
DEBUG    Reading config from stream                                                           __main__.py:382
INFO     Parsing config ...                                                                   __main__.py:385
INFO     Config loaded 📜                                                                     __main__.py:389
INFO     Run name: epic-torvalds                                                              __main__.py:391
INFO     Experiment name: default                                                             __main__.py:392


### Evaluating

Our `EvaluationCallback` during fine-tuning ensures that after each epoch, an evaluation of our model is run. We can access the evaluation results in the logs using `print(training_run.logs())`:

```bash
Training [3/3] ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 470/470 0:00:00 0:02:34 • loss: 5.191
INFO     Done ✨                                                                              __main__.py:192
DEBUG    Finetuning took 0 days, 0 hours 11 minutes and 55 seconds                            __main__.py:194
INFO     Metric: 'sentence-transformers/msmarco-distilbert-base-v3_precision_at_k' before     __main__.py:207
fine-tuning:  0.16069 after fine-tuning: 0.19134
INFO     Metric: 'sentence-transformers/msmarco-distilbert-base-v3_recall_at_k' before        __main__.py:207
fine-tuning:  0.29887 after fine-tuning: 0.34635
INFO     Metric: 'sentence-transformers/msmarco-distilbert-base-v3_f1_score_at_k' before      __main__.py:207
fine-tuning:  0.13676 after fine-tuning: 0.16519
INFO     Metric: 'sentence-transformers/msmarco-distilbert-base-v3_hit_at_k' before           __main__.py:207
fine-tuning:  0.64277 after fine-tuning: 0.66069
INFO     Metric: 'sentence-transformers/msmarco-distilbert-base-v3_average_precision' before  __main__.py:207
fine-tuning:  0.34337 after fine-tuning: 0.39265
INFO     Metric: 'sentence-transformers/msmarco-distilbert-base-v3_reciprocal_rank' before    __main__.py:207
fine-tuning:  0.39998 after fine-tuning: 0.44711
INFO     Metric: 'sentence-transformers/msmarco-distilbert-base-v3_dcg_at_k' before           __main__.py:207
fine-tuning:  1.49618 after fine-tuning: 1.77899
INFO     Building the artifact ...                                                            __main__.py:231
INFO     Pushing artifact to Jina AI Cloud ...                                                __main__.py:260
```

The amount of improvement is highly dependent on the amount of data generated during synthesis, **as the amount of training data increases, so will the performance of the finetuned model**. To increase the number of documents generated, we can either increase the size of the query dataset provided to the `finetuner.synthesize` function, or increase value of the `num_relations` parameter, which will result in more documents being generated per query. Conversely, choosing a smaller value for `num_relations` would result in shorter generation and training times, but less improvement after training.  
To better understand the relationship between the amount of training data and the increase in performance, have a look at the [how much data?](https://finetuner.jina.ai/advanced-topics/budget/) section of our documentation.
