<a href="https://colab.research.google.com/github/maiquealmeida/public-colab-notebooks/blob/main/Replicate%20Test%2001.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Running BAAI/bge-large-en-v1.5 on Replicate

In this notebook, we'll see how to run [`BAAI/bge-large-en-v1.5`](https://hf.co/baai/bge-large-en-v1.5) on Replicate - the current SOTA open source model for text embeddings! (as of 10/27/23)

As you'll see, this Replicate model is both better than OpenAI embeddings, and 4x cheaper to run for large scale text embedding.

👀 See the model in the Replicate UI [here](https://replicate.com/nateraw/bge-large-en-v1.5), and more ways to run it (node, curl, docker, etc.) [here](https://replicate.com/nateraw/bge-large-en-v1.5/api).

In [None]:
%%capture
! pip install replicate

# to count tokens
! pip install transformers sentencepiece

# For our example dataset samsum, we need these
! pip install datasets py7zr scikit-learn

Authenticate with [Replicate](https://replicate.com) :)

In [None]:
import os
from getpass import getpass
os.environ["REPLICATE_API_TOKEN"] = getpass("Enter your Replicate API Token from:\nhttps://replicate.com/account/api-tokens\n\nPress Enter when done\n")

## From list of text

Quick example from JSON list of text.

Run this to get the model warmed up. Read about how cold boots work on Replicate [here](https://replicate.com/docs/how-does-replicate-work#cold-boots).

In [None]:
import json

import replicate

texts = [
    "the happy cat",
    "the quick brown fox jumps over the lazy dog",
    "lorem ipsum dolor sit amet",
    "this is a test",
]

output = replicate.run(
    "nateraw/bge-large-en-v1.5:9cf9f015a9cb9c61d1a2610659cdac4a4ca222f2d3707a68517b18c198a9add1",
    input={"texts": json.dumps(texts)}
)

print(output)

# From jsonl file

I recommend to use a file for making predictions if you've got a larger amount of text to embed (>100 embeddings).

Here's a dummy example to show you the best way to do that.

In [None]:
%%writefile dummy_example.jsonl
{"text": "the happy cat"}
{"text": "the quick brown fox jumps over the lazy dog"}
{"text": "lorem ipsum dolor sit amet"}
{"text": "this is a test"}

In [None]:
output = replicate.run(
    "nateraw/bge-large-en-v1.5:9cf9f015a9cb9c61d1a2610659cdac4a4ca222f2d3707a68517b18c198a9add1",
    input={"path": open("dummy_example.jsonl", "rb")}
)

In [None]:
len(output)

## Real Example - big jsonl file (via `datasets` library)

Here, we'll encode the whole [samsum](https://hf.co/datasets/samsum) dataset. ~14k examples.

In [None]:
from pathlib import Path

from datasets import load_dataset

dataset_name = "samsum"
text_field = "dialogue"
outfile_name = "samsum_dialogue.jsonl"

ds = load_dataset(dataset_name, split='train')
ds = ds.remove_columns([x for x in ds.column_names if x != text_field])
ds = ds.rename_column(text_field, "text")
texts = ds["text"]
texts[0]

Write to jsonl text file!

In [None]:
ds.to_json(outfile_name)

Looks like this!

In [None]:
! head -n 5 {outfile_name}

## Run Predictions

This time, we'll choose to `convert_to_numpy`, which means our response will be a path to a saved `.npy` file instead of embeddings themselves. This is recommended when you want to compute a lot of embeddings at once, like we're doing here.

In [None]:
import time

start = time.time()
output = replicate.run(
    "nateraw/bge-large-en-v1.5:9cf9f015a9cb9c61d1a2610659cdac4a4ca222f2d3707a68517b18c198a9add1",
    input=dict(
        path=open(outfile_name, "rb"),
        convert_to_numpy=True,
        batch_size=64
    )
)
time_to_embed = time.time() - start
print(f"that took {time_to_embed:.2f} seconds.")

print("output", output)

## Load the predictions

Since we chose to convert to numpy, we'll load with numpy here.

In [None]:
import requests
from io import BytesIO

import numpy as np

embeds = np.load(BytesIO(requests.get(output).content))
embeds.shape

## Price vs. OpenAI

## Pricing vs OpenAI

At the time of writing this, OpenAI's Ada v2 model costs $0.0001 / 1K tokens.

```
Model	Usage
Ada v2	$0.0001 / 1K tokens
```

On replicate, you're charged by the second for the hardware you're running on. In this case, we're using A40 (Large) instances, which cost 0.000725/sec.

👀 Read more about Replicate's pricing [here](https://replicate.com/pricing).

Below, we'll compare both

In [None]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("BAAI/bge-large-en-v1.5")

Prepare a benchmark file with 512 tokens in each example (the max seq length of this model).

Our benchmark will have 5,120,000 tokens.

In [None]:
text = """\
Lorem ipsum dolor sit amet, consectetur adipiscing elit, \
sed do eiusmod tempor a b
""" * 16  # Not long enough, need >= 512 tokens, so multiply by 16

# no truncation (how many input tokens)
print(len(tokenizer.encode(text, truncation=False, add_special_tokens=False)))
# with truncation (just for fun)
print(len(tokenizer.encode(text, truncation=True, add_special_tokens=False)))

In [None]:
from datasets import Dataset

ds = Dataset.from_dict({"text": [text] * 10000})

In [None]:
def count_tokens(ex):
    ex['num_tokens'] = len(tokenizer.encode(ex["text"], truncation=True, add_special_tokens=False))
    return ex

ds = ds.map(count_tokens)

In [None]:
total_tokens = sum(ds['num_tokens'])
total_tokens

In [None]:
outfile_name = "benchmark.jsonl"
ds.to_json(outfile_name)

---

Here, we'll run the model using `replicate.predictions.create`, which will return a prediction object that we can use to get the actual time our run is billed for. This way, we can accurately calculate the cost.

In [None]:
model = replicate.models.get("nateraw/bge-large-en-v1.5")
version = model.latest_version
prediction = replicate.predictions.create(
    version,
    input=dict(
        path=open(outfile_name, "rb"),
        convert_to_numpy=True,
        batch_size=64
    )
)
prediction.wait()

In [None]:
time_to_embed = prediction.metrics['predict_time']
print(f"that took {time_to_embed:.2f} seconds.")

In [None]:
openai_cost = 0.0001  # per 1k tokens
openai_price = total_tokens / 1000 * openai_cost
print(f"OpenAI price: ${openai_price:.3f} USD")

In [None]:
replicate_price = time_to_embed * 0.000725
print(f"Replicate cost: ${replicate_price:.3f}")