<a href="https://colab.research.google.com/github/nateraw/BeautifulSauce/blob/master/notebooks/replicate_batched_bge_large_en_v1_5_inference.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Running BAAI/bge-large-en-v1.5 on Replicate

In this notebook, we'll see how to run [`BAAI/bge-large-en-v1.5`](https://hf.co/baai/bge-large-en-v1.5) on Replicate - the current SOTA open source model for text embeddings! (as of 10/27/23)

As you'll see, this Replicate model is both better than OpenAI embeddings, and 4x cheaper to run for large scale text embedding.

ðŸ‘€ See the model in the Replicate UI [here](https://replicate.com/nateraw/bge-large-en-v1.5), and more ways to run it (node, curl, docker, etc.) [here](https://replicate.com/nateraw/bge-large-en-v1.5/api).

In [None]:
%%capture
! pip install replicate

# to count tokens
! pip install transformers sentencepiece

# For our example dataset samsum, we need these
! pip install datasets py7zr scikit-learn

Authenticate with [Replicate](https://replicate.com) :)

In [None]:
import os
from getpass import getpass
os.environ["REPLICATE_API_TOKEN"] = getpass("Enter your Replicate API Token from:\nhttps://replicate.com/account/api-tokens\n\nPress Enter when done\n")

## From list of text

Quick example from JSON list of text.

Run this to get the model warmed up. Might take a few mins to spin up, but then the next cells should start running right away :)

Read about how cold boots work on Replicate [here](https://replicate.com/docs/how-does-replicate-work#cold-boots).

In [34]:
import json

import replicate

texts = [
    "the happy cat",
    "the quick brown fox jumps over the lazy dog",
    "lorem ipsum dolor sit amet",
    "this is a test",
]

output = replicate.run(
    "nateraw/bge-large-en-v1.5:9cf9f015a9cb9c61d1a2610659cdac4a4ca222f2d3707a68517b18c198a9add1",
    input={"texts": json.dumps(texts)}
)

print(len(output))
print(len(output[0]))

4
1024


# From jsonl file

I recommend to use a file for making predictions if you've got a larger amount of text to embed (>100 embeddings).

Here's a dummy example to show you the best way to do that.

In [None]:
%%writefile dummy_example.jsonl
{"text": "the happy cat"}
{"text": "the quick brown fox jumps over the lazy dog"}
{"text": "lorem ipsum dolor sit amet"}
{"text": "this is a test"}

Writing dummy_example.jsonl


In [None]:
output = replicate.run(
    "nateraw/bge-large-en-v1.5:9cf9f015a9cb9c61d1a2610659cdac4a4ca222f2d3707a68517b18c198a9add1",
    input={"path": open("dummy_example.jsonl", "rb")}
)

In [None]:
len(output)

4

## Real Example - big jsonl file (via `datasets` library)

Here, we'll encode the whole [samsum](https://hf.co/datasets/samsum) dataset. ~14k examples.

In [None]:
from pathlib import Path

from datasets import load_dataset

dataset_name = "samsum"
text_field = "dialogue"
outfile_name = "samsum_dialogue.jsonl"

ds = load_dataset(dataset_name, split='train')
ds = ds.remove_columns([x for x in ds.column_names if x != text_field])
ds = ds.rename_column(text_field, "text")
texts = ds["text"]
texts[0]

Downloading builder script:   0%|          | 0.00/3.36k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/1.58k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/7.04k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/2.94M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/14732 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/819 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/818 [00:00<?, ? examples/s]

"Amanda: I baked  cookies. Do you want some?\r\nJerry: Sure!\r\nAmanda: I'll bring you tomorrow :-)"

Write to jsonl text file!

In [None]:
ds.to_json(outfile_name)

Creating json from Arrow format:   0%|          | 0/15 [00:00<?, ?ba/s]

8083570

Looks like this!

In [None]:
! head -n 5 {outfile_name}

{"text":"Amanda: I baked  cookies. Do you want some?\r\nJerry: Sure!\r\nAmanda: I'll bring you tomorrow :-)"}
{"text":"Olivia: Who are you voting for in this election? \r\nOliver: Liberals as always.\r\nOlivia: Me too!!\r\nOliver: Great"}
{"text":"Tim: Hi, what's up?\r\nKim: Bad mood tbh, I was going to do lots of stuff but ended up procrastinating\r\nTim: What did you plan on doing?\r\nKim: Oh you know, uni stuff and unfucking my room\r\nKim: Maybe tomorrow I'll move my ass and do everything\r\nKim: We were going to defrost a fridge so instead of shopping I'll eat some defrosted veggies\r\nTim: For doing stuff I recommend Pomodoro technique where u use breaks for doing chores\r\nTim: It really helps\r\nKim: thanks, maybe I'll do that\r\nTim: I also like using post-its in kaban style"}
{"text":"Edward: Rachel, I think I'm in ove with Bella..\r\nrachel: Dont say anything else..\r\nEdward: What do you mean??\r\nrachel: Open your fu**ing door.. I'm outside"}
{"text":"Sam: hey  overheard r

## Run Predictions

This time, we'll choose to `convert_to_numpy`, which means our response will be a path to a saved `.npy` file instead of embeddings themselves. This is recommended when you want to compute a lot of embeddings at once, like we're doing here.

In [None]:
import time

start = time.time()
output = replicate.run(
    "nateraw/bge-large-en-v1.5:9cf9f015a9cb9c61d1a2610659cdac4a4ca222f2d3707a68517b18c198a9add1",
    input=dict(
        path=open(outfile_name, "rb"),
        convert_to_numpy=True,
        batch_size=64
    )
)
time_to_embed = time.time() - start
print(f"that took {time_to_embed:.2f} seconds.")

print("output", output)

that took 65.51 seconds.
output https://replicate.delivery/pbxt/ZpzzGcdZf5VbCCgynfufoXww7MtymKITDa0HfAZOOVsvNNJHB/embeddings.npy


## Load the predictions

Since we chose to convert to numpy, we'll load with numpy here.

In [None]:
import requests
from io import BytesIO

import numpy as np

embeds = np.load(BytesIO(requests.get(output).content))
embeds.shape

(14732, 1024)

## Price vs. OpenAI

## Pricing vs OpenAI

At the time of writing this, OpenAI's Ada v2 model costs $0.0001 / 1K tokens.

```
Model	Usage
Ada v2	$0.0001 / 1K tokens
```

On replicate, you're charged by the second for the hardware you're running on. In this case, we're using A40 (Large) instances, which cost 0.000725/sec.

ðŸ‘€ Read more about Replicate's pricing [here](https://replicate.com/pricing).

Below, we'll compare both

In [None]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("BAAI/bge-large-en-v1.5")

Downloading (â€¦)okenizer_config.json:   0%|          | 0.00/366 [00:00<?, ?B/s]

Downloading (â€¦)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (â€¦)/main/tokenizer.json:   0%|          | 0.00/711k [00:00<?, ?B/s]

Downloading (â€¦)cial_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

Prepare a benchmark file with 512 tokens in each example (the max seq length of this model).

Our benchmark will have 5,120,000 tokens.

In [None]:
text = """\
Lorem ipsum dolor sit amet, consectetur adipiscing elit, \
sed do eiusmod tempor a b
""" * 16  # Not long enough, need >= 512 tokens, so multiply by 16

# no truncation (how many input tokens)
print(len(tokenizer.encode(text, truncation=False, add_special_tokens=False)))
# with truncation (just for fun)
print(len(tokenizer.encode(text, truncation=True, add_special_tokens=False)))

512
512


In [None]:
from datasets import Dataset

ds = Dataset.from_dict({"text": [text] * 10000})

In [None]:
def count_tokens(ex):
    ex['num_tokens'] = len(tokenizer.encode(ex["text"], truncation=True, add_special_tokens=False))
    return ex

ds = ds.map(count_tokens)

Map:   0%|          | 0/10000 [00:00<?, ? examples/s]

In [None]:
total_tokens = sum(ds['num_tokens'])
total_tokens

5120000

In [None]:
outfile_name = "benchmark.jsonl"
ds.to_json(outfile_name)

Creating json from Arrow format:   0%|          | 0/10 [00:00<?, ?ba/s]

13730000

---

**Note - the time of this cell depends on you *not* running into a cold boot, assuming you ran the cells above recently.**

Even if you do run into a cold boot, the result should be the same. The way we time here is lazy, as the real time is the one you see in the replicate dashboard, which should be ~148 seconds.

In [None]:
import time

start = time.time()
output = replicate.run(
    "nateraw/bge-large-en-v1.5:9cf9f015a9cb9c61d1a2610659cdac4a4ca222f2d3707a68517b18c198a9add1",
    input=dict(
        path=open(outfile_name, "rb"),
        convert_to_numpy=True,
        batch_size=64
    )
)
time_to_embed = time.time() - start
print(f"that took {time_to_embed:.2f} seconds.")

print("output", output)

that took 151.92 seconds.
output https://replicate.delivery/pbxt/VVrkEaiaem3uHCzAqAOmCaewTobbvrmA20QNpJo8tE39VTyRA/embeddings.npy


In [None]:
openai_cost = 0.0001  # per 1k tokens
openai_price = total_tokens / 1000 * openai_cost
print(f"OpenAI price: ${openai_price:.3f} USD")

OpenAI price: $0.512 USD


In [None]:
replicate_price = time_to_embed * 0.000725
print(f"Replicate cost: ${replicate_price:.3f}")

Replicate cost: $0.110


I'm seeing in my replicate dashboard 148 seconds on average, so hard coding that here in case number above is incorrect.

In [None]:
time_to_embed_actual = 148
replicate_price = time_to_embed_actual * 0.000725
print(f"Replicate cost (actual, from dashboard): ${replicate_price:.3f}")

Replicate cost (actual, from dashboard): $0.107
