Before proceeding with batch processing, the cost should be estimated. To do so, the batch processing file will be used together with `hugging-face`. Tokenizer for for `llama-3.1-8b` will be used.

In [9]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B-Instruct")

In [10]:
import json 

with open('../test-batch-syn-queries.jsonl', 'r') as file: 
    content = file.read() 
    calls = content.split('\n') 
    calls = list(map(lambda x: json.loads(x), calls)) 

calls[0]

{'custom_id': 'doc0',
 'method': 'POST',
 'url': '/v1/chat/completions',
 'body': {'model': 'llama-3.1-8b-instant',
  'messages': [{'role': 'system',
    'content': 'You are a helpful AI assistant.\nGenerate most relevant and short search question for the document excerpt provided by the user.\nReturn only the generated query without quotes or trailing punctuation marks.'},
   {'role': 'user',
    'content': "Minority interest In accounting, minority interest (or non-controlling interest) is the portion of a subsidiary corporation's stock that is not owned by the parent corporation. The magnitude of the minority interest in the subsidiary company is generally less than 50% of outstanding shares, or the corporation would generally cease to be a subsidiary of the parent.[1]"}]}}

In [11]:
calls[0]['body']['messages']

[{'role': 'system',
  'content': 'You are a helpful AI assistant.\nGenerate most relevant and short search question for the document excerpt provided by the user.\nReturn only the generated query without quotes or trailing punctuation marks.'},
 {'role': 'user',
  'content': "Minority interest In accounting, minority interest (or non-controlling interest) is the portion of a subsidiary corporation's stock that is not owned by the parent corporation. The magnitude of the minority interest in the subsidiary company is generally less than 50% of outstanding shares, or the corporation would generally cease to be a subsidiary of the parent.[1]"}]

In [14]:
chat_str = tokenizer.apply_chat_template(calls[0]['body']['messages'], tokenize=False)
chat_str

"<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nCutting Knowledge Date: December 2023\nToday Date: 26 Jul 2024\n\nYou are a helpful AI assistant.\nGenerate most relevant and short search question for the document excerpt provided by the user.\nReturn only the generated query without quotes or trailing punctuation marks.<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nMinority interest In accounting, minority interest (or non-controlling interest) is the portion of a subsidiary corporation's stock that is not owned by the parent corporation. The magnitude of the minority interest in the subsidiary company is generally less than 50% of outstanding shares, or the corporation would generally cease to be a subsidiary of the parent.[1]<|eot_id|>"

In [16]:
tokens = tokenizer.apply_chat_template(calls[0]['body']['messages'], tokenize=True)
len(tokens)

136

In [18]:
flatten = lambda x: x['body']['messages']

total_tokens = 0
for i, call in enumerate(calls):
    if i % 100_000 == 0: 
        print(f'Reached {i}th call. Total tokens: {total_tokens}.')

    chat = flatten(call)
    input_tokens = tokenizer.apply_chat_template(chat, tokenize=True)
    total_tokens += len(input_tokens)




Reached 0th call. Total tokens: 0.
Reached 100000th call. Total tokens: 18300360.
Reached 200000th call. Total tokens: 36235060.
Reached 300000th call. Total tokens: 54098371.
Reached 400000th call. Total tokens: 71861357.
Reached 500000th call. Total tokens: 89629704.
Reached 600000th call. Total tokens: 107128596.
Reached 700000th call. Total tokens: 124768551.
Reached 800000th call. Total tokens: 142355643.
Reached 900000th call. Total tokens: 159878336.
Reached 1000000th call. Total tokens: 177354372.
Reached 1100000th call. Total tokens: 194829832.
Reached 1200000th call. Total tokens: 212097375.
Reached 1300000th call. Total tokens: 229576563.
Reached 1400000th call. Total tokens: 246873547.
Reached 1500000th call. Total tokens: 264179256.
Reached 1600000th call. Total tokens: 281414757.
Reached 1700000th call. Total tokens: 298771323.
Reached 1800000th call. Total tokens: 315969639.
Reached 1900000th call. Total tokens: 333226101.
Reached 2000000th call. Total tokens: 350495059.

In [19]:
total_tokens

466948568

The number 466,948,568 in words is:

Four hundred sixty-six million, nine hundred forty-eight thousand, five hundred sixty-eight.

**It is a lot.**

### Price estimation (for now input only)

Llama 3.1 8B Instant 128k =  \$0.05 per million (20M / 1 dollar) *for input tokens



In [21]:
res = total_tokens // 20_000_000
res

23

23 of 20 million

In [26]:
left = total_tokens - (res * 20_000_000)
left

6948568

`left` approx. 7 millions 

In [28]:
23 * 1 + 7 * 0.05

23.35

$23.35 just for input tokens, not counting output tokens. However, with batch processing, this price could be cut in half, resulting in:

In [29]:
23.35 / 2

11.675

### Price estimation (for now output only)

$0.08 per million (12.5M / 1 dollar) *for output tokens

In [31]:
len(calls)

2681468

Suppose 2,681,468 entries were generated, with queries of 16 tokens each plus 2 special tokens, resulting in 18 tokens per query. In the end, we get the following total number of tokens:

In [35]:
total_output_tokens = len(calls) * 18
total_output_tokens

48266424

In [36]:
res = total_output_tokens // 12_500_000
res

3

In [37]:
left = total_output_tokens - (res * 12_500_000)
left

10766424

`left` approx. 11 millions 

In [40]:
3 * 1 + 11 * 0.08

3.88

With batch processing: 

In [41]:
3.88 / 2 

1.94

| type     | price                  |
| -------- | ---------------------- | 
| no batch | 3.88 + 23.35 = 27.23   | 
| batch    | 1.94 + 11.675 = 13.615 |

Even with batch processing, $13.615 is still a lot. It feels like deploying a quantized model on my PC is the only option left.