Before processing the dataset with a model hosted using vLLM, max-model-len should be estimated to determine the optimal token count for efficient GPU usage.

In [1]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-0.5B-Instruct")

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
from mteb.tasks import NQ

nq = NQ() 
nq.load_data()

Map: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4201/4201 [00:00<00:00, 14675.49 examples/s]


In [3]:
split = nq.eval_splits[0]
nq.eval_splits

['test']

In [4]:
queries = list(nq.queries[split].items())
queries[:3]

[('test0', 'what is non controlling interest on balance sheet'),
 ('test1', 'how many episodes are in chicago fire season 4'),
 ('test2', 'who sings love will keep us alive by the eagles')]

In [6]:
corpus = list(nq.corpus[split].items())
corpus[:3]

[('doc0',
  "Minority interest In accounting, minority interest (or non-controlling interest) is the portion of a subsidiary corporation's stock that is not owned by the parent corporation. The magnitude of the minority interest in the subsidiary company is generally less than 50% of outstanding shares, or the corporation would generally cease to be a subsidiary of the parent.[1]"),
 ('doc1',
  'Minority interest It is, however, possible (such as through special voting rights) for a controlling interest requiring consolidation to be achieved without exceeding 50% ownership, depending on the accounting standards being employed. Minority interest belongs to other investors and is reported on the consolidated balance sheet of the owning company to reflect the claim on assets belonging to other, non-controlling shareholders. Also, minority interest is reported on the consolidated income statement as a share of profit belonging to minority shareholders.'),
 ('doc2',
  "Minority interest The

In [8]:
queries = sorted(queries, key=lambda x: len(tokenizer.encode(x[1])), reverse=True)

In [9]:
queries[0]

('test2302',
 'when did bihar bifurcate from bengal and some parts of chota nagpur merged into bengal')

In [12]:
len(tokenizer.encode(queries[0]))

29

In [13]:
corpus = sorted(corpus, key=lambda x: len(tokenizer.encode(x[1])), reverse=True)

In [14]:
corpus[0]

('doc2391438',
 'Shiva Sahasranama Adaikkalam Kaththan       -        அடைக்கலம் காத்தான் \nAdaivarkkamudhan          -        அடைவார்க்கமுதன்\nAdaivorkkiniyan           -        அடைவோர்க்கினியன்\nAdalarasan                -        ஆடலரசன்\nAdalazagan                -        ஆடலழகன் \nAdalerran                 -        அடலேற்றன்\nAdalvallan                -        ஆடல்வல்லான்\nAdalvidaippagan           -        அடல்விடைப்பாகன்\nAdalvidaiyan              -        அடல்விடையான்\nAdangakkolvan             -        அடங்கக்கொள்வான்\nAdaravan                  -        ஆடரவன்\nAdarchadaiyan             -        அடர்ச்சடையன்\nAdarko                    -        ஆடற்கோ\nAdhaladaiyan              -        அதளாடையன் \nAdhi                      -        ஆதி \nAdhibagavan               -        ஆதிபகவன்\nAdhipuranan               -        ஆதிபுராணன் \nAdhiraiyan                -        ஆதிரையன் \nAdhirthudiyan             -        அதிர்துடியன் \nAdhirunkazalon            -        அதிருங்கழலோன் \nAdhiy

In [15]:
len(tokenizer.encode(corpus[0]))

4610

In [17]:
list(nq.relevant_docs[split].items())[:3]

[('test0', {'doc0': 1, 'doc1': 1}),
 ('test1', {'doc6': 1}),
 ('test2', {'doc10': 1})]

In [21]:
for qrel in nq.relevant_docs[split].items():
    if 'doc2391438' in qrel[1].keys():
        print(qrel[0])
        break

Hmm, it seems that a better strategy, is to use an average document in corpus size. 

In [None]:
total_tokens = 0
for _, text in corpus:
    tokens = tokenizer.encode(text, add_special_tokens=False)
    total_tokens += len(tokens)

mean_tokens = total_tokens / len(corpus)

In [None]:
mean_tokens