<a href="https://colab.research.google.com/github/jlonge4/gen_ai_utils/blob/main/jinaai_late_chunking.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
!pip install transformers --upgrade

In [1]:
import requests

url = "https://www.gutenberg.org/files/244/244-0.txt"  # Sherlock Holmes text

response = requests.get(url)

if response.status_code == 200:
    sample_text = response.text

In [None]:
from transformers import AutoTokenizer, AutoModel
import torch

model_name = "jinaai/jina-embeddings-v3"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name).to('cuda')

In [99]:
inputs = tokenizer(sample_text[1700:38200], return_tensors='pt', max_length=8192).to('cuda')
len(inputs['input_ids'][0])

8031

In [100]:
with torch.no_grad():
    outputs = model(**inputs)
    token_embeddings = outputs.last_hidden_state  # Shape: (num_tokens, embedding_dim)

In [101]:
outputs = []
# Determine how many 350 token chunks fit in the 8192 sequence
chunk_sizes = [350] * (token_embeddings.shape[1] // 350 + 1)

for emb in [token_embeddings[0]]:
    pooled_chunks = []
    start = 0

    for size in chunk_sizes:
        end = start + size
        if end > len(emb):
          break
        chunk = emb[start:end]
        pooled_chunk = chunk.mean(dim=0)
        pooled_chunks.append(pooled_chunk)
        start = end

    outputs.append(pooled_chunks)

In [102]:
print("Number of chunks:", len(outputs[0]))
print("Embedding dimension:", outputs[0][0].shape)

Number of chunks: 22
Embedding dimension: torch.Size([1024])


In [109]:
from transformers import AutoTokenizer
import torch

def chunk_text(text, tokenizer, chunk_size=350):
    tokens = tokenizer.encode(text, add_special_tokens=False,max_length=8192)

    num_chunks = len(tokens) // chunk_size + (1 if len(tokens) % chunk_size != 0 else 0)

    chunks = []
    for i in range(num_chunks):
        start = i * chunk_size
        end = min((i + 1) * chunk_size, len(tokens))
        chunk = tokens[start:end]
        chunks.append(chunk)

    text_chunks = [tokenizer.decode(chunk) for chunk in chunks]

    return text_chunks

all_tokens = tokenizer.encode(sample_text[1700:38200], add_special_tokens=False)

chunk_size = 350
num_chunks = len(all_tokens) // chunk_size + (1 if len(all_tokens) % chunk_size != 0 else 0)
chunk_sizes = [chunk_size] * num_chunks

chunked_texts = chunk_text(sample_text, tokenizer, chunk_size)

print(f"Number of chunks: {len(chunked_texts)}")
print(f"Number of chunk_sizes: {len(chunk_sizes)}")

for i, chunk in enumerate(chunked_texts[:3]):
    print(f"\nChunk {i+1}:")
    print(chunk[:200] + "...")

Number of chunks: 24
Number of chunk_sizes: 23

Chunk 1:
i » ¿ the project gutenberg ebook of a study in scarlet, by arthur conan doyle this ebook is for the use of anyone anywhere in the united states and most other parts of the world at no cost and with a...

Chunk 2:
being a reprint from the reminiscences of _ john h. watson, m. d., _ late of the army medical department. _ ) chapter i. mr. sherlock holmes. in the year 1878 i took my degree of doctor of medicine of...

Chunk 3:
rallied, and had already improved so far as to be able to walk about the wards, and even to bask a little upon the verandah, when i was struck down by enteric fever, that curse of our indian possessio...


In [103]:
query_text = "Where did Sherlock find his regiment?"

qv = model.encode([query_text])

In [105]:
import torch
import torch.nn.functional as F

if isinstance(qv, np.ndarray):
    qv = torch.from_numpy(qv).float().to('cuda')

if qv.dim() == 1:
    qv = qv.unsqueeze(0)

qv_norm = F.normalize(qv, p=2, dim=1)

In [122]:
import torch
import torch.nn.functional as F

query_embedding_norm = qv_norm.to(torch.float32)

chunked_embeddings = torch.stack([chunk.clone().detach().to('cuda').to(torch.float32) if isinstance(chunk, torch.Tensor) else torch.tensor(chunk, device='cuda', dtype=torch.float32) for chunk in outputs[0]])

chunked_embeddings_norm = F.normalize(chunked_embeddings, p=2, dim=1)

chunked_embeddings_norm = chunked_embeddings_norm.to(torch.float32)

cosine_similarities = torch.mm(query_embedding_norm, chunked_embeddings_norm.t())
top_k = 5
top_results = torch.topk(cosine_similarities.squeeze(), k=min(top_k, cosine_similarities.numel()))

for i, (score, idx) in enumerate(zip(top_results.values, top_results.indices)):
    print(f"Rank {i+1}: Chunk {idx.item()}, Score: {score.item():.4f}")

Rank 1: Chunk 1, Score: 0.2363
Rank 2: Chunk 21, Score: 0.2338
Rank 3: Chunk 2, Score: 0.2331
Rank 4: Chunk 20, Score: 0.2319
Rank 5: Chunk 19, Score: 0.2317


In [110]:
chunked_texts[1]

'being a reprint from the reminiscences of _ john h. watson, m. d., _ late of the army medical department. _ ) chapter i. mr. sherlock holmes. in the year 1878 i took my degree of doctor of medicine of the university of london, and proceeded to netley to go through the course prescribed for surgeons in the army. having completed my studies there, i was duly attached to the fifth northumberland fusiliers as assistant surgeon. the regiment was stationed in india at the time, and before i could join it, the second afghan war had broken out. on landing at bombay, i learned that my corps had advanced through the passes, and was already deep in the enemyas country. i followed, however, with many other officers who were in the same situation as myself, and succeeded in reaching candahar in safety, where i found my regiment, and at once entered upon my new duties. the campaign brought honours and promotion to many, but for me it had nothing but misfortune and disaster. i was removed from my br

# Traditional chunking and encoding

In [115]:
embeddings = model.encode(chunked_texts)

In [123]:
import torch
import torch.nn.functional as F
import numpy as np

if isinstance(embeddings, np.ndarray):
    embeddings = torch.from_numpy(embeddings).float()

if isinstance(qv, np.ndarray):
    qv = torch.from_numpy(qv).float()

if qv.dim() == 1:
    qv = qv.unsqueeze(0)

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
embeddings = embeddings.to(device)
qv = qv.to(device)

embeddings_norm = F.normalize(embeddings, p=2, dim=1)
qv_norm = F.normalize(qv, p=2, dim=1)

cosine_similarities = torch.mm(qv_norm, embeddings_norm.t())

top_results = torch.topk(cosine_similarities.squeeze(), k=min(top_k, cosine_similarities.numel()))

for i, (score, idx) in enumerate(zip(top_results.values, top_results.indices)):
    print(f"Rank {i+1}: Chunk {idx.item()}, Score: {score.item():.4f}")
    print(chunked_texts[idx.item()][:200] + "...")

Rank 1: Chunk 17, Score: 0.6247
as mr. lestrade, and who came three or four times in a single week. one morning a young girl called, fashionably dressed, and stayed for half an hour or more. the same afternoon brought a grey - heade...
Rank 2: Chunk 15, Score: 0.6187
difference to me or to my work. a i was on the point of asking him what that work might be, but something in his manner showed me that the question would be an unwelcome one. i pondered over our short...
Rank 3: Chunk 11, Score: 0.6069
##ity, a he said. aa good many people have wanted to know how he finds things out. a aoh! a mystery is it? a i cried, rubbing my hands. athis is very piquant. i am much obliged to you for bringing us ...
Rank 4: Chunk 23, Score: 0.5987
know well that i have it in me to make my name famous. no man lives or has ever lived who has brought the same amount of study and of natural talent to the detection of crime which i have done. and wh...
Rank 5: Chunk 19, Score: 0.5957
, let the enquirer begin

In [124]:
chunked_texts[17]

'as mr. lestrade, and who came three or four times in a single week. one morning a young girl called, fashionably dressed, and stayed for half an hour or more. the same afternoon brought a grey - headed, seedy visitor, looking like a jew pedlar, who appeared to me to be much excited, and who was closely followed by a slip - shod elderly woman. on another occasion an old white - haired gentleman had an interview with my companion ; and on another a railway porter in his velveteen uniform. when any of these nondescript individuals put in an appearance, sherlock holmes used to beg for the use of the sitting - room, and i would retire to my bed - room. he always apologized to me for putting me to this inconvenience. ai have to use this room as a place of business, a he said, aand these people are my clients. a again i had an opportunity of asking him a point blank question, and again my delicacy prevented me from forcing another man to confide in me. i imagined at the time that he had some