In [1]:
text = "Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, s
unt in culpa qui officia deserunt mollit anim id est laborum."

# Fixed size chunking

In [2]:
chunk_size = 20
[text[i: i + chunk_size] for i in range(0, len(text), chunk_size)]

['Lorem ipsum dolor si',
 't amet, consectetur ',
 'adipiscing elit, sed',
 ' do eiusmod tempor i',
 'ncididunt ut labore ',
 'et dolore magna aliq',
 'ua. Ut enim ad minim',
 ' veniam, quis nostru',
 'd exercitation ullam',
 'co laboris nisi ut a',
 'liquip ex ea commodo',
 ' consequat. Duis aut',
 'e irure dolor in rep',
 'rehenderit in volupt',
 'ate velit esse cillu',
 'm dolore eu fugiat n',
 'ulla pariatur. Excep',
 'teur sint occaecat c',
 'upidatat non proiden',
 't, sunt in culpa qui',
 ' officia deserunt mo',
 'llit anim id est lab',
 'orum.']

# Splitting based on special characters

In [3]:
text.split(".")

['Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua',
 ' Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat',
 ' Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur',
 ' Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum',
 '']

# Overlap between chunks

In [4]:
chunk_size = 20
overlap = 5
[text[i: i + chunk_size] for i in range(0, len(text), chunk_size-overlap)]

['Lorem ipsum dolor si',
 'or sit amet, consect',
 'nsectetur adipiscing',
 'scing elit, sed do e',
 ' do eiusmod tempor i',
 'por incididunt ut la',
 'ut labore et dolore ',
 'lore magna aliqua. U',
 'ua. Ut enim ad minim',
 'minim veniam, quis n',
 'uis nostrud exercita',
 'rcitation ullamco la',
 'co laboris nisi ut a',
 ' ut aliquip ex ea co',
 'ea commodo consequat',
 'equat. Duis aute iru',
 'e irure dolor in rep',
 'n reprehenderit in v',
 ' in voluptate velit ',
 'elit esse cillum dol',
 'm dolore eu fugiat n',
 'iat nulla pariatur. ',
 'tur. Excepteur sint ',
 'sint occaecat cupida',
 'upidatat non proiden',
 'oident, sunt in culp',
 ' culpa qui officia d',
 'cia deserunt mollit ',
 'llit anim id est lab',
 't laborum.']

# Embeddings, Vectors, RAG

In [None]:
!pip install sentence-transformers chromadb

In [7]:
from sentence_transformers import SentenceTransformer

You can use an embedding model to find suitable vectors to represent your vocabulary whether it is words or sentences.

In [10]:
model = SentenceTransformer("all-MiniLM-L6-v2")
sentences = text.split(".")

text_embeddings = model.encode(sentences)
print(text_embeddings.shape)

(5, 384)


ChromaDB is an Open Source Vector Database that we can use for our RAG application to query before the request is sent to the LLM. ChromaDB as default uses the same Sentence Embedding model we used above, "all-MiniLM-L6-v2".

In [24]:
import chromadb
client = chromadb.Client()
collection = client.create_collection(name="MySentenceStore")
collection.add(documents=sentences, ids=[str(id) for id in range(0, len(sentences))])

/root/.cache/chroma/onnx_models/all-MiniLM-L6-v2/onnx.tar.gz: 100%|██████████| 79.3M/79.3M [00:03<00:00, 27.3MiB/s]


In [27]:
query_results = collection.query(query_texts=["What is deserunt mollit est labt laborum?"], n_results=1)

In [29]:
print(query_results["documents"])

[[' Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum']]
