In this notebook, i will demonstrate how to preprocess the input document using below example (assume it was load from document loader), then converting it to vector (using ollama) and store all the important information onto mongoDB.

In [1]:
from langchain_text_splitters import RecursiveCharacterTextSplitter
from pymongo import MongoClient

In [2]:
## Example text input, ASSUME ITS LOAD FROM DOCUMENT LOADER PART
text = """I am a full-stack software and AI engineer especially interested in building web application experiences, scaling systems up, and producing reliable AI applications. I seek a full-time role to apply my skills, embrace challenges, collaborate with diverse teams, and contribute meaningfully to an organization."""
print(text)

I am a full-stack software and AI engineer especially interested in building web application experiences, scaling systems up, and producing reliable AI applications. I seek a full-time role to apply my skills, embrace challenges, collaborate with diverse teams, and contribute meaningfully to an organization.


Setting chunk_size & chunk overlap, besides there are various kind of text splitter, and the most commonly used and perform nicely is recursivecharactertextsplitter, so this project going to implement with this approach. Of course, we can make it to be chooseable by the user too.

Note that the value for chunk_size and overlap will be decided by the user, the advisable value will be in bit value, eg. 2,4,8,16,32,64 ...so on , and value for overlap would be 10-20% of the chunk size

In [3]:
chunk_size = 35
chunk_overlap = 35//10

chunk_option = {"chunk_size":chunk_size,"chunk_overlap":chunk_overlap}

text_splitter  = RecursiveCharacterTextSplitter(separators=["\n\n","\n"," ",""],**chunk_option)
res = text_splitter.create_documents([text],metadatas=[{"source":"what is these document about"}])
res

[Document(metadata={'source': 'what is these document about'}, page_content='I am a full-stack software and AI'),
 Document(metadata={'source': 'what is these document about'}, page_content='AI engineer especially interested'),
 Document(metadata={'source': 'what is these document about'}, page_content='in building web application'),
 Document(metadata={'source': 'what is these document about'}, page_content='experiences, scaling systems up,'),
 Document(metadata={'source': 'what is these document about'}, page_content='and producing reliable AI'),
 Document(metadata={'source': 'what is these document about'}, page_content='AI applications. I seek a'),
 Document(metadata={'source': 'what is these document about'}, page_content='a full-time role to apply my'),
 Document(metadata={'source': 'what is these document about'}, page_content='my skills, embrace challenges,'),
 Document(metadata={'source': 'what is these document about'}, page_content='collaborate with diverse teams,'),
 Docume

Below code is to demonstrate some basic operation (bulk insertion) to store data onto mongoDB

In [7]:
import datetime

## make sure your mongoclient url is not publicly visible, for this example im running in local machine, so is fine.
dbConn = MongoClient("mongodb://root:rootpass@192.168.1.7:27017/?authSource=admin")
database = dbConn["LLM_PROJECT"]
col = database["Vector_DB"]

docs = []
for doc in res:
    doc_to_insert = {
        "content": doc.page_content,
        "metadata": doc.metadata,
        "embedding": "",
        "create_date": datetime.datetime.now(datetime.timezone.utc),
        "update_date": datetime.datetime.now(datetime.timezone.utc),
    }

    docs.append(doc_to_insert)

col.insert_many(docs)
print("Successful inserting data onto mongodb")



Successful inserting data onto mongodb


below codesnippet is to showcase how to apply ollama model running locally

In [10]:
import ollama
from ollama import AsyncClient
import re

host = "http://localhost:11434"
model = "deepseek-r1:1.5b"
message = [{"role":"user","content":"Hello there, who am i"}]

client = AsyncClient(host=host)

response = await client.chat(
    model=model,
    messages=message,
    stream=True
)

def clean_response(text:str) -> str:
    return re.sub(r"<think>\n*</think>", "", text).strip()

result = ""
async for chunk in response:
    result += chunk["message"]["content"]
print(clean_response(result))




Hello! I'm DeepSeek-R1, an artificial intelligence assistant created by DeepSeek. I'm at your service and would be delighted to assist you with any inquiries or tasks you may have.


Below codesnipet is for embedding&storing data using Ollama & mongoDB using batch processing.

In [12]:
import datetime

dbConn = MongoClient("mongodb://root:rootpass@192.168.1.7:27017/?authSource=admin")
database = dbConn["LLM_PROJECT"]
col = database["Vector_DB"]

## batch processing for embedding
batch_size = 2
model = "nomic-embed-text"

for i in range(0,len(res),batch_size):
    batch = res[i:i+batch_size]
    batchList = []
    for texts in batch:
        batchList.append(texts.page_content)
    embedding = ollama.embed(model=model,input=batchList)
    # print(embedding["embeddings"])
    docs_to_insert = []

    for j,embed_text in enumerate(embedding.embeddings):
        doc = {
            "content": batch[j].page_content,
            "metadata": batch[j].metadata,
            "embedding":embed_text,
            "create_date": datetime.datetime.now(datetime.timezone.utc),
            "update_date": datetime.datetime.now(datetime.timezone.utc),
        }
        docs_to_insert.append(doc)
    
    if docs_to_insert:
        col.insert_many(docs_to_insert)
print(f"Embedding Process Done")
 

Embedding Process Done
