Load content and attached metadata

In [1]:
content1 = "This service package manages a fleet of commercial vehicles. The Fleet and Freight Management Center monitors the vehicle fleet and can provide routes using either an in-house capability or an external provider. Routes generated by either approach are constrained by hazardous materials and other restrictions (such as height or weight). A route is electronically sent to the Commercial Vehicle with any appropriate dispatch instructions. The location of the Commercial Vehicle can be monitored by the Fleet and Freight Management Center and routing changes can be made depending on current road network conditions. This service package also supports maintenance of fleet vehicles with on-board monitoring equipment. Records of vehicle mileage, preventative maintenance and repairs are maintained"
metadata1 = {'item type' : 'service', 'name' : 'carrier operations and fleet management'}

content2 = "The 'Basic Commercial Vehicle' represents the commercial vehicle that hosts the on-board equipment that provides ITS capabilities. It includes the heavy vehicle databus and all other interface points between on-board systems and the rest of the commercial vehicle. This vehicle is used to transport goods, is operated by a professional driver and typically administered as part of a larger fleet. Commercial Vehicle classification applies to all goods transport vehicles ranging from small panel vans used in local pick-up and delivery services to large, multi-axle tractor-trailer rigs operating on long haul routes."
metadata2 = {'item type' : 'physical', 'name' : 'basic commercial vehicle', 'kind' : 'terminator', 'class' : 'vehicle', 'service' : ['carrier operations and fleet management', 'freight administration']}

Build Documents (for index)

In [2]:
from langchain.schema import Document

doc1 = Document(page_content=content1, metadata=metadata1)

doc2 = Document(page_content=content2, metadata=metadata2)

docs = [doc1, doc2]

Assign an id to each document (link between collection and vector index)

In [3]:
import uuid

def assign_ids(docs:list[Document]):
    for doc in docs:
        doc.metadata["id"] = str(uuid.uuid4())
    return(docs)

docs_with_id = assign_ids(docs)
len(docs_with_id)

2

Chunk documents

In [4]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size = 512,
    chunk_overlap = 100,
    separators=["\n\n", "\n", " ", ""]
)

docs_chunked = splitter.split_documents(docs_with_id)
len(docs_chunked)

4

Instantiate embedding model

In [5]:
from langchain_huggingface import HuggingFaceEmbeddings

embedding_model =  HuggingFaceEmbeddings( # Instantiate the embedding method
        model_name="Alibaba-NLP/gte-multilingual-base",     
        model_kwargs={"device" : 'cpu', "trust_remote_code" : True},
        encode_kwargs={'normalize_embeddings': True} 
    )

  from .autonotebook import tqdm as notebook_tqdm





Some weights of the model checkpoint at Alibaba-NLP/gte-multilingual-base were not used when initializing NewModel: ['classifier.bias', 'classifier.weight']
- This IS expected if you are initializing NewModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing NewModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Instantiate vector index (similarity search) and collection (complex pre-filtering)

In [9]:
from langchain_chroma import Chroma
from pymongo import MongoClient

index_path = "../data/demo" # We use a persistent directory for our chroma db. This way, data is saved between calls
vector_db = Chroma(
    collection_name="demo_db",
    embedding_function=embedding_model,
    persist_directory=index_path,
)

print("vector db : ", vector_db._collection.count())

mongo_path = "mongodb://192.168.211.96:27017/" # 
client = MongoClient(mongo_path)
metadata_db = client["metadata_db"]["demo_collection"]

vector db :  0


Store in the index the content and the id, store in the collection the metadata

In [10]:
def index_chroma(chroma_db, docs:list[Document]):
    """ Store only content and id in the vector db """
    chroma_docs = [] # Without any list in metadata
    for doc in docs:
        new_doc = Document(
            page_content=doc.page_content,
            metadata={'id' : doc.metadata['id'], 'name' : doc.metadata['name']}  
        )
        chroma_docs.append(new_doc)

    chroma_db.add_documents(chroma_docs)

def batchify(docs:list[Document], batch_size): # Max batch size is 5400 for chroma db
    """ Chroma db has a max batch size of about 5k, index batches and not the full corpus at once """
    for i in range(0, len(docs), batch_size):
        yield docs[i:i + batch_size]

def doc_into_dict(docs:list[Document])->list:
    """ Mongo collection stores dictionnaries and not Document objects """
    return([
        {'content' : doc.page_content,
         'metadata' : doc.metadata}
         for doc in docs
    ])

In [11]:
collection_docs = doc_into_dict(docs_chunked) # Mongo collection stores dict and not Document
len(collection_docs)

4

Embedding and indexing

In [13]:
metadata_db.insert_many(collection_docs)

for batch in batchify(docs_chunked, 5000):
    index_chroma(vector_db, batch)
print(vector_db._collection.count())

4


The index contains the documents' embeddings. 
The collection contains the documents' metadata.
We will perform similarity search on the former and pre-filtering on the latter.