Instantiate libraries

In [3]:
# Add rooth path
import sys, os
root_path = os.path.abspath(os.path.join(".."))
if root_path not in sys.path:
    sys.path.append(root_path)

# Import retrievers
from retrieval.retrievers.filter_retriever import FilterRetriever 
from retrieval.retrievers.adaptative_retriever import AdaptativeRetriever
from retrieval.retrievers.alignment_retriever import AlignmentRetriever 
from retrieval.retrievers.vanila_retriever import VanilaRetriever 

# Import generation modules
from generation.summarization import Summarization
from generation.generator import Generation

Instantiate embedding model

In [None]:
from langchain_huggingface import HuggingFaceEmbeddings

embedding_model =  HuggingFaceEmbeddings( # Instantiate the embedding method
        model_name="Alibaba-NLP/gte-multilingual-base",     
        model_kwargs={"device" : 'cpu', "trust_remote_code" : True},
        encode_kwargs={'normalize_embeddings': True} 
    )

  from .autonotebook import tqdm as notebook_tqdm





Some weights of the model checkpoint at Alibaba-NLP/gte-multilingual-base were not used when initializing NewModel: ['classifier.bias', 'classifier.weight']
- This IS expected if you are initializing NewModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing NewModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Instantiate indexes

In [None]:
from langchain_chroma import Chroma
from pymongo import MongoClient

community_db = Chroma(
    collection_name="community_db",
    embedding_function=embedding_model,
    persist_directory="../../data/db/rag",
)

items_db = Chroma( # All items in collection
    collection_name="items_db",
    embedding_function=embedding_model,
    persist_directory="../../data/db/testbench",
)

chunked_db = Chroma( # For fine grained similarity search
    collection_name="chunked_db",
    embedding_function=embedding_model,
    persist_directory="../../data/db/chunked",
)

client = MongoClient("mongodb://192.168.211.96:27017/")
collection = client["metadata_db"]["metadata_collection"] # metadata collection, used for the pre filtering step
# fr_collection = client["metadata_db"]["filter_metadata_collection"]

print('community db', community_db._collection.count())
print('items db', items_db._collection.count())
print("chunked db : ", chunked_db._collection.count())

community db 174
items db 2520
chunked db :  7032


Instantiate retriever

In [None]:
from langchain_ollama import OllamaLLM

model = 'mistral-small3.1:24b'
base_url = 'http://192.168.249.7:11434'
llm = OllamaLLM(base_url=base_url, model=model)
top_k = 20

retriever = AdaptativeRetriever(llm, items_db, community_db, chunked_db, collection, top_k=20)

## Generation

Retrieve context

In [11]:
question = "What is the goal of the data management domain ?"
is_diagram, retrieved_context, ids = retriever.retrieve(question)
print(retrieved_context)

Extracted filter is : {'name': 'data management'}
This area addresses the management of data that can be used by some or all transportation agencies and other organizations to support transportation planning, performance monitoring, safety analysis, and research. Data are collected from detectors and sensors, connected vehicles, and operational data feeds from centers.
The services related to the 'data management' domain are : dm01: its data warehouse, dm02: performance monitoring


Instantiate summary model and generation model

In [None]:
summarizer = Summarization(llm)
generator = Generation(llm)

Summarize retrieved context to facilitate the generation task

In [None]:
summary = summarizer.summarize(question, retrieved_context)
print(summary)

The data management domain focuses on handling data for transportation agencies and organizations to support planning, performance monitoring, safety analysis, and research. This data is collected from various sources such as detectors, sensors, connected vehicles, and operational data feeds. Key services in this domain include a data warehouse and performance monitoring.


Answer user's query using the retrieved-and-summarized context

In [None]:
print(question)
response = generator.generate(question, summary)
print(response)

What is the goal of the data management domain ?
The goal of the data management domain is to handle data for transportation agencies and organizations to support planning, performance monitoring, safety analysis, and research. This data is collected from various sources such as detectors, sensors, connected vehicles, and operational data feeds. Key services in this domain include a data warehouse and performance monitoring.
