<a href="https://colab.research.google.com/github/noambassat/RAG_Agent_GITHUB_Rep.project/blob/main/ChromaDB_RAG_Notebook.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!pip install -q pandas chromadb langchain langchain-community openai sentence-transformers faiss-cpu transformers tqdm rank_bm25 --quiet


[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/67.3 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m67.3/67.3 kB[0m [31m5.8 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m19.3/19.3 MB[0m [31m108.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.5/2.5 MB[0m [31m90.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m31.3/31.3 MB[0m [31m72.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m284.2/284.2 kB[0m [31m23.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.9/1.9 MB[0m [31m85.8 MB/s[0m eta [36m0:00:00

In [2]:
# Import Libraries
import pandas as pd
import numpy as np
import os
import torch # Import torch to check for CUDA availability

# LangChain specific imports
from langchain_community.document_loaders import DataFrameLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.embeddings import HuggingFaceBgeEmbeddings # Using BGE for embeddings
from langchain_community.vectorstores import Chroma
from langchain_community.llms import OpenAI
from langchain.agents import initialize_agent, Tool, AgentType
from langchain.prompts import PromptTemplate
from langchain.chains import LLMChain

# For agent prompting customization - keep these if you intend to use them later,
# but note that the specific output parser might not be needed for basic agent setup.
from langchain.agents.agent_types import AgentType
from langchain.agents import AgentExecutor
from langchain.agents.format_scratchpad import format_log_to_messages
# Removed the problematic import: from langchain.agents.output_parser import OpenAIAgentOutputParser
from langchain_core.messages import AIMessage, HumanMessage
from langchain.pydantic_v1 import BaseModel, Field
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough


# Sentence Transformers for embeddings and reranking
from sentence_transformers import SentenceTransformer, CrossEncoder
from transformers import AutoTokenizer # For token-based chunking

# BM25 for lexical search
from rank_bm25 import BM25Okapi

# Utilities
from tqdm import tqdm
from google.colab import userdata, drive




For example, replace imports like: `from langchain.pydantic_v1 import BaseModel`
with: `from pydantic import BaseModel`
or the v1 compatibility namespace if you are working in a code base that has not been fully upgraded to pydantic 2 yet. 	from pydantic.v1 import BaseModel

  exec(code_obj, self.user_global_ns, self.user_ns)


In [3]:
# Initialize tqdm for pandas operations (if used later)
tqdm.pandas()

# Mount Google Drive and Set OpenAI API Key
# Mount Google Drive to access your data file
drive.mount('/content/drive')

# Set OpenAI API Key from Colab secrets
# Ensure 'open_ai_key' is securely stored in Colab Secrets
os.environ["OPENAI_API_KEY"] = userdata.get('open_ai_key')

print("--- Initial Setup Complete ---")
print("Google Drive mounted and OpenAI API key loaded.")

# Cell 4: Load and Preprocess Data
# Path to your file (replace with your actual path)
# Ensure this path is correct after mounting Google Drive
path = "/content/drive/MyDrive/GitHubRepositoriesProject/clean_df.xlsx"

Mounted at /content/drive
--- Initial Setup Complete ---
Google Drive mounted and OpenAI API key loaded.


In [4]:
# Load only relevant columns from the Excel file
df = pd.read_excel(path, usecols=["Name", "Description", "URL", "Topics"])

# Drop rows where 'Description' or 'Topics' are missing
# These rows would not provide useful context for RAG
df.dropna(subset=["Description", "Topics"], inplace=True)

# Ensure 'Topics' column is treated as string type
# This prevents potential errors if 'Topics' contains non-string types
df["Topics"] = df["Topics"].astype(str)

# Combine 'Description' and 'Topics' into a single 'Full_Text' column.
# This concatenated text will be used as the primary content for chunking and embedding.
df["Full_Text"] = df["Description"] + " " + df["Topics"]

print("\n--- Data Preprocessing Summary ---")
print(f"Original DataFrame shape: {df.shape}")
df[['Name', 'Full_Text']].head(2)



--- Data Preprocessing Summary ---
Original DataFrame shape: (11663, 5)


Unnamed: 0,Name,Full_Text
0,PyPOTS,toolboxlibrary data mining partially observed ...
1,changedetection.io,best simplest free open source website change ...


In [5]:
# Define the embedding model name. This model's tokenizer will be used for accurate token counting.
embedding_model_name = "BAAI/bge-base-en-v1.5"

# Load the tokenizer for the chosen embedding model.
# This is crucial for precise chunk sizing based on tokens, not just characters,
# aligning with LLM context window limitations.
try:
    tokenizer = AutoTokenizer.from_pretrained(embedding_model_name)
except Exception as e:
    print(f"Error loading tokenizer: {e}. Please ensure model name is correct and internet connection is available.")
    # Fallback to a generic tokenizer if BGE tokenizer fails, or handle error as appropriate
    from transformers import AutoTokenizer
    tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased") # Fallback example

# Define a custom length function using the loaded tokenizer.
def num_tokens_from_string(text: str) -> int:
    """Returns the number of tokens in a text string using the pre-loaded tokenizer."""
    return len(tokenizer.encode(text))

# Define chunking parameters.
# chunk_size: Maximum number of tokens per chunk. A common range is 256-512 tokens.
# chunk_overlap: Number of overlapping tokens between consecutive chunks to maintain context across splits.
chunk_size = 500  # tokens
chunk_overlap = 100 # tokens

# Initialize RecursiveCharacterTextSplitter.
# This splitter attempts to preserve semantic boundaries by trying a sequence of separators
# (e.g., newlines, spaces) to avoid breaking sentences or important phrases.
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=chunk_size,
    chunk_overlap=chunk_overlap,
    length_function=num_tokens_from_string, # Use the token-based length function for accuracy
    add_start_index=True # Adds the starting character index of each chunk to its metadata
)


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/366 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

In [6]:
# Prepare documents for LangChain's text splitter.
# DataFrameLoader converts each row of the DataFrame into a LangChain Document object.
# The 'page_content_column' specifies which DataFrame column's text becomes the main content of the Document.
# All other DataFrame columns are automatically stored as 'metadata' within each Document.
loader = DataFrameLoader(df, page_content_column="Full_Text")
documents = loader.load()

# Split the original documents (each representing a GitHub project) into smaller chunks.
chunks = text_splitter.split_documents(documents)

print("\n--- Text Chunking Summary ---")
print(f"Number of original documents processed: {len(documents)}")
print(f"Number of chunks created after splitting: {len(chunks)}")
if chunks:
    # Display the first 200 characters of the first chunk's content
    print(f"Example chunk content (first 200 chars):\n'{chunks[0].page_content[:200]}...'")
    # Display the metadata associated with the first chunk
    print(f"Example chunk metadata:\n{chunks[0].metadata}")
print("Chunking complete. Each chunk now contains its original document's metadata.")




--- Text Chunking Summary ---
Number of original documents processed: 11663
Number of chunks created after splitting: 11665
Example chunk content (first 200 chars):
'toolboxlibrary data mining partially observed time series including sota models supporting tasks forecasting incomplete irregularly sampled multivariate time series missing values classification, clus...'
Example chunk metadata:
{'Name': 'PyPOTS', 'Description': 'toolboxlibrary data mining partially observed time series including sota models supporting tasks forecasting incomplete irregularly sampled multivariate time series missing values', 'URL': 'https://github.com/WenjieDu/PyPOTS', 'Topics': 'classification, clustering, data mining, forecasting, imputation, incomplete data, incomplete time series, irregularly sampled time series, machine learning, missing data, missing values, partially observed time series, pytorch, time series, time series analysis, time series classification, time series clustering, time series for

In [7]:
# Initialize the BGE embedding model.
# 'model_kwargs': Arguments passed to the SentenceTransformer model.
#   'device': Set to 'cuda' if a GPU is available for faster embedding generation, otherwise 'cpu'.
# 'encode_kwargs': Arguments passed to the model's encode method.
#   'normalize_embeddings': Set to True for cosine similarity, which is standard for vector search.
embedding_model = HuggingFaceBgeEmbeddings(
    model_name=embedding_model_name,
    model_kwargs={'device': 'cuda' if torch.cuda.is_available() else 'cpu'},
    encode_kwargs={'normalize_embeddings': True}
)

# Define a directory to persist ChromaDB data.
# This allows the vector store to be saved to disk and reloaded later,
# avoiding the need to regenerate embeddings on subsequent runs.
persist_directory = "./chroma_db_github_repos"

print(f"\n--- Embedding Generation and Vector Store Creation ---")
print(f"Using embedding model: '{embedding_model_name}'")
print(f"ChromaDB persistence directory: '{persist_directory}'")

# Create and persist the ChromaDB vector store from the generated chunks.
# This process involves:
# 1. Generating embeddings for each chunk using the specified embedding_model.
# 2. Storing the chunks' content, their embeddings, and their metadata in ChromaDB.
db = Chroma.from_documents(
    documents=chunks,
    embedding=embedding_model,
    persist_directory=persist_directory
)

# Explicitly persist the database to ensure all data is written to disk.
db.persist()
print("ChromaDB vector store successfully created and saved to disk.")


  embedding_model = HuggingFaceBgeEmbeddings(


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/52.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/777 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]


--- Embedding Generation and Vector Store Creation ---
Using embedding model: 'BAAI/bge-base-en-v1.5'
ChromaDB persistence directory: './chroma_db_github_repos'


ERROR:chromadb.telemetry.product.posthog:Failed to send telemetry event ClientStartEvent: capture() takes 1 positional argument but 3 were given
ERROR:chromadb.telemetry.product.posthog:Failed to send telemetry event ClientCreateCollectionEvent: capture() takes 1 positional argument but 3 were given


ChromaDB vector store successfully created and saved to disk.


  db.persist()


In [8]:
# Prepare corpus for BM25. BM25 works on tokenized documents.
# We'll use the 'Full_Text' content of each chunk for the BM25 index.
# Important: Ensure the tokens are space-separated for simple splitting.
tokenized_corpus = [doc.page_content.lower().split(" ") for doc in chunks]
bm25 = BM25Okapi(tokenized_corpus)

print("\n--- Lexical Indexing with BM25 ---")
print(f"BM25 index created for {len(tokenized_corpus)} chunks.")
print("This will enhance keyword-based retrieval.")



--- Lexical Indexing with BM25 ---
BM25 index created for 11665 chunks.
This will enhance keyword-based retrieval.


In [9]:
# Load the Cross-Encoder model for reranking.
# This model takes a query-document pair and outputs a single relevance score,
# effectively re-ranking the initial retrieval results.
reranker_model_name = 'cross-encoder/ms-marco-MiniLM-L-6-v2'
cross_encoder = CrossEncoder(reranker_model_name)

def hybrid_search_with_reranking(user_query: str, top_k_semantic: int = 20, top_k_bm25: int = 20, top_k_reranked: int = 5):
    """
    Performs a robust hybrid search combining semantic (vector) and BM25 lexical matching,
    followed by reranking to refine the final set of relevant documents.

    Args:
        user_query (str): The user's natural language query.
        top_k_semantic (int): Number of top documents to retrieve initially from the vector store.
        top_k_bm25 (int): Number of top documents to retrieve initially from the BM25 index.
        top_k_reranked (int): Number of final, most relevant documents to return after reranking.

    Returns:
        list[dict]: A list of dictionaries, where each dictionary represents a highly
                    relevant GitHub project, including its metadata and reranking score.
    """
    print(f"\n--- Executing Hybrid Search with Reranking for query: '{user_query}' ---")

    # Step 1: Semantic Search using ChromaDB (Vector Store)
    # db.similarity_search_with_score returns a list of (Document, score) tuples.
    semantic_docs_lc = db.similarity_search_with_score(user_query, k=top_k_semantic)
    print(f"  > Retrieved {len(semantic_docs_lc)} documents from semantic search.")

    # Step 2: Lexical Search using BM25
    tokenized_query = user_query.lower().split(" ")
    bm25_scores = bm25.get_scores(tokenized_query)

    # Get top BM25 indices and filter unique documents
    bm25_ranked_indices = np.argsort(bm25_scores)[::-1] # Sort descending
    bm25_unique_docs = []
    seen_doc_ids = set() # Use a set to track unique original document IDs

    for idx in bm25_ranked_indices:
        if idx >= len(chunks): # Safety check
            continue
        doc = chunks[idx]
        original_doc_id = doc.metadata.get('id', doc.metadata.get('Name', str(idx))) # Use original ID
        if original_doc_id not in seen_doc_ids:
            bm25_unique_docs.append({'page_content': doc.page_content, 'metadata': doc.metadata})
            seen_doc_ids.add(original_doc_id)
        if len(bm25_unique_docs) >= top_k_bm25:
            break

    print(f"  > Retrieved {len(bm25_unique_docs)} documents from BM25 search.")

    # Step 3: Combine Results for Reranking
    all_candidate_docs = []
    added_to_candidates_ids = set()

    # Add semantic results first
    for doc_lc, _score in semantic_docs_lc:
        original_doc_id = doc_lc.metadata.get('id', doc_lc.metadata.get('Name', 'N/A'))
        if original_doc_id not in added_to_candidates_ids:
            all_candidate_docs.append({'page_content': doc_lc.page_content, 'metadata': doc_lc.metadata})
            added_to_candidates_ids.add(original_doc_id)

    # Add BM25 results, avoiding duplicates
    for doc_bm25 in bm25_unique_docs:
        original_doc_id = doc_bm25['metadata'].get('id', doc_bm25['metadata'].get('Name', 'N/A'))
        if original_doc_id not in added_to_candidates_ids:
            all_candidate_docs.append(doc_bm25)
            added_to_candidates_ids.add(original_doc_id)

    if not all_candidate_docs:
        print("  > No documents found from initial retrieval for reranking.")
        return []

    # Prepare input pairs for the reranker: Each pair is [query_string, document_content_string].
    reranker_inputs = [[user_query, doc['page_content']] for doc in all_candidate_docs]

    # Get relevance scores from the Cross-Encoder reranker.
    reranker_scores = cross_encoder.predict(reranker_inputs)
    print(f"  > Reranked {len(all_candidate_docs)} candidate documents.")

    # Combine original document information with their new reranker scores.
    scored_candidates = sorted(zip(reranker_scores, all_candidate_docs), key=lambda x: x[0], reverse=True)

    # Prepare the final results list, taking only the top_k_reranked documents.
    final_results = []
    for i, (score, doc_info) in enumerate(scored_candidates[:top_k_reranked]):
        final_results.append({
            'name': doc_info['metadata'].get('Name', 'N/A'),
            'description': doc_info['metadata'].get('Description', 'N/A'),
            'url': doc_info['metadata'].get('URL', 'N/A'),
            'topics': doc_info['metadata'].get('Topics', 'N/A'),
            'full_text_chunk': doc_info['page_content'], # The actual chunk content that was embedded
            'rerank_score': score
        })
    print(f"  > Returned top {len(final_results)} reranked results.")
    return final_results


config.json:   0%|          | 0.00/794 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/132 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

In [10]:
# Define a function that wraps the enhanced `hybrid_search_with_reranking` and formats its output.
def search_and_format_for_agent(query: str) -> str:
    """
    Performs the hybrid search with reranking and formats the results
    into a human-readable string suitable for consumption by an LLM agent.
    """
    results = hybrid_search_with_reranking(query)
    if not results:
        return "No relevant GitHub projects found based on your query."

    output = "Found the following relevant GitHub projects:\n"
    for i, project in enumerate(results):
        output += f"{i+1}. Project Name: {project['name']}\n"
        output += f"   Description: {project['description']}\n"
        output += f"   URL: {project['url']}\n"
        output += f"   Topics: {project['topics']}\n"
        output += f"   Relevance Score: {project['rerank_score']:.4f}\n\n"
    return output.strip()

# Define the tools that the LangChain agent will have access to.
tools = [
    Tool(
        name="GitHubProjectSearch",
        func=search_and_format_for_agent,
        description=(
            "Useful for finding open-source GitHub projects. "
            "Input should be a natural language query describing desired project features, "
            "technologies, or topics (e.g., 'a Python library for time series forecasting with missing data'). "
            "The tool returns a list of highly relevant projects with their names, descriptions, URLs, topics, and a relevance score. "
            "**Crucially, when using this tool, your final answer MUST summarize the details of the projects found and relate them directly to the user's query.** "
            "**DO NOT attempt to filter, re-rank, or pick only a subset of the returned projects unless explicitly asked by the user.** "
            "**Focus on providing a concise yet informative summary based ONLY on the details provided by the tool.**"
        )
    )
]

# Initialize the Large Language Model (LLM) that the agent will use.
llm = OpenAI(temperature=0.2)

# Define the system message for the agent to guide its overall behavior
# This is where we reinforce general instructions for the agent.
system_message_for_agent = """
You are an AI assistant expert in finding and summarizing information about open-source GitHub projects.
Your role is to answer user questions by utilizing the sole tool at your disposal, 'GitHubProjectSearch'.

When you receive results from the tool, read them carefully.
In your final answer, summarize the relevant projects found.
**For each project, ensure you highlight its specific features, main functionalities, and list its relevant topics.**
**If the user's query mentions specific data challenges (e.g., 'missing values', 'noisy data'), explicitly state how each relevant project addresses or is capable of handling those challenges, based on the provided descriptions.**
It's crucial to stay faithful to the information the tool provided – do not add information not present in the search results.
Do not attempt to filter, re-rank, or pick only a subset of the projects returned by the tool, unless explicitly asked by the user.
If the tool finds multiple relevant projects, detail the relevant information for each.
Your answer should be clear, focused, informative, and as specific as possible regarding the project's capabilities in relation to the query.
"""

# Initialize the LangChain agent with custom prompt/system message
agent = initialize_agent(
    tools=tools,
    llm=llm,
    agent=AgentType.ZERO_SHOT_REACT_DESCRIPTION, # Agent type that relies on tool descriptions
    verbose=True,
    handle_parsing_errors=True,
    agent_kwargs={
        "system_message": system_message_for_agent
    }
)

print("\n--- LangChain Agent Integration ---")
print("Agent initialized and configured to use the GitHub project search tool with enhanced instructions.")
print("It is now ready to process natural language queries.")


  llm = OpenAI(temperature=0.2)



--- LangChain Agent Integration ---
Agent initialized and configured to use the GitHub project search tool with enhanced instructions.
It is now ready to process natural language queries.


  agent = initialize_agent(


In [11]:

# Initialize a separate LLM for evaluation, or use the same one with a specific prompt
eval_llm = OpenAI(temperature=0.0) # Lower temperature for more deterministic evaluation

# Define the prompt for the evaluation LLM
EVAL_PROMPT = PromptTemplate(
    input_variables=["query", "retrieved_context", "generated_answer"],
    template="""You are an AI assistant tasked with evaluating the quality of an answer generated by another AI.
Here's the user's original query:
<QUERY>{query}</QUERY>

Here's the context retrieved from the knowledge base (GitHub projects):
<RETRIEVED_CONTEXT>{retrieved_context}</RETRIEVED_CONTEXT>

Here's the answer generated by the AI:
<GENERATED_ANSWER>{generated_answer}</GENERATED_ANSWER>

Please evaluate the GENERATED_ANSWER based on the following criteria. Provide a score from 1 to 5 (1=Poor, 5=Excellent) for each criterion, and a brief explanation.

1.  **Faithfulness (Adherence to Source):** Is the GENERATED_ANSWER fully supported by the RETRIEVED_CONTEXT? Does it contain any information not present in the context?
    Score:
    Explanation:

2.  **Relevance:** Is the GENERATED_ANSWER directly and comprehensively relevant to the USER'S QUERY?
    Score:
    Explanation:

3.  **Coherence and Fluency:** Is the GENERATED_ANSWER well-written, easy to understand, grammatically correct, and free of awkward phrasing?
    Score:
    Explanation:

Provide your final assessment and any overall comments.
"""
)

eval_chain = LLMChain(llm=eval_llm, prompt=EVAL_PROMPT)

def evaluate_llm_response_with_llm(query: str, retrieved_docs: list[dict], generated_answer: str):
    """
    Evaluates the generated answer using an LLM based on retrieved context and original query.
    """
    # Ensure retrieved_docs is not empty to avoid errors
    if not retrieved_docs:
        retrieved_context_str = "No documents were retrieved for this query."
    else:
        # Limit the context string length to avoid hitting LLM token limits for evaluation
        # We also want to make sure the essential info is there for the evaluation LLM
        context_parts = []
        for d in retrieved_docs:
            project_name = d.get('name', 'N/A')
            description = d.get('description', 'N/A')
            url = d.get('url', 'N/A')
            topics = d.get('topics', 'N/A')
            # Limit chunk excerpt to prevent excessive length
            chunk_excerpt = d.get('full_text_chunk', '')[:500] + ('...' if len(d.get('full_text_chunk', '')) > 500 else '')
            context_parts.append(
                f"Project Name: {project_name}\n"
                f"Description: {description}\n"
                f"URL: {url}\n"
                f"Topics: {topics}\n"
                f"Chunk Excerpt: {chunk_excerpt}"
            )
        retrieved_context_str = "\n---\n".join(context_parts)

    print("\n--- Running LLM-based Evaluation ---")
    try:
        evaluation_result = eval_chain.run(
            query=query,
            retrieved_context=retrieved_context_str,
            generated_answer=generated_answer
        )
    except Exception as e:
        evaluation_result = f"Error during LLM-based evaluation: {e}. Check LLM rate limits or context window."
    print("\n--- Evaluation Result ---")
    print(evaluation_result)
    return evaluation_result


  eval_chain = LLMChain(llm=eval_llm, prompt=EVAL_PROMPT)


In [12]:

# Example Query 1: Focusing on LLM fine-tuning
query_1 = "I need a python library for fine-tuning large language models on custom datasets."
print(f"\nAgent Query 1: '{query_1}'")
retrieved_docs_1 = hybrid_search_with_reranking(query_1) # Call search to get docs separately for evaluation context
agent_response_1 = agent.run(query_1) # Run agent to get final answer
print("\n--- Agent's Final Response (Query 1) ---")
print(agent_response_1)
evaluate_llm_response_with_llm(query_1, retrieved_docs_1, agent_response_1)


ERROR:chromadb.telemetry.product.posthog:Failed to send telemetry event CollectionQueryEvent: capture() takes 1 positional argument but 3 were given



Agent Query 1: 'I need a python library for fine-tuning large language models on custom datasets.'

--- Executing Hybrid Search with Reranking for query: 'I need a python library for fine-tuning large language models on custom datasets.' ---
  > Retrieved 20 documents from semantic search.
  > Retrieved 20 documents from BM25 search.


  agent_response_1 = agent.run(query_1) # Run agent to get final answer


  > Reranked 34 candidate documents.
  > Returned top 5 reranked results.


[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3m I should use GitHubProjectSearch to find relevant projects.
Action: GitHubProjectSearch
Action Input: 'python library for fine-tuning large language models on custom datasets'[0m
--- Executing Hybrid Search with Reranking for query: ''python library for fine-tuning large language models on custom datasets'' ---
  > Retrieved 20 documents from semantic search.
  > Retrieved 20 documents from BM25 search.
  > Reranked 34 candidate documents.
  > Returned top 5 reranked results.

Observation: [36;1m[1;3mFound the following relevant GitHub projects:
1. Project Name: LongLoRA
   Description: efficient long context fine supervised fine longqa dataset
   URL: https://github.com/dvlab-research/LongLoRA
   Topics: fine tuning llm, large language models, llm, long context, lora
   Relevance Score: 1.7873

2. Project Name: xtuner
   Description: xtuner toolkit

"\n1. Faithfulness: 4 - The GENERATED_ANSWER is mostly supported by the RETRIEVED_CONTEXT, as it mentions relevant projects and their focus on efficient fine-tuning of large language models. However, it does not mention the specific technologies used in each project, which are mentioned in the context.\n2. Relevance: 5 - The GENERATED_ANSWER is directly and comprehensively relevant to the USER'S QUERY, as it provides multiple relevant projects for fine-tuning large language models on custom datasets in Python.\n3. Coherence and Fluency: 4 - The GENERATED_ANSWER is well-written and easy to understand, but there are a few grammatical errors and awkward phrasing that could be improved.\nOverall, the GENERATED_ANSWER is a good summary of the relevant projects found in the RETRIEVED_CONTEXT. It provides a comprehensive and relevant answer to the USER'S QUERY, but could benefit from some minor improvements in terms of grammar and phrasing. "

In [13]:


# Example Query 2: Focusing on data mining and time series
query_2 = "Show me tools for data mining and clustering of time series with missing values."
print(f"\nAgent Query 2: '{query_2}'")
retrieved_docs_2 = hybrid_search_with_reranking(query_2) # Call search to get docs separately for evaluation context
agent_response_2 = agent.run(query_2) # Run agent to get final answer
print("\n--- Agent's Final Response (Query 2) ---")
print(agent_response_2)
evaluate_llm_response_with_llm(query_2, retrieved_docs_2, agent_response_2)


Agent Query 2: 'Show me tools for data mining and clustering of time series with missing values.'

--- Executing Hybrid Search with Reranking for query: 'Show me tools for data mining and clustering of time series with missing values.' ---
  > Retrieved 20 documents from semantic search.
  > Retrieved 20 documents from BM25 search.
  > Reranked 31 candidate documents.
  > Returned top 5 reranked results.


[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3m I should use GitHubProjectSearch to find relevant projects.
Action: GitHubProjectSearch
Action Input: 'data mining, clustering, time series, missing values'[0m
--- Executing Hybrid Search with Reranking for query: ''data mining, clustering, time series, missing values'' ---
  > Retrieved 20 documents from semantic search.
  > Retrieved 20 documents from BM25 search.
  > Reranked 31 candidate documents.
  > Returned top 5 reranked results.

Observation: [36;1m[1;3mFound the following relevant GitHub projects:
1. Project N

"\nFinal Assessment:\n1. Faithfulness: 4 - The GENERATED_ANSWER is mostly supported by the RETRIEVED_CONTEXT, but it does contain some additional information not present in the context, such as the use of dynamic time warping and matrix profile for time series analysis.\n2. Relevance: 5 - The GENERATED_ANSWER is directly and comprehensively relevant to the USER'S QUERY, as it provides a list of relevant projects for data mining and clustering of time series with missing values.\n3. Coherence and Fluency: 5 - The GENERATED_ANSWER is well-written, easy to understand, and free of grammatical errors or awkward phrasing.\n\nOverall, the GENERATED_ANSWER is a high-quality response that effectively addresses the USER'S QUERY and provides relevant and accurate information. However, it could be improved by being more faithful to the RETRIEVED_CONTEXT and avoiding the inclusion of any additional information not present in the context."