## Creating a RAG based Chatbot with LangChain and OpenAI

## **STEP 1- Data Ingestion Pipeline**

1. Connect to the NASA data on Google Dive
2. Parsing and chunking of data using LangChain
3. Embeddings using sentence transformer(huggingface)
4. Ingestion into VectorDB

In [2]:
pip install -r requirements.txt

Collecting langchain-groq (from -r requirements.txt (line 9))
  Using cached langchain_groq-1.0.0-py3-none-any.whl.metadata (1.7 kB)
Collecting groq<1.0.0,>=0.30.0 (from langchain-groq->-r requirements.txt (line 9))
  Using cached groq-0.33.0-py3-none-any.whl.metadata (16 kB)
INFO: pip is looking at multiple versions of langchain-groq to determine which version is compatible with other requirements. This could take a while.
Collecting langchain-groq (from -r requirements.txt (line 9))
  Downloading langchain_groq-0.3.8-py3-none-any.whl.metadata (2.6 kB)
Downloading langchain_groq-0.3.8-py3-none-any.whl (16 kB)
Using cached groq-0.33.0-py3-none-any.whl (135 kB)
Installing collected packages: groq, langchain-groq
Successfully installed groq-0.33.0 langchain-groq-0.3.8
Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 24.0 -> 25.3
[notice] To update, run: python.exe -m pip install --upgrade pip


In [4]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


Connecting to NASA Data on GoogleDrive

In [None]:
import os
folder_path = "/content/drive/My Drive/data"
os.listdir(folder_path)

['Audits and Investigations',
 'Financial Management',
 'Human Resources and Personnel',
 'Legal Policies',
 'Program Management',
 'Program Formulation',
 'Procurement, Small Business and Industrial Relations',
 'Organization and Administration',
 'Transportation',
 'Property, Supply and Equipment']

In [1]:
import os
folder_path = "Data"
os.listdir(folder_path)

['Audits and Investigations',
 'Financial Management',
 'Human Resources and Personnel',
 'Legal Policies',
 'Organization and Administration',
 'Procurement, Small Business and Industrial Relations',
 'Program Formulation',
 'Program Management',
 'Property, Supply and Equipment',
 'Transportation']

In [2]:
#directory Loader
from langchain_core.documents import Document
from langchain_community.document_loaders import TextLoader, PyPDFLoader,PyMuPDFLoader, DirectoryLoader


  from .autonotebook import tqdm as notebook_tqdm


In [3]:
dir_loader= DirectoryLoader(folder_path,
                            loader_cls=PyMuPDFLoader,
                            show_progress= True
                            )

In [4]:
documents= dir_loader.load()

100%|██████████| 212/212 [00:35<00:00,  5.94it/s]


Embedding and VectorDB

In [5]:
import numpy as np
from sentence_transformers import SentenceTransformer
import chromadb
from chromadb.config import Settings
import uuid
from typing import List, Dict, Any, Tuple
from sklearn.metrics.pairwise import cosine_similarity

In [6]:
#uses sentence transformer to handle document embeddings
class EmbeddingManager:
  def  __init__(self, model_name: str= "all-MiniLM-L6-v2"):
    '''
    initialize the embedding manager

    Args:
      model_name: HuggingFace model name for sentence embeddings
    '''
    self.model_name= model_name
    self.model= None
    self._load_model()

  def _load_model(self):

    '''
    load the sentence transformer model
    '''
    try:
      print(f'Loading embedding model:{self.model_name}')
      self.model= SentenceTransformer(self.model_name)
      print(f'Embedding model loaded successfully. Embedding dimension:{self.model.get_sentence_embedding_dimension()}')
    except Exception as e:
      print(f'Error loading model: {self.model_name}:{e}')
      raise

  def generate_embeddings(self, texts:list[str])->np.ndarray:
    if not self.model:
      raise ValueError("Model not loaded")
    print(f"Generating embeddings for {len(texts)} texts...")
    embeddings = self.model.encode(texts, show_progress_bar=True)
    print(f"Generated embeddings with shape: {embeddings.shape}")
    return embeddings

In [7]:
embedding_manager= EmbeddingManager()
embedding_manager

Loading embedding model:all-MiniLM-L6-v2


To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


Embedding model loaded successfully. Embedding dimension:384


<__main__.EmbeddingManager at 0x205cabd3650>

VectoreStore

In [8]:
import os
import numpy as np
import chromadb
from chromadb.config import Settings
import uuid
from typing import List, Dict, Any, Tuple
from sklearn.metrics.pairwise import cosine_similarity
from langchain_core.documents import Document

class VectorStore:
  def __init__(self, collection_name: str="pdf_documents", persist_directory: str="../data/vector_store"):
    self.collection_name= collection_name
    self.persist_directory= persist_directory
    self.client=None
    self.collection= None
    self.initialize_store()

  def initialize_store(self):
    '''initialize ChromaDBclient and collection'''
    try:
      #create persistent Chromadb client
      os.makedirs(self.persist_directory, exist_ok=True)
      self.client=chromadb.PersistentClient(path=self.persist_directory)

      #get or create collection
      self.collection= self.client.get_or_create_collection(
          name=self.collection_name,
          metadata={"description": "NASA documents"}
      )
      print(f"Vector store initialized. Collection: {self.collection_name}")
      print(f"Total documents in the collection: {self.collection_count()}")

    except Exception as e:
      print(f"Error initializing vector store: {e}")
      raise

  def collection_count(self) -> int:
      if self.collection:
          return self.collection.count()
      return 0

  def add_documents(self, documents:List[Any], embeddings: np.ndarray):
    '''
    Add documents to the vector store
    Args:
      documents: List of LangChain documents
      embeddings: corresponding embeddings for thedocument
      '''
    if len(documents)!= len(embeddings):
      raise ValueError("Number of documents must match number of embeddings")

    print(f"Adding {len(documents)} documents to the vector store...")

    #prepare datafor ChromaDB
    ids=[]
    metadatas=[]
    documents_text=[]
    embeddings_list=[]

    for i, (doc, embedding) in enumerate(zip(documents, embeddings)):
      doc_id=f"doc_{uuid.uuid4().hex[:8]}_{i}"
      ids.append(doc_id)

      #prepare metadata
      metadata=dict(doc.metadata)
      metadata['doc_index']=i
      metadata['content_length']= len(doc.page_content)
      metadatas.append(metadata)

      #document content
      documents_text.append(doc.page_content)

      embeddings_list.append(embedding.tolist())

    #Add to collection
    try:
      self.collection.add(
          ids=ids,
          embeddings=embeddings_list,
          metadatas=metadatas,
          documents=documents_text
      )
      print(f"Sucessfully added {len(documents)} documents to the vector store")
      print(f"Total documents in the collection: {self.collection_count()}")
    except Exception as e:
      print(f"Error adding documents to the vector store: {e}")
      raise

In [9]:
vectorstore=VectorStore()
vectorstore

Vector store initialized. Collection: pdf_documents
Total documents in the collection: 0


<__main__.VectorStore at 0x205cdae9bd0>

In [10]:
### convert the text to embedding
texts= [document.page_content for document in documents]
embeddings= embedding_manager.generate_embeddings(texts)
vectorstore.add_documents(documents, embeddings)

Generating embeddings for 4285 texts...


Batches: 100%|██████████| 134/134 [15:50<00:00,  7.10s/it]


Generated embeddings with shape: (4285, 384)
Adding 4285 documents to the vector store...
Sucessfully added 4285 documents to the vector store
Total documents in the collection: 4285


# Retrieval Pipeline
1. Create a retrieval class
2. Test some queries
3. Intergrate LLM

In [11]:
# create a retrieval class
class RetrievalPipeline:
  def __init__(self, vectorstore: VectorStore, embedding_manager: EmbeddingManager):
    self.vectorstore= vectorstore
    self.embedding_manager= embedding_manager

  def retrieve(self, query:str, top_k: int=5, score_threshold: float=0.0)->List[Dict[str, Any]]:
    print(f"Retrieving documents for query: '{query}'")
    print(f"Top K: {top_k}, score threshold: {score_threshold}")

    #Generate embedding
    query_embedding= self.embedding_manager.generate_embeddings([query])[0]

    #search vector store
    try:
      results= self.vectorstore.collection.query(
          query_embeddings=[query_embedding.tolist()],
          n_results=top_k,
          include=['documents', 'metadatas', 'distances'] # Explicitly include what's needed
      )
      #process results
      retrieved_docs=[] # Corrected indentation
      if results['documents'] and results['documents'][0]:
        documents= results['documents'][0]
        metadatas= results['metadatas'][0]
        distances= results['distances'][0]
        # ids=results['ids'][0] # Not used in the current logic, can be removed or used if needed

        for i, document in enumerate(documents):
          similarity_score= 1 - distances[i] # Corrected: access individual distance
          if similarity_score >= score_threshold:
            retrieved_docs.append({
                'document': document,
                'metadata': metadatas[i], # Corrected: access individual metadata
                'distance': distances[i], # Corrected: access individual distance
                'rank': i+1
            })
        print(f"Retrieved {len(retrieved_docs)} documents") # Corrected print statement
      else:
        print("No documents retrieved")
      return retrieved_docs
    except Exception as e:
      print(f"Error retrieving documents: {e}")
      return []

In [14]:
rag_retrieval=RetrievalPipeline(vectorstore=vectorstore, embedding_manager=embedding_manager)

In [None]:
rag_retrieval.retrieve("what is the policy for Court Actions or Proceedings Involving NASA Employee")


Retrieving documents for query: 'what Court Actions or Proceedings Involving NASA Employee'
Top K: 5, score threshold: 0.0
Generating embeddings for 1 texts...


Batches: 100%|██████████| 1/1 [00:00<00:00,  2.49it/s]

Generated embeddings with shape: (1, 384)
Retrieved 5 documents





[{'document': "| NODIS Library | Legal Policies(2000s) | Search | \n NASA\nPolicy\nDirective \nNPD 2010.1E\nEffective Date: June 06, 2013\nExpiration Date: June 06, 2028\nCOMPLIANCE IS MANDATORY FOR NASA EMPLOYEES \nPrintable Format (PDF)\nSubject: Court Actions or Proceedings Involving NASA or NASA Employees\n(Updated w/Administrative Change 3 on 09/26/2024)\nResponsible Office: Office of the General Counsel\nChg# \nDate \nDescription/Comments \n1\n07/02/2018 Update to comply with 1400 Compliance with administrative changes,\ncorrect citations and and added an Appendix A: Reference.\n2\n06/27/2023 Update to comply with 1400 Compliance, with administrative changes.\n3\n09/26/2024 This is an update with administrative change, to clarify the handling of\nsubpoenas.\n1. POLICY \nNASA policy is that any information concerning court actions, or administrative or regulatory proceedings, brought on\nbehalf of, or against, the United States, NASA, or any NASA current or former employee, result

In [39]:

from langchain_groq import ChatGroq
import os
from dotenv import load_dotenv
load_dotenv()

##initializing the Groq LLM

groq_api_key= os.getenv("GROQ_API_KEY")


llm = ChatGroq(
    groq_api_key=groq_api_key,          # Your existing key
    model="llama-3.1-8b-instant",       # Groq's top rec for gemma2-9b-it
    temperature=0.1,                    # Keeps answers factual/concise
    max_tokens=1024                     # Plenty for RAG responses
)

# Quick test 
test_response = llm.invoke("what are the policy for Court Actions or Proceedings Involving NASA Employee")
print(test_response.content)

The policies for court actions or proceedings involving NASA employees are outlined in the NASA Policy Directive (NPD) 1280.1, "Legal Services." Here are some key points:

1. **Confidentiality**: NASA employees are expected to maintain confidentiality regarding any court actions or proceedings involving the agency. This includes not disclosing sensitive information about the case, the parties involved, or the agency's position.
2. **Representation**: NASA employees are entitled to representation by the NASA Office of the General Counsel (OGC) in any court action or proceeding. The OGC will provide counsel and support to ensure that the employee's rights are protected.
3. **Employee Conduct**: NASA employees are expected to conduct themselves in a professional and respectful manner during court proceedings. This includes avoiding any behavior that could be perceived as unprofessional or disruptive.
4. **Testimony**: NASA employees may be required to testify in court proceedings. In such

#### GROQ LLM INTEGRATION

In [None]:
from langchain_groq import ChatGroq
import os
from dotenv import load_dotenv
load_dotenv()

groq_api_key= os.getenv("GROQ_API_KEY")


llm = ChatGroq(
    groq_api_key=groq_api_key,         
    model="llama-3.1-8b-instant",       
    temperature=0.1,                   
    max_tokens=1024                    
)



#RAG function
def rag_simple(query, retriever, llm, top_k=3):
    results = retriever.retrieve(query, top_k=top_k)

    # NEW: Use the correct key ('document') instead of 'content'
    context = "\n\n".join([doc['document'] for doc in results]) if results else ""
    if not context:
        return "no relevant context found to answer the question"

    # Fixed prompt (no .format() needed – f-string already injects values)
    prompt = f"""Use the following context to answer the question concisely.
context:
{context}
question: {query}
answer:"""

    response = llm.invoke(prompt)         
    return response.content

In [41]:
answer= rag_simple("what is the policy on Court Actions or Proceedings Involving NASA Employee", rag_retrieval, llm)
print(answer)

Retrieving documents for query: 'what is the policy on Court Actions or Proceedings Involving NASA Employee'
Top K: 3, score threshold: 0.0
Generating embeddings for 1 texts...


Batches: 100%|██████████| 1/1 [00:02<00:00,  2.03s/it]


Generated embeddings with shape: (1, 384)
Retrieved 3 documents
According to NPD 2010.1E, the policy on Court Actions or Proceedings Involving NASA Employees is as follows:

- Any information concerning court actions or administrative or regulatory proceedings brought on behalf of, or against, the United States, NASA, or any NASA current or former employee, resulting from alleged NASA activities, must be promptly reported to the Office of the General Counsel at the appropriate Center.
- Service of process for private actions is generally voluntary, but Field Centers may provide otherwise.
- NASA employees and contractor employees may voluntarily accept service of process onsite, except if provided for otherwise by Center-specific regulation, in their personal capacity (i.e., an allegation, complaint, or dispute not arising from their official NASA duties).


## More concise answer with score

In [37]:
def rag_advanced(
    query,
    retriever,
    llm,
    top_k=5,
    min_score=0.2,
    return_context=False
):
    """
    Returns answer + sources + confidence + (optional) full context.
    """
    # 1. Retrieve
    results = retriever.retrieve(query, top_k=top_k, score_threshold=min_score)
    if not results:
        return {
            "answer": "No relevant context found to answer the question",
            "sources": [],
            "confidence": 0.0,
            "context": ""
        }

    # 2. Build context & sources (use the correct keys!)
    context = "\n\n".join([doc["document"] for doc in results])          # <-- 'document'
    sources = [
        {
            "source": doc["metadata"].get("source_file", "unknown"),
            "page": doc["metadata"].get("page", "unknown"),
            "score": doc["distance"],                                 # <-- 'distance' = similarity
            "preview": doc["document"][:300] + "..."
        }
        for doc in results
    ]
    confidence = max(doc["distance"] for doc in results)                # higher = better match

    # 3. Prompt (no .format() needed – use f-string)
    prompt = f"""Use the following context to answer the question concisely.

Context:
{context}

Question: {query}

Answer:"""

    # 4. Call LLM (pass string directly)
    response = llm.invoke(prompt)

    # 5. Build output
    output = {
        "answer": response.content,
        "sources": sources,
        "confidence": confidence,
    }
    if return_context:
        output["context"] = context
    return output

In [44]:
answer = rag_advanced(
    query="what is the policy on Court Actions or Proceedings Involving NASA Employee",
    retriever=rag_retrieval,
    llm=llm,
    top_k=3,
    min_score=0.0,
    return_context=True
)

print("Answer:", answer["answer"])
print("\nConfidence:", answer["confidence"])
print("\nSources:")
for s in answer["sources"]:
    print(f"  • {s['source']} (page {s['page']}) | score: {s['score']:.3f}")

Retrieving documents for query: 'what is the policy on Court Actions or Proceedings Involving NASA Employee'
Top K: 3, score threshold: 0.0
Generating embeddings for 1 texts...


Batches: 100%|██████████| 1/1 [00:02<00:00,  2.64s/it]


Generated embeddings with shape: (1, 384)
Retrieved 3 documents
Answer: According to NPD 2010.1E, NASA policy is that any information concerning court actions, or administrative or regulatory proceedings, brought on behalf of, or against, the United States, NASA, or any NASA current or former employee, resulting from alleged NASA activities, are promptly reported to the Office of the General Counsel at the appropriate Center.

Confidence: 0.5865508317947388

Sources:
  • unknown (page 0) | score: 0.406
  • unknown (page 32) | score: 0.585
  • unknown (page 1) | score: 0.587


In [45]:
answer = rag_advanced(
    query="what is the policy on burger king",
    retriever=rag_retrieval,
    llm=llm,
    top_k=3,
    min_score=0.0,
    return_context=True
)

print("Answer:", answer["answer"])
print("\nConfidence:", answer["confidence"])
print("\nSources:")
for s in answer["sources"]:
    print(f"  • {s['source']} (page {s['page']}) | score: {s['score']:.3f}")

Retrieving documents for query: 'what is the policy on burger king'
Top K: 3, score threshold: 0.0
Generating embeddings for 1 texts...


Batches: 100%|██████████| 1/1 [00:02<00:00,  2.58s/it]

Generated embeddings with shape: (1, 384)
Retrieved 0 documents
Answer: No relevant context found to answer the question

Confidence: 0.0

Sources:



