### Building a RAg system with Langchain and ChromaDB

Introduction

RAG(Retrieval Augumented Generation) is a powerful technique that combines the capabilities of large language model with external knowledge retrieval.

- langchain: A framework for developing applications powered by language models
- chromaDB: An open-source vector databses for storing and retrieving embeddings.
- HuggingFace: For embeddings and language model(you can use OpenAI/groq)

In [1]:
import os
import numpy as np
from typing import List
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import TextLoader,DirectoryLoader
from langchain_core.documents import Document
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_community.vectorstores import Chroma

  from pydantic.v1.fields import FieldInfo as FieldInfoV1
  from .autonotebook import tqdm as notebook_tqdm


RAG Architecture:

1. Document Loading: Load Documents from various sources
2. Document splitting: Break documents into smaller chunks
3. Embedding Generation: convert chunks into vector representation
4. Vector Storage: Store embedings in ChromaDB
5. Query Processing: Convert user query to embedding
6. Similarity search: Find relevant chunks from vector store
7. Context Augmentation: Combine retrieved chunks into query
8. Response Generation: LLM generates answer using context

Benefits of RAG:
- Reduces hallucination
- Provides up-to-date information
- Allows citing sources
- Works with domain-specific knowledge

### 1 Sample Data

In [2]:
sample_docs=[
    """
    Machine Learning Fundamentals
    
    Machine learning is a subset of artificial intelligence that focuses on building systems that can learn from and 
    make decisions based on data. It involves various algorithms and statistical models that enable computers to 
    perform specific tasks without explicit instructions.
    """,
    """
    Introduction to Python Programming
    
    Python is a high-level, interpreted programming language known for its simplicity and readability. It supports
    multiple programming paradigms, including procedural, object-oriented, and functional programming. Python is widely
    used in web development, data analysis, artificial intelligence, and scientific computing.
    """,
    """
    Data Science Overview

    Data science is an interdisciplinary field that combines statistics, computer science, and domain knowledge to
    extract insights and knowledge from data. It involves data collection, cleaning, analysis, visualization, and
    interpretation to support decision-making processes in various industries.
    """
]
sample_docs

['\n    Machine Learning Fundamentals\n\n    Machine learning is a subset of artificial intelligence that focuses on building systems that can learn from and \n    make decisions based on data. It involves various algorithms and statistical models that enable computers to \n    perform specific tasks without explicit instructions.\n    ',
 '\n    Introduction to Python Programming\n\n    Python is a high-level, interpreted programming language known for its simplicity and readability. It supports\n    multiple programming paradigms, including procedural, object-oriented, and functional programming. Python is widely\n    used in web development, data analysis, artificial intelligence, and scientific computing.\n    ',
 '\n    Data Science Overview\n\n    Data science is an interdisciplinary field that combines statistics, computer science, and domain knowledge to\n    extract insights and knowledge from data. It involves data collection, cleaning, analysis, visualization, and\n    inter

In [3]:
#save the sample documents to text files
import tempfile
temp_dir = tempfile.mkdtemp()

for i, doc in enumerate(sample_docs):
    with open(f"doc_{i+1}.txt", "w") as f:
        f.write(doc)
print(f"Sample documents saved to {temp_dir}")

Sample documents saved to /var/folders/9r/hgk68tjx2ndgw_qhp6tl3nnr0000gn/T/tmpbgblw_n5


### 2. Document Loading

In [4]:
loader = DirectoryLoader(
    "data",
    glob="*.txt",
    loader_cls=TextLoader,
    loader_kwargs={"encoding": "utf-8"},
    show_progress=True
    )

documents = loader.load()
print(f"Loaded {len(documents)} documents.")
print("Sample document content:")
print(documents[0].page_content[:500])  # Print first 500 characters of the first document

100%|██████████| 3/3 [00:00<00:00, 1067.98it/s]

Loaded 3 documents.
Sample document content:

    Data Science Overview

    Data science is an interdisciplinary field that combines statistics, computer science, and domain knowledge to
    extract insights and knowledge from data. It involves data collection, cleaning, analysis, visualization, and
    interpretation to support decision-making processes in various industries.
    





In [5]:
# initialize text splitter
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,
    chunk_overlap=50,
    length_function=len,
    separators=["\n\n", "\n", " ", "","."]
)

# split documents into chunks
texts = text_splitter.split_documents(documents)
print(f"Split into {len(texts)} chunks from {len(documents)} documents.")
print("Sample chunk content:")
print(texts[0].page_content[:150])  # Print first chunk content
print(f"Metadata: {texts[0].metadata}")

Split into 3 chunks from 3 documents.
Sample chunk content:
Data Science Overview

    Data science is an interdisciplinary field that combines statistics, computer science, and domain knowledge to
    extract 
Metadata: {'source': 'data/doc_3.txt'}


Embeddings Model

In [6]:
## Huggingface Embeddings Example

from langchain_huggingface import HuggingFaceEmbeddings

## Initialize the HuggingFace Embeddings model(no API key is required)

embedding_model = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")
embedding_model

HuggingFaceEmbeddings(model_name='sentence-transformers/all-MiniLM-L6-v2', cache_folder=None, model_kwargs={}, encode_kwargs={}, query_encode_kwargs={}, multi_process=False, show_progress=False)

In [9]:
sample_text="ML or AI is the future of technology."
vector=embedding_model.embed_query(sample_text)
vector

[-0.03312410041689873,
 -0.05918993428349495,
 0.004337981343269348,
 -0.03739072382450104,
 -0.011759608052670956,
 -0.004276704974472523,
 -0.0234364103525877,
 0.024586819112300873,
 0.03545262664556503,
 -0.0030211014673113823,
 0.015179022215306759,
 0.0744793489575386,
 0.07151101529598236,
 -0.011977767571806908,
 -0.049196723848581314,
 0.08573225140571594,
 0.022614751011133194,
 -0.04269898682832718,
 -0.07812504470348358,
 -0.05562921613454819,
 -0.0023514016065746546,
 -0.002881679916754365,
 0.00027759003569372,
 -0.023849360644817352,
 -0.009256625548005104,
 0.06912091374397278,
 0.041870854794979095,
 -0.10859371721744537,
 0.013667579740285873,
 0.024943890050053596,
 -0.007728179916739464,
 0.023151669651269913,
 0.018692098557949066,
 0.02918194606900215,
 -0.07642051577568054,
 0.022142376750707626,
 -0.02349327877163887,
 -0.030036956071853638,
 0.0747632309794426,
 -0.05341344699263573,
 -0.036218248307704926,
 -0.13010069727897644,
 -0.025390813127160072,
 -0.055

## Initialize the chromaDB vector store and stores the chunks in vector representation