Skip to content

Implementation of a Retrieval-Augmented Generation (RAG) system using MongoDB Atlas Vector Search to enhance LLM-powered applications through semantic information retrieval. Developed as part of the MongoDB University - MongoDB Skills course.

Notifications You must be signed in to change notification settings

glhrmbg/rag-mongodb

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

9 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

πŸ” RAG with MongoDB

RAG (Retrieval-Augmented Generation) application using MongoDB Atlas Vector Search to enhance LLM-powered applications through semantic information retrieval.

  • Batch PDF document processing
  • Automatic metadata extraction using LLM
  • Vector embeddings with OpenAI
  • MongoDB Atlas Vector Search integration
  • Semantic search capabilities

This project is part of MongoDB University educational materials. Part of MongoDB University - MongoDB Skills course: RAG with MongoDB

βš™οΈ MongoDB Atlas Setup

1. Create Database and Collection

  • Database: documents
  • Collection: docs-chunks

2. Create Vector Search Index

Navigate to Atlas Search and create a new Vector Search Index:

Index Name: vector_index

JSON Configuration:

{
  "fields": [
    {
      "type": "vector",
      "path": "embedding",
      "numDimensions": 1536,
      "similarity": "cosine"
    },
    {
      "type": "filter",
      "path": "hasCode"
    }
  ]
}

πŸ‘¨πŸ»β€πŸ’» How to use

  • Prerequisites:
    • Python 3.13+
    • MongoDB Atlas account with Vector Search Index configured
    • OpenAI API key
  1. Install dependencies:
pip install langchain langchain_community langchain_core langchain_openai langchain_mongodb pymongo pypdf python-dotenv
  1. Create a .env file in the root directory:
OPENAI_API_KEY=your-openai-api-key
MONGODB_URI=your-mongodb-connection-string
  1. Place your PDF files in the docs/ folder

  2. Run the ingestion script:

python load_data.py

The script will automatically process all PDFs in the docs/ folder and:

  • Load and clean each PDF
  • Extract metadata (title, keywords, code detection)
  • Split documents into chunks (500 chars with 150 overlap)
  • Generate embeddings using OpenAI
  • Store everything in MongoDB Atlas
  1. After ingesting documents, you can query your data:
python rag.py

Or import and use programmatically:

from rag import query_data

answer = query_data("What is the difference between a collection and database in MongoDB?")
print(answer)

About

Implementation of a Retrieval-Augmented Generation (RAG) system using MongoDB Atlas Vector Search to enhance LLM-powered applications through semantic information retrieval. Developed as part of the MongoDB University - MongoDB Skills course.

Topics

Resources

Stars

Watchers

Forks

Languages