	#### The provided code is a Jupyter Notebook designed to convert PDF files to Markdown format and subsequently generate a vector database using the converted Markdown files.

In [2]:
import os
from warnings import filterwarnings
from dotenv import load_dotenv
from docling.document_converter import DocumentConverter
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, StorageContext
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.llms.groq import Groq
from llama_index.vector_stores.chroma import ChromaVectorStore
import chromadb


##### Configuration

Configuration variables, including directories for input PDFs, output Markdown files, and the vector database. It also loads environment variables from a .env file and suppresses specific warnings to avoid cluttering the output.

In [3]:
INPUT_PDF_DIR = 'input/pdfs/'
OUTPUT_MD_DIR = 'input/mds/'
CHROMADB_DIR = 'database/vector_store/'
CHROMADB_COLLECTION = 'rag_collection'

load_dotenv()

filterwarnings(action="ignore", category=FutureWarning, module="easyocr")
os.environ["TOKENIZERS_PARALLELISM"] = "false"

##### Function convert_pdfs_to_markdown

That takes two arguments: the directory containing PDF files and the directory where the converted Markdown files will be saved. The function checks if the output directory exists and creates it if necessary. It then iterates over all PDF files in the input directory, converts each to Markdown using the DocumentConverter class, and saves the result in the output directory.

In [1]:
def convert_pdfs_to_markdown(pdf_dir, md_dir):
	if not os.path.exists(md_dir):
		os.makedirs(md_dir)

	pdf_files = [f for f in os.listdir(pdf_dir) if f.endswith('.pdf')]
	for pdf_file in pdf_files:
		pdf_path = os.path.join(pdf_dir, pdf_file)
		md_path = os.path.join(md_dir, f"{os.path.splitext(pdf_file)[0]}.md")

		if not os.path.exists(md_path):
			print(f"Converting `{pdf_file}` to Markdown ...")

			doc_converter = DocumentConverter()
			result = doc_converter.convert(source=pdf_path)
			
			with open(md_path, 'w', encoding='utf-8') as md_file:
				md_file.write(result.document.export_to_markdown())

##### Executes the convert_pdfs_to_markdown function

Converting all PDFs in the specified input directory to Markdown format and saving them in the output directory. The function prints messages to indicate the progress of the conversion process.

In [None]:
convert_pdfs_to_markdown(INPUT_PDF_DIR, OUTPUT_MD_DIR)

##### Initializes models and clients required for generating the vector database.

It creates an embedding model using the OpenAI API and a language model using the Groq API. It then reads the converted Markdown documents from the output directory and loads them into a SimpleDirectoryReader.

Next, the code initializes a ChromaDB client and creates or retrieves a collection within the database. It sets up a vector store using the ChromaDB collection and a storage context with default settings. Finally, it creates a VectorStoreIndex from the loaded documents, using the embedding model for vectorization. The process concludes with a print statement indicating that the vector database has been successfully generated.

This notebook is intended to be run first in the project workflow, as it prepares the necessary data and vector database for subsequent tasks.

In [6]:
chroma_embed_model = OpenAIEmbedding(api_key=os.getenv("OPENAI_API_KEY"))
llm_model = Groq(model="llama3-70b-8192", api_key=os.getenv("GROQ_API_KEY"))

documents = SimpleDirectoryReader(input_dir=OUTPUT_MD_DIR).load_data()

chroma_client = chromadb.PersistentClient(path = CHROMADB_DIR)
chroma_collection = chroma_client.get_or_create_collection(name=CHROMADB_COLLECTION)

vector_store = ChromaVectorStore(chroma_collection = chroma_collection)
storage_context = StorageContext.from_defaults(vector_store = vector_store)
index = VectorStoreIndex.from_documents(documents, storage_context=storage_context, embed_model=chroma_embed_model)

print("Vector database successfully generated!")