### Libraries:

The code uses LangChain modules for managing embeddings, vector stores, and LLM interactions.
TensorFlow is used to suppress warnings for a cleaner experience.
### Functionality:

##### Indexing:
Scans the specified directory for .txt, .pdf, and .xlsx files.
Splits documents into manageable chunks.
Stores embeddings in a FAISS vector database.
##### Querying:
Retrieves document chunks relevant to the query using embeddings.
Passes the retrieved chunks to the Llama2 model (llama2:7b) for generating answers.
### Changes Made:

Added explicit allow_dangerous_deserialization=True to safely load the FAISS vector database.
Enhanced error handling and added print statements for better debugging.

By combining the text embeddings of HuggingFace, the retrieval capabilities of FAISS, and the generative responses of Llama2, this script provides a complete pipeline for document-based question answering.

In [3]:
import os
from langchain.embeddings import HuggingFaceEmbeddings  # To generate text embeddings using HuggingFace models
from langchain.vectorstores import FAISS  # To store and query vectorized representations of text
from langchain.llms import Ollama  # To interact with Llama2 language models for generative tasks
from langchain.chains import RetrievalQA  # Combines retrieval and question-answering functionality
from langchain.document_loaders import TextLoader, PyPDFLoader  # To load .txt and .pdf documents
from langchain.text_splitter import CharacterTextSplitter  # To split large documents into smaller chunks

# Suppress TensorFlow warnings for a cleaner output
import tensorflow as tf
tf.compat.v1.logging.set_verbosity(tf.compat.v1.logging.ERROR)

# Initialize HuggingFace embeddings
# Provides pre-trained embeddings for semantic similarity search
embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-mpnet-base-v2")

# Path to the FAISS vector database
vector_db_path = "./vectorstore"
vector_db = None

# Check if the vector database already exists
if os.path.exists(vector_db_path):
    print("Loading existing vector database...")
    vector_db = FAISS.load_local(
        vector_db_path,  # Path to the existing vector database
        embeddings,  # Embeddings model to use
        allow_dangerous_deserialization=True  # Explicitly allow loading pickle files
    )
else:
    print("No existing vector database found. Starting fresh.")

def index_documents(directory, file_extensions=(".txt", ".pdf")):
    """Index documents in a directory and store them in the vector database."""
    global vector_db
    docs = []  # List to store loaded documents

    # Traverse the directory and find all files matching the specified extensions
    for root, _, files in os.walk(directory):
        for file in files:
            if file.endswith(file_extensions):
                file_path = os.path.join(root, file)
                print(f"Loading document: {file_path}")
                try:
                    # Load .txt or .pdf files using the appropriate loader
                    if file.endswith(".txt"):
                        loader = TextLoader(file_path, encoding="utf-8")
                    elif file.endswith(".pdf"):
                        loader = PyPDFLoader(file_path)
                    elif file.endswith(".xlsx"):
                        loader = pd.ExcelFile(file)
                        sheet_names = loader.sheet_names
                        print(f"Available sheets: {sheet_names}")
                        data = loader.parse(sheet_names[0])
                    else:
                        print(f"Unsupported file type: {file}")
                        continue
                    loaded_docs = loader.load()  # Load the document content
                    docs.extend(loaded_docs)  # Add the document to the list
                except Exception as e:
                    print(f"Failed to load document: {file_path}. Error: {e}")

    # Exit if no documents were loaded
    if not docs:
        print("No valid documents found in the specified directory.")
        return

    # Split documents into smaller chunks to improve embedding quality
    text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
    split_docs = []
    for doc in docs:
        splits = text_splitter.split_documents([doc])
        split_docs.extend(splits)

    # Exit if no document chunks were created
    if not split_docs:
        print("No valid splits created from the loaded documents.")
        return

    # Create or update the FAISS vector database
    if not vector_db:
        print(f"Creating a new FAISS vector database with {len(split_docs)} documents...")
        vector_db = FAISS.from_documents(split_docs, embeddings)
    else:
        print(f"Adding {len(split_docs)} documents to the existing vector database...")
        vector_db.add_documents(split_docs)

    # Save the updated database to disk
    vector_db.save_local(vector_db_path)
    print("Documents indexed successfully.")

def query_documents(query):
    """Query the vector database and use Llama3 for answering questions."""
    if not vector_db:
        print("No documents indexed. Please index documents first.")
        return

    # Initialize the Llama2 language model using Ollama
    llm = Ollama(model="llama2:7b")  # Ensure the model is pulled using `ollama pull llama2:7b`

    # Set up RetrievalQA with the vector database and Llama2 model
    retriever = vector_db.as_retriever()  # Converts vector database into a retriever
    qa_chain = RetrievalQA.from_chain_type(
        llm, retriever=retriever, return_source_documents=True
    )

    # Execute the query using the QA chain
    print(f"\nQuerying: {query}")
    result = qa_chain({"query": query})
    answer = result.get("result", "No answer available")
    source_docs = result.get("source_documents", [])

    # Display the generated answer and the source documents
    print(f"\nAnswer:\n{answer}")
    print("\nSource Documents:")
    for doc in source_docs:
        print(f"- {doc.metadata.get('source', 'Unknown')}")

# Example Usage:

# 1. Index documents from the specified folder
directory_to_index = "C:/Users/Ianth/Documents/Docs"
index_documents(directory_to_index, file_extensions=(".txt", ".pdf"))

# 2. Query the indexed documents
query = """Open the Mos fs file and prepare a template in this format:
Company Overview
- Company Name: 
- Business Segments:

Financial Overview
- 2024 Net Sales vs 2023 Net Sales:
- 2024 Operating Earnings vs 2023 Operating Earnings:
- Cash Balances:
- Short and long term debt:

Financial Highlights:

Environmental Highlights:

In the first section "Company Overview" this is more of a data entry section. The Company name can be 
located in the Organization and Nature of Business section of the Mos fs doc. Please list out the
business segments that the business is organized into, centered around the commodities starting with Potash which, can be found on page 7 within section 1. Organization and Nature of Business 
They can be found in the Organization and Nature of Business section. 

In the second section "Financial Overview" this is more of a data entry section so please just list the 
results, you can expand further on them in the 3rd section. For example, 2024 Net Sales vs 2023 Net Sales 
should be entered just as the actual figure for 2024 net sales vs 2023 net sales. Please include only information about results 
for the 9 month ended period. The 2024 and 2023 net sales and operating earnings can be found in the 
financial statements section. Cash and cash equivalents can also be found in the same section. Short and 
long term debt can be found in the same section.

In the 3rd section "Financial Highlights", please provide a detailed paragraph about the changes in 
financial performance and what the drivers were for the business and its segments in the 2024 9 months 
ended period, which can be found in the Overview of Consolidated Results for the nine months ended September 
30, 2024 and 2023 section. For example, explain what caused the change in net sales. 

In the 4th section "Environmental Highlights", please provide an informative paragraph about the legal and 
financial challenges Mosaic faces as a result of the business it engages in, which can be found in the
Environmental, Health, Safety and Security Matters section. Provide relevant information that impacts changes
in financial performance."""

query_documents(query)


Loading existing vector database...
Loading document: C:/Users/Ianth/Documents/Docs\Mos fs.pdf
Adding 54 documents to the existing vector database...
Documents indexed successfully.

Querying: Open the Mos fs file and prepare a template in this format:
Company Overview
- Company Name: 
- Business Segments:

Financial Overview
- 2024 Net Sales vs 2023 Net Sales:
- 2024 Operating Earnings vs 2023 Operating Earnings:
- Cash Balances:
- Short and long term debt:

Financial Highlights:

Environmental Highlights:

In the first section "Company Overview" this is more of a data entry section. The Company name can be 
located in the Organization and Nature of Business section of the Mos fs doc. Please list out the
business segments that the business is organized into, centered around the commodities starting with Potash which, can be found on page 7 within section 1. Organization and Nature of Business 
They can be found in the Organization and Nature of Business section. 

In the second sectio