# Local File Retrieval Demo Notebook

This notebook demonstrates how to use the Local File Retrieval application to:

- Load and process documents (including PDFs).
- Create embeddings using a configurable model.
- Store embeddings in a SQLite vector database.
- Perform similarity searches with a configurable `k` value.
- Interactively query the database.

Feel free to modify the code and configurations to suit your needs.

## Customization and Experimentation

- **Change the Query:** Modify the `query_text` variable to test different queries.
- **Adjust 'k' Value:** Change the value of `k` in the configuration cell to retrieve more or fewer results.
- **Use a Different Embedding Model:** Update `model_name` to use a different SentenceTransformer model.
- **Add More Data:** Place additional documents (including PDFs) in the your data directory and rerun the notebook.
- **Modify Chunk Size and Overlap:** Adjust `chunk_size` and `chunk_overlap` in the configuration cell to see how it affects the results.
- **Explore the Code:** Feel free to delve into the source code in the `src/` directory to understand how each component works.

In [None]:
# Import necessary modules
import os
import sys

# Import application modules
from src import data_loader, embedding, database, query
import yaml
import sqlite3
import numpy as np

## Configuration

In [None]:

# Define the folder path where your documents are located
folder_path = "../data/example_data"  # adjust as needed

# Define the SQLite database file to use
db_file = "../documents.db"  

# specify the model name to use for creating embeddings
model_name = "all-MiniLM-L6-v2" #visit HuggingFace models for other model ideas

# Define chunk size and overlap for document splitting
chunk_size = 1000  #number of characters per chunk
chunk_overlap = 100  # number of characters to overlap between chunks

# Define the file extensions to consider when loading documents - so you can ignore certain files
file_extensions = [".txt", ".md", ".py", ".json", ".csv", ".pdf"]

# Specify the number of top similar documents to retrieve for each query
k = 5  # number of top similar documents to return

# Display the configurations
print(f"Folder Path: {folder_path}")
print(f"Database File: {db_file}")
print(f"Embedding Model: {model_name}")
print(f"Chunk Size: {chunk_size}")
print(f"Chunk Overlap: {chunk_overlap}")
print(f"File Extensions: {file_extensions}")
print(f"Number of Results to Retrieve (k): {k}")

## Doc processing

In [None]:
# Load documents
print("Loading documents...")
documents = data_loader.load_documents(os.path.abspath(os.path.join('..', folder_path)), file_extensions)
if not documents:
    print("No valid documents found in the specified folder.")
else:
    print(f"Loaded {len(documents)} documents.")

# Split documents into chunks
print("Splitting documents into chunks...")
docs = data_loader.split_documents(documents, chunk_size, chunk_overlap)
print(f"Split into {len(docs)} chunks.")

## Embeddings

In [None]:
# Initialize embedding model
print(f"Initializing embedding model: {model_name}")
model = embedding.initialize_model(model_name)

In [None]:
# Create embeddings
print("Creating embeddings...")
contents, sources, embeddings = embedding.create_embeddings(model, docs)
print(f"Created embeddings for {len(embeddings)} chunks.")

## Database setup and population

In [None]:
# Initialize database
print("Initializing database...")
db_path = os.path.abspath(os.path.join('..', db_file))
db = database.initialize_database(db_path, embedding_dim=embeddings[0].shape[0])

# Insert data into database
print("Inserting data into database...")
database.insert_data(db, contents, sources, embeddings)
print("Data insertion complete.")

## Performing your similarity search

In [None]:
# Define a query
query_text = "Enter your query here"  # Replace with your query or use input()

# Perform similarity search
print(f"Performing similarity search for: '{query_text}'")
results = query.query_database(db, model, query_text, k=k)

# Display results
if results:
    for idx, result in enumerate(results, start=1):
        doc_id, distance, content, source = result
        print(f"\nResult {idx}:")
        print(f"Source: {source}")
        print(f"Similarity Score: {distance}")
        print(f"Content Snippet:\n{content[:500]}")
        print("-" * 50)
else:
    print("No relevant documents found.")

In [None]:
# Close the database connection
db.close()
print("Database connection closed.")