Gemini File Search Store Extension

A complete solution for managing Gemini's File Search Stores, enabling RAG (Retrieval Augmented Generation) workflows with automatic document uploads, category organization, and bidirectional sync between your local database and Gemini's cloud storage.

Features include background processing, SHA-256 deduplication, state tracking, and a RESTful API for seamless integration.

Key Features

Intelligent Document Management

Automatic Deduplication: SHA-256 hash-based duplicate detection prevents redundant uploads
Category Organization: Organize documents into logical categories for better management
Custom Metadata: Track documents with ID, hash, and category metadata
State Tracking: Monitor document states (PENDING, ACTIVE, FAILED) throughout their lifecycle

Background Upload Worker

Asynchronous Processing: Automatically processes pending uploads in the background
Auto-start on Upload: Worker automatically starts when new documents are uploaded
Startup Processing: Processes any pending uploads from previous sessions on extension startup
Batch Processing: Efficiently handles multiple documents in batches of 10
Automatic Metadata Updates: Keeps filestore statistics up-to-date after uploads complete

Smart Synchronization

Bidirectional Sync: Identify documents missing from local or remote stores
Metadata Validation: Detect and fix metadata mismatches between local and remote
Duplicate Detection: Find and flag duplicate documents in remote stores
State Management: Automatically update document states based on sync results
Detailed Reporting: Comprehensive sync reports with counts and sample documents

Custom MIME Type Support

Configurable Types: Override MIME types for specific file extensions via environment variable
Markdown Extensions: Pre-configured support for mdx, l, ss, sc extensions as text/markdown
Upload Optimization: Ensures correct MIME types for better search indexing

Robust Error Handling

Comprehensive Logging: Track all operations with detailed debug information
Error Recovery: Gracefully handle failures and store error messages for review
Retry Capability: Manual retry endpoint for failed uploads
ClientError Handling: Proper handling of 404s and other Gemini API errors

Configuration

Environment Variables

To use this extension, you must configure your Gemini API key.

Obtain an API key from Google AI Studio.
Add it to your environment variables or .env file:

Required

GEMINI_API_KEY=your_api_key_here

Optional

# Override MIME types for specific file extensions (comma-separated)
# Format: extension:mime/type,extension:mime/type
GEMINI_UPLOAD_MIME_TYPES="mdx:text/markdown,ss:text/markdown"

Database Storage

The extension automatically creates a SQLite database at:

.llms/user/default/gemini/gemini.sqlite

File Cache

Uploaded files are stored in the cache directory with SHA-256 hash-based filenames:

~/.llms/cache/[hash_prefix]/[hash].[ext]
~/.llms/cache/[hash_prefix]/[hash].info.json

API Endpoints

Filestore Management

Query Filestores

GET /filestores?take=50&skip=0&sort=-id&q=search_term

Query parameters:

take: Number of results (default: 50, max: 1000)
skip: Offset for pagination
sort: Sort order (-id, id, failed)
q: Search by display name
null: Comma-separated columns that should be NULL
not_null: Comma-separated columns that should NOT be NULL

Create Filestore

POST /filestores
Content-Type: application/json

{
  "displayName": "My Documents"
}

Delete Filestore

DELETE /filestores/{id}

Deletes the filestore and all associated documents (uses force: true).

Get Filestore Categories

GET /filestores/{id}/categories

Returns categories with document counts and total sizes.

Sync Filestore Documents

POST /filestores/{id}/sync

Synchronizes local documents with remote Gemini file search store. Returns detailed report:

{
  "Missing from Local": { "count": 0, "docs": [] },
  "Missing from Gemini": { "count": 2, "docs": ["category/file1.pdf"] },
  "Missing Metadata": { "count": 0, "docs": [] },
  "Metadata Mismatch": { "count": 1, "docs": ["category/file2.txt"] },
  "Unmatched Fields": { "count": 3, "docs": ["category/file3.md"] },
  "Duplicate Documents": { "count": 0, "docs": [] },
  "Summary": {
    "Local Documents": 45,
    "Remote Documents": 43,
    "Matched Documents": 43
  }
}

List Remote Documents

GET /filestores/{id}/documents

Fetches current state of all documents from Gemini API.

Document Management

Upload Documents

POST /filestores/{id}/upload?category=my_category
Content-Type: multipart/form-data

file: [binary file data]
file: [binary file data]
...

Uploads one or more files to the filestore. Documents are:

Hashed (SHA-256) for deduplication
Saved to cache directory
Added to local database
Queued for background upload to Gemini

Query Documents

GET /documents?filestoreId=1&take=50&skip=0&sort=-id

Query parameters:

filestoreId: Filter by filestore
category: Filter by category
take, skip, sort: Pagination and sorting
q: Search by display name
ids_in: Comma-separated IDs
displayNames: Filter by display names
null, not_null: Column filters
Sort options: -id, uploading, failed, issues

Delete Document

DELETE /documents/{id}

Deletes document from both local database and Gemini (handles 404s gracefully).

Retry Upload

POST /documents/{id}/upload

Manually retry uploading a failed document. Waits for upload to complete and returns updated document state.

Usage Examples

Creating a Filestore

curl -X POST http://localhost:8080/ext/gemini/filestores \
  -H "Content-Type: application/json" \
  -d '{"displayName": "Technical Documentation"}'

Uploading Documents with Categories

curl -X POST http://localhost:8080/ext/gemini/filestores/1/upload?category=guides \
  -F "file=@guide1.pdf" \
  -F "file=@guide2.pdf"

Syncing Documents

curl -X POST http://localhost:8080/ext/gemini/filestores/1/sync

Querying Failed Uploads

curl "http://localhost:8080/ext/gemini/documents?sort=failed&not_null=error"

Querying by Category

curl "http://localhost:8080/ext/gemini/documents?filestoreId=1&category=guides&sort=-id&take=20"

How It Works

Upload Process

File Reception: Multipart file upload with optional category
Hashing: SHA-256 hash calculated for deduplication
Deduplication Check: Existing documents with same hash are deleted
Local Storage: File saved to cache directory with hash-based filename
Database Record: Document metadata stored in SQLite
Worker Trigger: Background upload worker starts automatically
Gemini Upload: Worker uploads file to Gemini file search store
Metadata Update: Document updated with Gemini response data

Background Worker

The upload worker:

Starts automatically on extension initialization to process pending uploads
Starts automatically when new documents are uploaded
Processes up to 10 documents per batch
Polls Gemini operations until completion
Updates filestore statistics after batch completion
Handles errors gracefully and stores error messages
Stops automatically when queue is empty

Synchronization Process

The sync endpoint:

Fetches Local Documents: Queries all documents for the filestore
Builds Hash Lookup: Creates maps by hash and name for fast matching
Lists Remote Documents: Fetches all documents from Gemini API
Matching Logic:
- Matches by custom metadata hash (preferred)
- Falls back to matching by document name
Detects Issues:
- Documents in Gemini but not in local database
- Documents in local database but not in Gemini
- Documents missing custom metadata (id, hash, category)
- Metadata mismatches (wrong id or hash in metadata)
- Field mismatches (name, size, mime type, etc.)
- Duplicate hashes in remote store
Updates States: Automatically updates document states based on findings
Returns Report: Detailed breakdown with counts and sample filenames

MIME Type Handling

Default MIME type determined by file extension using Python's mimetypes
Custom overrides via GEMINI_UPLOAD_MIME_TYPES environment variable
JSON files uploaded without explicit MIME type (Gemini API limitation workaround)
Custom MIME types applied during upload configuration

Database Schema

Filestore Table

Stores Gemini file search store metadata:

id, user, createdAt, updatedAt
name: Gemini resource name (e.g., "fileSearchStores/...")
displayName: Human-readable name
createTime, updateTime: Gemini timestamps
activeDocumentsCount, pendingDocumentsCount, failedDocumentsCount
sizeBytes: Total size of all documents
metadata: JSON metadata
error: Last error message
ref: Optional reference field

Document Table

Stores document metadata:

id, filestoreId, user, createdAt
filename: SHA-256 based filename
url: Cache URL path
hash: SHA-256 hash of content
size: File size in bytes
displayName: Original filename
name: Gemini resource name
customMetadata: JSON metadata (id, hash, category)
createTime, updateTime: Gemini timestamps
sizeBytes: Size reported by Gemini
mimeType: MIME type
state: Document state (STATE_PENDING, STATE_ACTIVE, STATE_FAILED, etc.)
category: User-defined category
tags: JSON tags
startedAt, uploadedAt: Upload timing
metadata: JSON metadata
error: Upload error message
ref: Optional reference field

Benefits

For Development Workflows

Automatic Processing: Set it and forget it - uploads happen in the background
Persistent Queue: Pending uploads survive restarts
Fast Uploads: Multipart support for batch operations
Easy Testing: Example JSON file included for quick testing

For Document Management

Space Efficiency: Deduplication prevents storing identical files multiple times
Organization: Category-based organization with statistics
Traceability: Complete audit trail with timestamps and states
Search Ready: Proper MIME types ensure optimal Gemini search indexing

For System Integration

RESTful API: Standard HTTP endpoints for easy integration
User Isolation: Multi-user support with user-scoped data
Extensible: JSON metadata fields for custom extensions
Observable: Comprehensive logging and error messages

For Data Integrity

Sync Validation: Regular sync checks ensure consistency
Error Recovery: Failed uploads can be retried
State Tracking: Always know the current state of your documents
Duplicate Detection: Identify and handle duplicate uploads

Troubleshooting

Check Upload Worker Status

The worker logs when it starts and stops. Check your logs for:

UploadWorker started
UploadWorker stopped

View Pending Uploads

curl "http://localhost:8080/ext/gemini/documents?sort=uploading&null=uploadedAt,error"

View Failed Uploads

curl "http://localhost:8080/ext/gemini/documents?sort=failed&not_null=error"

Retry Failed Upload

curl -X POST http://localhost:8080/ext/gemini/documents/{id}/upload

Check Sync Status

curl -X POST http://localhost:8080/ext/gemini/filestores/{id}/sync

Enable Debug Logging

Set the debug flag in your context to see detailed operation logs including:

DEBUG=1 ./llms.sh --serve 8000 --verbose

Gemini API requests and responses
Document matching logic
Metadata updates
Upload configurations

Integration with Gemini AI

This extension uses Google's Gemini AI File Search API to:

Create and manage file search stores
Upload documents for AI-powered search
Attach custom metadata for enhanced organization
Query document status and retrieve results

Once documents are uploaded, they can be referenced in Gemini chat sessions using the fileSearch tool, enabling the AI to search through your document collection and provide informed responses based on your content.

License

This extension is released under the New BSD License.

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
ui		ui
.gitignore		.gitignore
README.md		README.md
__init__.py		__init__.py
db.py		db.py
requirements.txt		requirements.txt
upload_worker.py		upload_worker.py

llmspy/gemini

Folders and files

Latest commit

History

Repository files navigation