Skip to content

Conversation

@paullizer
Copy link
Contributor

@paullizer paullizer commented Nov 21, 2025

Out of cycle release today for Simple Chat

🐛 Fixes

✨ Adds

New File Type Support

Added support for .xml, .yaml/.yml, .doc, .docm, and .log file types

Multi-Modal Vision Analysis for Images

Implemented comprehensive image upload support with AI-powered analysis:

Features:

  • Base64 Conversion & Inline Display: Uploaded images are converted to base64 and displayed inline in the chat (like AI-generated images) instead of as file links
  • Automatic Chunking: Large images (>1.5MB) are automatically split across multiple Cosmos DB documents to avoid the 2MB document limit, then seamlessly reassembled on retrieval
  • Dual Text Extraction:
    • Document Intelligence OCR: Extracts all visible text from the image
    • GPT-4o Vision Analysis: Provides AI-generated description, object detection, contextual analysis, and text interpretation
  • Info Button: User-uploaded images display an info button that reveals extracted text and vision analysis in a formatted, scrollable drawer
  • Token-Efficient Chat History: Image context (OCR + Vision analysis) is included in the chat history as system messages so the AI can answer questions about uploaded images, but base64 image data is explicitly excluded to prevent token waste
    • OCR + Vision analysis: ~625 tokens per image ✅
    • Full base64 data: ~350K tokens per 1MB image ❌ (prevented)
  • Settings Control: Multi-modal vision can be enabled/disabled in admin settings with model selection (GPT-4o, GPT-4o-mini, o-series, GPT-5, etc.)

Technical Implementation:

  • Images stored with role: 'image' and metadata.is_user_upload: true flag
  • Stores extracted_text (OCR), vision_analysis (AI insights), and filename metadata
  • Backend automatically includes image context in conversation history for AI reasoning
  • Runtime safety checks prevent base64 data leakage into chat history
  • Debug logging tracks image context addition and character counts

User Experience:

  • Upload an image → Displays as thumbnail in chat
  • Click info button → View formatted OCR text and AI vision analysis
  • Ask questions about the image → AI uses extracted context to respond accurately

📦 Chunking Strategy for New File Types

Update README with chunking strategy · Issue #98 · microsoft/simplechat

DOC / DOCM

  • Processed with Python package docx2txt
  • Chunked by ~400 words, approximating an A4 page

XML

  • Uses RecursiveCharacterTextSplitter with XML-aware separators
  • Structure-preserving chunking:
    • Separators prioritized: \n\n\n> (end of XML tags) → space → character
    • Splits at logical boundaries to maintain tag integrity
  • Chunked by 4000 characters
  • Goal: Preserve XML structure by splitting at tag boundaries rather than mid-element, ensuring chunks are more semantically meaningful for LLM processing
  • See process_xml

YAML / YML

  • Uses RecursiveCharacterTextSplitter with YAML-aware separators
  • Structure-preserving chunking:
    • Separators prioritized: \n\n\n- (YAML list items) → space → character
    • Splits at logical boundaries to maintain YAML structure
  • Chunked by 4000 characters
  • Goal: Preserve YAML hierarchy and list structures by splitting at section boundaries and list items rather than mid-key or mid-value
  • See process_yaml

LOG

  • Processed using line-based chunking to maintain log record integrity
  • Never splits mid-line to preserve complete log entries
  • Line-Level Chunking:
    1. Split file by lines using splitlines(keepends=True) to preserve line endings
    2. Accumulate complete lines until reaching target word count ≈1000 words
    3. When adding next line would exceed target AND chunk already has content:
      • Finalize current chunk
      • Start new chunk with current line
    4. If single line exceeds target, it gets its own chunk to prevent infinite loops
    5. Emit chunks with complete log records
  • Goal: Provide substantial log context (1000 words) while ensuring no log entry is split across chunks
  • See process_log

📝 Version Updates

  • Incremented version to 0.229.098

#### **JSON**

- Uses `RecursiveJsonSplitter`:
  - `max_chunk_size=600`
  - `convert_lists=True`
- Produces JSON strings that retain original structure.
- See `process_json_file`.

#### **XML**

- Uses `RecursiveCharacterTextSplitter` with XML-aware separators.
- **Structure-preserving chunking**:
  - Separators prioritized: `\n\n` → `\n` → `>` (end of XML tags) → space → character
  - Splits at logical boundaries to maintain tag integrity
- **Chunked by 4000 characters** with 200-character overlap for context preservation.
- **Goal**: Preserve XML structure while providing manageable chunks for LLM processing.
- See `process_xml`.

#### **YAML / YML**

- Processed using regex word splitting (similar to TXT).
- **Chunked by 400 words**.
- Maintains YAML structure through simple word-based splitting.
- See `process_yaml`.

#### **LOG**

- Processed using line-based chunking to maintain log record integrity.
- **Never splits mid-line** to preserve complete log entries.
- **Line-Level Chunking**:
  1. Split file by lines using `splitlines(keepends=True)` to preserve line endings.
  2. Accumulate complete lines until reaching target word count ≈1000 words.
  3. When adding next line would exceed target AND chunk already has content:
     - Finalize current chunk
     - Start new chunk with current line
  4. If single line exceeds target, it gets its own chunk to prevent infinite loops.
  5. Emit chunks with complete log records.
- **Goal**: Provide substantial log context (1000 words) while ensuring no log entry is split across chunks.
- See `process_log`.
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR is an out-of-cycle release that includes two major fixes and new file type support. The primary changes address a scoping issue in chat search functionality where the "All" option was missing groups, and complete removal of API key authentication for Video Indexer in favor of managed identity-only authentication. Additionally, support is added for new file types (.xml, .yaml/.yml, .doc, .docm, .log) with appropriate chunking strategies.

Key Changes:

  • Fixed "All" scope chat search to include group documents by passing active_group_id when scope is 'all'
  • Removed Video Indexer API key authentication, transitioning to managed identity-only authentication
  • Added support for 5 new file types with tailored chunking strategies optimized for each format

Reviewed Changes

Copilot reviewed 22 out of 22 changed files in this pull request and generated 18 comments.

Show a summary per file
File Description
functional_tests/test_video_indexer_dual_authentication_support.py Issue: New functional test file that tests for dual authentication support (API key + managed identity), but the actual implementation only supports managed identity. Tests check for functions and settings that don't exist in the codebase.
docs/setup_instructions_manual.md Updated Video Indexer setup instructions to focus on managed identity authentication with detailed permission configuration steps
docs/fixes/VIDEO_INDEXER_API_KEY_TOKEN_GENERATION_FIX.md Issue: New documentation describing API key token generation "fix", but API key authentication was actually removed from the codebase, making this documentation misleading
docs/features/VIDEO_INDEXER_DUAL_AUTHENTICATION.md Issue: New documentation describing dual authentication support (API key + managed identity), but implementation only supports managed identity
docs/admin_configuration.md Updated to reflect managed identity-only authentication for Video Indexer
application/single_app/templates/workspace.html Changed hardcoded file extension list to use dynamic allowed_extensions variable
application/single_app/templates/group_workspaces.html Changed hardcoded file extension list to use dynamic allowed_extensions variable
application/single_app/templates/chats.html Updated file input accept attribute and tooltip to include new file extensions (.doc, .docm, .xml, .yaml, .yml, .log)
application/single_app/templates/admin_settings.html Removed API key field, reordered and enhanced Video Indexer settings fields with better help text and managed identity prerequisites
application/single_app/templates/_video_indexer_info.html Comprehensive update emphasizing managed identity authentication, removed API key instructions, added detailed permission setup steps
application/single_app/route_frontend_workspace.py Added dynamic allowed_extensions list building and passing to template (with spacing issue)
application/single_app/route_frontend_public_workspaces.py Added new file extensions to allowed list (with spacing issue)
application/single_app/route_frontend_group_workspaces.py Added dynamic allowed_extensions list building and passing to template (with spacing issue)
application/single_app/route_frontend_chats.py Added handling for new file types (.doc, .docm, .xml, .yaml, .yml, .log) in upload_file and view_document functions
application/single_app/route_frontend_admin_settings.py Removed video_indexer_api_key from form handling, updated default ARM API version to 2024-01-01
application/single_app/route_backend_chats.py Fixed scoping issue - now passes active_group_id when document_scope is 'all' to include group search results
application/single_app/functions_settings.py Removed video_indexer_api_key from default settings, updated ARM API version to 2024-01-01
application/single_app/functions_search.py Fixed scoping issue - added conditional group search that only executes when active_group_id is provided
application/single_app/functions_documents.py Issue: Contains remnant authentication type checking code; Added 4 new processing functions (process_xml, process_yaml, process_log, process_doc) with appropriate chunking strategies; Added .xlsm support to tabular processing
application/single_app/functions_content.py Added .xlsm to extract_table_file function
application/single_app/functions_authentication.py Updated get_video_indexer_account_token to only call managed identity authentication function
application/single_app/config.py Updated version to 0.229.072, added new file extensions to ALLOWED_EXTENSIONS (with spacing issue)


# Build allowed extensions string
allowed_extensions = [
"txt", "pdf", "doc", "docm", "docx", "xlsx", "xls", "xlsm","csv", "pptx", "html",
Copy link

Copilot AI Nov 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing space after comma between 'xlsm' and "csv". Should be 'xlsm', "csv" for consistency with other items in the list.

Suggested change
"txt", "pdf", "doc", "docm", "docx", "xlsx", "xls", "xlsm","csv", "pptx", "html",
"txt", "pdf", "doc", "docm", "docx", "xlsx", "xls", "xlsm", "csv", "pptx", "html",

Copilot uses AI. Check for mistakes.
allowed_extensions = [
"txt", "pdf", "docx", "xlsx", "xls", "csv", "pptx", "html",
"jpg", "jpeg", "png", "bmp", "tiff", "tif", "heif", "md", "json"
"txt", "pdf", "doc", "docm", "docx", "xlsx", "xls", "xlsm","csv", "pptx", "html",
Copy link

Copilot AI Nov 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing space after comma between 'xlsm' and "csv". Should be 'xlsm', "csv" for consistency with other items in the list.

Suggested change
"txt", "pdf", "doc", "docm", "docx", "xlsx", "xls", "xlsm","csv", "pptx", "html",
"txt", "pdf", "doc", "docm", "docx", "xlsx", "xls", "xlsm", "csv", "pptx", "html",

Copilot uses AI. Check for mistakes.
Comment on lines +70 to +103
required_functions = [
'def get_video_indexer_account_token(',
'def get_video_indexer_api_key_token(',
'def get_video_indexer_managed_identity_token('
]

for func in required_functions:
if func not in auth_content:
print(f"❌ Missing function: {func}")
return False

# Check for proper conditional logic
if 'auth_type = settings.get("video_indexer_authentication_type"' not in auth_content:
print("❌ Missing authentication type conditional logic")
return False

# Check for API key authentication pattern - should generate access token
if 'if auth_type == "key":' not in auth_content:
print("❌ Missing API key authentication conditional logic")
return False

# Verify API key method generates an access token via API
required_api_key_patterns = [
'/auth/',
'/AccessToken',
'Ocp-Apim-Subscription-Key',
'requests.get',
'allowEdit'
]

for pattern in required_api_key_patterns:
if pattern not in auth_content:
print(f"❌ Missing API key token generation pattern: {pattern}")
return False
Copy link

Copilot AI Nov 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The functional test expects get_video_indexer_api_key_token() to exist in functions_authentication.py (line 72), but this function is not present in the codebase. According to the PR description, API key authentication was removed in favor of managed identity only.

Either:

  1. Remove the test checks for API key authentication (lines 70-103), or
  2. Update the test to reflect that API key authentication is not supported

Given the PR description states "Removed api key" and the documentation emphasizes "API keys are not supported", the test should not be checking for API key authentication functionality.

Copilot uses AI. Check for mistakes.
Comment on lines +82 to +84
if 'auth_type = settings.get("video_indexer_authentication_type"' not in auth_content:
print("❌ Missing authentication type conditional logic")
return False
Copy link

Copilot AI Nov 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test looks for conditional logic based on video_indexer_authentication_type, but since API key authentication has been removed, this conditional logic should not exist in the codebase. The test will fail as the authentication type switching logic has been removed.

Copilot uses AI. Check for mistakes.
Comment on lines +1 to +199
# Video Indexer API Key Token Generation Fix

## Issue Description
When using API key authentication with Azure Video Indexer, video uploads were failing with 401 Unauthorized errors. The system was attempting to use the raw API key directly in API calls instead of generating an access token first.

**Fixed in version:** 0.229.065

## Error Symptoms
```
[VIDEO] API key authentication completed
[VIDEO] UPLOAD ERROR: 401 Client Error: Unauthorized for url: https://api.videoindexer.ai/eastus/Accounts/.../Videos?name=...
```

## Root Cause
The initial implementation of API key authentication (`get_video_indexer_api_key_token()`) was simply returning the API key directly. However, the Video Indexer API requires a two-step authentication process:

1. **Step 1**: Use API key to generate an access token via the auth endpoint
2. **Step 2**: Use the access token for all subsequent Video Indexer API operations

The code was skipping Step 1 and trying to use the API key directly, which caused authentication failures.

## Technical Details

### Before Fix
```python
def get_video_indexer_api_key_token(settings, video_id=None):
"""Returns API key directly (incorrect)"""
api_key = settings.get("video_indexer_api_key", "")
return api_key # Wrong - this won't work with Video Indexer API
```

### After Fix
```python
def get_video_indexer_api_key_token(settings, video_id=None):
"""Generate access token using API key"""
api_key = settings.get("video_indexer_api_key", "")
account_id = settings.get("video_indexer_account_id", "")
location = settings.get("video_indexer_location", "trial")

# Generate access token via auth endpoint
api_url = "https://api.videoindexer.ai"
auth_url = f"{api_url}/auth/{location}/Accounts/{account_id}/AccessToken"

headers = {"Ocp-Apim-Subscription-Key": api_key}
params = {"allowEdit": "true"}

if video_id:
params["videoId"] = video_id

response = requests.get(auth_url, headers=headers, params=params)
response.raise_for_status()

access_token = response.text.strip('"')
return access_token
```

## Files Modified

### 1. functions_authentication.py
- **Function**: `get_video_indexer_api_key_token()`
- **Changes**:
- Added API key to access token conversion logic
- Calls Video Indexer auth endpoint: `/auth/{location}/Accounts/{account_id}/AccessToken`
- Uses `Ocp-Apim-Subscription-Key` header for auth endpoint call
- Returns generated access token instead of raw API key
- Added error handling with detailed debug logging

### 2. functions_documents.py
- **Changes**: Simplified authentication handling since both methods now return access tokens
- **Video Upload**: Both auth methods now use `accessToken` query parameter
- **Video Polling**: Both auth methods now use `accessToken` query parameter
- **Video Deletion**: Both auth methods now use `accessToken` query parameter

#### Before Fix
```python
if auth_type == "key":
headers = {'Ocp-Apim-Subscription-Key': token} # Wrong
params = {"name": original_filename}
else:
headers = {}
params = {"accessToken": token, "name": original_filename}
```

#### After Fix
```python
# Both methods now return access tokens
headers = {}
params = {"accessToken": token, "name": original_filename}
```

## Authentication Flow

### API Key Authentication Flow
1. User configures API key in admin settings
2. System calls `get_video_indexer_api_key_token()`
3. Function makes GET request to: `https://api.videoindexer.ai/auth/{location}/Accounts/{account_id}/AccessToken`
4. Request includes `Ocp-Apim-Subscription-Key: {api_key}` header
5. Video Indexer returns access token
6. Access token is used for all subsequent API calls via `accessToken` query parameter

### Managed Identity Authentication Flow
1. User configures managed identity in admin settings
2. System calls `get_video_indexer_managed_identity_token()`
3. Function acquires ARM token using DefaultAzureCredential
4. Function makes POST to ARM generateAccessToken endpoint
5. ARM returns access token
6. Access token is used for all subsequent API calls via `accessToken` query parameter

## Video Indexer API Reference

### Token Generation Endpoint
```
GET https://api.videoindexer.ai/auth/{location}/Accounts/{account_id}/AccessToken
Headers:
Ocp-Apim-Subscription-Key: {your_api_key}
Query Parameters:
allowEdit: true
videoId: {optional_video_id}
```

### API Operations (Upload, Poll, Delete)
```
All operations use the generated access token:
?accessToken={generated_token}
```

## Testing

### Functional Test Updates
Updated `test_video_indexer_dual_authentication_support.py` to verify:
- ✅ API key authentication generates access tokens via auth endpoint
- ✅ Access tokens are used consistently across all operations
- ✅ Both authentication methods use the same API call patterns
- ✅ Default settings maintain backward compatibility

### Test Patterns Verified
```python
# Verify token generation patterns
required_api_key_patterns = [
'/auth/',
'/AccessToken',
'Ocp-Apim-Subscription-Key',
'requests.get',
'allowEdit'
]

# Verify consistent access token usage
required_patterns = [
'auth_type = settings.get("video_indexer_authentication_type"',
'"accessToken": token',
'get_video_indexer_account_token'
]
```

## Benefits of Fix
- **Authentication Works Correctly**: API key method now properly authenticates with Video Indexer
- **Consistent Pattern**: Both auth methods return access tokens and use identical API call patterns
- **Proper Token Scope**: Can generate video-specific or account-level tokens as needed
- **Better Error Handling**: Detailed logging for token generation failures
- **Cleaner Code**: Simplified video processing logic since both methods work the same way

## Configuration Requirements

### API Key Method Settings
```python
video_indexer_authentication_type: "key"
video_indexer_api_key: "your-api-key"
video_indexer_account_id: "your-account-id"
video_indexer_location: "trial" or your region
```

### Required Permissions
- API key must have permissions to generate access tokens
- Access tokens generated with `allowEdit=true` for upload/delete operations

## Known Limitations
- Access tokens have expiration times (typically 1 hour)
- Token generation requires additional API call before operations
- API key must be manually rotated (not handled automatically)

## Debug Logging
Enhanced logging helps troubleshoot token generation:
```
[VIDEO INDEXER AUTH] Using API key authentication
[VIDEO INDEXER AUTH] Requesting access token for account: {id}, location: {loc}
[VIDEO INDEXER AUTH] Calling auth endpoint: {url}
[VIDEO INDEXER AUTH] Access token acquired successfully (token length: {len})
[VIDEO] API key authentication and token generation completed
```

## Related Documentation
- Feature Documentation: `docs/features/VIDEO_INDEXER_DUAL_AUTHENTICATION.md`
- Functional Test: `functional_tests/test_video_indexer_dual_authentication_support.py`
- Azure Video Indexer API Documentation: https://api-portal.videoindexer.ai/

## Backward Compatibility
- ✅ Existing managed identity configurations continue working without changes
- ✅ Default authentication type remains "managed_identity"
- ✅ No changes required to admin UI or settings structure
Copy link

Copilot AI Nov 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This documentation describes a "fix" for API key token generation, but the PR actually removes API key authentication entirely. This document is misleading as it suggests API key authentication is supported and was fixed, when in reality it has been removed.

According to the PR description: "The API key did not work with paid service only works with trial - Removed api key".

This documentation should either be removed or updated to explain that API key authentication was attempted, found to be incompatible with paid services, and subsequently removed in favor of managed identity only.

Copilot uses AI. Check for mistakes.
Comment on lines +133 to +141
'auth_type = settings.get("video_indexer_authentication_type"',
'"accessToken": token',
'get_video_indexer_account_token'
]

for pattern in required_patterns:
if pattern not in docs_content:
print(f"❌ Missing pattern in video processing: {pattern}")
return False
Copy link

Copilot AI Nov 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test checks for authentication type handling with conditional logic, but the PR removes API key authentication entirely. The code now only uses managed identity, so there should be no authentication type branching logic in functions_documents.py.

Copilot uses AI. Check for mistakes.
Comment on lines +169 to +185
required_ui_elements = [
'id="video_indexer_authentication_type"',
'name="video_indexer_authentication_type"',
'value="managed_identity"',
'value="key"',
'Managed Identity (Azure ARM)',
'API Key',
'id="video_indexer_api_key_section"',
'id="video_indexer_arm_section"',
'id="video_indexer_arm_fields"',
'toggleVideoIndexerAuthFields'
]

for element in required_ui_elements:
if element not in template_content:
print(f"❌ Missing UI element: {element}")
return False
Copy link

Copilot AI Nov 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test checks for UI elements related to authentication type selection including API key fields (lines 170-179), but the admin UI changes in this PR remove API key support entirely. These UI elements should not exist in the updated template.

Copilot uses AI. Check for mistakes.
Comment on lines +218 to +224
if "'video_indexer_authentication_type':" not in route_content:
print("❌ Missing authentication type form field handling")
return False

if "form_data.get('video_indexer_authentication_type'" not in route_content:
print("❌ Missing authentication type form data extraction")
return False
Copy link

Copilot AI Nov 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test checks for video_indexer_authentication_type form field handling, but since API key authentication was removed, this form field should not exist in the route handler.

Copilot uses AI. Check for mistakes.

# Build allowed extensions string
allowed_extensions = [
"txt", "pdf", "doc", "docm", "docx", "xlsx", "xls", "xlsm","csv", "pptx", "html",
Copy link

Copilot AI Nov 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing space after comma between 'xlsm' and "csv". Should be 'xlsm', "csv" for consistency with other items in the list.

Suggested change
"txt", "pdf", "doc", "docm", "docx", "xlsx", "xls", "xlsm","csv", "pptx", "html",
"txt", "pdf", "doc", "docm", "docx", "xlsx", "xls", "xlsm", "csv", "pptx", "html",

Copilot uses AI. Check for mistakes.
is_ppt = file_ext in ('.pptx', '.ppt')
is_image = file_ext in ('.jpg', '.jpeg', '.png', '.bmp', '.tiff', '.tif', '.gif', '.webp') # Added more image types
is_text = file_ext in ('.txt', '.md', '.csv', '.json', '.log', '.xml', '.html', '.htm') # Common text-based types
is_text = file_ext in ('.txt', '.md', '.csv', '.json', '.log', '.xml', '.yaml', '.yml', '.html', '.htm') # Common text-based types
Copy link

Copilot AI Nov 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Variable is_text is not used.

Copilot uses AI. Check for mistakes.
This was linked to issues Nov 21, 2025
@paullizer paullizer merged commit 5546e5a into main Nov 21, 2025
1 of 2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

2 participants