-
Notifications
You must be signed in to change notification settings - Fork 83
Add 372 fix 489 #528
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add 372 fix 489 #528
Conversation
#### **JSON**
- Uses `RecursiveJsonSplitter`:
- `max_chunk_size=600`
- `convert_lists=True`
- Produces JSON strings that retain original structure.
- See `process_json_file`.
#### **XML**
- Uses `RecursiveCharacterTextSplitter` with XML-aware separators.
- **Structure-preserving chunking**:
- Separators prioritized: `\n\n` → `\n` → `>` (end of XML tags) → space → character
- Splits at logical boundaries to maintain tag integrity
- **Chunked by 4000 characters** with 200-character overlap for context preservation.
- **Goal**: Preserve XML structure while providing manageable chunks for LLM processing.
- See `process_xml`.
#### **YAML / YML**
- Processed using regex word splitting (similar to TXT).
- **Chunked by 400 words**.
- Maintains YAML structure through simple word-based splitting.
- See `process_yaml`.
#### **LOG**
- Processed using line-based chunking to maintain log record integrity.
- **Never splits mid-line** to preserve complete log entries.
- **Line-Level Chunking**:
1. Split file by lines using `splitlines(keepends=True)` to preserve line endings.
2. Accumulate complete lines until reaching target word count ≈1000 words.
3. When adding next line would exceed target AND chunk already has content:
- Finalize current chunk
- Start new chunk with current line
4. If single line exceeds target, it gets its own chunk to prevent infinite loops.
5. Emit chunks with complete log records.
- **Goal**: Provide substantial log context (1000 words) while ensuring no log entry is split across chunks.
- See `process_log`.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR is an out-of-cycle release that includes two major fixes and new file type support. The primary changes address a scoping issue in chat search functionality where the "All" option was missing groups, and complete removal of API key authentication for Video Indexer in favor of managed identity-only authentication. Additionally, support is added for new file types (.xml, .yaml/.yml, .doc, .docm, .log) with appropriate chunking strategies.
Key Changes:
- Fixed "All" scope chat search to include group documents by passing active_group_id when scope is 'all'
- Removed Video Indexer API key authentication, transitioning to managed identity-only authentication
- Added support for 5 new file types with tailored chunking strategies optimized for each format
Reviewed Changes
Copilot reviewed 22 out of 22 changed files in this pull request and generated 18 comments.
Show a summary per file
| File | Description |
|---|---|
| functional_tests/test_video_indexer_dual_authentication_support.py | Issue: New functional test file that tests for dual authentication support (API key + managed identity), but the actual implementation only supports managed identity. Tests check for functions and settings that don't exist in the codebase. |
| docs/setup_instructions_manual.md | Updated Video Indexer setup instructions to focus on managed identity authentication with detailed permission configuration steps |
| docs/fixes/VIDEO_INDEXER_API_KEY_TOKEN_GENERATION_FIX.md | Issue: New documentation describing API key token generation "fix", but API key authentication was actually removed from the codebase, making this documentation misleading |
| docs/features/VIDEO_INDEXER_DUAL_AUTHENTICATION.md | Issue: New documentation describing dual authentication support (API key + managed identity), but implementation only supports managed identity |
| docs/admin_configuration.md | Updated to reflect managed identity-only authentication for Video Indexer |
| application/single_app/templates/workspace.html | Changed hardcoded file extension list to use dynamic allowed_extensions variable |
| application/single_app/templates/group_workspaces.html | Changed hardcoded file extension list to use dynamic allowed_extensions variable |
| application/single_app/templates/chats.html | Updated file input accept attribute and tooltip to include new file extensions (.doc, .docm, .xml, .yaml, .yml, .log) |
| application/single_app/templates/admin_settings.html | Removed API key field, reordered and enhanced Video Indexer settings fields with better help text and managed identity prerequisites |
| application/single_app/templates/_video_indexer_info.html | Comprehensive update emphasizing managed identity authentication, removed API key instructions, added detailed permission setup steps |
| application/single_app/route_frontend_workspace.py | Added dynamic allowed_extensions list building and passing to template (with spacing issue) |
| application/single_app/route_frontend_public_workspaces.py | Added new file extensions to allowed list (with spacing issue) |
| application/single_app/route_frontend_group_workspaces.py | Added dynamic allowed_extensions list building and passing to template (with spacing issue) |
| application/single_app/route_frontend_chats.py | Added handling for new file types (.doc, .docm, .xml, .yaml, .yml, .log) in upload_file and view_document functions |
| application/single_app/route_frontend_admin_settings.py | Removed video_indexer_api_key from form handling, updated default ARM API version to 2024-01-01 |
| application/single_app/route_backend_chats.py | Fixed scoping issue - now passes active_group_id when document_scope is 'all' to include group search results |
| application/single_app/functions_settings.py | Removed video_indexer_api_key from default settings, updated ARM API version to 2024-01-01 |
| application/single_app/functions_search.py | Fixed scoping issue - added conditional group search that only executes when active_group_id is provided |
| application/single_app/functions_documents.py | Issue: Contains remnant authentication type checking code; Added 4 new processing functions (process_xml, process_yaml, process_log, process_doc) with appropriate chunking strategies; Added .xlsm support to tabular processing |
| application/single_app/functions_content.py | Added .xlsm to extract_table_file function |
| application/single_app/functions_authentication.py | Updated get_video_indexer_account_token to only call managed identity authentication function |
| application/single_app/config.py | Updated version to 0.229.072, added new file extensions to ALLOWED_EXTENSIONS (with spacing issue) |
|
|
||
| # Build allowed extensions string | ||
| allowed_extensions = [ | ||
| "txt", "pdf", "doc", "docm", "docx", "xlsx", "xls", "xlsm","csv", "pptx", "html", |
Copilot
AI
Nov 21, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing space after comma between 'xlsm' and "csv". Should be 'xlsm', "csv" for consistency with other items in the list.
| "txt", "pdf", "doc", "docm", "docx", "xlsx", "xls", "xlsm","csv", "pptx", "html", | |
| "txt", "pdf", "doc", "docm", "docx", "xlsx", "xls", "xlsm", "csv", "pptx", "html", |
| allowed_extensions = [ | ||
| "txt", "pdf", "docx", "xlsx", "xls", "csv", "pptx", "html", | ||
| "jpg", "jpeg", "png", "bmp", "tiff", "tif", "heif", "md", "json" | ||
| "txt", "pdf", "doc", "docm", "docx", "xlsx", "xls", "xlsm","csv", "pptx", "html", |
Copilot
AI
Nov 21, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing space after comma between 'xlsm' and "csv". Should be 'xlsm', "csv" for consistency with other items in the list.
| "txt", "pdf", "doc", "docm", "docx", "xlsx", "xls", "xlsm","csv", "pptx", "html", | |
| "txt", "pdf", "doc", "docm", "docx", "xlsx", "xls", "xlsm", "csv", "pptx", "html", |
| required_functions = [ | ||
| 'def get_video_indexer_account_token(', | ||
| 'def get_video_indexer_api_key_token(', | ||
| 'def get_video_indexer_managed_identity_token(' | ||
| ] | ||
|
|
||
| for func in required_functions: | ||
| if func not in auth_content: | ||
| print(f"❌ Missing function: {func}") | ||
| return False | ||
|
|
||
| # Check for proper conditional logic | ||
| if 'auth_type = settings.get("video_indexer_authentication_type"' not in auth_content: | ||
| print("❌ Missing authentication type conditional logic") | ||
| return False | ||
|
|
||
| # Check for API key authentication pattern - should generate access token | ||
| if 'if auth_type == "key":' not in auth_content: | ||
| print("❌ Missing API key authentication conditional logic") | ||
| return False | ||
|
|
||
| # Verify API key method generates an access token via API | ||
| required_api_key_patterns = [ | ||
| '/auth/', | ||
| '/AccessToken', | ||
| 'Ocp-Apim-Subscription-Key', | ||
| 'requests.get', | ||
| 'allowEdit' | ||
| ] | ||
|
|
||
| for pattern in required_api_key_patterns: | ||
| if pattern not in auth_content: | ||
| print(f"❌ Missing API key token generation pattern: {pattern}") | ||
| return False |
Copilot
AI
Nov 21, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The functional test expects get_video_indexer_api_key_token() to exist in functions_authentication.py (line 72), but this function is not present in the codebase. According to the PR description, API key authentication was removed in favor of managed identity only.
Either:
- Remove the test checks for API key authentication (lines 70-103), or
- Update the test to reflect that API key authentication is not supported
Given the PR description states "Removed api key" and the documentation emphasizes "API keys are not supported", the test should not be checking for API key authentication functionality.
| if 'auth_type = settings.get("video_indexer_authentication_type"' not in auth_content: | ||
| print("❌ Missing authentication type conditional logic") | ||
| return False |
Copilot
AI
Nov 21, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test looks for conditional logic based on video_indexer_authentication_type, but since API key authentication has been removed, this conditional logic should not exist in the codebase. The test will fail as the authentication type switching logic has been removed.
| # Video Indexer API Key Token Generation Fix | ||
|
|
||
| ## Issue Description | ||
| When using API key authentication with Azure Video Indexer, video uploads were failing with 401 Unauthorized errors. The system was attempting to use the raw API key directly in API calls instead of generating an access token first. | ||
|
|
||
| **Fixed in version:** 0.229.065 | ||
|
|
||
| ## Error Symptoms | ||
| ``` | ||
| [VIDEO] API key authentication completed | ||
| [VIDEO] UPLOAD ERROR: 401 Client Error: Unauthorized for url: https://api.videoindexer.ai/eastus/Accounts/.../Videos?name=... | ||
| ``` | ||
|
|
||
| ## Root Cause | ||
| The initial implementation of API key authentication (`get_video_indexer_api_key_token()`) was simply returning the API key directly. However, the Video Indexer API requires a two-step authentication process: | ||
|
|
||
| 1. **Step 1**: Use API key to generate an access token via the auth endpoint | ||
| 2. **Step 2**: Use the access token for all subsequent Video Indexer API operations | ||
|
|
||
| The code was skipping Step 1 and trying to use the API key directly, which caused authentication failures. | ||
|
|
||
| ## Technical Details | ||
|
|
||
| ### Before Fix | ||
| ```python | ||
| def get_video_indexer_api_key_token(settings, video_id=None): | ||
| """Returns API key directly (incorrect)""" | ||
| api_key = settings.get("video_indexer_api_key", "") | ||
| return api_key # Wrong - this won't work with Video Indexer API | ||
| ``` | ||
|
|
||
| ### After Fix | ||
| ```python | ||
| def get_video_indexer_api_key_token(settings, video_id=None): | ||
| """Generate access token using API key""" | ||
| api_key = settings.get("video_indexer_api_key", "") | ||
| account_id = settings.get("video_indexer_account_id", "") | ||
| location = settings.get("video_indexer_location", "trial") | ||
|
|
||
| # Generate access token via auth endpoint | ||
| api_url = "https://api.videoindexer.ai" | ||
| auth_url = f"{api_url}/auth/{location}/Accounts/{account_id}/AccessToken" | ||
|
|
||
| headers = {"Ocp-Apim-Subscription-Key": api_key} | ||
| params = {"allowEdit": "true"} | ||
|
|
||
| if video_id: | ||
| params["videoId"] = video_id | ||
|
|
||
| response = requests.get(auth_url, headers=headers, params=params) | ||
| response.raise_for_status() | ||
|
|
||
| access_token = response.text.strip('"') | ||
| return access_token | ||
| ``` | ||
|
|
||
| ## Files Modified | ||
|
|
||
| ### 1. functions_authentication.py | ||
| - **Function**: `get_video_indexer_api_key_token()` | ||
| - **Changes**: | ||
| - Added API key to access token conversion logic | ||
| - Calls Video Indexer auth endpoint: `/auth/{location}/Accounts/{account_id}/AccessToken` | ||
| - Uses `Ocp-Apim-Subscription-Key` header for auth endpoint call | ||
| - Returns generated access token instead of raw API key | ||
| - Added error handling with detailed debug logging | ||
|
|
||
| ### 2. functions_documents.py | ||
| - **Changes**: Simplified authentication handling since both methods now return access tokens | ||
| - **Video Upload**: Both auth methods now use `accessToken` query parameter | ||
| - **Video Polling**: Both auth methods now use `accessToken` query parameter | ||
| - **Video Deletion**: Both auth methods now use `accessToken` query parameter | ||
|
|
||
| #### Before Fix | ||
| ```python | ||
| if auth_type == "key": | ||
| headers = {'Ocp-Apim-Subscription-Key': token} # Wrong | ||
| params = {"name": original_filename} | ||
| else: | ||
| headers = {} | ||
| params = {"accessToken": token, "name": original_filename} | ||
| ``` | ||
|
|
||
| #### After Fix | ||
| ```python | ||
| # Both methods now return access tokens | ||
| headers = {} | ||
| params = {"accessToken": token, "name": original_filename} | ||
| ``` | ||
|
|
||
| ## Authentication Flow | ||
|
|
||
| ### API Key Authentication Flow | ||
| 1. User configures API key in admin settings | ||
| 2. System calls `get_video_indexer_api_key_token()` | ||
| 3. Function makes GET request to: `https://api.videoindexer.ai/auth/{location}/Accounts/{account_id}/AccessToken` | ||
| 4. Request includes `Ocp-Apim-Subscription-Key: {api_key}` header | ||
| 5. Video Indexer returns access token | ||
| 6. Access token is used for all subsequent API calls via `accessToken` query parameter | ||
|
|
||
| ### Managed Identity Authentication Flow | ||
| 1. User configures managed identity in admin settings | ||
| 2. System calls `get_video_indexer_managed_identity_token()` | ||
| 3. Function acquires ARM token using DefaultAzureCredential | ||
| 4. Function makes POST to ARM generateAccessToken endpoint | ||
| 5. ARM returns access token | ||
| 6. Access token is used for all subsequent API calls via `accessToken` query parameter | ||
|
|
||
| ## Video Indexer API Reference | ||
|
|
||
| ### Token Generation Endpoint | ||
| ``` | ||
| GET https://api.videoindexer.ai/auth/{location}/Accounts/{account_id}/AccessToken | ||
| Headers: | ||
| Ocp-Apim-Subscription-Key: {your_api_key} | ||
| Query Parameters: | ||
| allowEdit: true | ||
| videoId: {optional_video_id} | ||
| ``` | ||
|
|
||
| ### API Operations (Upload, Poll, Delete) | ||
| ``` | ||
| All operations use the generated access token: | ||
| ?accessToken={generated_token} | ||
| ``` | ||
|
|
||
| ## Testing | ||
|
|
||
| ### Functional Test Updates | ||
| Updated `test_video_indexer_dual_authentication_support.py` to verify: | ||
| - ✅ API key authentication generates access tokens via auth endpoint | ||
| - ✅ Access tokens are used consistently across all operations | ||
| - ✅ Both authentication methods use the same API call patterns | ||
| - ✅ Default settings maintain backward compatibility | ||
|
|
||
| ### Test Patterns Verified | ||
| ```python | ||
| # Verify token generation patterns | ||
| required_api_key_patterns = [ | ||
| '/auth/', | ||
| '/AccessToken', | ||
| 'Ocp-Apim-Subscription-Key', | ||
| 'requests.get', | ||
| 'allowEdit' | ||
| ] | ||
|
|
||
| # Verify consistent access token usage | ||
| required_patterns = [ | ||
| 'auth_type = settings.get("video_indexer_authentication_type"', | ||
| '"accessToken": token', | ||
| 'get_video_indexer_account_token' | ||
| ] | ||
| ``` | ||
|
|
||
| ## Benefits of Fix | ||
| - **Authentication Works Correctly**: API key method now properly authenticates with Video Indexer | ||
| - **Consistent Pattern**: Both auth methods return access tokens and use identical API call patterns | ||
| - **Proper Token Scope**: Can generate video-specific or account-level tokens as needed | ||
| - **Better Error Handling**: Detailed logging for token generation failures | ||
| - **Cleaner Code**: Simplified video processing logic since both methods work the same way | ||
|
|
||
| ## Configuration Requirements | ||
|
|
||
| ### API Key Method Settings | ||
| ```python | ||
| video_indexer_authentication_type: "key" | ||
| video_indexer_api_key: "your-api-key" | ||
| video_indexer_account_id: "your-account-id" | ||
| video_indexer_location: "trial" or your region | ||
| ``` | ||
|
|
||
| ### Required Permissions | ||
| - API key must have permissions to generate access tokens | ||
| - Access tokens generated with `allowEdit=true` for upload/delete operations | ||
|
|
||
| ## Known Limitations | ||
| - Access tokens have expiration times (typically 1 hour) | ||
| - Token generation requires additional API call before operations | ||
| - API key must be manually rotated (not handled automatically) | ||
|
|
||
| ## Debug Logging | ||
| Enhanced logging helps troubleshoot token generation: | ||
| ``` | ||
| [VIDEO INDEXER AUTH] Using API key authentication | ||
| [VIDEO INDEXER AUTH] Requesting access token for account: {id}, location: {loc} | ||
| [VIDEO INDEXER AUTH] Calling auth endpoint: {url} | ||
| [VIDEO INDEXER AUTH] Access token acquired successfully (token length: {len}) | ||
| [VIDEO] API key authentication and token generation completed | ||
| ``` | ||
|
|
||
| ## Related Documentation | ||
| - Feature Documentation: `docs/features/VIDEO_INDEXER_DUAL_AUTHENTICATION.md` | ||
| - Functional Test: `functional_tests/test_video_indexer_dual_authentication_support.py` | ||
| - Azure Video Indexer API Documentation: https://api-portal.videoindexer.ai/ | ||
|
|
||
| ## Backward Compatibility | ||
| - ✅ Existing managed identity configurations continue working without changes | ||
| - ✅ Default authentication type remains "managed_identity" | ||
| - ✅ No changes required to admin UI or settings structure |
Copilot
AI
Nov 21, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This documentation describes a "fix" for API key token generation, but the PR actually removes API key authentication entirely. This document is misleading as it suggests API key authentication is supported and was fixed, when in reality it has been removed.
According to the PR description: "The API key did not work with paid service only works with trial - Removed api key".
This documentation should either be removed or updated to explain that API key authentication was attempted, found to be incompatible with paid services, and subsequently removed in favor of managed identity only.
| 'auth_type = settings.get("video_indexer_authentication_type"', | ||
| '"accessToken": token', | ||
| 'get_video_indexer_account_token' | ||
| ] | ||
|
|
||
| for pattern in required_patterns: | ||
| if pattern not in docs_content: | ||
| print(f"❌ Missing pattern in video processing: {pattern}") | ||
| return False |
Copilot
AI
Nov 21, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test checks for authentication type handling with conditional logic, but the PR removes API key authentication entirely. The code now only uses managed identity, so there should be no authentication type branching logic in functions_documents.py.
| required_ui_elements = [ | ||
| 'id="video_indexer_authentication_type"', | ||
| 'name="video_indexer_authentication_type"', | ||
| 'value="managed_identity"', | ||
| 'value="key"', | ||
| 'Managed Identity (Azure ARM)', | ||
| 'API Key', | ||
| 'id="video_indexer_api_key_section"', | ||
| 'id="video_indexer_arm_section"', | ||
| 'id="video_indexer_arm_fields"', | ||
| 'toggleVideoIndexerAuthFields' | ||
| ] | ||
|
|
||
| for element in required_ui_elements: | ||
| if element not in template_content: | ||
| print(f"❌ Missing UI element: {element}") | ||
| return False |
Copilot
AI
Nov 21, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test checks for UI elements related to authentication type selection including API key fields (lines 170-179), but the admin UI changes in this PR remove API key support entirely. These UI elements should not exist in the updated template.
| if "'video_indexer_authentication_type':" not in route_content: | ||
| print("❌ Missing authentication type form field handling") | ||
| return False | ||
|
|
||
| if "form_data.get('video_indexer_authentication_type'" not in route_content: | ||
| print("❌ Missing authentication type form data extraction") | ||
| return False |
Copilot
AI
Nov 21, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test checks for video_indexer_authentication_type form field handling, but since API key authentication was removed, this form field should not exist in the route handler.
|
|
||
| # Build allowed extensions string | ||
| allowed_extensions = [ | ||
| "txt", "pdf", "doc", "docm", "docx", "xlsx", "xls", "xlsm","csv", "pptx", "html", |
Copilot
AI
Nov 21, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing space after comma between 'xlsm' and "csv". Should be 'xlsm', "csv" for consistency with other items in the list.
| "txt", "pdf", "doc", "docm", "docx", "xlsx", "xls", "xlsm","csv", "pptx", "html", | |
| "txt", "pdf", "doc", "docm", "docx", "xlsx", "xls", "xlsm", "csv", "pptx", "html", |
| is_ppt = file_ext in ('.pptx', '.ppt') | ||
| is_image = file_ext in ('.jpg', '.jpeg', '.png', '.bmp', '.tiff', '.tif', '.gif', '.webp') # Added more image types | ||
| is_text = file_ext in ('.txt', '.md', '.csv', '.json', '.log', '.xml', '.html', '.htm') # Common text-based types | ||
| is_text = file_ext in ('.txt', '.md', '.csv', '.json', '.log', '.xml', '.yaml', '.yml', '.html', '.htm') # Common text-based types |
Copilot
AI
Nov 21, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Variable is_text is not used.
Out of cycle release today for Simple Chat
🐛 Fixes
✨ Adds
New File Type Support
Added support for
.xml,.yaml/.yml,.doc,.docm, and.logfile typesMulti-Modal Vision Analysis for Images
Implemented comprehensive image upload support with AI-powered analysis:
Features:
Technical Implementation:
role: 'image'andmetadata.is_user_upload: trueflagextracted_text(OCR),vision_analysis(AI insights), andfilenamemetadataUser Experience:
📦 Chunking Strategy for New File Types
Update README with chunking strategy · Issue #98 · microsoft/simplechat
DOC / DOCM
docx2txtXML
RecursiveCharacterTextSplitterwith XML-aware separators\n\n→\n→>(end of XML tags) → space → characterprocess_xmlYAML / YML
RecursiveCharacterTextSplitterwith YAML-aware separators\n\n→\n→-(YAML list items) → space → characterprocess_yamlLOG
splitlines(keepends=True)to preserve line endingsprocess_log📝 Version Updates
0.229.098