-
Notifications
You must be signed in to change notification settings - Fork 54
LCORE-302: User Data Collection #237
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
- Introduced DataCollectorConfiguration model to manage data collection settings. - Updated lightspeed-stack.yaml to include data collector configuration options. - Integrated data collector service startup and shutdown in main application flow.
- Introduced a new target in the Makefile to run the data collector service. - Enhanced `lightspeed_stack.py` to support starting the data collector via command-line argument. - Refactored `data_collector.py` to simplify the service's run method and remove unnecessary async handling. - Cleaned up the main application startup process by removing the data collector's startup and shutdown events from `main.py`.
- Added `types-requests` to development dependencies for type checking. - Modified `lightspeed_stack.py` to pass the data collector configuration to the `start_data_collector` function. - Updated `DataCollectorConfiguration` model to enforce required fields when data archival is enabled. - Refactored logging statements in `DataCollectorService` for improved clarity and consistency. - Added unit tests for `UserDataCollection` to validate data collector configuration scenarios.
- Included `types-requests` in the development dependencies for type checking. - Updated the lock file to reflect the new package and its version. - Added relevant wheel and source information for `types-requests`.
- Updated exception handling in tests to use specific exception types (OSError, requests.RequestException, tarfile.TarError). - Added new test for handling missing ingress server URL in tarball sending. - Improved test coverage for the _perform_collection method to ensure proper exception catching.
- Renamed validation method in DataCollectorConfiguration for clarity. - Updated error messages to reflect data collector context. - Improved exception handling in start_data_collector function to log errors. - Modified DataCollectorService to conditionally send feedback and transcript files based on configuration. - Updated unit tests to cover new exception handling and configuration scenarios.
- Introduced new constants for data collector settings, including collection interval, connection timeout, and retry interval. - Updated DataCollectorConfiguration to use these constants for default values and enforced positive integer types. - Refactored error handling in tests to align with the new configuration model. - Enhanced unit tests to validate default values and ensure proper configuration scenarios.
- Introduced `ingress_content_service_name` to the DataCollectorConfiguration model to specify the service name for data collection. - Updated `lightspeed-stack.yaml` to include the new configuration option. - Enhanced error handling to ensure `ingress_content_service_name` is required when the data collector is enabled. - Modified the DataCollectorService to use the new service name in the content type header during data transmission. - Updated unit tests to validate the new configuration and error scenarios.
- Introduced DataCollectorConfiguration model to manage data collection settings. - Updated lightspeed-stack.yaml to include data collector configuration options. - Integrated data collector service startup and shutdown in main application flow.
- Introduced a new target in the Makefile to run the data collector service. - Enhanced `lightspeed_stack.py` to support starting the data collector via command-line argument. - Refactored `data_collector.py` to simplify the service's run method and remove unnecessary async handling. - Cleaned up the main application startup process by removing the data collector's startup and shutdown events from `main.py`.
- Added `types-requests` to development dependencies for type checking. - Modified `lightspeed_stack.py` to pass the data collector configuration to the `start_data_collector` function. - Updated `DataCollectorConfiguration` model to enforce required fields when data archival is enabled. - Refactored logging statements in `DataCollectorService` for improved clarity and consistency. - Added unit tests for `UserDataCollection` to validate data collector configuration scenarios.
- Included `types-requests` in the development dependencies for type checking. - Updated the lock file to reflect the new package and its version. - Added relevant wheel and source information for `types-requests`.
- Updated exception handling in tests to use specific exception types (OSError, requests.RequestException, tarfile.TarError). - Added new test for handling missing ingress server URL in tarball sending. - Improved test coverage for the _perform_collection method to ensure proper exception catching.
- Renamed validation method in DataCollectorConfiguration for clarity. - Updated error messages to reflect data collector context. - Improved exception handling in start_data_collector function to log errors. - Modified DataCollectorService to conditionally send feedback and transcript files based on configuration. - Updated unit tests to cover new exception handling and configuration scenarios.
- Introduced new constants for data collector settings, including collection interval, connection timeout, and retry interval. - Updated DataCollectorConfiguration to use these constants for default values and enforced positive integer types. - Refactored error handling in tests to align with the new configuration model. - Enhanced unit tests to validate default values and ensure proper configuration scenarios.
- Introduced `ingress_content_service_name` to the DataCollectorConfiguration model to specify the service name for data collection. - Updated `lightspeed-stack.yaml` to include the new configuration option. - Enhanced error handling to ensure `ingress_content_service_name` is required when the data collector is enabled. - Modified the DataCollectorService to use the new service name in the content type header during data transmission. - Updated unit tests to validate the new configuration and error scenarios.
- Bumped versions for `astroid` (3.3.10 to 3.3.11) and `certifi` (2025.7.9 to 2025.7.14) in the lock file, including updated source URLs and hashes. - Removed outdated `typing-extensions` package entry from the lock file. - Cleaned up import statements in `config.py` to resolve merge conflicts and improve readability.
WalkthroughA new data collector feature was introduced, including configuration, service implementation, command-line integration, Makefile and documentation updates, and comprehensive unit tests. The data collector periodically gathers user feedback and transcript data, packages them, and sends them to an ingress server with configurable parameters and robust error handling. Changes
Sequence Diagram(s)sequenceDiagram
participant User
participant CLI/Makefile
participant lightspeed_stack.py
participant DataCollectorService
User->>CLI/Makefile: make run-data-collector
CLI/Makefile->>lightspeed_stack.py: python src/lightspeed_stack.py --data-collector
lightspeed_stack.py->>DataCollectorService: start_data_collector(config)
DataCollectorService->>DataCollectorService: run()
DataCollectorService->>DataCollectorService: Collect feedback/transcript files
DataCollectorService->>DataCollectorService: Package files into tarball
DataCollectorService->>Ingress Server: Send tarball (HTTP POST)
DataCollectorService->>DataCollectorService: Cleanup files/tarball
DataCollectorService-->>lightspeed_stack.py: Wait for next interval or exit
Poem
✨ Finishing Touches
🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🧹 Nitpick comments (2)
README.md (1)
312-354: Comprehensive documentation for the data collector service.The documentation effectively covers all aspects of the new service including features, configuration, and usage instructions. The YAML configuration example is helpful and the command-line usage is clearly explained.
However, there are markdown linting issues that should be addressed for consistency.
Fix the markdown list style inconsistency flagged by static analysis:
-### Features - -- **Periodic Collection**: Runs at configurable intervals -- **Data Packaging**: Packages feedback and transcript files into compressed tar.gz archives -- **Secure Transmission**: Sends data to a configured ingress server with optional authentication -- **File Cleanup**: Optionally removes local files after successful transmission -- **Error Handling**: Includes retry logic and comprehensive error handling +### Features + +* **Periodic Collection**: Runs at configurable intervals +* **Data Packaging**: Packages feedback and transcript files into compressed tar.gz archives +* **Secure Transmission**: Sends data to a configured ingress server with optional authentication +* **File Cleanup**: Optionally removes local files after successful transmission +* **Error Handling**: Includes retry logic and comprehensive error handlingsrc/services/data_collector.py (1)
25-50: Remove redundant null check for collection_interval.The
collection_intervalfield is defined asPositiveIntin the configuration model, which means it cannot beNone. The null check on line 40 is redundant.Apply this diff to remove the redundant check:
- if collector_config.collection_interval is not None: - time.sleep(collector_config.collection_interval) + time.sleep(collector_config.collection_interval)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (1)
uv.lockis excluded by!**/*.lock
📒 Files selected for processing (13)
Makefile(1 hunks)README.md(2 hunks)lightspeed-stack.yaml(1 hunks)pyproject.toml(1 hunks)src/constants.py(1 hunks)src/lightspeed_stack.py(3 hunks)src/models/config.py(2 hunks)src/runners/data_collector.py(1 hunks)src/services/__init__.py(1 hunks)src/services/data_collector.py(1 hunks)tests/unit/models/test_config.py(5 hunks)tests/unit/runners/test_data_collector_runner.py(1 hunks)tests/unit/services/test_data_collector.py(1 hunks)
🧰 Additional context used
🧬 Code Graph Analysis (3)
src/runners/data_collector.py (2)
src/models/config.py (1)
DataCollectorConfiguration(88-110)src/services/data_collector.py (2)
DataCollectorService(18-258)run(25-49)
src/lightspeed_stack.py (2)
src/runners/data_collector.py (1)
start_data_collector(11-26)src/configuration.py (2)
configuration(47-52)user_data_collection_configuration(71-76)
tests/unit/models/test_config.py (1)
src/models/config.py (6)
Configuration(172-188)LLamaStackConfiguration(57-85)ServiceConfiguration(26-46)UserDataCollection(113-131)ModelContextProtocolServer(49-54)DataCollectorConfiguration(88-110)
🪛 markdownlint-cli2 (0.17.2)
README.md
318-318: Unordered list style
Expected: asterisk; Actual: dash
(MD004, ul-style)
319-319: Unordered list style
Expected: asterisk; Actual: dash
(MD004, ul-style)
320-320: Unordered list style
Expected: asterisk; Actual: dash
(MD004, ul-style)
321-321: Unordered list style
Expected: asterisk; Actual: dash
(MD004, ul-style)
322-322: Unordered list style
Expected: asterisk; Actual: dash
(MD004, ul-style)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: build-pr
🔇 Additional comments (20)
src/services/__init__.py (1)
1-1: LGTM: Clean package initializationThe package initialization file is properly structured with a clear docstring.
pyproject.toml (1)
50-50: LGTM: Appropriate type stub additionThe
types-requestsdependency addition is necessary for type checking the data collector's HTTP functionality.Makefile (1)
11-12: LGTM: Consistent Makefile targetThe new target follows the established pattern and provides a convenient way to run the data collector service.
src/constants.py (1)
46-49: LGTM: Well-chosen timing constantsThe data collector constants have appropriate values and clear naming. The inline comments helpfully explain the time units.
src/runners/data_collector.py (1)
11-17: LGTM: Clean function structure with proper early returnThe function signature is clear and the early return when data collection is disabled is a good practice.
tests/unit/runners/test_data_collector_runner.py (1)
1-61: Test coverage looks comprehensive and well-structured.The test file effectively covers the key scenarios for the
start_data_collectorfunction:
- Service execution when enabled
- Service bypass when disabled
- Exception propagation and handling
The use of mocking is appropriate to isolate the unit under test from the actual service implementation.
src/lightspeed_stack.py (3)
13-13: Clean integration of the data collector import.The import follows the existing pattern and is properly organized with other runner imports.
51-57: Command-line argument addition is well-implemented.The new
--data-collectorargument is properly configured with appropriate help text and default value. The naming convention is consistent with existing arguments.
81-84: Conditional logic correctly implements mutual exclusivity.The implementation properly ensures that either the data collector or the web service runs, but not both simultaneously. The configuration access via
user_data_collection_configuration.data_collectoris correct based on the relevant code snippets.tests/unit/models/test_config.py (4)
8-13: Import additions are properly organized.The new constants and model imports are appropriately added and follow the existing import structure.
139-153: Positive test case is well-structured.The test properly verifies that a correctly configured
DataCollectorConfigurationwith all required fields constructs successfully and sets the enabled flag properly.
155-184: Comprehensive validation testing.The test effectively covers both required field validation scenarios:
- Missing
ingress_server_urlwhen enabled- Missing
ingress_content_service_namewhen enabledThe error messages are properly matched and the test structure is consistent.
423-431: JSON serialization test updated correctly.The test verifies that the new
data_collectorsection is properly included in the JSON dump with the correct default values from the constants.README.md (1)
160-160: Make target addition is correctly documented.The new
run-data-collectortarget is properly added to the available targets list and follows the existing documentation format.src/models/config.py (2)
88-111: Well-structured configuration model with proper validation.The
DataCollectorConfigurationclass is well-designed with appropriate field types and validation logic. The use ofPositiveIntfor numeric fields and proper validation in thecheck_data_collector_configurationmethod ensures configuration consistency.
120-120: Clean integration with existing configuration structure.The integration of the data collector configuration into the
UserDataCollectionmodel is clean and follows the established pattern with proper default initialization.src/services/data_collector.py (3)
170-212: Robust HTTP transmission with proper error handling.The tarball transmission logic is well-implemented with proper authentication handling, timeout configuration, and comprehensive error checking. The use of bearer token authentication and custom content-type headers follows best practices.
143-169: Efficient tarball creation with proper error handling.The tarball creation method properly handles file addition errors and maintains directory structure. The use of relative paths and error logging for individual file failures ensures robustness.
222-251: Comprehensive cleanup logic for empty directories.The cleanup method properly handles the nested directory structure (user/conversation) and gracefully handles permission errors. The logic correctly removes empty directories bottom-up.
tests/unit/services/test_data_collector.py (1)
1-588: Comprehensive test suite with excellent coverage.The test suite provides thorough coverage of the
DataCollectorServicefunctionality including:
- Normal operation and error conditions
- File collection with various configuration states
- Tarball creation and transmission
- Cleanup operations with error handling
- Proper mocking of dependencies and external resources
The tests are well-structured and follow good testing practices with clear test names and proper assertions.
| try: | ||
| service = DataCollectorService() | ||
| service.run() | ||
| except Exception as e: | ||
| logger.error( | ||
| "Data collector service encountered an exception: %s", e, exc_info=True | ||
| ) | ||
| raise |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Verification agent
🧩 Analysis chain
Verify configuration consistency between runner and service
The runner receives a DataCollectorConfiguration parameter but from the relevant code snippets, the DataCollectorService appears to access configuration from a global configuration object (configuration.user_data_collection_configuration.data_collector). This could lead to inconsistency where the runner thinks data collection is enabled but the service has different configuration.
Consider either:
- Passing the configuration to the service constructor
- Using the same global configuration source in both places
- Adding validation to ensure consistency
🏁 Script executed:
#!/bin/bash
# Check how DataCollectorService accesses configuration
ast-grep --pattern 'class DataCollectorService {
$$$
}'
# Check for configuration access patterns in the service
rg -A 10 "configuration\." src/services/data_collector.pyLength of output: 3498
I’d like to confirm how the runner receives and passes the configuration. Let’s pull the full runner code:
🏁 Script executed:
#!/bin/bash
sed -n '1,200p' src/runners/data_collector.pyLength of output: 783
Continuing verification of how configuration is passed into the service:
🏁 Script executed:
#!/bin/bash
# Show imports and top of DataCollectorService
sed -n '1,50p' src/services/data_collector.py
# Locate DataCollectorConfiguration definition
rg -n "class DataCollectorConfiguration" -A5 src/models/config.pyLength of output: 2088
Synchronize runner and service configuration sources
The start_data_collector runner takes a DataCollectorConfiguration parameter but never passes it into DataCollectorService, which always reads from the global configuration.user_data_collection_configuration.data_collector. This disconnect can cause the service to run with different settings than the runner expects.
Please update the implementation so both use the same config:
- In
src/runners/data_collector.py, pass the runner’sconfigurationinto the service. - In
src/services/data_collector.py, add a constructor (or method parameter) to accept and store that config instead of pulling from the global. - Replace all occurrences of
configuration.user_data_collection_configuration.data_collectorin the service with the injected config object.
Example diff:
# src/runners/data_collector.py
- service = DataCollectorService()
+ service = DataCollectorService(configuration)
# src/services/data_collector.py
-class DataCollectorService: # pylint: disable=too-few-public-methods
+class DataCollectorService: # pylint: disable=too-few-public-methods
+ def __init__(self, config: DataCollectorConfiguration):
+ self._collector_config = config
def run(self) -> None:
- collector_config = configuration.user_data_collection_configuration.data_collector
+ collector_config = self._collector_config
logger.info("Starting data collection service")
…📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| try: | |
| service = DataCollectorService() | |
| service.run() | |
| except Exception as e: | |
| logger.error( | |
| "Data collector service encountered an exception: %s", e, exc_info=True | |
| ) | |
| raise | |
| --- src/runners/data_collector.py | |
| @@ lines 19-26 | |
| try: | |
| - service = DataCollectorService() | |
| + service = DataCollectorService(configuration) | |
| service.run() | |
| except Exception as e: | |
| logger.error( | |
| "Data collector service encountered an exception: %s", e, exc_info=True | |
| ) | |
| raise |
| try: | |
| service = DataCollectorService() | |
| service.run() | |
| except Exception as e: | |
| logger.error( | |
| "Data collector service encountered an exception: %s", e, exc_info=True | |
| ) | |
| raise | |
| --- src/services/data_collector.py | |
| @@ | |
| class DataCollectorService: # pylint: disable=too-few-public-methods | |
| + def __init__(self, config: DataCollectorConfiguration): | |
| + self._collector_config = config | |
| def run(self) -> None: | |
| - collector_config = configuration.user_data_collection_configuration.data_collector | |
| + collector_config = self._collector_config | |
| logger.info("Starting data collection service") | |
| … |
🤖 Prompt for AI Agents
In src/runners/data_collector.py around lines 19 to 26, the DataCollectorService
is instantiated without passing the runner's DataCollectorConfiguration, causing
a mismatch in configuration sources. Modify the runner to pass its configuration
object to the DataCollectorService constructor. Then, in
src/services/data_collector.py, add a constructor parameter to accept this
configuration and store it as an instance variable. Replace all references to
the global configuration.user_data_collection_configuration.data_collector in
the service with this injected configuration to ensure both runner and service
use the same settings.
| data_collector: | ||
| enabled: false | ||
| ingress_server_url: null | ||
| ingress_server_auth_token: null | ||
| ingress_content_service_name: null | ||
| collection_interval: 7200 # 2 hours in seconds | ||
| cleanup_after_send: true | ||
| connection_timeout_seconds: 30 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Verification agent
🧩 Analysis chain
Configuration section is well-structured with appropriate defaults.
The new data_collector section includes all necessary parameters with sensible default values. The service is disabled by default, which is appropriate for a new feature.
However, there's a potential field name inconsistency that should be verified.
Verify the field name consistency between the YAML configuration and the model definition:
The YAML uses connection_timeout_seconds while the model likely uses connection_timeout. This inconsistency could cause configuration parsing issues.
🏁 Script executed:
#!/bin/bash
# Check if the field name in YAML matches the model definition
echo "Checking field name in YAML configuration:"
grep -n "connection_timeout" lightspeed-stack.yaml
echo -e "\nChecking field name in model definition:"
ast-grep --pattern 'connection_timeout$_: $_' src/models/config.pyLength of output: 349
🏁 Script executed:
#!/bin/bash
# Search for 'connection_timeout_seconds' in model definitions
echo "Searching for 'connection_timeout_seconds' in model definitions:"
grep -R "connection_timeout_seconds" -n src/models
echo -e "\nSearching for DataCollector class definition and its fields:"
grep -R "class DataCollector" -n src/modelsLength of output: 447
#!/bin/bash
# Locate the DataCollectorConfiguration class and list its fields
grep -n "class DataCollectorConfiguration" src/models/config.py
grep -A20 "class DataCollectorConfiguration" src/models/config.py
# Search specifically for timeout-related fields
grep -n "timeout" src/models/config.py🏁 Script executed:
#!/bin/bash
# Locate the DataCollectorConfiguration class and list its fields
grep -n "class DataCollectorConfiguration" src/models/config.py
grep -A20 "class DataCollectorConfiguration" src/models/config.py
# Search specifically for timeout-related fields within the class
sed -n '88,120p' src/models/config.py | grep -n "timeout"Length of output: 1387
Rename connection_timeout_seconds to connection_timeout in YAML
The Pydantic model DataCollectorConfiguration defines its timeout field as connection_timeout, so the YAML key connection_timeout_seconds will not be picked up (and may raise an “extra field” error or be ignored).
Please update lightspeed-stack.yaml at line 30:
cleanup_after_send: true
- connection_timeout_seconds: 30
+ connection_timeout: 30This ensures the configuration maps correctly to DataCollectorConfiguration.connection_timeout.
🤖 Prompt for AI Agents
In lightspeed-stack.yaml at line 30, rename the key `connection_timeout_seconds`
to `connection_timeout` to match the Pydantic model field name
`connection_timeout`. This change ensures the YAML configuration correctly maps
to the DataCollectorConfiguration model without causing errors or being ignored.
tisnik
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it looks pretty ok
| ingress_content_service_name: "lightspeed-team" | ||
| collection_interval: 7200 # 2 hours in seconds | ||
| cleanup_after_send: true | ||
| connection_timeout: 30 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't it be a separate root-level configuration? Like
user_data_collection:
feedback_disabled: false
feedback_storage: "/tmp/data/feedback"
transcripts_disabled: false
transcripts_storage: "/tmp/data/transcripts"
user_data_export:
enabled: true
ingress_server_url: "https://your-ingress-server.com"
ingress_server_auth_token: "your-auth-token"
ingress_content_service_name: "lightspeed-team"
collection_interval: 7200 # 2 hours in seconds
cleanup_after_send: true
connection_timeout: 30As those are two independent pieces.
Description
This pull request implements support for delivering user chat transcripts and feedback at a regular period to an ingress server.
Type of change
Related Tickets & Documents
Checklist before requesting a review
Testing
Summary by CodeRabbit
New Features
Documentation
Tests
Chores