Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
224 changes: 224 additions & 0 deletions docs/user_data_collection.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,224 @@
# Lightspeed Stack user data collection

## Overview
This document outlines the process of capturing user interactions and system responses in the Lightspeed Core Stack service. Understanding this process will help optimize the system for better responses and outcomes.

## Components

### Lightspeed Core Stack
- Every user interaction results in the storage of its transcript as a JSON file on the local disk.
- When a user provides feedback (whether the LLM answer was satisfactory or not), the data is posted to the `/feedback` endpoint. This action also results in the creation of a JSON file.
- Both transcripts and feedback are stored in configurable local directories with unique filenames.

### Data Export Integration
- The Lightspeed Core Stack integrates with the [lightspeed-to-dataverse-exporter](https://github.com/lightspeed-core/lightspeed-to-dataverse-exporter) service to automatically export various types of user interaction data to Red Hat's Dataverse for analysis.
- The exporter service acts as a sidecar that periodically scans the configured data directories for new JSON files (transcripts and feedback).
- It packages these data into archives and uploads them to the appropriate ingress endpoints.

### Red Hat Dataverse Integration
- The exporter service uploads data to Red Hat's Dataverse for analysis and system improvement.
- Data flows through the same pipeline as other Red Hat services for consistent processing and analysis.

Comment on lines +18 to +21
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Verification agent

🧩 Analysis chain

Validate “Red Hat Dataverse” naming/branding.

Confirm the correct upstream/product name and link to avoid brand inaccuracies.

Also applies to: 215-224


🌐 Web query:

Is the destination service officially called “Red Hat Dataverse”? If not, what is the correct product/name to reference for Lightspeed data export?

💡 Result:

Short answer: No — it's not called "Red Hat Dataverse." The destination is Microsoft Dataverse (formerly Common Data Service) when referring to business/app data exports; "Dataverse" (Harvard/IQSS) is a separate open‑source research data platform and Red Hat is not the product owner. [1][2][3]

Sources:

  • Microsoft Dataverse (Data Export / Synapse Link). [1]
  • Dataverse (Harvard/IQSS) — separate open research platform. [2]
  • Red Hat mentions/use of "Dataverse" in research context (not an official Red Hat product named "Dataverse"). [3]

Update “Red Hat Dataverse” to “Microsoft Dataverse” and add official link
In docs/user_data_collection.md (18–21, 215–224), rename and relink as follows:

- ### Red Hat Dataverse Integration
+ ### Microsoft Dataverse Integration

- The exporter service uploads data to Red Hat's Dataverse for analysis and system improvement.
+ The exporter service uploads data to [Microsoft Dataverse](https://learn.microsoft.com/power-apps/maker/data-platform/data-platform-intro) for analysis and system improvement.

- Data flows through the same pipeline as other Red Hat services for consistent processing and analysis.
+ Data flows through the same pipeline as other services for consistent processing and analysis.

Also update any remaining “Red Hat Dataverse” mentions in lines 215–224.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
### Red Hat Dataverse Integration
- The exporter service uploads data to Red Hat's Dataverse for analysis and system improvement.
- Data flows through the same pipeline as other Red Hat services for consistent processing and analysis.
### Microsoft Dataverse Integration
- The exporter service uploads data to [Microsoft Dataverse](https://learn.microsoft.com/power-apps/maker/data-platform/data-platform-intro) for analysis and system improvement.
- Data flows through the same pipeline as other services for consistent processing and analysis.
🤖 Prompt for AI Agents
In docs/user_data_collection.md around lines 18–21 and 215–224, occurrences of
"Red Hat Dataverse" should be renamed to "Microsoft Dataverse" and any
references relinked to the official Microsoft Dataverse page
(https://learn.microsoft.com/power-apps/maker/data-platform/data-platform-intro).
Update the heading and bullet text to use "Microsoft Dataverse" and replace or
add the official link where appropriate, and scan the two line ranges to ensure
all remaining mentions are renamed and the link is added or updated.

## Configuration

User data collection is configured in the `user_data_collection` section of the configuration file:

```yaml
user_data_collection:
feedback_enabled: true
feedback_storage: "/tmp/data/feedback"
transcripts_enabled: true
transcripts_storage: "/tmp/data/transcripts"
Comment on lines +28 to +31
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Avoid /tmp for persisted data.

/tmp is ephemeral and often world-readable. Recommend a dedicated directory with restricted permissions.

Apply (first snippet; mirror in others):

-  feedback_storage: "/tmp/data/feedback"
-  transcripts_storage: "/tmp/data/transcripts"
+  feedback_storage: "/var/lib/lightspeed/data/feedback"     # ensure dir exists with 700 perms
+  transcripts_storage: "/var/lib/lightspeed/data/transcripts" # ensure dir exists with 700 perms

And add a note:

+> mkdir -p /var/lib/lightspeed/data/{feedback,transcripts} && chmod -R 700 /var/lib/lightspeed

Also applies to: 124-128, 171-174

🤖 Prompt for AI Agents
In docs/user_data_collection.md around lines 28-31 (and similarly at 124-128 and
171-174), the examples use /tmp for persisted feedback/transcripts which is
ephemeral and often world-readable; replace the paths with a dedicated
application data directory (e.g., /var/lib/your-app/data or
${APP_DATA_DIR}/feedback and .../transcripts) and update the other snippets to
match, and add a short note below explaining to create the directory with
restricted permissions (chmod 700 or chown to the app user) and to avoid using
/tmp for long-lived sensitive data.

data_collector:
enabled: false
ingress_server_url: null
ingress_server_auth_token: null
ingress_content_service_name: null
collection_interval: 7200 # 2 hours in seconds
cleanup_after_send: true
connection_timeout_seconds: 30
Comment on lines +38 to +39
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Safer default for cleanup_after_send.

Deleting local files immediately after upload can cause data loss on partial/failed uploads. Default to false and document verification semantics.

Apply:

-    cleanup_after_send: true
+    cleanup_after_send: false  # set to true only after verifying idempotent uploads and success checks

Also applies to: 180-181

🤖 Prompt for AI Agents
In docs/user_data_collection.md around lines 38-39 (also apply same change to
lines 180-181), the example default for cleanup_after_send is unsafe; change the
example value from true to false, and add one sentence clarifying verification
semantics: explain that files are retained locally until successful upload is
confirmed (or checksum/ack returned), and recommend enabling
cleanup_after_send=true only when uploads are guaranteed atomic or data is
otherwise backed up; update both occurrences and ensure the documentation notes
how users can verify successful upload before enabling automatic cleanup.

```

### Configuration Options

#### Basic Data Collection
- `feedback_enabled`: Enable/disable collection of user feedback data
- `feedback_storage`: Directory path where feedback JSON files are stored
- `transcripts_enabled`: Enable/disable collection of conversation transcripts
- `transcripts_storage`: Directory path where transcript JSON files are stored

#### Data Collector Service (Advanced)
- `enabled`: Enable/disable the data collector service that uploads data to ingress
- `ingress_server_url`: URL of the ingress server for data upload
- `ingress_server_auth_token`: Authentication token for the ingress server
- `ingress_content_service_name`: Service name identifier for the ingress server
Comment on lines +52 to +54
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Don’t document storing raw tokens in config files.

Recommend environment variables or secret stores instead of inline tokens.

Apply:

-- `ingress_server_auth_token`: Authentication token for the ingress server
+- `ingress_server_auth_token`: Authentication token for the ingress server. Do not store raw tokens in files; use environment variables or a secret manager (e.g., `${INGRESS_TOKEN}`) and mount/inject at runtime.
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
- `ingress_server_url`: URL of the ingress server for data upload
- `ingress_server_auth_token`: Authentication token for the ingress server
- `ingress_content_service_name`: Service name identifier for the ingress server
- `ingress_server_url`: URL of the ingress server for data upload
- `ingress_server_auth_token`: Authentication token for the ingress server. Do not store raw tokens in files; use environment variables or a secret manager (e.g., `${INGRESS_TOKEN}`) and mount/inject at runtime.
- `ingress_content_service_name`: Service name identifier for the ingress server
🤖 Prompt for AI Agents
In docs/user_data_collection.md around lines 52 to 54, the docs currently imply
storing raw authentication tokens and sensitive ingress details in config files;
update the wording to discourage inline tokens and instead instruct users to use
environment variables or a secrets manager (e.g., provide placeholder names like
INGRESS_SERVER_AUTH_TOKEN and point to retrieving them from process.env or the
project's secret store), remove any example raw token values, and add a brief
note about least-privilege and rotation practices for ingress credentials.

- `collection_interval`: Interval in seconds between data collection cycles (default: 7200 = 2 hours)
- `cleanup_after_send`: Whether to delete local files after successful upload (default: true)
- `connection_timeout_seconds`: Timeout for connection to ingress server (default: 30)

## Data Storage

### Feedback Data
Feedback data is stored as JSON files in the configured `feedback_storage` directory. Each file contains:

```json
{
"user_id": "user-uuid",
"timestamp": "2024-01-01T12:00:00Z",
"conversation_id": "conversation-uuid",
"user_question": "What is Kubernetes?",
"llm_response": "Kubernetes is an open-source container orchestration system...",
"sentiment": 1,
"user_feedback": "This response was very helpful",
"categories": ["helpful"]
}
```
Comment on lines +64 to +75
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Clarify PII redaction vs. stored fields (feedback includes raw Q/A).

Security section claims redaction, but feedback example stores user_question and llm_response verbatim. Align either by redacting here too or explicitly documenting scope.

Apply (feedback example):

-  "user_question": "What is Kubernetes?",
-  "llm_response": "Kubernetes is an open-source container orchestration system...",
+  "redacted_user_question": "What is <technology>?",
+  "redacted_llm_response": "…",

And expand Security:

- **Data Redaction**: Query data is stored as "redacted_query" to ensure sensitive information is not captured
+ **Data Redaction**: All persisted user-visible fields (queries, responses, feedback) must be redacted before storage (`redacted_*` fields). Do not persist raw values unless you have explicit consent and a lawful basis.

Also applies to: 190-193


### Transcript Data
Transcript data is stored as JSON files in the configured `transcripts_storage` directory, organized by user and conversation:

```
/transcripts_storage/
/{user_id}/
/{conversation_id}/
/{unique_id}.json
```

Each transcript file contains:

```json
{
"metadata": {
"provider": "openai",
"model": "gpt-4",
"query_provider": "openai",
"query_model": "gpt-4",
"user_id": "user-uuid",
"conversation_id": "conversation-uuid",
"timestamp": "2024-01-01T12:00:00Z"
},
"redacted_query": "What is Kubernetes?",
"query_is_valid": true,
"llm_response": "Kubernetes is an open-source container orchestration system...",
"rag_chunks": [],
"truncated": false,
"attachments": []
}
```

## Data Flow

1. **User Interaction**: User submits a query to the `/query` or `/streaming_query` endpoint
2. **Transcript Storage**: If transcripts are enabled, the interaction is stored as a JSON file
3. **Feedback Collection**: User can submit feedback via the `/feedback` endpoint
4. **Feedback Storage**: If feedback is enabled, the feedback is stored as a JSON file
5. **Data Export**: The exporter service (if enabled) periodically scans for new files and uploads them to the ingress server

## How to Test Locally

### Basic Data Collection Testing

1. **Enable data collection** in your `lightspeed-stack.yaml`:
```yaml
user_data_collection:
feedback_enabled: true
feedback_storage: "/tmp/data/feedback"
transcripts_enabled: true
transcripts_storage: "/tmp/data/transcripts"
```

2. **Start the Lightspeed Core Stack**:
```bash
python -m src.app.main
```

3. **Submit a query** to generate transcript data:
```bash
curl -X POST "http://localhost:8080/query" \
-H "Content-Type: application/json" \
-d '{
"query": "What is Kubernetes?",
"provider": "openai",
"model": "gpt-4"
}'
```

4. **Submit feedback** to generate feedback data:
```bash
curl -X POST "http://localhost:8080/feedback" \
-H "Content-Type: application/json" \
-d '{
"conversation_id": "your-conversation-id",
"user_question": "What is Kubernetes?",
"llm_response": "Kubernetes is...",
"sentiment": 1,
"user_feedback": "Very helpful response"
}'
```
Comment on lines +146 to +157
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Add instruction to obtain conversation_id before posting feedback.

The feedback example requires conversation_id but prior steps don’t show how to get it.

Insert before the feedback curl:

+4a. Extract the conversation_id from the latest transcript:
+```bash
+CONVERSATION_ID=$(jq -r '.metadata.conversation_id' "$(find /var/lib/lightspeed/data/transcripts -type f -name '*.json' -print0 | xargs -0 ls -t | head -n1)")
+echo "$CONVERSATION_ID"
+```

And update the payload:

-       "conversation_id": "your-conversation-id",
+       "conversation_id": "'"$CONVERSATION_ID"'",
🤖 Prompt for AI Agents
In docs/user_data_collection.md around lines 146 to 157, the feedback curl
example uses conversation_id but earlier steps never show how to obtain it; add
a short step immediately before the curl that explains and demonstrates
extracting the latest conversation_id from the transcripts directory (using jq
to read .metadata.conversation_id from the most recent JSON file) and echoing it
for verification, then update the curl payload example to reference that
variable (e.g., use the shell variable in the JSON payload) so readers can
copy/paste the full flow end-to-end.


5. **Check stored data**:
```bash
ls -la /tmp/data/feedback/
ls -la /tmp/data/transcripts/
```

### Advanced Data Collector Testing

1. **Enable data collector** in your configuration:
```yaml
user_data_collection:
feedback_enabled: true
feedback_storage: "/tmp/data/feedback"
transcripts_enabled: true
transcripts_storage: "/tmp/data/transcripts"
data_collector:
enabled: true
ingress_server_url: "https://your-ingress-server.com/upload"
ingress_server_auth_token: "your-auth-token"
ingress_content_service_name: "lightspeed-stack"
Comment on lines +169 to +178
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Use env var for auth token in examples.

Inline secrets are risky and get copy-pasted.

Apply:

      data_collector:
        enabled: true
        ingress_server_url: "https://your-ingress-server.com/upload"
-       ingress_server_auth_token: "your-auth-token"
+       ingress_server_auth_token: ${INGRESS_TOKEN}  # injected via environment/secret
        ingress_content_service_name: "lightspeed-stack"

Add a brief note below with a safe export example:

+> Note: In local testing, export the token for the current shell only:
+> `export INGRESS_TOKEN='<redacted>'`
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
user_data_collection:
feedback_enabled: true
feedback_storage: "/tmp/data/feedback"
transcripts_enabled: true
transcripts_storage: "/tmp/data/transcripts"
data_collector:
enabled: true
ingress_server_url: "https://your-ingress-server.com/upload"
ingress_server_auth_token: "your-auth-token"
ingress_content_service_name: "lightspeed-stack"
user_data_collection:
feedback_enabled: true
feedback_storage: "/tmp/data/feedback"
transcripts_enabled: true
transcripts_storage: "/tmp/data/transcripts"
data_collector:
enabled: true
ingress_server_url: "https://your-ingress-server.com/upload"
ingress_server_auth_token: ${INGRESS_TOKEN} # injected via environment/secret
ingress_content_service_name: "lightspeed-stack"
> Note: In local testing, export the token for the current shell only:
> `export INGRESS_TOKEN='<redacted>'`
🤖 Prompt for AI Agents
In docs/user_data_collection.md around lines 169 to 178, the example inlines an
auth token which is unsafe; change the example to reference an environment
variable (e.g., INGRESS_AUTH_TOKEN) instead of embedding the secret, update the
key to show a placeholder like ingress_server_auth_token:
"${INGRESS_AUTH_TOKEN}" or similar config-variable syntax used by the project,
and add a short note beneath the block showing how to export the variable
locally (e.g., export INGRESS_AUTH_TOKEN="your-auth-token") and recommend using
a secret manager or CI/CD secrets for production.

collection_interval: 60 # 1 minute for testing
cleanup_after_send: true
connection_timeout_seconds: 30
```

2. **Deploy the exporter service** pointing to the same data directories

3. **Monitor the data collection** by checking the logs and verifying that files are being uploaded and cleaned up

## Security Considerations

- **Data Privacy**: All user data is stored locally and can be configured to be cleaned up after upload
- **Authentication**: The data collector service uses authentication tokens for secure uploads
- **Data Redaction**: Query data is stored as "redacted_query" to ensure sensitive information is not captured
- **Access Control**: Data directories should be properly secured with appropriate file permissions

## Troubleshooting

### Common Issues

1. **Data not being stored**: Check that the storage directories exist and are writable
2. **Data collector not uploading**: Verify the ingress server URL and authentication token
3. **Permission errors**: Ensure the service has write permissions to the configured directories
4. **Connection timeouts**: Adjust the `connection_timeout_seconds` setting if needed

### Logging

Enable debug logging to troubleshoot data collection issues:

```yaml
service:
log_level: debug
```

This will provide detailed information about data collection, storage, and upload processes.

## Integration with Red Hat Dataverse

For production deployments, the Lightspeed Core Stack integrates with Red Hat's Dataverse through the exporter service. This provides:

- Centralized data collection and analysis
- Consistent data processing pipeline
- Integration with other Red Hat services
- Automated data export and cleanup

For complete integration setup, deployment options, and configuration details, see the [exporter repository](https://github.com/lightspeed-core/lightspeed-to-dataverse-exporter).