-
Notifications
You must be signed in to change notification settings - Fork 54
LCORE-427: (docs) user data collection #518
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LCORE-427: (docs) user data collection #518
Conversation
WalkthroughAdds a new documentation file describing Lightspeed Stack user data collection: configuration options, JSON schemas for feedback/transcripts, local storage structure, sidecar exporter workflow to Red Hat Dataverse, testing steps (basic/advanced), security considerations, and troubleshooting. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant U as User
participant LS as Lightspeed Stack
participant FS as Local Storage<br/>(JSON files)
participant EX as Sidecar Exporter
participant DV as Red Hat Dataverse
rect rgba(200,220,255,0.25)
note over U,LS: Transcript flow
U->>LS: Submit query
LS-->>U: Return response
LS->>FS: Write transcript JSON<br/>(by user/conversation)
end
rect rgba(220,255,220,0.25)
note over U,LS: Feedback flow
U->>LS: Submit feedback
LS->>FS: Append feedback JSON
end
rect rgba(255,240,200,0.25)
note over EX,DV: Export cycle
EX-->>FS: Periodic scan (interval)
EX->>DV: Upload archives (auth token)
alt cleanup enabled
EX->>FS: Remove sent data
else keep data
EX-->>FS: Retain local copies
end
end
Estimated code review effort🎯 1 (Trivial) | ⏱️ ~3 minutes Possibly related PRs
Suggested reviewers
Pre-merge checks (2 passed, 1 inconclusive)❌ Failed checks (1 inconclusive)
✅ Passed checks (2 passed)
Poem
✨ Finishing Touches🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 7
🧹 Nitpick comments (4)
docs/user_data_collection.md (4)
80-85: Fix MD040: add language to fenced block.The directory layout block lacks a language hint; markdownlint flags this.
Apply:
-``` +```text /transcripts_storage/ /{user_id}/ /{conversation_id}/ /{unique_id}.json -``` +```
91-96: Use vendor-neutral placeholders in examples.Avoid locking docs to a specific provider/model in general guidance.
Apply:
- "provider": "openai", - "model": "gpt-4", - "query_provider": "openai", - "query_model": "gpt-4", + "provider": "<provider>", + "model": "<model>", + "query_provider": "<provider>", + "query_model": "<model>",
59-75: Specify schema/constraints for feedback.Define field requirements and allowed values (e.g., sentiment domain, categories enum) to ensure data quality.
Consider adding a JSON Schema (required fields: user_id, timestamp, conversation_id, sentiment; sentiment: integer in [-1,0,1] or [0,1]; categories: array of enum).
8-12: Document filename format and retention.You mention “unique filenames”; specify the naming convention (e.g.,
<ts>-<uuid>.json) and retention policy to aid ops and compliance.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
docs/user_data_collection.md(1 hunks)
🧰 Additional context used
🪛 markdownlint-cli2 (0.17.2)
docs/user_data_collection.md
80-80: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: build-pr
- GitHub Check: e2e_tests
🔇 Additional comments (2)
docs/user_data_collection.md (2)
137-144: Confirm endpoints, port, and response shape.Verify that
/query,/feedback, and port 8080 match the current API, and note the expected response fields (e.g., conversation_id) to reduce guesswork.Would you confirm the base URL, endpoints, and that
/queryreturnsconversation_id(or point to the transcript path to obtain it)?Also applies to: 148-157
206-213: Verify logging config keys and provide env-based toggle.Ensure
service.log_levelmatches the actual config. Add an env override example.Apply:
+# Alternatively, set via environment: +# export LS_LOG_LEVEL=debug
| ### Red Hat Dataverse Integration | ||
| - The exporter service uploads data to Red Hat's Dataverse for analysis and system improvement. | ||
| - Data flows through the same pipeline as other Red Hat services for consistent processing and analysis. | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Verification agent
🧩 Analysis chain
Validate “Red Hat Dataverse” naming/branding.
Confirm the correct upstream/product name and link to avoid brand inaccuracies.
Also applies to: 215-224
🌐 Web query:
Is the destination service officially called “Red Hat Dataverse”? If not, what is the correct product/name to reference for Lightspeed data export?
💡 Result:
Short answer: No — it's not called "Red Hat Dataverse." The destination is Microsoft Dataverse (formerly Common Data Service) when referring to business/app data exports; "Dataverse" (Harvard/IQSS) is a separate open‑source research data platform and Red Hat is not the product owner. [1][2][3]
Sources:
- Microsoft Dataverse (Data Export / Synapse Link). [1]
- Dataverse (Harvard/IQSS) — separate open research platform. [2]
- Red Hat mentions/use of "Dataverse" in research context (not an official Red Hat product named "Dataverse"). [3]
Update “Red Hat Dataverse” to “Microsoft Dataverse” and add official link
In docs/user_data_collection.md (18–21, 215–224), rename and relink as follows:
- ### Red Hat Dataverse Integration
+ ### Microsoft Dataverse Integration
- The exporter service uploads data to Red Hat's Dataverse for analysis and system improvement.
+ The exporter service uploads data to [Microsoft Dataverse](https://learn.microsoft.com/power-apps/maker/data-platform/data-platform-intro) for analysis and system improvement.
- Data flows through the same pipeline as other Red Hat services for consistent processing and analysis.
+ Data flows through the same pipeline as other services for consistent processing and analysis.Also update any remaining “Red Hat Dataverse” mentions in lines 215–224.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| ### Red Hat Dataverse Integration | |
| - The exporter service uploads data to Red Hat's Dataverse for analysis and system improvement. | |
| - Data flows through the same pipeline as other Red Hat services for consistent processing and analysis. | |
| ### Microsoft Dataverse Integration | |
| - The exporter service uploads data to [Microsoft Dataverse](https://learn.microsoft.com/power-apps/maker/data-platform/data-platform-intro) for analysis and system improvement. | |
| - Data flows through the same pipeline as other services for consistent processing and analysis. |
🤖 Prompt for AI Agents
In docs/user_data_collection.md around lines 18–21 and 215–224, occurrences of
"Red Hat Dataverse" should be renamed to "Microsoft Dataverse" and any
references relinked to the official Microsoft Dataverse page
(https://learn.microsoft.com/power-apps/maker/data-platform/data-platform-intro).
Update the heading and bullet text to use "Microsoft Dataverse" and replace or
add the official link where appropriate, and scan the two line ranges to ensure
all remaining mentions are renamed and the link is added or updated.
| feedback_enabled: true | ||
| feedback_storage: "/tmp/data/feedback" | ||
| transcripts_enabled: true | ||
| transcripts_storage: "/tmp/data/transcripts" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Avoid /tmp for persisted data.
/tmp is ephemeral and often world-readable. Recommend a dedicated directory with restricted permissions.
Apply (first snippet; mirror in others):
- feedback_storage: "/tmp/data/feedback"
- transcripts_storage: "/tmp/data/transcripts"
+ feedback_storage: "/var/lib/lightspeed/data/feedback" # ensure dir exists with 700 perms
+ transcripts_storage: "/var/lib/lightspeed/data/transcripts" # ensure dir exists with 700 permsAnd add a note:
+> mkdir -p /var/lib/lightspeed/data/{feedback,transcripts} && chmod -R 700 /var/lib/lightspeedAlso applies to: 124-128, 171-174
🤖 Prompt for AI Agents
In docs/user_data_collection.md around lines 28-31 (and similarly at 124-128 and
171-174), the examples use /tmp for persisted feedback/transcripts which is
ephemeral and often world-readable; replace the paths with a dedicated
application data directory (e.g., /var/lib/your-app/data or
${APP_DATA_DIR}/feedback and .../transcripts) and update the other snippets to
match, and add a short note below explaining to create the directory with
restricted permissions (chmod 700 or chown to the app user) and to avoid using
/tmp for long-lived sensitive data.
| cleanup_after_send: true | ||
| connection_timeout_seconds: 30 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Safer default for cleanup_after_send.
Deleting local files immediately after upload can cause data loss on partial/failed uploads. Default to false and document verification semantics.
Apply:
- cleanup_after_send: true
+ cleanup_after_send: false # set to true only after verifying idempotent uploads and success checksAlso applies to: 180-181
🤖 Prompt for AI Agents
In docs/user_data_collection.md around lines 38-39 (also apply same change to
lines 180-181), the example default for cleanup_after_send is unsafe; change the
example value from true to false, and add one sentence clarifying verification
semantics: explain that files are retained locally until successful upload is
confirmed (or checksum/ack returned), and recommend enabling
cleanup_after_send=true only when uploads are guaranteed atomic or data is
otherwise backed up; update both occurrences and ensure the documentation notes
how users can verify successful upload before enabling automatic cleanup.
| - `ingress_server_url`: URL of the ingress server for data upload | ||
| - `ingress_server_auth_token`: Authentication token for the ingress server | ||
| - `ingress_content_service_name`: Service name identifier for the ingress server |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Don’t document storing raw tokens in config files.
Recommend environment variables or secret stores instead of inline tokens.
Apply:
-- `ingress_server_auth_token`: Authentication token for the ingress server
+- `ingress_server_auth_token`: Authentication token for the ingress server. Do not store raw tokens in files; use environment variables or a secret manager (e.g., `${INGRESS_TOKEN}`) and mount/inject at runtime.📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| - `ingress_server_url`: URL of the ingress server for data upload | |
| - `ingress_server_auth_token`: Authentication token for the ingress server | |
| - `ingress_content_service_name`: Service name identifier for the ingress server | |
| - `ingress_server_url`: URL of the ingress server for data upload | |
| - `ingress_server_auth_token`: Authentication token for the ingress server. Do not store raw tokens in files; use environment variables or a secret manager (e.g., `${INGRESS_TOKEN}`) and mount/inject at runtime. | |
| - `ingress_content_service_name`: Service name identifier for the ingress server |
🤖 Prompt for AI Agents
In docs/user_data_collection.md around lines 52 to 54, the docs currently imply
storing raw authentication tokens and sensitive ingress details in config files;
update the wording to discourage inline tokens and instead instruct users to use
environment variables or a secrets manager (e.g., provide placeholder names like
INGRESS_SERVER_AUTH_TOKEN and point to retrieving them from process.env or the
project's secret store), remove any example raw token values, and add a brief
note about least-privilege and rotation practices for ingress credentials.
| ```json | ||
| { | ||
| "user_id": "user-uuid", | ||
| "timestamp": "2024-01-01T12:00:00Z", | ||
| "conversation_id": "conversation-uuid", | ||
| "user_question": "What is Kubernetes?", | ||
| "llm_response": "Kubernetes is an open-source container orchestration system...", | ||
| "sentiment": 1, | ||
| "user_feedback": "This response was very helpful", | ||
| "categories": ["helpful"] | ||
| } | ||
| ``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Clarify PII redaction vs. stored fields (feedback includes raw Q/A).
Security section claims redaction, but feedback example stores user_question and llm_response verbatim. Align either by redacting here too or explicitly documenting scope.
Apply (feedback example):
- "user_question": "What is Kubernetes?",
- "llm_response": "Kubernetes is an open-source container orchestration system...",
+ "redacted_user_question": "What is <technology>?",
+ "redacted_llm_response": "…",And expand Security:
- **Data Redaction**: Query data is stored as "redacted_query" to ensure sensitive information is not captured
+ **Data Redaction**: All persisted user-visible fields (queries, responses, feedback) must be redacted before storage (`redacted_*` fields). Do not persist raw values unless you have explicit consent and a lawful basis.Also applies to: 190-193
| 4. **Submit feedback** to generate feedback data: | ||
| ```bash | ||
| curl -X POST "http://localhost:8080/feedback" \ | ||
| -H "Content-Type: application/json" \ | ||
| -d '{ | ||
| "conversation_id": "your-conversation-id", | ||
| "user_question": "What is Kubernetes?", | ||
| "llm_response": "Kubernetes is...", | ||
| "sentiment": 1, | ||
| "user_feedback": "Very helpful response" | ||
| }' | ||
| ``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Add instruction to obtain conversation_id before posting feedback.
The feedback example requires conversation_id but prior steps don’t show how to get it.
Insert before the feedback curl:
+4a. Extract the conversation_id from the latest transcript:
+```bash
+CONVERSATION_ID=$(jq -r '.metadata.conversation_id' "$(find /var/lib/lightspeed/data/transcripts -type f -name '*.json' -print0 | xargs -0 ls -t | head -n1)")
+echo "$CONVERSATION_ID"
+```And update the payload:
- "conversation_id": "your-conversation-id",
+ "conversation_id": "'"$CONVERSATION_ID"'",🤖 Prompt for AI Agents
In docs/user_data_collection.md around lines 146 to 157, the feedback curl
example uses conversation_id but earlier steps never show how to obtain it; add
a short step immediately before the curl that explains and demonstrates
extracting the latest conversation_id from the transcripts directory (using jq
to read .metadata.conversation_id from the most recent JSON file) and echoing it
for verification, then update the curl payload example to reference that
variable (e.g., use the shell variable in the JSON payload) so readers can
copy/paste the full flow end-to-end.
| user_data_collection: | ||
| feedback_enabled: true | ||
| feedback_storage: "/tmp/data/feedback" | ||
| transcripts_enabled: true | ||
| transcripts_storage: "/tmp/data/transcripts" | ||
| data_collector: | ||
| enabled: true | ||
| ingress_server_url: "https://your-ingress-server.com/upload" | ||
| ingress_server_auth_token: "your-auth-token" | ||
| ingress_content_service_name: "lightspeed-stack" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Use env var for auth token in examples.
Inline secrets are risky and get copy-pasted.
Apply:
data_collector:
enabled: true
ingress_server_url: "https://your-ingress-server.com/upload"
- ingress_server_auth_token: "your-auth-token"
+ ingress_server_auth_token: ${INGRESS_TOKEN} # injected via environment/secret
ingress_content_service_name: "lightspeed-stack"Add a brief note below with a safe export example:
+> Note: In local testing, export the token for the current shell only:
+> `export INGRESS_TOKEN='<redacted>'`📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| user_data_collection: | |
| feedback_enabled: true | |
| feedback_storage: "/tmp/data/feedback" | |
| transcripts_enabled: true | |
| transcripts_storage: "/tmp/data/transcripts" | |
| data_collector: | |
| enabled: true | |
| ingress_server_url: "https://your-ingress-server.com/upload" | |
| ingress_server_auth_token: "your-auth-token" | |
| ingress_content_service_name: "lightspeed-stack" | |
| user_data_collection: | |
| feedback_enabled: true | |
| feedback_storage: "/tmp/data/feedback" | |
| transcripts_enabled: true | |
| transcripts_storage: "/tmp/data/transcripts" | |
| data_collector: | |
| enabled: true | |
| ingress_server_url: "https://your-ingress-server.com/upload" | |
| ingress_server_auth_token: ${INGRESS_TOKEN} # injected via environment/secret | |
| ingress_content_service_name: "lightspeed-stack" | |
| > Note: In local testing, export the token for the current shell only: | |
| > `export INGRESS_TOKEN='<redacted>'` |
🤖 Prompt for AI Agents
In docs/user_data_collection.md around lines 169 to 178, the example inlines an
auth token which is unsafe; change the example to reference an environment
variable (e.g., INGRESS_AUTH_TOKEN) instead of embedding the secret, update the
key to show a placeholder like ingress_server_auth_token:
"${INGRESS_AUTH_TOKEN}" or similar config-variable syntax used by the project,
and add a short note beneath the block showing how to export the variable
locally (e.g., export INGRESS_AUTH_TOKEN="your-auth-token") and recommend using
a secret manager or CI/CD secrets for production.
tisnik
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Description
Type of change
Related Tickets & Documents
Checklist before requesting a review
Testing
Summary by CodeRabbit