LCORE-427: (docs) user data collection #518

gallettilance · 2025-09-09T12:40:20Z

Description

Type of change

Related Tickets & Documents

Related Issue #
Closes #

Checklist before requesting a review

I have performed a self-review of my code.
PR has passed all pre-merge test jobs.
If it is a core feature, I have added thorough tests.

Testing

Please provide detailed steps to perform tests related to this code change.
How were the fix/results from this change verified? Please provide relevant screenshots or results.

Summary by CodeRabbit

Documentation
- Added comprehensive guide for Lightspeed Stack user data collection.
- Covers configuration options (feedback/transcripts settings and data collector), data flow, and storage structure.
- Includes JSON schema overviews for transcripts and feedback.
- Provides step-by-step testing instructions for basic and advanced collection.
- Details integration with Red Hat Dataverse and exporter usage.
- Outlines security considerations (privacy, tokens, redaction, access control).
- Adds troubleshooting tips and guidance on enabling debug logging.

coderabbitai · 2025-09-09T12:40:28Z

Walkthrough

Adds a new documentation file describing Lightspeed Stack user data collection: configuration options, JSON schemas for feedback/transcripts, local storage structure, sidecar exporter workflow to Red Hat Dataverse, testing steps (basic/advanced), security considerations, and troubleshooting.

Changes

Cohort / File(s)	Summary
Documentation: User Data Collection `docs/user_data_collection.md`	New doc detailing user interaction/transcript capture, JSON schemas, config block (user_data_collection and data_collector), exporter workflow to Dataverse, test procedures, security notes, and troubleshooting.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant U as User
  participant LS as Lightspeed Stack
  participant FS as Local Storage<br/>(JSON files)
  participant EX as Sidecar Exporter
  participant DV as Red Hat Dataverse

  rect rgba(200,220,255,0.25)
  note over U,LS: Transcript flow
  U->>LS: Submit query
  LS-->>U: Return response
  LS->>FS: Write transcript JSON<br/>(by user/conversation)
  end

  rect rgba(220,255,220,0.25)
  note over U,LS: Feedback flow
  U->>LS: Submit feedback
  LS->>FS: Append feedback JSON
  end

  rect rgba(255,240,200,0.25)
  note over EX,DV: Export cycle
  EX-->>FS: Periodic scan (interval)
  EX->>DV: Upload archives (auth token)
  alt cleanup enabled
    EX->>FS: Remove sent data
  else keep data
    EX-->>FS: Retain local copies
  end
  end

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~3 minutes

Possibly related PRs

LCORE-472: Add data collection documentation #389 — Adds overlapping documentation on user data collection and Dataverse exporter integration.

Suggested reviewers

umago
tisnik

Pre-merge checks (2 passed, 1 inconclusive)

❌ Failed checks (1 inconclusive)

Check name	Status	Explanation	Resolution
Description Check	❓ Inconclusive	The current description consists solely of template headings and unchecked checkboxes without any details about the added documentation or its purpose, leaving reviewers without context on what the pull request entails.	Please update the description by summarizing the contents of the new docs/user_data_collection.md file, outlining key configuration options and testing steps, and linking any related issues or tickets to provide clear context for reviewers.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title Check	✅ Passed	The title “LCORE-427: (docs) user data collection” concisely identifies the main change by referencing the documentation addition and the feature name, making it clear for reviewers what the pull request implements. It avoids unnecessary detail while remaining specific about the scope of the change.
Docstring Coverage	✅ Passed	No functions found in the changes. Docstring coverage check skipped.

Poem

A nibble of notes, a hop through the logs,
I stash little JSONs like carrots in bogs.
An exporter friend hums, “Off to Dataverse!”
While I tidy the burrow—concise and terse.
With twitchy ears I cheer the flow:
Queries in, insights go! 🥕🐇

✨ Finishing Touches

🧪 Generate unit tests

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 7

🧹 Nitpick comments (4)

docs/user_data_collection.md (4)
80-85: Fix MD040: add language to fenced block.

The directory layout block lacks a language hint; markdownlint flags this.

Apply:
-```
+```text
 /transcripts_storage/
   /{user_id}/
     /{conversation_id}/
       /{unique_id}.json
-```
+```
91-96: Use vendor-neutral placeholders in examples.

Avoid locking docs to a specific provider/model in general guidance.

Apply:
-    "provider": "openai",
-    "model": "gpt-4",
-    "query_provider": "openai",
-    "query_model": "gpt-4",
+    "provider": "<provider>",
+    "model": "<model>",
+    "query_provider": "<provider>",
+    "query_model": "<model>",
59-75: Specify schema/constraints for feedback.

Define field requirements and allowed values (e.g., sentiment domain, categories enum) to ensure data quality.

Consider adding a JSON Schema (required fields: user_id, timestamp, conversation_id, sentiment; sentiment: integer in [-1,0,1] or [0,1]; categories: array of enum).

8-12: Document filename format and retention.

You mention “unique filenames”; specify the naming convention (e.g., <ts>-<uuid>.json) and retention policy to aid ops and compliance.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 528e0b9 and 7e4f62a.

📒 Files selected for processing (1)

docs/user_data_collection.md (1 hunks)

🧰 Additional context used

🪛 markdownlint-cli2 (0.17.2)

docs/user_data_collection.md

80-80: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)

GitHub Check: build-pr
GitHub Check: e2e_tests

🔇 Additional comments (2)

docs/user_data_collection.md (2)
137-144: Confirm endpoints, port, and response shape.

Verify that /query, /feedback, and port 8080 match the current API, and note the expected response fields (e.g., conversation_id) to reduce guesswork.

Would you confirm the base URL, endpoints, and that /query returns conversation_id (or point to the transcript path to obtain it)?

Also applies to: 148-157

206-213: Verify logging config keys and provide env-based toggle.

Ensure service.log_level matches the actual config. Add an env override example.

Apply:
+# Alternatively, set via environment:
+# export LS_LOG_LEVEL=debug

coderabbitai · 2025-09-09T12:45:56Z

docs/user_data_collection.md

+### Red Hat Dataverse Integration
+- The exporter service uploads data to Red Hat's Dataverse for analysis and system improvement.
+- Data flows through the same pipeline as other Red Hat services for consistent processing and analysis.
+


💡 Verification agent

🧩 Analysis chain

Validate “Red Hat Dataverse” naming/branding.

Confirm the correct upstream/product name and link to avoid brand inaccuracies.

Also applies to: 215-224

🌐 Web query:

Is the destination service officially called “Red Hat Dataverse”? If not, what is the correct product/name to reference for Lightspeed data export?

💡 Result:

Short answer: No — it's not called "Red Hat Dataverse." The destination is Microsoft Dataverse (formerly Common Data Service) when referring to business/app data exports; "Dataverse" (Harvard/IQSS) is a separate open‑source research data platform and Red Hat is not the product owner. [1][2][3]

Sources:

Microsoft Dataverse (Data Export / Synapse Link). [1]

Dataverse (Harvard/IQSS) — separate open research platform. [2]

Red Hat mentions/use of "Dataverse" in research context (not an official Red Hat product named "Dataverse"). [3]

Update “Red Hat Dataverse” to “Microsoft Dataverse” and add official link
In docs/user_data_collection.md (18–21, 215–224), rename and relink as follows:

- ### Red Hat Dataverse Integration + ### Microsoft Dataverse Integration - The exporter service uploads data to Red Hat's Dataverse for analysis and system improvement. + The exporter service uploads data to [Microsoft Dataverse](https://learn.microsoft.com/power-apps/maker/data-platform/data-platform-intro) for analysis and system improvement. - Data flows through the same pipeline as other Red Hat services for consistent processing and analysis. + Data flows through the same pipeline as other services for consistent processing and analysis.

Also update any remaining “Red Hat Dataverse” mentions in lines 215–224.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

### Red Hat Dataverse Integration

- The exporter service uploads data to Red Hat's Dataverse for analysis and system improvement.

- Data flows through the same pipeline as other Red Hat services for consistent processing and analysis.

### Microsoft Dataverse Integration

- The exporter service uploads data to [Microsoft Dataverse](https://learn.microsoft.com/power-apps/maker/data-platform/data-platform-intro) for analysis and system improvement.

- Data flows through the same pipeline as other services for consistent processing and analysis.

🤖 Prompt for AI Agents

In docs/user_data_collection.md around lines 18–21 and 215–224, occurrences of "Red Hat Dataverse" should be renamed to "Microsoft Dataverse" and any references relinked to the official Microsoft Dataverse page (https://learn.microsoft.com/power-apps/maker/data-platform/data-platform-intro). Update the heading and bullet text to use "Microsoft Dataverse" and replace or add the official link where appropriate, and scan the two line ranges to ensure all remaining mentions are renamed and the link is added or updated.

coderabbitai · 2025-09-09T12:45:56Z

docs/user_data_collection.md

+  feedback_enabled: true
+  feedback_storage: "/tmp/data/feedback"
+  transcripts_enabled: true
+  transcripts_storage: "/tmp/data/transcripts"


🛠️ Refactor suggestion

Avoid /tmp for persisted data.

/tmp is ephemeral and often world-readable. Recommend a dedicated directory with restricted permissions.

Apply (first snippet; mirror in others):

- feedback_storage: "/tmp/data/feedback" - transcripts_storage: "/tmp/data/transcripts" + feedback_storage: "/var/lib/lightspeed/data/feedback" # ensure dir exists with 700 perms + transcripts_storage: "/var/lib/lightspeed/data/transcripts" # ensure dir exists with 700 perms

And add a note:

+> mkdir -p /var/lib/lightspeed/data/{feedback,transcripts} && chmod -R 700 /var/lib/lightspeed

Also applies to: 124-128, 171-174

🤖 Prompt for AI Agents

In docs/user_data_collection.md around lines 28-31 (and similarly at 124-128 and 171-174), the examples use /tmp for persisted feedback/transcripts which is ephemeral and often world-readable; replace the paths with a dedicated application data directory (e.g., /var/lib/your-app/data or ${APP_DATA_DIR}/feedback and .../transcripts) and update the other snippets to match, and add a short note below explaining to create the directory with restricted permissions (chmod 700 or chown to the app user) and to avoid using /tmp for long-lived sensitive data.

coderabbitai · 2025-09-09T12:45:56Z

docs/user_data_collection.md

+    cleanup_after_send: true
+    connection_timeout_seconds: 30


🛠️ Refactor suggestion

Safer default for cleanup_after_send.

Deleting local files immediately after upload can cause data loss on partial/failed uploads. Default to false and document verification semantics.

Apply:

- cleanup_after_send: true + cleanup_after_send: false # set to true only after verifying idempotent uploads and success checks

Also applies to: 180-181

🤖 Prompt for AI Agents

In docs/user_data_collection.md around lines 38-39 (also apply same change to lines 180-181), the example default for cleanup_after_send is unsafe; change the example value from true to false, and add one sentence clarifying verification semantics: explain that files are retained locally until successful upload is confirmed (or checksum/ack returned), and recommend enabling cleanup_after_send=true only when uploads are guaranteed atomic or data is otherwise backed up; update both occurrences and ensure the documentation notes how users can verify successful upload before enabling automatic cleanup.

coderabbitai · 2025-09-09T12:45:57Z

docs/user_data_collection.md

+- `ingress_server_url`: URL of the ingress server for data upload
+- `ingress_server_auth_token`: Authentication token for the ingress server
+- `ingress_content_service_name`: Service name identifier for the ingress server


🛠️ Refactor suggestion

Don’t document storing raw tokens in config files.

Recommend environment variables or secret stores instead of inline tokens.

Apply:

-- `ingress_server_auth_token`: Authentication token for the ingress server +- `ingress_server_auth_token`: Authentication token for the ingress server. Do not store raw tokens in files; use environment variables or a secret manager (e.g., `${INGRESS_TOKEN}`) and mount/inject at runtime.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

- `ingress_server_url`: URL of the ingress server for data upload

- `ingress_server_auth_token`: Authentication token for the ingress server

- `ingress_content_service_name`: Service name identifier for the ingress server

- `ingress_server_url`: URL of the ingress server for data upload

- `ingress_server_auth_token`: Authentication token for the ingress server. Do not store raw tokens in files; use environment variables or a secret manager (e.g., `${INGRESS_TOKEN}`) and mount/inject at runtime.

- `ingress_content_service_name`: Service name identifier for the ingress server

🤖 Prompt for AI Agents

In docs/user_data_collection.md around lines 52 to 54, the docs currently imply storing raw authentication tokens and sensitive ingress details in config files; update the wording to discourage inline tokens and instead instruct users to use environment variables or a secrets manager (e.g., provide placeholder names like INGRESS_SERVER_AUTH_TOKEN and point to retrieving them from process.env or the project's secret store), remove any example raw token values, and add a brief note about least-privilege and rotation practices for ingress credentials.

coderabbitai · 2025-09-09T12:45:57Z

docs/user_data_collection.md

+```json
+{
+  "user_id": "user-uuid",
+  "timestamp": "2024-01-01T12:00:00Z",
+  "conversation_id": "conversation-uuid",
+  "user_question": "What is Kubernetes?",
+  "llm_response": "Kubernetes is an open-source container orchestration system...",
+  "sentiment": 1,
+  "user_feedback": "This response was very helpful",
+  "categories": ["helpful"]
+}
+```


🛠️ Refactor suggestion

Clarify PII redaction vs. stored fields (feedback includes raw Q/A).

Security section claims redaction, but feedback example stores user_question and llm_response verbatim. Align either by redacting here too or explicitly documenting scope.

Apply (feedback example):

- "user_question": "What is Kubernetes?", - "llm_response": "Kubernetes is an open-source container orchestration system...", + "redacted_user_question": "What is <technology>?", + "redacted_llm_response": "…",

And expand Security:

- **Data Redaction**: Query data is stored as "redacted_query" to ensure sensitive information is not captured + **Data Redaction**: All persisted user-visible fields (queries, responses, feedback) must be redacted before storage (`redacted_*` fields). Do not persist raw values unless you have explicit consent and a lawful basis.

Also applies to: 190-193

coderabbitai · 2025-09-09T12:45:57Z

docs/user_data_collection.md

+4. **Submit feedback** to generate feedback data:
+   ```bash
+   curl -X POST "http://localhost:8080/feedback" \
+     -H "Content-Type: application/json" \
+     -d '{
+       "conversation_id": "your-conversation-id",
+       "user_question": "What is Kubernetes?",
+       "llm_response": "Kubernetes is...",
+       "sentiment": 1,
+       "user_feedback": "Very helpful response"
+     }'
+   ```


🛠️ Refactor suggestion

Add instruction to obtain conversation_id before posting feedback.

The feedback example requires conversation_id but prior steps don’t show how to get it.

Insert before the feedback curl:

+4a. Extract the conversation_id from the latest transcript: +```bash +CONVERSATION_ID=$(jq -r '.metadata.conversation_id' "$(find /var/lib/lightspeed/data/transcripts -type f -name '*.json' -print0 | xargs -0 ls -t | head -n1)") +echo "$CONVERSATION_ID" +```

And update the payload:

- "conversation_id": "your-conversation-id", + "conversation_id": "'"$CONVERSATION_ID"'",

🤖 Prompt for AI Agents

In docs/user_data_collection.md around lines 146 to 157, the feedback curl example uses conversation_id but earlier steps never show how to obtain it; add a short step immediately before the curl that explains and demonstrates extracting the latest conversation_id from the transcripts directory (using jq to read .metadata.conversation_id from the most recent JSON file) and echoing it for verification, then update the curl payload example to reference that variable (e.g., use the shell variable in the JSON payload) so readers can copy/paste the full flow end-to-end.

coderabbitai · 2025-09-09T12:45:57Z

docs/user_data_collection.md

+   user_data_collection:
+     feedback_enabled: true
+     feedback_storage: "/tmp/data/feedback"
+     transcripts_enabled: true
+     transcripts_storage: "/tmp/data/transcripts"
+     data_collector:
+       enabled: true
+       ingress_server_url: "https://your-ingress-server.com/upload"
+       ingress_server_auth_token: "your-auth-token"
+       ingress_content_service_name: "lightspeed-stack"


🛠️ Refactor suggestion

Use env var for auth token in examples.

Inline secrets are risky and get copy-pasted.

Apply:

data_collector: enabled: true ingress_server_url: "https://your-ingress-server.com/upload" - ingress_server_auth_token: "your-auth-token" + ingress_server_auth_token: ${INGRESS_TOKEN} # injected via environment/secret ingress_content_service_name: "lightspeed-stack"

Add a brief note below with a safe export example:

+> Note: In local testing, export the token for the current shell only: +> `export INGRESS_TOKEN='<redacted>'`

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

user_data_collection:

feedback_enabled: true

feedback_storage: "/tmp/data/feedback"

transcripts_enabled: true

transcripts_storage: "/tmp/data/transcripts"

data_collector:

enabled: true

ingress_server_url: "https://your-ingress-server.com/upload"

ingress_server_auth_token: "your-auth-token"

ingress_content_service_name: "lightspeed-stack"

user_data_collection:

feedback_enabled: true

feedback_storage: "/tmp/data/feedback"

transcripts_enabled: true

transcripts_storage: "/tmp/data/transcripts"

data_collector:

enabled: true

ingress_server_url: "https://your-ingress-server.com/upload"

ingress_server_auth_token: ${INGRESS_TOKEN} # injected via environment/secret

ingress_content_service_name: "lightspeed-stack"

> Note: In local testing, export the token for the current shell only:

> `export INGRESS_TOKEN='<redacted>'`

🤖 Prompt for AI Agents

In docs/user_data_collection.md around lines 169 to 178, the example inlines an auth token which is unsafe; change the example to reference an environment variable (e.g., INGRESS_AUTH_TOKEN) instead of embedding the secret, update the key to show a placeholder like ingress_server_auth_token: "${INGRESS_AUTH_TOKEN}" or similar config-variable syntax used by the project, and add a short note beneath the block showing how to export the variable locally (e.g., export INGRESS_AUTH_TOKEN="your-auth-token") and recommend using a secret manager or CI/CD secrets for production.

tisnik

LGTM

(docs) user data collection

7e4f62a

gallettilance changed the title ~~(docs) user data collection~~ LCORE-427: (docs) user data collection Sep 9, 2025

coderabbitai bot reviewed Sep 9, 2025

View reviewed changes

tisnik approved these changes Sep 9, 2025

View reviewed changes

tisnik merged commit 2ba31bd into lightspeed-core:main Sep 9, 2025
19 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

LCORE-427: (docs) user data collection #518

LCORE-427: (docs) user data collection #518

Uh oh!

gallettilance commented Sep 9, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Sep 9, 2025 •

edited

Loading

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Sep 9, 2025

Uh oh!

coderabbitai bot Sep 9, 2025

Uh oh!

coderabbitai bot Sep 9, 2025

Uh oh!

coderabbitai bot Sep 9, 2025

Uh oh!

coderabbitai bot Sep 9, 2025

Uh oh!

coderabbitai bot Sep 9, 2025

Uh oh!

coderabbitai bot Sep 9, 2025

Uh oh!

tisnik left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

LCORE-427: (docs) user data collection #518

LCORE-427: (docs) user data collection #518

Uh oh!

Conversation

gallettilance commented Sep 9, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of change

Related Tickets & Documents

Checklist before requesting a review

Testing

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Sep 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested reviewers

Pre-merge checks (2 passed, 1 inconclusive)

Poem

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Sep 9, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Sep 9, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Sep 9, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Sep 9, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Sep 9, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Sep 9, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Sep 9, 2025

Choose a reason for hiding this comment

Uh oh!

tisnik left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

gallettilance commented Sep 9, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Sep 9, 2025 •

edited

Loading