Skip to content

Conversation

@gallettilance
Copy link
Contributor

@gallettilance gallettilance commented Sep 9, 2025

Description

Type of change

  • Refactor
  • New feature
  • Bug fix
  • CVE fix
  • Optimization
  • Documentation Update
  • Configuration Update
  • Bump-up service version
  • Bump-up dependent library
  • Bump-up library or tool used for development (does not change the final image)
  • CI configuration change
  • Konflux configuration change
  • Unit tests improvement
  • Integration tests improvement
  • End to end tests improvement

Related Tickets & Documents

  • Related Issue #
  • Closes #

Checklist before requesting a review

  • I have performed a self-review of my code.
  • PR has passed all pre-merge test jobs.
  • If it is a core feature, I have added thorough tests.

Testing

  • Please provide detailed steps to perform tests related to this code change.
  • How were the fix/results from this change verified? Please provide relevant screenshots or results.

Summary by CodeRabbit

  • Documentation
    • Added comprehensive guide for Lightspeed Stack user data collection.
    • Covers configuration options (feedback/transcripts settings and data collector), data flow, and storage structure.
    • Includes JSON schema overviews for transcripts and feedback.
    • Provides step-by-step testing instructions for basic and advanced collection.
    • Details integration with Red Hat Dataverse and exporter usage.
    • Outlines security considerations (privacy, tokens, redaction, access control).
    • Adds troubleshooting tips and guidance on enabling debug logging.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Sep 9, 2025

Walkthrough

Adds a new documentation file describing Lightspeed Stack user data collection: configuration options, JSON schemas for feedback/transcripts, local storage structure, sidecar exporter workflow to Red Hat Dataverse, testing steps (basic/advanced), security considerations, and troubleshooting.

Changes

Cohort / File(s) Summary
Documentation: User Data Collection
docs/user_data_collection.md
New doc detailing user interaction/transcript capture, JSON schemas, config block (user_data_collection and data_collector), exporter workflow to Dataverse, test procedures, security notes, and troubleshooting.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant U as User
  participant LS as Lightspeed Stack
  participant FS as Local Storage<br/>(JSON files)
  participant EX as Sidecar Exporter
  participant DV as Red Hat Dataverse

  rect rgba(200,220,255,0.25)
  note over U,LS: Transcript flow
  U->>LS: Submit query
  LS-->>U: Return response
  LS->>FS: Write transcript JSON<br/>(by user/conversation)
  end

  rect rgba(220,255,220,0.25)
  note over U,LS: Feedback flow
  U->>LS: Submit feedback
  LS->>FS: Append feedback JSON
  end

  rect rgba(255,240,200,0.25)
  note over EX,DV: Export cycle
  EX-->>FS: Periodic scan (interval)
  EX->>DV: Upload archives (auth token)
  alt cleanup enabled
    EX->>FS: Remove sent data
  else keep data
    EX-->>FS: Retain local copies
  end
  end
Loading

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~3 minutes

Possibly related PRs

Suggested reviewers

  • umago
  • tisnik

Pre-merge checks (2 passed, 1 inconclusive)

❌ Failed checks (1 inconclusive)
Check name Status Explanation Resolution
Description Check ❓ Inconclusive The current description consists solely of template headings and unchecked checkboxes without any details about the added documentation or its purpose, leaving reviewers without context on what the pull request entails. Please update the description by summarizing the contents of the new docs/user_data_collection.md file, outlining key configuration options and testing steps, and linking any related issues or tickets to provide clear context for reviewers.
✅ Passed checks (2 passed)
Check name Status Explanation
Title Check ✅ Passed The title “LCORE-427: (docs) user data collection” concisely identifies the main change by referencing the documentation addition and the feature name, making it clear for reviewers what the pull request implements. It avoids unnecessary detail while remaining specific about the scope of the change.
Docstring Coverage ✅ Passed No functions found in the changes. Docstring coverage check skipped.

Poem

A nibble of notes, a hop through the logs,
I stash little JSONs like carrots in bogs.
An exporter friend hums, “Off to Dataverse!”
While I tidy the burrow—concise and terse.
With twitchy ears I cheer the flow:
Queries in, insights go! 🥕🐇

✨ Finishing Touches
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@gallettilance gallettilance changed the title (docs) user data collection LCORE-427: (docs) user data collection Sep 9, 2025
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 7

🧹 Nitpick comments (4)
docs/user_data_collection.md (4)

80-85: Fix MD040: add language to fenced block.

The directory layout block lacks a language hint; markdownlint flags this.

Apply:

-```
+```text
 /transcripts_storage/
   /{user_id}/
     /{conversation_id}/
       /{unique_id}.json
-```
+```

91-96: Use vendor-neutral placeholders in examples.

Avoid locking docs to a specific provider/model in general guidance.

Apply:

-    "provider": "openai",
-    "model": "gpt-4",
-    "query_provider": "openai",
-    "query_model": "gpt-4",
+    "provider": "<provider>",
+    "model": "<model>",
+    "query_provider": "<provider>",
+    "query_model": "<model>",

59-75: Specify schema/constraints for feedback.

Define field requirements and allowed values (e.g., sentiment domain, categories enum) to ensure data quality.

Consider adding a JSON Schema (required fields: user_id, timestamp, conversation_id, sentiment; sentiment: integer in [-1,0,1] or [0,1]; categories: array of enum).


8-12: Document filename format and retention.

You mention “unique filenames”; specify the naming convention (e.g., <ts>-<uuid>.json) and retention policy to aid ops and compliance.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 528e0b9 and 7e4f62a.

📒 Files selected for processing (1)
  • docs/user_data_collection.md (1 hunks)
🧰 Additional context used
🪛 markdownlint-cli2 (0.17.2)
docs/user_data_collection.md

80-80: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: build-pr
  • GitHub Check: e2e_tests
🔇 Additional comments (2)
docs/user_data_collection.md (2)

137-144: Confirm endpoints, port, and response shape.

Verify that /query, /feedback, and port 8080 match the current API, and note the expected response fields (e.g., conversation_id) to reduce guesswork.

Would you confirm the base URL, endpoints, and that /query returns conversation_id (or point to the transcript path to obtain it)?

Also applies to: 148-157


206-213: Verify logging config keys and provide env-based toggle.

Ensure service.log_level matches the actual config. Add an env override example.

Apply:

+# Alternatively, set via environment:
+# export LS_LOG_LEVEL=debug

Comment on lines +18 to +21
### Red Hat Dataverse Integration
- The exporter service uploads data to Red Hat's Dataverse for analysis and system improvement.
- Data flows through the same pipeline as other Red Hat services for consistent processing and analysis.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Verification agent

🧩 Analysis chain

Validate “Red Hat Dataverse” naming/branding.

Confirm the correct upstream/product name and link to avoid brand inaccuracies.

Also applies to: 215-224


🌐 Web query:

Is the destination service officially called “Red Hat Dataverse”? If not, what is the correct product/name to reference for Lightspeed data export?

💡 Result:

Short answer: No — it's not called "Red Hat Dataverse." The destination is Microsoft Dataverse (formerly Common Data Service) when referring to business/app data exports; "Dataverse" (Harvard/IQSS) is a separate open‑source research data platform and Red Hat is not the product owner. [1][2][3]

Sources:

  • Microsoft Dataverse (Data Export / Synapse Link). [1]
  • Dataverse (Harvard/IQSS) — separate open research platform. [2]
  • Red Hat mentions/use of "Dataverse" in research context (not an official Red Hat product named "Dataverse"). [3]

Update “Red Hat Dataverse” to “Microsoft Dataverse” and add official link
In docs/user_data_collection.md (18–21, 215–224), rename and relink as follows:

- ### Red Hat Dataverse Integration
+ ### Microsoft Dataverse Integration

- The exporter service uploads data to Red Hat's Dataverse for analysis and system improvement.
+ The exporter service uploads data to [Microsoft Dataverse](https://learn.microsoft.com/power-apps/maker/data-platform/data-platform-intro) for analysis and system improvement.

- Data flows through the same pipeline as other Red Hat services for consistent processing and analysis.
+ Data flows through the same pipeline as other services for consistent processing and analysis.

Also update any remaining “Red Hat Dataverse” mentions in lines 215–224.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
### Red Hat Dataverse Integration
- The exporter service uploads data to Red Hat's Dataverse for analysis and system improvement.
- Data flows through the same pipeline as other Red Hat services for consistent processing and analysis.
### Microsoft Dataverse Integration
- The exporter service uploads data to [Microsoft Dataverse](https://learn.microsoft.com/power-apps/maker/data-platform/data-platform-intro) for analysis and system improvement.
- Data flows through the same pipeline as other services for consistent processing and analysis.
🤖 Prompt for AI Agents
In docs/user_data_collection.md around lines 18–21 and 215–224, occurrences of
"Red Hat Dataverse" should be renamed to "Microsoft Dataverse" and any
references relinked to the official Microsoft Dataverse page
(https://learn.microsoft.com/power-apps/maker/data-platform/data-platform-intro).
Update the heading and bullet text to use "Microsoft Dataverse" and replace or
add the official link where appropriate, and scan the two line ranges to ensure
all remaining mentions are renamed and the link is added or updated.

Comment on lines +28 to +31
feedback_enabled: true
feedback_storage: "/tmp/data/feedback"
transcripts_enabled: true
transcripts_storage: "/tmp/data/transcripts"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Avoid /tmp for persisted data.

/tmp is ephemeral and often world-readable. Recommend a dedicated directory with restricted permissions.

Apply (first snippet; mirror in others):

-  feedback_storage: "/tmp/data/feedback"
-  transcripts_storage: "/tmp/data/transcripts"
+  feedback_storage: "/var/lib/lightspeed/data/feedback"     # ensure dir exists with 700 perms
+  transcripts_storage: "/var/lib/lightspeed/data/transcripts" # ensure dir exists with 700 perms

And add a note:

+> mkdir -p /var/lib/lightspeed/data/{feedback,transcripts} && chmod -R 700 /var/lib/lightspeed

Also applies to: 124-128, 171-174

🤖 Prompt for AI Agents
In docs/user_data_collection.md around lines 28-31 (and similarly at 124-128 and
171-174), the examples use /tmp for persisted feedback/transcripts which is
ephemeral and often world-readable; replace the paths with a dedicated
application data directory (e.g., /var/lib/your-app/data or
${APP_DATA_DIR}/feedback and .../transcripts) and update the other snippets to
match, and add a short note below explaining to create the directory with
restricted permissions (chmod 700 or chown to the app user) and to avoid using
/tmp for long-lived sensitive data.

Comment on lines +38 to +39
cleanup_after_send: true
connection_timeout_seconds: 30
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Safer default for cleanup_after_send.

Deleting local files immediately after upload can cause data loss on partial/failed uploads. Default to false and document verification semantics.

Apply:

-    cleanup_after_send: true
+    cleanup_after_send: false  # set to true only after verifying idempotent uploads and success checks

Also applies to: 180-181

🤖 Prompt for AI Agents
In docs/user_data_collection.md around lines 38-39 (also apply same change to
lines 180-181), the example default for cleanup_after_send is unsafe; change the
example value from true to false, and add one sentence clarifying verification
semantics: explain that files are retained locally until successful upload is
confirmed (or checksum/ack returned), and recommend enabling
cleanup_after_send=true only when uploads are guaranteed atomic or data is
otherwise backed up; update both occurrences and ensure the documentation notes
how users can verify successful upload before enabling automatic cleanup.

Comment on lines +52 to +54
- `ingress_server_url`: URL of the ingress server for data upload
- `ingress_server_auth_token`: Authentication token for the ingress server
- `ingress_content_service_name`: Service name identifier for the ingress server
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Don’t document storing raw tokens in config files.

Recommend environment variables or secret stores instead of inline tokens.

Apply:

-- `ingress_server_auth_token`: Authentication token for the ingress server
+- `ingress_server_auth_token`: Authentication token for the ingress server. Do not store raw tokens in files; use environment variables or a secret manager (e.g., `${INGRESS_TOKEN}`) and mount/inject at runtime.
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
- `ingress_server_url`: URL of the ingress server for data upload
- `ingress_server_auth_token`: Authentication token for the ingress server
- `ingress_content_service_name`: Service name identifier for the ingress server
- `ingress_server_url`: URL of the ingress server for data upload
- `ingress_server_auth_token`: Authentication token for the ingress server. Do not store raw tokens in files; use environment variables or a secret manager (e.g., `${INGRESS_TOKEN}`) and mount/inject at runtime.
- `ingress_content_service_name`: Service name identifier for the ingress server
🤖 Prompt for AI Agents
In docs/user_data_collection.md around lines 52 to 54, the docs currently imply
storing raw authentication tokens and sensitive ingress details in config files;
update the wording to discourage inline tokens and instead instruct users to use
environment variables or a secrets manager (e.g., provide placeholder names like
INGRESS_SERVER_AUTH_TOKEN and point to retrieving them from process.env or the
project's secret store), remove any example raw token values, and add a brief
note about least-privilege and rotation practices for ingress credentials.

Comment on lines +64 to +75
```json
{
"user_id": "user-uuid",
"timestamp": "2024-01-01T12:00:00Z",
"conversation_id": "conversation-uuid",
"user_question": "What is Kubernetes?",
"llm_response": "Kubernetes is an open-source container orchestration system...",
"sentiment": 1,
"user_feedback": "This response was very helpful",
"categories": ["helpful"]
}
```
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Clarify PII redaction vs. stored fields (feedback includes raw Q/A).

Security section claims redaction, but feedback example stores user_question and llm_response verbatim. Align either by redacting here too or explicitly documenting scope.

Apply (feedback example):

-  "user_question": "What is Kubernetes?",
-  "llm_response": "Kubernetes is an open-source container orchestration system...",
+  "redacted_user_question": "What is <technology>?",
+  "redacted_llm_response": "…",

And expand Security:

- **Data Redaction**: Query data is stored as "redacted_query" to ensure sensitive information is not captured
+ **Data Redaction**: All persisted user-visible fields (queries, responses, feedback) must be redacted before storage (`redacted_*` fields). Do not persist raw values unless you have explicit consent and a lawful basis.

Also applies to: 190-193

Comment on lines +146 to +157
4. **Submit feedback** to generate feedback data:
```bash
curl -X POST "http://localhost:8080/feedback" \
-H "Content-Type: application/json" \
-d '{
"conversation_id": "your-conversation-id",
"user_question": "What is Kubernetes?",
"llm_response": "Kubernetes is...",
"sentiment": 1,
"user_feedback": "Very helpful response"
}'
```
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Add instruction to obtain conversation_id before posting feedback.

The feedback example requires conversation_id but prior steps don’t show how to get it.

Insert before the feedback curl:

+4a. Extract the conversation_id from the latest transcript:
+```bash
+CONVERSATION_ID=$(jq -r '.metadata.conversation_id' "$(find /var/lib/lightspeed/data/transcripts -type f -name '*.json' -print0 | xargs -0 ls -t | head -n1)")
+echo "$CONVERSATION_ID"
+```

And update the payload:

-       "conversation_id": "your-conversation-id",
+       "conversation_id": "'"$CONVERSATION_ID"'",
🤖 Prompt for AI Agents
In docs/user_data_collection.md around lines 146 to 157, the feedback curl
example uses conversation_id but earlier steps never show how to obtain it; add
a short step immediately before the curl that explains and demonstrates
extracting the latest conversation_id from the transcripts directory (using jq
to read .metadata.conversation_id from the most recent JSON file) and echoing it
for verification, then update the curl payload example to reference that
variable (e.g., use the shell variable in the JSON payload) so readers can
copy/paste the full flow end-to-end.

Comment on lines +169 to +178
user_data_collection:
feedback_enabled: true
feedback_storage: "/tmp/data/feedback"
transcripts_enabled: true
transcripts_storage: "/tmp/data/transcripts"
data_collector:
enabled: true
ingress_server_url: "https://your-ingress-server.com/upload"
ingress_server_auth_token: "your-auth-token"
ingress_content_service_name: "lightspeed-stack"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Use env var for auth token in examples.

Inline secrets are risky and get copy-pasted.

Apply:

      data_collector:
        enabled: true
        ingress_server_url: "https://your-ingress-server.com/upload"
-       ingress_server_auth_token: "your-auth-token"
+       ingress_server_auth_token: ${INGRESS_TOKEN}  # injected via environment/secret
        ingress_content_service_name: "lightspeed-stack"

Add a brief note below with a safe export example:

+> Note: In local testing, export the token for the current shell only:
+> `export INGRESS_TOKEN='<redacted>'`
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
user_data_collection:
feedback_enabled: true
feedback_storage: "/tmp/data/feedback"
transcripts_enabled: true
transcripts_storage: "/tmp/data/transcripts"
data_collector:
enabled: true
ingress_server_url: "https://your-ingress-server.com/upload"
ingress_server_auth_token: "your-auth-token"
ingress_content_service_name: "lightspeed-stack"
user_data_collection:
feedback_enabled: true
feedback_storage: "/tmp/data/feedback"
transcripts_enabled: true
transcripts_storage: "/tmp/data/transcripts"
data_collector:
enabled: true
ingress_server_url: "https://your-ingress-server.com/upload"
ingress_server_auth_token: ${INGRESS_TOKEN} # injected via environment/secret
ingress_content_service_name: "lightspeed-stack"
> Note: In local testing, export the token for the current shell only:
> `export INGRESS_TOKEN='<redacted>'`
🤖 Prompt for AI Agents
In docs/user_data_collection.md around lines 169 to 178, the example inlines an
auth token which is unsafe; change the example to reference an environment
variable (e.g., INGRESS_AUTH_TOKEN) instead of embedding the secret, update the
key to show a placeholder like ingress_server_auth_token:
"${INGRESS_AUTH_TOKEN}" or similar config-variable syntax used by the project,
and add a short note beneath the block showing how to export the variable
locally (e.g., export INGRESS_AUTH_TOKEN="your-auth-token") and recommend using
a secret manager or CI/CD secrets for production.

Copy link
Contributor

@tisnik tisnik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@tisnik tisnik merged commit 2ba31bd into lightspeed-core:main Sep 9, 2025
19 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants