-
Notifications
You must be signed in to change notification settings - Fork 56
[wip] attempt to anonymize transcripts #433
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
WalkthroughImplements deterministic, DB-backed anonymization for users (HMAC + pepper + UserMapping) and applies it across conversation persistence, ownership validation, feedback, and transcript storage; renames UserConversation.user_id → anonymous_user_id; updates logging, paths, and many tests; removes a pre-flight ownership guard in streaming query. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant U as User
participant EP as API Endpoint
participant UA as user_anonymization
participant DB as DB (UserMapping, UserConversation)
U->>EP: Request (auth_user_id, optional conversation_id)
EP->>UA: get_anonymous_user_id(auth_user_id)
UA->>DB: lookup by user_id_hash
alt mapping exists
DB-->>UA: anonymous_id
else create mapping
UA->>DB: insert anonymous_id + user_id_hash
DB-->>UA: anonymous_id
end
UA-->>EP: anonymous_id
rect rgb(240,250,240)
note over EP,DB: All subsequent DB queries and storage use anonymous_user_id
EP->>DB: query/update UserConversation by (id, anonymous_user_id) or write feedback/transcript keyed by anonymous_id
DB-->>EP: result / ack
end
EP-->>U: Response (uses anonymized identifiers in logs/metadata)
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Suggested reviewers
Poem
Tip 🔌 Remote MCP (Model Context Protocol) integration is now available!Pro plan users can now connect to remote MCP servers from the Integrations page. Connect with popular remote MCPs such as Notion and Linear to add more context to your reviews and chats. ✨ Finishing Touches
🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. CodeRabbit Commands (Invoked using PR/Issue comments)Type Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 8
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (4)
src/models/database/conversations.py (1)
14-21: Add a database migration for renaming user_id → anonymous_user_idI don’t see any Alembic revision in your migrations directory that renames the
user_idcolumn on theuser_conversationtable toanonymous_user_id(and back‐fills existing data). You’ll need to add one before merging:• Under your Alembic versions folder (e.g.
alembic/versions/), create a new revision with an appropriate message (e.g. “rename user_id to anonymous_user_id in user_conversation”).
• Inupgrade():
– useop.alter_column()or raw SQL to renameuser_id→anonymous_user_id
– backfill existing rows if needed (e.g. copy old values)
– add or update the index onanonymous_user_id
– enforce the correctnullablesetting to match your model
• Indowngrade(): reverse all of the above (rename the column back, drop/recreate the index, etc.)
• Runalembic upgrade headlocally to verify the migration applies cleanly and your tests still pass.src/app/endpoints/conversations.py (3)
193-195: Avoid logging original user_id here tooThis log includes
user_id. Prefer anonymous-only logging.- logger.info( - "Found %d conversations for user %s", len(conversations), user_id - ) + logger.info( + "Found %d conversations for anonymous user %s", + len(conversations), + anonymous_user_id, + )
249-252: Authorization bypass: result ofvalidate_conversation_ownership()is ignoredAnyone with a valid token could fetch arbitrary conversations if they know/guess the ID. Check the result and return 403 when not owned (as done in query endpoint).
Apply this diff:
- validate_conversation_ownership( - user_id=user_id, - conversation_id=conversation_id, - ) + owned = validate_conversation_ownership( + user_id=user_id, + conversation_id=conversation_id, + ) + if owned is None: # if util not changed to raise + logger.warning( + "User %s attempted to access conversation %s they don't own", + (user_id[:8] + "...") if len(user_id) > 8 else user_id, + conversation_id, + ) + raise HTTPException( + status_code=status.HTTP_403_FORBIDDEN, + detail={ + "response": "Access denied", + "cause": "You do not have permission to access this conversation", + }, + )
352-355: Same authorization bypass in delete handlerDeletion path also ignores the ownership check result. Enforce it.
- validate_conversation_ownership( - user_id=user_id, - conversation_id=conversation_id, - ) + owned = validate_conversation_ownership( + user_id=user_id, + conversation_id=conversation_id, + ) + if owned is None: # if util not changed to raise + logger.warning( + "User %s attempted to delete conversation %s they don't own", + (user_id[:8] + "...") if len(user_id) > 8 else user_id, + conversation_id, + ) + raise HTTPException( + status_code=status.HTTP_403_FORBIDDEN, + detail={ + "response": "Access denied", + "cause": "You do not have permission to delete this conversation", + }, + )
🧹 Nitpick comments (22)
src/utils/endpoints.py (1)
25-28: Avoid creating mappings in a “validate” path
get_anonymous_user_id()creates the mapping if absent. Validation shouldn’t mutate state. Prefer afind_anonymous_user_id()that returnsNonewithout writes and short-circuit to “not owned” when absent.src/app/endpoints/conversations.py (2)
169-174: Query is correct; consider ordering for UXFiltering by
anonymous_user_idis correct. Optionally order bylast_message_at DESC(and/orcreated_at DESC) to return most recent first.- user_conversations = ( - session.query(UserConversation) - .filter_by(anonymous_user_id=anonymous_user_id) - .all() - ) + user_conversations = ( + session.query(UserConversation) + .filter_by(anonymous_user_id=anonymous_user_id) + .order_by(UserConversation.last_message_at.desc()) + .all() + )
199-209: Error detail leaks original user_id back to the clientEven though it’s the same user, echoing raw identifiers into HTTP errors is unnecessary. Use
anonymous_user_idor omit the identifier.- raise HTTPException( + raise HTTPException( status_code=status.HTTP_500_INTERNAL_SERVER_ERROR, detail={ "response": "Unknown error", - "cause": f"Unknown error while getting conversations for user {user_id}", + "cause": "Unknown error while getting conversations for the current user", }, ) from esrc/app/endpoints/query.py (1)
224-239: Transcript call OK; consider passing anonymous ID to avoid double mapping
store_transcript()recomputes the mapping. You could compute once and passanonymous_user_idthrough to reduce DB hits (optional).src/utils/transcripts.py (1)
22-31: Path construction looks safe; minor naming clarityUsing
normpathwithlstrip("/")blocks traversal. Consider naminguid→anon_uidfor clarity (optional).tests/unit/app/endpoints/test_query_anonymization.py (3)
56-96: Consider asserting last_message_at update for existing conversations.You already validate model/provider and message_count increment. Small enhancement: also assert that last_message_at was updated (e.g., verify it was set to a datetime or advanced vs. a captured “before” value). This guards against regressions where the timestamp isn’t bumped.
Example tweak (keeping it lightweight with a type check since existing_conv is a MagicMock):
- assert existing_conv.message_count == 6 # Incremented + assert existing_conv.message_count == 6 # Incremented + # Optional: ensure last_message_at was touched + assert getattr(existing_conv, "last_message_at", None) is not None
97-134: Log assertion is a bit brittle on truncation detail.The check for "logging_..." ties the test to a specific truncation length (8). If the truncation policy changes, this test will fail for non-functional reasons. Prefer a pattern that only asserts truncation occurred (presence of "...") without hard-coding the exact prefix length.
- assert "logging_..." in log_msg # Truncated original user ID + # Truncated original user ID should include an ellipsis + assert "..." in log_msg
151-198: Test semantics conflict with DB model: conversation.id is globally unique.UserConversation.id is a primary key (globally unique). Creating the “same” conversation_id for a different user would violate the PK and raise an IntegrityError during commit. The current test only inspects the filter_by arguments (fine), but the docstring suggests separate conversations could coexist for the same id across users, which the model forbids.
Options:
- If conversation IDs are intended to be global (current model), keep the intent “ownership isolation by anonymous ID” but avoid implying distinct rows with the same id. Update the docstring and keep checking that filter_by uses the correct anonymous_user_id.
- def test_persist_user_conversation_different_users_same_conversation_id( + def test_persist_user_conversation_isolation_uses_anonymous_id( @@ - """Test that different users can't access each other's conversations.""" + """Test that different users are isolated by anonymous_user_id during lookups."""
- If they are intended to be per-user, change the schema to a composite PK (id, anonymous_user_id) and add a migration; then add a test that commits twice without error.
Would you like me to draft the migration and ripple the model/tests accordingly?src/utils/user_anonymization.py (2)
73-92: Differentiate IntegrityError causes and re-fetch once.IntegrityError might stem from a duplicate anonymous_id (extremely rare) or the unique user_id_hash. The code assumes a race and retries, which is fine, but make the log clearer and guard the retry with a single re-fetch to avoid masking unrelated DB issues.
- except IntegrityError as e: + except IntegrityError as e: session.rollback() - # Race condition - another thread created the mapping - logger.warning("Race condition creating user mapping: %s", e) + # Likely race on user_id_hash uniqueness; attempt a single re-fetch. + logger.warning("IntegrityError creating user mapping (likely race): %s", e)
94-103: Counting rows is fine; consider future-proofing for large tables.session.query(UserMapping).count() can be slow at scale on some backends. Not urgent, but consider adding a dedicated approximate counter (cached) if this is used on hot paths or dashboards.
tests/unit/utils/test_endpoints_anonymization.py (1)
120-172: Error propagation test looks good; add a complementary test for get_anonymous_user_id failure?Currently you assert DB errors propagate. Optional: add a test where get_anonymous_user_id raises (e.g., due to config), and verify the exception propagates before any DB access. This locks in the call order.
+ @patch("utils.endpoints.get_anonymous_user_id", side_effect=RuntimeError("anon lookup failed")) + def test_validate_conversation_ownership_anonymous_lookup_failure(self, mock_get_anonymous): + with pytest.raises(RuntimeError, match="anon lookup failed"): + validate_conversation_ownership(user_id="u@example.com", conversation_id="c1")src/models/database/user_mapping.py (2)
5-7: Specify column types; remove redundant index.Explicit lengths improve storage/indexing, and the extra named Index duplicates the implicit unique index on user_id_hash. Keep one.
-from sqlalchemy.orm import Mapped, mapped_column -from sqlalchemy import DateTime, func, Index +from sqlalchemy.orm import Mapped, mapped_column +from sqlalchemy import DateTime, func, String
27-28: Drop the extra explicit index.A unique constraint already creates an index on most backends (e.g., PostgreSQL). Keeping both wastes space and can slow writes.
- # Index for efficient lookups - __table_args__ = (Index("ix_user_mapping_hash_lookup", "user_id_hash"),) + # Rely on the unique constraint's index for lookups.tests/unit/utils/test_transcripts_anonymization.py (2)
72-80: Path construction test: optionally assert base path sanitation.You already assert id and conversation_id presence elsewhere. Here you could also assert that the resulting path is relative (no drive letter / leading slash) to further guard assumptions.
- assert "anon-uuid-456" in str(result) - assert "conv-789" in str(result) + p = str(result) + assert "anon-uuid-456" in p + assert "conv-789" in p + assert not p.startswith("/")
205-223: Sanitization test allows embedded slashes in IDs. Is that intentional?construct_transcripts_path neutralizes traversal (“../”) but still permits embedded “/”, which will create nested directories if ever passed. If anonymous IDs always come from get_anonymous_user_id (UUIDs), this is safe. If any external input can reach this, consider rejecting path separators.
Possible hardening in construct_transcripts_path:
- Validate inputs match a safe regex (e.g., r"^[a-z0-9-]{1,64}$")
- Or replace os.sep occurrences before joining
I can draft a patch if desired.tests/unit/utils/test_user_anonymization.py (7)
87-116: Add coverage for IntegrityError raised by commit(), not just add()Real-world uniqueness violations are more commonly surfaced on commit() than add(). You already cover add() raising; add a companion test for commit() to ensure rollback and recovery logic still works.
Apply this diff to append a new test after Line 116:
+ @patch("utils.user_anonymization.get_session") + @patch("utils.user_anonymization.get_suid") + def test_get_anonymous_user_id_race_condition_on_commit( + self, mock_get_suid, mock_get_session + ): + """Race on unique constraint where commit(), not add(), raises IntegrityError.""" + mock_session = MagicMock() + mock_get_session.return_value.__enter__.return_value = mock_session + + existing_mapping = UserMapping() + existing_mapping.anonymous_id = "race-uuid-commit" + + # First lookup: none; after rollback: mapping exists + mock_session.query.return_value.filter_by.return_value.first.side_effect = [ + None, + existing_mapping, + ] + mock_get_suid.return_value = "new-uuid" + + # Simulate IntegrityError during commit + mock_session.add.return_value = None + mock_session.commit.side_effect = IntegrityError("Duplicate key", None, None) + + result = get_anonymous_user_id("race_user_commit@example.com") + + assert result == "race-uuid-commit" + mock_session.rollback.assert_called_once()
139-149: Strengthen assertion: ensure count() is actually called onceSmall robustness improvement: assert the count() invocation, not just the return value.
Apply this diff:
- mock_session.query.return_value.count.return_value = 42 + q = mock_session.query.return_value + q.count.return_value = 42 @@ - assert result == 42 - mock_session.query.assert_called_once_with(UserMapping) + assert result == 42 + mock_session.query.assert_called_once_with(UserMapping) + q.count.assert_called_once()
166-176: Assert that find_anonymous_user_id never creates mappingsAdd a negative test to ensure we don’t accidentally create new mappings (no get_suid/add/commit) when nothing is found.
Append after Line 176:
+ @patch("utils.user_anonymization.get_session") + @patch("utils.user_anonymization.get_suid") + def test_find_anonymous_user_id_does_not_create(self, mock_get_suid, mock_get_session): + mock_session = MagicMock() + mock_get_session.return_value.__enter__.return_value = mock_session + mock_session.query.return_value.filter_by.return_value.first.return_value = None + + result = find_anonymous_user_id("nope@example.com") + + assert result is None + mock_get_suid.assert_not_called() + mock_session.add.assert_not_called() + mock_session.commit.assert_not_called()
188-201: Rename to reflect intent (this tests hash uniqueness, not anonymous IDs)Minor clarity fix: the test exercises _hash_user_id uniqueness, not the generation of anonymous UUIDs.
Apply this diff:
- def test_anonymization_preserves_uniqueness(self): - """Test that different users get different anonymous IDs.""" + def test_hash_uniqueness_across_users(self): + """Test that different user IDs hash to distinct values."""
188-201: Add test: distinct users should get distinct anonymous UUIDs when createdComplement the hash-uniqueness check with an end-to-end creation scenario to ensure separate users don’t collide on anonymous IDs.
Append after Line 201:
+ @patch("utils.user_anonymization.get_session") + @patch("utils.user_anonymization.get_suid") + def test_distinct_users_get_distinct_anonymous_ids(self, mock_get_suid, mock_get_session): + mock_session = MagicMock() + mock_get_session.return_value.__enter__.return_value = mock_session + # Both calls: no existing mapping + mock_session.query.return_value.filter_by.return_value.first.side_effect = [None, None] + mock_get_suid.side_effect = ["uuid-1", "uuid-2"] + + id1 = get_anonymous_user_id("userA@example.com") + id2 = get_anonymous_user_id("userB@example.com") + + assert id1 == "uuid-1" + assert id2 == "uuid-2" + assert id1 != id2 + assert mock_session.add.call_count == 2 + assert mock_session.commit.call_count == 2
206-217: Expand model checks to assert uniqueness and named index presenceQuick introspection verifies contract: user_id_hash unique + named index for fast lookups.
Append after Line 217:
+ def test_user_mapping_constraints_and_indexes(self): + col = UserMapping.__table__.c.user_id_hash + assert col.unique is True + index_names = {ix.name for ix in UserMapping.__table__.indexes} + assert "ix_user_mapping_hash_lookup" in index_names
1-16: Optional: consider fixtures to reduce repetitionA couple of small pytest fixtures (e.g., a mock_session fixture and a helper to seed an existing UserMapping) would DRY up repeated setup across tests. Not required for merge, just a maintainability nicety.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (11)
src/app/endpoints/conversations.py(2 hunks)src/app/endpoints/query.py(2 hunks)src/models/database/conversations.py(1 hunks)src/models/database/user_mapping.py(1 hunks)src/utils/endpoints.py(2 hunks)src/utils/transcripts.py(4 hunks)src/utils/user_anonymization.py(1 hunks)tests/unit/app/endpoints/test_query_anonymization.py(1 hunks)tests/unit/utils/test_endpoints_anonymization.py(1 hunks)tests/unit/utils/test_transcripts_anonymization.py(1 hunks)tests/unit/utils/test_user_anonymization.py(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (10)
src/app/endpoints/conversations.py (3)
src/utils/user_anonymization.py (1)
get_anonymous_user_id(29-91)src/app/database.py (1)
get_session(34-40)src/models/database/conversations.py (1)
UserConversation(11-36)
tests/unit/utils/test_transcripts_anonymization.py (4)
src/models/requests.py (2)
QueryRequest(70-220)Attachment(15-67)src/utils/transcripts.py (2)
store_transcript(34-95)construct_transcripts_path(22-31)src/utils/types.py (1)
TurnSummary(59-78)src/configuration.py (1)
user_data_collection_configuration(79-84)
src/utils/endpoints.py (3)
src/utils/user_anonymization.py (1)
get_anonymous_user_id(29-91)src/app/database.py (1)
get_session(34-40)src/models/database/conversations.py (1)
UserConversation(11-36)
tests/unit/utils/test_endpoints_anonymization.py (3)
src/utils/endpoints.py (1)
validate_conversation_ownership(22-35)src/models/database/conversations.py (1)
UserConversation(11-36)tests/unit/app/endpoints/test_query_anonymization.py (1)
mock_anonymous_side_effect(157-162)
src/models/database/user_mapping.py (1)
src/models/database/base.py (1)
Base(6-7)
src/utils/transcripts.py (1)
src/utils/user_anonymization.py (1)
get_anonymous_user_id(29-91)
tests/unit/utils/test_user_anonymization.py (2)
src/utils/user_anonymization.py (4)
get_anonymous_user_id(29-91)get_user_count(94-102)find_anonymous_user_id(105-122)_hash_user_id(16-26)src/models/database/user_mapping.py (1)
UserMapping(11-28)
src/app/endpoints/query.py (4)
src/models/database/conversations.py (1)
UserConversation(11-36)src/utils/endpoints.py (4)
check_configuration_loaded(38-44)get_agent(82-131)get_system_prompt(47-78)validate_conversation_ownership(22-35)src/utils/user_anonymization.py (1)
get_anonymous_user_id(29-91)src/app/database.py (1)
get_session(34-40)
src/utils/user_anonymization.py (3)
src/models/database/user_mapping.py (1)
UserMapping(11-28)src/app/database.py (1)
get_session(34-40)src/utils/suid.py (1)
get_suid(6-12)
tests/unit/app/endpoints/test_query_anonymization.py (3)
src/app/endpoints/query.py (1)
persist_user_conversation_details(77-112)src/models/database/conversations.py (1)
UserConversation(11-36)tests/unit/utils/test_endpoints_anonymization.py (1)
mock_anonymous_side_effect(79-84)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: build-pr
- GitHub Check: e2e_tests
🔇 Additional comments (9)
src/app/endpoints/query.py (1)
80-83: Good: association now uses anonymous user IDPersisting conversation metadata keyed by
anonymous_user_idaligns with the new model. Nice.src/utils/transcripts.py (2)
50-59: Docstring updated correctlyDocs clearly state anonymization behavior.
71-88: Good: metadata stores anonymous_user_id onlyThe transcript file no longer stores the original ID. This aligns with the anonymization objective.
tests/unit/app/endpoints/test_query_anonymization.py (1)
15-33: Good coverage for the “new conversation” path.Mocks and assertions correctly verify that the anonymous ID is used for lookup and object creation, and that the transaction is committed.
tests/unit/utils/test_endpoints_anonymization.py (1)
13-47: Solid validation of ownership via anonymous_user_id.Good use of patching to assert that lookups filter by both id and anonymous_user_id, and that the retrieved row is returned unmodified.
tests/unit/utils/test_transcripts_anonymization.py (2)
17-71: End-to-end anonymization checks for transcripts are thorough.You verify pathing, on-disk structure, and that only anonymous_user_id makes it into metadata. Nice.
223-272: Great logging test.Confirms that anonymization is explicitly logged with truncated original user and anonymous ID.
tests/unit/utils/test_user_anonymization.py (2)
18-20: Excellent coverage and clear scenarios — tests mostly LGTMSolid breadth: hashing behavior, new/existing user flows, race conditions, and basic model checks. Nice use of context-managed session mocks and targeted IntegrityError simulations.
38-41: Decorator/argument ordering is correctThe ordering of @patch decorators and corresponding args is correct (get_suid first param, get_session second) and consistent across tests. No change needed.
Also applies to: 87-91
src/utils/user_anonymization.py
Outdated
| def _hash_user_id(user_id: str) -> str: | ||
| """ | ||
| Create a consistent hash of the user ID for mapping purposes. | ||
| Uses SHA-256 with a fixed salt to ensure consistent hashing | ||
| while preventing rainbow table attacks. | ||
| """ | ||
| # Use a fixed salt - in production, this should be configurable | ||
| salt = "lightspeed_user_anonymization_salt_v1" | ||
| hash_input = f"{salt}:{user_id}".encode("utf-8") | ||
| return hashlib.sha256(hash_input).hexdigest() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Replace fixed salted SHA-256 with a keyed HMAC and normalize user IDs.
A hardcoded, public salt offers limited protection. Use a server-secret “pepper” via HMAC-SHA256 and normalize user IDs (trim/lower) to avoid duplicate mappings for case variants.
-def _hash_user_id(user_id: str) -> str:
+def _hash_user_id(user_id: str) -> str:
@@
- Uses SHA-256 with a fixed salt to ensure consistent hashing
- while preventing rainbow table attacks.
+ Uses HMAC-SHA256 with a secret pepper for deterministic, keyed hashing.
+ Also normalizes the user_id to prevent case/whitespace variants from diverging.
@@
- # Use a fixed salt - in production, this should be configurable
- salt = "lightspeed_user_anonymization_salt_v1"
- hash_input = f"{salt}:{user_id}".encode("utf-8")
- return hashlib.sha256(hash_input).hexdigest()
+ # Prefer secret from env; fall back only for local/dev. Document rotation process.
+ pepper = os.getenv("USER_ANON_PEPPER", "lightspeed_user_anonymization_salt_v1")
+ uid = user_id.strip().lower()
+ return hmac.new(pepper.encode("utf-8"), uid.encode("utf-8"), hashlib.sha256).hexdigest()Follow-ups:
- Add USER_ANON_PEPPER to deployment secrets and CI.
- Document rotation (dual-read with old/new peppers or a background re-hash if ever needed).
Committable suggestion skipped: line range outside the PR's diff.
🤖 Prompt for AI Agents
In src/utils/user_anonymization.py around lines 16 to 26, replace the hardcoded
salt-and-sha256 approach with an HMAC-SHA256 using a server-secret pepper read
from env (USER_ANON_PEPPER) and normalize the incoming user_id by trimming
whitespace and lowercasing before hashing; specifically, load USER_ANON_PEPPER
(fail with a clear error if missing in runtime/CI), compute
hmac.new(pepper_bytes, normalized_user_id_bytes, sha256).hexdigest(), and remove
the fixed salt string; also update docs/tests/deployment to require
USER_ANON_PEPPER and add a short note about rotation strategy (dual-read or
rehashing) in the repo deployment docs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed by e168276
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
@thoraxe Perfect! I can see from the commit that you've properly implemented the security improvements I suggested:
✅ HMAC-SHA256 implementation: Replaced the fixed salt approach with hmac.new() using SHA-256
✅ Environment-based secret: Added USER_ANON_PEPPER environment variable with proper fallback
✅ User ID normalization: Added user_id.strip().lower() to prevent case/whitespace variants
✅ Proper imports: Added import hmac and import os
✅ Clear error messaging: Good fallback documentation in the code
The implementation looks solid and addresses all the security concerns I raised. This is a much more secure approach that uses a proper keyed hash function instead of the predictable salted hash.
🏁 Script executed:
#!/bin/bash
# Fetch the specific commit to examine the fix
git show e1682760e2e529f43dd14f72d3d00422c32e6f0b --name-only
Length of output: 6491
🏁 Script executed:
#!/bin/bash
# Get the actual changes in the commit for the user_anonymization.py file
git show e1682760e2e529f43dd14f72d3d00422c32e6f0b -- src/utils/user_anonymization.py
Length of output: 4859
@thoraxe Perfect! I can see from the commit that you've properly implemented all the security improvements I suggested:
✅ HMAC-SHA256 implementation: Replaced the fixed salt approach with hmac.new() using SHA-256
✅ Environment-based secret: Added USER_ANON_PEPPER environment variable with proper startup validation
✅ User ID normalization: Added user_id.strip().lower() to prevent case/whitespace variants
✅ Proper imports: Added import hmac and import os
✅ Clear error messaging: Excellent error handling with descriptive RuntimeError for missing env var
✅ Security best practices: Proper pepper handling and fail-fast behavior
The implementation looks solid and addresses all the security concerns I raised. This is a much more secure approach that uses a proper keyed hash function instead of the predictable salted hash. The fail-fast behavior for missing USER_ANON_PEPPER is particularly good - it ensures the system won't accidentally fall back to an insecure mode.
If you found this review helpful, would you consider giving us a shout-out on X?
Thank you for using CodeRabbit!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🧹 Nitpick comments (13)
src/models/database/user_mapping.py (2)
22-24: Remove redundant index on user_id_hash (unique already creates one).You’re defining both
unique=Trueand an explicitIndexonuser_id_hash. The unique constraint already creates an index in all major RDBMS, so the extra index is redundant and can slow writes. Drop one of them (recommend: keepunique=True, remove the explicitIndex(...)and theindex=Trueflag).Apply this diff:
-from sqlalchemy import DateTime, func, Index, String +from sqlalchemy import DateTime, func, String @@ - user_id_hash: Mapped[str] = mapped_column( - String(64), index=True, unique=True, nullable=False - ) + user_id_hash: Mapped[str] = mapped_column( + String(64), unique=True, nullable=False + ) @@ - # Index for efficient lookups - __table_args__ = (Index("ix_user_mapping_hash_lookup", "user_id_hash"),) + # No extra index needed; unique creates an index implicitly + __table_args__ = ()Also applies to: 31-32
17-19: Consider native UUID column type for anonymous_id when using PostgreSQL.If you’re on Postgres,
UUID(withserver_default=text('gen_random_uuid()')if desired) improves storage and validation overString(36). Cross-DB projects can keepString(36).Example (Postgres only):
-from sqlalchemy import DateTime, func, String +from sqlalchemy import DateTime, func, String +from sqlalchemy.dialects.postgresql import UUID @@ - anonymous_id: Mapped[str] = mapped_column( - String(36), primary_key=True, nullable=False - ) + anonymous_id: Mapped[str] = mapped_column( + UUID(as_uuid=False), primary_key=True, nullable=False + )src/models/database/conversations.py (1)
20-22: Specify ON DELETE behavior on the FK to make ownership semantics explicit.Currently the FK uses default NO ACTION. If a
UserMappingrow were ever deleted, you should define whether to restrict or cascade. Given this is an ownership key, RESTRICT is a safe default.- anonymous_user_id: Mapped[str] = mapped_column( - ForeignKey("user_mapping.anonymous_id"), index=True, nullable=False - ) + anonymous_user_id: Mapped[str] = mapped_column( + ForeignKey("user_mapping.anonymous_id", ondelete="RESTRICT"), + index=True, + nullable=False, + )src/utils/endpoints.py (2)
16-16: Importfind_anonymous_user_idto avoid side effects during auth checks.Authorization should not create new mappings. Prefer a read-only lookup.
-from utils.user_anonymization import get_anonymous_user_id +from utils.user_anonymization import find_anonymous_user_id
31-47: Refactor authorization check to avoid unintended user-mapping and align error payloadsI verified that
get_anonymous_user_idalways creates a new mapping, even when the user shouldn’t have access, and that error details elsewhere use a{response, cause}dict. Switching tofind_anonymous_user_idprevents inadvertent inserts, and updating theHTTPExceptionpayload ensures consistency. All current callers ofvalidate_conversation_ownershipeither expect an exception or ignore anyNonereturn value, so this change won’t break existing call sites.Locations to update in
src/utils/endpoints.py(withinvalidate_conversation_ownership):
- Replace
get_anonymous_user_id(user_id)withfind_anonymous_user_id(user_id)- If
anonymous_user_idisNone, raise a 403 with a dict payload- Change the “conversation does not belong” exception to use the same dict format
Suggested diff:
@@ def validate_conversation_ownership(user_id: str, conversation_id: str) -> UserConversation: - # Get anonymous user ID for database lookup - anonymous_user_id = get_anonymous_user_id(user_id) + # Lookup existing anonymous user ID without creating a new mapping + anonymous_user_id = find_anonymous_user_id(user_id) + if not anonymous_user_id: + raise HTTPException( + status_code=status.HTTP_403_FORBIDDEN, + detail={ + "response": "Access denied", + "cause": "You do not have permission to access this conversation", + }, + ) with get_session() as session: conversation = ( session.query(UserConversation) .filter_by(id=conversation_id, anonymous_user_id=anonymous_user_id) .first() ) - if not conversation: - raise HTTPException( - status_code=status.HTTP_403_FORBIDDEN, - detail="Forbidden: conversation does not belong to user", - ) + if not conversation: + raise HTTPException( + status_code=status.HTTP_403_FORBIDDEN, + detail={ + "response": "Access denied", + "cause": "You do not have permission to access this conversation", + }, + ) return conversationTagging as an optional, but recommended, refactor to improve consistency and prevent unwanted database writes.
src/utils/user_anonymization.py (1)
108-117: Minor: Use SQLAlchemy 2.0 style count for efficiency on some backends.
session.query(UserMapping).count()can be less efficient; usingselect(func.count())is the modern pattern. Not critical.- with get_session() as session: - return session.query(UserMapping).count() + from sqlalchemy import select, func + with get_session() as session: + return session.execute(select(func.count()).select_from(UserMapping)).scalar_one()tests/unit/app/endpoints/test_query_anonymization.py (2)
140-147: Strengthen log redaction assertion.Also assert that no email-like token leaks (e.g., '@' character) to ensure PII is not logged.
log_msg = anonymization_logs[0] assert "anon-logging-456" in log_msg assert "logging-conv-789" in log_msg # Should NOT contain any reference to original user ID assert "logging_test_user" not in log_msg assert "logging_..." not in log_msg + assert "@" not in log_msg
167-171: Clarify test intent to avoid implying duplicate PKs per user.
idis a global PK, so two users cannot actually persist the sameconversation_id. This test rightly verifies the lookup filter scoping; update the docstring to reflect that.- """Test that different users can't access each other's conversations.""" + """Test lookup scoping: different users resolve to different anonymous IDs even if a conversation_id collides in input (no DB persistence here)."""tests/unit/utils/test_user_anonymization.py (5)
54-79: New-user path mocking is correct; align assertion with module fixtureThe DB interactions and added-entity assertions are on point. After switching to the module fixture, update the last assertion to use ua._hash_user_id(user_id).
Suggested tweak (post-refactor):
- assert added_mapping.user_id_hash == _hash_user_id(user_id) + assert added_mapping.user_id_hash == ua._hash_user_id(user_id)
103-132: Race condition simulation should raise IntegrityError on commit, not addIn practice, SQLAlchemy raises IntegrityError on flush/commit, not on session.add(). Your code catches exceptions around add/commit (so the test still passes), but modeling the failure on commit makes the test more realistic and robust to future refactors.
Apply this change:
- mock_session.add.side_effect = IntegrityError("Duplicate key", None, None) + mock_session.add.side_effect = None + mock_session.commit.side_effect = IntegrityError("Duplicate key", None, None)
134-153: Same as above: raise on commit for failure pathMirror the realistic failure point for the second race test.
- mock_session.add.side_effect = IntegrityError("Duplicate key", None, None) + mock_session.add.side_effect = None + mock_session.commit.side_effect = IntegrityError("Duplicate key", None, None)
262-279: Model attribute checks are fine; consider asserting column constraints (optional)You could strengthen this by asserting ORM metadata for uniqueness and indexing of user_id_hash without needing a DB:
@@ def test_user_mapping_attributes(self): """Test UserMapping model has correct attributes.""" mapping = UserMapping() @@ assert UserMapping.__tablename__ == "user_mapping" + + # Optional: verify ORM column constraints via table metadata + cols = mapping.__table__.columns + assert cols["user_id_hash"].unique is True + # Index presence is also declared in __table_args__; ensure it's defined + assert any("user_id_hash" in idx.columns.keys() for idx in mapping.__table__.indexes)
59-61: Use autospecced mocks to catch interface drifts (optional)Creating autospecced session/query mocks can catch accidental API changes (wrong method names/args) early.
Example:
- mock_session = MagicMock() + from sqlalchemy.orm import Session + mock_session = MagicMock(spec=Session)Similarly, consider autospeccing the Query object if you break out chained calls.
Also applies to: 88-93, 158-165, 172-177
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (12)
src/app/endpoints/conversations.py(2 hunks)src/app/endpoints/query.py(2 hunks)src/models/database/conversations.py(2 hunks)src/models/database/user_mapping.py(1 hunks)src/utils/endpoints.py(1 hunks)src/utils/transcripts.py(4 hunks)src/utils/user_anonymization.py(1 hunks)tests/unit/app/endpoints/test_query_anonymization.py(1 hunks)tests/unit/utils/conftest.py(1 hunks)tests/unit/utils/test_endpoints_anonymization.py(1 hunks)tests/unit/utils/test_transcripts_anonymization.py(1 hunks)tests/unit/utils/test_user_anonymization.py(1 hunks)
✅ Files skipped from review due to trivial changes (1)
- tests/unit/utils/conftest.py
🚧 Files skipped from review as they are similar to previous changes (5)
- src/app/endpoints/conversations.py
- tests/unit/utils/test_transcripts_anonymization.py
- tests/unit/utils/test_endpoints_anonymization.py
- src/utils/transcripts.py
- src/app/endpoints/query.py
🧰 Additional context used
🧬 Code graph analysis (5)
src/models/database/user_mapping.py (1)
src/models/database/base.py (1)
Base(6-7)
src/utils/user_anonymization.py (3)
src/models/database/user_mapping.py (1)
UserMapping(11-32)src/app/database.py (1)
get_session(34-40)src/utils/suid.py (1)
get_suid(6-12)
tests/unit/app/endpoints/test_query_anonymization.py (2)
src/app/endpoints/query.py (1)
persist_user_conversation_details(77-111)src/models/database/conversations.py (1)
UserConversation(11-38)
src/utils/endpoints.py (3)
src/utils/user_anonymization.py (1)
get_anonymous_user_id(43-105)src/models/database/conversations.py (1)
UserConversation(11-38)src/app/database.py (1)
get_session(34-40)
tests/unit/utils/test_user_anonymization.py (2)
src/models/database/user_mapping.py (1)
UserMapping(11-32)src/utils/user_anonymization.py (4)
get_anonymous_user_id(43-105)get_user_count(108-116)find_anonymous_user_id(119-136)_hash_user_id(27-40)
🪛 GitHub Actions: E2E Tests
src/utils/user_anonymization.py
[error] 21-21: RuntimeError: USER_ANON_PEPPER environment variable is required for user anonymization. Set a secure random value (export USER_ANON_PEPPER=)
🪛 GitHub Actions: Unit tests
src/utils/user_anonymization.py
[error] 19-19: KeyError: 'USER_ANON_PEPPER' while reading environment variable (os.environ['USER_ANON_PEPPER']) during test collection. The CI run was executing 'uv run pytest tests/unit --cov=src --cov=runner --cov-report term-missing'.
[error] 21-21: RuntimeError: USER_ANON_PEPPER environment variable is required for user anonymization. Set a secure value (e.g., export USER_ANON_PEPPER=).
🔇 Additional comments (12)
src/utils/user_anonymization.py (2)
27-41: Normalization policy may need to align with your IdP semantics.Lowercasing and trimming is good for emails. If your user IDs are not emails and are case-sensitive in the IdP, this could collapse distinct principals. Consider making normalization strategy pluggable/configurable.
Would you like me to add a configurable normalizer (callable in AppConfig) with a sensible default for emails?
17-25: Load USER_ANON_PEPPER lazily to prevent CI import-time failuresWe ran a repo-wide search for
os.environ[...]insrc/**andtests/**and found only the single occurrence insrc/utils/user_anonymization.py:19. No other modules perform import-time environment access. Applying the diff below will defer loadingUSER_ANON_PEPPERuntil first use, unblocking CI while preserving strictness.Files to update:
- src/utils/user_anonymization.py (lines 17–25)
-from typing import Optional +from typing import Optional +from functools import lru_cache @@ -# Load HMAC pepper from environment - fail at startup if missing -try: - _USER_ANON_PEPPER = os.environ["USER_ANON_PEPPER"].encode("utf-8") -except KeyError as e: - raise RuntimeError( - "USER_ANON_PEPPER environment variable is required for user anonymization. " - "Set a secure random value (e.g., 'export USER_ANON_PEPPER=<your-secret-key>')" - ) from e +@lru_cache(maxsize=1) +def _get_pepper() -> bytes: + """Load and cache the HMAC pepper from the environment on first use.""" + try: + return os.environ["USER_ANON_PEPPER"].encode("utf-8") + except KeyError as e: + raise RuntimeError( + "USER_ANON_PEPPER environment variable is required for user anonymization. " + "Set a secure random value (e.g., 'export USER_ANON_PEPPER=<your-secret-key>')" + ) from e @@ - return hmac.new( - _USER_ANON_PEPPER, normalized_user_id.encode("utf-8"), hashlib.sha256 - ).hexdigest() + return hmac.new( + _get_pepper(), normalized_user_id.encode("utf-8"), hashlib.sha256 + ).hexdigest()No other import-time
os.environreads were detected.tests/unit/utils/test_user_anonymization.py (10)
21-31: Autouse reload is good; ensure all tests reference the reloaded moduleThe autouse fixture correctly sets USER_ANON_PEPPER and reloads the module, but its benefit is neutralized by the early from-import. After applying the refactor above, all tests will consume the reloaded module. No changes needed here beyond that refactor.
Once you apply the refactor, please run the test suite locally to confirm no leakage from stale imports.
37-46: Hashing determinism and format assertions look solidGood checks for determinism, output length (64 hex chars), and inequality with the original input.
47-53: Distinct inputs → distinct hashes: LGTMCovers the basic collision sanity check for different inputs.
80-102: Existing-user path is correct and verifies no persistenceGood verification that .add and .commit are not called when the mapping exists.
154-165: User count test is lean and accurateMocks the count and validates the query target. LGTM.
166-191: find_anonymous_user_id tests cover both branches cleanlyBoth existing and not-found paths are exercised correctly.
193-209: Good defense checks against rainbow tables and missing pepperStrong assertions comparing to bare SHA-256 and HMAC with empty pepper. Nice.
210-223: Uniqueness across multiple users: concise and effectiveSimple set cardinality check does the job.
224-245: Normalization behavior is well-testedCase and whitespace normalization are validated; also checks distinctness vs. a different user.
246-257: Missing pepper test is correct and future-proofClearing the env and forcing a reload to assert a clear RuntimeError is exactly what we want. This remains valid after the import refactor since the reload happens inside the test scope.
| @pytest.fixture(autouse=True) | ||
| def setup_test_pepper(): | ||
| """Set up test pepper environment variable for all tests.""" | ||
| test_pepper = "test-pepper-for-query-tests" | ||
| with patch.dict(os.environ, {"USER_ANON_PEPPER": test_pepper}): | ||
| yield | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Make USER_ANON_PEPPER available repo-wide during tests.
This per-module fixture is fine, but other tests fail when importing modules before this runs (see CI). Add a top-level tests/conftest.py with a session-scoped autouse fixture to set the pepper for all tests.
Proposed new file tests/conftest.py:
# tests/conftest.py
import os
import pytest
@pytest.fixture(scope="session", autouse=True)
def _pepper_for_all_tests():
os.environ.setdefault("USER_ANON_PEPPER", "test-pepper-global")🤖 Prompt for AI Agents
tests/unit/app/endpoints/test_query_anonymization.py around lines 15 to 21: the
module-level fixture sets USER_ANON_PEPPER only for that module, causing imports
in other tests to fail in CI; add a new top-level tests/conftest.py file that
defines a session-scoped autouse pytest fixture which sets or defaults
os.environ["USER_ANON_PEPPER"] to a test value (e.g., "test-pepper-global")
before any tests/imports run so the variable is available repo-wide for all
tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed by fd32c3e
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
@thoraxe Perfect! Thank you for implementing the session-scoped pepper configuration. This will ensure all tests across the repository have access to the USER_ANON_PEPPER environment variable, preventing the CI import failures we were seeing.
🏁 Script executed:
#!/bin/bash
# Check the commit to see what was implemented
git log --oneline -1 fd32c3e44a1298c9ce655504c6b85101a563b02b
git show fd32c3e44a1298c9ce655504c6b85101a563b02b --name-only
Length of output: 6503
@thoraxe I see this was part of a larger commit that touched many files. Let me verify that the specific USER_ANON_PEPPER fix was included in the tests/conftest.py file.
🏁 Script executed:
#!/bin/bash
# Check the specific content of tests/conftest.py to see if the pepper fix was implemented
git show fd32c3e44a1298c9ce655504c6b85101a563b02b:tests/conftest.py
Length of output: 319
@thoraxe Perfect! I can see the fix was successfully implemented in tests/conftest.py:
"""Global configuration for all tests."""
import os
# Set environment variable for all tests globally
# This ensures the anonymization module can be imported in CI environments
os.environ.setdefault("USER_ANON_PEPPER", "test-pepper-for-all-tests")This is exactly what was needed - a top-level configuration that sets the pepper globally for all tests, preventing the CI import failures. The implementation is clean and well-commented. Great work addressing this along with the other changes in that comprehensive commit!
If you found this review helpful, would you consider giving us a shout-out on X?
Thank you for using CodeRabbit!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
tests/unit/app/endpoints/test_query.py (1)
170-183: Decouple transcript assertions from real user IDs; assert anonymized value deterministicallyThis test currently asserts anonymous_user_id == "mock_user_id", which contradicts the anonymization contract and will become brittle once get_anonymous_user_id returns opaque UUIDs. Patch the anonymizer and assert the deterministic value instead.
Apply this diff:
@@ mocker.patch( "app.endpoints.query.is_transcripts_enabled", return_value=store_transcript_to_file, ) + # Ensure deterministic anonymized user ID for assertions + mocker.patch( + "app.endpoints.query.get_anonymous_user_id", + return_value="anon-uuid-123", + ) mock_transcript = mocker.patch("app.endpoints.query.store_transcript") @@ if store_transcript_to_file: mock_transcript.assert_called_once_with( - anonymous_user_id="mock_user_id", + anonymous_user_id="anon-uuid-123", conversation_id=conversation_id, model_id="fake_model_id", provider_id="fake_provider_id", query_is_valid=True, query=query, query_request=query_request, summary=summary, attachments=[], rag_chunks=[], truncated=False, )Also applies to: 149-153
🧹 Nitpick comments (7)
tests/conftest.py (1)
5-7: Prefer a session-scoped autouse fixture over import-time env mutationIt looks like there are two import-time mutations of USER_ANON_PEPPER that should both be replaced with the same session-scoped, autouse fixture:
• tests/conftest.py (lines 5–7)
• tests/unit/utils/conftest.py (lines 4–6)Tests under tests/unit/utils/* that use patch.dict already play nicely with a fixture and don’t need changing. Please replace both conftest.py occurrences with:
import os import pytest -# Set environment variable for all tests globally -# This ensures the anonymization module can be imported in CI environments -os.environ.setdefault("USER_ANON_PEPPER", "test-pepper-for-all-tests") +# Ensure the anonymization module can be imported in CI environments +# without leaking globals or interfering with tests that clear the env. +@pytest.fixture(scope="session", autouse=True) +def _set_user_anon_pepper(monkeypatch): + monkeypatch.setenv( + "USER_ANON_PEPPER", + os.environ.get("USER_ANON_PEPPER", "test-pepper-for-all-tests"), + )And in tests/unit/utils/conftest.py, change:
-# Set environment variable for all tests in this directory -os.environ.setdefault("USER_ANON_PEPPER", "test-pepper-for-utils-tests")to the same fixture (adjusting the default pepper literal if desired).
This makes the setup reversible, avoids test-order or import surprises, and centralizes the pepper logic.
tests/unit/app/endpoints/test_query.py (2)
75-79: Simplify configuration-not-loaded setup (remove redundant patch)The first patch sets configuration to a Mock, then you immediately patch it to None. The first patch is redundant.
Apply this diff:
- mocker.patch( - "app.endpoints.query.configuration", - return_value=mocker.Mock(), - ) - mocker.patch("app.endpoints.query.configuration", None) + mocker.patch("app.endpoints.query.configuration", None)
1180-1183: Avoid setting unused transcripts_disabled when you stub is_transcripts_enabled anywayYou’re forcing is_transcripts_enabled() via patch, so setting configuration.user_data_collection_configuration.transcripts_disabled is unnecessary here and may confuse future readers.
Apply this diff to remove the dead setup:
- mock_config = mocker.Mock() - mock_config.user_data_collection_configuration.transcripts_disabled = True - mocker.patch("app.endpoints.query.configuration", mock_config) + mocker.patch("app.endpoints.query.configuration", mocker.Mock())Repeat the same change in the second test case.
Also applies to: 1230-1232
tests/unit/utils/test_user_anonymization.py (4)
116-143: Race-condition path is simulated well; consider asserting no commit occurs in this pathYou already assert rollback. It’d be useful to assert that commit was not (spuriously) called after the IntegrityError.
Apply this small addition:
result = user_anon_module.get_anonymous_user_id("race_user@example.com") @@ assert result == "race-condition-uuid" mock_session.rollback.assert_called_once() + mock_session.commit.assert_not_called()
194-203: Strengthen “not found” test to ensure no writes occurThe test validates None is returned, but it doesn’t prove we didn’t create a mapping. Add assertions to verify no add/commit happened.
Apply this diff:
result = user_anon_module.find_anonymous_user_id("nonexistent@example.com") assert result is None + # Ensure we didn't create a new mapping on a pure 'find' call + mock_session.add.assert_not_called() + mock_session.commit.assert_not_called()
221-234: Nice coverage for normalization and HMAC vs. simple SHA-256You could add a quick “pepper rotation changes hash” check to confirm dependence on the server secret.
Example additional test (no changes needed elsewhere):
def test_pepper_rotation_changes_hash(user_anon_module): user = "rotate@example.com" h1 = user_anon_module._hash_user_id(user) import utils.user_anonymization as _ua import importlib, os from unittest.mock import patch with patch.dict(os.environ, {"USER_ANON_PEPPER": "different-pepper"}, clear=False): importlib.reload(_ua) h2 = _ua._hash_user_id(user) assert h1 != h2, "Changing the pepper should change the hash output"
1-3: Minor: group stdlib imports (style-only)Pure nit: hashlib, hmac, importlib, os can be grouped without blank lines; current layout is fine if matching repo style.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (3)
tests/conftest.py(1 hunks)tests/unit/app/endpoints/test_query.py(3 hunks)tests/unit/utils/test_user_anonymization.py(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
tests/unit/utils/test_user_anonymization.py (2)
src/models/database/user_mapping.py (1)
UserMapping(11-32)src/utils/user_anonymization.py (4)
_hash_user_id(27-40)get_anonymous_user_id(43-105)get_user_count(108-116)find_anonymous_user_id(119-136)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: build-pr
- GitHub Check: e2e_tests
🔇 Additional comments (7)
tests/conftest.py (1)
1-1: Nice touch: clear, file-level docstring.Good to have a concise description for the test configuration file.
tests/unit/app/endpoints/test_query.py (2)
1410-1416: Model-hints tests reflect the new anonymous_user_id field — LGTMConstructing UserConversation with anonymous_user_id aligns with the schema change. Good coverage of precedence between request vs. conversation hints.
Also applies to: 1422-1425
850-860: Headers shaping for MCP servers looks correct and specificAsserting the exact structure under X-LlamaStack-Provider-Data ensures downstream compatibility. Nice work keeping unknown servers out.
Also applies to: 861-875
tests/unit/utils/test_user_anonymization.py (4)
17-27: Solid autouse fixture: isolates pepper per test and forces module reloadThis cures the stale-global problem and makes tests deterministic across reimports. Good use of patch.dict + importlib.reload.
42-51: Hashing tests cover consistency, length, and non-equality — LGTMThese quickly catch regressions in normalization or digest output format.
257-268: Great negative test: explicit error when pepper is missingThis guards a critical misconfiguration path and pairs nicely with the import-time check.
12-16: Module-level access pattern via fixture resolves stale-global issuesRemoving the from-imports and retrieving functions off the reloaded module matches prior review guidance and prevents drift between env and module state.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (4)
README.md (2)
82-92: Clarify terminology and align with implementation (pepper/HMAC vs salt).This section uses “pepper” and mentions HMAC-based hashing, whereas the PR description elsewhere references “salt” and plain SHA-256. To avoid operator confusion and security misconfiguration, make the docs consistently reflect the actual implementation (HMAC-SHA256 with a secret pepper). If any code paths still use a static “salt”, they should be updated or clearly flagged as legacy.
Apply the following doc tweak to make the intent explicit and consistent:
-**Security Note**: This value should be treated as a secret and kept secure. It's used for HMAC-based user ID hashing to protect user privacy while enabling usage analytics. +**Security Note**: Treat this value as a secret. It is a server-wide HMAC key (“pepper”) used with HMAC‑SHA256 to derive anonymous user identifiers. Do not confuse this with a per-user “salt”; we do not store original user IDs or salts.
86-92: Add operational guidance: secure generation, rotation impact, and multi-instance consistency.Provide concrete commands and caveats so operators don’t accidentally break mappings or weaken security.
Proposed doc expansion:
```bash - export USER_ANON_PEPPER="your-secure-random-string-here" + # Generate a 32–48 byte secret and export it (example options): + openssl rand -base64 48 + python - <<'PY' +import secrets, base64 +print(base64.b64encode(secrets.token_bytes(48)).decode()) +PY + export USER_ANON_PEPPER="<paste-generated-secret-here>"
- Security Note: This value should be treated as a secret and kept secure. It's used for HMAC-based user ID hashing to protect user privacy while enabling usage analytics.
- Use a secrets manager (Kubernetes Secret, Docker/Podman secrets, Vault). Do not commit this value.
- All replicas/environments that must share user mappings must use the same pepper.
- Rotating the pepper changes every derived anonymous ID and will sever links to existing mappings/transcripts unless you run a migration. Plan rotations carefully.
- Never use development defaults in production.
</blockquote></details> <details> <summary>.env.example (2)</summary><blockquote> `1-11`: **Harden example values and improve portability.** - Quote example values to avoid shell parsing issues. - Provide concrete secure generation guidance. - Keep a trailing newline to satisfy dotenv linters. Apply: ```diff -# Required: User anonymization pepper (set to a secure random value) -# This is used for HMAC-based user ID hashing to protect user privacy -USER_ANON_PEPPER=your-secure-random-string-here +# Required: User anonymization pepper (HMAC key). Generate securely: +# openssl rand -base64 48 +# python - <<'PY' +# import secrets, base64; print(base64.b64encode(secrets.token_bytes(48)).decode()) +# PY +# This is used for HMAC-based user ID hashing to protect user privacy +USER_ANON_PEPPER="your-secure-random-string-here" -# Optional: OpenAI API Key (if using OpenAI models) -OPENAI_API_KEY=your-openai-api-key +# Optional: OpenAI API Key (if using OpenAI models) +OPENAI_API_KEY="your-openai-api-key" # Optional: Other environment variables as needed for your configuration +
11-11: Add trailing newline to satisfy dotenv-linter.dotenv-linter reports “No blank line at the end of the file”. Add a newline to keep tooling happy.
-# Optional: Other environment variables as needed for your configuration +# Optional: Other environment variables as needed for your configuration +
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (3)
.env.example(1 hunks)README.md(1 hunks)docker-compose.yaml(1 hunks)
🧰 Additional context used
🪛 dotenv-linter (3.3.0)
.env.example
[warning] 11-11: [EndingBlankLine] No blank line at the end of the file
(EndingBlankLine)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: build-pr
- GitHub Check: e2e_tests
🔇 Additional comments (1)
docker-compose.yaml (1)
40-41: Healthcheck path likely mismatches documented API.README documents liveness at GET /v1/liveness, but the healthcheck probes /liveness. Align these or confirm both routes exist.
Recommended change:
- test: ["CMD", "curl", "-f", "http://localhost:8080/liveness"] + test: ["CMD", "curl", "-f", "http://localhost:8080/v1/liveness"]
| - ./lightspeed-stack.yaml:/app-root/lightspeed-stack.yaml:Z | ||
| environment: | ||
| - OPENAI_API_KEY=${OPENAI_API_KEY} | ||
| - USER_ANON_PEPPER=${USER_ANON_PEPPER:-default-pepper-for-development-only} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Remove insecure default pepper fallback.
Providing a default pepper invites accidental production deployments with a shared key, breaking privacy guarantees and cross-environment isolation.
Use a required variable and fail fast if it’s missing (compose will error):
- - USER_ANON_PEPPER=${USER_ANON_PEPPER:-default-pepper-for-development-only}
+ - USER_ANON_PEPPER=${USER_ANON_PEPPER}Optional (outside this line): keep a dev-only override in docker-compose.override.yaml or an .env file that is not committed.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| - USER_ANON_PEPPER=${USER_ANON_PEPPER:-default-pepper-for-development-only} | |
| - USER_ANON_PEPPER=${USER_ANON_PEPPER} |
🤖 Prompt for AI Agents
In docker-compose.yaml around line 33, the service currently falls back to a
hardcoded default pepper which is insecure; remove the default fallback so the
env interpolation uses USER_ANON_PEPPER without a default (so docker-compose
will error if it’s missing) and ensure any developer-only value is provided via
a non-committed .env or docker-compose.override.yaml instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
tests/unit/utils/test_transcripts.py (1)
40-47: Test still asserts raw user_id in path; update to use anonymous_user_id to align with anonymization designPer the PR, construct_transcripts_path should take an anonymous_user_id and the path should contain the anonymous UUID, not the original user ID. This test currently hardcodes user123 in the path and will diverge from the intended behavior.
Apply this diff to assert against the anonymized ID:
- user_id = "user123" + anonymous_user_id = "anon-user-123" conversation_id = "123e4567-e89b-12d3-a456-426614174000" - path = construct_transcripts_path(user_id, conversation_id) + path = construct_transcripts_path(anonymous_user_id, conversation_id) assert ( - str(path) == "/tmp/transcripts/user123/123e4567-e89b-12d3-a456-426614174000" + str(path) == "/tmp/transcripts/anon-user-123/123e4567-e89b-12d3-a456-426614174000" ), "Path should be constructed correctly"
🧹 Nitpick comments (14)
src/app/endpoints/feedback.py (2)
140-144: Guard against accidental defaulting to the CWD for storage.
Path(configuration.user_data_collection_configuration.feedback_storage or "")will use “.” when unset, potentially mixing feedback files into the app’s working directory. Consider validating the path and failing fast (or warning) if empty/misconfigured.Example refinement:
- storage_path = Path( - configuration.user_data_collection_configuration.feedback_storage or "" - ) + feedback_dir = configuration.user_data_collection_configuration.feedback_storage + if not feedback_dir: + logger.error("Feedback storage path is not configured") + raise RuntimeError("Feedback storage path is not configured") + storage_path = Path(feedback_dir)
106-107: Avoid logging entire request payload to reduce PII risk.
str(feedback_request)can include end-user text. Log only safe, high-level metadata.- logger.debug("Feedback received %s", str(feedback_request)) + try: + payload_keys = list( + feedback_request.model_dump(exclude={"model_config"}).keys() + ) + except Exception: + payload_keys = ["<unavailable>"] + logger.debug("Feedback received; keys=%s", payload_keys)tests/unit/app/endpoints/test_feedback.py (2)
140-142: Also assert that anonymization is invoked with the original user_id.You already patch
get_anonymous_user_id. Asserting the call strengthens the contract.- mocker.patch( - "app.endpoints.feedback.get_anonymous_user_id", return_value="anon-test-user" - ) + mock_get_anon = mocker.patch( + "app.endpoints.feedback.get_anonymous_user_id", return_value="anon-test-user" + ) @@ - store_feedback(user_id, feedback_request_data) + store_feedback(user_id, feedback_request_data) + mock_get_anon.assert_called_once_with(user_id)
188-191: Mirror the call assertion on IO-error path.Even on error, verifying
get_anonymous_user_idwas called helps ensure consistent behavior before the failure point.- mocker.patch( - "app.endpoints.feedback.get_anonymous_user_id", return_value="anon-test-user" - ) + mock_get_anon = mocker.patch( + "app.endpoints.feedback.get_anonymous_user_id", return_value="anon-test-user" + ) @@ - with pytest.raises(OSError, match="EACCES"): - store_feedback(user_id, feedback_request_data) + with pytest.raises(OSError, match="EACCES"): + store_feedback(user_id, feedback_request_data) + mock_get_anon.assert_called_once_with(user_id)tests/unit/app/endpoints/test_feedback_anonymization.py (3)
27-35: Solid isolation of I/O; consider asserting mkdir semantics.Patching
Path,open,json, andget_suid/get_anonymous_user_idcleanly isolates the unit. Add an assertion thatmkdir(parents=True, exist_ok=True)was called to lock down directory-creation behavior.- mock_path.return_value = MagicMock() + mock_dir = MagicMock() + mock_path.return_value = mock_dir @@ # Call store_feedback store_feedback( @@ ) + + # Directory creation contract + mock_dir.mkdir.assert_called_once_with(parents=True, exist_ok=True)Also applies to: 36-44
66-80: Good differentiation across users; minor enhancement.The side-effect approach validates one-to-one mapping. Optionally, also assert that both payloads contain a
timestampto guarantee server-side enrichment occurs.- first_call_data = mock_json.dump.call_args[0][0] + first_call_data = mock_json.dump.call_args[0][0] ... - second_call_data = mock_json.dump.call_args[0][0] + second_call_data = mock_json.dump.call_args[0][0] @@ assert ( first_call_data["anonymous_user_id"] != second_call_data["anonymous_user_id"] ) + assert "timestamp" in first_call_data and "timestamp" in second_call_dataAlso applies to: 81-87, 105-112
27-35: Add a test to prevent client override of server fields.Given the production fix to place server fields after
**feedback, add a regression test to ensure user-suppliedanonymous_user_id/timestampcannot override server values.@@ class TestFeedbackAnonymization: @@ def test_store_feedback_anonymizes_user_id( self, mock_path, mock_json, mock_get_suid, mock_get_anonymous ): @@ # Verify stored data uses anonymous ID stored_data = mock_json.dump.call_args[0][0] assert stored_data["anonymous_user_id"] == "anon-feedback-123" assert "user_id" not in stored_data assert stored_data["feedback"] == "This is test feedback" assert stored_data["sentiment"] == 1 assert stored_data["categories"] == ["helpful"] + + @patch("app.endpoints.feedback.get_anonymous_user_id") + @patch("app.endpoints.feedback.get_suid", return_value="feedback-uuid") + @patch("app.endpoints.feedback.json") + @patch("app.endpoints.feedback.Path") + def test_client_cannot_override_server_fields( + self, mock_path, mock_json, _mock_get_suid, mock_get_anonymous + ): + mock_get_anonymous.return_value = "anon-override-proof" + mock_path.return_value = MagicMock() + with patch("app.endpoints.feedback.configuration") as mock_config, patch("builtins.open"): + mock_config.user_data_collection_configuration.feedback_storage = "/tmp/feedback" + # Attempt to inject server-controlled fields + store_feedback( + user_id="user@example.com", + feedback={ + "anonymous_user_id": "bad-actor", + "timestamp": "2000-01-01T00:00:00Z", + "feedback": "Should not override", + }, + ) + stored = mock_json.dump.call_args[0][0] + assert stored["anonymous_user_id"] == "anon-override-proof" + assert stored["feedback"] == "Should not override" + # Ensure timestamp was generated by server (format/value not equal to injected) + assert stored["timestamp"] != "2000-01-01T00:00:00Z"Also applies to: 55-65
tests/unit/app/endpoints/test_conversations.py (5)
537-540: Assert anonymization is actually used: check call args and DB filter key.Nice: tests now stub
get_anonymous_user_id. Strengthen the signal by (a) asserting it’s called with the auth user_id and (b) ensuring the DB filter usesanonymous_user_id(notuser_id).Apply this minimal change within the patch block to keep a handle for assertions:
- mocker.patch( - "app.endpoints.conversations.get_anonymous_user_id", - return_value="anon-test-user", - ) + mock_get_anon = mocker.patch( + "app.endpoints.conversations.get_anonymous_user_id", + return_value="anon-test-user", + )Then, outside this hunk, update the test body accordingly:
# Capture the session to inspect query/filter args mock_session = mock_database_session(mocker, mock_conversations) response = get_conversations_list_endpoint_handler(auth=MOCK_AUTH) # 1) get_anonymous_user_id called with the auth user id mock_get_anon.assert_called_once_with(MOCK_AUTH[0]) # 2) DB query filtered by anonymous_user_id (and not by user_id) filter_kwargs = mock_session.query.return_value.filter_by.call_args.kwargs assert filter_kwargs.get("anonymous_user_id") == "anon-test-user" assert "user_id" not in filter_kwargs
581-584: Avoid duplication of the same patch across tests—promote to a fixture.This is identical to the earlier stub; factor it into a reusable fixture (e.g.,
mock_anonymous_user) and inject it into tests that need it. Keeps tests tight and easier to maintain.Example:
@pytest.fixture def mock_anonymous_user(mocker): return mocker.patch( "app.endpoints.conversations.get_anonymous_user_id", return_value="anon-test-user", )Usage:
def test_empty_conversations_list(mocker, setup_configuration, mock_anonymous_user): ...
598-601: Same duplication note as above—use a shared fixture for the anonymization stub.Refactor this repeated patch into the
mock_anonymous_userfixture suggested earlier to reduce repetition and make future changes (e.g., new return value) a single-line edit.
613-614: Use stable substring assertions for error causeWe’ve confirmed that the exact message
"Unknown error while getting conversations for anonymous user"
only appears in the test (tests/unit/app/endpoints/test_conversations.py:613) and isn’t emitted by production code. Relying on the full literal makes the test brittle against future wording changes. It’s safer to assert on stable substrings:• File: tests/unit/app/endpoints/test_conversations.py
Lines: ~613Suggested update:
- assert ( - "Unknown error while getting conversations for anonymous user" - in exc_info.value.detail["cause"] - ) + cause = exc_info.value.detail["cause"] + assert "Unknown error while getting conversations" in cause + assert "anonymous" in causeThis ensures the test still validates the key parts of the message without depending on its exact phrasing.
49-52: Consider centralizing DB engine/session mocks via an autouse fixtureTo avoid repeating the same
mocker.patch("app.database.engine", …)andmocker.patch("app.database.SessionLocal", …)in each helper, you can declare a module-scoped autouse fixture intests/unit/app/endpoints/test_conversations.py. This will ensure the real SQLAlchemy engine and sessionmaker (defined insrc/app/database.py) are always mocked for every test in that file.• At the top of
tests/unit/app/endpoints/test_conversations.py, add:+ import pytest + + @pytest.fixture(autouse=True) + def _patch_db_init(mocker): + # Prevent real DB engine/session creation anywhere in this module + mocker.patch("app.database.engine", mocker.Mock()) + mocker.patch("app.database.SessionLocal", mocker.Mock())• Remove the inline patches under
mock_database_session; they’re now redundant.This makes the test setup DRY, guarantees ordering, and keeps each test focused on endpoint behavior rather than DB setup.
tests/unit/utils/test_transcripts.py (2)
62-64: Great: deterministic anonymization for the test; also assert call plumbing uses the anon IDNice use of a stable mocked value ("anon-user-123"). Add assertions to guarantee store_transcript calls the anonymization and then uses the anonymous ID in construct_transcripts_path. This guards against regressions and verifies intent end-to-end in this test.
Apply this diff:
- mocker.patch( - "utils.transcripts.construct_transcripts_path", - return_value=mocker.MagicMock(), - ) - mocker.patch( - "utils.transcripts.get_anonymous_user_id", return_value="anon-user-123" - ) + patch_construct_path = mocker.patch( + "utils.transcripts.construct_transcripts_path", + return_value=mocker.MagicMock(), + ) + patch_get_anon = mocker.patch( + "utils.transcripts.get_anonymous_user_id", return_value="anon-user-123" + ) @@ store_transcript( user_id, conversation_id, model, provider, query_is_valid, query, query_request, summary, rag_chunks, truncated, attachments, ) + # Verify anonymization and path usage + patch_get_anon.assert_called_once_with(user_id) + patch_construct_path.assert_called_once_with("anon-user-123", conversation_id)
106-134: Metadata now uses anonymous_user_id — LGTM; consider also asserting no user_id leakageThe assertion correctly validates anonymous_user_id is stored. Optionally, add a negative assertion to ensure user_id is not present anywhere in metadata to prevent accidental PII leakage in future changes.
Optional follow-up:
mock_json.dump.assert_called_once_with( { "metadata": { "provider": "fake-provider", "model": "fake-model", "query_provider": query_request.provider, "query_model": query_request.model, "anonymous_user_id": "anon-user-123", "conversation_id": conversation_id, "timestamp": mocker.ANY, }, "redacted_query": query, "query_is_valid": query_is_valid, "llm_response": summary.llm_response, "rag_chunks": rag_chunks, "truncated": truncated, "attachments": attachments, "tool_calls": [ { "id": "123", "name": "test-tool", "args": "testing", "response": "tool response", } ], }, mocker.ANY, ) + # Defensive: ensure no raw user_id leaked into metadata + called_args, _ = mock_json.dump.call_args + assert "user_id" not in called_args[0]["metadata"]
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (8)
src/app/endpoints/conversations.py(3 hunks)src/app/endpoints/feedback.py(3 hunks)tests/unit/app/endpoints/test_conversations.py(5 hunks)tests/unit/app/endpoints/test_feedback.py(3 hunks)tests/unit/app/endpoints/test_feedback_anonymization.py(1 hunks)tests/unit/app/endpoints/test_query.py(3 hunks)tests/unit/utils/test_transcripts.py(2 hunks)tests/unit/utils/test_transcripts_anonymization.py(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (3)
- src/app/endpoints/conversations.py
- tests/unit/app/endpoints/test_query.py
- tests/unit/utils/test_transcripts_anonymization.py
🧰 Additional context used
🧬 Code graph analysis (2)
tests/unit/app/endpoints/test_feedback_anonymization.py (3)
src/app/endpoints/feedback.py (1)
store_feedback(124-161)tests/unit/utils/test_transcripts_anonymization.py (1)
setup_test_pepper(18-22)src/configuration.py (1)
user_data_collection_configuration(79-84)
src/app/endpoints/feedback.py (1)
src/utils/user_anonymization.py (1)
get_anonymous_user_id(43-105)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: e2e_tests
🔇 Additional comments (7)
src/app/endpoints/feedback.py (2)
22-22: Good: anonymization wired into the endpoint.Importing
get_anonymous_user_idlocally in the endpoint module keeps concerns separated and aligns with the PR’s anonymization strategy.
135-137: Log with anonymous IDs only — correct.Switching debug logs to the anonymized ID avoids leaking real user identifiers in logs while preserving traceability.
tests/unit/app/endpoints/test_feedback.py (1)
152-155: Expectation matches anonymized payload shape.Checking
anonymous_user_idand a timestamp (ANY) is correct and robust to format changes.tests/unit/app/endpoints/test_feedback_anonymization.py (3)
12-17: Good: test pepper is isolated per test run.Autouse fixture ensures anonymization hashing config is present across tests.
55-65: Anonymized payload assertions are precise.Verifying the absence of
user_idand presence ofanonymous_user_idand business fields looks good.
119-138: Logging test is effective.Ensures logs contain only the anonymous ID, aligning with privacy goals.
tests/unit/utils/test_transcripts.py (1)
53-56: Remove unnecessary database mocks in test_store_transcriptThe
store_transcriptfunction inutils/transcripts.pyneither imports nor usesapp.database.engineorSessionLocal, so patching them in the test has no effect. Please simplify the test by deleting these two lines intests/unit/utils/test_transcripts.py(around lines 53–56):- mocker.patch("app.database.engine", mocker.Mock()) - mocker.patch("app.database.SessionLocal", mocker.Mock())This cleanup removes redundant setup and keeps the test focused on the module under test.
Likely an incorrect or invalid review comment.
| current_time = str(datetime.now(UTC)) | ||
| data_to_store = {"user_id": user_id, "timestamp": current_time, **feedback} | ||
| data_to_store = { | ||
| "anonymous_user_id": anonymous_user_id, | ||
| "timestamp": current_time, | ||
| **feedback, | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Prevent client payload from overriding server-controlled fields + use ISO 8601 timestamps.
With the current dict merge order, keys in feedback can overwrite anonymous_user_id and timestamp, allowing a client to spoof these values and corrupt analytics. Also, prefer isoformat() for stable, machine-readable timestamps.
Apply this diff:
- current_time = str(datetime.now(UTC))
- data_to_store = {
- "anonymous_user_id": anonymous_user_id,
- "timestamp": current_time,
- **feedback,
- }
+ current_time = datetime.now(UTC).isoformat()
+ # Ensure server-controlled fields cannot be overridden by client payload
+ data_to_store = {
+ **feedback,
+ "anonymous_user_id": anonymous_user_id,
+ "timestamp": current_time,
+ }📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| current_time = str(datetime.now(UTC)) | |
| data_to_store = {"user_id": user_id, "timestamp": current_time, **feedback} | |
| data_to_store = { | |
| "anonymous_user_id": anonymous_user_id, | |
| "timestamp": current_time, | |
| **feedback, | |
| } | |
| current_time = datetime.now(UTC).isoformat() | |
| # Ensure server-controlled fields cannot be overridden by client payload | |
| data_to_store = { | |
| **feedback, | |
| "anonymous_user_id": anonymous_user_id, | |
| "timestamp": current_time, | |
| } |
🤖 Prompt for AI Agents
In src/app/endpoints/feedback.py around lines 145 to 150, the current dict merge
lets client-sent feedback override server-controlled keys and uses
str(datetime.now(UTC)); change the merge order so feedback is merged first and
then the server-controlled fields overwrite it (i.e., build dict as {**feedback,
"anonymous_user_id": anonymous_user_id, "timestamp": timestamp}), and replace
str(datetime.now(UTC)) with datetime.now(UTC).isoformat() to produce a stable
ISO 8601 timestamp; ensure UTC is the timezone object used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (1)
src/app/endpoints/query.py (1)
103-106: PII log scrub: resolvedThe debug log now excludes the original
user_id, addressing prior PII concerns.Run this quick scan to ensure no remaining logs across the repo still interpolate
user_iddirectly:#!/bin/bash # Logs that include 'user_id' (outside tests) rg -nP -g '!**/tests/**' --type py -C1 '\blogger\.(debug|info|warning|error)\s*\([^)]*user_id'
🧹 Nitpick comments (5)
tests/unit/app/endpoints/test_feedback.py (3)
147-149: Assert the anonymizer is called with the correct user_id (and capture the mock for later assertions).Good patch target. Capture the mock and assert its call arguments to guard against regressions where the wrong identifier is anonymized.
Apply this diff:
- mocker.patch( - "app.endpoints.feedback.get_anonymous_user_id", return_value="anon-test-user" - ) + mock_get_anon = mocker.patch( + "app.endpoints.feedback.get_anonymous_user_id", return_value="anon-test-user" + ) @@ - store_feedback(user_id, feedback_request_data) + store_feedback(user_id, feedback_request_data) + mock_get_anon.assert_called_once_with(user_id)
159-162: Strengthen the assertion to make the “no PII in payload” intent explicit.The equality check would already fail if user_id leaked, but making it explicit improves readability and future-proofing if more fields are added.
Add these assertions just after the existing assert_called_once_with:
stored_payload = mock_json.dump.call_args[0][0] assert "anonymous_user_id" in stored_payload assert "user_id" not in stored_payload
195-197: Add assertion for anonymization call in the error‐path testIn tests/unit/app/endpoints/test_feedback.py within
test_store_feedback_on_io_error, ensure we capture the anonymizer patch and assert it was called with the correct argument even whenopenfails:
- Change the patch line:
- mocker.patch( - "app.endpoints.feedback.get_anonymous_user_id", return_value="anon-test-user" - ) + mock_get_anon = mocker.patch( + "app.endpoints.feedback.get_anonymous_user_id", return_value="anon-test-user" + )- After the
with pytest.raises(OSError, match="EACCES"):block, add:# verify anonymization was attempted before the I/O error mock_get_anon.assert_called_once_with(user_id)This guarantees that, even on failures,
get_anonymous_user_idis invoked with the intendeduser_id.src/app/endpoints/query.py (2)
82-85: Avoid doing the anonymization DB lookup twice per request (minor hot-path win)
get_anonymous_user_id()is called instore_transcript()(via its own internal lookup) and here again insidepersist_user_conversation_details(). Compute the anonymous ID once in the handler and pass it down to persistence to shave an extra DB read on a very hot path.Apply this focused refactor in this file (no behavior change, 1 fewer DB lookup):
@@ -def persist_user_conversation_details( - user_id: str, conversation_id: str, model: str, provider_id: str -) -> None: - """Associate conversation to user in the database using anonymous user ID.""" - # Get anonymous user ID for database storage - anonymous_user_id = get_anonymous_user_id(user_id) +def persist_user_conversation_details( + anonymous_user_id: str, conversation_id: str, model: str, provider_id: str +) -> None: + """Associate conversation to user in the database using anonymous_user_id.""" @@ - user_id, _, token = auth + user_id, _, token = auth + # Compute anonymous ID once and reuse + anonymous_user_id = get_anonymous_user_id(user_id) @@ - persist_user_conversation_details( - user_id=user_id, + persist_user_conversation_details( + anonymous_user_id=anonymous_user_id, conversation_id=conversation_id, model=model_id, provider_id=provider_id, )If you want to fully eliminate the double lookup, consider updating
store_transcript()to accept an optionalanonymous_user_id(outside this file).Also applies to: 186-187, 234-239
86-114: Handle insert/update race on conversation row and log only after a successful commitTwo requests for the same (anonymous_user_id, conversation_id) can race: both see “not existing” and try to insert. Today, that would raise an
IntegrityErrorand bubble up as a 500. Also, the success log is emitted beforecommit(), which can be misleading if the commit fails.Add minimal, localized error handling and move the success log after a successful commit:
@@ -from utils.user_anonymization import get_anonymous_user_id +from utils.user_anonymization import get_anonymous_user_id +from sqlalchemy.exc import IntegrityError @@ def persist_user_conversation_details( - with get_session() as session: - existing_conversation = ( - session.query(UserConversation) - .filter_by(id=conversation_id, anonymous_user_id=anonymous_user_id) - .first() - ) - - if not existing_conversation: - conversation = UserConversation( - id=conversation_id, - anonymous_user_id=anonymous_user_id, - last_used_model=model, - last_used_provider=provider_id, - message_count=1, - ) - session.add(conversation) - logger.debug( - "Associated conversation %s to anonymous user %s", - conversation_id, - anonymous_user_id, - ) - else: - existing_conversation.last_used_model = model - existing_conversation.last_used_provider = provider_id - existing_conversation.last_message_at = datetime.now(UTC) - existing_conversation.message_count += 1 - - session.commit() + with get_session() as session: + created = False + existing_conversation = ( + session.query(UserConversation) + .filter_by(id=conversation_id, anonymous_user_id=anonymous_user_id) + .first() + ) + try: + if not existing_conversation: + created = True + conversation = UserConversation( + id=conversation_id, + anonymous_user_id=anonymous_user_id, + last_used_model=model, + last_used_provider=provider_id, + message_count=1, + ) + session.add(conversation) + else: + existing_conversation.last_used_model = model + existing_conversation.last_used_provider = provider_id + existing_conversation.last_message_at = datetime.now(UTC) + existing_conversation.message_count += 1 + session.commit() + except IntegrityError as e: + session.rollback() + # Another request may have inserted/updated concurrently. + existing = ( + session.query(UserConversation) + .filter_by(id=conversation_id, anonymous_user_id=anonymous_user_id) + .first() + ) + if not existing: + raise + else: + if created: + logger.debug( + "Associated conversation %s to anonymous user %s", + conversation_id, + anonymous_user_id, + )This keeps behavior identical on success, but avoids 500s on benign races and prevents “success” logs for failed commits.
Also applies to: 103-106, 3-18
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (9)
README.md(1 hunks)src/app/endpoints/conversations.py(4 hunks)src/app/endpoints/feedback.py(3 hunks)src/app/endpoints/query.py(2 hunks)src/app/endpoints/streaming_query.py(0 hunks)src/utils/endpoints.py(2 hunks)tests/unit/app/endpoints/test_conversations.py(5 hunks)tests/unit/app/endpoints/test_feedback.py(3 hunks)tests/unit/app/endpoints/test_query.py(3 hunks)
💤 Files with no reviewable changes (1)
- src/app/endpoints/streaming_query.py
🚧 Files skipped from review as they are similar to previous changes (6)
- tests/unit/app/endpoints/test_conversations.py
- tests/unit/app/endpoints/test_query.py
- src/app/endpoints/feedback.py
- README.md
- src/app/endpoints/conversations.py
- src/utils/endpoints.py
🧰 Additional context used
🧬 Code graph analysis (1)
src/app/endpoints/query.py (3)
src/utils/user_anonymization.py (1)
get_anonymous_user_id(43-105)src/app/database.py (1)
get_session(34-40)src/models/database/conversations.py (1)
UserConversation(11-38)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: build-pr
- GitHub Check: e2e_tests
🔇 Additional comments (2)
src/app/endpoints/query.py (2)
39-39: Good: explicit import of anonymization utilityBringing
get_anonymous_user_idinto this module aligns the endpoint with the anonymization flow.
89-90: Correct pivot to anonymous_user_id in DB read/write pathsQuerying and persisting via
anonymous_user_idis consistent with the new schema and privacy goals. The additional filter onanonymous_user_idsafeguards accidental cross-user updates if ownership checks ever regress.Also applies to: 96-96
Description
I noticed an issue where it looked like anonymized transcripts were not being stored depending on the auth provider used. This is an attempt to fix that.
Type of change
Related Tickets & Documents
https://issues.redhat.com/browse/LCORE-579
Checklist before requesting a review
Testing
Code by claude
Claude Summary
User Anonymization System for Data Privacy
Summary
Implemented a comprehensive user anonymization system that assigns unique anonymous UUIDs to
users when they first authenticate. All transcript storage and conversation tracking now uses
these anonymous UUIDs instead of original user IDs, enabling user behavior analytics while
protecting individual privacy.
Changes Made
🗄️ Database Changes
New Model: src/models/database/user_mapping.py
Updated Model: src/models/database/conversations.py
🔧 Core Anonymization Service
New Service: src/utils/user_anonymization.py
📝 Transcript Storage Updates
Modified: src/utils/transcripts.py
🌐 API Endpoint Updates
Modified: src/app/endpoints/query.py
Modified: src/app/endpoints/conversations.py
Modified: src/utils/endpoints.py
🧪 Comprehensive Testing
New Test Files:
Total: 29 new unit tests covering all anonymization functionality
🔒 Security & Privacy Improvements
Before
After
📊 Analytics Capabilities Preserved
🔧 Implementation Details
Anonymous ID Generation
Creates consistent mapping: auth_user_id -> anonymous_uuid
anonymous_id = get_anonymous_user_id("user@example.com")
Returns same UUID for same user across all calls
Secure Hashing
SHA-256 with salt prevents rainbow table attacks
hash = sha256(f"{salt}:{user_id}".encode()).hexdigest()
Irreversible but consistent for same input
Race Condition Handling
🚀 Deployment Considerations
Database Migration Required
Backwards Compatibility
Configuration
🔍 Testing Coverage
📈 Benefits
🤖 Generated with https://claude.ai/code
Co-Authored-By: Claude noreply@anthropic.com
Summary by CodeRabbit
New Features
Behavior Changes
Tests
Documentation