Skip to content

Conversation

@steven10a
Copy link
Collaborator

@steven10a steven10a commented Nov 3, 2025

Using Presidio's built in AnonymizerEngine to handle the PII masking

  • Provides a more robust solution that handles overlapping entities
  • Remove checked_text field from checks that don't use them

Made PII detection more robust against obfuscation attempts

  • Implemented Unicode normalization
  • Added the ability to decode Base64, URL, and Hex encoded content for entity detection
  • Accept an optional parameter detect_encoded_pii that can be set via the config to turn this on

Added custom identifiers to Presidio

  • CVV/BIC detection and email-in-URL contexts

Updated docs and added more tests

@steven10a steven10a marked this pull request as ready for review November 4, 2025 22:20
Copilot AI review requested due to automatic review settings November 4, 2025 22:20
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR refactors the checked_text field handling across guardrails and significantly enhances the PII detection guardrail with advanced security features.

Key changes:

  • Removes checked_text from guardrails that don't modify text (moderation, keywords, URLs, secret keys, etc.)
  • Keeps checked_text only in PII guardrail where text is actively masked
  • Adds advanced PII detection: Unicode normalization, encoded PII detection (Base64/URL/hex), custom CVV/BIC_SWIFT recognizers
  • Refactors _apply_preflight_modifications to use Presidio's anonymizer engine directly for better performance and correctness

Reviewed Changes

Copilot reviewed 28 out of 28 changed files in this pull request and generated 7 comments.

Show a summary per file
File Description
src/guardrails/checks/text/pii.py Major enhancement with Unicode normalization, encoded PII detection, custom recognizers (CVV, BIC_SWIFT), and Presidio anonymizer integration
src/guardrails/_base_client.py Refactored PII masking to use guardrail's checked_text directly and Presidio for structured content; removed summary_text type
src/guardrails/checks/text/moderation.py Removed checked_text field (no text modification)
src/guardrails/checks/text/keywords.py Removed checked_text field (no text modification)
src/guardrails/checks/text/urls.py Removed checked_text field (no text modification)
src/guardrails/checks/text/secret_keys.py Removed checked_text field (no text modification)
src/guardrails/checks/text/prompt_injection_detection.py Removed checked_text field (no text modification)
src/guardrails/checks/text/hallucination_detection.py Removed checked_text parameter from error result creation
src/guardrails/checks/text/llm_base.py Removed checked_text parameter from create_error_result function
tests/unit/test_base_client.py Updated tests to reflect new PII guardrail result structure and removed summary_text type
tests/unit/checks/test_pii.py Added comprehensive tests for Unicode normalization and encoded PII detection
tests/unit/checks/test_keywords.py Removed assertion on checked_text field
tests/unit/test_safety_identifier.py Removed trailing blank line
src/guardrails/utils/safety_identifier.py Removed trailing blank line
pyproject.toml Added presidio-anonymizer>=2.2.360 dependency
examples/basic/pii_mask_example.py Added CVV, BIC_SWIFT entities and enabled encoded PII detection
docs/ref/checks/*.md Updated documentation to remove checked_text field references where appropriate

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

len(candidate_lower) >= 3
and any( # Any 3-char chunk overlaps
candidate_lower[i : i + 3] in detected_lower
for i in range(0, len(candidate_lower) - 2, 2) # Step by 2 for efficiency
Copy link

Copilot AI Nov 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The overlap detection logic performs substring matching in a loop that could be expensive for large texts. For a candidate of length N, this generates N/2 substring checks. Consider using a more efficient algorithm like rolling hash or KMP for substring matching, or at least increase the step size from 2 to a larger value like 4 or 5 to reduce iterations while still catching meaningful overlaps.

Suggested change
for i in range(0, len(candidate_lower) - 2, 2) # Step by 2 for efficiency
for i in range(0, len(candidate_lower) - 2, 4) # Step by 4 for better efficiency

Copilot uses AI. Check for mistakes.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WAI. We are only looking at detected entities for overlap (emails, ssn, location, etc...) so the strings will be relatively small and this is not an issue


# Patterns for detecting encoded content
# Note: Hex must be checked BEFORE Base64 since hex strings can match Base64 pattern
_HEX_PATTERN = re.compile(r"\b[0-9a-fA-F]{24,}\b") # Reduced from 32 to 24 (12 bytes min)
Copy link

Copilot AI Nov 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The hex pattern with a minimum of 24 characters will match many non-PII hex strings (UUIDs without dashes are 32 chars, but legitimate hex data like color codes or short hashes could be 24+ chars). This could lead to false positives where legitimate hex data is incorrectly flagged as encoded PII. Consider adding context-aware matching (e.g., preceded by 'hex:', 'encoded:', etc.) or increasing the minimum length to reduce false positives.

Suggested change
_HEX_PATTERN = re.compile(r"\b[0-9a-fA-F]{24,}\b") # Reduced from 32 to 24 (12 bytes min)
_HEX_PATTERN = re.compile(
r"(?:\b(?:hex:|encoded:|0x)[ ]*)[0-9a-fA-F]{24,}\b"
) # Only match hex strings with context prefix to reduce false positives

Copilot uses AI. Check for mistakes.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WAI. This feature is opt-in for heightened security so higher false positives is acceptable. It's purpose is to detect intentional obfuscation, so checking context wouldn't work.

Comment on lines +537 to +559
# URL decode the ENTIRE text (handles partial encodings like %6a%61%6e%65%40securemail.net)
# This must happen AFTER Base64/Hex replacement to handle mixed encodings correctly
url_decoded = urllib.parse.unquote(decoded_text)

# If URL decoding changed the text, track encoded spans for masking
if url_decoded != decoded_text:
# Find URL-encoded spans in the ORIGINAL text for masking purposes
for match in _URL_ENCODED_PATTERN.finditer(text):
if any(start <= match.start() < end or start < match.end() <= end for start, end in used_spans):
continue

decoded_chunk = urllib.parse.unquote(match.group())
if decoded_chunk != match.group():
candidates.append(
EncodedCandidate(
encoded_text=match.group(),
decoded_text=decoded_chunk,
encoding_type="url",
start=match.start(),
end=match.end(),
)
)
decoded_text = url_decoded
Copy link

Copilot AI Nov 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function calls urllib.parse.unquote on the entire decoded_text after Base64/hex replacements have been made. This means if decoded_text contains a legitimate %XX sequence that was part of a Base64 decode (not URL encoding), it will be incorrectly decoded again. For example, if a Base64 string decodes to 'test%20value', the %20 will be converted to a space, which may not be the intended behavior. Consider tracking which parts are URL-encoded separately before applying unquote.

Suggested change
# URL decode the ENTIRE text (handles partial encodings like %6a%61%6e%65%40securemail.net)
# This must happen AFTER Base64/Hex replacement to handle mixed encodings correctly
url_decoded = urllib.parse.unquote(decoded_text)
# If URL decoding changed the text, track encoded spans for masking
if url_decoded != decoded_text:
# Find URL-encoded spans in the ORIGINAL text for masking purposes
for match in _URL_ENCODED_PATTERN.finditer(text):
if any(start <= match.start() < end or start < match.end() <= end for start, end in used_spans):
continue
decoded_chunk = urllib.parse.unquote(match.group())
if decoded_chunk != match.group():
candidates.append(
EncodedCandidate(
encoded_text=match.group(),
decoded_text=decoded_chunk,
encoding_type="url",
start=match.start(),
end=match.end(),
)
)
decoded_text = url_decoded
# URL decode only actual URL-encoded spans (handles partial encodings like %6a%61%6e%65%40securemail.net)
# This must happen AFTER Base64/Hex replacement to handle mixed encodings correctly
url_candidates = []
# Find URL-encoded spans in the ORIGINAL text for masking purposes
for match in _URL_ENCODED_PATTERN.finditer(text):
if any(start <= match.start() < end or start < match.end() <= end for start, end in used_spans):
continue
decoded_chunk = urllib.parse.unquote(match.group())
if decoded_chunk != match.group():
url_candidates.append(
EncodedCandidate(
encoded_text=match.group(),
decoded_text=decoded_chunk,
encoding_type="url",
start=match.start(),
end=match.end(),
)
)
# Replace URL-encoded spans in decoded_text (from last to first to avoid offset issues)
url_candidates.sort(key=lambda c: c.start, reverse=True)
for candidate in url_candidates:
decoded_text = decoded_text[: candidate.start] + candidate.decoded_text + decoded_text[candidate.end :]
candidates.append(candidate)

Copilot uses AI. Check for mistakes.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WAI. This is acceptable


if found_entities:
# Mask the encoded version in original text
entity_marker = f"<{next(iter(found_entities))}_ENCODED>"
Copy link

Copilot AI Nov 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When multiple entity types overlap in the same encoded chunk, only the first entity type from found_entities is used for the marker. This could be misleading if an encoded string contains both an email and a phone number - only one will be indicated in the mask. Consider using all found entity types (e.g., <EMAIL_ADDRESS|PHONE_NUMBER_ENCODED>) or documenting this limitation.

Suggested change
entity_marker = f"<{next(iter(found_entities))}_ENCODED>"
entity_marker = f"<{'|'.join(sorted(found_entities))}_ENCODED>"

Copilot uses AI. Check for mistakes.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WAI. The entire encoded span is still being masked before being sent to the LLM. All of the detected entities are still being listed in the response object if a developer wants to see them.

@steven10a steven10a requested a review from Copilot November 5, 2025 15:38
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Copilot reviewed 28 out of 28 changed files in this pull request and generated 3 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@steven10a steven10a requested a review from Copilot November 5, 2025 15:49
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

@steven10a steven10a requested a review from Copilot November 5, 2025 16:13
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

@steven10a steven10a requested a review from Copilot November 5, 2025 21:20
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Copilot reviewed 28 out of 28 changed files in this pull request and generated 1 comment.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@steven10a steven10a requested a review from Copilot November 5, 2025 21:48
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Copilot reviewed 28 out of 28 changed files in this pull request and generated 8 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +702 to +712
has_overlap = (
candidate_lower in detected_lower # Candidate is substring of detection
or detected_lower in candidate_lower # Detection is substring of candidate
or (
len(candidate_lower) >= 3
and any( # Any 3-char chunk overlaps
candidate_lower[i : i + 3] in detected_lower
for i in range(0, len(candidate_lower) - 2, 2) # Step by 2 for efficiency
)
)
)
Copy link

Copilot AI Nov 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] The overlap detection logic in _mask_encoded_pii uses a nested loop with substring checks that could be O(n*m) in the worst case. For each candidate, it checks against all analyzer results, and within that, performs chunked substring matching. Consider using more efficient string matching algorithms (e.g., KMP or Boyer-Moore) for the chunked overlap detection, or pre-computing a set of 3-character chunks from detections for O(1) lookup.

Copilot uses AI. Check for mistakes.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WAI. Already addressed this comment. Checking small entities so this won't be an issue


if found_entities:
# Mask the encoded version in original text
entity_marker = f"<{next(iter(found_entities))}_ENCODED>"
Copy link

Copilot AI Nov 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When multiple entity types are found in a single encoded candidate (e.g., an encoded string containing both EMAIL_ADDRESS and PHONE_NUMBER), only the first entity type from the set is used for the marker. This could lead to misleading masking labels. Consider either using all entity types in the marker (e.g., <EMAIL_ADDRESS|PHONE_NUMBER_ENCODED>) or documenting this limitation that only one entity type is shown when multiple are detected.

Suggested change
entity_marker = f"<{next(iter(found_entities))}_ENCODED>"
entity_types_str = "|".join(sorted(found_entities))
entity_marker = f"<{entity_types_str}_ENCODED>"

Copilot uses AI. Check for mistakes.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WAI. Already addressed this comment

Comment on lines +322 to +326
if detected_value in candidate.decoded_text:
# Mask the encoded version
entity_marker = f"<{entity_type}_ENCODED>"
masked = masked[: candidate.start] + entity_marker + masked[candidate.end :]
break
Copy link

Copilot AI Nov 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The condition if detected_value in candidate.decoded_text: may fail to find the correct candidate when multiple candidates exist or when the detected value spans multiple candidates. This could result in encoded PII not being masked. Consider checking overlap ranges more precisely using start/end positions or ensuring that the candidate list is properly sorted and all overlaps are checked.

Suggested change
if detected_value in candidate.decoded_text:
# Mask the encoded version
entity_marker = f"<{entity_type}_ENCODED>"
masked = masked[: candidate.start] + entity_marker + masked[candidate.end :]
break
# Check if detection overlaps candidate's decoded text range
candidate_decoded_start = getattr(candidate, "decoded_start", None)
candidate_decoded_end = getattr(candidate, "decoded_end", None)
if candidate_decoded_start is not None and candidate_decoded_end is not None:
if (
result.start < candidate_decoded_end
and result.end > candidate_decoded_start
):
# Mask the encoded version
entity_marker = f"<{entity_type}_ENCODED>"
masked = masked[: candidate.start] + entity_marker + masked[candidate.end :]
break

Copilot uses AI. Check for mistakes.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WAI. This is only with structured content which is checking each part individually. For unstructured content we have more sophisticated overlap checking

# CVV/CVC recognizer (3-4 digits, often near credit card context)
cvv_pattern = Pattern(
name="cvv_pattern",
regex=r"\b(?:cvv|cvc|security\s*code|card\s*code)[:\s=]*(\d{3,4})\b",
Copy link

Copilot AI Nov 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] The CVV pattern requires a label prefix ('cvv', 'cvc', 'security code', 'card code') before the digits. This means standalone 3-4 digit numbers won't be detected as CVV codes, which could miss obfuscated CVV values. While this reduces false positives, it creates a security gap. Consider documenting this limitation or adding a configurable option for stricter CVV detection that catches standalone 3-4 digit sequences in financial contexts.

Copilot uses AI. Check for mistakes.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WAI. I am okay with this simple approach and we don't want to over flag 3 digit numbers

OpenAIResponseType = Union[Completion, ChatCompletion, ChatCompletionChunk, Response] # noqa: UP007

# Text content types recognized in message content parts
_TEXT_CONTENT_TYPES: Final[set[str]] = {"text", "input_text", "output_text"}
Copy link

Copilot AI Nov 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The removal of 'summary_text' from _TEXT_CONTENT_TYPES means that content parts with type 'summary_text' will no longer have PII masking applied or text extracted. If 'summary_text' is still a valid content type in the system, this change could lead to PII leaks. Ensure this removal is intentional and that 'summary_text' is truly deprecated, or document why it should not be processed.

Suggested change
_TEXT_CONTENT_TYPES: Final[set[str]] = {"text", "input_text", "output_text"}
_TEXT_CONTENT_TYPES: Final[set[str]] = {"text", "input_text", "output_text", "summary_text"}

Copilot uses AI. Check for mistakes.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WAI. summary_text is not a valid input


# Should detect and mask the obfuscated email
assert "<EMAIL_ADDRESS>" in modified[0]["content"][0]["text"] # noqa: S101
assert "@" not in modified[0]["content"][0]["text"] or "@" not in modified[0]["content"][0]["text"] # noqa: S101
Copy link

Copilot AI Nov 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The assertion assert '@' not in modified[0]['content'][0]['text'] or '@' not in modified[0]['content'][0]['text'] will pass if EITHER character is absent, which means the test passes even if the fullwidth '@' (@) is still present. This should use and instead of or to ensure both the ASCII and fullwidth '@' characters are properly masked.

Suggested change
assert "@" not in modified[0]["content"][0]["text"] or "@" not in modified[0]["content"][0]["text"] # noqa: S101
assert "@" not in modified[0]["content"][0]["text"] and "@" not in modified[0]["content"][0]["text"] # noqa: S101

Copilot uses AI. Check for mistakes.
Comment on lines +459 to +466
try:
decoded_bytes = base64.b64decode(text, validate=True)
# Security: Fail closed - reject content > 10KB to prevent memory DoS and PII bypass
if len(decoded_bytes) > 10_000:
msg = (
f"Base64 decoded content too large ({len(decoded_bytes):,} bytes). Maximum allowed is 10KB."
)
raise ValueError(msg)
Copy link

Copilot AI Nov 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The 10KB size limit for decoded content is enforced by raising a ValueError, but this exception is not caught in the calling code (_build_decoded_text), which could cause the entire PII check to fail rather than gracefully skipping oversized encoded content. Consider catching this ValueError and logging a warning instead, or document that requests with large encoded payloads will be rejected entirely.

Copilot uses AI. Check for mistakes.
Copy link
Collaborator Author

@steven10a steven10a Nov 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WAI. We want to fail if there is a massive encoded file. Checking for encoded content is opt-in so users can not default to not checking encoded content if they wish

Comment on lines 290 to 291
decoded_text = normalized

Copy link

Copilot AI Nov 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This assignment to 'decoded_text' is unnecessary as it is redefined before this value is used.

Suggested change
decoded_text = normalized

Copilot uses AI. Check for mistakes.
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Copilot reviewed 28 out of 28 changed files in this pull request and generated 5 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.


# Zero-width and invisible Unicode characters that can be used to bypass detection
_ZERO_WIDTH_CHARS = re.compile(
r"[\u200b\u200c\u200d\u2060\ufeff]" # Zero-width space, ZWNJ, ZWJ, word joiner, BOM
Copy link

Copilot AI Nov 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment mentions 'BOM' but the actual character \ufeff is the Zero Width No-Break Space, not the Byte Order Mark. While \ufeff can function as a BOM when at the beginning of a file, in the middle of text it's a zero-width no-break space. The comment should clarify this to avoid confusion: 'Zero-width space, ZWNJ, ZWJ, word joiner, zero-width no-break space'.

Suggested change
r"[\u200b\u200c\u200d\u2060\ufeff]" # Zero-width space, ZWNJ, ZWJ, word joiner, BOM
r"[\u200b\u200c\u200d\u2060\ufeff]" # Zero-width space, ZWNJ, ZWJ, word joiner, zero-width no-break space (\ufeff; BOM only at start of file)

Copilot uses AI. Check for mistakes.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nit]

Comment on lines +463 to +466
msg = (
f"Base64 decoded content too large ({len(decoded_bytes):,} bytes). Maximum allowed is 10KB."
)
raise ValueError(msg)
Copy link

Copilot AI Nov 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error message leaks the size of decoded content, which could help attackers fingerprint the system or understand size limits for crafted payloads. Consider using a generic error message without the specific size: 'Base64 decoded content exceeds maximum allowed size of 10KB.'

Copilot uses AI. Check for mistakes.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is okay. Returning to dev not user

Comment on lines +496 to +497
msg = f"Hex decoded content too large ({len(decoded_bytes):,} bytes). Maximum allowed is 10KB."
raise ValueError(msg)
Copy link

Copilot AI Nov 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to the Base64 case, the error message leaks the size of decoded content. Use a generic error message: 'Hex decoded content exceeds maximum allowed size of 10KB.'

Suggested change
msg = f"Hex decoded content too large ({len(decoded_bytes):,} bytes). Maximum allowed is 10KB."
raise ValueError(msg)
raise ValueError("Hex decoded content exceeds maximum allowed size of 10KB.")

Copilot uses AI. Check for mistakes.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is okay. Returning to dev not user

len(candidate_lower) >= 3
and any( # Any 3-char chunk overlaps
candidate_lower[i : i + 3] in detected_lower
for i in range(0, len(candidate_lower) - 2, 2) # Step by 2 for efficiency
Copy link

Copilot AI Nov 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The 3-character chunk overlap check at line 707-710 iterates with step=2 for efficiency, but this could miss valid overlaps. For example, if candidate_lower='abcdef' and detected_lower contains 'bcd', the check at i=0 ('abc') and i=2 ('cde') would both miss it. Consider using step=1 or documenting why step=2 is acceptable for this security-critical matching.

Suggested change
for i in range(0, len(candidate_lower) - 2, 2) # Step by 2 for efficiency
for i in range(0, len(candidate_lower) - 2) # Step by 1 to avoid missing overlaps

Copilot uses AI. Check for mistakes.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WAI. This is fine. This is the third check and stepping by 2 will only miss PII that is exactly 3 characters long and isn't caught by all of the other checks. Very rare and unrealistic case

async def test_pii_detects_base64_encoded_email() -> None:
"""Base64-encoded email should be detected when flag is enabled."""
config = PIIConfig(entities=[PIIEntity.EMAIL_ADDRESS], block=False, detect_encoded_pii=True)
# am9obkBleGFtcGxlLmNvbQ== is base64 for john@example.com
Copy link

Copilot AI Nov 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] The comment states the Base64 string decodes to 'john@example.com', but the actual decoded value is 'john@example.com' which is correct. However, for consistency with other test comments that specify the exact encoding (e.g., line 383 shows the URL-encoded version), consider verifying this comment is accurate by decoding the Base64 string.

Suggested change
# am9obkBleGFtcGxlLmNvbQ== is base64 for john@example.com
# am9obkBleGFtcGxlLmNvbQ== (base64 for 'john@example.com')

Copilot uses AI. Check for mistakes.
Copy link
Collaborator

@gabor-openai gabor-openai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@gabor-openai gabor-openai merged commit bf0cb52 into main Nov 6, 2025
9 checks passed
@gabor-openai gabor-openai deleted the dev/steven/pii_updates branch November 6, 2025 00:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants