Use Presidio for masking #40

steven10a · 2025-11-03T22:15:38Z

Using Presidio's built in AnonymizerEngine to handle the PII masking

Provides a more robust solution that handles overlapping entities
Remove checked_text field from checks that don't use them

Made PII detection more robust against obfuscation attempts

Implemented Unicode normalization
Added the ability to decode Base64, URL, and Hex encoded content for entity detection
Accept an optional parameter detect_encoded_pii that can be set via the config to turn this on

Added custom identifiers to Presidio

CVV/BIC detection and email-in-URL contexts

Updated docs and added more tests

…o dev/steven/pii_updates

Copilot

Pull Request Overview

This PR refactors the checked_text field handling across guardrails and significantly enhances the PII detection guardrail with advanced security features.

Key changes:

Removes checked_text from guardrails that don't modify text (moderation, keywords, URLs, secret keys, etc.)
Keeps checked_text only in PII guardrail where text is actively masked
Adds advanced PII detection: Unicode normalization, encoded PII detection (Base64/URL/hex), custom CVV/BIC_SWIFT recognizers
Refactors _apply_preflight_modifications to use Presidio's anonymizer engine directly for better performance and correctness

Reviewed Changes

Copilot reviewed 28 out of 28 changed files in this pull request and generated 7 comments.

Show a summary per file

File	Description
src/guardrails/checks/text/pii.py	Major enhancement with Unicode normalization, encoded PII detection, custom recognizers (CVV, BIC_SWIFT), and Presidio anonymizer integration
src/guardrails/_base_client.py	Refactored PII masking to use guardrail's `checked_text` directly and Presidio for structured content; removed `summary_text` type
src/guardrails/checks/text/moderation.py	Removed `checked_text` field (no text modification)
src/guardrails/checks/text/keywords.py	Removed `checked_text` field (no text modification)
src/guardrails/checks/text/urls.py	Removed `checked_text` field (no text modification)
src/guardrails/checks/text/secret_keys.py	Removed `checked_text` field (no text modification)
src/guardrails/checks/text/prompt_injection_detection.py	Removed `checked_text` field (no text modification)
src/guardrails/checks/text/hallucination_detection.py	Removed `checked_text` parameter from error result creation
src/guardrails/checks/text/llm_base.py	Removed `checked_text` parameter from `create_error_result` function
tests/unit/test_base_client.py	Updated tests to reflect new PII guardrail result structure and removed `summary_text` type
tests/unit/checks/test_pii.py	Added comprehensive tests for Unicode normalization and encoded PII detection
tests/unit/checks/test_keywords.py	Removed assertion on `checked_text` field
tests/unit/test_safety_identifier.py	Removed trailing blank line
src/guardrails/utils/safety_identifier.py	Removed trailing blank line
pyproject.toml	Added `presidio-anonymizer>=2.2.360` dependency
examples/basic/pii_mask_example.py	Added CVV, BIC_SWIFT entities and enabled encoded PII detection
docs/ref/checks/*.md	Updated documentation to remove `checked_text` field references where appropriate

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

src/guardrails/_base_client.py

Copilot · 2025-11-04T22:23:34Z

src/guardrails/checks/text/pii.py

+                    len(candidate_lower) >= 3
+                    and any(  # Any 3-char chunk overlaps
+                        candidate_lower[i : i + 3] in detected_lower
+                        for i in range(0, len(candidate_lower) - 2, 2)  # Step by 2 for efficiency


The overlap detection logic performs substring matching in a loop that could be expensive for large texts. For a candidate of length N, this generates N/2 substring checks. Consider using a more efficient algorithm like rolling hash or KMP for substring matching, or at least increase the step size from 2 to a larger value like 4 or 5 to reduce iterations while still catching meaningful overlaps.

Suggested change

for i in range(0, len(candidate_lower) - 2, 2) # Step by 2 for efficiency

for i in range(0, len(candidate_lower) - 2, 4) # Step by 4 for better efficiency

WAI. We are only looking at detected entities for overlap (emails, ssn, location, etc...) so the strings will be relatively small and this is not an issue

Copilot · 2025-11-04T22:23:34Z

src/guardrails/checks/text/pii.py

+
+# Patterns for detecting encoded content
+# Note: Hex must be checked BEFORE Base64 since hex strings can match Base64 pattern
+_HEX_PATTERN = re.compile(r"\b[0-9a-fA-F]{24,}\b")  # Reduced from 32 to 24 (12 bytes min)


The hex pattern with a minimum of 24 characters will match many non-PII hex strings (UUIDs without dashes are 32 chars, but legitimate hex data like color codes or short hashes could be 24+ chars). This could lead to false positives where legitimate hex data is incorrectly flagged as encoded PII. Consider adding context-aware matching (e.g., preceded by 'hex:', 'encoded:', etc.) or increasing the minimum length to reduce false positives.

Suggested change

_HEX_PATTERN = re.compile(r"\b[0-9a-fA-F]{24,}\b") # Reduced from 32 to 24 (12 bytes min)

_HEX_PATTERN = re.compile(

r"(?:\b(?:hex:|encoded:|0x)[ ]*)[0-9a-fA-F]{24,}\b"

) # Only match hex strings with context prefix to reduce false positives

WAI. This feature is opt-in for heightened security so higher false positives is acceptable. It's purpose is to detect intentional obfuscation, so checking context wouldn't work.

Copilot · 2025-11-04T22:23:34Z

src/guardrails/checks/text/pii.py

+    # URL decode the ENTIRE text (handles partial encodings like %6a%61%6e%65%40securemail.net)
+    # This must happen AFTER Base64/Hex replacement to handle mixed encodings correctly
+    url_decoded = urllib.parse.unquote(decoded_text)
+
+    # If URL decoding changed the text, track encoded spans for masking
+    if url_decoded != decoded_text:
+        # Find URL-encoded spans in the ORIGINAL text for masking purposes
+        for match in _URL_ENCODED_PATTERN.finditer(text):
+            if any(start <= match.start() < end or start < match.end() <= end for start, end in used_spans):
+                continue
+
+            decoded_chunk = urllib.parse.unquote(match.group())
+            if decoded_chunk != match.group():
+                candidates.append(
+                    EncodedCandidate(
+                        encoded_text=match.group(),
+                        decoded_text=decoded_chunk,
+                        encoding_type="url",
+                        start=match.start(),
+                        end=match.end(),
+                    )
+                )
+        decoded_text = url_decoded


The function calls urllib.parse.unquote on the entire decoded_text after Base64/hex replacements have been made. This means if decoded_text contains a legitimate %XX sequence that was part of a Base64 decode (not URL encoding), it will be incorrectly decoded again. For example, if a Base64 string decodes to 'test%20value', the %20 will be converted to a space, which may not be the intended behavior. Consider tracking which parts are URL-encoded separately before applying unquote.

Suggested change

# URL decode the ENTIRE text (handles partial encodings like %6a%61%6e%65%40securemail.net)

# This must happen AFTER Base64/Hex replacement to handle mixed encodings correctly

url_decoded = urllib.parse.unquote(decoded_text)

# If URL decoding changed the text, track encoded spans for masking

if url_decoded != decoded_text:

# Find URL-encoded spans in the ORIGINAL text for masking purposes

for match in _URL_ENCODED_PATTERN.finditer(text):

if any(start <= match.start() < end or start < match.end() <= end for start, end in used_spans):

continue

decoded_chunk = urllib.parse.unquote(match.group())

if decoded_chunk != match.group():

candidates.append(

EncodedCandidate(

encoded_text=match.group(),

decoded_text=decoded_chunk,

encoding_type="url",

start=match.start(),

end=match.end(),

)

)

decoded_text = url_decoded

# URL decode only actual URL-encoded spans (handles partial encodings like %6a%61%6e%65%40securemail.net)

# This must happen AFTER Base64/Hex replacement to handle mixed encodings correctly

url_candidates = []

# Find URL-encoded spans in the ORIGINAL text for masking purposes

for match in _URL_ENCODED_PATTERN.finditer(text):

if any(start <= match.start() < end or start < match.end() <= end for start, end in used_spans):

continue

decoded_chunk = urllib.parse.unquote(match.group())

if decoded_chunk != match.group():

url_candidates.append(

EncodedCandidate(

encoded_text=match.group(),

decoded_text=decoded_chunk,

encoding_type="url",

start=match.start(),

end=match.end(),

)

)

# Replace URL-encoded spans in decoded_text (from last to first to avoid offset issues)

url_candidates.sort(key=lambda c: c.start, reverse=True)

for candidate in url_candidates:

decoded_text = decoded_text[: candidate.start] + candidate.decoded_text + decoded_text[candidate.end :]

candidates.append(candidate)

WAI. This is acceptable

Copilot · 2025-11-04T22:23:35Z

src/guardrails/checks/text/pii.py

+
+        if found_entities:
+            # Mask the encoded version in original text
+            entity_marker = f"<{next(iter(found_entities))}_ENCODED>"


When multiple entity types overlap in the same encoded chunk, only the first entity type from found_entities is used for the marker. This could be misleading if an encoded string contains both an email and a phone number - only one will be indicated in the mask. Consider using all found entity types (e.g., <EMAIL_ADDRESS|PHONE_NUMBER_ENCODED>) or documenting this limitation.

Suggested change

entity_marker = f"<{next(iter(found_entities))}_ENCODED>"

entity_marker = f"<{'|'.join(sorted(found_entities))}_ENCODED>"

WAI. The entire encoded span is still being masked before being sent to the LLM. All of the detected entities are still being listed in the response object if a developer wants to see them.

src/guardrails/checks/text/pii.py

Copilot

Pull Request Overview

Copilot reviewed 28 out of 28 changed files in this pull request and generated 3 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

src/guardrails/checks/text/pii.py

Copilot

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

Copilot

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

Copilot

Pull Request Overview

Copilot reviewed 28 out of 28 changed files in this pull request and generated 1 comment.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

src/guardrails/_base_client.py

Copilot

Pull Request Overview

Copilot reviewed 28 out of 28 changed files in this pull request and generated 8 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2025-11-05T21:51:33Z

src/guardrails/checks/text/pii.py

+            has_overlap = (
+                candidate_lower in detected_lower  # Candidate is substring of detection
+                or detected_lower in candidate_lower  # Detection is substring of candidate
+                or (
+                    len(candidate_lower) >= 3
+                    and any(  # Any 3-char chunk overlaps
+                        candidate_lower[i : i + 3] in detected_lower
+                        for i in range(0, len(candidate_lower) - 2, 2)  # Step by 2 for efficiency
+                    )
+                )
+            )


[nitpick] The overlap detection logic in _mask_encoded_pii uses a nested loop with substring checks that could be O(n*m) in the worst case. For each candidate, it checks against all analyzer results, and within that, performs chunked substring matching. Consider using more efficient string matching algorithms (e.g., KMP or Boyer-Moore) for the chunked overlap detection, or pre-computing a set of 3-character chunks from detections for O(1) lookup.

WAI. Already addressed this comment. Checking small entities so this won't be an issue

Copilot · 2025-11-05T21:51:34Z

src/guardrails/checks/text/pii.py

+
+        if found_entities:
+            # Mask the encoded version in original text
+            entity_marker = f"<{next(iter(found_entities))}_ENCODED>"


When multiple entity types are found in a single encoded candidate (e.g., an encoded string containing both EMAIL_ADDRESS and PHONE_NUMBER), only the first entity type from the set is used for the marker. This could lead to misleading masking labels. Consider either using all entity types in the marker (e.g., <EMAIL_ADDRESS|PHONE_NUMBER_ENCODED>) or documenting this limitation that only one entity type is shown when multiple are detected.

Suggested change

entity_marker = f"<{next(iter(found_entities))}_ENCODED>"

entity_types_str = "|".join(sorted(found_entities))

entity_marker = f"<{entity_types_str}_ENCODED>"

WAI. Already addressed this comment

Copilot · 2025-11-05T21:51:34Z

src/guardrails/_base_client.py

+                            if detected_value in candidate.decoded_text:
+                                # Mask the encoded version
+                                entity_marker = f"<{entity_type}_ENCODED>"
+                                masked = masked[: candidate.start] + entity_marker + masked[candidate.end :]
+                                break


The condition if detected_value in candidate.decoded_text: may fail to find the correct candidate when multiple candidates exist or when the detected value spans multiple candidates. This could result in encoded PII not being masked. Consider checking overlap ranges more precisely using start/end positions or ensuring that the candidate list is properly sorted and all overlaps are checked.

Suggested change

if detected_value in candidate.decoded_text:

# Mask the encoded version

entity_marker = f"<{entity_type}_ENCODED>"

masked = masked[: candidate.start] + entity_marker + masked[candidate.end :]

break

# Check if detection overlaps candidate's decoded text range

candidate_decoded_start = getattr(candidate, "decoded_start", None)

candidate_decoded_end = getattr(candidate, "decoded_end", None)

if candidate_decoded_start is not None and candidate_decoded_end is not None:

if (

result.start < candidate_decoded_end

and result.end > candidate_decoded_start

):

# Mask the encoded version

entity_marker = f"<{entity_type}_ENCODED>"

masked = masked[: candidate.start] + entity_marker + masked[candidate.end :]

break

WAI. This is only with structured content which is checking each part individually. For unstructured content we have more sophisticated overlap checking

Copilot · 2025-11-05T21:51:34Z

src/guardrails/checks/text/pii.py

+    # CVV/CVC recognizer (3-4 digits, often near credit card context)
+    cvv_pattern = Pattern(
+        name="cvv_pattern",
+        regex=r"\b(?:cvv|cvc|security\s*code|card\s*code)[:\s=]*(\d{3,4})\b",


[nitpick] The CVV pattern requires a label prefix ('cvv', 'cvc', 'security code', 'card code') before the digits. This means standalone 3-4 digit numbers won't be detected as CVV codes, which could miss obfuscated CVV values. While this reduces false positives, it creates a security gap. Consider documenting this limitation or adding a configurable option for stricter CVV detection that catches standalone 3-4 digit sequences in financial contexts.

WAI. I am okay with this simple approach and we don't want to over flag 3 digit numbers

Copilot · 2025-11-05T21:51:35Z

src/guardrails/_base_client.py

 OpenAIResponseType = Union[Completion, ChatCompletion, ChatCompletionChunk, Response]  # noqa: UP007

+# Text content types recognized in message content parts
+_TEXT_CONTENT_TYPES: Final[set[str]] = {"text", "input_text", "output_text"}


The removal of 'summary_text' from _TEXT_CONTENT_TYPES means that content parts with type 'summary_text' will no longer have PII masking applied or text extracted. If 'summary_text' is still a valid content type in the system, this change could lead to PII leaks. Ensure this removal is intentional and that 'summary_text' is truly deprecated, or document why it should not be processed.

Suggested change

_TEXT_CONTENT_TYPES: Final[set[str]] = {"text", "input_text", "output_text"}

_TEXT_CONTENT_TYPES: Final[set[str]] = {"text", "input_text", "output_text", "summary_text"}

WAI. summary_text is not a valid input

Copilot · 2025-11-05T21:51:35Z

tests/unit/test_base_client.py

+
+    # Should detect and mask the obfuscated email
+    assert "<EMAIL_ADDRESS>" in modified[0]["content"][0]["text"]  # noqa: S101
+    assert "@" not in modified[0]["content"][0]["text"] or "＠" not in modified[0]["content"][0]["text"]  # noqa: S101


The assertion assert '@' not in modified[0]['content'][0]['text'] or '＠' not in modified[0]['content'][0]['text'] will pass if EITHER character is absent, which means the test passes even if the fullwidth '@' (＠) is still present. This should use and instead of or to ensure both the ASCII and fullwidth '@' characters are properly masked.

Suggested change

assert "@" not in modified[0]["content"][0]["text"] or "＠" not in modified[0]["content"][0]["text"] # noqa: S101

assert "@" not in modified[0]["content"][0]["text"] and "＠" not in modified[0]["content"][0]["text"] # noqa: S101

Copilot · 2025-11-05T21:51:35Z

src/guardrails/checks/text/pii.py

+    try:
+        decoded_bytes = base64.b64decode(text, validate=True)
+        # Security: Fail closed - reject content > 10KB to prevent memory DoS and PII bypass
+        if len(decoded_bytes) > 10_000:
+            msg = (
+                f"Base64 decoded content too large ({len(decoded_bytes):,} bytes). Maximum allowed is 10KB."
+            )
+            raise ValueError(msg)


The 10KB size limit for decoded content is enforced by raising a ValueError, but this exception is not caught in the calling code (_build_decoded_text), which could cause the entire PII check to fail rather than gracefully skipping oversized encoded content. Consider catching this ValueError and logging a warning instead, or document that requests with large encoded payloads will be rejected entirely.

WAI. We want to fail if there is a massive encoded file. Checking for encoded content is opt-in so users can not default to not checking encoded content if they wish

Copilot · 2025-11-05T21:51:36Z

src/guardrails/_base_client.py

+            decoded_text = normalized



This assignment to 'decoded_text' is unnecessary as it is redefined before this value is used.

Suggested change

decoded_text = normalized

Copilot

Pull Request Overview

Copilot reviewed 28 out of 28 changed files in this pull request and generated 5 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2025-11-05T22:56:21Z

src/guardrails/checks/text/pii.py


+# Zero-width and invisible Unicode characters that can be used to bypass detection
+_ZERO_WIDTH_CHARS = re.compile(
+    r"[\u200b\u200c\u200d\u2060\ufeff]"  # Zero-width space, ZWNJ, ZWJ, word joiner, BOM


The comment mentions 'BOM' but the actual character \ufeff is the Zero Width No-Break Space, not the Byte Order Mark. While \ufeff can function as a BOM when at the beginning of a file, in the middle of text it's a zero-width no-break space. The comment should clarify this to avoid confusion: 'Zero-width space, ZWNJ, ZWJ, word joiner, zero-width no-break space'.

Suggested change

r"[\u200b\u200c\u200d\u2060\ufeff]" # Zero-width space, ZWNJ, ZWJ, word joiner, BOM

r"[\u200b\u200c\u200d\u2060\ufeff]" # Zero-width space, ZWNJ, ZWJ, word joiner, zero-width no-break space (\ufeff; BOM only at start of file)

Copilot · 2025-11-05T22:56:21Z

src/guardrails/checks/text/pii.py

+            msg = (
+                f"Base64 decoded content too large ({len(decoded_bytes):,} bytes). Maximum allowed is 10KB."
+            )
+            raise ValueError(msg)


The error message leaks the size of decoded content, which could help attackers fingerprint the system or understand size limits for crafted payloads. Consider using a generic error message without the specific size: 'Base64 decoded content exceeds maximum allowed size of 10KB.'

That is okay. Returning to dev not user

Copilot · 2025-11-05T22:56:21Z

src/guardrails/checks/text/pii.py

+        msg = f"Hex decoded content too large ({len(decoded_bytes):,} bytes). Maximum allowed is 10KB."
+        raise ValueError(msg)


Similar to the Base64 case, the error message leaks the size of decoded content. Use a generic error message: 'Hex decoded content exceeds maximum allowed size of 10KB.'

Suggested change

msg = f"Hex decoded content too large ({len(decoded_bytes):,} bytes). Maximum allowed is 10KB."

raise ValueError(msg)

raise ValueError("Hex decoded content exceeds maximum allowed size of 10KB.")

That is okay. Returning to dev not user

Copilot · 2025-11-05T22:56:22Z

src/guardrails/checks/text/pii.py

+                    len(candidate_lower) >= 3
+                    and any(  # Any 3-char chunk overlaps
+                        candidate_lower[i : i + 3] in detected_lower
+                        for i in range(0, len(candidate_lower) - 2, 2)  # Step by 2 for efficiency


The 3-character chunk overlap check at line 707-710 iterates with step=2 for efficiency, but this could miss valid overlaps. For example, if candidate_lower='abcdef' and detected_lower contains 'bcd', the check at i=0 ('abc') and i=2 ('cde') would both miss it. Consider using step=1 or documenting why step=2 is acceptable for this security-critical matching.

Suggested change

for i in range(0, len(candidate_lower) - 2, 2) # Step by 2 for efficiency

for i in range(0, len(candidate_lower) - 2) # Step by 1 to avoid missing overlaps

WAI. This is fine. This is the third check and stepping by 2 will only miss PII that is exactly 3 characters long and isn't caught by all of the other checks. Very rare and unrealistic case

Copilot · 2025-11-05T22:56:22Z

tests/unit/checks/test_pii.py

+async def test_pii_detects_base64_encoded_email() -> None:
+    """Base64-encoded email should be detected when flag is enabled."""
+    config = PIIConfig(entities=[PIIEntity.EMAIL_ADDRESS], block=False, detect_encoded_pii=True)
+    # am9obkBleGFtcGxlLmNvbQ== is base64 for john@example.com


[nitpick] The comment states the Base64 string decodes to 'john@example.com', but the actual decoded value is 'john@example.com' which is correct. However, for consistency with other test comments that specify the exact encoding (e.g., line 383 shows the URL-encoded version), consider verifying this comment is accurate by decoding the Base64 string.

Suggested change

# am9obkBleGFtcGxlLmNvbQ== is base64 for john@example.com

# am9obkBleGFtcGxlLmNvbQ== (base64 for 'john@example.com')

gabor-openai

LGTM

steven10a added 3 commits November 3, 2025 17:06

Use Presidio for masking

cb2dbdb

Improve PII to handle encoded content

d992fc5

Merge branch 'main' of github.com:openai/openai-guardrails-python int…

9a5a72b

…o dev/steven/pii_updates

steven10a marked this pull request as ready for review November 4, 2025 22:20

Copilot AI review requested due to automatic review settings November 4, 2025 22:20

Copilot AI reviewed Nov 4, 2025

View reviewed changes

Reject large encoded content as DOS

40aa80f

steven10a requested a review from Copilot November 5, 2025 15:38

Copilot AI reviewed Nov 5, 2025

View reviewed changes

src/guardrails/checks/text/pii.py Outdated Show resolved Hide resolved

src/guardrails/checks/text/pii.py Outdated Show resolved Hide resolved

src/guardrails/checks/text/pii.py Show resolved Hide resolved

Handle hex errors

62ed5e0

steven10a requested a review from Copilot November 5, 2025 15:49

Copilot AI reviewed Nov 5, 2025

View reviewed changes

steven10a requested a review from Copilot November 5, 2025 16:13

Copilot AI reviewed Nov 5, 2025

View reviewed changes

steven10a requested a review from Copilot November 5, 2025 21:20

Copilot AI reviewed Nov 5, 2025

View reviewed changes

src/guardrails/_base_client.py Outdated Show resolved Hide resolved

fix structured output masking path

a901d25

steven10a requested a review from Copilot November 5, 2025 21:48

Copilot AI reviewed Nov 5, 2025

View reviewed changes

address copilot nits

7be1b04

steven10a requested review from Copilot and gabor-openai November 5, 2025 22:53

Copilot AI reviewed Nov 5, 2025

View reviewed changes

gabor-openai approved these changes Nov 6, 2025

View reviewed changes

gabor-openai merged commit bf0cb52 into main Nov 6, 2025
9 checks passed

gabor-openai deleted the dev/steven/pii_updates branch November 6, 2025 00:52

	for i in range(0, len(candidate_lower) - 2, 2) # Step by 2 for efficiency
	for i in range(0, len(candidate_lower) - 2, 4) # Step by 4 for better efficiency

	entity_marker = f"<{next(iter(found_entities))}_ENCODED>"
	entity_marker = f"<{'\|'.join(sorted(found_entities))}_ENCODED>"

	entity_marker = f"<{next(iter(found_entities))}_ENCODED>"
	entity_types_str = "\|".join(sorted(found_entities))
	entity_marker = f"<{entity_types_str}_ENCODED>"

	_TEXT_CONTENT_TYPES: Final[set[str]] = {"text", "input_text", "output_text"}
	_TEXT_CONTENT_TYPES: Final[set[str]] = {"text", "input_text", "output_text", "summary_text"}

	assert "@" not in modified[0]["content"][0]["text"] or "＠" not in modified[0]["content"][0]["text"] # noqa: S101
	assert "@" not in modified[0]["content"][0]["text"] and "＠" not in modified[0]["content"][0]["text"] # noqa: S101

	r"[\u200b\u200c\u200d\u2060\ufeff]" # Zero-width space, ZWNJ, ZWJ, word joiner, BOM
	r"[\u200b\u200c\u200d\u2060\ufeff]" # Zero-width space, ZWNJ, ZWJ, word joiner, zero-width no-break space (\ufeff; BOM only at start of file)

		msg = f"Hex decoded content too large ({len(decoded_bytes):,} bytes). Maximum allowed is 10KB."
		raise ValueError(msg)

	msg = f"Hex decoded content too large ({len(decoded_bytes):,} bytes). Maximum allowed is 10KB."
	raise ValueError(msg)
	raise ValueError("Hex decoded content exceeds maximum allowed size of 10KB.")

	# am9obkBleGFtcGxlLmNvbQ== is base64 for john@example.com
	# am9obkBleGFtcGxlLmNvbQ== (base64 for 'john@example.com')

Use Presidio for masking #40

Use Presidio for masking #40

Conversation

steven10a commented Nov 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

Copilot AI Nov 4, 2025

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 4, 2025

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 4, 2025

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 4, 2025

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Uh oh!

Copilot AI Nov 5, 2025

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 5, 2025

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 5, 2025

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 5, 2025

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 5, 2025

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

steven10a commented Nov 3, 2025 •

edited

Loading

steven10a Nov 5, 2025 •

edited

Loading