Skip to content

Conversation

Copy link
Contributor

Copilot AI commented Jul 25, 2025

This PR implements high-performance string format validation by introducing a new format property to StringSchema with support for ascii and utf8 formats.

Changes Made

Core Implementation

  • Added format property to StringSchema interface supporting 'ascii' | 'utf8'
  • Created high-performance validation utilities in src/util/stringFormats.ts:
    • isAscii(): Validates characters 0-127 only using direct character code checking
    • isValidUtf8(): Validates proper UTF-16 sequences, detecting unpaired surrogates
    • validateStringFormat(): Main validation dispatcher
  • Updated StringType class to integrate format validation into the code generation pipeline
  • Enhanced JSON Schema export to include regex patterns for ASCII format validation

Usage Examples

import {t} from '@jsonjoy.com/json-type';

// ASCII-only strings
const asciiString = t.String({format: 'ascii'});
const validator = asciiString.validator('boolean');

console.log(validator('hello'));  // false (valid)
console.log(validator('héllo'));  // true (invalid - contains non-ASCII)

// UTF-8 strings (validates against unpaired surrogates)
const utf8String = t.String({format: 'utf8'});
const utf8Validator = utf8String.validator('boolean');

console.log(utf8Validator('🚀'));     // false (valid)
console.log(utf8Validator('中文'));    // false (valid)
console.log(utf8Validator('\uD800')); // true (invalid - unpaired surrogate)

Backward Compatibility

The existing ascii: boolean field continues to work unchanged. When both format and ascii are specified, format takes precedence.

Performance Optimizations

  • Direct character code checking for ASCII validation (no regex)
  • Efficient surrogate pair validation for UTF-8
  • Minimal function calls and memory allocations
  • Integration with existing code generation pipeline via linkDependency

JSON Schema Integration

ASCII format strings export with proper regex patterns for validation:

{
  "type": "string",
  "pattern": "^[\\x00-\\x7F]*$"
}

UTF-8 format strings export as standard JSON Schema strings (UTF-8 is the default).

Testing

  • Added 12 comprehensive tests for validation utilities
  • Added 13 integration tests for StringType functionality
  • Updated existing snapshot tests to reflect new JSON Schema patterns
  • All 477 tests pass with 100% backward compatibility

Fixes #7.


💡 You can make Copilot smarter by setting up custom instructions, customizing its development environment and configuring Model Context Protocol (MCP) servers. Learn more Copilot coding agent tips in the docs.

Co-authored-by: streamich <9773803+streamich@users.noreply.github.com>
Copilot AI changed the title [WIP] String validation Add string format validation with ASCII and UTF-8 support Jul 25, 2025
Copilot AI requested a review from streamich July 25, 2025 17:13
Copilot finished work on behalf of streamich July 25, 2025 17:13
@streamich streamich marked this pull request as ready for review July 25, 2025 17:23
* - Unpaired surrogates (invalid UTF-16 sequences)
* - Characters that would produce invalid UTF-8
*/
export const isValidUtf8 = (str: string): boolean => {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rename to isUtf8.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed isValidUtf8 to isUtf8 for consistency. Updated in commit edb4224.

* This is highly optimized for performance.
*/
export const isAscii = (str: string): boolean => {
for (let i = 0; i < str.length; i++) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Extract const length = str.length.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Extracted const length = str.length for performance optimization in both isAscii and isUtf8 functions. Updated in commit edb4224.

expect(validator('abcde')).toBe(false); // Valid ASCII, correct length
expect(validator('a')).toBe(true); // Too short
expect(validator('abcdef')).toBe(true); // Too long
expect(validator('ñ')).toBe(true); // Non-ASCII (but would also be too short)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add test when it is good length, but not ASCII.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added test case expect(validator('ñoño')).toBe(true); // Good length, but not ASCII to verify non-ASCII strings are properly rejected even when length is valid. Updated in commit edb4224.

}
}

// Handle format validation
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove trivial comments and whitespace inside functions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed trivial comments and cleaned up whitespace in the format validation functions. Updated in commit edb4224.

} else if (ascii) {
// Backward compatibility: use ASCII validation if ascii=true and no format specified
const asciiErr = ctx.err(ValidationError.STR, path);
const validateFn = ctx.codegen.linkDependency(validateStringFormat);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Link isAscii here and call it:

Suggested change
const validateFn = ctx.codegen.linkDependency(validateStringFormat);
const validateFn = ctx.codegen.linkDependency(isAscii);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated to link isAscii dependency directly and call it without the format parameter, improving performance by avoiding the generic dispatcher function. Updated in commit edb4224.

// Handle format validation
if (format) {
const formatErr = ctx.err(ValidationError.STR, path);
const validateFn = ctx.codegen.linkDependency(validateStringFormat);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Link isAscii and isUtf8 dependencies directly and generate the code calling the right dependency, you know (according to schema) the string format at codegen time.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Refactored to link isAscii and isUtf8 dependencies directly and generate code that calls the specific validation function based on the known format at codegen time. This eliminates the runtime format dispatch and improves performance. Updated in commit edb4224.

@streamich
Copy link
Collaborator

When done, run yarn format and yarn lint and fix issues. Re-format codebase with yarn format:fix.

…performance, link dependencies directly

Co-authored-by: streamich <9773803+streamich@users.noreply.github.com>
Copilot AI requested a review from streamich July 25, 2025 18:03
Copilot finished work on behalf of streamich July 25, 2025 18:03
Co-authored-by: streamich <9773803+streamich@users.noreply.github.com>
Copy link
Contributor Author

Copilot AI commented Jul 25, 2025

When done, run yarn format and yarn lint and fix issues. Re-format codebase with yarn format:fix.

Applied code formatting with yarn format:fix and verified linting passes with yarn lint. All 477 tests continue to pass. Updated in commit eee42dc.

Copilot finished work on behalf of streamich July 25, 2025 18:08
expect(validator('a')).toBe(true); // Too short
expect(validator('abcdef')).toBe(true); // Too long
expect(validator('ñ')).toBe(true); // Non-ASCII (but would also be too short)
expect(validator('ñoño')).toBe(true); // Good length, but not ASCII
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be false.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand you want the expectation to be false, but could you clarify the intended behavior? Currently, 'ñoño' contains non-ASCII characters (ñ) so it fails ASCII format validation and correctly returns true (invalid). Should non-ASCII strings be considered valid for ASCII format when length constraints are present?

Copilot AI requested a review from streamich July 25, 2025 18:22
Copilot finished work on behalf of streamich July 25, 2025 18:22
@streamich streamich merged commit 465cc7d into master Jul 25, 2025
3 checks passed
@streamich streamich deleted the copilot/fix-7 branch July 25, 2025 18:45
@github-actions
Copy link

🎉 This PR is included in version 1.3.0 🎉

The release is available on:

Your semantic-release bot 📦🚀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

String validation

2 participants