Update ICU to 78#177
Closed
aoruganti-msft wants to merge 37 commits into
Closed
Conversation
Upstream ref: upstream/maint/maint-78 Upstream SHA: 21d1eb0f306e1141c10931e914dfc038c06121da Previous version: 72.1.0.4 New version: 78.3.0.0 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ion_number Windows OS ICU build uses a versionless data filename (icudtl.dat) instead of versioned (icudtl78l.dat) so the filename does not churn each upgrade. Guarded by ICU_DATA_DIR_WINDOWS; public/SDK build is unaffected. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add IGNORE_WINDOWS_HEADERS_START/END markers around regions of putil.h and unistr.h that should be stripped from the Windows OS SDK header. putil.h: data-directory + timezone-files-directory + filesystem-separator constants are not user-mutable in Windows OS ICU. unistr.h: UStringCaseMapper internal callback typedef is meaningless to SDK consumers that don't expose C++ UnicodeString. No C/C++ semantics change; markers are pure comments interpreted by the Windows SDK header-stripping tool only. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add IGNORE_WINDOWS_HEADERS_START/END markers around regions of 4 public ICU headers so the Windows SDK header-stripping tool omits them. Reworked from ICU 72 form (8/12 hunks landed at new offsets, 2 hand-ported, 2 dropped): - uchar.h: wraps U_UNICODE_VERSION macro (runtime-variable; SDK consumers should use u_getUnicodeVersion() API instead). - uconfig.h: wraps uconfig_local.h include hook and UCONFIG_USE_WINDOWS_ LCID_MAPPING_API switch (compile-time settings irrelevant to SDK). - utypes.h: wraps ICUDATA naming scheme constants (Windows OS uses a fixed single-data-file layout). - uversion.h: wraps U_NAMESPACE_BEGIN/END and C++ namespace plumbing (Windows OS SDK exposes flat C APIs only). Dropped umachine.h hunks: U_OVERRIDE and U_FINAL macros no longer exist in ICU 78 (upstream removed them in favor of using the C++11 keywords directly). The patch's intent for that file is resolved by upstream. No C/C++ semantics change; markers are comments consumed only by the Windows SDK header-stripping tool. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Make u_cleanup() a no-op for the Windows OS ICU build to prevent race-condition crashes when multiple threads (Windows.Globalization, default OS sort, app code) are concurrently using ICU. - Real implementation renamed to uprv_u_cleanup() (private; combined DLL can still call it; not exported from DEF). - New public u_cleanup() under ICU_DATA_DIR_WINDOWS returns no-op; otherwise delegates to uprv_u_cleanup() so public/Nuget consumers retain the original behavior. Reworked: ICU 78 modernized the function signature from (void) to () and NULL to nullptr; the rework matches the new style. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…unk 1 only) icu4c/source/data/build.xml: Change CLDR_TMP_DIR from cldr-aux to cldr-staging (MS-ICU pipeline uses cldr-staging as its CLDR temp directory rather than vanilla CLDR's cldr-aux default). Dropped hunk 2 (build-icu-data.xml): target file removed in ICU 73+; the cldr-to-icu data build toolchain is now driven by config.xml + Maven + Cldr2Icu.java rather than that Ant build script. The hunk's intent (forceDelete=true; mvn->mvn.cmd for Windows) does not have a direct landing zone in the new toolchain. If Maven-on-Windows breaks during Step 6 data build, fix at the new invocation point then. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Under ICU_DATA_DIR_WINDOWS, make extendICUData() return early (false). Windows OS ICU has only one data file (versionless icudtl.dat from patch 000) and never has extended data; running the normal extension path would try to load icudt78l.dat on top of the already-loaded common data, creating redundant work or load conflicts. Reworked: ICU 78 modernized FALSE -> false in this file; the new guard matches that style. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Two Linux make-dist adjustments: - DISTY_FILES strips DISTY_DOC_ZIP. MS-ICU does not build Doxygen docs in its release pipeline; the dist target would otherwise fail looking for a non-existent docs.zip. - git archive path adapted for MS-ICU GitHub layout: cd ../.. (two levels up instead of one) and HEAD:icu/icu4c/ (extra icu/ prefix) because microsoft/icu has its icu4c tree at icu/icu4c/ rather than vanilla ICU's top-level icu4c/. Reworked: ICU 78 fixed an upstream typo (we watn -> we want) and added testdata/ copy logic to dist.mk (lines 72-73); both are preserved. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…regions Remove test entries for blocked region codes (DG, EA, EH, IC, etc.) from ICU's region tests. MS-ICU strips these codes from data via GeoPol policy; the tests would otherwise fail looking up regions that no longer exist in MS-ICU's region data. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…namicLink_UCRT Statically link VCRuntime + VCStartup + STL into icuuc.dll / icuin.dll, but keep UCRT (ucrtbase.dll) dynamic. This eliminates the VC Redist dependency for consumers — Windows 10+ ships UCRT, but VCRuntime and STL would otherwise require manual VC Redist install. Mechanism (applied to both common.vcxproj and i18n.vcxproj, Debug + Release blocks): - RuntimeLibrary: MultiThreadedDebugDLL/MultiThreadedDLL -> MultiThreaded Debug/MultiThreaded (compiler switches to static C++ runtime). - IgnoreSpecificDefaultLibraries=libucrtd.lib;libucrt.lib (linker drops the static UCRT pulled in by /MT[d]). - /DEFAULTLIB:ucrt[d].lib via AdditionalOptions (force the dynamic UCRT). Reworked: ICU 78 already uses $(IcuMajorVersion) macro for DLL names (unchanged by this patch). Verified no arch-specific overrides exist — only generic Debug/Release ItemDefinitionGroups — so the fix applies uniformly to x86, x64, and ARM64. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add ICU major version number to PDB (debug symbol) filenames so they match the DLL filenames. Prevents PDB filename collision when two ICU versions are deployed side-by-side and lets debuggers correctly correlate symbols across versions. Files updated: common.vcxproj, i18n.vcxproj, stubdata.vcxproj. PDBs become icuuc78.pdb / icuuc78d.pdb, icuin78.pdb / icuin78d.pdb, icudt78.pdb (matching the existing icuuc78.dll / icuuc78d.dll / icuin78.dll / icuin78d.dll / icudt78.dll naming). Reworked: Used $(IcuMajorVersion) MSBuild macro rather than hardcoding 78. This is the same pattern ICU 78's <OutputFile> tags already use in these vcxproj files, so future ICU upgrades won't need to touch these strings. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…tra_locales Bump STRING_STORE_SIZE from 100000 to 120000 in package.h. The package tool uses this as a static buffer for item names when building the .dat data file; CLDR-MS adds extra locales that overflow the vanilla 100K buffer. Applied verbatim from the patch (120000). If CLDR 48 + MS-CLDR overlay overflows this in the Step 6 data build, bump further (see followup todo). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…locale_as_full_BCP47_tag
Add MS-only uprefs library so uprv_getDefaultLocaleID() on Windows
returns a full BCP47 tag (e.g. en-US-u-ca-gregory-hc-h12-fw-mon-ms-metric)
that encodes the user's calendar, currency, hour cycle, first day of
week, sort method, and measurement system from Windows Globalization
APIs. Vanilla ICU only returns language+region.
Files:
- new uprefs.cpp/h (common library, gated on UCONFIG_USE_WINDOWS_PREFERENCES_LIBRARY
and U_PLATFORM_USES_ONLY_WIN32_API)
- new uprefstest.cpp/h
- putil.cpp wires uprefs_getBCP47Tag() into uprv_getDefaultLocaleID();
unifies buffer sizing (POSIX_LOCALE_CAPACITY -> length * 2)
- uconfig.h defines UCONFIG_USE_WINDOWS_PREFERENCES_LIBRARY = 1
- sources.txt lists uprefs.cpp
- common.vcxproj
- common_uwp.vcxproj reference uprefs.cpp/h
- test/intltest/ Makefile.in + intltest.vcxproj wire uprefstest.{cpp,h};
itutil.cpp registers UPrefsTest class
Reworked from ICU 72 form:
- 8 hunks applied cleanly via git apply (with offsets only)
- 5 build-system list-insertion hunks reworked manually due to context drift
(ICU 78 added fixedstring.cpp, new test files between the patch's anchor
lines)
- New file uprefstest.cpp uses backup version from ICU 72.1.0.4 (commit
860c2ea by Rahul Pandey, "Add missing parameters to MockGetLocaleInfoEx"
Nov 2022) which contains style/whitespace cleanup not present in the
original 2021 patch file. Patch file in icu-patches/patches/ remains stale
and will be regenerated at end of upgrade.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Patch 018 originally bumped this from 100000 -> 120000 for the CLDR-MS extra-locales overflow at the package-tool stage of data build. CLDR 48 has substantially more locales than CLDR 44 (which 120000 was sized for); prior session evidence (now-deleted branch) suggests 120000 may still overflow at Step 6 data build. Pre-emptively bump to 200000 to avoid a Step 6 rerun on overflow. If the actual measurement at Step 6 shows 120000 was enough, we can revisit post-shipping. Bumping high now is harmless (static array sized at compile time in tool; trivial RAM increase only while makedata runs). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ry_validation The new ICU 78 Cldr2Icu Maven/Java toolchain replaces the deleted build-icu-data.xml Ant entry point. Its CLI-options constructor runs validateEnvironment() and System.exit(1)s unless ICU_DIR contains an icu4j/ subdirectory. The microsoft/icu fork is icu4c-only -- there is no icu4j source tree -- so the Step 6 data-build pipeline cannot run without bypassing this check. The runtime Java dependency on icu4j (used by TransformsMapper for Transliterator) is satisfied via the Maven artifact com.ibm.icu:icu4j in ~/.m2, which is unaffected by source-tree absence. This is an MS-only divergence. Both the patch file and the applied source change land here as a single commit; the patch file documents the divergence in icu-patches/patches/ for future upgrades. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Output of running `cldr/scripts/runme.cmd` (parent repo) after the CLDR_DIR fix (1ff7af3dd) and remove_data.pl empty-parent cleanup (c7fed06a2): - existing locale/region/currency/zone/unit/lang/coll/rbnf .txt files regenerated from CLDR 48.2 production data - new locales added by CLDR 48 (vs the CLDR 42 set) - Minguo blocked-term leak that affected 22 files in the pre-fix run is resolved (0 hits in this regeneration) The runme.cmd script then failed at the blocked-zones verification step because zoneinfo64.txt (untouched here, came in with the ICU 72->78 source swap) still contains Urumqi/Kashgar. That's tracked separately and will be resolved by regenerating zoneinfo64.txt from IANA tzdata 2026a with blocked zones removed at source. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…zones removed Output of `bash cldr/scripts/regenerate_zoneinfo64_with_blocked_timezones_removed.sh 2026a` run on WSL Ubuntu 24.04 after parent-side fixes: - 1ff7af3dd cldr: reset CLDR_DIR to CLDR_MS_ROOT before ant proddata - c7fed06a2 remove_data.pl scrubs empty parent containers - 085f3ce84 tz regen normalizes CRLF on WSL before ICU build - a72cfa836 tz regen uses HTTPS + fail-fast on download error - 2bfb4f6e0 boundary-aware blocked-term match in iana scrub helper - 97c2591bf iana scrub handles multi-line Zone blocks in all files - f3c9d1f2b tz regen pre-cleans stale tzcode build state - 3ec9ea25b tz regen pre-cleans stale ICU data build outputs Resulting file: Build tool: tz2icu Build date: Sat May 16 09:04:19 2026 tz version: 2026a ICU version: 78.3 Blocked-term verification (post-regen): Urumqi: 0 hits (was 3 in upstream maint-78) Kashgar: 0 hits (was 3) Vostok: 0 hits (was 3, new in tzdata 2023d for Antarctica/Vostok) Vladivostok: 3 hits (preserved - false-positive guard verified) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Output of running `cldr/scripts/runme.cmd` end-to-end on Windows after parent commit 6bc35e28e fixed remove_blocked_zones.pl's path-separator regex. The fixed verifier now correctly matches forward-slash paths that File::Find::name produces on this Perl build, and the in-place scrub branch runs against testdata/zoneinfo64.txt and testdata/ windowsZones.txt instead of being silently skipped. Resulting deltas (Urumqi + Kashgar removed, indices reflowed): - testdata/zoneinfo64.txt: -2 zone tables, -2 country aliases - testdata/metaZones.txt: -1 metaZone block - testdata/timezoneTypes.txt: -1 timezone type line - testdata/windowsZones.txt: -1 zone in CN line Verified via final stage6 log line: "The ICU data files (.txt) have been updated." Blocked-term check across all 8 tz files (data/misc + test/testdata): Urumqi: 0 (target) Kashgar: 0 (target) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
UWP projects (common_uwp, i18n_uwp, makedata_uwp) failed to build on Visual Studio 2026 (BuildToolVersion=18.0) with: Microsoft.Cpp.WindowsSDK.targets(46,5): error MSB8036: The Windows SDK version 10.0.0.0 was not found. Root cause: Build.Windows.PlatformToolset.props had toolset mappings only for VS 14.0/15.0/16.0/17.0 (v140/v141/v142/v143). On VS 18.0 no AutoDetectedPlatformToolset matched, so PlatformToolset fell through. The subsequent latest-SDK alias block at line 39 was conditioned on PlatformToolset == v142|v143 and did not fire, leaving WindowsTargetPlatformVersion unset. VS 18 UWP Default.props (.../Application Type/Windows Store/10.0/Default.props) then assigned the sentinel value 10.0.0.0 to WindowsTargetPlatformVersion, which is not a real installed SDK, producing MSB8036. Fix: add VS 18.0 -> v145 mapping and include v145 in the latest-SDK alias condition. With this change, building UWP projects on VS 18 resolves WindowsTargetPlatformVersion to the literal "10.0" alias, which VC targets dynamically resolve to the latest installed Windows 10 SDK via _LatestWindowsTargetPlatformVersion (no hard-coded SDK number; works on any host with a Windows 10 SDK installed). Verified: - common_uwp.vcxproj Release|x64: exit 0, common_uwp.dll produced - i18n_uwp.vcxproj Release|x64: exit 0, i18n_uwp.dll produced - makedata_uwp via allinone.sln Release|x64: exit 0, icudt78.dll produced in bin64UWP/, testdata built Diagnosis verified by independent gpt-5.5 DA pass (cited primary sources: VS UWP Default.props line 27-32, Microsoft.Cpp.WindowsSDK.props line 124). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Build.Windows.ProjectConfiguration.props had hardcoded SDK pinning for ARM and ARM64 desktop configs: L66, L70: <WindowsTargetPlatformVersion>10.0.22621.0</WindowsTargetPlatformVersion> L188, L204: <AdditionalLibraryDirectories>...Windows Kits\10\Lib\10.0.22621.0\um\arm[64]</AdditionalLibraryDirectories> x86/x64 do not pin a specific SDK; they inherit the latest-installed "10.0" alias from Build.Windows.PlatformToolset.props (lines 32-43). ARM/ARM64 had been pinned since 2020 when Windows 10 SDK was required explicitly for desktop ARM but default desktop SDK was Win 8.1; later upgrades migrated x86/x64 to the dynamic alias but the ARM/ARM64 hardcode stayed, only being bumped (16299 -> 22621) instead of removed. On a host without 10.0.22621.0 installed (this dev box has 10.0.26100.0 only), ARM64 builds of any non-UWP project (e.g., stubdata.vcxproj as a dependency of common_uwp) fail with MSB8036. Remove the 4 hardcoded lines so ARM/ARM64 inherit the same dynamic-SDK behavior as x86/x64. Keep all other ARM/ARM64-specific configuration: output dirs, preprocessor defines, TargetMachine, kernel32.lib, the WindowsSDKDesktopARMSupport / WindowsSDKDesktopARM64Support flags. ARM32 (Platform=ARM) is not in the build matrix iterated by upgrade-ver/icu/build-test.ps1 (lines 138-142 only enumerate Win32/x64/ARM64), so dropping the ARM32 SDK pinning has no impact on our supported configurations. Verified by independent gpt-5.5 DA pass; cited primary sources: - source\stubdata\stubdata.vcxproj:11-16 - source\allinone\Build.Windows.ProjectConfiguration.props:61-72,185-188,201-204 - source\allinone\Build.Windows.PlatformToolset.props:32-43 - source\allinone\allinone.sln:13-15,186-188 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Remove generated yue ICU resource bundles so blocked locale data does not get packaged into runtime ICU data. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Remove ICU-generated ku and tok locale bundles that alias into blocked or ignored generated data, and update tzdata to 2026b. This unblocks UnifiedCacheTest eviction stress from hanging on empty alias targets. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Align ICU test expectations with MS English locale data that preserves regular spaces before AM/PM markers. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Adjust ICU tests for CLDR-MS display-name differences and removed yue collation aliases. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Treat 029 locales as US measurement/paper locales and expect en_029 currency data from CLDR-MS. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Update data-driven date-format spacing expectations and en_GB display-context expectations for CLDR-MS data. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The UCPTrie golden comparison is byte-for-byte, so Windows CRLF checkout makes the generated set1 TOMLs fail even though the normalized content is identical. Force LF for the codepoint trie TOML goldens. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Remove blocked subdivision prefixes from supplemental data and update the generated region validation bitmap so Region::getInstance(), subdivision validity, and loclikely agree on the MS blocked-region set. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Force LF checkout for the shell fragments that are concatenated into config/icu-config so WSL/Linux selfcheck can execute the generated script. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Refresh the stale/corrupt MSFT patch records for the ICU 78 upgrade: re-derived test expectations, final string-store size, uprefs refresh, and LF-only checkout rules. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add the generated development review report for the ICU 78 / CLDR 48 upgrade. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Use the Unicode U+20C1 SAUDI RIYAL SIGN for the ar-SA SAR currency symbol and add a C API regression test for the locale-specific override. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add patch 023 to record the ar-SA SAR symbol override and its regression test. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Record patch 023 and the targeted validation for the ar-SA SAR symbol override. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
@copilot review |
Repair the icu4c LICENSE symlink so releaseDist can install the license from the repo layout. Move the ARM64 Linux build image to Ubuntu 18.04 so clang++-9 has C++17 standard library support for ICU 78 string_view headers. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add a root CodeQL.yml so 1ES CodeQL treats imported upstream ICU sources as library code while keeping MSFT-owned patches and build scripts visible. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Replace the root 1ES-style CodeQL.yml with GitHub's .github/codeql/codeql-config.yml so CodeQL can ignore imported upstream ICU source and test data. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Updates the Microsoft ICU fork from ICU 72 to ICU 78.
What changed
icu-patches/patchesmatches the final source tree.icu-patches/dev_report.mdfor review notes, validation, and patch status.Patch records
Refreshed final patch records:
002-MSFT-Patch-ICU_test_changes_for_MSFT_changes.patch017-MSFT-Patch_ICU_test_changes_for_extra_CLDR-MS_locales.patch018-MSFT-Patch_ICU_toolutil_increase_string_store_for_extra_locales.patch020-MSFT-Patch_ICU_Add_uprefs_library_to_obtain_default_locale_as_full_BCP47_tag.patch022-MSFT-Patch-ICU_keep_generated_test_and_shell_artifacts_LF_only.patchNew patch record:
021-MSFT-Patch_ICU_Cldr2Icu_remove_icu4j_directory_validation.patchValidation
icu-patches/dev_report.md.Notes
This is the nested ICU source PR. The parent
ms-icuPR depends on this nested ICU commit via the submodule pointer.