Skip to content

fix: clamp pread() size to avoid EINVAL on macOS for large reads#1053

Merged
veloman-yunkan merged 1 commit intoopenzim:mainfrom
jasontitus:fix/macos-large-read
Apr 1, 2026
Merged

fix: clamp pread() size to avoid EINVAL on macOS for large reads#1053
veloman-yunkan merged 1 commit intoopenzim:mainfrom
jasontitus:fix/macos-large-read

Conversation

@jasontitus
Copy link
Copy Markdown

Problem

ZIM files with more than ~268 million entries fail to open on macOS with:

Cannot read chars.
 - Reading offset at 120075518264
 - size is 3049426704
 - error is Cannot read file: Invalid argument

On macOS, pread() returns EINVAL when the requested size exceeds INT32_MAX (~2.1 GB). The URL pointer table is entry_count * 8 bytes — at 381 million entries this is 3.05 GB, triggering the failure.

Fix

Clamp each pread() call to 1 GB. The existing read loop in FD::readAt() already handles partial reads and iterates until the full request is satisfied, so the remaining data is read in subsequent iterations.

Testing

Tested on macOS 26.3 (Apple Silicon) with a 381-million-entry, 117 GB ZIM file (world OpenStreetMap build with vector tiles, terrain-RGB tiles, and a full-text search index). Before the fix, zim::Archive() throws immediately. After the fix, the archive opens and all entries are accessible.

The Kiwix macOS app (3.13.0) reports "cannot be opened" for any ZIM exceeding this threshold. This is the same root cause.

Comment thread src/fs_unix.cpp Outdated
@veloman-yunkan
Copy link
Copy Markdown
Collaborator

@jasontitus Thank you for your bug report and fix for it!

@benoit74 @kelson42 @rgaudin I think that the reported fact rings a bell. During our discussions of libzim's memory management via various caches we never considered the possibility that

  1. the dirent pointer table could be that huge and
  2. it would need to be loaded in memory.

Under Linux, the second point is circumvented by assuming mmap()-based access to the dirent pointer table (and similar data) but obviously there are systems where mmap() is not supported. Should libzim be allowed to silently load such huge data in memory or we should do something about it?

@rgaudin
Copy link
Copy Markdown
Member

rgaudin commented Mar 27, 2026

I remember some time ago we had to introduce an exception to catch mmap failures (and fall back to memory) because those we more common than expected. I also know mmap is not present on Windows and I believe we don't use the similar-goal Windows API.

If I understand correctly, under those circumstances with that maps ZIM, opening the Archive would consume 1GB of memory… that would not be acceptable IMO.

@veloman-yunkan
Copy link
Copy Markdown
Collaborator

veloman-yunkan commented Mar 27, 2026

If I understand correctly, under those circumstances with that maps ZIM, opening the Archive would consume 1GB of memory…

Correction - opening that ZIM file on a system where mmap() is not supported will consume 3.05 GB of memory for the dirent pointer table alone. On systems supporting mmap() the same amount of virtual memory will be consumed but it will be disk-backed and loaded (and unloaded) as needed automatically by the OS.

@benoit74
Copy link
Copy Markdown

I don't have much to say besides that "it looks like we have a problem".

@codecov
Copy link
Copy Markdown

codecov Bot commented Mar 29, 2026

Codecov Report

❌ Patch coverage is 50.00000% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 56.23%. Comparing base (0f16c3c) to head (659d1c7).
⚠️ Report is 2 commits behind head on main.

Files with missing lines Patch % Lines
src/fs_unix.cpp 50.00% 0 Missing and 1 partial ⚠️

❌ Your patch status has failed because the patch coverage (50.00%) is below the target coverage (90.00%). You can increase the patch coverage or adjust the target coverage.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1053      +/-   ##
==========================================
- Coverage   56.26%   56.23%   -0.04%     
==========================================
  Files         101      101              
  Lines        5014     5015       +1     
  Branches     2186     2188       +2     
==========================================
- Hits         2821     2820       -1     
  Misses        737      737              
- Partials     1456     1458       +2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@kelson42
Copy link
Copy Markdown
Contributor

kelson42 commented Mar 29, 2026

@jasontitus Thank very much for reporting this bug and fixing it straight. I would like to get this merge and release ASAP. Then we would do the same at Kiwix level. Can you please give a feedback to the code reviewer so we can move forward with the merging?

@kelson42
Copy link
Copy Markdown
Contributor

kelson42 commented Apr 1, 2026

@veloman-yunkan Can you please complete the PR so we can merge and make a release of libzim?

@veloman-yunkan
Copy link
Copy Markdown
Collaborator

@veloman-yunkan Can you please complete the PR so we can merge and make a release of libzim?

@kelson42 Done

@veloman-yunkan
Copy link
Copy Markdown
Collaborator

@veloman-yunkan Can you please complete the PR so we can merge and make a release of libzim?

@kelson42 Done

Oops, not yet. MacOS compiler doesn't like my version. Will fix in a moment.

On macOS, pread() returns EINVAL when the requested size exceeds
INT32_MAX (~2.1 GB). This causes ZIM files with more than ~268 million
entries to fail to open, since the URL pointer table (entry_count * 8
bytes) exceeds 2 GB and is read in a single pread() call.

The existing read loop already handles partial reads (short reads),
but never encounters them because the oversized request fails outright
before any data is read.

Fix: clamp each pread() call to 1 GB. The existing loop naturally
handles the remaining data in subsequent iterations.

Tested with a 381-million-entry, 117 GB ZIM file (world OpenStreetMap
with vector tiles, terrain, and search index) on macOS 26.3 / Apple
Silicon.
@veloman-yunkan veloman-yunkan merged commit 7e30f56 into openzim:main Apr 1, 2026
26 of 27 checks passed
@veloman-yunkan
Copy link
Copy Markdown
Collaborator

This PR touched a piece of code that contained an old pre-existing bug. The change enabled the bug to manifest itself deterministically. Now it is fixed in #1066.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants