fix: Use decoding as a heuristic for binary data, and download accordingly#8858
fix: Use decoding as a heuristic for binary data, and download accordingly#8858dmadisetti merged 7 commits intomainfrom
Conversation
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
for more information, see https://pre-commit.ci
There was a problem hiding this comment.
Pull request overview
Adds an explicit isBase64/is_base64 flag to the file-details API so the frontend can distinguish textual contents from base64-encoded binary payloads and download files correctly when MIME type detection is missing/insufficient.
Changes:
- Backend:
FileDetailsResponsenow includesis_base64;OSFileSystem.open_filereturnsbytesfor non-decodable files andget_detailsbase64-encodes those bytes while setting the flag. - OpenAPI + frontend: schema updated and download logic updated to decode base64 to bytes before creating a Blob.
- Tests/smoke test: adds coverage for binary file behavior and a smoke test app for GH-5361.
Reviewed changes
Copilot reviewed 11 out of 11 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/_server/files/test_os_file_system.py | Adds assertions for open_file returning str vs bytes and verifies is_base64 behavior. |
| tests/_server/api/endpoints/test_file_explorer.py | Extends endpoint tests to assert isBase64 and base64 contents for binary. |
| packages/openapi/src/api.ts | Updates generated TS types to include isBase64. |
| packages/openapi/api.yaml | Adds isBase64 to the FileDetailsResponse schema. |
| marimo/_smoke_tests/issues/5361.py | New smoke test for binary download scenarios (GH-5361). |
| marimo/_server/models/files.py | Adds is_base64 to FileDetailsResponse (camel-cased in JSON). |
| marimo/_server/files/os_file_system.py | Implements is_base64 computation and changes open_file to return raw bytes on decode failure. |
| marimo/_server/files/file_system.py | Updates the abstract open_file return type to `str |
| marimo/_server/api/endpoints/file_explorer.py | Removes outdated TODO comment about raw bytes support. |
| frontend/src/components/editor/file-tree/file-viewer.tsx | Decodes base64 payloads to bytes for downloads; continues media download path. |
| frontend/src/components/editor/file-tree/file-explorer.tsx | Decodes base64 payloads to bytes for downloads from the tree dropdown. |
| def open_file(self, path: str, encoding: str | None = None) -> str | bytes: | ||
| file_path = Path(path) | ||
| try: | ||
| return file_path.read_text(encoding=encoding) | ||
| except UnicodeDecodeError: | ||
| # If its a UnicodeDecodeError, try as bytes and convert to base64 | ||
| return base64.b64encode(file_path.read_bytes()).decode("utf-8") | ||
| return file_path.read_bytes() | ||
|
|
||
| def create_file_or_directory( |
There was a problem hiding this comment.
open_file() currently falls back to Path.read_text() with encoding=None, which uses the platform’s preferred encoding (e.g. cp1252 on Windows). That encoding can successfully decode arbitrary bytes, meaning truly-binary files may be misclassified as text and never trigger the UnicodeDecodeError path. This breaks the new is_base64 behavior (and the associated tests) and can corrupt downloads. Consider always attempting a strict UTF-8 decode when encoding is not provided (or implement an explicit bytes-read + decode attempt with the desired encoding) so binary detection is consistent across OSes.
JiwaniZakir
left a comment
There was a problem hiding this comment.
The change in os_file_system.py from file_path.read_text(encoding=encoding) to file_path.read_text(encoding=encoding or "utf-8") is a subtle but impactful behavioral shift: previously, passing encoding=None would use the platform's default encoding, which correctly handled Latin-1, Shift-JIS, or Windows-1252 text files on appropriately configured systems. Now, those files will silently fail the UTF-8 decode, get treated as binary, and be base64-encoded — meaning they'll download as raw bytes rather than being viewable as text in the file viewer. If the intent is to standardize on UTF-8, that's reasonable, but the downstream consequence (text files in other encodings now presented as binary) deserves a comment or explicit design decision note.
In file-viewer.tsx, the handleDownload path for isBase64 && !isMediaMime does a round-trip through base64ToDataURL → deserializeBlob, which converts base64 → data URL string → Blob. This could be done more directly by decoding the base64 string to a Uint8Array and constructing the Blob from that, avoiding the intermediate data URL string allocation — worth considering for large binary files.
The addition of the explicit is_base64: bool = False field to FileDetailsResponse is a clean improvement over the previous implicit convention where consumers had to guess whether contents was raw text or base64 depending on context.
| if (data.isBase64 && data.contents) { | ||
| if (isMediaMime(mimeType)) { | ||
| const dataURL = base64ToDataURL( | ||
| data.contents as Base64String, | ||
| mimeType, | ||
| ); | ||
| downloadByURL(dataURL, data.file.name); | ||
| } else { | ||
| const blob = deserializeBlob( | ||
| base64ToDataURL( | ||
| data.contents as Base64String, | ||
| data.mimeType || "application/octet-stream", | ||
| ), | ||
| ); | ||
| downloadBlob(blob, data.file.name); | ||
| } |
There was a problem hiding this comment.
isBase64 is only used to decide how to download, but the rest of this component still treats data.contents as text/base64 based purely on mimeType. In particular, media rendering later in the file passes data.contents as base64 even when isBase64 is false (e.g., image/svg+xml where contents is UTF-8 text), which will produce an invalid data URL and broken preview. Consider gating media rendering (and the editable text editor / copy-save actions) on data.isBase64, and for non-base64 media either render via a Blob/object URL or fall back to a text preview.
| downloadBlob( | ||
| new Blob([details.contents || ""]), | ||
| node.data.name, | ||
| ); |
There was a problem hiding this comment.
When details.isBase64 is false, the download path creates new Blob([details.contents || ""]) without setting a MIME type. This can cause the downloaded file to be treated as an unknown type (and can break expected behavior for text files like SVG/XML). Prefer passing { type: details.mimeType || "text/plain" } (or a sensible default) when constructing the Blob.
| downloadBlob( | |
| new Blob([details.contents || ""]), | |
| node.data.name, | |
| ); | |
| const blob = new Blob([details.contents || ""], { | |
| type: details.mimeType || "text/plain", | |
| }); | |
| downloadBlob(blob, node.data.name); |
| with open(bin_path, "wb") as f: | ||
| f.write(raw_bytes) | ||
| response = client.post( | ||
| "/api/files/file_details", | ||
| headers=HEADERS, | ||
| json={"path": bin_path}, | ||
| ) | ||
| assert response.status_code == 200, response.text | ||
| data = response.json() | ||
| assert data["isBase64"] is True | ||
| assert data["contents"] == base64.b64encode(raw_bytes).decode("utf-8") | ||
| os.remove(bin_path) |
There was a problem hiding this comment.
This test manually writes binary_file.bin into the shared module-level TemporaryDirectory and only deletes it at the end. If an assertion fails before os.remove(bin_path), the file will be left behind and can affect subsequent tests. Use a try/finally for cleanup (or a per-test tmp_path fixture) to guarantee removal.
| with open(bin_path, "wb") as f: | |
| f.write(raw_bytes) | |
| response = client.post( | |
| "/api/files/file_details", | |
| headers=HEADERS, | |
| json={"path": bin_path}, | |
| ) | |
| assert response.status_code == 200, response.text | |
| data = response.json() | |
| assert data["isBase64"] is True | |
| assert data["contents"] == base64.b64encode(raw_bytes).decode("utf-8") | |
| os.remove(bin_path) | |
| try: | |
| with open(bin_path, "wb") as f: | |
| f.write(raw_bytes) | |
| response = client.post( | |
| "/api/files/file_details", | |
| headers=HEADERS, | |
| json={"path": bin_path}, | |
| ) | |
| assert response.status_code == 200, response.text | |
| data = response.json() | |
| assert data["isBase64"] is True | |
| assert data["contents"] == base64.b64encode(raw_bytes).decode("utf-8") | |
| finally: | |
| if os.path.exists(bin_path): | |
| os.remove(bin_path) |
| test_dir = os.path.dirname(os.path.abspath(__file__)) | ||
|
|
||
| # 1. cProfile .prof file | ||
| prof_path = os.path.join(test_dir, "sample.prof") | ||
| profiler = cProfile.Profile() | ||
| profiler.enable() | ||
| sum(range(10000)) | ||
| profiler.disable() | ||
| profiler.dump_stats(prof_path) | ||
|
|
||
| # 2. Generic binary file with non-UTF8 bytes | ||
| bin_path = os.path.join(test_dir, "sample.bin") | ||
| with open(bin_path, "wb") as f: | ||
| f.write(struct.pack("!4sIf16s", b"MAGIC", 42, 3.14, b"\xff\xfe\xfd" + b"\x00" * 13)) | ||
|
|
||
| # 3. Pickle file | ||
| pkl_path = os.path.join(test_dir, "sample.pkl") | ||
| with open(pkl_path, "wb") as f: | ||
| pickle.dump({"key": "value", "numbers": [1, 2, 3], "bytes": b"\xff\xfe"}, f) | ||
|
|
||
| # 4. A plain text file (control: should still work) | ||
| txt_path = os.path.join(test_dir, "sample.txt") | ||
| with open(txt_path, "w") as f: | ||
| f.write("This is a plain text file.\nIt should download correctly.\n") | ||
|
|
||
| # 5. A minimal valid PNG (1x1 red pixel) — tests the isMediaMime/dataURL path | ||
| def _make_png(): | ||
| def chunk(ctype, data): | ||
| c = ctype + data | ||
| return struct.pack(">I", len(data)) + c + struct.pack(">I", zlib.crc32(c) & 0xFFFFFFFF) | ||
|
|
||
| return ( | ||
| b"\x89PNG\r\n\x1a\n" | ||
| + chunk(b"IHDR", struct.pack(">IIBBBBB", 1, 1, 8, 2, 0, 0, 0)) | ||
| + chunk(b"IDAT", zlib.compress(b"\x00\xff\x00\x00")) | ||
| + chunk(b"IEND", b"") | ||
| ) | ||
|
|
||
| png_path = os.path.join(test_dir, "sample.png") | ||
| with open(png_path, "wb") as f: | ||
| f.write(_make_png()) | ||
|
|
||
| print(f"Created test files in {test_dir}:") | ||
| for name in ["sample.prof", "sample.bin", "sample.pkl", "sample.txt", "sample.png"]: | ||
| p = os.path.join(test_dir, name) | ||
| print(f" {name}: {os.path.getsize(p)} bytes") |
There was a problem hiding this comment.
This smoke test writes several files (sample.prof, sample.bin, sample.pkl, sample.txt, sample.png) directly into the repository directory containing the test (os.path.dirname(__file__)) and never cleans them up. Running smoke tests locally/CI can leave the working tree dirty and may cause confusion if these artifacts are committed accidentally. Consider writing into a temporary directory (and pointing the user to it), or add cleanup logic at the end of the cell/script.
|
🚀 Development release published. You may be able to view the changes at https://marimo.app?v=0.21.2-dev59 |
📝 Summary
We previously encoded blobs as base64 if they were not
utf-8compat. As a consequence, for files whose mimetype we could not dertermine, we saved the result as a base64 blob.This PR introduces a
is_base64flag to inform the fe it should handle the file contents a little differently. The media download behavior is preserved if the mimetype is recognized- otherwise the raw bytes are downloaded.Closes #5361