Skip to content

fix: Use decoding as a heuristic for binary data, and download accordingly#8858

Merged
dmadisetti merged 7 commits intomainfrom
dm/mo-3217
Mar 26, 2026
Merged

fix: Use decoding as a heuristic for binary data, and download accordingly#8858
dmadisetti merged 7 commits intomainfrom
dm/mo-3217

Conversation

@dmadisetti
Copy link
Copy Markdown
Collaborator

@dmadisetti dmadisetti commented Mar 24, 2026

📝 Summary

We previously encoded blobs as base64 if they were not utf-8 compat. As a consequence, for files whose mimetype we could not dertermine, we saved the result as a base64 blob.

This PR introduces a is_base64 flag to inform the fe it should handle the file contents a little differently. The media download behavior is preserved if the mimetype is recognized- otherwise the raw bytes are downloaded.

Closes #5361

Copilot AI review requested due to automatic review settings March 24, 2026 23:23
@vercel
Copy link
Copy Markdown

vercel bot commented Mar 24, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
marimo-docs Ready Ready Preview, Comment Mar 25, 2026 5:15pm

Request Review

@github-actions github-actions bot added the bash-focus Area to focus on during release bug bash label Mar 24, 2026
@dmadisetti dmadisetti changed the title Dm/mo 3217 @dmadisetti fix: Use decoding as a heuristic for binary data, and download accordingly Mar 24, 2026
@dmadisetti dmadisetti changed the title @dmadisetti fix: Use decoding as a heuristic for binary data, and download accordingly fix: Use decoding as a heuristic for binary data, and download accordingly Mar 24, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds an explicit isBase64/is_base64 flag to the file-details API so the frontend can distinguish textual contents from base64-encoded binary payloads and download files correctly when MIME type detection is missing/insufficient.

Changes:

  • Backend: FileDetailsResponse now includes is_base64; OSFileSystem.open_file returns bytes for non-decodable files and get_details base64-encodes those bytes while setting the flag.
  • OpenAPI + frontend: schema updated and download logic updated to decode base64 to bytes before creating a Blob.
  • Tests/smoke test: adds coverage for binary file behavior and a smoke test app for GH-5361.

Reviewed changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
tests/_server/files/test_os_file_system.py Adds assertions for open_file returning str vs bytes and verifies is_base64 behavior.
tests/_server/api/endpoints/test_file_explorer.py Extends endpoint tests to assert isBase64 and base64 contents for binary.
packages/openapi/src/api.ts Updates generated TS types to include isBase64.
packages/openapi/api.yaml Adds isBase64 to the FileDetailsResponse schema.
marimo/_smoke_tests/issues/5361.py New smoke test for binary download scenarios (GH-5361).
marimo/_server/models/files.py Adds is_base64 to FileDetailsResponse (camel-cased in JSON).
marimo/_server/files/os_file_system.py Implements is_base64 computation and changes open_file to return raw bytes on decode failure.
marimo/_server/files/file_system.py Updates the abstract open_file return type to `str
marimo/_server/api/endpoints/file_explorer.py Removes outdated TODO comment about raw bytes support.
frontend/src/components/editor/file-tree/file-viewer.tsx Decodes base64 payloads to bytes for downloads; continues media download path.
frontend/src/components/editor/file-tree/file-explorer.tsx Decodes base64 payloads to bytes for downloads from the tree dropdown.

Comment on lines +128 to 135
def open_file(self, path: str, encoding: str | None = None) -> str | bytes:
file_path = Path(path)
try:
return file_path.read_text(encoding=encoding)
except UnicodeDecodeError:
# If its a UnicodeDecodeError, try as bytes and convert to base64
return base64.b64encode(file_path.read_bytes()).decode("utf-8")
return file_path.read_bytes()

def create_file_or_directory(
Copy link

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

open_file() currently falls back to Path.read_text() with encoding=None, which uses the platform’s preferred encoding (e.g. cp1252 on Windows). That encoding can successfully decode arbitrary bytes, meaning truly-binary files may be misclassified as text and never trigger the UnicodeDecodeError path. This breaks the new is_base64 behavior (and the associated tests) and can corrupt downloads. Consider always attempting a strict UTF-8 decode when encoding is not provided (or implement an explicit bytes-read + decode attempt with the desired encoding) so binary detection is consistent across OSes.

Copilot uses AI. Check for mistakes.
Light2Dark
Light2Dark previously approved these changes Mar 25, 2026
Copy link
Copy Markdown

@JiwaniZakir JiwaniZakir left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change in os_file_system.py from file_path.read_text(encoding=encoding) to file_path.read_text(encoding=encoding or "utf-8") is a subtle but impactful behavioral shift: previously, passing encoding=None would use the platform's default encoding, which correctly handled Latin-1, Shift-JIS, or Windows-1252 text files on appropriately configured systems. Now, those files will silently fail the UTF-8 decode, get treated as binary, and be base64-encoded — meaning they'll download as raw bytes rather than being viewable as text in the file viewer. If the intent is to standardize on UTF-8, that's reasonable, but the downstream consequence (text files in other encodings now presented as binary) deserves a comment or explicit design decision note.

In file-viewer.tsx, the handleDownload path for isBase64 && !isMediaMime does a round-trip through base64ToDataURLdeserializeBlob, which converts base64 → data URL string → Blob. This could be done more directly by decoding the base64 string to a Uint8Array and constructing the Blob from that, avoiding the intermediate data URL string allocation — worth considering for large binary files.

The addition of the explicit is_base64: bool = False field to FileDetailsResponse is a clean improvement over the previous implicit convention where consumers had to guess whether contents was raw text or base64 depending on context.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 11 out of 11 changed files in this pull request and generated 4 comments.

Comment on lines +125 to +140
if (data.isBase64 && data.contents) {
if (isMediaMime(mimeType)) {
const dataURL = base64ToDataURL(
data.contents as Base64String,
mimeType,
);
downloadByURL(dataURL, data.file.name);
} else {
const blob = deserializeBlob(
base64ToDataURL(
data.contents as Base64String,
data.mimeType || "application/octet-stream",
),
);
downloadBlob(blob, data.file.name);
}
Copy link

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isBase64 is only used to decide how to download, but the rest of this component still treats data.contents as text/base64 based purely on mimeType. In particular, media rendering later in the file passes data.contents as base64 even when isBase64 is false (e.g., image/svg+xml where contents is UTF-8 text), which will produce an invalid data URL and broken preview. Consider gating media rendering (and the editable text editor / copy-save actions) on data.isBase64, and for non-base64 media either render via a Blob/object URL or fall back to a text preview.

Copilot uses AI. Check for mistakes.
Comment on lines +664 to +667
downloadBlob(
new Blob([details.contents || ""]),
node.data.name,
);
Copy link

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When details.isBase64 is false, the download path creates new Blob([details.contents || ""]) without setting a MIME type. This can cause the downloaded file to be treated as an unknown type (and can break expected behavior for text files like SVG/XML). Prefer passing { type: details.mimeType || "text/plain" } (or a sensible default) when constructing the Blob.

Suggested change
downloadBlob(
new Blob([details.contents || ""]),
node.data.name,
);
const blob = new Blob([details.contents || ""], {
type: details.mimeType || "text/plain",
});
downloadBlob(blob, node.data.name);

Copilot uses AI. Check for mistakes.
Comment on lines +59 to +70
with open(bin_path, "wb") as f:
f.write(raw_bytes)
response = client.post(
"/api/files/file_details",
headers=HEADERS,
json={"path": bin_path},
)
assert response.status_code == 200, response.text
data = response.json()
assert data["isBase64"] is True
assert data["contents"] == base64.b64encode(raw_bytes).decode("utf-8")
os.remove(bin_path)
Copy link

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test manually writes binary_file.bin into the shared module-level TemporaryDirectory and only deletes it at the end. If an assertion fails before os.remove(bin_path), the file will be left behind and can affect subsequent tests. Use a try/finally for cleanup (or a per-test tmp_path fixture) to guarantee removal.

Suggested change
with open(bin_path, "wb") as f:
f.write(raw_bytes)
response = client.post(
"/api/files/file_details",
headers=HEADERS,
json={"path": bin_path},
)
assert response.status_code == 200, response.text
data = response.json()
assert data["isBase64"] is True
assert data["contents"] == base64.b64encode(raw_bytes).decode("utf-8")
os.remove(bin_path)
try:
with open(bin_path, "wb") as f:
f.write(raw_bytes)
response = client.post(
"/api/files/file_details",
headers=HEADERS,
json={"path": bin_path},
)
assert response.status_code == 200, response.text
data = response.json()
assert data["isBase64"] is True
assert data["contents"] == base64.b64encode(raw_bytes).decode("utf-8")
finally:
if os.path.exists(bin_path):
os.remove(bin_path)

Copilot uses AI. Check for mistakes.
Comment on lines +23 to +68
test_dir = os.path.dirname(os.path.abspath(__file__))

# 1. cProfile .prof file
prof_path = os.path.join(test_dir, "sample.prof")
profiler = cProfile.Profile()
profiler.enable()
sum(range(10000))
profiler.disable()
profiler.dump_stats(prof_path)

# 2. Generic binary file with non-UTF8 bytes
bin_path = os.path.join(test_dir, "sample.bin")
with open(bin_path, "wb") as f:
f.write(struct.pack("!4sIf16s", b"MAGIC", 42, 3.14, b"\xff\xfe\xfd" + b"\x00" * 13))

# 3. Pickle file
pkl_path = os.path.join(test_dir, "sample.pkl")
with open(pkl_path, "wb") as f:
pickle.dump({"key": "value", "numbers": [1, 2, 3], "bytes": b"\xff\xfe"}, f)

# 4. A plain text file (control: should still work)
txt_path = os.path.join(test_dir, "sample.txt")
with open(txt_path, "w") as f:
f.write("This is a plain text file.\nIt should download correctly.\n")

# 5. A minimal valid PNG (1x1 red pixel) — tests the isMediaMime/dataURL path
def _make_png():
def chunk(ctype, data):
c = ctype + data
return struct.pack(">I", len(data)) + c + struct.pack(">I", zlib.crc32(c) & 0xFFFFFFFF)

return (
b"\x89PNG\r\n\x1a\n"
+ chunk(b"IHDR", struct.pack(">IIBBBBB", 1, 1, 8, 2, 0, 0, 0))
+ chunk(b"IDAT", zlib.compress(b"\x00\xff\x00\x00"))
+ chunk(b"IEND", b"")
)

png_path = os.path.join(test_dir, "sample.png")
with open(png_path, "wb") as f:
f.write(_make_png())

print(f"Created test files in {test_dir}:")
for name in ["sample.prof", "sample.bin", "sample.pkl", "sample.txt", "sample.png"]:
p = os.path.join(test_dir, name)
print(f" {name}: {os.path.getsize(p)} bytes")
Copy link

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This smoke test writes several files (sample.prof, sample.bin, sample.pkl, sample.txt, sample.png) directly into the repository directory containing the test (os.path.dirname(__file__)) and never cleans them up. Running smoke tests locally/CI can leave the working tree dirty and may cause confusion if these artifacts are committed accidentally. Consider writing into a temporary directory (and pointing the user to it), or add cleanup logic at the end of the cell/script.

Copilot uses AI. Check for mistakes.
@dmadisetti dmadisetti merged commit 285f02c into main Mar 26, 2026
52 checks passed
@dmadisetti dmadisetti deleted the dm/mo-3217 branch March 26, 2026 17:48
@github-actions
Copy link
Copy Markdown

🚀 Development release published. You may be able to view the changes at https://marimo.app?v=0.21.2-dev59

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bash-focus Area to focus on during release bug bash bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Downloading binaries files from the marimo file explorer turns to base64

4 participants