use disk at query time of notification payload creation #64390

dpsutton · 2025-10-04T14:49:41Z

OOMs from Notifications

A short history of how we got here:
We render alerts and dashboards with a graal js interpreter. To do this we need the resultsets of queries. We used to just gather these all in memory and then render them. This works super well for small datasets and exposes the instance to OOM issues if a large query is returned.

Ngoc (possibly others as well) made a nice addition in #51708 that uses disk to ensure that only a single resultset is in memory at any given time. the gist is run query, if results greater than some row count threshold, dump to disk, and then during rendering refetch it into memory.

This still leaves us susceptible to OOM if any particular resultset is too large.

What ends up happening is that instances randomly die and emails are not sent.

The fix here

The gist of this fix is quite simple: if resultsets start getting to big, stream them to disk as they come from the db rather than after they are done. Then add two knobs: a) the size limit that we don't attempt to render them and b) whether to enforce this limit. This lets anyone easily go back to the previous behavior if they want to allow large query results.

How it is implemented

We run our queries in a transducing context here: https://github.com/metabase/metabase/blob/master/src/metabase/query_processor/pipeline.clj#L78-L87 and the reducing function is created just above with (rff metadata) which returns a reducing function.

So this PR creates a new one which has two modes: collecting results in memory, similar to the default one (https://github.com/metabase/metabase/blob/master/src/metabase/query_processor/reducible.clj#L15-L38) which is optionally wired up here: https://github.com/metabase/metabase/blob/master/src/metabase/query_processor.clj#L81

If the number of rows remains under 2000, this reducing function behaves identically to the default one. If more rows are returned, it takes all accumulated rows so far and starts serializing them to disk. In this way we have prevent having entire resultsets in memory and we can also inspect the entire size and decide if we should abandon trying to render the results.

There is a limit at which we won't open the files (by default 10mb). And the reducing function will monitor the filesize at is creates it and stop running the query when the file is 1.3x this limit. This helps us not fill up disk with query results we won't attempt to render anyways.

What it looks like

This dashboard has a query that generates 50mb of data and then a simple line chart of orders aggregated by month and averaged.

The email comes across as

The logs output are

2025-10-07 16:45:28,775 INFO notification.send :: Sending {mb-notification_id=, mb-payload_type=:notification/dashboard}
2025-10-07 16:45:29,795 INFO payload.temp-storage :: Row count reached threshold (2000), switching to streaming mode {mb-card_id=311, mb-dashboard_id=54, mb-notification_id=, mb-payload_type=:notification/dashboard}
2025-10-07 16:45:30,102 WARN payload.temp-storage :: Results have exceeded 1.3 times of `notification-temp-file-size-max-bytes` of 13.4 mb (max: 10.0 mb). Truncating query results. ({:dashboard_id 54, :card_id 311, :dashcard_id 145}) {mb-card_id=311, mb-dashboard_id=54, mb-notification_id=, mb-payload_type=:notification/dashboard}
2025-10-07 16:45:30,147 INFO payload.temp-storage :: 💾 Stored 40000 rows to disk: 13.42 MB (never loaded into memory) (note query results were truncated) {mb-card_id=311, mb-dashboard_id=54, mb-notification_id=, mb-payload_type=:notification/dashboard}
2025-10-07 16:45:30,358 INFO payload.temp-storage :: ✓ Completed with 12 rows in memory (under threshold) {mb-card_id=355, mb-dashboard_id=54, mb-notification_id=, mb-payload_type=:notification/dashboard}
2025-10-07 16:45:31,141 INFO payload.temp-storage :: Row count reached threshold (2000), switching to streaming mode {mb-card_id=356, mb-dashboard_id=54, mb-notification_id=, mb-payload_type=:notification/dashboard}
2025-10-07 16:45:31,237 INFO payload.temp-storage :: 💾 Stored 18760 rows to disk: 7.68 MB (never loaded into memory) {mb-card_id=356, mb-dashboard_id=54, mb-notification_id=, mb-payload_type=:notification/dashboard}
2025-10-07 16:45:31,253 WARN payload.temp-storage :: ⚠️  SKIPPING LOAD - File too large: 13.42 MB (max: 10.00 MB). File will NOT be loaded into memory. {mb-channel_type=:channel/email, mb-handler_id=, mb-notification_id=, mb-payload_type=:notification/dashboard}
2025-10-07 16:45:31,253 WARN channel.shared :: 🚫 Result file too large (13.42 MB > 10.00 MB max). Skipping load to protect memory. {mb-channel_type=:channel/email, mb-handler_id=, mb-notification_id=, mb-payload_type=:notification/dashboard}
2025-10-07 16:45:31,255 ERROR render.card :: Pulse card query error: results too large {mb-card_id=311, mb-channel_type=:channel/email, mb-handler_id=, mb-notification_id=, mb-payload_type=:notification/dashboard}
2025-10-07 16:45:33,572 INFO payload.temp-storage :: 📂 Loading streamed results from disk: 7.68 MB {mb-channel_type=:channel/email, mb-handler_id=, mb-notification_id=, mb-payload_type=:notification/dashboard}
2025-10-07 16:45:33,637 INFO payload.temp-storage :: ✅ Loaded 18760 rows from disk in 65 ms (7.68 MB) {mb-channel_type=:channel/email, mb-handler_id=, mb-notification_id=, mb-payload_type=:notification/dashboard}
2025-10-07 16:45:33,995 INFO notification.send :: Sent successfully {mb-channel_type=:channel/email, mb-handler_id=, mb-notification_id=, mb-payload_type=:notification/dashboard}
2025-10-07 16:45:34,005 INFO notification.send :: Done processing notification {mb-notification_id=, mb-payload_type=:notification/dashboard}
2025-10-07 16:45:34,014 DEBUG middleware.log :: POST /api/pulse/test 200 5314ms (159 DB calls) App DB connections: 1/6 Jetty threads: 5/50 (2 idle, 0 queued) (176 total active threads) Queries in flight: 0 (0 queued); h2 DB 1 connections: 0/1 (0 threads blocked) {:metabase-user-id 1}

The relevant logs for these two cards are:

the 50mb query

Row count reached threshold (2000), switching to streaming mode tried to not use disk but we should switch
Results have exceeded 1.3 times of notification-temp-file-size-max-bytes of 13.0 mb (max: 10.0 mb). Truncating query results. we have exceed the size that we would read off of disk, so stop running the query
💾 Stored 400338 rows to disk: 13.00 MB (never loaded into memory) (note query results were truncated) {mb-card_id=311, mb-dashboard_id=54, mb-notification_id=, mb-payload_type=:notification/dashboard} a summary of what is stored to disk, including notification id, dashboard, etc. This one does not have a notification id because i'm running the preview "send now" version.

The small query

✓ Completed with 12 rows in memory (under threshold) {mb-card_id=355, mb-dashboard_id=54, mb-notification_id=, mb-payload_type=:notification/dashboard} Really nothing to see here. Simple line graph, 12 query results, no need to involve disk, etc. Happy path is very happy

Rendering

We have a few lines about the first card's results being too large and being skipped

⚠️  SKIPPING LOAD - File too large: 13.00 MB (max: 10.00 MB). File will NOT be loaded into memory. {mb-channel_type=:channel/email, mb-handler_id=, mb-notification_id=, mb-payload_type=:notification/dashboard}
🚫 Result file too large (13.00 MB > 10.00 MB max). Skipping load to protect memory. {mb-channel_type=:channel/email, mb-handler_id=, mb-notification_id=, mb-payload_type=:notification/dashboard}
Pulse card query error: results too large {mb-card_id=311, mb-channel_type=:channel/email, mb-handler_id=, mb-notification_id=, mb-payload_type=:notification/dashboard}

Optimizations for the future:

explore how parallel we can get now. Presumably we are now quite memory effecient. Can we go back to 3, or 10, worker threads?

Correctness

Query results are coming back identical as long as they aren't truncated. Of course once we truncate they should not be the same. And we get a nice speed up as well. A query that returns 50mb of data:

(time (->
        (metabase.notification.payload.execute/execute-card 1 311)
        :result
        (select-keys [:notification/truncated?
                      :status :running_time
                      :data.rows-file-size
                      :data])
        (update :data (fn [d] (select-keys d [:rows])))))
"Elapsed time: 959.727584 msecs"
{:notification/truncated? true,
 :status :completed,
 :running_time 902,
 :data.rows-file-size 14073732,
 :data {:rows #object[metabase.notification.payload.temp_storage.StreamingTempFileStorage
                      "0x23bdac12"
                      {:status :pending, :val nil}]}}
temp-storage=> (-> *1 :data :rows str)
"#StreamingTempFileStorage{:file notification-10650480272503863452.npy, :size 13743.88 KB, :context {:card-id 311}}"
temp-storage=> (time
                 (-> (toucan2.core/select-one-fn :dataset_query :model/Card :id 311)
                     (metabase.query-processor/process-query)
                     :data :rows count))
"Elapsed time: 4272.368625 msecs"
500000

Here it takes us 959 ms to abandon a query after 10mb. But the regular query processor spends 4272ms processing a query that would create 50mb of data.

Performance

At first i was using gzip to keep results smaller. This has two issues: performance and correctness. For performance, it was taking a 4 second query to 18 seconds and murdering performance. For correctness, we do want to have a good sense of how big results are in a csv or in memory. GZip is really good at compressing. So a few results that would be 50mb in a csv were ~8mb on disk gzipped. Which means it wasn't a great proxy for if the query might OOM the instance, and it also slowed everything down.

Size:

The following query is my "50mb" query:

SELECT 
     i AS id,
     'User_' || i AS username,
     'user' || i || '@example.com' AS email,
     'This is a long description for row ' || i || '. ' ||
     'It contains enough text to make the row size larger. ' ||
     'We want to generate approximately 50MB of data to test ' ||
     'the file size limit protection in the streaming system. ' ||
     'This query uses generate_series to create many rows.' AS description,
     CASE WHEN i % 3 = 0 THEN 'Active'
          WHEN i % 3 = 1 THEN 'Pending'
          ELSE 'Inactive' END AS status,
     (RANDOM() * 1000)::numeric(10,2) AS amount,
     NOW() - (i || ' days')::interval AS created_at
   FROM generate_series(1, 500000) AS i;

When i download this as a csv it takes 165M, when saved as a nippy file it takes:

(time (->
        (metabase.notification.payload.execute/execute-card 1 311)
        :result
        (select-keys [:notification/truncated?
                      :status :running_time
                      :data.rows-file-size
                      :data])
        (update :data (fn [d] (select-keys d [:rows])))))
"Elapsed time: 6685.622292 msecs"
{:notification/truncated? true,
 :status :completed,
 :running_time 6616,
 :data.rows-file-size 136600169,
 :data {:rows #object[metabase.notification.payload.temp_storage.StreamingTempFileStorage
                      "0x7367674f"
                      {:status :pending, :val nil}]}}
temp-storage=> (-> *1 :data :rows (.file) (.length) (human-readable-size))
"130.3 mb"

For card 356 which is people join orders join products i see

temp-storage=> 
(time (->
        (metabase.notification.payload.execute/execute-card 1 356)
        :result
        (select-keys [:notification/truncated?
                      :status :running_time
                      :data.rows-file-size
                      :data])
        (update :data (fn [d] (select-keys d [:rows])))))
"Elapsed time: 1152.316792 msecs"
{:status :completed,
 :running_time 993,
 :data.rows-file-size 8056845,
 :data {:rows #object[metabase.notification.payload.temp_storage.StreamingTempFileStorage
                      "0x34b0eb54"
                      {:status :pending, :val nil}]}}
temp-storage=> (-> *1 :data :rows (.file) (.length) (human-readable-size))
"7.7 mb"

and the download sizes are:

format	size
csv	7.2M
xlsx	4.1M
json	19M

So roughly in-line with other formats.

Weird types:

How does this work with pgarrays, weird bigquery types, etc.? I suspect it has the same issues that caching does, which is to say it blows up, but also it will be limited by what can be fed to the js interpreter. That turns everything into json so we probably just need to verify

nippy serialization
cheshire serialization

If these two work, then the pipeline works.

SELECT 
     i AS id,
     'User_' || i AS username,
     -- JSON types
     ('{"userId": ' || i || ', "score": ' || (i % 1000) || ', "active": ' || (i % 2 = 0)::text || '}')::jsonb AS settings,
     -- Arrays
     ARRAY['tag' || (i % 10), 'category' || (i % 5)]::text[] AS tags,
     ARRAY[i, i*2, i*3]::integer[] AS numbers,
     -- UUID
     gen_random_uuid() AS uuid,
     -- Network
     ('192.168.' || (i % 255) || '.' || ((i*7) % 255))::inet AS ip,
     -- Geometric
     point(i % 180 - 90, i % 360 - 180) AS coordinates,
     circle(point(i % 100, i % 100), 50) AS area,
     -- Text search
     to_tsvector('english', 'Content for row ' || i) AS search_data,
     -- Binary
     decode(md5(i::text), 'hex') AS hash,
     -- Range types
     int4range(i, i + 100) AS value_range,
     -- Money
     ((RANDOM() * 1000)::numeric(10,2))::money AS price,
     -- Standard
     NOW() - ((i % 365) || ' days')::interval AS created_at
FROM generate_series(1, 5000) AS i;

(time (->
        (metabase.notification.payload.execute/execute-card 1 357)
        :result
        (select-keys [:notification/truncated?
                      :status :running_time
                      :data.rows-file-size
                      :data])
        (update :data (fn [d] (select-keys d [:rows])))
        :data :rows deref first))
"Elapsed time: 372.73275 msecs"
[1
 "User_1"
 "{\"score\": 1, \"active\": false, \"userId\": 1}"
 ["tag1" "category1"]
 [1 2 3]
 #uuid "8d4b0e23-9f99-4d78-a059-fd2187b22e61"
 "192.168.1.7"
 "(-89.0,-179.0)"
 "<(1.0,1.0),50.0>"
 "'1':4 'content':1 'row':3"
 #object["[B" "0x75d67e3a" "[B@75d67e3a"]
 "[1,101)"
 827.08M
 "2025-10-06T16:54:35.640219Z"]

dpsutton · 2025-10-07T21:27:21Z

src/metabase/driver/sql_jdbc/execute.clj

                            :public-csv-download :public-xlsx-download :public-json-download
-                            :embedded-csv-download :embedded-xlsx-download :embedded-json-download}]
+                            :embedded-csv-download :embedded-xlsx-download :embedded-json-download
+                            :pulse}]


for pulses (alerts) we set autocommit false so we don't put postgres results all in memory at once.

piranha

At first RFF tripped me up somewhat being quite big, but once you start reading, the logic is sound and the code is clean. I hope I'm not missing anything important, and all I have is little nitpicks, this is really great!

piranha · 2025-10-08T08:13:05Z

src/metabase/notification/payload/temp_storage.clj

+            ;; Already streaming - write row directly to file
+
+            (let [{:keys [^DataOutputStream output-stream ^File file]} @streaming-state]
+              (vswap! rows conj! row)


This vswap! tripped me up a little bit, I think it deserves a comment, like "buffering rows to avoid touching disk on every incoming row" or something.

Also if you do not mind, I'd prefer (if-not (zero? .... in the next line, this way there is an early return of result and it'll be easier to visually match what's going on.

done. good call out.

piranha · 2025-10-08T08:14:01Z

src/metabase/notification/payload/temp_storage.clj

+                    (write-row-block-to-stream! output-stream remaining-rows))
+                  (.close output-stream)
+                  (let [file-size (.length file)
+                        file-size-mb (/ file-size 1024.0 1024.0)]


Should this be human-readable-size too?

piranha · 2025-10-08T08:16:03Z

src/metabase/query_processor/pipeline.clj

                                                         ([] (rf))
                                                         ([acc]
+                                                          (analytics/inc! :metabase-query-processor/query
+                                                                          {:driver driver/*driver* :status "success"})


I'm not prepared enough to judge this 😁

I'm glad you called this out! I meant to put a comment. But each rff had been in charge, in its completion arity, to mark the analytics counter for a successful query. But we have this nice wrapper where we can put it that will be used by all rffs, so we don't have to have each one do it, they can all be in charge of storing results and such, and the pipeline that assembles and uses the rff can mark a successful query on this completion step.

metabase-bot · 2025-10-08T16:35:29Z

src/metabase/notification/payload/temp_storage.clj

@@ -1,14 +1,25 @@
 (ns metabase.notification.payload.temp-storage
-  "Util to put data into a temporary file and schedule it for deletion after a specified time period.
+  "Util to put data into a temporary file and delete after the notification sends. Cleanup happens with notification.send/do-after-notification-sent for dashboards and cards which calls cleanup! on each part. This is exetended to Object as a no-op and on the type defined here deletes the temporary file.


📚 The docstring says "exetended" but it should be "extended" (typo). 🎯

metabase-bot · 2025-10-08T16:35:30Z

src/metabase/notification/payload/temp_storage.clj

+  When getting query results for notifications (alerts, subscriptions) once the query row count
+  exceeds [[metabase.notification.payload.execute/rows-to-disk-threshold]], we then start streaming all rows to
+  disk. This ensures that smaller queries don't needlessly write to disk and then reload, while large results don't
+  attempt to reside in memory and kill and instance.


🔍 The docstring says "kill and instance" but should be "kill an instance" (typo). 🦅

metabase-bot · 2025-10-08T16:35:31Z

src/metabase/notification/payload/temp_storage.clj

+
+         ;; Step arity - accumulate rows
+         ([result row]
+          ;; unconditionally incrememt row count and add rows to internal volatile transient collector. If we are


📖 Per the style guide, "acc" is an acceptable conventional abbreviation, but "incrememt" should be "increment" (typo). Also this comment is quite long and could be broken into multiple lines for better readability. 🎯

metabase-bot · 2025-10-08T16:35:33Z

src/metabase/notification/payload/temp_storage.clj

+          (vswap! rows conj! row)
+          (if @streaming?
+            ;; Already streaming - write row directly to file
+


⚠️ This blank line appears to be inside the notification-rf function definition. According to the style guide, there should be No Blank Lines Within Definition Forms. Blank lines should not be placed in the middle of a function definition. Consider removing this line or if you're trying to indicate grouping in a conditional, you might restructure this to avoid the blank line. 🚨

metabase-bot · 2025-10-08T16:35:34Z

src/metabase/notification/payload/execute.clj

  [qp-result]
  (update-in qp-result [:data :viz-settings] merge (get-in qp-result [:json_query :viz-settings])))

+(def rows-to-disk-threshold


🔧 The var name is rows-to-disk-threshold but in the comment on line 133 it was spelled rows-to-disk-threadhold (with extra "h"). This inconsistency has been fixed here but worth noting the improvement! ✨

metabase-bot · 2025-10-08T16:35:45Z

🔮 Code Review Summary

This PR introduces streaming query results to disk during notification payload creation to prevent OOM issues. The implementation is solid and follows most Metabase conventions!

🎯 Key Style Issues Found

Typos in docstrings: "exetended" → "extended", "kill and instance" → "kill an instance", "incrememt" → "increment"
Blank line in function definition at line 255 in temp_storage.clj - the style guide prohibits blank lines within definition forms

✨ Good Practices Observed

Public functions have clear, useful docstrings ✅
Comments use proper semicolon counts (;;; for top-level, ;; for indented) ✅
kebab-case naming throughout ✅
Good use of type hints to prevent reflection warnings ✅
Logging includes helpful context with emojis (fun!) 🎉
New settings properly defined with :internal visibility ✅

🎴 Tarot Reading for This PR

Card Drawn: The Tower (Reversed) 🗼✨

          _______________
         |               |
         |   💥  🗼  💥   |
         |               |
         |  ⚡  Tower  ⚡  |
         |               |
         |  Preventing   |
         |  Catastrophe  |
         |_______________|
              /     \
             /       \
            /_________\

Interpretation: The Tower reversed represents avoiding disaster rather than experiencing it! 🎊 This PR literally prevents the catastrophic collapse (OOM crashes) that was plaguing the notification system. Just as The Tower reversed indicates recovering from near-disaster and implementing protective measures, this code introduces safeguards (file size limits, streaming thresholds) to prevent the "tower" of your Metabase instance from crumbling under the weight of massive query results. The 1.3x buffer and 10MB limits are like architectural reinforcements preventing structural failure. The streaming approach is the foundation repair that will keep everything stable! 🏗️✨

🎰 Today's Lucky Lottery Numbers

7, 13, 42, 2000, 10485760, 86 🍀

(2000 = your row threshold; 10485760 = your byte limit; 86 = the number of changed files you didn't have to touch! Good luck!) 🎲✨

metabase-bot · 2025-10-08T17:26:39Z

src/metabase/notification/payload/temp_storage.clj

@@ -1,14 +1,25 @@
 (ns metabase.notification.payload.temp-storage
-  "Util to put data into a temporary file and schedule it for deletion after a specified time period.
+  "Util to put data into a temporary file and delete after the notification sends. Cleanup happens with notification.send/do-after-notification-sent for dashboards and cards which calls cleanup! on each part. This is exetended to Object as a no-op and on the type defined here deletes the temporary file.


🎨 Typo in namespace docstring 🐛

Line 2 has a typo: "exetended" should be "extended" 📝

metabase-bot · 2025-10-08T17:26:40Z

src/metabase/notification/payload/temp_storage.clj

+     :output-stream os}))
+
+(defn- write-row-block-to-stream!
+  "Write a single row to the output stream using nippy serialization."


📝 Docstring issue: misleading description

The docstring says "Write a single row" but the parameter name row-block and the actual usage (line 263 shows (persistent! @rows) being passed) suggests this writes multiple rows as a block. The docstring should accurately describe what the function does. 🤔

Consider: "Write a block of rows to the output stream using nippy serialization." ✍️

metabase-bot · 2025-10-08T17:26:42Z

src/metabase/notification/payload/temp_storage.clj

+
+         ;; Step arity - accumulate rows
+         ([result row]
+          ;; unconditionally incrememt row count and add rows to internal volatile transient collector. If we are


🔤 Typo in comment

"incrememt" should be "increment" 📚

metabase-bot · 2025-10-08T17:26:43Z

src/metabase/notification/payload/temp_storage.clj

+    (throw (ex-info "Temp file no longer exists" {:file file})))
+
+  (let [file-size (.length file)
+        file-size-mb (/ file-size 1024.0 1024.0)]


💡 Consider using kebab-case for variable names

The variable file-size-mb uses the abbreviation mb. According to the style guide, you should prefer longer, more verbose names and avoid abbreviations unless they are well-known and conventional. While "MB" is fairly standard, file-size-megabytes would be clearer and more consistent with Clojure conventions. Same applies to other *-mb variables in this file. 📏

That said, mb in the context of file sizes is pretty universally understood, so this is a minor point! 🎯

metabase-bot · 2025-10-08T17:26:44Z

src/metabase/notification/payload/execute.clj

  [qp-result]
  (update-in qp-result [:data :viz-settings] merge (get-in qp-result [:json_query :viz-settings])))

+(def rows-to-disk-threshold


📖 Missing helpful context in docstring

The docstring says "Maximum rows to hold in memory..." but doesn't explain what happens after the threshold is exceeded. Consider expanding it to: "Maximum rows to hold in memory when running notification queries. After this threshold is reached, query results are streamed straight to disk. See [[metabase.notification.payload.temp-storage]] for more details."

Oh wait, I see you do have this! But the second sentence feels tacked on. Consider making it flow better as one complete description. ✨

metabase-bot · 2025-10-08T17:26:45Z

src/metabase/notification/payload/temp_storage.clj

+                       (log/debugf "File context: %s" (pr-str preamble))))
+
+                   ;; Read row count/marker
+                   (let [count-or-marker (nippy/thaw-from-in! is)]


⚠️ Variable naming clarity concern

The variable name count-or-marker suggests it could be either a count (number) or a marker (keyword), but doesn't clearly indicate what determines which it is or what the marker value would be. The code shows ::streaming is the marker, but someone reading just the variable name wouldn't know that.

Consider row-count-or-streaming-marker or adding a comment explaining the dual nature. 🎯

metabase-bot · 2025-10-08T17:27:32Z

🎴 Tarot Reading for This PR 🎴

The Eight of Pentacles ⚒️✨

The Eight of Pentacles represents dedication to mastery, attention to detail, and iterative improvement through focused work. 🎯

This card resonates deeply with your PR's meticulous approach to solving OOM issues. Just as the craftsperson in the Eight of Pentacles carefully hones each pentacle one at a time, your code methodically handles each query result—streaming rows to disk with precision, monitoring file sizes, and implementing careful thresholds. 🔨💾

The card's energy of incremental refinement is evident in how you've evolved the notification system: from keeping everything in memory (novice level), to Ngoc's work dumping completed results to disk (journeyman level), to now streaming during query execution (master craftsperson level). Each iteration builds upon the last with careful attention to the craft. 🌟

The detailed logging, the thoughtful variable naming, the careful balance between memory and disk usage—all reflect the Eight of Pentacles' message: excellence through dedication to detail. The dual-mode reducing function (in-memory vs streaming) shows the wisdom of a true craftsperson who knows when to use each tool. 🛠️

Your thorough testing with 50MB queries, consideration of edge cases (weird types, truncation), and comprehensive performance analysis demonstrate the Eight of Pentacles' insistence on measuring twice, cutting once. 📏✂️

The only caution from this card: don't get so focused on perfecting each individual pentacle that you lose sight of completing the full set. But with settings for easy rollback and clear documentation, you've already heeded that wisdom! 🎓

🎰 Today's Lucky Lottery Numbers 🎰

6, 13, 20, 42, 54, 💾

(Bonus ball: 2000 - representing your threshold!)

Play responsibly! 🍀✨

metabase-bot · 2025-10-08T17:56:46Z

src/metabase/notification/payload/temp_storage.clj

-              (format "%.1f %s" current unit))))))
-
-(deftype TempFileStorage [^File file]
+(deftype StreamingTempFileStorage [^File file context]


🌟 Style violation: The type name should follow :model/ style for consistency with the rest of the codebase. Consider using a protocol or record type instead of deftype with a clearer naming convention.

Additionally, context parameter lacks a docstring explaining what context information should be provided.

metabase-bot · 2025-10-08T17:56:47Z

src/metabase/notification/payload/temp_storage.clj

  (toString [_]
-    (str "#TempFileStorage{:file " file "}"))
+    (if (.exists file)
+      (format "#StreamingTempFileStorage{:file %s, :size %.2f KB, :context %s}"


🐛 Issue: The format string uses %.2f KB but passes the result of human-readable-size which already returns a formatted string like "7.7 mb". This will cause a type error.

Suggested change

(format "#StreamingTempFileStorage{:file %s, :size %.2f KB, :context %s}"

(format "#StreamingTempFileStorage{:file %s, :size %s, :context %s}"

(.getName file)

(human-readable-size (.length file))

(pr-str context))

metabase-bot · 2025-10-08T17:56:48Z

src/metabase/notification/payload/temp_storage.clj

+          (if @streaming?
+            ;; Already streaming - write row directly to file
+            (let [{:keys [^DataOutputStream output-stream ^File file]} @streaming-state]
+              (if-not (zero? (mod @row-count 5000))


✨ Consider extracting the magic number 5000 to a named constant at the top of the namespace, similar to how rows-to-disk-threshold is defined in execute.clj. This would make it easier to tune and understand the relationship between the two thresholds.

metabase-bot · 2025-10-08T17:56:49Z

src/metabase/notification/payload/temp_storage.clj

+                  (if (and (pos? (notification.settings/notification-temp-file-size-max-bytes))
+                           (> (.length file) (* 1.3 (notification.settings/notification-temp-file-size-max-bytes))))
+                    (do (vswap! streaming-state assoc :notification/truncated? true)
+                        (log/warnf "Results have exceeded 1.3 times of `notification-temp-file-size-max-bytes` of %s (max: %s). Truncating query results. %s"


✨ Style: The log message uses backticks around notification-temp-file-size-max-bytes which isn't standard Clojure formatting. Based on CLAUDE.md conventions, var references should either be plain text or use proper markdown formatting in docstrings. In log messages, consider using plain text or proper interpolation.

Additionally, the 1.3 multiplier is a magic number that should be extracted to a named constant for clarity.

metabase-bot · 2025-10-08T17:56:50Z

src/metabase/notification/payload/temp_storage.clj

+
+         ;; Completion arity
+         ([result]
+          {:pre [(map? (unreduced result))]}


🎯 The :pre condition checks (map? (unreduced result)) but doesn't verify that result contains the expected keys like :data. Consider adding more specific validation or a schema check here for robustness.

metabase-bot · 2025-10-08T17:56:51Z

src/metabase/channel/render/body.clj

  [_chart-type _render-type _timezone-id _card _dashcard _data]
  @card-error-rendered-info)

+(mu/defmethod render :card-error/results-too-large :- ::RenderedPartCard


📝 Missing docstring for this new multimethod implementation. Per CLAUDE.md: "Every public var in src or enterprise/backend/src must have a useful docstring."

Suggested change

(mu/defmethod render :card-error/results-too-large :- ::RenderedPartCard

(mu/defmethod render :card-error/results-too-large :- ::RenderedPartCard

"Render an error message when query results are too large to include in the notification.

This occurs when result file size exceeds [[metabase.notification.settings/notification-temp-file-size-max-bytes]]."

[_chart-type _render-type _timezone-id _card _dashcard _data]

{:attachments nil,

:content [:div

{:style "font-family: Lato, \"Helvetica Neue\", Helvetica, Arial, sans-serif; color: #EF8C8C; font-weight: 700; padding: 16px;"}

"The results returned from this question were too large. Perhaps add an aggregation or a filter."]})

metabase-bot · 2025-10-08T17:56:53Z

src/metabase/channel/shared.clj

    @x
    x))

 (defn maybe-realize-data-rows


📖 The updated docstring is good, but it should use proper markdown reference format for the var reference. Per CLAUDE.md, mentions of other vars should use [[metabase.notification.settings/notification-temp-file-size-max-bytes]] format.

metabase-bot · 2025-10-08T17:56:53Z

src/metabase/notification/payload/execute.clj

  [qp-result]
  (update-in qp-result [:data :viz-settings] merge (get-in qp-result [:json_query :viz-settings])))

+(def rows-to-disk-threshold


✅ Nice improvement! The public def with docstring follows the style guide better than the previous ^:private var. The docstring properly references another namespace using [[metabase.notification.payload.temp-storage]].

The real gist here is removing the gzip on the output. This was crushing our performance. 18 seconds instead of ~4 seconds. Real bad. And it's better to not gzip anyways. We really want to be tied pretty closely to the size of the results. GZiping uses a ton of cpu and gets it very compact. But we want a number that is pretty representative of the results. Repeated values get gzipped down to almost nothing and then they could still OOM the instance. But 50mb of results might get squashed down to ~8mb of text. We want to be tethered close the size in a csv or in memory. Also changed the serialization to write chunks of rows and not a single row. Its totally possible that the gzip fix did all of the work and chunks or individual rows all get put together into the bufferedoutputstream so it doesn't matter anyways. But the good thing is that the format of this file is entirely the work of this namespace and no one else gets to see it.

timing wise: running the same card 356 (a join of people, products, orders) and verify timing stays consistent (we are only 100ms longer with disk) ```clojure temp-storage=> (time (-> (toucan2.core/select-one-fn :dataset_query :model/Card :id 356) metabase.query-processor/process-query :data :rows count)) "Elapsed time: 740.709583 msecs" 18760 temp-storage=> (time (-> (metabase.notification.payload.execute/execute-card 1 356) :result (select-keys [:notification/truncated? :status :running_time :data.rows-file-size :data]) (update :data (fn [d] (select-keys d [:rows]))))) "Elapsed time: 840.608 msecs" {:status :completed, :running_time 724, :data.rows-file-size 8056845, :data {:rows #object[metabase.notification.payload.temp_storage.StreamingTempFileStorage "0x3336077d" {:status :pending, :val nil}]}} ``` Compare in memory results: ```clojure temp-storage=> (let [card-id 356 regular-rows (-> (toucan2.core/select-one-fn :dataset_query :model/Card :id card-id) (metabase.query-processor/process-query) :data :rows) disk-rows (-> (metabase.notification.payload.execute/execute-card 1 356) :result :data :rows) dereffed @disk-rows] (doseq [[f results] [["/tmp/results.from-disk" dereffed] ["/tmp/results.from-qp" regular-rows]]] (with-open [w (clojure.java.io/writer f)] (doseq [row results] (io/copy (str (pr-str row) \newline) w)))) [(= regular-rows dereffed) (= (count regular-rows) (count dereffed))]) [true true] ``` and then compare the files written above: ```shell /tmp ❯ diff results.from-disk results.from-qp /tmp ❯ echo $? 0 ```

everyone else promised they wouldn't use 7e3cd49d-bfe1-4620-83dd-0c163719175c for anything important

unconditionally add to rowcount and in-memory collection of rows. For two of three branches, this is "done": (1) collecting in memory, and (2) streaming to disk but our "batch" isn't full. Only when the batch is a multiple of 5,000 rows (number completely picked at random, balancing this) we can then write the row block to disk. Totally possible this can switch to just writing to disk on each row and rely on the buffered writer to handle batching. But the use of the in-memory collection being similar across almost all of the branches makes this solution feel a bit cleaner for the moment. also does things like inverting `if-not` so the short branch is at the top and not hanging at the end where it's esay to miss.

used to have some notion of `enforce` and `file-size-limits` Rather than juggle both, you can just set the filesize to 0 to make it unbounded

I’ve added a new metric ```clojure (prometheus/counter :metabase-notification/temp-storage {:description "Number and type of temporary storage uses" ;; memory, disk, above-threshold, truncated, not-limited :labels [:storage]}) ``` to record the different scenarios of query results: - stayed in `memory` - wrote them to `disk` but remained under the threshold - wrote to disk and they went `above-threshold` - wrote to disk and they went so far above threshold that we abandoned the query so the results were `truncated` - wrote to disk but the limit was set to 0 so the results were `not-limited` and these look like: (replaced # with ♯ so git message will include) ``` ❯ http get :9191/metrics | grep temp ♯ HELP metabase_notification_temp_storage_total Number and type of temporary storage uses ♯ TYPE metabase_notification_temp_storage_total counter metabase_notification_temp_storage_total{storage="memory",} 2.0 metabase_notification_temp_storage_total{storage="disk",} 2.0 metabase_notification_temp_storage_total{storage="truncated",} 2.0 ```

thread the human readable size through

…tification-temp-storage

github-actions · 2025-10-10T17:50:55Z

👋 Deploying a preview environment for commit d891472.
✅ Preview:
https://pr64390.coredev.metabase.com
🗒️ login instructions

github-actions · 2025-10-10T19:36:15Z

@dpsutton Manual conflict resolution is required on #64644

* use disk at query time of notification payload creation We have some nice work to dump query results to disk if they are larger than a row-count threshold. But this presupposes that results can fit into memory. It's completely possible, and relatively easy, to do some kind of "select *" operation in a dashboard where the only limit that prevents billions of rows coming back is the excel limit we add in. This tries to optimistically use the disk. If creates a new rff that has two modes: in-memory and a gzipped dataoutput stream. The first is in operation for the first 2,000 rows. If we never exceed this limit it remains a regular reducing function. Once this threshold is exceeded, we open a new file and stuff the rows in there and don't keep them in memory any longer. This prevents us from having to hold the entire result set in memory at once. On the consumption side, we have a threshold such that if a file is _larger_ than this, we won't attempt to load it and render it for static viz. This is what lets us defensively decline to do an operation that would reasonably lead to an OOM. Optimizations for the future: - give up on file creation once it exceeds the threshold size. We can easily monitor the size of the file as we go and stop the transduction or results. - clear out files more frequently. This will probably litter lots of temp files, some possibly quite large, on teh disk. We need to more frequently clear them up. The directory is marked as deleteOnExit, but we need to ensure we don't put lots of files in there during process lifetime. - better error rendering of "too large" error. Right now it hits the default error rendering (previously it would oom). We can make this better. - explore how parallel we can get now. Presumably we are now quite memory effecient. Can we go back to 3, or 10, worker threads? * fix typo and reintroduce constant for row threshold * more references to typo constant * missed one somehow? * include pulses (alerts) for autocommit off for downloads we want to stream results. and now we do into a file so this can respect the page size which only happens in pg when using autocommit false * special error msg for too large results/dont attach large results feedback that query results were too large alerts also by default include results. let's check for the render/too-large and not include them in that case. Was actually getting a seq error on the deftype that holds the file. maybe want to give that a seqable method? * fix tests we had an unrealized temp storage file going through * unification all to streaming before moving to temp-storage * move streaming to the temp-storage namespace * don't deref results if no one wants attachments * don't log the error here. it's pretty normal * doc strings * better logging and truncate results above threshold once we are above the size that we won't read the file, we don't have to keep piping results to it. Concretely, what does this get us? If someone absentmindedly did `select * from transactions` it could be 100s of GBs or more. Without this, we would fill up the disk for the instance even though we don't try to render it in the email. But once we know that there's no way we will read the contents of the file, we really don't have to write to it any longer. * Make settings for size limit and whether to enforce maybe someone has a beefy machine and they don't mind chugging through 843mb of query results to get a graph. they can easily set `MB_ENFORCE_NOTIFICATION_TEMP_FILE_SIZE_LIMIT` to false and the old behavior will be restored. If someone doesn't like the 10mb limit and prefers a 50mb limit they can do `MB_NOTIFICATION_TEMP_FILE_SIZE_MAX_BYTES=52428800` and then they will have the higher limit * Huge performance improvements The real gist here is removing the gzip on the output. This was crushing our performance. 18 seconds instead of ~4 seconds. Real bad. And it's better to not gzip anyways. We really want to be tied pretty closely to the size of the results. GZiping uses a ton of cpu and gets it very compact. But we want a number that is pretty representative of the results. Repeated values get gzipped down to almost nothing and then they could still OOM the instance. But 50mb of results might get squashed down to ~8mb of text. We want to be tethered close the size in a csv or in memory. Also changed the serialization to write chunks of rows and not a single row. Its totally possible that the gzip fix did all of the work and chunks or individual rows all get put together into the bufferedoutputstream so it doesn't matter anyways. But the good thing is that the format of this file is entirely the work of this namespace and no one else gets to see it. * Check for fidelity of regular results and disk based timing wise: running the same card 356 (a join of people, products, orders) and verify timing stays consistent (we are only 100ms longer with disk) ```clojure temp-storage=> (time (-> (toucan2.core/select-one-fn :dataset_query :model/Card :id 356) metabase.query-processor/process-query :data :rows count)) "Elapsed time: 740.709583 msecs" 18760 temp-storage=> (time (-> (metabase.notification.payload.execute/execute-card 1 356) :result (select-keys [:notification/truncated? :status :running_time :data.rows-file-size :data]) (update :data (fn [d] (select-keys d [:rows]))))) "Elapsed time: 840.608 msecs" {:status :completed, :running_time 724, :data.rows-file-size 8056845, :data {:rows #object[metabase.notification.payload.temp_storage.StreamingTempFileStorage "0x3336077d" {:status :pending, :val nil}]}} ``` Compare in memory results: ```clojure temp-storage=> (let [card-id 356 regular-rows (-> (toucan2.core/select-one-fn :dataset_query :model/Card :id card-id) (metabase.query-processor/process-query) :data :rows) disk-rows (-> (metabase.notification.payload.execute/execute-card 1 356) :result :data :rows) dereffed @disk-rows] (doseq [[f results] [["/tmp/results.from-disk" dereffed] ["/tmp/results.from-qp" regular-rows]]] (with-open [w (clojure.java.io/writer f)] (doseq [row results] (io/copy (str (pr-str row) \newline) w)))) [(= regular-rows dereffed) (= (count regular-rows) (count dereffed))]) [true true] ``` and then compare the files written above: ```shell /tmp ❯ diff results.from-disk results.from-qp /tmp ❯ echo $? 0 ``` * weird postgres types * postgres 12 didn't like gen_random_uuid so just hardcode one everyone else promised they wouldn't use 7e3cd49d-bfe1-4620-83dd-0c163719175c for anything important * formatting, lints * alignment * reformat, remove unused aliases * remove errant comment * cleanup logic and add some comments unconditionally add to rowcount and in-memory collection of rows. For two of three branches, this is "done": (1) collecting in memory, and (2) streaming to disk but our "batch" isn't full. Only when the batch is a multiple of 5,000 rows (number completely picked at random, balancing this) we can then write the row block to disk. Totally possible this can switch to just writing to disk on each row and rely on the buffered writer to handle batching. But the use of the in-memory collection being similar across almost all of the branches makes this solution feel a bit cleaner for the moment. also does things like inverting `if-not` so the short branch is at the top and not hanging at the end where it's esay to miss. * unify two file size settings used to have some notion of `enforce` and `file-size-limits` Rather than juggle both, you can just set the filesize to 0 to make it unbounded * reuse human-readable-size in more places, fix some typos * typos, fixes * prometheus metrics * followup to metrics I’ve added a new metric ```clojure (prometheus/counter :metabase-notification/temp-storage {:description "Number and type of temporary storage uses" ;; memory, disk, above-threshold, truncated, not-limited :labels [:storage]}) ``` to record the different scenarios of query results: - stayed in `memory` - wrote them to `disk` but remained under the threshold - wrote to disk and they went `above-threshold` - wrote to disk and they went so far above threshold that we abandoned the query so the results were `truncated` - wrote to disk but the limit was set to 0 so the results were `not-limited` and these look like: (replaced # with ♯ so git message will include) ``` ❯ http get :9191/metrics | grep temp ♯ HELP metabase_notification_temp_storage_total Number and type of temporary storage uses ♯ TYPE metabase_notification_temp_storage_total counter metabase_notification_temp_storage_total{storage="memory",} 2.0 metabase_notification_temp_storage_total{storage="disk",} 2.0 metabase_notification_temp_storage_total{storage="truncated",} 2.0 ``` * make predicates a bit more clojurey * missed a refactor somehow * better styling thread the human readable size through * styling changes * bit of formatting --------- Co-authored-by: Alexander Polyankin <alexander.polyankin@metabase.com>

) * use disk at query time of notification payload creation We have some nice work to dump query results to disk if they are larger than a row-count threshold. But this presupposes that results can fit into memory. It's completely possible, and relatively easy, to do some kind of "select *" operation in a dashboard where the only limit that prevents billions of rows coming back is the excel limit we add in. This tries to optimistically use the disk. If creates a new rff that has two modes: in-memory and a gzipped dataoutput stream. The first is in operation for the first 2,000 rows. If we never exceed this limit it remains a regular reducing function. Once this threshold is exceeded, we open a new file and stuff the rows in there and don't keep them in memory any longer. This prevents us from having to hold the entire result set in memory at once. On the consumption side, we have a threshold such that if a file is _larger_ than this, we won't attempt to load it and render it for static viz. This is what lets us defensively decline to do an operation that would reasonably lead to an OOM. Optimizations for the future: - give up on file creation once it exceeds the threshold size. We can easily monitor the size of the file as we go and stop the transduction or results. - clear out files more frequently. This will probably litter lots of temp files, some possibly quite large, on teh disk. We need to more frequently clear them up. The directory is marked as deleteOnExit, but we need to ensure we don't put lots of files in there during process lifetime. - better error rendering of "too large" error. Right now it hits the default error rendering (previously it would oom). We can make this better. - explore how parallel we can get now. Presumably we are now quite memory effecient. Can we go back to 3, or 10, worker threads? * fix typo and reintroduce constant for row threshold * more references to typo constant * missed one somehow? * include pulses (alerts) for autocommit off for downloads we want to stream results. and now we do into a file so this can respect the page size which only happens in pg when using autocommit false * special error msg for too large results/dont attach large results feedback that query results were too large alerts also by default include results. let's check for the render/too-large and not include them in that case. Was actually getting a seq error on the deftype that holds the file. maybe want to give that a seqable method? * fix tests we had an unrealized temp storage file going through * unification all to streaming before moving to temp-storage * move streaming to the temp-storage namespace * don't deref results if no one wants attachments * don't log the error here. it's pretty normal * doc strings * better logging and truncate results above threshold once we are above the size that we won't read the file, we don't have to keep piping results to it. Concretely, what does this get us? If someone absentmindedly did `select * from transactions` it could be 100s of GBs or more. Without this, we would fill up the disk for the instance even though we don't try to render it in the email. But once we know that there's no way we will read the contents of the file, we really don't have to write to it any longer. * Make settings for size limit and whether to enforce maybe someone has a beefy machine and they don't mind chugging through 843mb of query results to get a graph. they can easily set `MB_ENFORCE_NOTIFICATION_TEMP_FILE_SIZE_LIMIT` to false and the old behavior will be restored. If someone doesn't like the 10mb limit and prefers a 50mb limit they can do `MB_NOTIFICATION_TEMP_FILE_SIZE_MAX_BYTES=52428800` and then they will have the higher limit * Huge performance improvements The real gist here is removing the gzip on the output. This was crushing our performance. 18 seconds instead of ~4 seconds. Real bad. And it's better to not gzip anyways. We really want to be tied pretty closely to the size of the results. GZiping uses a ton of cpu and gets it very compact. But we want a number that is pretty representative of the results. Repeated values get gzipped down to almost nothing and then they could still OOM the instance. But 50mb of results might get squashed down to ~8mb of text. We want to be tethered close the size in a csv or in memory. Also changed the serialization to write chunks of rows and not a single row. Its totally possible that the gzip fix did all of the work and chunks or individual rows all get put together into the bufferedoutputstream so it doesn't matter anyways. But the good thing is that the format of this file is entirely the work of this namespace and no one else gets to see it. * Check for fidelity of regular results and disk based timing wise: running the same card 356 (a join of people, products, orders) and verify timing stays consistent (we are only 100ms longer with disk) ```clojure temp-storage=> (time (-> (toucan2.core/select-one-fn :dataset_query :model/Card :id 356) metabase.query-processor/process-query :data :rows count)) "Elapsed time: 740.709583 msecs" 18760 temp-storage=> (time (-> (metabase.notification.payload.execute/execute-card 1 356) :result (select-keys [:notification/truncated? :status :running_time :data.rows-file-size :data]) (update :data (fn [d] (select-keys d [:rows]))))) "Elapsed time: 840.608 msecs" {:status :completed, :running_time 724, :data.rows-file-size 8056845, :data {:rows #object[metabase.notification.payload.temp_storage.StreamingTempFileStorage "0x3336077d" {:status :pending, :val nil}]}} ``` Compare in memory results: ```clojure temp-storage=> (let [card-id 356 regular-rows (-> (toucan2.core/select-one-fn :dataset_query :model/Card :id card-id) (metabase.query-processor/process-query) :data :rows) disk-rows (-> (metabase.notification.payload.execute/execute-card 1 356) :result :data :rows) dereffed @disk-rows] (doseq [[f results] [["/tmp/results.from-disk" dereffed] ["/tmp/results.from-qp" regular-rows]]] (with-open [w (clojure.java.io/writer f)] (doseq [row results] (io/copy (str (pr-str row) \newline) w)))) [(= regular-rows dereffed) (= (count regular-rows) (count dereffed))]) [true true] ``` and then compare the files written above: ```shell /tmp ❯ diff results.from-disk results.from-qp /tmp ❯ echo $? 0 ``` * weird postgres types * postgres 12 didn't like gen_random_uuid so just hardcode one everyone else promised they wouldn't use 7e3cd49d-bfe1-4620-83dd-0c163719175c for anything important * formatting, lints * alignment * reformat, remove unused aliases * remove errant comment * cleanup logic and add some comments unconditionally add to rowcount and in-memory collection of rows. For two of three branches, this is "done": (1) collecting in memory, and (2) streaming to disk but our "batch" isn't full. Only when the batch is a multiple of 5,000 rows (number completely picked at random, balancing this) we can then write the row block to disk. Totally possible this can switch to just writing to disk on each row and rely on the buffered writer to handle batching. But the use of the in-memory collection being similar across almost all of the branches makes this solution feel a bit cleaner for the moment. also does things like inverting `if-not` so the short branch is at the top and not hanging at the end where it's esay to miss. * unify two file size settings used to have some notion of `enforce` and `file-size-limits` Rather than juggle both, you can just set the filesize to 0 to make it unbounded * reuse human-readable-size in more places, fix some typos * typos, fixes * prometheus metrics * followup to metrics I’ve added a new metric ```clojure (prometheus/counter :metabase-notification/temp-storage {:description "Number and type of temporary storage uses" ;; memory, disk, above-threshold, truncated, not-limited :labels [:storage]}) ``` to record the different scenarios of query results: - stayed in `memory` - wrote them to `disk` but remained under the threshold - wrote to disk and they went `above-threshold` - wrote to disk and they went so far above threshold that we abandoned the query so the results were `truncated` - wrote to disk but the limit was set to 0 so the results were `not-limited` and these look like: (replaced # with ♯ so git message will include) ``` ❯ http get :9191/metrics | grep temp ♯ HELP metabase_notification_temp_storage_total Number and type of temporary storage uses ♯ TYPE metabase_notification_temp_storage_total counter metabase_notification_temp_storage_total{storage="memory",} 2.0 metabase_notification_temp_storage_total{storage="disk",} 2.0 metabase_notification_temp_storage_total{storage="truncated",} 2.0 ``` * make predicates a bit more clojurey * missed a refactor somehow * better styling thread the human readable size through * styling changes * bit of formatting --------- Co-authored-by: dpsutton <dan@dpsutton.com> Co-authored-by: Alexander Polyankin <alexander.polyankin@metabase.com>

metabase-bot bot assigned dpsutton Oct 4, 2025

dpsutton added backport Automatically create PR on current release branch on merge PR-Env See https://www.notion.so/metabase/Ephemeral-PR-Environments-c74a3710c87a460dbf4fc0007a001458 labels Oct 7, 2025

dpsutton marked this pull request as ready for review October 7, 2025 21:25

dpsutton requested review from a team, piranha and qnkhuat October 7, 2025 21:26

dpsutton commented Oct 7, 2025

View reviewed changes

piranha approved these changes Oct 8, 2025

View reviewed changes

metabase-bot bot reviewed Oct 8, 2025

View reviewed changes

dpsutton added 19 commits October 10, 2025 09:48

weird postgres types

dcf306c

postgres 12 didn't like gen_random_uuid so just hardcode one

73da4ee

everyone else promised they wouldn't use 7e3cd49d-bfe1-4620-83dd-0c163719175c for anything important

formatting, lints

7ca2da5

alignment

3d5bd93

reformat, remove unused aliases

4d96aae

remove errant comment

b4708ba

unify two file size settings

fd4150f

used to have some notion of `enforce` and `file-size-limits` Rather than juggle both, you can just set the filesize to 0 to make it unbounded

reuse human-readable-size in more places, fix some typos

e7876ee

typos, fixes

bf0354d

prometheus metrics

e4e548e

make predicates a bit more clojurey

1db14b9

missed a refactor somehow

5e095b1

better styling

1355fb3

thread the human readable size through

styling changes

469cd41

bit of formatting

2216ebe

dpsutton force-pushed the size-insensitive-notification-temp-storage branch from 0a04f92 to 2216ebe Compare October 10, 2025 14:48

Merge remote-tracking branch 'origin/master' into size-insensitive-no…

d891472

…tification-temp-storage

dpsutton merged commit db11fee into master Oct 10, 2025
532 of 539 checks passed

dpsutton deleted the size-insensitive-notification-temp-storage branch October 10, 2025 19:35

github-automation-metabase mentioned this pull request Oct 10, 2025

🤖 backported "use disk at query time of notification payload creation" #64644

Merged

1 task

github-actions bot added this to the 0.56.10 milestone Oct 10, 2025

qnkhuat mentioned this pull request Oct 20, 2025

OOM on email alert with visualization + lots of rows #58672

Open

-      (format "#StreamingTempFileStorage{:file %s, :size %.2f KB, :context %s}"
+      (format "#StreamingTempFileStorage{:file %s, :size %s, :context %s}"
+              (.getName file)
+              (human-readable-size (.length file))
+              (pr-str context))

-(mu/defmethod render :card-error/results-too-large :- ::RenderedPartCard
+(mu/defmethod render :card-error/results-too-large :- ::RenderedPartCard
+  "Render an error message when query results are too large to include in the notification.
+  This occurs when result file size exceeds [[metabase.notification.settings/notification-temp-file-size-max-bytes]]."
+  [_chart-type _render-type _timezone-id _card _dashcard _data]
+  {:attachments nil,
+   :content [:div
+             {:style "font-family: Lato, \"Helvetica Neue\", Helvetica, Arial, sans-serif; color: #EF8C8C; font-weight: 700; padding: 16px;"}
+             "The results returned from this question were too large. Perhaps add an aggregation or a filter."]})

use disk at query time of notification payload creation #64390

use disk at query time of notification payload creation #64390

Uh oh!

Conversation

dpsutton commented Oct 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

OOMs from Notifications

The fix here

How it is implemented

What it looks like

the 50mb query

The small query

Rendering

Correctness

Performance

Size:

Weird types:

Uh oh!

dpsutton Oct 7, 2025

Choose a reason for hiding this comment

Uh oh!

piranha left a comment

Choose a reason for hiding this comment

Uh oh!

piranha Oct 8, 2025

Choose a reason for hiding this comment

Uh oh!

piranha Oct 8, 2025

Choose a reason for hiding this comment

Uh oh!

dpsutton Oct 8, 2025

Choose a reason for hiding this comment

Uh oh!

piranha Oct 8, 2025

Choose a reason for hiding this comment

Uh oh!

piranha Oct 8, 2025

Choose a reason for hiding this comment

Uh oh!

dpsutton Oct 8, 2025

Choose a reason for hiding this comment

Uh oh!

metabase-bot bot Oct 8, 2025

Choose a reason for hiding this comment

Uh oh!

metabase-bot bot Oct 8, 2025

Choose a reason for hiding this comment

Uh oh!

metabase-bot bot Oct 8, 2025

Choose a reason for hiding this comment

Uh oh!

metabase-bot bot Oct 8, 2025

Choose a reason for hiding this comment

Uh oh!

metabase-bot bot Oct 8, 2025

Choose a reason for hiding this comment

Uh oh!

metabase-bot bot commented Oct 8, 2025

🔮 Code Review Summary

🎯 Key Style Issues Found

✨ Good Practices Observed

🎴 Tarot Reading for This PR

🎰 Today's Lucky Lottery Numbers

Uh oh!

metabase-bot bot Oct 8, 2025

Choose a reason for hiding this comment

Uh oh!

metabase-bot bot Oct 8, 2025

Choose a reason for hiding this comment

Uh oh!

metabase-bot bot Oct 8, 2025

Choose a reason for hiding this comment

Uh oh!

metabase-bot bot Oct 8, 2025

Choose a reason for hiding this comment

Uh oh!

metabase-bot bot Oct 8, 2025

Choose a reason for hiding this comment

Uh oh!

metabase-bot bot Oct 8, 2025

dpsutton commented Oct 4, 2025 •

edited

Loading