Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix in-memory logger #24616

Merged
merged 18 commits into from Aug 5, 2022
Merged

Fix in-memory logger #24616

merged 18 commits into from Aug 5, 2022

Conversation

dpsutton
Copy link
Contributor

@dpsutton dpsutton commented Aug 4, 2022

Our Admin > Troubleshooting > Logs page broke, just showing a spinner
and never showing the logs.

Remove our memoizing logger

For 39.2 we made our uberjar multi-release but we had to revert due to degraded query performance. Jeff tracked this down to a huge decrease in speed of logging, even noops like (log/trace "info").

To re-institute the multi-release property, we've attempted to increase the speed of this logging by memoizing the lookup from namespace to logger.

Summary from

(bench (log/trace "no-op"))
Summary for benching log/info which is a no-op:
39.2:                              11534ns (11.534614 µs)
RC2:                               98.408357 ns
Multi-release:                     2236 ns (2.236756 µs)
multi-release with logger factory: 141.080913 ns

But we didn't notice one critical issue when doing this: it was using the (log.impl/slf4j-factory) which would delegate to log4j2.

But if we just set the logger factory as the log4j2 factor we're back to native speeds:

(bench (log/trace "no-op"))
Memoizing factory: 141 ns
slf4j factory:     2269 ns
log4j2 factory:    31 ns

One issue still remained: the difference between (LogManager/getContext true) vs when called with false. Log4j2 by default uses a ClassLoaderContextSelector which can handle multiple logging contexts, usually for web apps in the same jvm. (see logging separation docs). We can change this with a system property to a BasicContextSelector where there is only ever one context. But to do this we need the system property set before the logging static initializers are run. Thus metabase.bootstrap.

;; run locally built jar
;; MB_JETTY_PORT=4000 java "$(socket-repl 4001)" -cp locally-built.jar:criterium.jar metabase.bootstrap

❯ netcat localhost 4001
user=> (doto 'metabase.logger require in-ns)
metabase.logger

;; confirm that our context selector is the BasicContextSelector
metabase.logger=> (type (.. (LogManager/getFactory) getSelector))
org.apache.logging.log4j.core.selector.BasicContextSelector

;; and that selector ensures we have only one logger context
metabase.logger=> (identical? (LogManager/getContext true) (LogManager/getContext false))
true
metabase.logger=> (require '[criterium.core :refer [bench]])
nil

;; we're still speedy at 23 nanoseconds
metabase.logger=> (bench (log/trace "no-op"))
Evaluation count : 1953831000 in 60 samples of 32563850 calls.
             Execution time mean : 23.818734 ns
    Execution time std-deviation : 0.065813 ns
   Execution time lower quantile : 23.749761 ns ( 2.5%)
   Execution time upper quantile : 24.006678 ns (97.5%)
                   Overhead used : 6.883426 ns

Found 4 outliers in 60 samples (6.6667 %)
	low-severe	 1 (1.6667 %)
	low-mild	 3 (5.0000 %)
 Variance from outliers : 1.6389 % Variance is slightly inflated by outliers
nil

;; verify we are correctly setting the log4j2 factory
metabase.logger=> (clojure.tools.logging.impl/name clojure.tools.logging/*logger-factory*) 
"org.apache.logging.log4j"
metabase.logger=>

EDIT: no need to backport this since release-x.44.x doesn't have the new logger (related to multi-release and upgrading graal/js dep)

Our Admin > Troubleshooting > Logs page broke, just showing a spinner
and never showing the logs.

Don't quite understand why this fixes it. Javadocs are
https://logging.apache.org/log4j/2.x/log4j-api/apidocs/org/apache/logging/log4j/LogManager.html#getContext-boolean-

```clojure
logger=> (log/warn "test")
nil
logger=> (count @messages*)
0
;; no in-memory logs so page is empty
;; change `(LogManager/getContext true)` in the momoized ns-logger fn
;; and then observe:
logger=> (log/warn "test")
nil
logger=> (count @messages*)
4
```

Some explorations that might shine some light:

```clojure
logger=> (into {} (.getAppenders (.getLogger (LogManager/getContext false) (str *ns*))))
{}
logger=> (into {} (.getAppenders (.getLogger (LogManager/getContext true) (str *ns*))))
{"metabase-appender" #object[metabase.logger.proxy$org.apache.logging.log4j.core.appender.AbstractAppender$ff19274a
                             "0x4d680247"
                             "metabase-appender"]}
```

So something is not hooked up quite right.
@dpsutton dpsutton requested a review from a team August 4, 2022 19:28
@codecov
Copy link

codecov bot commented Aug 4, 2022

Codecov Report

Merging #24616 (72cd632) into master (d6bc5e9) will decrease coverage by 0.03%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master   #24616      +/-   ##
==========================================
- Coverage   65.29%   65.25%   -0.04%     
==========================================
  Files        2788     2803      +15     
  Lines       85301    85326      +25     
  Branches    10530    10653     +123     
==========================================
- Hits        55696    55682      -14     
- Misses      25258    25266       +8     
- Partials     4347     4378      +31     
Flag Coverage Δ
back-end 85.82% <100.00%> (+0.01%) ⬆️
front-end 46.36% <ø> (-0.05%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
src/metabase/core.clj 17.46% <ø> (ø)
src/metabase/logger.clj 80.64% <100.00%> (+15.73%) ⬆️
frontend/src/metabase/lib/formatting/date.js 86.66% <0.00%> (-8.69%) ⬇️
src/metabase/task/sync_databases.clj 77.08% <0.00%> (-3.48%) ⬇️
frontend/src/metabase/lib/time.ts 62.68% <0.00%> (ø)
...nd/src/metabase-lib/lib/queries/StructuredQuery.ts 73.26% <0.00%> (ø)
...src/metabase/visualizations/lib/settings/column.js 70.32% <0.00%> (ø)
frontend/src/metabase/lib/formatting/colors.ts 0.00% <0.00%> (ø)
frontend/src/metabase/lib/formatting/email.tsx 77.77% <0.00%> (ø)
...components/template_tags/TagEditorParam.styled.tsx 100.00% <0.00%> (ø)
... and 17 more

Help us with your feedback. Take ten seconds to tell us how you rate us.

Copy link
Member

@camsaul camsaul left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should probably figure out how to write a test for this e.g. log a message and see if it comes back in the logs API endpoint. But since we didn't have one yet I think it's ok to do that in a follow-on PR.

@camsaul
Copy link
Member

camsaul commented Aug 4, 2022

Probably also worth opening an issue about the FE spinning forever. I'm guessing the API response comes back empty, but since there are no results the FE never hides the loading spinner? Seems like a bug. We should probably show the "No results" image or "there was a problem" image or something like that.

IRL hopefully the endpoint is always going to be returning things, but if this is how it handles empty results I'm worried about its error handling in general. If the request times out or something does it show the spinner forever as well? How about a 5xx error response? We need to make that page more robust

@dpsutton
Copy link
Contributor Author

dpsutton commented Aug 4, 2022

Tests added and FE issue created for error handling / empty logs cases. #24621

When in a namespace with a record, `Foo resolves to ns.Foo. But outside
it resolves to ns/Foo. When running tests from the command line *ns* is
user so it gets more complicated.
History:

39.2 we set `Multi-Release: true` in our manifest file and query speed
drops like a stone. Jeff was able to track this to our logger calls in
tight loops.

We revert the multi-release and keep seeing the warning on startup

> WARNING: sun.reflect.Reflection.getCallerClass is not supported. This will impact performance.

Benchmarking on 39.2
(bench (log/trace "hi")) -> 15383 ns

So we freaked out and set up a memoizing logger factory

(bench (log/trace "hi")) -> 141 ns

What a win.

But we never noticed that the default *logger-factory* being picked up
was slf4j ( `(log.impl/name log/*logger-factory*)` -> "org.slf4j" ). On
39.2 if you set the factory to the log4j2 version you get back to a
great number: `(bench (log/trace "hi"))` -> 25 ns

And thus ensuring that our logger is the default log4j2 version is even
faster than our memoizing logger-factory.

Memoizing factory: 141 ns
slf4j factory: 2269 ns
log4j2 factory: 31 ns

What does `(LogManager/getContext false)` mean versus using `true`? We
only need and want a single context. But log4j2 by default uses a
context selector called `ClassLoaderContextSelector`. We could put all
of this behind us if we used a context selector type
`BasicContextSelector` but this is surprisingly hard to do: you have to
set a system property. And since all of this stuff gets initialized in
static initializers, we don't have an opportunity to do this
programmatically. The way around this is to require people to pass this
system property on startup which is not acceptable.

So getContext true checks for a threadlocal context in a specific static
variable and falls back to a Default context. getContext false looks at
classloaders and ends up at a different context. BUT: the log.impl
version uses a closure over getContext false instead of getting it each
time. And I suspect that when it does this there is only one so it is
the default and continues to use this one. In our LoggerFactory
implementation we were looking up the context each time. This still
seems to work and everything is playing nice in our application
classloader but its totally possible that our drivers are not hitting
this. I'll have to investigate this.

Verification:
- build the uberjar locally (`bin/build`)
- copy to some temp directory and also copy criterium.jar

```shell
MB_JETTY_PORT=4000 java "$(socket-repl 4001)" -cp locally-built.jar:criterium.jar metabase.core
```

```clojure
/tmp/locally-built via ☕ v17.30 on ☁️  metabase-query
❯ nc localhost 4001
user=> (doto 'metabase.logger require in-ns)
metabase.logger
metabase.logger=> (require '[criterium.core :refer [bench]])
nil
metabase.logger=> (bench (log/trace "hi"))
Evaluation count : 1686535500 in 60 samples of 28108925 calls.
             Execution time mean : 22.487972 ns
    Execution time std-deviation : 0.101004 ns
   Execution time lower quantile : 22.326806 ns ( 2.5%)
   Execution time upper quantile : 22.648368 ns (97.5%)
                   Overhead used : 6.924761 ns
nil
metabase.logger=> (count (messages))
358
metabase.logger=>
```

Verifies that we are on the order of 22 nanoseconds and the in-memory
logger has messages in it.
New entrypoint for the application: metabase.bootstrap

sets two properties for logging (context selector, log4j2 factory) and
ensures those properties are set before any logging code is loaded.
deps.edn Outdated
Comment on lines 10 to 11
amalloy/ring-gzip-middleware {:mvn/version "0.1.4"} ; Ring middleware to GZIP responses if client can handle it
bigml/histogram {:mvn/version "4.1.4"} ; Histogram data structure
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pretty big indentation diff here -- I'm worried this will cause merge conflicts in other PRs that bump deps or whatever. Maybe worth reverting it

We do have a comment-column declaration here

;; -*- comment-column: 80; -*-

so not sure why it did this to you 😢

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

woah. inadvertent. good catch

Copy link
Member

@camsaul camsaul left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This LGTM but I humbly request you revert those comment indentation changes in deps.edn to reduce unnecessary code churn

@dpsutton dpsutton merged commit df165ba into master Aug 5, 2022
@dpsutton dpsutton deleted the fix-in-memory-logger branch August 5, 2022 21:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants