Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: improve error handling, and more! #524

Merged
merged 25 commits into from
Mar 27, 2024
Merged

Conversation

MicaiahReid
Copy link
Collaborator

@MicaiahReid MicaiahReid commented Mar 13, 2024

This PR introduces a few fixes in an effort to improve reliability and debugging problems when running Chainhook as a service:

@MicaiahReid MicaiahReid marked this pull request as ready for review March 15, 2024 20:01
Copy link
Contributor

@csgui csgui left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes are looking good.

Only some changes requested due to the usage of the format! macro.

components/chainhook-cli/src/config/mod.rs Outdated Show resolved Hide resolved
components/chainhook-cli/src/config/mod.rs Outdated Show resolved Hide resolved
Copy link
Contributor

@csgui csgui left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks for addressing the change requests.

@MicaiahReid MicaiahReid merged commit d6b8816 into develop Mar 27, 2024
10 checks passed
@MicaiahReid MicaiahReid deleted the fix-handle-errors branch March 27, 2024 15:26
@CharlieC3
Copy link
Member

erro - something went wrong the could lead to a critical error, or that could impact all users

@MicaiahReid Just briefly reading this, but is this a typo? erro isn't a term recognized by Grafana, so it won't be properly formatted in our logs by default. I mentioned this in the referenced issue here, but I may have not called it out enough.

@csgui
Copy link
Contributor

csgui commented Mar 27, 2024

@MicaiahReid @CharlieC3 Just chiming in!
Where are you reading that erro? I saw (in someplace that I can't recall right now) that a logger was truncating the error word to erro.

@CharlieC3
Copy link
Member

Where are you reading that erro?

It's in the PR description. Admittedly I haven't gone very deep in the code to verify, but though it may be best just to verify directly you or Micaiah :)

@MicaiahReid
Copy link
Collaborator Author

@CharlieC3
The logging library (slog that Chainhook (as well as the devnet API and I believe a few of our other rust-based tools) uses truncates the log type to four characters (ERROR => ERRO, DEBUG => DEBG).

From what I've seen, Grafana hasn't ever had a problem recognizing these as errors. The panic message referenced in #521 wasn't using this library, so Grafana wasn't picking it up. Now all logs are done through slog.

I was curious if the CRIT log level would be a problem - slog doesn't have a FATAL option.

@CharlieC3
Copy link
Member

Grafana hasn't ever had a problem recognizing these as errors

haha that is because of a quick fix I put up in our log collection 😄 If there's no way around this I'm happy to keep it in, but if it's something easy to configure, I'd rather the logging be compatible with Grafana by default without needing extra configuration at the infra level.

@csgui
Copy link
Contributor

csgui commented Mar 27, 2024

Quick investigation on that:

Here is where it defines the log level constants: https://github.com/slog-rs/slog/blob/master/src/lib.rs#L2112

And it seems that the short naming is the default: https://github.com/slog-rs/slog/blob/master/src/lib.rs#L2291. I've never used slog, but maybe/probably, there is a formatter where that info can be defined.

@CharlieC3
Copy link
Member

CharlieC3 commented Mar 27, 2024

Seems like it's possible to configure slog to use Grafana-compatible logging levels using slog.Leveler like so: https://betterstack.com/community/guides/logging/logging-in-go/#creating-custom-log-levels

Again, I'm somewhat flexible on this if it's too difficult at a service level, but I think aligning ourselves to what Grafana categorizes by default is an easy way to get compatibility across all our services and fall under a new standard for Hiro without the need for one-offs for individual services at an infra level.

@MicaiahReid
Copy link
Collaborator Author

MicaiahReid commented Mar 27, 2024

Cool @CharlieC3, I didn't realize we had custom patches that were required to make our logging work. I've created this PR in the logging library that is used by a few of our tools - this will just print the full log level prefix so that you won't need to have these patches. (granted, it will require users of this lib to upgrade + use the new feature, but Chainhook will be the first on that front!)

@CharlieC3
Copy link
Member

Nice! Thanks so much, seems like a perfect way to handle this.

MicaiahReid added a commit that referenced this pull request Mar 27, 2024
This PR introduces a few fixes in an effort to improve reliability and
debugging problems when running Chainhook as a service:
- Revisits log levels throughout the tool (fixes #498, fixes #521). The
general approach for the logs were:
- `crit` - fatal errors that will crash mission critical component of
Chainhook. In these cases, Chainhook should automatically kill all main
threads (not individual scanning threads, which is tracked by #404) to
crash the service.
- `erro` - something went wrong the could lead to a critical error, or
that could impact all users
- `warn` - something went wrong that could impact an end user (usually
due to user error)
- `info` - control flow logging and updates on the state of _all_
registered predicates
   - `debug` - updates on the state of _a_ predicate
- Crash the service if a mission critical thread fails (see
#517 (comment)
for a list of these threads). Previously, if one of these threads
failed, the remaining services would keep running. For example, if the
event observer handler crashed, the event observer API would keep
running. This means that the stacks node is successfully emitting blocks
that Chainhook is acknowledging but not ingesting. This causes gaps in
our database Fixes #517
- Removes an infinite loop with bitcoin ingestion, crashing the service
instead: Fixes #506
- Fixes how we delete predicates from our db when one is deregistered.
This should reduce the number of logs we have on startup. Fixes #510
 - Warns on all reorgs. Fixes #519
MicaiahReid added a commit that referenced this pull request Mar 27, 2024
This PR introduces a few fixes in an effort to improve reliability and
debugging problems when running Chainhook as a service:
- Revisits log levels throughout the tool (fixes #498, fixes #521). The
general approach for the logs were:
- `crit` - fatal errors that will crash mission critical component of
Chainhook. In these cases, Chainhook should automatically kill all main
threads (not individual scanning threads, which is tracked by #404) to
crash the service.
- `erro` - something went wrong the could lead to a critical error, or
that could impact all users
- `warn` - something went wrong that could impact an end user (usually
due to user error)
- `info` - control flow logging and updates on the state of _all_
registered predicates
   - `debug` - updates on the state of _a_ predicate
- Crash the service if a mission critical thread fails (see
#517 (comment)
for a list of these threads). Previously, if one of these threads
failed, the remaining services would keep running. For example, if the
event observer handler crashed, the event observer API would keep
running. This means that the stacks node is successfully emitting blocks
that Chainhook is acknowledging but not ingesting. This causes gaps in
our database Fixes #517
- Removes an infinite loop with bitcoin ingestion, crashing the service
instead: Fixes #506
- Fixes how we delete predicates from our db when one is deregistered.
This should reduce the number of logs we have on startup. Fixes #510
 - Warns on all reorgs. Fixes #519
github-actions bot pushed a commit that referenced this pull request Mar 27, 2024
## [1.4.0](v1.3.1...v1.4.0) (2024-03-27)

### Features

* detect http / rpc errors as early as possible ([ad78669](ad78669))
* use stacks.rocksdb for predicate scan ([#514](#514)) ([a4f1663](a4f1663)), closes [#513](#513) [#485](#485)

### Bug Fixes

* enable debug logs in release mode ([#537](#537)) ([fb49e28](fb49e28))
* improve error handling, and more! ([#524](#524)) ([86b5c78](86b5c78)), closes [#498](#498) [#521](#521) [#404](#404) [/github.com//issues/517#issuecomment-1992135101](https://github.com/hirosystems//github.com/hirosystems/chainhook/issues/517/issues/issuecomment-1992135101) [#517](#517) [#506](#506) [#510](#510) [#519](#519)
* log errors on block download failure; implement max retries ([#503](#503)) ([0fc38cb](0fc38cb))
* **metrics:** update latest ingested block on reorg ([#515](#515)) ([8f728f7](8f728f7))
* order and filter blocks used to seed forking block pool ([#534](#534)) ([a11bc1c](a11bc1c))
* seed forking handler with unconfirmed blocks to improve startup stability ([#505](#505)) ([485394e](485394e)), closes [#487](#487)
* skip db consolidation if no new dataset was downloaded ([#513](#513)) ([983a165](983a165))
* update scan status for non-triggering predicates ([#511](#511)) ([9073f42](9073f42)), closes [#498](#498)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
3 participants