Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Catch Redis server errors #1119

Merged
merged 12 commits into from Feb 22, 2021
Merged

Catch Redis server errors #1119

merged 12 commits into from Feb 22, 2021

Conversation

mrunge
Copy link
Contributor

@mrunge mrunge commented Feb 17, 2021

There seems to happen a bad interaction with Redis over
HAProxy, and Redis connections are reported as closed,
which is only temporary.

There seems to happen a bad interaction with Redis over
HAProxy, and Redis connections are reported as closed,
which is only temporary.
@mrunge mrunge closed this Feb 17, 2021
@mrunge mrunge reopened this Feb 17, 2021
jd
jd previously requested changes Feb 17, 2021
gnocchi/cli/metricd.py Outdated Show resolved Hide resolved
gnocchi/cli/metricd.py Outdated Show resolved Hide resolved
gnocchi/cli/metricd.py Outdated Show resolved Hide resolved
@mrunge
Copy link
Contributor Author

mrunge commented Feb 17, 2021

Thank you for the review. The follow-up patch should address the review comments.

@mrunge mrunge requested a review from jd February 17, 2021 19:56
@mergify mergify bot dismissed jd’s stale review February 17, 2021 19:57

Pull request has been modified.

jd
jd previously requested changes Feb 18, 2021
Copy link
Member

@jd jd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry it's been a long time since I did Gnocchi stuff so my brain was a bit slow the first review

@@ -85,6 +86,8 @@ def run(self):
with utils.StopWatch() as timer:
try:
self._run_job()
except ConnectionError:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok I see another problem here, is that you are violating separation of concern.
The metricd tool does not need to know about Redis, and anything else can be used. That means the exception must be managed in the redis driver itself, not here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

underlying incoming/redis does not handle any exceptions, but it probably should. There has been a change in redis-py around 3.0, which hinted/suggested to catch and handle exceptions in applications and not in redis-py. I can not find the issue/pr in redis-py at the moment.

Where would you want the exception handled then?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In such files https://github.com/gnocchixyz/gnocchi/blob/master/gnocchi/incoming/redis.py and its method.
I'm not sure which part of Redis raises ConnectionError, if it's all of the, you might need to decorate with a tenacity decorator to retry a few times the operations, and bail out if it fails too much.

@@ -192,6 +195,8 @@ def _fill_sacks_to_process(self):
self.wakeup()
except exceptions.NotImplementedError:
LOG.info("Incoming driver does not support notification")
except ConnectionError:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above

@@ -219,6 +224,7 @@ def _get_sacks_to_process(self):
finally:
return self._tasks or self.fallback_tasks

@utils.retry_on_exception.wraps
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's unrelated to redis, it should probably not be there; it'll retry on ANY exception.

setup.cfg Outdated
@@ -34,6 +34,7 @@ install_requires =
futures; python_version < '3'
jsonpatch
cotyledon>=1.5.0
redis >= 3.2.0 # MIT
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Redis is not mandatory to run gnocchi, do not make that change.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. Moving the change to the incoming redis driver should not make it a requirement here. I've moved it back to the redis section.

@mergify mergify bot dismissed jd’s stale review February 19, 2021 08:45

Pull request has been modified.

in tests. s3 Was spuriously failing, we turned
off testing for ceph some time ago. Time to turn it
on again.
jd
jd previously requested changes Feb 19, 2021
.travis.yml Outdated
@@ -18,10 +18,10 @@ env:

- TARGET: py36-mysql-file
- TARGET: py36-mysql-swift
- TARGET: py36-mysql-s3
- TARGET: py36-mysql-ceph
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks unrelated to this PR :(

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes indeed it is.
Ideally, I would like to have all backends enabled. There seems to be a timing(?) issue with CI, and lately, the s3 tests were failing with a chance of > 50%

Comment on lines 99 to 102
@tenacity.retry(
wait=utils.wait_exponential,
# Never retry except when explicitly asked by raising TryAgain
retry=tenacity.retry_never)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd go with:

Suggested change
@tenacity.retry(
wait=utils.wait_exponential,
# Never retry except when explicitly asked by raising TryAgain
retry=tenacity.retry_never)
@tenacity.retry(
wait=utils.wait_exponential,
retry=tenacity.retry_if_exception_type(ConnectionError))

That should be simpler.
You probably want to stop at some point too though, I'd add a condition for it to stop retrying after a few tries.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That looks better. Thank you for the suggestion.

I've never seen this repeated more than twice for now.

@mergify mergify bot dismissed jd’s stale review February 19, 2021 10:38

Pull request has been modified.

jd
jd previously requested changes Feb 19, 2021
Copy link
Member

@jd jd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you still miss a stop condition right?

Comment on lines 116 to 132
try:
for key in self._client.scan_iter(match=match, count=1000):
metrics += 1
pipe.llen(key)
if details:
m_list.append(key.split(redis.SEP)[1].decode("utf8"))
# group 100 commands/call
if metrics % 100 == 0:
results = pipe.execute()
update_report(results, m_list)
m_list = []
pipe = self._client.pipeline()
else:
results = pipe.execute()
update_report(results, m_list)
m_list = []
pipe = self._client.pipeline()
else:
results = pipe.execute()
update_report(results, m_list)
except ConnectionError:
LOG.debug("Redis Server closed connection. Retrying.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you don't need this anymore (the try/except)

Comment on lines 200 to 207
try:
for message in p.listen():
if message['type'] == 'pmessage' and message['pattern'] == pattern:
# FIXME(jd) This is awful, we need a better way to extract this
# Format is defined by _get_sack_name: incoming128-17
yield self._make_sack(int(message['channel'].split(b"-")[-1]))
except ConnectionError:
LOG.debug("Redis Server closed connection. Retrying.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

@mergify mergify bot dismissed jd’s stale review February 19, 2021 11:29

Pull request has been modified.

jd
jd previously requested changes Feb 19, 2021
@@ -176,6 +183,10 @@ def process_measures_for_sack(self, sack):
pipe.ltrim(key, item_len + 1, -1)
pipe.execute()

@tenacity.retry(
wait=utils.wait_exponential,
# Never retry except when explicitly asked by raising TryAgain
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wrong comment

gnocchi/incoming/redis.py Show resolved Hide resolved
@mergify mergify bot dismissed jd’s stale review February 19, 2021 14:17

Pull request has been modified.

@jd
Copy link
Member

jd commented Feb 19, 2021

@mrunge LGTM
Are you able to test it on your setup to see if it fixes your issue?

@mrunge
Copy link
Contributor Author

mrunge commented Feb 19, 2021

Thank you for your reviews!

Yes, I've tested this a couple of times already over each iteration. That's why I had debug logs etc and wasn't that much concerned about stop condition.

This latest proposal does not fix the issue, try/except is required, apparently.

@jd
Copy link
Member

jd commented Feb 19, 2021

This latest proposal does not fix the issue, try/except is required, apparently.

That does not make sense. What's the traceback?

@mrunge
Copy link
Contributor Author

mrunge commented Feb 22, 2021

Sorry for not getting back earlier. While I got stack traces on Friday, The deployment is now running flawlessly for over a hour, no stack traces. Before, these kind of traces errors happened every 2 to 5 minutes. We should consider this PR to fix the issues.

@mrunge
Copy link
Contributor Author

mrunge commented Feb 22, 2021

Just after my last comment, I saw this trace in the logs

2021-02-22 08:09:20,432 [32] ERROR    gnocchi.cli.metricd: Unexpected error during reporting job
Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/gnocchi/cli/metricd.py", line 87, in run
    self._run_job()
  File "/usr/lib/python3.6/site-packages/gnocchi/cli/metricd.py", line 127, in _run_job
    report = self.incoming.measures_report(details=False)
  File "/usr/lib/python3.6/site-packages/gnocchi/incoming/__init__.py", line 211, in measures_report
    metrics, measures, full_details = self._build_report(details)
TypeError: 'NoneType' object is not iterable

@mrunge mrunge closed this Feb 22, 2021
@mrunge mrunge reopened this Feb 22, 2021
@mrunge
Copy link
Contributor Author

mrunge commented Feb 22, 2021

It turned out I had a try/except left over in my test deployment where I also tried this out. It's running for quite a while now in the form of being proposed here. Sorry to cause confusion here.

Thank you for your effort here. With this PR merged being merged, I see the issue being fixed then.

@jd
Copy link
Member

jd commented Feb 22, 2021

Your traceback makes little sense as the _build_report function never returns None, and if the retry fails it should raise a RetryError. 🤔
You sure you got this right?

@mrunge
Copy link
Contributor Author

mrunge commented Feb 22, 2021

I had wrapped

for message in p.listen():
if message['type'] == 'pmessage' and message['pattern'] == pattern:
# FIXME(jd) This is awful, we need a better way to extract this
# Format is defined by _get_sack_name: incoming128-17
yield self._make_sack(int(message['channel'].split(b"-")[-1]))
with a try/except, which then could lead to returning none.

 try:
        for message in p.listen():
            if message['type'] == 'pmessage' and message['pattern'] == pattern:
                # FIXME(jd) This is awful, we need a better way to extract this
                # Format is defined by _get_sack_name: incoming128-17
                yield self._make_sack(int(message['channel'].split(b"-")[-1]))
  except ....:
      LOG.debug.... 

@mrunge
Copy link
Contributor Author

mrunge commented Feb 22, 2021

I understand it that way, that my additional changes/left overs were causing the exception reported in #1119 (comment).

@jd
Copy link
Member

jd commented Feb 22, 2021

Oh ok, I got confused, I thought this PR was doing that. Ok so it should be fine!

@mergify mergify bot merged commit df55ac6 into gnocchixyz:master Feb 22, 2021
@mrunge
Copy link
Contributor Author

mrunge commented Feb 22, 2021

thank you @jd

@mrunge mrunge deleted the redis_reconnect branch February 23, 2021 11:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants