neobolt.exceptions.ServiceUnavailable in ecr.load_ecr_repository_images() when connecting to remote Neo4j #522

achantavy · 2021-01-30T22:08:18Z

Description:

What issue is being seen? Describe what should be happening instead of the bug, for example: Cartography should not crash, the expected value isn't returned, the data schema is wrong, etc.

When loading 50MB worth of data to a remote Neo4j server (i.e. not located on the same machine), ecr.load_ecr_repository_images() crashes with a neobolt.exceptions.ServiceUnavailable error after running for 2 hours.

To Reproduce:

Steps to reproduce the behavior. Provide all data and inputs required to reproduce the issue.

Run ecr.load_ecr_repository_images() with 50MB of data.

POC code:

from neo4j import GraphDatabase
import cartography.intel.aws.ecr
import time
# You will need to provide your own data here.
# data shape = [{'repo_uri': 'uri', 'repo_images': [{'imageDigest': 'mydigest', 'imageTag': 'mytag'}, ...]},...]
from image_data import image_list

neo4j_driver = GraphDatabase.driver("bolt://your-remote-endpoint:7687")
neo4j_session = neo4j_driver.session()
account_id = '1234'
aws_update_tag = int(time.time())
region = 'us-east-1'

common_job_parameters = {
    "UPDATE_TAG": aws_update_tag,
    "AWS_ID": account_id,
}

cartography.intel.aws.ecr.load_ecr_repository_images(neo4j_session, image_list, region, aws_update_tag)
cartography.intel.aws.ecr.cleanup(neo4j_session, common_job_parameters)

Logs:

If applicable, copy and paste your console log with the failing stack trace.

Traceback (most recent call last):
  File "/Users/achantavy/.pyenv/versions/3.7.9/lib/python3.7/ssl.py", line 1071, in recv_into
    return self.read(nbytes, buffer)
  File "/Users/achantavy/.pyenv/versions/3.7.9/lib/python3.7/ssl.py", line 929, in read
    return self._sslobj.read(len, buffer)
ConnectionResetError: [Errno 54] Connection reset by peer
Exception ignored in: 'neobolt.bolt._io.ChunkedInputBuffer.receive'
Traceback (most recent call last):
  File "/Users/achantavy/.pyenv/versions/3.7.9/lib/python3.7/ssl.py", line 1071, in recv_into
    return self.read(nbytes, buffer)
  File "/Users/achantavy/.pyenv/versions/3.7.9/lib/python3.7/ssl.py", line 929, in read
    return self._sslobj.read(len, buffer)
ConnectionResetError: [Errno 54] Connection reset by peer
Traceback (most recent call last):
  File "load_ecr_list_images.py", line 17, in <module>
    cartography.intel.aws.ecr.load_ecr_repository_images(neo4j_session, image_list, region, aws_update_tag)
  File "/Users/achantavy/lyftsrc/cartography/cartography/util.py", line 63, in timed
    return method(*args, **kwargs)
  File "/Users/achantavy/lyftsrc/cartography/cartography/intel/aws/ecr.py", line 122, in load_ecr_repository_images
    Region=region,
  File "/Users/achantavy/.virtualenvs/env11/lib/python3.7/site-packages/neo4j/__init__.py", line 503, in run
    self._connection.fetch()
  File "/Users/achantavy/.virtualenvs/env11/lib/python3.7/site-packages/neobolt/direct.py", line 414, in fetch
    return self._fetch()
  File "/Users/achantavy/.virtualenvs/env11/lib/python3.7/site-packages/neobolt/direct.py", line 431, in _fetch
    self._receive()
  File "/Users/achantavy/.virtualenvs/env11/lib/python3.7/site-packages/neobolt/direct.py", line 472, in _receive
    raise self.Error("Failed to read from defunct connection {!r}".format(self.server.address))
neobolt.exceptions.ServiceUnavailable: Failed to read from defunct connection Address(host={IP}, port=7687)

Please complete the following information::

Cartography release version or commit hash [e.g. 0.12.0 or 95e8e11]

0c9a662

Python version: [e.g. 3.7.4]

3.7.9

OS (feel free to omit this if you don't think it's relevant to your issue): [e.g. Ubuntu bla bla, OSX bla bla]

Have observed this in a Docker container based on Debian as well as my OSX laptop. Neither of them appear to be resource constrained: CPU usage is around 0%, memory usage of the python process is about 200-300MB.

Additional context:

Add any other context about the problem here.

This appears related to

all of these issues involve sending fairly large objects over the Bolt connection.

Update:

I've also observed this issue on load_ecr_repositories().

The text was updated successfully, but these errors were encountered:

achantavy · 2021-02-02T01:09:30Z

Actually going to reopen this issue as #523 does not completely resolve it.

voutilad · 2021-02-02T20:26:50Z

@achantavy I believe I know what the root cause is given my Neo4j experience. See my PR #526.

You were on the right track with UNWIND, but you need to still be cautious about batching. The transaction state in 3.5 by default is stored on-heap in the JVM and can create memory pressure when running large transactions. Worst case scenario is unresponsive network activity from the server causing the client to give up the connection.

(edited to explain batching since I hit comment too soon)

Attempts to fix lyft#522 by batching the data to minimize transaction size. Should resolve network errors due to unresponsive Neo4j server when under too much memory pressure. I chose the batch size arbitrarily, but given it's creating nodes as well as relationships to highly dense nodes, it shouldn't be set too high without knowing the characteristics of the database host.

stale · 2021-02-23T00:11:52Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.

achantavy · 2021-04-13T05:25:38Z

Hey @voutilad, I put together a plan for improving write perf in this project. If you have 10 minutes would really appreciate your input: https://docs.google.com/document/d/1IZ12R3oROn11LcYj5XunokyOjJkKu-H2O1TEk065Dsk/edit#

stale · 2021-05-19T22:47:38Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.

stale · 2021-06-16T23:23:25Z

This issue has been automatically closed for inactivity. If you still wish to make these changes, please open a new change or reopen this one.

stale · 2021-07-02T13:50:27Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.

jexp · 2021-07-02T22:13:27Z

@achantavy I'm also happy to help if @voutilad is busy. If you want to we can have a look at the code together, just drop me an email to schedule a call.

stale · 2021-07-20T23:41:51Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.

achantavy · 2021-07-23T05:07:16Z

@voutilad @jexp Thanks for the help so far. I saw this issue happen on a different section of internal code and was able to resolve it by using explicit transactions!

Deployment information: we have a k8s cronjob running neo4j python driver 1.7.6, writing data to a Neo4j enterprise 3.5.19 database across an AWS Network Load Balancer.

To summarize, the code would

Instantiate a single neo4j session
Use that session to write some data to the graph
Do some more parsing/data transforms
Go to (2) until we are done

Most times, step (3) would take longer than 380 seconds and the code would work fine, which is not what I would expect because this is longer than the timeout from our AWS NLB and the value of our neo4j driver's max_connection_lifetime value. Anyway, we found that running this code with a specific set of data would cause step (4) to reliably fail with a ConnectionResetError, resulting in a neo4j.ServiceUnavailable exception.

To fix, I changed the code to explicitly use session.write_transaction() instead of auto-commit transactions with session.run() and now the code seems to work magically! I have not implemented any retry logic myself at all.

To get to this solution, I stumbled upon this section of the current driver doc: https://neo4j.com/docs/api/python-driver/current/api.html#managed-transactions-transaction-functions

[Managed transactoins] allow a function object representing the transactional unit of work to be passed as a parameter. This function is called one or more times, within a configurable time limit, until it succeeds.

Sure enough, this seemed encouraging and it worked! Prior to this I had only been reading the docs for the 1.7 driver but it seems that the docs have become more thorough for the current drivers.

I'll push out a similar fix to address this specific issue and other related ones. I guess to summarize Python driver best practices that we've learned in this project,

Use the unwind pattern for speed and batching
Use explicit transaction functions
Consume the results within the transaction functions
Be aware of max_connection_lifetime, especially if you're forced to deal with load balancers
Ensure the size of the transaction is not too large

This has been bugging me for months and I'm glad to finally have forward movement on this problem. :)

stale · 2021-09-06T20:18:52Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.

achantavy changed the title ~~neobolt.exceptions.ServiceUnavailable in ecr.load_ecr_repository_images()~~ neobolt.exceptions.ServiceUnavailable in ecr.load_ecr_repository_images() when connecting to remote Neo4j Jan 30, 2021

achantavy mentioned this issue Jan 31, 2021

Fix #522: resolve ServiceUnavailable exception - ECR missing index, load now uses UNWIND #523

Merged

achantavy closed this as completed in cc9d88b Feb 2, 2021

achantavy reopened this Feb 2, 2021

voutilad mentioned this issue Feb 2, 2021

Load ecr repository images in batches to resolve timeouts #526

Closed

stale bot added the stale stalebot believes this issue/PR is no longer active label Feb 23, 2021

This was referenced Mar 18, 2021

Bump cartography from 0.31.0 to 0.33.1 williamjacksn/docker-cartography#30

Merged

Bump cartography from 0.31.0 to 0.33.1 alpine-wheels/cartography#10

Closed

stale bot removed the stale stalebot believes this issue/PR is no longer active label Apr 13, 2021

stale bot added the stale stalebot believes this issue/PR is no longer active label May 19, 2021

achantavy mentioned this issue May 30, 2021

ServiceUnavailable exception on aws.iam.load_policy #630

Closed

stale bot closed this as completed Jun 16, 2021

achantavy reopened this Jun 17, 2021

stale bot removed the stale stalebot believes this issue/PR is no longer active label Jun 17, 2021

stale bot added the stale stalebot believes this issue/PR is no longer active label Jul 2, 2021

stale bot removed the stale stalebot believes this issue/PR is no longer active label Jul 2, 2021

stale bot added the stale stalebot believes this issue/PR is no longer active label Jul 20, 2021

stale bot removed the stale stalebot believes this issue/PR is no longer active label Jul 23, 2021

achantavy mentioned this issue Aug 6, 2021

initial route table support #607

Closed

achantavy mentioned this issue Aug 17, 2021

Failure during the cleanup job of EC2 snapshots #665

Open

stale bot added the stale stalebot believes this issue/PR is no longer active label Sep 6, 2021

achantavy mentioned this issue Oct 11, 2021

ServiceUnavailable exception thrown from iam.load_policy #702

Closed

achantavy added a commit that referenced this issue Oct 11, 2021

Fix #522, #702: use managed write transactions in ECR and IAM

4163738

achantavy added a commit that referenced this issue Oct 11, 2021

Fix #522, #702: use managed write transactions in ECR and IAM

8bcd315

achantavy mentioned this issue Oct 11, 2021

Fix #522, #702: use managed write transactions in AWS ECR and IAM #705

Merged

achantavy linked a pull request Oct 11, 2021 that will close this issue

Fix #522, #702: use managed write transactions in AWS ECR and IAM #705

Merged

achantavy closed this as completed in #705 Oct 11, 2021

achantavy added a commit that referenced this issue Oct 11, 2021

Fix #522, #702: use managed write transactions in AWS ECR and IAM (#705)

59ac339

achantavy mentioned this issue Oct 11, 2021

Sync connections can become blocked on neo4j >=3.3, which will then cause them to be dropped due to inactivity. #170

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

neobolt.exceptions.ServiceUnavailable in ecr.load_ecr_repository_images() when connecting to remote Neo4j #522

neobolt.exceptions.ServiceUnavailable in ecr.load_ecr_repository_images() when connecting to remote Neo4j #522

achantavy commented Jan 30, 2021 •

edited

Loading

achantavy commented Feb 2, 2021

voutilad commented Feb 2, 2021 •

edited

Loading

stale bot commented Feb 23, 2021

achantavy commented Apr 13, 2021

stale bot commented May 19, 2021

stale bot commented Jun 16, 2021

stale bot commented Jul 2, 2021

jexp commented Jul 2, 2021

stale bot commented Jul 20, 2021

achantavy commented Jul 23, 2021

stale bot commented Sep 6, 2021

neobolt.exceptions.ServiceUnavailable in ecr.load_ecr_repository_images() when connecting to remote Neo4j #522

neobolt.exceptions.ServiceUnavailable in ecr.load_ecr_repository_images() when connecting to remote Neo4j #522

Comments

achantavy commented Jan 30, 2021 • edited Loading

achantavy commented Feb 2, 2021

voutilad commented Feb 2, 2021 • edited Loading

stale bot commented Feb 23, 2021

achantavy commented Apr 13, 2021

stale bot commented May 19, 2021

stale bot commented Jun 16, 2021

stale bot commented Jul 2, 2021

jexp commented Jul 2, 2021

stale bot commented Jul 20, 2021

achantavy commented Jul 23, 2021

stale bot commented Sep 6, 2021

achantavy commented Jan 30, 2021 •

edited

Loading

voutilad commented Feb 2, 2021 •

edited

Loading