Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[query] show and checkpoint are very slow in Jupyter Notebooks; they are not slow outside Jupyter #13690

Closed
danking opened this issue Sep 22, 2023 · 8 comments · Fixed by #13759
Assignees
Labels

Comments

@danking
Copy link
Collaborator

danking commented Sep 22, 2023

What happened?

This started happening in 0.2.123. It does not happen in 0.2.120

Version

0.2.123

Relevant log output

No response

@danking danking added the bug label Sep 22, 2023
@danking
Copy link
Collaborator Author

danking commented Sep 22, 2023

I can't reproduce locally with

import hail as hl
hl.utils.range_table(10).show()

Must be something more complex.

@iris-garden iris-garden self-assigned this Sep 22, 2023
@iris-garden
Copy link
Collaborator

local notebook works fine for me as well, looks to be just dataproc that's not working as expected. submitting that test command as a script finished in 36.2s. notebook is currently still hanging with this output (it's been 11 minutes):

BokehJS 3.2.2 successfully loaded.
Initializing Hail with default parameters...
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
SPARKMONITOR_LISTENER: Started SparkListener for Jupyter Notebook
SPARKMONITOR_LISTENER: Port obtained from environment: 55989
SPARKMONITOR_LISTENER: Application Started: application_1695402030462_0001 ...Start Time: 1695402594764
Running on Apache Spark version 3.3.0
SparkUI available at http://notebook-slowdown-repro-m.c.broad-ctsa.internal:43055/
Welcome to
     __  __     <>__
    / /_/ /__  __/ /
   / __  / _ `/ / /
  /_/ /_/\_,_/_/_/   version 0.2.124-ee7fef6fc40d
LOGGING: writing to /home/hail/hail-20230922-1709-0.2.124-ee7fef6fc40d.log
[Stage 0:>                                                          (0 + 2) / 2]

@danking
Copy link
Collaborator Author

danking commented Sep 22, 2023

@iris-garden can you grab that log file and upload here? it should live on the CLUSTER_NAME-m machine.

@iris-garden
Copy link
Collaborator

yep, here it is

@danking
Copy link
Collaborator Author

danking commented Sep 22, 2023

Nothing suspicious there. Something is going wrong in the executors. I think the only way we're gonna solve this is by running a pipeline and looking at the executor logs. I'm at a complete loss for how Jupyter could affect what happens on the executors.

@iris-garden
Copy link
Collaborator

okay, so this makes no sense to me, and i don't understand gradle at all really, but i tried reproducing the issue with each recent release until i found the one where it started presenting (0.2.123), then tried it on every commit in between the previous release and that one, and found that the issue started presenting after #13551 merged. i tried reverting that commit on the current main and confirmed the issue stopped showing up. i also tried downgrading just the google-cloud-storage version back to 2.17.1, since that was bumped in that commit, but the issue still presented.

@danking
Copy link
Collaborator Author

danking commented Sep 27, 2023

!!!!

@danking
Copy link
Collaborator Author

danking commented Sep 27, 2023

Wow, talk about a tour de force of debugging, well done!!


OK, so this kinda makes sense. We are importing our own copies of the GCS libraries and renaming them all to is.hail.relocated..... We do this so that we're not stuck with whatever version Dataproc is including.

We pin our dataproc image version to 2.1.2-debian11 (see here) which was released in January 2023.

The latest available version of Dataproc's Debian images is 2.1.25-debian11 which depends on GoogleCloudDataproc hadoop connector version 2.2.15 which relies on Google Cloud Storage client library version 2.22.3.

I have a PR to upgrade us to 2.27.1 because the library broke retries in versions [2.25.0, 2.27.0).

AFAICT, Google's image version page only shows the most recent five. There's no way to go back further in time. Luckily, the way back machine has a March 2023 capture which includes our version. 2.1.2-debian11 used Google Cloud Dataproc hadoop connector version 2.2.9 This version of the hadoop connector was using some alpha version of a gRPC version of the cloud storage library. I'm not sure what's up with that.

OK, here's my proposal: let's change that IMAGE_VERSION to the latest one and see if that fixes things.

If that works, let's just merge and forget this happened. If that doesn't work, we gotta wade into the Lovecraftian horror of JARs. Most likely we're not fully relocating the dependencies pulled in by the Google Cloud Storage client libraries and they conflict with what Dataproc produces.

iris-garden added a commit to iris-garden/hail that referenced this issue Oct 2, 2023
closes hail-is#13690.

to test that this works, i've been running these commands from the root
of my clone of the hail repo:

```bash
make -C hail install-editable
make -C hail install-hailctl
hailctl dataproc start notebook-slowdown-repro --region us-central1
hailctl dataproc connect notebook-slowdown-repro notebook
```

and then running this minimal example in the notebook:

```python
import hail
hail.utils.range_table(10).show()
```

and making sure it outputs a visual of the table, instead of getting stuck
displaying `Stage 0:> (0+X)/Y` and not progressing.
iris-garden added a commit to iris-garden/hail that referenced this issue Oct 3, 2023
closes hail-is#13690.

to test that this works, i've been running these commands from the root
of my clone of the hail repo:

```bash
make -C hail install-editable
make -C hail install-hailctl
hailctl dataproc start notebook-slowdown-repro --region us-central1
hailctl dataproc connect notebook-slowdown-repro notebook
```

and then running this minimal example in the notebook:

```python
import hail
hail.utils.range_table(10).show()
```

and making sure it outputs a visual of the table, instead of getting stuck
displaying `Stage 0:> (0+X)/Y` and not progressing.
danking pushed a commit that referenced this issue Oct 3, 2023
closes #13690.

to test that this works, i've been running these commands from the root
of my clone of the hail repo:

```bash
make -C hail install-editable
make -C hail install-hailctl
hailctl dataproc start notebook-slowdown-repro --region us-central1
hailctl dataproc connect notebook-slowdown-repro notebook
```

and then running this minimal example in the notebook:

```python
import hail
hail.utils.range_table(10).show()
```

and making sure it outputs a visual of the table, instead of getting
stuck displaying `Stage 0:> (0+X)/Y` and not progressing.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants