Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PyGlove interface errors with google-vizier 0.1.13 #1044

Closed
gelatinouscube42 opened this issue Jan 23, 2024 · 11 comments · Fixed by #1047
Closed

PyGlove interface errors with google-vizier 0.1.13 #1044

gelatinouscube42 opened this issue Jan 23, 2024 · 11 comments · Fixed by #1047

Comments

@gelatinouscube42
Copy link

There was an error in backend.py causing the library imports to fail, some extra text that needed to be deleted.

Then in core.py, there is an attempted import from vizier.google of "metadata_to_user" which breaks the import of that file.

Looking at this repo, it seems these were already addressed at some point, but perhaps not pushed to the pip repository.

@sagipe
Copy link
Member

sagipe commented Jan 23, 2024

Apologies, it should work fine at HEAD.
I'm sending a PR to release version 0.1.14 to pypi now.
When it's released, please upgrade to 0.1.14.
Best,
Sagi

@gelatinouscube42
Copy link
Author

Thanks for your quick response.

For the record, I pulled your latest commit, and am now having an issue importing vizier_server from vizier.service. Trying to run this down now.

Trying to follow the example in the docs for using Vizier as a backend for PyGlove, for the record. A more complete example would be helpful, for what its worth. E.g., its not clear from that section how pg_vizier.init("my_study") creates an object that can be used to query for the optimal trials, as per the examples in other sections.

@sagipe
Copy link
Member

sagipe commented Jan 23, 2024

Our unit tests and Colab notebooks run fine with version 0.1.14.

If you're having an issue importing vizier_server, and you are doing this in Colab, perhaps try to restart the Colab runtime first?
What error are you getting?

RE the pyglove question, see an example of querying the result in
https://github.com/google/vizier/blob/main/vizier/_src/pyglove/e2e_test.py#L42
which uses this Result object:
https://github.com/google/vizier/blob/main/vizier/_src/pyglove/core.py#L350

result = pg.poll_result('')
result.trials
result.best_trials

We can add an example to the docs as well.

Best,
Sagi

@xingyousong
Copy link
Collaborator

(FYI) The tutorial for running PyGlove with OSS Vizier is here: https://oss-vizier.readthedocs.io/en/latest/advanced_topics/pyglove/vizier_as_backend.html

Is there something missing / not working about it?

@gelatinouscube42
Copy link
Author

@xingyousong

I will provide more detail as I attempt to run through the example on my machine. With respect to the PyGlove example specifically, it was not clear to me at all what the line

pg_vizier.init("my_study")

is doing. It seems to be using a different interface than the Vizier basics examples, which had you initialize a server and a client separately, and then use the client to query the database for the results.

A slightly separate issue, but still relevant, is that there does not appear to be a reference anywhere in the examples of how the datastore is initialized, and/or how we might configure to interface with a pre-existing database. I'm planning on running my tuning experiments when the machines on my network are otherwise latent, and would need/want the database to persist. Probably the answer will be obvious once I find the relevant code in the repo, but doubts/issues such as this are slowing me down.

@gelatinouscube42
Copy link
Author

gelatinouscube42 commented Jan 23, 2024

Related to the database concern, I thought I had it running on my machine as I had a run without error, but tried another run, and am getting the error

"Failed to find study name: my_study.basic_run"

Presumably that should have been created somewhere behind the scenes, but not clear where.

edit:
when I run the pyglove tests via

bash run_tests.sh pyglove

I get two failures consistently, both of which seem to be pointing to a study not being found. The failed tests are
performance_test.py::PerformanceTest::test_multiple_workers0
and
oss_vizier_test.py::OSSVizierSampleTest::testSamplingWithMultiObjectiveAlgorithm

I am also intermittently getting an error "Cannot start already-started server!" which appears to be attempting to initialize a new Pythia process. I'm not getting these errors when running the Vizier Basics example.

Edit:
Hypothesis for what is going on: the error messages are coming from the call to _setup_study in backend.py; they appear to be intentional, as the try/except seems to be serving as de facto control flow logic, hinging on whether or not a study with a particular name has already been created.

On my machine, practically all of the threads hit the database up for the study to have been created before it was by some other process, causing an error.

By the time this error is handled, it appears as if the RPC's are killed either by a timeout or the simple fact that an error had occurred. I get the error:

<_InactiveRpcError of RPC that terminated with:
	status = StatusCode.UNKNOWN
	details = "Exception calling application: 'Failed to find study name: owners/<username>/studies/my_study.worker_run'"
	debug_error_string = "UNKNOWN:Error received from peer  {created_time:"2024-01-24T11:58:37.20558528-05:00", grpc_status:2, grpc_message:"Exception calling application: \'Failed to find study name: owners/<username>/studies/my_study.worker_run\'"}"

This reads to me as though the RPC terminates when that exception is first encountered, but sometimes my process crashses with:

grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
        status = StatusCode.UNKNOWN
        details = "Exception calling application: 'Failed to find trial name: owners/<username>/studies/my_study.worker_run/trials/10'"
        debug_error_string = "UNKNOWN:Error received from peer  {grpc_message:"Exception calling application: \'Failed to find trial name: owners/<username>/studies/my_study.worker_run/trials/10\'", grpc_status:2, created_time:"2024-01-24T12:07:09.54003539-05:00"}"

This leads me to believe that sometimes the service is able to generate/start the study and actually gets to attempting to run trials before crashing, which points more to something like a socket timeout. Not sure.

@gelatinouscube42
Copy link
Author

gelatinouscube42 commented Jan 24, 2024

@sagipe

With regards to to importing vizier_server, it works if I change the input to vizier._src.service, but still does not work if I try to import from vizier.service

Error:
ImportError: cannot import name 'vizier_server' from 'vizier.service' (/vizier/vizier/service/init.py)

I did pull the recent changes and re-install before trying this, fwiw.

Edit:
For the record, trying to explicitly initialize the service, since it seems the only I can find to connect to a pre-existing database; even with this import and explicit server initialization, the "Failed to find study name" errors are occurring, which seem to trace back to something regarding interaction with the datastore...

@xingyousong
Copy link
Collaborator

xingyousong commented Jan 24, 2024

There's a few facts that might help this thread overall:

@gelatinouscube42
Copy link
Author

@xingyousong

Did as you suggested, still have the same errors.

Btw, I had run the script to compile the protos after I had pulled from the repo and re-installed the local copy, so I think that part should have been fine.

@xingyousong
Copy link
Collaborator

@gelatinouscube42 can you send a code snippet to reproduce this issue?

@gelatinouscube42
Copy link
Author

Sure, see below. It is more or less verbatim taken from the example in the docs...

import multiprocessing
import multiprocessing.pool
import os

import pyglove as pg
from vizier import pyglove as pg_vizier
from vizier._src.service import vizier_server

search_space = pg.Dict(x=pg.floatv(0.0, 1.0), y=pg.floatv(0.0, 1.0))
algorithm = pg.evolution.regularized_evolution()
num_trials = 100


search_space = pg.Dict(x=pg.floatv(0.0, 1.0), y=pg.floatv(0.0,1.0))
algorithm = pg.evolution.regularized_evolution()
num_trials = 100

def evaluator(value: pg.Dict):
    return value.x**2 - value.y**2

server = vizier_server.DefaultVizierServer()
pg_vizier.init("my_study", vizier_endpoint=server.endpoint)

num_workers = 10

def work_fun(worker_id):
    print(f"Worker ID: {worker_id}")
    for value, feedback in pg.sample(
        search_space,
        algorithm=algorithm,
        num_examples=num_trials // num_workers,
        name='worker_run',
        ):
        reward = evaluator(value)
        feedback(reward=reward)

with multiprocessing.pool.ThreadPool(num_workers) as pool:
    pool.map(work_fun, range(num_workers))

copybara-service bot pushed a commit that referenced this issue Jan 31, 2024
copybara-service bot pushed a commit that referenced this issue Jan 31, 2024
copybara-service bot pushed a commit that referenced this issue Jan 31, 2024
copybara-service bot pushed a commit that referenced this issue Jan 31, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants