Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FEAT: Add support for Postgres UDFs #1871

Merged
merged 23 commits into from
Jul 29, 2019
Merged

Conversation

scottcode
Copy link
Contributor

The following functionality is included:

  • ability to use user-defined functions (UDFs)
  • ability to use UDFs that are already defined in the database (I just want to refer to them and use them)
  • ability to wrap a python function, automatically define it as a PL/Python function in the database, and be able to use it with ibis objects
  • changed Docker image to support testing of Postgres PL/Python UDF functionality in the ibis test suite
  • Tests for UDF functionality

The current API looks like this. Comments and feedback are welcome.

import ibis.expr.datatypes as dt
from ibis.sql.postgres.udf.api import existing_udf, func_to_udf

my_udf = existing_udf(
    'my_udf',
    input_types=[dt.String()],
    output_type=dt.Integer(),
    schema='my_schema'
)

expr = table[my_udf(table['str_col']).name('int_col_output')]

def mult_a_b(a, b):
    return a * b

mult_a_b_udf = func_to_udf(
    sqlalchemy_engine,
    mult_a_b,
    (dt.Int32(), dt.Int32()),
    dt.Int32(),
    schema='my_schema',
    overwrite=True
)

expr2 = table[mult_a_b_udf(table['int_col1'], table['int_col2']).name('int_col_output')]

@scottcode
Copy link
Contributor Author

It appears that some of the failures were due to...

  • timeouts from impala startup (in LinuxBuildDocs pipeline).
  • All of the Windows pipelines failed due to some problem creating the PL/Python extension: could not access file "$libdir/plpython2". I wonder if there's some Windows-specific problem not resolving an environment variable due to the $ notation instead of %VAR%
  • Conda build (I don't understand enough about feedstock.py and conda builds to know what's going on. I don't think anything in my PR itself would have affected this process.

@scottcode
Copy link
Contributor Author

Figured out why it's probably failing on the conda build. The conda build doesn't spin up the backends and is supposed to exclude tests that rely on them. It excludes things marked with pytest.mark.postgresql, but I hadn't marked my test_udf.py module with that. My new commit adds that.

@scottcode
Copy link
Contributor Author

Aha, I figured out that the Windows tests use a different installation of Postgres, one which does not include any extensions (neither PostGIS nor pl/python). The Windows pipeline was already excluding postgis tests. I tweaked my tests to not try enabling PL/Python by default, and I added postgis marker so the plpython tests would be excluded along with the postgis ones. Eventually we need to update the Azure Windows pipeline to exclude a new marker called postgres_extensions, as this would cover both postgis and pl/python.

I included the updated windows.yml in this commit. @cpcloud, what is necessary to update the Azure pipelines?

I am eager to hear feedback from the community.

@cpcloud
Copy link
Member

cpcloud commented Jul 3, 2019

@scottcode Wow, this is really interesting, thanks for doing this. I'm going to review a bit today, and then the rest over the weekend.

@scottcode
Copy link
Contributor Author

Hmm, this is frustrating. I fixed the Windows issues, rebased on Master, and now the linux stuff fails, but only for geospatial-related stuff (not my new UDF code). I'll try to figure this out, but I'm certainly confused at the moment.

ci/docker-compose.yml Outdated Show resolved Hide resolved
ci/datamgr.py Outdated Show resolved Hide resolved
ci/datamgr.py Outdated Show resolved Hide resolved
ibis/sql/postgres/tests/test_udf.py Outdated Show resolved Hide resolved
ibis/sql/postgres/tests/test_udf.py Outdated Show resolved Hide resolved
ibis/sql/postgres/udf/api.py Show resolved Hide resolved
ibis/sql/postgres/udf/api.py Outdated Show resolved Hide resolved
ibis/sql/postgres/udf/api.py Outdated Show resolved Hide resolved
ibis/sql/postgres/udf/api.py Outdated Show resolved Hide resolved
ibis/sql/postgres/udf/api.py Outdated Show resolved Hide resolved
@cpcloud
Copy link
Member

cpcloud commented Jul 3, 2019

@scottcode Don't spend any time on fixing the geospatial stuff, I'm about to merge #1872 which should fix it.

@scottcode
Copy link
Contributor Author

Thanks @cpcloud for all the great feedback! I just wanted to call out separately how much I appreciate it. I'm starting to make my way through all of it.

@cpcloud cpcloud changed the title Adds Support for Postgres UDFs FEAT: Add support for Postgres UDFs Jul 3, 2019
@scottcode
Copy link
Contributor Author

@cpcloud I've addressed most of your comments and pushed a new version. The @ibis_client.udf decorator is still in progress, and there might be a few other outstanding issues, but wanted you to be able to see all the progress.

@cpcloud cpcloud self-assigned this Jul 9, 2019
@cpcloud cpcloud added the postgres The PostgreSQL backend label Jul 9, 2019
@cpcloud cpcloud added this to the Next Feature Release milestone Jul 9, 2019
ibis/sql/postgres/client.py Outdated Show resolved Hide resolved
ibis/sql/postgres/client.py Show resolved Hide resolved
ibis/sql/postgres/udf/api.py Outdated Show resolved Hide resolved
ibis/sql/postgres/udf/api.py Show resolved Hide resolved
ibis/sql/postgres/udf/api.py Outdated Show resolved Hide resolved
ibis/sql/postgres/udf/api.py Outdated Show resolved Hide resolved
ibis/sql/postgres/udf/api.py Outdated Show resolved Hide resolved
ibis/sql/postgres/udf/__init__.py Outdated Show resolved Hide resolved
ibis/sql/postgres/udf/api.py Outdated Show resolved Hide resolved
ibis/sql/postgres/tests/test_udf.py Outdated Show resolved Hide resolved
@scottcode
Copy link
Contributor Author

@cpcloud Looks like there's some problem with the Azure pipeline setup. Looking at the logs it says

/azure-pipelines.yml: Could not retrieve file content for /ci/azure/linux.yml from repository self hosted on https://github.com/ using commit e97f541. GitHub reported the error, "Bad credentials"

@scottcode scottcode force-pushed the pg_udf branch 2 times, most recently from 91f22bc to 774f201 Compare July 17, 2019 21:54
@scottcode
Copy link
Contributor Author

@cpcloud All the CI tests passed, and I believe I've addressed all the changes you've requested so far. Are there any others you'd like me to make?

ibis/sql/postgres/client.py Outdated Show resolved Hide resolved
ibis/sql/postgres/client.py Outdated Show resolved Hide resolved
ibis/sql/postgres/tests/test_udf.py Outdated Show resolved Hide resolved
ibis/sql/postgres/tests/test_udf.py Outdated Show resolved Hide resolved
ibis/sql/postgres/tests/test_udf.py Outdated Show resolved Hide resolved
ibis/sql/postgres/udf/api.py Outdated Show resolved Hide resolved
ibis/sql/postgres/udf/api.py Outdated Show resolved Hide resolved
ibis/sql/postgres/udf/api.py Show resolved Hide resolved
ibis/sql/postgres/udf/api.py Outdated Show resolved Hide resolved
ibis/sql/postgres/udf/api.py Outdated Show resolved Hide resolved
@scottcode
Copy link
Contributor Author

The only errors at this point are in the Lint step (only in Windows Python 3.6 and 3.7). It is unclear to me why it is erroring out; the logs don't give any hints. Running the lint locally everything checks out.

Based on feedback from @cpcloud:

* changes to datamgr.py cli
* lots of refactoring to UDF tests
* lots of refactoring to UDF implementation
* began ibis_client.udf() decorator (work in progress)

Some of @cpcloud feedback still outstanding
* Includes helpers for removing decorators from function definition

Usage:
```
@ibis_postgres_client.udf([dt.int32], dt.int32)
def my_square(x):
    return x ** 2
```
...using a SEQUENCE object in the database
* For type mapping, use simple dictionaries and stop even trying
to support type lookup from anything except ibis DataTypes
* don't store all non-decorator line numbers, just the min
* give better names to temporary field and node variables
* remove UDF decorator API and supporting decorator utilities
* add an exception if user tries to define a UDF with a function that
has decorator(s)
* utilize ibis' and sqlalchemy's existing type translation for defining
ibis type to Postgres datatype string conversion
* removed placeholder conditionals + NotImplementedError's. This can
be a future feature request
* Point to a postgres docker image based directly on an "Official Image"
(see https://docs.docker.com/docker-hub/official_images/)
* previously had a two-step process in which one repo created an
intermediate image, and a second repo used that intermediate image
to create the final image.
* The image is currently hosted on
shajekpivotal/ibis-docker-postgres-9.5 and based on
https://github.com/autoscott/ibis-docker-postgres, but this can easily
be migrated to ibis-project-controlled accounts.
Azure test pipelines keep unpredictably failing due to timeout waiting
for the various backends to be available. Increasing the timeout from
5 minutes to 10 minutes in the hopes that will prevent failures.

Another avenue to pursue in the future might be to look for ways to speed up the
instantiation of the backends.
@scottcode
Copy link
Contributor Author

Only error at this point is backend waiter timeout on LInux Python 3.5 after 10 minutes.

@cpcloud
Copy link
Member

cpcloud commented Jul 25, 2019

Yeah I just kicked it off again to see if it was a transient error.

@scottcode
Copy link
Contributor Author

Looking at logs from several builds that timed out at that step, it's always only Impala that it's still waiting on. I wonder if there's a way to speed up Impala's load time.

@scottcode
Copy link
Contributor Author

It looks like the current image for Impala includes unnecessarily installing and starting up Kudu (since in our docker-compose we start them up from separate images/containers and use those anyway). Maybe we could cut the following out of the Impala startup to speed it up.

ibis-project/docker-impala/etc/supervisord.conf

[program:kudu-master]
command=/bin/bash -c 'exec /etc/init.d/kudu-master start'
autostart=false
startsecs=0

[program:kudu-tserver]
command=/bin/bash -c 'exec /etc/init.d/kudu-tserver start'
autostart=false
startsecs=0

The impala service actually relies on there being an instance of
postgres available to use as its metadata store. It has a timer to
check up to 120 seconds and then impala startup hangs indefinitely or
fails.

When we switched to a new image for postgres that includes PL/Python,
this pushed the startup time just long enough to cause impala startup
to sometimes fail (unpredictably).

Instead of increasing the wait time hard-coded in the Impala image,
it seems better/easier to enforce the dependency with the
`docker-compose.yml`.

See https://github.com/ibis-project/docker-impala/blob/master/bin/start-impala.sh line 21 at revision dcb7388

`/wait-for-it.sh postgres:5432 -t 120`
@scottcode
Copy link
Contributor Author

Think I've figured out why this still sometimes fails while waiting for backends. Just pushed my fix. Fingers crossed that it passes the Azure pipelines. Below is my commit message explanation.

Respect dependency of impala on postgres service

The impala service actually relies on there being an instance of
postgres available to use as its metadata store. It has a timer to
check up to 120 seconds and then impala startup hangs indefinitely or
fails.

When we switched to a new image for postgres that includes PL/Python,
this pushed the startup time just long enough to cause impala startup
to sometimes fail (unpredictably).

Instead of increasing the wait time hard-coded in the Impala image,
it seems better/easier to enforce the dependency with the
docker-compose.yml.

See https://github.com/ibis-project/docker-impala/blob/master/bin/start-impala.sh line 21 at revision dcb7388

/wait-for-it.sh postgres:5432 -t 120

@scottcode
Copy link
Contributor Author

@cpcloud I got it all green 😃

Copy link
Member

@cpcloud cpcloud left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small API nit/question. Once that's addressed this is ready for merge. @scottcode Can you create a follow up issue for moving the dockerfiles you're using here to the ibis-project if you haven't already?

ibis/sql/postgres/udf/api.py Outdated Show resolved Hide resolved
* rename `func_to_udf` to `udf` and change its signature to take an
ibis client instead of a sqlalchemy connection
@scottcode
Copy link
Contributor Author

Phillip, I know you already saw this, but for others' reference the issue for the Postgres Docker image is #1853.

@cpcloud
Copy link
Member

cpcloud commented Jul 29, 2019

Kaboom. Merge time.

@cpcloud cpcloud merged commit 323967f into ibis-project:master Jul 29, 2019
@cpcloud
Copy link
Member

cpcloud commented Jul 29, 2019

Thanks @scottcode, keep 'em coming!

@scottcode scottcode deleted the pg_udf branch July 29, 2019 20:28
@cpcloud cpcloud modified the milestones: Next Feature Release, Next Major Release Aug 26, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
postgres The PostgreSQL backend
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants