Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FEAT: Improve pandas udf performance for many arguments #2071

Merged
merged 3 commits into from
Feb 11, 2020

Conversation

icexelloss
Copy link
Contributor

@icexelloss icexelloss commented Feb 4, 2020

Problem

The existing pandas udf implementation suffers from huge performance problem when the udf takes many arguments.

The problem is that, the existing implementation tries to register execution rules for every possible combination of argument types, resulting in an exponential growth execution rules (exponential to the number of udf arguments)

For an example, if a udf takes n arguments, it will ended register O(4^n) rules because each argument can take >4 different types, e.g., for floating numbers, (float, np.floating, pd.Series, pd.SeriesGroupBy) and the existing implementation tries to register the Cartesian product of those.

Proposed Solution

In this PR, the number of rules registered for a given number of arguments is reduced to 3. I.e., for a 10 argument UDF, only 3 rules will be registered.

  • (pd.SeriesGroupBy, pd.SeriesGroupBy, ..., pd.SeriesGroupBy)
  • (pd.Series, pd.Series, ..., pd.Series)
  • (object, object, ..., object)

The last rule (object, object, ..., object) is a "catch all" rule to handle scalar inputs.

Additional Changes

In this PR, I also take the chance to simplify pandas/core.py. This is because I found a bug in the existing implementation where pre_execute is not called for all nodes.

Microbenchmark

import numpy as np
import pandas as pd
import ibis
import ibis.expr.datatypes as dt
from ibis.pandas.udf import udf
import time

df = pd.DataFrame(
    {
        'a': np.arange(4, dtype=float).tolist()
        + np.random.rand(3).tolist(),
        'b': np.arange(4, dtype=float).tolist()
        + np.random.rand(3).tolist(),
        'c': np.arange(7, dtype=int).tolist(),
        'key': list('ddeefff'),
    }
)

client = ibis.pandas.connect({'table': df})

t2 = client.table('table')

@udf.elementwise(input_type=[dt.double] * 6, output_type=dt.double)
def my_udf(
    c1,
    c2,
    c3,
    c4,
    c5,
    c6
):
    return c1

expr = my_udf(*([t2.a] * 6))

start = time.time()
result = expr.execute()

print("Elapsed time:", time.time() - start)

Before the PR:

Elapsed time: 0.1456921100616455

After the PR:

Elapsed time: 0.0021181106567382812

@xmnlab
Copy link
Contributor

xmnlab commented Feb 4, 2020

@icexelloss will this PR also fix the udf issues related the 2 tests that were failing? ref #2065

@icexelloss
Copy link
Contributor Author

I don't believe so. I didn't address any window related issues. I will take a look at that separately .

@xmnlab
Copy link
Contributor

xmnlab commented Feb 4, 2020

sounds good! thanks @icexelloss !

ibis/pandas/core.py Show resolved Hide resolved
ibis/pandas/udf.py Show resolved Hide resolved
@icexelloss icexelloss changed the title Improve pandas udf performance for many arguments FEAT: Improve pandas udf performance for many arguments Feb 6, 2020
@jreback jreback added this to the Next Feature Release milestone Feb 11, 2020
@jreback jreback added pandas The pandas backend performance Issues related to ibis's performance labels Feb 11, 2020
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you show some before / after timings. there are asv benchmarks (in ~/ibis/benchmarks), can you add some for Elementedwise UDFS and show the changes.

also add this issue / PR number to the UDF one already in the release notes.

def pre_execute_elementwise_udf(
op, *clients, scope=None, aggcontet=None, **kwargs
):
@pre_execute.register(ops.ElementWiseVectorizedUDF, ibis.client.Client)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doesn't this catch more generically here? I guess we don't have ElementWiseVector UDFs on other clients atm, but why this change?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't really necessary but follows the same convention as https://github.com/ibis-project/ibis/blob/master/ibis/pandas/dispatch.py#L49

Copy link
Contributor

@emilyreff7 emilyreff7 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@jreback jreback merged commit 0601252 into ibis-project:master Feb 11, 2020
@jreback
Copy link
Contributor

jreback commented Feb 11, 2020

thanks @icexelloss

did we have an associated issue to close?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pandas The pandas backend performance Issues related to ibis's performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants