Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: window operations for pyspark backend #1945

Merged
merged 6 commits into from
Sep 9, 2019

Conversation

hjoo
Copy link
Contributor

@hjoo hjoo commented Sep 5, 2019

Implemented window operations for PySpark backend to pass all tests in ibis/tests/all/test_window.py:

  • WindowOp
  • Lag
  • Lead
  • MinRank
  • DenseRank
  • NTile
  • FirstValue
  • LastValue
  • RowNumber
  • CumulativeSum
  • CumulativeMean
  • CumulativeMin
  • CumulativeMax

Also enhanced select aggregation operations (e.g. Any, NotAny, All, NotAll, Count, Max, Min, Mean, Sum) to be interoperable with windows.

@codecov
Copy link

codecov bot commented Sep 5, 2019

Codecov Report

Merging #1945 into master will decrease coverage by 1.54%.
The diff coverage is 98.18%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #1945      +/-   ##
==========================================
- Coverage   87.68%   86.13%   -1.55%     
==========================================
  Files          93       93              
  Lines       16971    17068      +97     
  Branches     2145     2157      +12     
==========================================
- Hits        14881    14702     -179     
- Misses       1681     1961     +280     
+ Partials      409      405       -4
Impacted Files Coverage Δ
ibis/pyspark/compiler.py 96.05% <98.18%> (+0.44%) ⬆️
ibis/bigquery/client.py 41.1% <0%> (-53.39%) ⬇️
ibis/bigquery/compiler.py 59.92% <0%> (-37.5%) ⬇️
ibis/bigquery/udf/api.py 80.48% <0%> (-14.64%) ⬇️
ibis/impala/compiler.py 91.23% <0%> (-5.2%) ⬇️
ibis/pandas/client.py 85.54% <0%> (-3.47%) ⬇️
ibis/bigquery/api.py 63.33% <0%> (-3.34%) ⬇️
ibis/expr/schema.py 90.1% <0%> (-1.1%) ⬇️
ibis/sql/compiler.py 94.26% <0%> (-0.39%) ⬇️
ibis/expr/api.py 92.33% <0%> (-0.37%) ⬇️
... and 4 more

Copy link
Contributor

@toryhaavik toryhaavik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some comments/questions, but overall looks good



@compiles(ops.Lead)
def compile_lead(t, expr, scope, *, window, **kwargs):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since this and Lag are so similar, can we factor them into a single function that takes the pyspark function as a parameter?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, done.

ibis/pyspark/compiler.py Show resolved Hide resolved
ibis/pyspark/compiler.py Show resolved Hide resolved
@@ -3,7 +3,8 @@

import ibis
import ibis.common.exceptions as com
from ibis.tests.backends import Csv, OmniSciDB, Pandas, Parquet
from ibis.tests.backends import Csv, Impala, OmniSciDB, Pandas, Parquet, \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think we generally prefer wrapping these in parentheses rather than \

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it

@hjoo
Copy link
Contributor Author

hjoo commented Sep 6, 2019

Rebased on top of master.

return compile_aggregator(t, expr, scope, F.max, context)

def fn(col):
if "window" in kwargs:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the default for string used in ibis is single quotes. probably we should keep all string using single quotes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Contributor

@toryhaavik toryhaavik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This LGTM, thanks @hjoo

@toryhaavik toryhaavik merged commit d89cb6c into ibis-project:master Sep 9, 2019
costrouc pushed a commit to costrouc/ibis that referenced this pull request Oct 10, 2019
Implemented window operations for PySpark backend to pass all tests in
`ibis/tests/all/test_window.py`:    - `WindowOp`  - `Lag`  - `Lead`  -
`MinRank`  - `DenseRank`  - `NTile`  - `FirstValue`  - `LastValue`  -
`RowNumber`  - `CumulativeSum`  - `CumulativeMean`  - `CumulativeMin`
- `CumulativeMax`    Also enhanced select aggregation operations (e.g.
`Any`, `NotAny`, `All`, `NotAll`, `Count`, `Max`, `Min`, `Mean`,
`Sum`) to be interoperable with windows.
Author: Hyonjee <hyonjee.joo@twosigma.com>

Closes ibis-project#1945 from hjoo/pyspark-window and squashes the following commits:

54a3405 [Hyonjee] change double quotes to single quotes in pyspark compiler.py
7cb2291 [Hyonjee] add helper methods for pyspark shift and cumulative window ops, refactor import line
5c82b19 [Hyonjee] remove extra test_window lead/lag tests that were added in previous commit but don't pass with non pyspark backends
7ac2dd8 [Hyonjee] skip unsupported window tests for OmniSciDB
419c309 [Hyonjee] fix test_window() in ibis/pyspark/tests/test_basic.py and remove xfail
7bd6585 [Hyonjee] window operations for pyspark backend
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants