Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FEAT: Adding more window operations to OmniSciDB #1976

Merged
merged 1 commit into from
Jan 29, 2020

Conversation

xmnlab
Copy link
Contributor

@xmnlab xmnlab commented Sep 22, 2019

In this PR, added the follow window operations:

  • DenseRank
  • PercentRank (that is translated to cume_dist)
  • RowNumber
  • MinRank
  • Count

@xmnlab xmnlab changed the title ENH: Adding more window operations to OmniSciDB FEAT: Adding more window operations to OmniSciDB Oct 2, 2019
@xmnlab
Copy link
Contributor Author

xmnlab commented Nov 26, 2019

Currently CI is raising some errors:

py36 building:

>                   raise Exception("Java gateway process exited before sending its port number")
E                   Exception: Java gateway process exited before sending its port number

/opt/conda/envs/ibis-env/lib/python3.6/site-packages/pyspark/java_gateway.py:108: Exception
---------------------------- Captured stderr setup -----------------------------
/usr/bin/env: ‘bash’: No such file or directory
__________________ ERROR at setup of test_array_length_scalar __________________
[gw0] linux -- Python 3.6.7 /opt/conda/envs/ibis-env/bin/python

    @pytest.fixture(scope='session')
    def client():
        from pyspark.sql import SparkSession
        import pyspark.sql.functions as F
    
>       session = SparkSession.builder.getOrCreate()

ibis/pyspark/tests/conftest.py:11: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
/opt/conda/envs/ibis-env/lib/python3.6/site-packages/pyspark/sql/session.py:173: in getOrCreate
    sc = SparkContext.getOrCreate(sparkConf)
/opt/conda/envs/ibis-env/lib/python3.6/site-packages/pyspark/context.py:367: in getOrCreate
    SparkContext(conf=conf or SparkConf())
/opt/conda/envs/ibis-env/lib/python3.6/site-packages/pyspark/context.py:133: in __init__
    SparkContext._ensure_initialized(self, gateway=gateway, conf=conf)
/opt/conda/envs/ibis-env/lib/python3.6/site-packages/pyspark/context.py:316: in _ensure_initialized
    SparkContext._gateway = gateway or launch_gateway(conf)
/opt/conda/envs/ibis-env/lib/python3.6/site-packages/pyspark/java_gateway.py:46: in launch_gateway
    return _launch_gateway(conf)
>           raise Exception((retcode, cmd))
E           Exception: (127, 'tar zc 80c3e427b01d4471a63a57e035a599f7 > 80c3e427b01d4471a63a57e035a599f7.tar.gz')

ibis/tests/test_filesystems.py:364: Exception
----------------------------- Captured stderr call -----------------------------
/bin/sh: 1: tar: not found

doc building

�[0;32m/opt/conda/envs/ibis-env/lib/python3.6/site-packages/IPython/utils/_process_posix.py�[0m in �[0;36msh�[0;34m(self)�[0m
�[1;32m     63�[0m             �[0mself�[0m�[0;34m.�[0m�[0m_sh�[0m �[0;34m=�[0m �[0mpexpect�[0m�[0;34m.�[0m�[0mwhich�[0m�[0;34m(�[0m�[0;34m'sh'�[0m�[0;34m)�[0m�[0;34m�[0m�[0;34m�[0m�[0m
�[1;32m     64�[0m             �[0;32mif�[0m �[0mself�[0m�[0;34m.�[0m�[0m_sh�[0m �[0;32mis�[0m �[0;32mNone�[0m�[0;34m:�[0m�[0;34m�[0m�[0;34m�[0m�[0m
�[0;32m---> 65�[0;31m                 �[0;32mraise�[0m �[0mOSError�[0m�[0;34m(�[0m�[0;34m'"sh" shell not found'�[0m�[0;34m)�[0m�[0;34m�[0m�[0;34m�[0m�[0m
�[0m�[1;32m     66�[0m �[0;34m�[0m�[0m
�[1;32m     67�[0m         �[0;32mreturn�[0m �[0mself�[0m�[0;34m.�[0m�[0m_sh�[0m�[0;34m�[0m�[0;34m�[0m�[0m

�[0;31mOSError�[0m: "sh" shell not found
OSError: "sh" shell not found


During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/conda/envs/ibis-env/lib/python3.6/site-packages/sphinx/cmd/build.py", line 284, in build_main
    app.build(args.force_all, filenames)
  File "/opt/conda/envs/ibis-env/lib/python3.6/site-packages/sphinx/application.py", line 337, in build
    self.builder.build_update()
  File "/opt/conda/envs/ibis-env/lib/python3.6/site-packages/sphinx/builders/__init__.py", line 326, in build_update
    len(to_build))
  File "/opt/conda/envs/ibis-env/lib/python3.6/site-packages/sphinx/builders/__init__.py", line 339, in build
    updated_docnames = set(self.read())
  File "/opt/conda/envs/ibis-env/lib/python3.6/site-packages/sphinx/builders/__init__.py", line 445, in read
    self._read_serial(docnames)
  File "/opt/conda/envs/ibis-env/lib/python3.6/site-packages/sphinx/builders/__init__.py", line 467, in _read_serial
    self.read_doc(docname)
  File "/opt/conda/envs/ibis-env/lib/python3.6/site-packages/sphinx/builders/__init__.py", line 511, in read_doc
    doctree = read_doc(self.app, self.env, self.env.doc2path(docname))
  File "/opt/conda/envs/ibis-env/lib/python3.6/site-packages/sphinx/io.py", line 323, in read_doc
    pub.publish()
  File "/opt/conda/envs/ibis-env/lib/python3.6/site-packages/docutils/core.py", line 217, in publish
    self.settings)
  File "/opt/conda/envs/ibis-env/lib/python3.6/site-packages/sphinx/io.py", line 116, in read
    self.parse()
  File "/opt/conda/envs/ibis-env/lib/python3.6/site-packages/docutils/readers/__init__.py", line 77, in parse
    self.parser.parse(self.input, document)
  File "/opt/conda/envs/ibis-env/lib/python3.6/site-packages/nbsphinx.py", line 865, in parse
    raise NotebookError('\n'.join(lines))
nbsphinx.NotebookError: CellExecutionError in notebooks/tutorial/5-IO-Create-Insert-External-Data.ipynb:
------------------
!rm -rf parquet_dir/
hdfs.get('/__ibis/ibis-testing-data/parquet/functional_alltypes', 'parquet_dir')
------------------

@xmnlab xmnlab marked this pull request as ready for review December 6, 2019 14:49
@xmnlab
Copy link
Contributor Author

xmnlab commented Dec 12, 2019

@jreback any more thought about this PR? let me know and if it is OK .. I will rebase to fix the conflict on release.rst

@xmnlab
Copy link
Contributor Author

xmnlab commented Jan 7, 2020

@jreback any more thought about this PR? let me know and if it is OK .. I will rebase to fix the conflict on release.rst

@xmnlab
Copy link
Contributor Author

xmnlab commented Jan 20, 2020

hey @jreback ! a gentle reminder about this PR :)

@jreback jreback added omnisci window functions Issues or PRs related to window functions labels Jan 20, 2020
@jreback jreback added this to the Next Feature Release milestone Jan 20, 2020
ibis/tests/all/test_window.py Outdated Show resolved Hide resolved
ibis/tests/all/test_window.py Outdated Show resolved Hide resolved
result.extend(sub_result)
if diff > 0:
diff -= 1
return pd.Series(result, index=x.index)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are you evaluating directly in the api like this?

are the ntile ops not defined?

these should be dispatched generically, not done like this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pandas doesn't have ntile operation.
the most similar operation isqcut but it seems there are some differences.

what is your recommendation in this case?

@xmnlab xmnlab force-pushed the Add-more-window-functions branch 3 times, most recently from 7971a1a to 9871e6f Compare January 21, 2020 19:38
"""
# internal ntile function
def _ntile(x: pandas.Series, bucket: int):
n = x.shape[0]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are you actually trying to execute operations inside api definitions?

this is an anti pattern in ibis

further these are actually definitions of pandas ops and not omnisci db native ops

so am puzzled what you are attempting to accomplish

the way to do this is to define these as pandas ops
the api then should return these ops and ibis will dispatch to them

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just created that because AFAIK currently there is no way to test ibis ntile operation ...
so that would be just a temporary way to test ibis ntile operations
just for clarification, by pandas ops do you mean 1) ibis pandas ops or really 2) pandas ops?

if it is 2) .. as it could take an extra time, probably I would prefer to remove ntile here and implement that in a different PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure you are following. This completely breaks the ibis style & philosophy. I am really not sure what your end goal is Does omniscidb actually have these functions? if so certainly you can add them that backend, but this shouldn't have anything to do with pandas.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes .. maybe I am not following you ... what do you mean by api definitions? because I added that just to test_window.py it shouldn't be available into the API.

I can remove this code with no problem .. my question is .. how can I test omniscidb ntile operation? normally we just compare the result from the backend to the result from pure pandas operation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have no ideal what these operations actually do. Some documentation and examples would go along way. You need to write a much more explict test.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds fair @jreback thanks for the feedback ... I will probably move ntile to a new PR and I will do it for sure. thanks!

@jreback jreback removed this from the Next Feature Release milestone Jan 24, 2020
@xmnlab xmnlab requested a review from jreback January 28, 2020 01:40
@xmnlab
Copy link
Contributor Author

xmnlab commented Jan 28, 2020

@jreback I removed the ntile operation from this PR. it is done for a new review. thanks!

@jreback jreback added this to the Next Feature Release milestone Jan 29, 2020
@jreback
Copy link
Contributor

jreback commented Jan 29, 2020

@xmnlab pls rebase. this looks fine. you missed my point above. I don't think is is necessary to add any operations to pandas proper itself (and will likely be rejected). But on the ibis pandas back-end I think you could simply implement these and would provide the good testing comparison.

@xmnlab
Copy link
Contributor Author

xmnlab commented Jan 29, 2020

rebase done, thanks @jreback !

I just opened that issue there to see if there is any interesting from the pandas community for that implementation .. and also check if there are other ways to get the desired result :)

about your suggestion about to implement that on pandas backend .. that is reasonable, I will need to create a separated test for that ... because on tests/all it compares the result between each backend and pandas (pure) .. and in this new case it will check the results between each backend and pandas backend (let me know if I am missing anything ...)

again, thanks for the review and suggestions!

@jreback jreback merged commit 95e17c3 into ibis-project:master Jan 29, 2020
@jreback
Copy link
Contributor

jreback commented Jan 29, 2020

this is ok thanks @xmnlab

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
window functions Issues or PRs related to window functions
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants