pyspark packages to support dataskipping #451

oshritf · 2021-07-01T15:18:12Z

Add Xskipper to pyspark packages in spark2.4 based images for dataskipping support in spark sql queries
https://issues.redhat.com/projects/ODH/issues/ODH-447
opendatahub-io/opendatahub-community#2

@durandom

…pping support

openshift-ci · 2021-07-01T15:18:20Z

Hi @oshritf. Thanks for your PR.

I'm waiting for a opendatahub-io member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

nakfour · 2021-07-07T13:28:37Z

@oshritf thank you for the integration PR, please follow our new guidelines in submitting new components in ODH opendatahub-io/opendatahub-community#1

durandom · 2021-07-15T15:14:45Z

@nakfour this is not a new component, but an extension.

@erikerlandson can you review and guide if this is the right approach?
I would guess, the default env variables only contain basic flags and the user extends them on demand...

LaVLaS · 2021-07-20T01:03:42Z

@nakfour this is not a new component, but an extension.

@erikerlandson can you review and guide if this is the right approach?
I would guess, the default env variables only contain basic flags and the user extends them on demand...

This not a new component but I do agree with @nakfour that it requires additional discussion and documentation before adding it to the default Spark Notebook singleuserprofile in ODH

oshritf · 2021-07-20T13:22:04Z

@LaVLaS @nakfour Should I open a story for it?

paulata · 2021-08-02T16:23:23Z

hi @nakfour @LaVLaS there is considerable documentation for data skipping available here https://xskipper.io/master/
Can you please clarify asap what additional documentation is needed and where/how it should be contributed ?
@oshritf has been trying to clarify this for the last few weeks but did not get a response so far.
@durandom @erikerlandson would also appreciate your help in moving this forward

LaVLaS · 2021-08-04T04:58:22Z

@oshritf @paulata Apologies for the delay in responding. The documentation would be required in ODH as there is no reference or indication to the user that xskipper would be included in any of the ODH spark notebooks by default.

This PR is adding io.xskipper:xskipper-core_2.11:1.1.1 to the Spark Notebook singleuserprofile that is deployed with the Spark 2.4.5 notebook image and Spark cluster. Based on the xskipper requirements of Spark 3.x will this work as intended?

From previous discussions, a better solution would be to provide a separate xskipper spark notebook image with a dedicated custom singleuserprofile just for that image. This could be included as a new kfdef overlay that could be deployed with JupyterHub

oshritf · 2021-08-08T10:33:11Z

@LaVLaS Could you point to where Xskipper documentation should be added in ODH?

For different Spark versions there might be different xskipper-core jar version, 1.1.1 is for spark2.4, 1.2.3 for spark3.0

We opt for data skipping enablement in all spark notebooks, it requires adding xskipper-core jar which has a small footprint to spark.jars to make it easy for Spark Jupiter notebook users (eliminates the need to override os.environ['PYSPARK_SUBMIT_ARGS']). We worked to create a separate Xskipper notebook image with a dedicated custom singleuserprofile to introduce spark3.0 based image which is currently not available in ODH, this work was paused as spark3.0 based image is on ODH timeline

nakfour · 2021-08-09T16:39:00Z

@oshritf @paulata as I explained earlier, even though this is not a new component, it is still a significant feature, please submit a PR with a proposal per the guidelines here : https://github.com/opendatahub-io/opendatahub-community
The proposal will server as an initial guideline for users wanting to use or understand this component.

oshritf · 2021-08-18T10:38:08Z

@nakfour Could you share the url for the proposal folder mentioned in the guidelines? ("The first step in submitting a new component or a feature in Open Data Hub is to create a proposal document and place that in the proposal folder")

nakfour · 2021-08-20T20:34:07Z

@oshritf you would have to create that folder since this will be the first proposal.

oshritf · 2021-08-23T09:32:32Z

opendatahub-io/opendatahub-community#2

nakfour · 2021-08-26T14:23:32Z

@oshritf can you also include specific steps to test this feature.

nakfour · 2021-08-31T13:03:06Z

@oshritf the proposal PR merged, if you can please add some how to test steps in this PR, we can review it.

oshritf · 2021-09-01T11:13:56Z

Testing:
In JupiterHub select spark2.4 based image (Minimal python with apache spark)
Import xskipper pyspark notebook sample (see link in the proposal)
To make sure xskipper jar is loaded, add a new cell with: 'xskipper-core' in spark.sparkContext.getConf().get('spark.jars')
the result should be true.
Run the notebook to verify xskipper functionality and api running as expected

nakfour · 2021-09-08T14:42:55Z

@oshritf I tried importing xskipper as in the example notebook, but got errors.
from xskipper import Xskipper

---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
<ipython-input-1-a5dd2bc22c39> in <module>
----> 1 from xskipper import Xskipper

ModuleNotFoundError: No module named 'xskipper'

Checking in the pod that the env var is set

env | grep PYSPARK_SUBMIT_ARGS
PYSPARK_SUBMIT_ARGS=--conf spark.cores.max=2 --conf spark.executor.instances=2 --conf spark.executor.memory=1G --conf spark.executor.cores=1 --conf spark.driver.memory=2G --packages com.amazonaws:aws-java-sdk:1.7.4,org.apache.hadoop:hadoop-aws:2.7.3,io.xskipper:xskipper-core_2.11:1.1.1 pyspark-shell

oshritf · 2021-09-11T17:24:26Z

@nakfour Could you try restarting the kernel (from the notebook toolbar)?

nakfour · 2021-09-15T16:31:22Z

@oshritf I did restart the kernel, same error, see snapshot.

nakfour · 2021-09-15T16:38:36Z

@oshritf since you are adding it to pyspark, shouldn't the import statement befrom pyspark import XXX

oshritf · 2021-09-29T09:01:10Z

@nakfour restart kernel/stop+start server should solve it. Retested now on OperateFirst smaug cluster.
I updated proposal section with testing section (notebook setup section reference for testing with local filesystem instead of object storage to make it clearer and easier + restart kernel remark)
opendatahub-io/opendatahub-community#4

nakfour · 2021-12-10T16:21:16Z

@oshritf tried it again, and it is still not working as explained above. I did restart the kernel and start/stop server and I still cannot import the library as I explained earlier. Even if this worked we do not want our users to restart kernel, start and stop server every time they want to use this library. I did ask the operate first team if someone tested this before merging and no one did. I am going to put this PR on hold until we get this working and the instructions does not include starting and stopping the server.
/hold

oshritf · 2021-12-12T15:16:28Z

@nakfour Tested on https://jupyterhub-opf-jupyterhub.apps.smaug.na.operate-first.cloud. What environment do you use?
Once this PR is merged, users do not need to restart kernel/server for library usage. It is only needed for testing to clear the library cache and to reload it after setting the parameter.

nakfour · 2022-01-31T22:05:28Z

Tested this again with the example notebook in the proposal and it worked.

nakfour · 2022-01-31T22:05:34Z

/lgtm

oshritf · 2022-02-08T09:20:55Z

hi @nakfour, What are the next steps?

LaVLaS

/lgtm

openshift-ci · 2022-02-10T15:23:27Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: LaVLaS, oshritf

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [LaVLaS]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

LaVLaS · 2022-02-17T15:25:48Z

/unhold

gallushi · 2022-02-22T12:31:29Z

/retest

openshift-ci · 2022-02-22T13:37:18Z

@oshritf: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/odh-manifests-e2e	`b8e7c38`	link	true	`/test odh-manifests-e2e`

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

LaVLaS · 2022-02-28T13:45:20Z

Merging this since it has successfully passed CI tests in the past month

…crds * origin/master: Model Mesh Serving 0.8 initial cut (opendatahub-io#526) add pachyderm as a new components in the README (opendatahub-io#532) pachyderm integration (opendatahub-io#516) Update chromedriver to 98.0.4758.102 (opendatahub-io#528) Add Xskipper to pyspark packages in spark2.4 based images for dataskipping support (opendatahub-io#451) Update JupyterHub to version v0.3.5 (opendatahub-io#530) Remove startingCSV (opendatahub-io#522)

* notebook-controller: Add notebook controller to ODH manifests Update chromedriver to 98.0.4758.102 (opendatahub-io#528) Add Xskipper to pyspark packages in spark2.4 based images for dataskipping support (opendatahub-io#451) Update Grafana operator to work on OCP 4.9

shalberd · 2022-09-14T08:00:34Z

I don't think the new package list should be comma-separated.

See

4c13bb1

…nifest_upstream Sync kserve manifest with upstream files

Add Xskipper to pyspark packages in spark2.4 based images for dataski…

b8e7c38

…pping support

openshift-ci bot requested review from LaVLaS and nakfour July 1, 2021 15:18

openshift-ci bot added the needs-ok-to-test label Jul 1, 2021

crobby added the ok-to-test label Jul 2, 2021

openshift-ci bot added the do-not-merge/hold Hold off on merging (provide reason in comment) label Dec 10, 2021

openshift-ci bot assigned nakfour Jan 31, 2022

openshift-ci bot added the lgtm label Jan 31, 2022

LaVLaS approved these changes Feb 10, 2022

View reviewed changes

openshift-ci bot assigned LaVLaS Feb 10, 2022

openshift-ci bot added the approved label Feb 10, 2022

openshift-ci bot removed the do-not-merge/hold Hold off on merging (provide reason in comment) label Feb 17, 2022

openshift-ci bot removed the needs-ok-to-test label Feb 22, 2022

LaVLaS added this to the ODH v1.2 Release milestone Feb 22, 2022

LaVLaS merged commit 4c13bb1 into opendatahub-io:master Feb 28, 2022

israel-hdez pushed a commit to israel-hdez/odh-manifests that referenced this pull request Aug 15, 2023

Merge pull request opendatahub-io#451 from vajain-rhods/sync_kservema…

b205618

…nifest_upstream Sync kserve manifest with upstream files

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pyspark packages to support dataskipping #451

pyspark packages to support dataskipping #451

oshritf commented Jul 1, 2021 •

edited

openshift-ci bot commented Jul 1, 2021

nakfour commented Jul 7, 2021

durandom commented Jul 15, 2021

LaVLaS commented Jul 20, 2021

oshritf commented Jul 20, 2021

paulata commented Aug 2, 2021

LaVLaS commented Aug 4, 2021

oshritf commented Aug 8, 2021 •

edited

nakfour commented Aug 9, 2021 •

edited

oshritf commented Aug 18, 2021

nakfour commented Aug 20, 2021

oshritf commented Aug 23, 2021

nakfour commented Aug 26, 2021

nakfour commented Aug 31, 2021

oshritf commented Sep 1, 2021 •

edited

nakfour commented Sep 8, 2021 •

edited

oshritf commented Sep 11, 2021

nakfour commented Sep 15, 2021

nakfour commented Sep 15, 2021

oshritf commented Sep 29, 2021

nakfour commented Dec 10, 2021

oshritf commented Dec 12, 2021 •

edited

nakfour commented Jan 31, 2022

nakfour commented Jan 31, 2022

oshritf commented Feb 8, 2022

LaVLaS left a comment

openshift-ci bot commented Feb 10, 2022

LaVLaS commented Feb 17, 2022

gallushi commented Feb 22, 2022

openshift-ci bot commented Feb 22, 2022

LaVLaS commented Feb 28, 2022

shalberd commented Sep 14, 2022

pyspark packages to support dataskipping #451

pyspark packages to support dataskipping #451

Conversation

oshritf commented Jul 1, 2021 • edited

openshift-ci bot commented Jul 1, 2021

nakfour commented Jul 7, 2021

durandom commented Jul 15, 2021

LaVLaS commented Jul 20, 2021

oshritf commented Jul 20, 2021

paulata commented Aug 2, 2021

LaVLaS commented Aug 4, 2021

oshritf commented Aug 8, 2021 • edited

nakfour commented Aug 9, 2021 • edited

oshritf commented Aug 18, 2021

nakfour commented Aug 20, 2021

oshritf commented Aug 23, 2021

nakfour commented Aug 26, 2021

nakfour commented Aug 31, 2021

oshritf commented Sep 1, 2021 • edited

nakfour commented Sep 8, 2021 • edited

oshritf commented Sep 11, 2021

nakfour commented Sep 15, 2021

nakfour commented Sep 15, 2021

oshritf commented Sep 29, 2021

nakfour commented Dec 10, 2021

oshritf commented Dec 12, 2021 • edited

nakfour commented Jan 31, 2022

nakfour commented Jan 31, 2022

oshritf commented Feb 8, 2022

LaVLaS left a comment

Choose a reason for hiding this comment

openshift-ci bot commented Feb 10, 2022

LaVLaS commented Feb 17, 2022

gallushi commented Feb 22, 2022

openshift-ci bot commented Feb 22, 2022

LaVLaS commented Feb 28, 2022

shalberd commented Sep 14, 2022

oshritf commented Jul 1, 2021 •

edited

oshritf commented Aug 8, 2021 •

edited

nakfour commented Aug 9, 2021 •

edited

oshritf commented Sep 1, 2021 •

edited

nakfour commented Sep 8, 2021 •

edited

oshritf commented Dec 12, 2021 •

edited