Skip to content
This repository has been archived by the owner on Jan 31, 2024. It is now read-only.

pyspark packages to support dataskipping #451

Merged
merged 1 commit into from Feb 28, 2022

Conversation

oshritf
Copy link
Contributor

@oshritf oshritf commented Jul 1, 2021

Add Xskipper to pyspark packages in spark2.4 based images for dataskipping support in spark sql queries
https://issues.redhat.com/projects/ODH/issues/ODH-447
opendatahub-io/opendatahub-community#2

@durandom

@openshift-ci
Copy link

openshift-ci bot commented Jul 1, 2021

Hi @oshritf. Thanks for your PR.

I'm waiting for a opendatahub-io member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@nakfour
Copy link

nakfour commented Jul 7, 2021

@oshritf thank you for the integration PR, please follow our new guidelines in submitting new components in ODH opendatahub-io/opendatahub-community#1

@durandom
Copy link
Member

@nakfour this is not a new component, but an extension.

@erikerlandson can you review and guide if this is the right approach?
I would guess, the default env variables only contain basic flags and the user extends them on demand...

@LaVLaS
Copy link
Member

LaVLaS commented Jul 20, 2021

@nakfour this is not a new component, but an extension.

@erikerlandson can you review and guide if this is the right approach?
I would guess, the default env variables only contain basic flags and the user extends them on demand...

This not a new component but I do agree with @nakfour that it requires additional discussion and documentation before adding it to the default Spark Notebook singleuserprofile in ODH

@oshritf
Copy link
Contributor Author

oshritf commented Jul 20, 2021

@LaVLaS @nakfour Should I open a story for it?

@paulata
Copy link

paulata commented Aug 2, 2021

hi @nakfour @LaVLaS there is considerable documentation for data skipping available here https://xskipper.io/master/
Can you please clarify asap what additional documentation is needed and where/how it should be contributed ?
@oshritf has been trying to clarify this for the last few weeks but did not get a response so far.
@durandom @erikerlandson would also appreciate your help in moving this forward

@LaVLaS
Copy link
Member

LaVLaS commented Aug 4, 2021

@oshritf @paulata Apologies for the delay in responding. The documentation would be required in ODH as there is no reference or indication to the user that xskipper would be included in any of the ODH spark notebooks by default.

This PR is adding io.xskipper:xskipper-core_2.11:1.1.1 to the Spark Notebook singleuserprofile that is deployed with the Spark 2.4.5 notebook image and Spark cluster. Based on the xskipper requirements of Spark 3.x will this work as intended?

From previous discussions, a better solution would be to provide a separate xskipper spark notebook image with a dedicated custom singleuserprofile just for that image. This could be included as a new kfdef overlay that could be deployed with JupyterHub

@oshritf
Copy link
Contributor Author

oshritf commented Aug 8, 2021

@LaVLaS Could you point to where Xskipper documentation should be added in ODH?

For different Spark versions there might be different xskipper-core jar version, 1.1.1 is for spark2.4, 1.2.3 for spark3.0

We opt for data skipping enablement in all spark notebooks, it requires adding xskipper-core jar which has a small footprint to spark.jars to make it easy for Spark Jupiter notebook users (eliminates the need to override os.environ['PYSPARK_SUBMIT_ARGS']). We worked to create a separate Xskipper notebook image with a dedicated custom singleuserprofile to introduce spark3.0 based image which is currently not available in ODH, this work was paused as spark3.0 based image is on ODH timeline

@nakfour
Copy link

nakfour commented Aug 9, 2021

@oshritf @paulata as I explained earlier, even though this is not a new component, it is still a significant feature, please submit a PR with a proposal per the guidelines here : https://github.com/opendatahub-io/opendatahub-community
The proposal will server as an initial guideline for users wanting to use or understand this component.

@oshritf
Copy link
Contributor Author

oshritf commented Aug 18, 2021

@nakfour Could you share the url for the proposal folder mentioned in the guidelines? ("The first step in submitting a new component or a feature in Open Data Hub is to create a proposal document and place that in the proposal folder")

@nakfour
Copy link

nakfour commented Aug 20, 2021

@oshritf you would have to create that folder since this will be the first proposal.

@oshritf
Copy link
Contributor Author

oshritf commented Aug 23, 2021

@nakfour
Copy link

nakfour commented Aug 26, 2021

@oshritf can you also include specific steps to test this feature.

@nakfour
Copy link

nakfour commented Aug 31, 2021

@oshritf the proposal PR merged, if you can please add some how to test steps in this PR, we can review it.

@oshritf
Copy link
Contributor Author

oshritf commented Sep 1, 2021

Testing:
In JupiterHub select spark2.4 based image (Minimal python with apache spark)
Import xskipper pyspark notebook sample (see link in the proposal)
To make sure xskipper jar is loaded, add a new cell with: 'xskipper-core' in spark.sparkContext.getConf().get('spark.jars')
the result should be true.
Run the notebook to verify xskipper functionality and api running as expected

@nakfour
Copy link

nakfour commented Sep 8, 2021

@oshritf I tried importing xskipper as in the example notebook, but got errors.
from xskipper import Xskipper

---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
<ipython-input-1-a5dd2bc22c39> in <module>
----> 1 from xskipper import Xskipper

ModuleNotFoundError: No module named 'xskipper'

Checking in the pod that the env var is set

env | grep PYSPARK_SUBMIT_ARGS
PYSPARK_SUBMIT_ARGS=--conf spark.cores.max=2 --conf spark.executor.instances=2 --conf spark.executor.memory=1G --conf spark.executor.cores=1 --conf spark.driver.memory=2G --packages com.amazonaws:aws-java-sdk:1.7.4,org.apache.hadoop:hadoop-aws:2.7.3,io.xskipper:xskipper-core_2.11:1.1.1 pyspark-shell

@oshritf
Copy link
Contributor Author

oshritf commented Sep 11, 2021

@nakfour Could you try restarting the kernel (from the notebook toolbar)?

@nakfour
Copy link

nakfour commented Sep 15, 2021

@oshritf I did restart the kernel, same error, see snapshot.
Screen Shot 2021-09-15 at 11 29 13 AM

@nakfour
Copy link

nakfour commented Sep 15, 2021

@oshritf since you are adding it to pyspark, shouldn't the import statement befrom pyspark import XXX

@oshritf
Copy link
Contributor Author

oshritf commented Sep 29, 2021

@nakfour restart kernel/stop+start server should solve it. Retested now on OperateFirst smaug cluster.
I updated proposal section with testing section (notebook setup section reference for testing with local filesystem instead of object storage to make it clearer and easier + restart kernel remark)
opendatahub-io/opendatahub-community#4

@nakfour
Copy link

nakfour commented Dec 10, 2021

@oshritf tried it again, and it is still not working as explained above. I did restart the kernel and start/stop server and I still cannot import the library as I explained earlier. Even if this worked we do not want our users to restart kernel, start and stop server every time they want to use this library. I did ask the operate first team if someone tested this before merging and no one did. I am going to put this PR on hold until we get this working and the instructions does not include starting and stopping the server.
/hold

@openshift-ci openshift-ci bot added the do-not-merge/hold Hold off on merging (provide reason in comment) label Dec 10, 2021
@oshritf
Copy link
Contributor Author

oshritf commented Dec 12, 2021

@nakfour Tested on https://jupyterhub-opf-jupyterhub.apps.smaug.na.operate-first.cloud. What environment do you use?
Once this PR is merged, users do not need to restart kernel/server for library usage. It is only needed for testing to clear the library cache and to reload it after setting the parameter.

@nakfour
Copy link

nakfour commented Jan 31, 2022

Tested this again with the example notebook in the proposal and it worked.

@nakfour
Copy link

nakfour commented Jan 31, 2022

/lgtm

@oshritf
Copy link
Contributor Author

oshritf commented Feb 8, 2022

hi @nakfour, What are the next steps?

Copy link
Member

@LaVLaS LaVLaS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@openshift-ci
Copy link

openshift-ci bot commented Feb 10, 2022

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: LaVLaS, oshritf

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@LaVLaS
Copy link
Member

LaVLaS commented Feb 17, 2022

/unhold

@openshift-ci openshift-ci bot removed the do-not-merge/hold Hold off on merging (provide reason in comment) label Feb 17, 2022
@gallushi
Copy link

/retest

@openshift-ci
Copy link

openshift-ci bot commented Feb 22, 2022

@oshritf: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/odh-manifests-e2e b8e7c38 link true /test odh-manifests-e2e

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@LaVLaS LaVLaS added this to the ODH v1.2 Release milestone Feb 22, 2022
@LaVLaS
Copy link
Member

LaVLaS commented Feb 28, 2022

Merging this since it has successfully passed CI tests in the past month

@LaVLaS LaVLaS merged commit 4c13bb1 into opendatahub-io:master Feb 28, 2022
cfchase added a commit to cfchase/odh-manifests that referenced this pull request Mar 8, 2022
…crds

* origin/master:
  Model Mesh Serving 0.8 initial cut (opendatahub-io#526)
  add pachyderm as a new components in the README (opendatahub-io#532)
  pachyderm integration (opendatahub-io#516)
  Update chromedriver to 98.0.4758.102 (opendatahub-io#528)
  Add Xskipper to pyspark packages in spark2.4 based images for dataskipping support (opendatahub-io#451)
  Update JupyterHub to version v0.3.5 (opendatahub-io#530)
  Remove startingCSV (opendatahub-io#522)
cfchase added a commit to cfchase/odh-manifests that referenced this pull request Mar 14, 2022
* notebook-controller:
  Add notebook controller to ODH manifests
  Update chromedriver to 98.0.4758.102 (opendatahub-io#528)
  Add Xskipper to pyspark packages in spark2.4 based images for dataskipping support (opendatahub-io#451)
  Update Grafana operator to work on OCP 4.9
@shalberd
Copy link
Contributor

I don't think the new package list should be comma-separated.

See

4c13bb1

israel-hdez pushed a commit to israel-hdez/odh-manifests that referenced this pull request Aug 15, 2023
…nifest_upstream

Sync kserve manifest with upstream files
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
8 participants