pyspark packages to support dataskipping #451
Conversation
Hi @oshritf. Thanks for your PR. I'm waiting for a opendatahub-io member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
@oshritf thank you for the integration PR, please follow our new guidelines in submitting new components in ODH opendatahub-io/opendatahub-community#1 |
@nakfour this is not a new component, but an extension. @erikerlandson can you review and guide if this is the right approach? |
This not a new component but I do agree with @nakfour that it requires additional discussion and documentation before adding it to the default |
hi @nakfour @LaVLaS there is considerable documentation for data skipping available here https://xskipper.io/master/ |
@oshritf @paulata Apologies for the delay in responding. The documentation would be required in ODH as there is no reference or indication to the user that xskipper would be included in any of the ODH spark notebooks by default. This PR is adding From previous discussions, a better solution would be to provide a separate xskipper spark notebook image with a dedicated custom singleuserprofile just for that image. This could be included as a new kfdef overlay that could be deployed with JupyterHub |
@LaVLaS Could you point to where Xskipper documentation should be added in ODH? For different Spark versions there might be different xskipper-core jar version, 1.1.1 is for spark2.4, 1.2.3 for spark3.0 We opt for data skipping enablement in all spark notebooks, it requires adding xskipper-core jar which has a small footprint to spark.jars to make it easy for Spark Jupiter notebook users (eliminates the need to override os.environ['PYSPARK_SUBMIT_ARGS']). We worked to create a separate Xskipper notebook image with a dedicated custom singleuserprofile to introduce spark3.0 based image which is currently not available in ODH, this work was paused as spark3.0 based image is on ODH timeline |
@oshritf @paulata as I explained earlier, even though this is not a new component, it is still a significant feature, please submit a PR with a proposal per the guidelines here : https://github.com/opendatahub-io/opendatahub-community |
@nakfour Could you share the url for the proposal folder mentioned in the guidelines? ("The first step in submitting a new component or a feature in Open Data Hub is to create a proposal document and place that in the proposal folder") |
@oshritf you would have to create that folder since this will be the first proposal. |
@oshritf can you also include specific steps to test this feature. |
@oshritf the proposal PR merged, if you can please add some how to test steps in this PR, we can review it. |
Testing: |
@oshritf I tried importing xskipper as in the example notebook, but got errors.
Checking in the pod that the env var is set
|
@nakfour Could you try restarting the kernel (from the notebook toolbar)? |
@oshritf I did restart the kernel, same error, see snapshot. |
@oshritf since you are adding it to pyspark, shouldn't the import statement be |
@nakfour restart kernel/stop+start server should solve it. Retested now on OperateFirst smaug cluster. |
@oshritf tried it again, and it is still not working as explained above. I did restart the kernel and start/stop server and I still cannot import the library as I explained earlier. Even if this worked we do not want our users to restart kernel, start and stop server every time they want to use this library. I did ask the operate first team if someone tested this before merging and no one did. I am going to put this PR on hold until we get this working and the instructions does not include starting and stopping the server. |
@nakfour Tested on https://jupyterhub-opf-jupyterhub.apps.smaug.na.operate-first.cloud. What environment do you use? |
Tested this again with the example notebook in the proposal and it worked. |
/lgtm |
hi @nakfour, What are the next steps? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: LaVLaS, oshritf The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/unhold |
/retest |
@oshritf: The following test failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
Merging this since it has successfully passed CI tests in the past month |
…crds * origin/master: Model Mesh Serving 0.8 initial cut (opendatahub-io#526) add pachyderm as a new components in the README (opendatahub-io#532) pachyderm integration (opendatahub-io#516) Update chromedriver to 98.0.4758.102 (opendatahub-io#528) Add Xskipper to pyspark packages in spark2.4 based images for dataskipping support (opendatahub-io#451) Update JupyterHub to version v0.3.5 (opendatahub-io#530) Remove startingCSV (opendatahub-io#522)
* notebook-controller: Add notebook controller to ODH manifests Update chromedriver to 98.0.4758.102 (opendatahub-io#528) Add Xskipper to pyspark packages in spark2.4 based images for dataskipping support (opendatahub-io#451) Update Grafana operator to work on OCP 4.9
I don't think the new package list should be comma-separated. See |
…nifest_upstream Sync kserve manifest with upstream files
Add Xskipper to pyspark packages in spark2.4 based images for dataskipping support in spark sql queries
https://issues.redhat.com/projects/ODH/issues/ODH-447
opendatahub-io/opendatahub-community#2
@durandom