# Component to run a PySpark Transformation
Data preparation using PySpark on Cloud Dataproc. Download a BigQuery Table using the BigQuery Storage API connector, and write the output to csv files on Google Cloud Storage. 

The [submit_pyspark_job](https://github.com/kubeflow/pipelines/tree/master/components/gcp/dataproc/submit_pyspark_job) component creates a PySpark job from the [Dataproc submit job REST API](https://cloud.google.com/dataproc/docs/reference/rest/v1/projects.regions.jobs/submit).


# Details on the submit_pyspark component arguments

### pyspark_submit component arguments
| Argument | Description | Optional | Data type | Accepted values | Default |
|----------------------|------------|----------|--------------|-----------------|---------|
| project_id | The ID of the Google Cloud Platform (GCP) project that the cluster belongs to. | No | GCPProjectID |  |  |
| region | The Cloud Dataproc region to handle the request. | No | GCPRegion |  |  |
| cluster_name | The name of the cluster to run the job. | No | String |  |  |
| main_python_file_uri | The HCFS URI of the Python file to use as the driver. This must be a .py file. | No | GCSPath |  |  |
| args | The arguments to pass to the driver. See below | Yes | List |  | None |
| pyspark_job | The payload of a [PySparkJob](https://cloud.google.com/dataproc/docs/reference/rest/v1/PySparkJob). | Yes | Dict |  | None |
| job | The payload of a [Dataproc job](https://cloud.google.com/dataproc/docs/reference/rest/v1/projects.regions.jobs). | Yes | Dict |  | None |

### driver program args 
| Argument | Description | Optional | Data type | Accepted values | Default |
|----------------------|------------|----------|--------------|-----------------|---------|
| tableProjectID | The ID of the Google Cloud Platform (GCP) projec that the table belong to | No | GCPProjectID | |
| table | The name of the BigQuery table to download | No | String | |
| dataset | The name of the BigQuery dataset to download the table from | No | String | |
| output | The output file name and location. gs://bucket/output/file.csv | No | String | |


### Output
Name | Description | Type
:--- | :---------- | :---
job_id | The ID of the created job. | String


# Setup & requirements for a test

To run the pipeline, you must:
*   Set up a GCP project by following this [guide](https://cloud.google.com/dataproc/docs/guides/setup-project).
*   [Create a new cluster](https://cloud.google.com/dataproc/docs/guides/create-cluster).
*   Create a Google Cloud Storage bucket, to hold your dependencies and out. Follow this [guide](https://cloud.google.com/storage/docs/creating-buckets).
*   Grant the Kubeflow user service account the role `roles/dataproc.editor` on the project.
*   Grant the [default compute service account](https://cloud.google.com/dataproc/docs/concepts/configuring-clusters/service-accounts), used by DataProc, the role of `roles/bigquery.user`. This is the `[project-number]-compute@developer.gserviceaccount.com` service account.

### Copy dependencies to Google Cloud storage

In [None]:
import json
import os
BUCKET_NAME = 'lf-ml-demo-eu-w1/test_test' ### input your bucket
MAIN_FILE_PATH = 'gs://{}/transform_run.py'.format(BUCKET_NAME)
JAR_PATH = 'gs://{}/sparkicson-0.1-dependencies.jar'.format(BUCKET_NAME)
os.environ['MAIN_FILE_PATH'] = MAIN_FILE_PATH
os.environ['JAR_PATH'] = JAR_PATH

We would need to run the script `upload.sh` to upload the data to GCP in the cloud console. Make sure to pass the same value of `BUCKET_NAME`. 

The reason why we need to follow this appraoch is that the node pool does not have the scope to write on GCS.
![scope](img/original_scope.png)

### Load the dataproc_submit_pyspark component

In [None]:
import kfp.components as comp

dataproc_submit_pyspark_job_op = comp.load_component_from_url(
    'https://raw.githubusercontent.com/kubeflow/pipelines/a97f1d0ad0e7b92203f35c5b0b9af3a314952e05/components/gcp/dataproc/submit_pyspark_job/component.yaml')
help(dataproc_submit_pyspark_job_op)

# Build the pipeline

In [None]:
import kfp.dsl as dsl
import kfp.gcp as gcp

@dsl.pipeline(
    name='Dataproc submit PySpark job pipeline',
    description='Dataproc submit PySpark job pipeline'
)
def dataproc_submit_pyspark_job_pipeline(
    cluster_project_id = 'kfp-primer-workshop', 
    cluster_region = 'us-central1', 
    cluster_name = 'cluster-5d99',
    bq_project_id = 'bigquery-samples',
    bq_dataset = 'wikipedia_benchmark',
    bq_table = 'Wiki10M',
    output_path = 'gs://{0}/output/{{{{workflow.uid}}}}/{{{{pod.name}}}}/test.csv'.format(BUCKET_NAME),
    main_python_file_uri = '{0}'.format(MAIN_FILE_PATH), 
    jar_file_uris = '{0}'.format(JAR_PATH),
    args = '', 
    job='{}', 
    wait_interval='30'
):
    
    dataproc_submit_pyspark_job_op(
        project_id=cluster_project_id, 
        region=cluster_region, 
        cluster_name=cluster_name, 
        main_python_file_uri=main_python_file_uri, 
        args=args, 
        pyspark_job=json.dumps({
            'main_python_file_uri': str(main_python_file_uri),
            'jar_file_uris': str(jar_file_uris),
            'args' : ['--tableProjectID', str(bq_project_id), 
                      '--dataset', str(bq_dataset), 
                      '--table', str(bq_table),
                      '--output', str(output_path)]
        }), 
        job=job, 
        wait_interval=wait_interval).apply(gcp.use_gcp_secret('user-gcp-sa'))

## Compile the pipeline

In [None]:
pipeline_func = dataproc_submit_pyspark_job_pipeline
pipeline_filename = pipeline_func.__name__ + '.zip'
import kfp.compiler as compiler
compiler.Compiler().compile(pipeline_func, pipeline_filename)

## Upload the pipeline

If running outside of the cluster with Kubeflow, set `GOOGLE_APPLICATION_CREDENTIALS` for dealing with authorisation. The service account needs to have the role `IAP-secured Web App User`.

In [None]:
# os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = '' # path to the json file of the service account used to log in: it need to have role IAP-secured Web App User
# HOST = '' # url of the cluster e.g. https://demo-kubeflow.endpoints.lf-ml-demo.cloud.goog/pipeline
# CLIENT_ID = '' # The client ID used by Identity-Aware Proxy
# NAMESPACE = '' # user namespace e.g. https://demo-kubeflow.endpoints.lf-ml-demo.cloud.goog/pipeline

In [None]:
from kfp import Client as KfpClient

In [None]:
client = KfpClient(
# we are running into the same Kubeflow so we do not need to do anything
#     host=HOST,
#     client_id=CLIENT_ID,
#     namespace=NAMESPACE  
)

In [None]:
client.upload_pipeline(
    pipeline_package_path=pipeline_filename, 
    pipeline_name='pyspark_run_test_001') #make the name unique with your username

## Run the pipeline from the UI

## References

*   [Create a new Dataproc cluster](https://cloud.google.com/dataproc/docs/guides/create-cluster) 
*   [PySparkJob](https://cloud.google.com/dataproc/docs/reference/rest/v1/PySparkJob)
*   [Dataproc job](https://cloud.google.com/dataproc/docs/reference/rest/v1/projects.regions.jobs)