# Component to run a PySpark Transformation
Data preparation using PySpark on Cloud Dataproc. Download a BigQuery Table using the BigQuery Storage API connector, and write the output to csv files on Google Cloud Storage. 

The [submit_pyspark_job](https://github.com/kubeflow/pipelines/tree/master/components/gcp/dataproc/submit_pyspark_job) component creates a PySpark job from the [Dataproc submit job REST API](https://cloud.google.com/dataproc/docs/reference/rest/v1/projects.regions.jobs/submit).


# Details on the submit_pyspark component arguments

### pyspark_submit component arguments
| Argument | Description | Optional | Data type | Accepted values | Default |
|----------------------|------------|----------|--------------|-----------------|---------|
| project_id | The ID of the Google Cloud Platform (GCP) project that the cluster belongs to. | No | GCPProjectID |  |  |
| region | The Cloud Dataproc region to handle the request. | No | GCPRegion |  |  |
| cluster_name | The name of the cluster to run the job. | No | String |  |  |
| main_python_file_uri | The HCFS URI of the Python file to use as the driver. This must be a .py file. | No | GCSPath |  |  |
| args | The arguments to pass to the driver. See below | Yes | List |  | None |
| pyspark_job | The payload of a [PySparkJob](https://cloud.google.com/dataproc/docs/reference/rest/v1/PySparkJob). | Yes | Dict |  | None |
| job | The payload of a [Dataproc job](https://cloud.google.com/dataproc/docs/reference/rest/v1/projects.regions.jobs). | Yes | Dict |  | None |

### driver program args 
| Argument | Description | Optional | Data type | Accepted values | Default |
|----------------------|------------|----------|--------------|-----------------|---------|
| tableProjectID | The ID of the Google Cloud Platform (GCP) projec that the table belong to | No | GCPProjectID | |
| table | The name of the BigQuery table to download | No | String | |
| dataset | The name of the BigQuery dataset to download the table from | No | String | |
| output | The output file name and location. gs://bucket/output/file.csv | No | String | |


### Output
Name | Description | Type
:--- | :---------- | :---
job_id | The ID of the created job. | String


# Setup & requirements for a test

To run the pipeline, you must:
*   Set up a GCP project by following this [guide](https://cloud.google.com/dataproc/docs/guides/setup-project).
*   [Create a new cluster](https://cloud.google.com/dataproc/docs/guides/create-cluster).
*   Create a Google Cloud Storage bucket, to hold your dependencies and out. Follow this [guide](https://cloud.google.com/storage/docs/creating-buckets).
*   Grant the Kubeflow user service account the role `roles/dataproc.editor` on the project.
*   Grant the [default compute service account](https://cloud.google.com/dataproc/docs/concepts/configuring-clusters/service-accounts), used by DataProc, the role of `roles/bigquery.user`. This is the `[project-number]-compute@developer.gserviceaccount.com` service account.

### Copy dependencies to Google Cloud storage

In [1]:
import json
import os
BUCKET_NAME = 'lf-ml-demo-eu-w1/kfp_primer/test/01/dataproc'
MAIN_FILE_PATH = 'gs://{}/transform_run.py'.format(BUCKET_NAME)
JAR_PATH = 'gs://{}/sparkicson-0.1-dependencies.jar'.format(BUCKET_NAME)
os.environ['MAIN_FILE_PATH'] = MAIN_FILE_PATH
os.environ['JAR_PATH'] = JAR_PATH

Copy the main python file

In [None]:
%%bash
gsutil cp ./pyspark_job/src/transform_run.py $MAIN_FILE_PATH

Copy the dependencies

In [None]:
%%bash
gsutil cp ./pyspark_job/src/target/sparkicson-0.1-dependencies.jar $JAR_PATH

Load the dataproc_submit_pyspark component

In [2]:
import kfp.components as comp

dataproc_submit_pyspark_job_op = comp.load_component_from_url(
    'https://raw.githubusercontent.com/kubeflow/pipelines/a97f1d0ad0e7b92203f35c5b0b9af3a314952e05/components/gcp/dataproc/submit_pyspark_job/component.yaml')
help(dataproc_submit_pyspark_job_op)

Help on function dataproc_submit_pyspark_job:

dataproc_submit_pyspark_job(project_id: 'GCPProjectID', region: 'GCPRegion', cluster_name: 'String', main_python_file_uri: 'GCSPath', args: 'List' = '', pyspark_job: 'Dict' = '', job: 'Dict' = '', wait_interval: 'Integer' = '30')
    dataproc_submit_pyspark_job
    Submits a Cloud Dataproc job for running Apache PySpark applications on YARN.



In [11]:
def test_dict_arguments(my_dict: dict) -> None:
    print(type(my_dict))
    print(dict))
    
    if 'args' in my_dict:
        print(my_dict['args'])
    else:
        print(my_dict)

In [12]:
test_dict_arguments_op = comp.func_to_container_op(func=test_dict_arguments)

In [42]:
def create_pyspark_job_dict(
    main_python_file_uri: str, jar_file_uris: str, bq_project_id: str, 
    bq_dataset: str, bq_table: str, output_path: str) -> dict:
    
    pyspark_job = {'main_python_file_uri': main_python_file_uri,
            'jar_file_uris': jar_file_uris}
    
    print(pyspark_job)
    
    return pyspark_job

In [43]:
create_pyspark_job_dict(main_python_file_uri='main_python_file_uri',
        jar_file_uris='jar_file_uris',
        bq_project_id='bq_project_id', 
        bq_dataset='bq_dataset',
        bq_table='bq_table',
        output_path='output_path')

{'main_python_file_uri': 'main_python_file_uri', 'jar_file_uris': 'jar_file_uris'}


{'main_python_file_uri': 'main_python_file_uri',
 'jar_file_uris': 'jar_file_uris'}

In [44]:
create_pyspark_job_dict_op = comp.func_to_container_op(create_pyspark_job_dict)

# Build the pipeline

In [48]:
import kfp.dsl as dsl
import kfp.gcp as gcp

@dsl.pipeline(
    name='Dataproc submit PySpark job pipeline',
    description='Dataproc submit PySpark job pipeline'
)
def dataproc_submit_pyspark_job_pipeline(
    cluster_project_id = 'lf-ml-demo', 
    cluster_region = 'europe-west1',
    cluster_name = 'cluster-edc5',
    bq_project_id = 'bigquery-samples',
    bq_dataset = 'wikipedia_benchmark',
    bq_table = 'Wiki10M',
    #output_path = 'gs://{0}/output/{{workflow.uid}}/{{pod.name}}/test.csv'.format(BUCKET_NAME),
    output_path = 'gs://{0}/output/test.csv'.format(BUCKET_NAME),
    main_python_file_uri = '{0}'.format(MAIN_FILE_PATH), 
    jar_file_uris = '{0}'.format(JAR_PATH),
    args = '', 
    job='{}', 
    wait_interval='30'
):
    create_pyspark_job_dict_task = create_pyspark_job_dict_op(
        main_python_file_uri=main_python_file_uri,
        jar_file_uris=jar_file_uris,
        bq_project_id=bq_project_id, 
        bq_dataset=bq_dataset,
        bq_table=bq_table,
        output_path=output_path)
    
    test_dict_arguments_task = test_dict_arguments_op(my_dict=create_pyspark_job_dict_task.output)

#     dataproc_submit_pyspark_job_op(
#         project_id=cluster_project_id, 
#         region=cluster_region, 
#         cluster_name=cluster_name, 
#         main_python_file_uri=main_python_file_uri, 
#         args=args, 
#         pyspark_job=pyspark_job, 
#         job=job, 
#         wait_interval=wait_interval).apply(gcp.use_gcp_secret('user-gcp-sa'))

## Compile the pipeline

In [49]:
pipeline_func = dataproc_submit_pyspark_job_pipeline
pipeline_filename = pipeline_func.__name__ + '.zip'
import kfp.compiler as compiler
compiler.Compiler().compile(pipeline_func, pipeline_filename)

## Upload the pipeline

Set `GOOGLE_APPLICATION_CREDENTIALS` for dealing with authorisation. The service account has role `IAP-secured Web App User`.

In [50]:
os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = '/Users/lfloretta/.secrets/lf-ml-demo-20819be29240.json'

In [51]:
from kfp import Client as KfpClient

In [52]:
client = KfpClient(
    host='https://demo-kubeflow.endpoints.lf-ml-demo.cloud.goog/pipeline',
    client_id='49311432881-9u2qfhilqci5fdthfsh8t0njpuugkj18.apps.googleusercontent.com',
    namespace='kubeflow_lfloretta'   
)

In [53]:
client.upload_pipeline(
    pipeline_package_path=pipeline_filename, 
    pipeline_name='pyspark_run_test_10') #make the name unique with your username

{'created_at': datetime.datetime(2019, 9, 10, 9, 30, 29, tzinfo=tzutc()),
 'description': None,
 'error': None,
 'id': 'da4e6d42-bd89-48f4-956d-c04f3ec2d29a',
 'name': 'pyspark_run_test_10',
 'parameters': [{'name': 'cluster-project-id', 'value': 'lf-ml-demo'},
                {'name': 'cluster-region', 'value': 'europe-west1'},
                {'name': 'cluster-name', 'value': 'cluster-edc5'},
                {'name': 'bq-project-id', 'value': 'bigquery-samples'},
                {'name': 'bq-dataset', 'value': 'wikipedia_benchmark'},
                {'name': 'bq-table', 'value': 'Wiki10M'},
                {'name': 'output-path',
                 'value': 'gs://lf-ml-demo-eu-w1/kfp_primer/test/01/dataproc/output/test.csv'},
                {'name': 'main-python-file-uri',
                 'value': 'gs://lf-ml-demo-eu-w1/kfp_primer/test/01/dataproc/transform_run.py'},
                {'name': 'jar-file-uris',
                 'value': 'gs://lf-ml-demo-eu-w1/kfp_primer/test/01/dataproc

## Run the pipeline from the UI

## References

*   [Create a new Dataproc cluster](https://cloud.google.com/dataproc/docs/guides/create-cluster) 
*   [PySparkJob](https://cloud.google.com/dataproc/docs/reference/rest/v1/PySparkJob)
*   [Dataproc job](https://cloud.google.com/dataproc/docs/reference/rest/v1/projects.regions.jobs)