# Prep for running pipelines
### Mostly around:
- Component `.yaml` for dataflow embedding export
- One-time setup of network resources (e.g.: VPC)
- Configuration of BigQuery flex slots (for the bqml modeling compute)

### Dataflow yaml creation

Many examples available from [here](https://github.com/kubeflow/pipelines/tree/master/components/gcp/dataflow/launch_template) 

In [3]:
%%writefile dataflow-launch_python-component.yaml

# Copyright 2018 The Kubeflow Authors
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#      http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

name: Launch Python
description: |
  Launch a self-executing beam python file.
metadata:
  labels:
    add-pod-env: 'true'
inputs:
  - name: project
    description: 'The ID of the GCP project to run the Dataflow job.'
    type: String
  - name: location
    description: 'The GCP region to run the Dataflow job.'
    type: String
  - name: python_file_path
    description: 'The gcs or local path to the python file to run.'
    type: String
  - name: staging_dir
    description: >-
      Optional. The GCS directory for keeping staging files.
      A random subdirectory will be created under the directory to keep job info
      for resuming the job in case of failure and it will be passed as
      `staging_location` and `temp_location` command line args of the beam code.
    default: ''
    type: String
  - name: requirements_file_path
    description: 'Optional, the gcs or local path to the pip requirements file'
    default: ''
    type: String
  - name: args
    description: 'The list of args to pass to the python file.'
    default: '[]'
    type: typing.List
  - name: wait_interval
    default: '30'
    description: 'Optional wait interval between calls to get job status. Defaults to 30.'
    type: Integer
outputs:
  - name: job_id
    description: 'The id of the created dataflow job.'
    type: String
  - name: MLPipeline UI metadata
    type: UI metadata
implementation:
  container:
    image: gcr.io/ml-pipeline/ml-pipeline-gcp:1.7.0-rc.3
    command: ['python', '-u', '-m', 'kfp_component.launcher']
    args: [
      --ui_metadata_path, {outputPath: MLPipeline UI metadata},
      kfp_component.google.dataflow, launch_python,
      --python_file_path, {inputValue: python_file_path},
      --project_id, {inputValue: project},
      --region, {inputValue: location},
      --staging_dir, {inputValue: staging_dir},
      --requirements_file_path, {inputValue: requirements_file_path},
      --args, {inputValue: args},
      --wait_interval, {inputValue: wait_interval},
      --job_id_output_path, {outputPath: job_id},
    ]
    env:
      KFP_POD_NAME: "{{pod.name}}"

Overwriting dataflow-launch_python-component.yaml


### Creation of vpc and other resources here - *ONLY RUN ONCE*

In [None]:
### RUN ONLY ONCE - NOTE THESE ARE CREATED IN NOTEBOOK 05

# PROJECT_ID = 'rec-ai-demo-326116'  # @param {type:"string"}
# NETWORK_NAME = "default"  # @param {type:"string"}
# PEERING_RANGE_NAME = 'google-reserved-range'
# BUCKET = 'rec_bq_jsw' # Change to the bucket you created.


# # Run this only once - this sets up your sub nets to provide high-speed predictions

# # Create a VPC network
# ! gcloud compute networks create {NETWORK_NAME} --bgp-routing-mode=regional --subnet-mode=auto --project={PROJECT_ID}

# # Add necessary firewall rules
# ! gcloud compute firewall-rules create {NETWORK_NAME}-allow-icmp --network {NETWORK_NAME} --priority 65534 --project {PROJECT_ID} --allow icmp

# ! gcloud compute firewall-rules create {NETWORK_NAME}-allow-internal --network {NETWORK_NAME} --priority 65534 --project {PROJECT_ID} --allow all --source-ranges 10.128.0.0/9

# ! gcloud compute firewall-rules create {NETWORK_NAME}-allow-rdp --network {NETWORK_NAME} --priority 65534 --project {PROJECT_ID} --allow tcp:3389

# ! gcloud compute firewall-rules create {NETWORK_NAME}-allow-ssh --network {NETWORK_NAME} --priority 65534 --project {PROJECT_ID} --allow tcp:22

# # Reserve IP range
# ! gcloud compute addresses create {PEERING_RANGE_NAME} --global --prefix-length=16 --network={NETWORK_NAME} --purpose=VPC_PEERING --project={PROJECT_ID} --description="peering range for uCAIP Haystack."

# # Set up peering with service networking
# ! gcloud services vpc-peerings connect --service=servicenetworking.googleapis.com --network={NETWORK_NAME} --ranges={PEERING_RANGE_NAME} --project={PROJECT_ID}

### Optional - set up flex slots on BQ for model computations

In [None]:
# BQ_REGION = 'US' # Change to your BigQuery region.
# RESERVATION = 'default'
# SLOTS=10
# !bq mk --reservation --project_id=$PROJECT_ID --slots=$SLOTS --location=$BQ_REGION $RESERVATION

### Upload the embeddings and stored procs for use in vertex pipeline

In [4]:
# Note we will be leveraging code copied up from the embeddings_exporter folder
!gsutil cp -r embeddings_exporter/ gs://rec_bq_jsw
    
#do the same with the sproc files
!gsutil cp -r sql_scripts/ gs://rec_bq_jsw

Copying file://embeddings_exporter/setup.py [Content-Type=text/x-python]...
Copying file://embeddings_exporter/pipeline_kfp.py [Content-Type=text/x-python]...
Copying file://embeddings_exporter/__init__.py [Content-Type=text/x-python]...  
Copying file://embeddings_exporter/pipeline.py [Content-Type=text/x-python]...  
/ [4 files][  5.5 KiB/  5.5 KiB]                                                
==> NOTE: You are performing a sequence of gsutil operations that may
run significantly faster if you instead use gsutil -m cp ... Please
see the -m section under "gsutil help options" for further information
about when gsutil -m can be advantageous.

Copying file://embeddings_exporter/beam_kfp2.py [Content-Type=text/x-python]...
Copying file://embeddings_exporter/runner.py [Content-Type=text/x-python]...    
Copying file://embeddings_exporter/embedding_exporter.egg-info/SOURCES.txt [Content-Type=text/plain]...
Copying file://embeddings_exporter/embedding_exporter.egg-info/top_level.txt [Con