# This notebook is to create a function to ingest data from snowflake with a Dask cluster

The dask frameworks enables users to parallelize their python code and run it as a distributed process on Iguazio cluster and dramatically accelerate their performance. <br>
In this notebook we'll create an mlrun function running as a dask client to ingest data from snowflake. <br>
It also demonstrates how to run parallelize query against snowflake using Dask Delayed option to query a large data set from snowflake. <br>
The function will be published on the function marketplace. <br>
For more information on dask over kubernetes: https://kubernetes.dask.org/en/latest/

### Set up the enviroment

In [1]:
import mlrun
import os
import warnings
import yaml

project_name = "snowflake-dask"
dask_cluster_name="snowflake-dask-cluster"
artifact_path = mlrun.set_environment(project=project_name,
                                      artifact_path = os.path.join(os.path.abspath('/v3io/projects/'), project_name))

warnings.filterwarnings("ignore")

print(f'artifact_path = {artifact_path}')

> 2022-03-17 17:11:56,500 [info] loaded project snowflake-dask from MLRun DB
artifact_path = ('snowflake-dask', '/v3io/projects/snowflake-dask')


### Load snowflake configuration from config file. 
This is for demo purpose, in the real production code, you would need to put the snowflake connection info into secrets use the secrets in the running pod to connect to snowflake

In [2]:
# Load connection info
with open(".config.yaml") as f:
    connection_info = yaml.safe_load(f)

# verify the config
print(connection_info['account'])

nf77378.eu-west-2.aws


### Create a python function

This function querys data from snowflake using snowflake python connector for parallel processing of the query results. <br>
With snoeflake python connector, when you execute a query, the cursor will return the result batches. <br>
Using Dask Delayed it will return and process results set in parallel. <br>

#### write the function to a py file

In [3]:
%%writefile snowflake_dask.py
"""Snowflake Dask - Ingest Snaowflake data with Dask"""
import warnings
import mlrun
from mlrun.execution import MLClientCtx
import snowflake.connector as snow
from dask.distributed import Client
from dask.dataframe import from_delayed
from dask import delayed
from dask import dataframe as dd

warnings.filterwarnings("ignore")

@delayed
def load(batch):

    """A delayed load one batch."""

    try:
        print("BATCHING")
        df_ = batch.to_pandas()
        return df_
    except Exception as e:
        print(f"Failed on {batch} for {e}")
        raise

def load_results(context: MLClientCtx,
                 dask_client: str,
                 connection_info: str,
                 query: str,
                 parquet_out_dir = None,
                 publish_name = None
                ) -> None:

    """Snowflake Dask - Ingest Snaowflake data with Dask

    :param context:           the function context
    :param dask_client:       dask cluster function name
    :param connection_info:   Snowflake database connection info (this wikk be in a secret later)
    :param query:             query to for Snowflake
    :param parquet_out_dir:   directory path for the output parquet files
                              (default None, not write out)
    :param publish_name:      name of the dask dataframe to publish to the dask cluster
                              (default None, not publish)

    """
    context = mlrun.get_or_create_ctx('snawflake-dask-cluster')

    # setup dask client from the MLRun dask cluster function
    if dask_client:
        client = mlrun.import_function(dask_client).client
        context.logger.info(f'Existing dask client === >>> {client}\n')
    else:
        client = Client()
        context.logger.info(f'\nNewly created dask client === >>> {client}\n')

    conn = snow.connect(**connection_info)
    cur = conn.cursor()
    cur.execute(query)
    batches = cur.get_result_batches()
    context.logger.info(f'batches len === {len(batches)}\n')

    dfs = []
    for batch in batches:
        if batch.rowcount > 0:
            df = load(batch)
            dfs.append(df)
    ddf = from_delayed(dfs)

    # materialize the query results set for some sample compute

    ddf_describe = ddf.describe().compute()

    context.logger.info(f'query  === >>> {query}\n')
    context.logger.info(f'ddf  === >>> {ddf}\n')
    context.log_result('number of rows', len(ddf.index))
    context.log_dataset("ddf_describe", df=ddf_describe)

    if publish_name:
        context.log_result('data_set_name', publish_name)
        if not client.list_datasets():
            ddf.persist(name = publish_name)
            client.publish_dataset(publish_name=ddf)

    if parquet_out_dir:
        dd.to_parquet(df=ddf, path=parquet_out_dir)
        context.log_result('parquet directory', parquet_out_dir)

Overwriting snowflake_dask.py


### Convert the code to MLRun function

Use code_to_function to convert the code to MLRun <br>

In [4]:
fn = mlrun.code_to_function(name="snowflake-dask",  
                            kind='job', 
                            filename='snowflake_dask.py',
                            image='mlrun/mlrun',
                            requirements='requirements.txt',
                            handler="load_results", 
                            description="Snowflake Dask - Ingest snowflake data in parallel with Dask cluster",
                            categories=["data-prep"],
                            labels={"author": "xingsheng"}
                           )
fn.apply(mlrun.platforms.auto_mount())
fn.deploy()

> 2022-03-17 17:11:56,703 [info] Started building image: .mlrun/func-snowflake-dask-snowflake-dask:latest
[36mINFO[0m[0000] Retrieving image manifest mlrun/mlrun:0.10.0 
[36mINFO[0m[0000] Retrieving image mlrun/mlrun:0.10.0 from registry index.docker.io 
[36mINFO[0m[0000] Built cross stage deps: map[]                
[36mINFO[0m[0000] Retrieving image manifest mlrun/mlrun:0.10.0 
[36mINFO[0m[0000] Returning cached image manifest              
[36mINFO[0m[0000] Executing 0 build triggers                   
[36mINFO[0m[0000] Unpacking rootfs as cmd RUN python -m pip install bokeh snowflake-connector-python[pandas] requires it. 
[36mINFO[0m[0019] RUN python -m pip install bokeh snowflake-connector-python[pandas] 
[36mINFO[0m[0019] Taking snapshot of full filesystem...        
[36mINFO[0m[0036] cmd: /bin/sh                                 
[36mINFO[0m[0036] args: [-c python -m pip install bokeh snowflake-connector-python[pandas]] 
[36mINFO[0m[0036] Running: [/bin/sh

True

#### export function to local `function.yaml` file for testing
in the real usage, we will import a function from hub

In [5]:
fn.export('function.yaml')
# print(fn.to_yaml())

> 2022-03-17 17:12:47,044 [info] function spec saved to path: function.yaml


<mlrun.runtimes.kubejob.KubejobRuntime at 0x7fae6fe3e690>

#### import a function from local `function.yaml' for testing (Need to change it to import from hub before PR)

In [6]:
fn = mlrun.import_function("./function.yaml")

In [7]:
# fn = mlrun.import_function("hub://snowflake_dask")

#### create a dask cluster and specify the configuration for the dask process (e.g. replicas, memory etc)

In [8]:
# function URI is db://<project>/<name>
dask_uri = f'db://{project_name}/{dask_cluster_name}'
dask_uri

'db://snowflake-dask/snowflake-dask-cluster'

In [9]:
dsf = mlrun.new_function(name=dask_cluster_name, 
                         kind='dask', 
                         image='mlrun/mlrun',
                         requirements=["bokeh", "snowflake-connector-python[pandas]"]
                        )
dsf.apply(mlrun.mount_v3io())
dsf.spec.remote = True
dsf.spec.min_replicas = 1
dsf.spec.max_replicas = 10
dsf.spec.service_type = "NodePort"
dsf.with_requests(mem='4G', cpu='2')
# dsf.spec.node_port=30088
# dsf.spec.scheduler_timeout = "5 days"

In [10]:
dsf.deploy()

> 2022-03-17 17:12:47,426 [info] Started building image: .mlrun/func-snowflake-dask-snowflake-dask-cluster:latest
[36mINFO[0m[0000] Retrieving image manifest mlrun/mlrun:0.10.0 
[36mINFO[0m[0000] Retrieving image mlrun/mlrun:0.10.0 from registry index.docker.io 
[36mINFO[0m[0000] Built cross stage deps: map[]                
[36mINFO[0m[0000] Retrieving image manifest mlrun/mlrun:0.10.0 
[36mINFO[0m[0000] Returning cached image manifest              
[36mINFO[0m[0000] Executing 0 build triggers                   
[36mINFO[0m[0000] Unpacking rootfs as cmd RUN python -m pip install bokeh snowflake-connector-python[pandas] requires it. 
[36mINFO[0m[0019] RUN python -m pip install bokeh snowflake-connector-python[pandas] 
[36mINFO[0m[0019] Taking snapshot of full filesystem...        
[36mINFO[0m[0036] cmd: /bin/sh                                 
[36mINFO[0m[0036] args: [-c python -m pip install bokeh snowflake-connector-python[pandas]] 
[36mINFO[0m[0036] Running: 

True

In [11]:
client = dsf.client

> 2022-03-17 17:13:51,354 [info] trying dask client at: tcp://mlrun-snowflake-dask-cluster-15ea793c-d.default-tenant:8786
> 2022-03-17 17:13:51,391 [info] using remote dask scheduler (mlrun-snowflake-dask-cluster-15ea793c-d) at: tcp://mlrun-snowflake-dask-cluster-15ea793c-d.default-tenant:8786


### Run the function

When running the function you would see a remote dashboard link as part of the result. click on this link takes you to the dask monitoring dashboard

In [12]:
p = 'my-local-test'
parquet_path = f"/v3io/bigdata/pq_from_sf_dask/{p}"

fn.run(handler = 'load_results',
       params={"dask_client": dask_uri, 
               "connection_info": connection_info, 
               "query": "SELECT * FROM SNOWFLAKE_SAMPLE_DATA.TPCH_SF1.CUSTOMER",
               "parquet_out_dir": parquet_path,
               "publish_name": "customer",
              }
      )

> 2022-03-17 17:13:51,401 [info] starting run snowflake-dask-load_results uid=8f59e678a2dd4b93ad89313f3e5d0455 DB=http://mlrun-api:8080
> 2022-03-17 17:13:51,562 [info] Job is running in the background, pod: snowflake-dask-load-results-nrsf2
> 2022-03-17 17:13:57,110 [info] trying dask client at: tcp://mlrun-snowflake-dask-cluster-15ea793c-d.default-tenant:8786
> 2022-03-17 17:13:57,119 [info] using remote dask scheduler (mlrun-snowflake-dask-cluster-15ea793c-d) at: tcp://mlrun-snowflake-dask-cluster-15ea793c-d.default-tenant:8786
remote dashboard: default-tenant.app.us-sales-322.iguazio-cd1.com:30088
> 2022-03-17 17:13:57,120 [info] Existing dask client === >>> <Client: 'tcp://172.31.1.12:8786' processes=0 threads=0, memory=0 B>

> 2022-03-17 17:14:00,599 [info] batches len === 22

> 2022-03-17 17:14:15,285 [info] query  === >>> SELECT * FROM SNOWFLAKE_SAMPLE_DATA.TPCH_SF1.CUSTOMER

> 2022-03-17 17:14:15,294 [info] ddf  === >>> Dask DataFrame Structure:
               C_CUSTKEY  C_NAM

project,uid,iter,start,state,name,labels,inputs,parameters,results,artifacts
snowflake-dask,...3e5d0455,0,Mar 17 17:13:56,completed,snowflake-dask-load_results,v3io_user=xingshengkind=jobowner=xingshengmlrun/client_version=0.10.0host=snowflake-dask-load-results-nrsf2,,"dask_client=db://snowflake-dask/snowflake-dask-clusterconnection_info={'user': 'xingsheng', 'password': 'Xgg2jcDDbxBsB7oL', 'warehouse': 'compute_sh', 'account': 'nf77378.eu-west-2.aws', 'application': 'Iguazio', 'query_tag': 'Iguazio'}query=SELECT * FROM SNOWFLAKE_SAMPLE_DATA.TPCH_SF1.CUSTOMERparquet_out_dir=/v3io/bigdata/pq_from_sf_dask/my-local-testpublish_name=customer",number of rows=150000data_set_name=customerparquet directory=/v3io/bigdata/pq_from_sf_dask/my-local-test,ddf_describe





> 2022-03-17 17:14:26,456 [info] run executed, status=completed


<mlrun.model.RunObject at 0x7fae64a4ce10>

In [13]:
client.close()

## Track the progress in the UI

Users can view the progress and detailed information in the mlrun UI by clicking on the uid above. <br>
Also, to track the dask progress in the dask UI click on the "dashboard link" above the "client" section