# Setting up Prefect

Prefect is an orchestration tool, this is a flow of everything that runs in a database load... The pulling of data, validating, loading, transforming, etc, etc. See it as an altenative to SSIS & SQL Agent Jobs, that looks and works a lot nicer!

It also means that we don't have to be tied down to a single toolset to accoplish things, if something would be better done in python which then kicks off a bit of SQL, the orchestration sets that us for us.

Prefect is a python module that uses _decorators_ to orchestrate the data flow

from prefect  import flow

@flow
def my_first_flow():
    print("This function doesn't do too much")
    return 42

We set up a basic function that is decorated with `@flow` 

In [None]:
state = my_first_flow()

In [None]:
print(state)
print(state.result())

## So what happened here?

1. We created a basic function that prints something and returns something
2. Decorated it with `@flow`
3. Assigned the output of the function to variable `state` which we can then query. The output is a Prefect `State` object, to see what is returned we need to access it via `.result()`

In [None]:
print(type(state))

## Flows and tasks

Flows and tasks are the basic blocks of Prefect, they are containers for the workflow logic. Flows can run other flows or tasks; tasks are optional, but provide extra encapsulation in observable units that can be reused across flows and subflows.

In [None]:
import os
import requests
from prefect import flow, task

@task
def call_api(url):
    response = requests.get(url)
    print(response.status_code)
    return response.json()

@task
def parse_fact(response):
    print(response["fact"])
    return

@flow(name="Example API call flow",
     description="An example flow for this tutorial",
     version=os.getenv("GIT_COMMIT_SHA"))
def api_flow(url):
    fact_json = call_api(url)
    parse_fact(fact_json)
    return

state=api_flow("https://catfact.ninja/fact")

## Applying this to NHS Numbers

So now we have the basics, let's apply it to our OptOut csv.

### The way we'll do it

1. Simulate csv delivery - just copy a file from one area to the source folder
2. Validate data with great expectations
3. Run checksum against data
4. Load csv to dataframe, save as a timestamped csv + load into a database

In [None]:
import shutil
import pandas as pd
from prefect import flow, task

@task(name='Copy file to dropzone', description='A simulated drop of the OptOuts csv to the dropzone')
def simulate_datadrop(sourcePath: str, destinationPath: str, file: str) -> None:
    sourceFile = sourcePath + '\\' + file
    destFile = destinationPath + '\\' + file
    shutil.copyfile(sourceFile, destFile)
    print("File copied")
    
@task(name='Validate Data', description='Validate data against great expectations checkpoint')
def validate_data(checkpointPath: str):
    exec(open(checkpointPath).read())
    
    
@flow(name='OptOuts Prefect Flow')
def optout_flow():
    simulate_datadrop('D:\\git\\dePOC\\data\\raw', 'D:\\git\\dePOC\\data\\src', 'OptOuts.csv')
    validate_data('D:\\git\\dePOC\\expectations\\great_expectations\\uncommitted\\run_OptOuts_checkpoint.py')
    return

state = optout_flow()

22:18:57.751 | INFO    | prefect.engine - Created flow run 'carmine-duck' for flow 'OptOuts Prefect Flow'
22:18:57.755 | INFO    | Flow run 'carmine-duck' - Using task runner 'ConcurrentTaskRunner'
22:18:57.887 | INFO    | Flow run 'carmine-duck' - Created task run 'Copy file to dropzone-046d9bd5-0' for task 'Copy file to dropzone'
22:18:57.965 | INFO    | Flow run 'carmine-duck' - Created task run 'Validate Data-d13a1dbb-0' for task 'Validate Data'
22:18:57.998 | INFO    | Task run 'Copy file to dropzone-046d9bd5-0' - Finished in state Completed()


File copied


  from .autonotebook import tqdm as notebook_tqdm

Calculating Metrics: 100%|████████████████████████████████████████████████████████████████████████████████████| 18/18 [00:00<00:00, 216.87it/s]
  return concat(self.root_render_func(self.new_context(vars)))
22:19:06.926 | INFO    | Task run 'Validate Data-d13a1dbb-0' - Crash detected! Execution was aborted by Python system exit call.
22:19:06.962 | ERROR   | Flow run 'carmine-duck' - Crash detected! Execution was aborted by Python system exit call.
