In [40]:
%load_ext autoreload

In [41]:
%autoreload 2

# Inroduction

## A guide to develop and deploy a modern data pipeline to production infrastructure
Using two tasks this guide will go through the three principle stages nesscary to develop a modern data pipeline in python.  

#### Pipeline tasks
1. Extract MOC imbalance data from the TSX website
2. Load the data into a `postgress` database

In reality a pipeline will include other tasks that might: (1) Extract data from many sources at different time zones. (2) Prepare the data for analytics or prediction algorithm(s). (3) run multiple statistical experiments on decentralized infrastructure.

Our motivating example is a toy ETL pipeline that extracs data from the Toronto Stock Exchange's Market on Close facilty website and loads into a database.  The data is published every trading day at 15:40 Toronto time and at midnight the table gets flushed out.
 

### Principle stages and tools
1.  Write the base code
2.  Implement task configuration and orchestration logic for individual tasks and the pipeline. i.e. error handling, execution logic
3.  Deploy the flow to a compute enviroment. Local, self hosted or IaaS providor. In our case AWS

| Development Stage | Tool | Purpose | pip install |
|------|------|------|------|
|  Base code  | `requests`, `lxml`, `html5lib`| extract content from web | pip install requests lxml html5lib|
| | pandas| transforming and passing data | pip install requests pandas |
| | datetime | Handling dates | pip install datetime |
| | psycopg2 | database driver | psycopg2 |
|  Task configuration and orchestration  | prefect| orchestration, configuration, execution method | pip install -U "prefect[viz, aws]" |
|  Deployment  | prefect| Scheduling and monitoring |  |
| | AWS Batch | compute | Instructions in guide |
| | AWS ECR | docker repo | Instructions in guide |
| | AWS s3| object storage | - |

### Requirements
- An AWS account
    - aws cli configured on the lcoal machine. [Instructions](https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-configure.html)
- postgres db - Remote lazy way: [Lightsail](https://aws.amazon.com/lightsail/) is free for a month, easy to setup and teardown for experiments. postgress or mysql.
- Docker - [docker install](https://docs.docker.com/engine/install/#supported-platforms).  
- graphviz - [download](https://www.graphviz.org/download/).  Choose your os.
- python 3.7
    - prefect - `pip install -U "prefect[viz, aws]"` See [prefect install](https://docs.prefect.io/core/getting_started/installation.html) for other possibilities.
    

## Base pipeline tasks
Context - Our base code will live in a directory called `prefect_guide`. In one file called `get_moc.py`.  Both can be named any valid unix valid names.

### a. Extract MOC imbalance data from the TSX website

The first task is to retrieve data from a table from the [TSX website](https://api.tmxmoney.com/mocimbalance/en/TSX/moc.html)  using `requests`.  The task returns and returns a `dataframe` with the MOC imbalances for the day. 

In [6]:
import pandas as pd
import requests


import logging
logger = logging.getLogger()
logger.setLevel(logging.INFO)

def get_tsx_moc_imb(url: str):
    """
    Scrape the TSX website Market on close website. Data only available weekdays after 15:40 pm Toronto time
    until 12 am.
    
    Use archived url for testing.       
    "https://web.archive.org/web/20200414202757/https://api.tmxmoney.com/mocimbalance/en/TSX/moc.html"
    """
    
    # 1, Get the html content
    html = requests.get(url).content
    
    # 2. Read all the tables
    df_list = pd.read_html(html, header=[0], na_values=[''], keep_default_na=False)
    
    tsx_imb_df = df_list[-1]
    
    logger.info(f"MOC download shape {tsx_imb_df.shape}")

    return tsx_imb_df

#### Run the function

After running the function below the user will get one of three results: 

   1. A dataframe having a shape of `[num_rows > 1, 4]`
   2. A dataframe having a shape of `[0, 4]`
   3. Some connection error .i.e. `RemoteDisconnected`, ... 

No matter what result is returned, it will be handled later in the configuration and orchastration stage. But for the purposes of validating the next step use the `backup_url` given below 

In [43]:
tsx_url = 'https://api.tmxmoney.com/mocimbalance/en/TSX/moc.html'
backup_url = "https://web.archive.org/web/20200414202757/https://api.tmxmoney.com/mocimbalance/en/TSX/moc.html"

tsx_imb_df = get_tsx_moc_imb(backup_url)
print(tsx_imb_df.shape)
tsx_imb_df.head(3)  

(389, 4)


Unnamed: 0,Symbol,Imbalance Side,Imbalance Size,Imbalance Reference Price
0,AAV,BUY,34003,1.725
1,ABX,BUY,460592,34.005
2,ACB,BUY,211790,1.035


### b. Load to database

This task takes the transformed data as an input and loads the `imb_df` to a database using `sqlAlchemy`.  There are many ways to accomplish this goal, including some baked into `prefect`. But this is the simplest in terms of writing from a small `df`. 

In [55]:
import sqlalchemy as sa
def df_to_db(df, tbl_name, conn_str):

    engine = sa.create_engine(conn_str)
    
    df.to_sql(
        name=tbl_name,
        con=engine,
        if_exists="append",
        index=True,
        method="multi",
        chunksize=5000
        )
    
    engine.dispose()
  
    return df.shape

Build a connection string using your own credentials for the db

In [56]:
usr_nm = 'something'
pwd = "verysecret"
host = "some_endpoint or ip"
db_nm = "tst"

db_string = f"postgres://{usr_nm}:{pwd}@{host}/{db_nm}"

Run the function.  A new table with the imb data should now be inserted

In [51]:
tranData = PrepareLoad()
imb_df = tranData.run(tsx_imb_df)
imb_df.head(2)

Unnamed: 0_level_0,Unnamed: 1_level_0,imbalance_side,imbalance_size,imbalance_reference_price,dlr_delta
moc_date,symbol,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2020-04-29,AAV,BUY,34003,1.725,58655
2020-04-29,ABX,BUY,460592,34.005,15662430


In [57]:
df_shape = df_to_db(imb_df, tbl_name="moc_tst", conn_str=s.get())
df_shape

(389, 4)

## Stage 2 - Configuration and Orchestration for production

Setting up the prefect config for deployment: 

1. In the terminal go to the prefect directory

    cd ~
    
    cd .prefect
2. Check if a config exists

    ls

    ![image.png](attachment:41cfd786-48c5-4495-a53c-aeaf37ee28e0.png)

    If it does not exist:
    
    touch config.toml

3. Open the file.

    nano config.toml

4. Get a cloud token from prefect

5. Add the following configuration.  [prefect config](https://docs.prefect.io/core/concepts/configuration.html)
  
 
![image.png](attachment:85d6901e-5c58-4b4e-801b-3b262153b614.png)

6. quit nano

Configuring the `get_tsx_moc_imb` to handle retries and saving its result to `s3`.
- Add the `prefect` task decorator with [keyword configuration](https://docs.prefect.io/api/latest/core/task.html#task-2).

In [15]:
from datetime import timedelta

from prefect import task
from prefect.engine.result_handlers import LocalResultHandler, S3ResultHandler

s3_handler = S3ResultHandler(bucket='tsx-moc-bcp')  
lcl_handler = LocalResultHandler()



@task(
    result_handler=s3_handler,
    max_retries=3, 
    retry_delay=timedelta(seconds=0)
    )
def get_tsx_moc_imb(url: str):
    """
    Scrape the TSX website Market on close website. Data only available weekdays after 15:40 pm Toronto time
    until 12 am.
    
    Use archived url for testing.       
    "https://web.archive.org/web/20200414202757/https://api.tmxmoney.com/mocimbalance/en/TSX/moc.html"
    """
    
    # 1, Get the html content
    html = requests.get(url).content
    
    # 2. Read all the tables
    df_list = pd.read_html(html, header=[0], na_values=[''], keep_default_na=False)
    
    tsx_imb_df = df_list[-1]
    
    logger.info(f"MOC download shape {tsx_imb_df.shape}")

    return tsx_imb_df

Notice the addition of the `run` method we've added

In [16]:
tsx_url = 'https://api.tmxmoney.com/mocimbalance/en/TSX/moc.html'
backup_url = "https://web.archive.org/web/20200414202757/https://api.tmxmoney.com/mocimbalance/en/TSX/moc.html"

tsx_imb_df = get_tsx_moc_imb.run(tsx_url)
tsx_imb_df

Unnamed: 0,Symbol,Imbalance Side,Imbalance Size,Imbalance Reference Price
0,AAV,BUY,5043,2.295
1,ABX,BUY,149503,36.745
2,ACO.X,BUY,23905,40.410
3,ACQ,BUY,9800,6.720
4,ADN,BUY,102,14.110
...,...,...,...,...
412,WPM,SELL,66284,56.185
413,WTE,SELL,2871,14.965
414,X,SELL,2846,121.950
415,XAU,SELL,300,2.740
