# Pipeline prep
This notebook will log the creation process for the Toronto Building Permits ELT data pipeline. First we take inventory of sources and transformations required to generate the gold table of permits summaries per neighbourhood on a monthly basis. Next we will choose then connect to a database service to host the db we will connect to with our data pipeline framework. We will be using `dbt` as our framework for our data pipeline. Finally we will create our transformations and implement tests to ensure data QA.

## Data inventory
From our exploratory data analysis, we have generated the following data flow in developing our data:

**Source (Bronze) Data:**
- Building Permits Data, Cleared `BPC`
- Building Permits Data, Active `BPA`
- Municipal Address Points `AP`
- Neighbourhoods `N`

**Silver Data**
- Building Permits Data, processed and annotated `BP`
- Addresses by Neighbourhoods `AN`

**Gold Data**
- Neighbiourhoods by Month `NBM`

**Transformations (Mermaid)**

![mermaid code for flowchart in following cell](diagrams/bp-flow-original.svg "a title")


```mermaid
flowchart TB

%% STAGE 1: Concat and prepare building permits

    bpa1[BPA] ---h1(["`hash:
        'PERMIT_NUM', 
        'REVISION_NUM', 
        'PERMIT_TYPE', 
        'BUILDER_NAME'`"]) 
        -->bpa2[BPA]
    bpc1[BPC] ---h2(["`hash:
        'PERMIT_NUM', 
        'REVISION_NUM', 
        'PERMIT_TYPE', 
        'BUILDER_NAME'`"]) 
        -->bpc2[BPC]
    bpa2 & bpc2 ---c1([concat]) -->bp1[BP]
    
    bp1 ---f1([filter by permit type and status])
        ---ed1([add effective date])
        ---s3(["`SELECT
        _GEO_ID,
        effective_date,
        EST_CONST_COST_`"])
        -->bp2[BP]

%% STAGE 2: Geofence properties into neighbourhoods
        
    ap1[AP] ---s1(["`SELECT
        _ADDRESS_POINT_ID, 
        geometry_`"])
        ---pg1([parse geometry])
        -->ap2[AP]

    n1[N] ---s2(["`SELECT
        _AREA_ID,
        AREA_SHORT_CODE,
        AREA_NAME, 
        geometry_`"])
        -->pg2([parse geometry])
        -->n2[N]
    
    ap2 & n2 ---gpd1(["`geopandas overlay
        (intersection):
        _ADDRESS_POINT_ID 
        to AREA_ID_`"])
        -->an1[AN]

%% STAGE 3: Join neighbourhoods to permits (Possible split here)
    
    bp2 & an1 ---lj(["`left join: 
        _GEO_ID to 
        ADDRESS_POINT_ID_`"])
        --> bp3[BP]

%% STAGE 4: Group and summarise by interval (month)

    bp3 ---cd1([convert dates to datetime])
        ---am1([add month column])
        ---gb1(["`group by 
        _AREA_NAME, month_`"])
        ---sm1(["`summarise:
        sum(EST_CONST_COST),
        count(GEO_ID)
        `"])
        -->nbm1[NBM]

%% STAGE 5: Add total properties per neighbourhood

    an1 ---gb2(["`group by
        _AREA NAME_`"])
        ---sm2(["`summarise:
        unique(ADDRESS_POINT_ID)
        `"])
        -->nc1[NC]
    
    nbm1 & nc1 ---lj2(["`left join:
        on _AREA_NAME_`"])
        -->nbm2[NBM]
```

## Optimizing transformations
There are several optimizations we can carry out throughout the pipeline

### Connecting to data
1. We can download data using queries from the CKAN portals. By restricting our API downloads to the columns we use, we can use less storage space on our databases. If we make this configurable we can turn on and off data columns as needed. Here are the columns we need for each source table:

**BPA:**

`'PERMIT_NUM', 'GEO_ID', 'REVISION_NUM', 'PERMIT_TYPE', 'STATUS', 'BUILDER_NAME', 'APPLICATION_DATE', 'ISSUED_DATE', 'COMPLETED_DATE', 'EST_CONST_COST'`

**BPC:**

`'PERMIT_NUM', 'GEO_ID', 'REVISION_NUM', 'PERMIT_TYPE', 'STATUS', 'BUILDER_NAME', 'APPLICATION_DATE', 'ISSUED_DATE', 'COMPLETED_DATE', 'EST_CONST_COST'`

**AP:**

`'ADDRESS_POINT_ID', 'geometry'`

**N:**

`'AREA_ID', 'AREA_SHORT_CODE', 'AREA_NAME', 'geometry'`