# Title

## Introduction

This process is dependent on upstream processes. See the "Prerequisites" section below.

The workflow defined herein is identified as workflow ID #TBD in the [Data Team Master Document List](https://morpc1.sharepoint.com/:x:/s/GISteam/EfC4j3HhohZCrSZzxJdyt5cBFEqVD7zHick8ZW0INqgCYA?e=0WhrAI). References to document list identifiers are denoted by a number in brackets, e.g. [TBD].

## Process outline

## Prerequisites and usage notes

  - Outputs of one or more upstream workflows must be available at the indicated paths. Make sure that those outputs are up to date prior to running this script. 
  - This script includes several intentional RuntimeError instances that may be triggered to alert the user to conditions that may require their attention. If the script triggers one of these errors, review the error, verify that the condition is acceptable or resolve any issues, then proceed.

## Setup

### Import required packages

In [13]:
import os
import morpc
import pickle
import sys
sys.path.append("./main")
import download_and_unzip
import build_database

### User-specified parameters

In [4]:
# When STALE_DATA_INTERRUPT == True, the script will produce a RuntimeError in certain situations where the input 
# data may be stale and updates might be required prior to running the script.  Otherwise, a warning will be generated 
# but script execution will continue.  Regardless of whether an error or warning occurs, be sure to verify the readiness 
# of all input data.
STALE_DATA_INTERRUPT = True

# This script may pull data from outputs of upstream workflows.  The locations of these outputs are specified by their path relative
# a GitHub root directory. This is a single directory which is presumed to contain local working copies of MORPC GitHub repositories.
# Specify the path to the directory on your system where the local working copies are stored. By default, the GitHub root directory is
# assumed to be one level up from this script.
GITHUB_ROOT = "../"

# Specify the path to the directory where the input data is stored. Sometimes the data may be sourced from this location and sometimes 
# it may be sourced from elsewhere and archived here.
INPUT_DIR = "./input_data"

# Specify the path to the directory where the output data is stored. Typically it is not necessary to change this, and changing it for 
# established scripts may break other scripts that depend on outputs from this one.
OUTPUT_DIR = "./output_data"

# Specify the path to the directory where temporary outputs are stored.  Typically this is used to capture data or artifacts that are useful
# for understanding the internal workings of a script but which are not considered to be official outputs of the script and may not be 
# acceptable for use in downstream workflows.
TEMP_DIR = "./temp_data"

### Static parameters

### Define inputs

The following datasets are required by this notebook. They will be retrieved from the specified location and temporarily stored in INPUT_DIR.

#### Create input data directory

Create input data directory if it doesn't exist.

In [5]:
inputDir = os.path.normpath(INPUT_DIR)
if not os.path.exists(inputDir):
    os.makedirs(inputDir)

#### MORPC counties reference data [81]

Reference data for counties in the MORPC region will be loaded automatically as a morpc.countyLookup() object (see below).

#### Example input dataset [TBD]

### Define outputs

#### Create output data directory

Create output data directory if it doesn't exist.

In [6]:
outputDir = os.path.normpath(OUTPUT_DIR)
if not os.path.exists(outputDir):
    os.makedirs(outputDir)   

#### Create temporary data directory

Create temporary data directory if it doesn't exist.

In [7]:
tempDir = os.path.normpath(TEMP_DIR)
if not os.path.exists(tempDir):
    os.makedirs(tempDir)   

#### Example output dataset [TBD]

## Prepare input data

### Load county reference data

### Load example input data

## Transform data

In [9]:
OUTPUT_DB_PATH = os.path.join(outputDir, "morpc-lodes-standardize.db")
FILELIST_FILENAME = "lodesFiles"
FILELIST_PATH = os.path.join(tempDir, "{}.pkl".format(FILELIST_FILENAME))

In [17]:
if(os.path.exists(FILELIST_PATH)):
    print("Using existing file list at {}".format(FILELIST_PATH))
    with open(FILELIST_PATH, "rb") as f:
           lodesFiles = pickle.load(f)
else:
    print("Scraping file list from BLS website.")
    lodesFiles = download_and_unzip.get_all_possible_files(save=True, savepath=tempDir, savename=FILELIST_FILENAME)

Using existing file list at temp_data\lodesFiles.pkl


In [23]:
downloadDir = download_and_unzip.download_state_lodes_file(save_loc=tempDir,
                          st='oh',
                          links_dict=lodesFiles,
                          skip_existing=True)

start time: 23:19:08
oh + od...
downloading 220 od files
0.5% complete...
23.2% complete...
Skipped 50 files that already existed
45.9% complete...
Skipped 100 files that already existed
68.6% complete...
Skipped 150 files that already existed
91.4% complete...
Skipped 200 files that already existed
100.0% complete...
Skipped 219 files that already existed
oh + rac...
downloading 1098 rac files
0.1% complete...
Skipped 220 files that already existed
4.6% complete...
Skipped 270 files that already existed
9.2% complete...
Skipped 320 files that already existed
13.8% complete...
Skipped 370 files that already existed
18.3% complete...
Skipped 420 files that already existed
22.9% complete...
Skipped 470 files that already existed
27.4% complete...
Skipped 520 files that already existed
32.0% complete...
Skipped 570 files that already existed
36.5% complete...
Skipped 620 files that already existed
41.1% complete...
Skipped 670 files that already existed
45.6% complete...
Skipped 720 files

In [21]:
downloadDir = os.path.join(tempDir, "oh")

In [24]:
download_and_unzip.unzip_state_lodes_file(state_fold=downloadDir, skip_existing=True, delete_corrupt=True)

start time: 23:53:57
unzipping 1 cw files
100.0% complete...
unzipping 440 od files
0.2% complete...
11.6% complete...
Skipped 25 files that already existed
23.0% complete...
Skipped 50 files that already existed
34.3% complete...
Skipped 75 files that already existed
45.7% complete...
Skipped 100 files that already existed
57.0% complete...
Skipped 125 files that already existed
68.4% complete...
Skipped 150 files that already existed
79.8% complete...
Skipped 175 files that already existed
91.1% complete...
Skipped 200 files that already existed
100.0% complete...
Skipped 219 files that already existed
unzipping 1877 rac files
0.1% complete...
2.7% complete...
Skipped 25 files that already existed
5.4% complete...
Skipped 50 files that already existed
8.0% complete...
Skipped 75 files that already existed
10.7% complete...
Skipped 100 files that already existed
13.4% complete...
Skipped 125 files that already existed
16.0% complete...
Skipped 150 files that already existed
18.7% comp

In [28]:
importlib.reload(build_database)

<module 'build_database' from 'C:\\Users\\aporr\\github\\morpc-lodes-standardize\\main\\build_database.py'>

In [29]:
# loads downloaded data into spatialite 
build_database.build_db(spath=OUTPUT_DB_PATH) #be careful - this build function overwrites existing data

building sqlite db...
could not create sqlite db at: output_data\morpc-lodes-standardize.db. The specified module could not be found.



In [None]:
#state_fold = r"C:\Users\cmg0003\Desktop\TX_Lodes_Download\tx"
load_lodes_into_db(folder_path = state_fold,spath = spath, base_only = True)
load_geometries_into_db(spath=spath) #note - this is basically a custom function for texas geometries - will need work for other states

## Export data

In [None]:
outputData.to_csv(OUTPUT_PATH, index=False)

In [None]:
outputResource = morpc.frictionless.create_resource(OUTPUT_FILENAME, 
    resourcePath=OUTPUT_RESOURCE_PATH,
    title="Enter a meaningful title for the output dataset", 
    description="Enter a more detailed description of the output dataset.",
    writeResource=True,
    validate=True
)
outputResource