# Quarterly Census of Employment and Wages (QCEW) Data Importer

## Introduction

This Jupyter Notebook is designed to automate the retrieval and processing of the Quarterly Census of Employment and Wages (QCEW) data, provided by the U.S. Bureau of Labor Statistics. The datasets retrieved and processed in this notebook contain detailed employment statistics, including the number of establishments, employment levels, total quarterly wages, and more, broken down by industry and ownership sectors for each county. This notebook specifically focuses on extracting data for a defined set of counties and years within a user specified lists, and either annual or quarterly data is collected based on user specification. See the Parameters section below to select the variables to be included in the extract.

### Process Outline

The process carried out by this workflow can be described as follows:
  - Users specify which years, quarters, and regions they are interested in analyzing through the script's parameter settings. 
  - The script accesses the QCEW data through the BLS' API. This interface automaticaly retrieves QCEW datasets based on specified parameters: year(s), quarter(s), and area code(s) (FIPS codes for counties).
  - Once downloaded, the data undergoes a series of processing steps to format it correctly for analysis. This includes trimming unnecessary characters and converting data types where necessary.
  - For each county within the specified region and for each year and/or quarter selected, the script generates a CSV file.
  - With each CSV file, a .schema.yaml file is generated. This YAML (YAML Ain't Markup Language) file provides a human-readable schema, defining the structure, data types, and descriptions of the CSV file's columns based on the Frictionless Data specifications. This schema ensures data consistency and aids in validation.
  - Similarly, a .resource.yaml file is created for each dataset, following the Frictionless Data Resource specification. This file includes metadata about the CSV file, such as its name, path, format, and the schema it conforms to, as well as a hash code for integrity checking. Additionally, it contains descriptive information about the dataset and references to its source.
  - The YAML files for schemas and resource descriptors are used to make data more usable by simplifying its publication and consumption. By adhering to Frictionless standards, the script ensures that the datasets it produces are easily shareable, validatable, and integrable into a wide range of data tools and platforms.

## Setup

### Import required packages

In [66]:
import csv
import os
import urllib
import pandas as pd
import hashlib
from io import StringIO
import urllib.request
import yaml
from frictionless import Resource, validate
import sys
sys.path.append(os.path.normpath("../morpc-common"))
import morpc
import morpcCensus

### Parameters

#### Static parameters

In [67]:
# Define input and output directories
INPUT_DIR = "./input_data"
OUTPUT_DIR = "./output_data"

# Define quarterly and annual schema directories
QUARTERLY_TABLE_SCHEMA_FILENAME = "morpc-qcew_quarterly.schema.yaml"
QUARTERLY_TABLE_SCHEMA_PATH = os.path.join(OUTPUT_DIR, QUARTERLY_TABLE_SCHEMA_FILENAME)
ANNUAL_TABLE_SCHEMA_FILENAME = "morpc-qcew_annual.schema.yaml"
ANNUAL_TABLE_SCHEMA_PATH = os.path.join(OUTPUT_DIR, ANNUAL_TABLE_SCHEMA_FILENAME)

# Documentation URL for the QCEW data - static because it points to the general documentation page
QCEW_TABLE_DOC_URL="https://www.bls.gov/cew/additional-resources/open-data/csv-data-slices.htm"

#### User-specified Parameters

In [68]:
# User-specified Parameters:
# These parameters are intended for the user to edit based on the specific analysis requirements.


# Define years for data collection
YEARS_PARAMETER = {
    "2015-2021": [2015, 2016, 2017, 2018, 2019, 2020, 2021],
    "2020":[2020],
}

# Define FIPS codes for data collection or use morpc.countyLookup for dynamic resolution (see below)
# Replace "15-County Region" with your desired region name and FIPS codes as needed.
CONST_REGIONS_CODES = {
    "15-County Region": ["39041", "39045", "39047", "39049", "39073", "39083", "39089", "39091", "39097", "39101", "39117", "39127", "39129", "39141", "39159"],
}

# Annual and Quarter codes for looping
# The user can specify whether to fetch annual or quarterly data.
ALL_QUARTERS = {
    "QUARTERS": ["1", "2", "3", "4"],  # For quarterly data analysis.
    "ANNUAL": ["a"],  # For annual data analysis, use 'a'.
}

### Define Outputs

Data, schema, and resource files for each county, year, and quaters selected

In [69]:
TEMP_TABLE_FILENAME = "morpc-qcew_{year}_{qtr}_{region}.csv"
TEMP_TABLE_PATH = os.path.join(OUTPUT_DIR, TEMP_TABLE_FILENAME)
TEMP_TABLE_RESOURCE_FILENAME = TEMP_TABLE_FILENAME.replace(".csv",".resource.yaml")
TEMP_TABLE_RESOURCE_PATH = os.path.join(OUTPUT_DIR, TEMP_TABLE_RESOURCE_FILENAME)
print("Data: {}".format(TEMP_TABLE_PATH))
print("Resource file: {}".format(TEMP_TABLE_RESOURCE_PATH))

Data: ./output_data\morpc-qcew_{year}_{qtr}_{region}.csv
Resource file: ./output_data\morpc-qcew_{year}_{qtr}_{region}.resource.yaml


## Getting input data

### Calling API and Creating Rows

In [70]:
def qcewGetAreaDataPandas(year, qtr, area):
    """
    Fetches QCEW area data for a specific year, quarter, and area using pandas.read_csv().
    Parameters:
        year (str): The year for the data.
        qtr (str): The quarter ('1', '2', '3', '4', or 'a' for annual).
        area (str): The area code (FIPS code).
    
    Returns:
        DataFrame: A pandas DataFrame containing the fetched data.
    """
    # Construct the URL based on function parameters
    url_template = "https://data.bls.gov/cew/data/api/{year}/{qtr}/area/{area}.csv"
    url = url_template.format(year=year, qtr=qtr.lower(), area=area.upper())

    try:
        # Attempt to read the CSV data directly into a pandas DataFrame
        df = pd.read_csv(url)
        return df
    except pd.errors.EmptyDataError:
        print("No data returned from the API.")
    except Exception as e:
        print(f"An error occurred: {e}")

### Cleaning Dataframe

In [71]:
def clean_data(df):
    """
    Clean the DataFrame by stripping quotes from strings if they exist.
    Assumes df is a pandas DataFrame.
    """
    # Strip '"' from column names if they exist
    df.columns = df.columns.str.replace('"', '')

    # Apply the same stripping for all string columns in the DataFrame
    for col in df.select_dtypes(include=["object"]).columns:
        df[col] = df[col].str.strip('"')
    
    return df

### Creating data, schema, and resource file paths

In [72]:
def createFiles(year, qtr, region):
    TEMP_TABLE_NAME = f"morpc-qcew_{year}_{qtr}_{region}.csv"
    TEMP_TABLE_PATH = os.path.join(OUTPUT_DIR, TEMP_TABLE_NAME)
    TEMP_TABLE_RESOURCE_NAME = f"morpc-qcew_{year}_{qtr}_{region}.resource.yaml"
    TEMP_TABLE_RESOURCE_PATH = os.path.join(OUTPUT_DIR, TEMP_TABLE_RESOURCE_NAME)
    print("Data: {}".format(TEMP_TABLE_PATH))
    print("Resource file: {}".format(TEMP_TABLE_RESOURCE_PATH))

    if qtr == "a":
        schemaPath = ANNUAL_TABLE_SCHEMA_PATH  # Use the annual data schema path
    else:
        schemaPath = QUARTERLY_TABLE_SCHEMA_PATH  # Use a different schema path for non-annual data

    
    # Return paths
    return {
        "data_path": TEMP_TABLE_PATH,
        "resource_path": TEMP_TABLE_RESOURCE_PATH,
        "schema_path":schemaPath,
    }  

### Saves data, schema, and resource files for each iteration

In [73]:
def exportDataResourceComponents(data, dataPath, schemaPath, resourcePath, year, region,qtr):

    # Make sure to use absolute paths for validation
    abs_dataPath = os.path.abspath(dataPath)
    
    # Load the schema from the YAML file
    with open(schemaPath, 'r') as file:
        schema = yaml.safe_load(file)

    acsResource = {
      "profile": "tabular-data-resource",
      "name": os.path.basename(dataPath).replace(".csv","").lower(),
      "path": os.path.basename(abs_dataPath),
      "title": f"Quarterly Census of Employment and Wages data {year}_{region}_{qtr}",
      "description": f"Quarterly Census of Employment and Wages data for {region} in {qtr} of {year}.",
      "format": "csv",
      "mediatype": "text/csv",
      "encoding": "utf-8",
      "bytes": os.path.getsize(dataPath),
      "hash": morpc.md5(dataPath),
      "rows": data.shape[0],
      "columns": data.shape[1],    
      "schema": schema,
      "sources": [{
          "title": "Quarterly Census of Employment and Wages, U.S. Bureau of Labor Statistics",
          "path": QCEW_TABLE_DOC_URL
      }]
    }

    # Create the resource object
    resource = Resource(acsResource)
    
    # Save the resource descriptor to a YAML file
    resource.to_yaml(resourcePath)
    print(f"Resource file written to {resourcePath}")
    
    # Validating the resource
    report = validate(resource)
    if report.valid:
        print("Resource is valid")
    else:
        print("Validation errors:", report.flatten(["code", "message"]))

### Iterating through each year, region, and qtr combination

In [74]:
def iterate_years_and_regions(years, regions, qtrs, schema_fields):
    for year in years:
        for region in regions:
            for qtr in qtrs:

                # Receive paths from createFiles
                paths = createFiles(year, qtr, region)

                # Getting data from API
                temp_data = qcewGetAreaDataPandas(str(year), qtr, region)

                # Cleaning data
                temp_df = clean_data(temp_data)

                print("Writing data to {}".format(paths["data_path"]))
                temp_df.to_csv(paths["data_path"], index=False)

                # Calling function to save data, schema, and resource files
                exportDataResourceComponents(temp_df, paths["data_path"], paths["schema_path"], paths["resource_path"], year, region,qtr)


###########################################################################
# For annual QCEW data: ALL_QUARTERS["ANNUAL"], SCHEMA_PARAMETER["ANNUAL"] 
# For quarterly QCEW data: ALL_QUARTERS["QUARTERS"], SCHEMA_PARAMETER["QUARTERS"] 
iterate_years_and_regions(YEARS_PARAMETER["2020"], CONST_REGIONS_CODES["15-County Region"], ALL_QUARTERS["QUARTERS"], SCHEMA_PARAMETER["QUARTERS"])
print("DONE!")

Data: ./output_data\morpc-qcew_2020_1_39041.csv
Resource file: ./output_data\morpc-qcew_2020_1_39041.resource.yaml
Writing data to ./output_data\morpc-qcew_2020_1_39041.csv
Resource file written to ./output_data\morpc-qcew_2020_1_39041.resource.yaml
Validation errors: [[None, "The data source could not be successfully loaded: [Errno 2] No such file or directory: 'morpc-qcew_2020_1_39041.csv'"]]
Data: ./output_data\morpc-qcew_2020_2_39041.csv
Resource file: ./output_data\morpc-qcew_2020_2_39041.resource.yaml


KeyboardInterrupt: 