# Quarterly Census of Employment and Wages (QCEW) Data Importer

## Introduction

This Jupyter Notebook is designed to automate the retrieval and processing of the Quarterly Census of Employment and Wages (QCEW) data, provided by the [U.S. Bureau of Labor Statistics](https://www.bls.gov/cew/) . The datasets retrieved and processed in this notebook contain detailed employment statistics, including the number of establishments, employment levels, total quarterly wages, and more, broken down by industry and ownership sectors for each county, as defined by the [technical documentation](https://www.bls.gov/cew/additional-resources/open-data/csv-data-slices.htm). This notebook specifically focuses on extracting data for a defined set of counties and years within a user specified lists, and either annual or quarterly data is collected based on user specification. See the Parameters section below to select the variables to be included in the extract.

### Process Outline

The process carried out by this workflow can be described as follows:
  - Users specify which years, quarters, and regions they are interested in analyzing through the script's parameter settings. 
  - The script accesses the QCEW data through the BLS' API. This interface automaticaly retrieves QCEW datasets based on specified parameters: year(s), quarter(s), and area code(s) (FIPS codes for counties).
  - Once downloaded, the data undergoes a series of processing steps to format it correctly for analysis. This includes trimming unnecessary characters and converting data types where necessary.
  - For each county within the specified region and for each year and/or quarter selected, the script generates a CSV file.
  - With each CSV file, a .schema.yaml file is generated. This YAML (YAML Ain't Markup Language) file provides a human-readable schema, defining the structure, data types, and descriptions of the CSV file's columns based on the [Frictionless Data specifications]( https://frictionlessdata.io/). This schema ensures data consistency and aids in validation.
  - Similarly, a .resource.yaml file is created for each dataset, following the [Frictionless Data Resource specification](https://framework.frictionlessdata.io/docs/framework/resource.html). This file includes metadata about the CSV file, such as its name, path, format, and the schema it conforms to, as well as a hash code for integrity checking. Additionally, it contains descriptive information about the dataset and references to its source.
  - The YAML files for schemas and resource descriptors are used to make data more usable by simplifying its publication and consumption. By adhering to Frictionless standards, the script ensures that the datasets it produces are easily shareable, validatable, and integrable into a wide range of data tools and platforms.

## Setup

### Import required packages

In [167]:
import os
import pandas as pd
import yaml
import frictionless
import sys
sys.path.append(os.path.normpath("../morpc-common"))
import morpc

### Parameters

#### User-specified Parameters

This section is for user-specified parameters to define the scope of data retrieval: years, regions (via FIPS codes), and whether the data should be annual or quarterly.

In [168]:
# These parameters are intended for the user to edit based on the specific analysis requirements.

# Define years for data collection
YEARS_PARAMETER = ["2020"]                                  # List of years: 2020
# YEARS_PARAMETER = ["2018" ,"2019", "2020", "2021", "2022"]  # List of years: 2018-2022

# List of Census GEOIDs for counties to be included in data extract, default MORPC-15
# COUNTY_CODES = ["39041","39049"]                          # List of GEOIDs for a few specific counties
COUNTY_CODES = morpc.countyLookup().list_ids()              # GEOIDs for all counties in the MORPC 15-County Region


# Annual and Quarter codes for looping
# The user can swtich to specify whether to fetch annual or quarterly data.
#QUARTERS = ["a"] # For annual data analysis.
QUARTERS = ["1", "2", "3", "4"] # For quarterly data analysis.

Loading data for MORPC 15-County region only


#### Static parameters

This section defines static parameters such as input and output directories, schema file names and paths, and a documentation URL.

In [169]:
# Define input and output directories
OUTPUT_DIR = "./output_data"

# Define quarterly and annual schema directories from local copies
QUARTERLY_TABLE_SCHEMA_FILENAME = "morpc-qcew-quarterly.schema.yaml"
QUARTERLY_TABLE_SCHEMA_PATH = os.path.join(OUTPUT_DIR, QUARTERLY_TABLE_SCHEMA_FILENAME)
ANNUAL_TABLE_SCHEMA_FILENAME = "morpc-qcew-annual.schema.yaml"
ANNUAL_TABLE_SCHEMA_PATH = os.path.join(OUTPUT_DIR, ANNUAL_TABLE_SCHEMA_FILENAME)

# Documentation URL for the QCEW data - static because it points to the general documentation page
QCEW_TABLE_DOC_URL="https://www.bls.gov/cew/additional-resources/open-data/csv-data-slices.htm"

### Define Inputs

This code prints the location of the existing .yaml schemas for use in the resource files

In [170]:
print("Annual schema file stored in: {}".format(ANNUAL_TABLE_SCHEMA_PATH))
print("Quarterly schema file stored in: {}".format(QUARTERLY_TABLE_SCHEMA_PATH))

Annual schema file stored in: ./output_data\morpc-qcew-annual.schema.yaml
Quarterly schema file stored in: ./output_data\morpc-qcew-quarterly.schema.yaml


### Define Outputs

This section establishes the naming convention and paths for temporary data files, schema files, and resource files that will be generated and saved by the script.

In [171]:
print("Data files will be saved as: {}".format(os.path.join(OUTPUT_DIR, "morpc-qcew-{year}-{qtr}-{region}.csv")))
print("Resource files will be saved as: {}".format(os.path.join(OUTPUT_DIR, "morpc-qcew-{year}-{qtr}-{region}.resource.yaml")))

Data files will be saved as: ./output_data\morpc-qcew-{year}-{qtr}-{region}.csv
Resource files will be saved as: ./output_data\morpc-qcew-{year}-{qtr}-{region}.resource.yaml


## Function definitions

Based on user-set parameters, this section defines functions that retreive data frames, which are cleaned and saved as ".csv" along with generated ".yaml" schema and resource files.

### Calling API and Creating Rows

In [172]:
def qcewGetAreaDataPandas(year, qtr, area):
    """
    Fetches QCEW area data for a specific year, quarter, and area using pandas.read_csv().
    Parameters:
        year (str): The year for which to fetch data.
        qtr (str): The quarter ('1', '2', '3', '4', or 'a' for annual data).
        area (str): The area code (FIPS code) for which to fetch data.
    Returns:
        DataFrame: A pandas DataFrame containing the fetched data, or None if an error occurs.
    """
    # Construct the URL based on function parameters
    url_template = "https://data.bls.gov/cew/data/api/{year}/{qtr}/area/{area}.csv"
    url = url_template.format(year=year, qtr=qtr.lower(), area=area.upper())

    try:
        # Attempt to read the CSV data directly into a pandas DataFrame
        df = pd.read_csv(url)
        return df
    except pd.errors.EmptyDataError:
        print("No data returned from the API.")
    except Exception as e:
        print(f"An error occurred: {e}")

### Cleaning Dataframe

In [173]:
def clean_data(df):
    """
    Cleans the input DataFrame by removing unnecessary characters from strings.
    Parameters:
        df (DataFrame): The DataFrame to clean.
    Returns:
        DataFrame: The cleaned DataFrame.
    """
    
    # Strip '"' from column names if they exist
    df.columns = df.columns.str.replace('"', '')

    # Apply the same stripping for all string columns in the DataFrame
    for col in df.select_dtypes(include=["object"]).columns:
        df[col] = df[col].str.strip('"')
    
    return df

### Creating data, schema, and resource file paths

In [174]:
def generate_file_paths(year, qtr, region):
    """
    Creates and returns file paths for the data CSV, schema YAML, and resource YAML based on specified parameters.
    Parameters:
        year (str): The year of the data.
        qtr (str): The quarter or 'a' for annual data.
        region (str): The region code (FIPS code).
    Returns:
        dict: A dictionary containing paths for the data file, resource file, and schema file.
    """
    TEMP_TABLE_NAME = f"morpc-qcew-{year}-{qtr}-{region}.csv"
    TEMP_TABLE_PATH = os.path.join(OUTPUT_DIR, TEMP_TABLE_NAME)
    TEMP_TABLE_RESOURCE_NAME = f"morpc-qcew-{year}-{qtr}-{region}.resource.yaml"
    TEMP_TABLE_RESOURCE_PATH = os.path.join(OUTPUT_DIR, TEMP_TABLE_RESOURCE_NAME)


    if qtr == "a":
        schemaPath = ANNUAL_TABLE_SCHEMA_PATH  # Use the annual data schema path
    else:
        schemaPath = QUARTERLY_TABLE_SCHEMA_PATH  # Use a different schema path for non-annual data

    
    # Return paths
    return {
        "data_path": TEMP_TABLE_PATH,
        "resource_path": TEMP_TABLE_RESOURCE_PATH,
        "schema_path":schemaPath,
    }  

### Saves data, schema, and resource files for each iteration

In [175]:
def format_quarter(qtr):
    """Format quarter for readability."""
    quarters = {"1": "Q1", "2": "Q2", "3": "Q3", "4": "Q4", "a" : "Annual"}
    return quarters.get(qtr, qtr)

def exportDataResourceComponents(data, dataPath, schemaPath, resourcePath, year, region,qtr, county_lookup):
    """
    Saves the given DataFrame to a CSV file and generates corresponding schema and resource YAML files.
    Parameters:
        data (DataFrame): The DataFrame to save.
        dataPath (str): Path for the CSV file.
        schemaPath (str): Path for the schema YAML file.
        resourcePath (str): Path for the resource YAML file.
        year (str): Year of the data.
        region (str): Region code (FIPS code).
        qtr (str): Quarter or 'a' for annual data.
    """
    
    # Use absolute paths for validation
    abs_dataPath = os.path.abspath(dataPath)
    abs_schemaPath = os.path.abspath(schemaPath) 

    # Load the schema from the YAML file
    with open(schemaPath, 'r') as file:
        schema = yaml.safe_load(file)

    # Convert region from GEOID to county name
    region_name = county_lookup.get_name(region)

    # Format quarter for readability
    formatted_qtr = format_quarter(qtr)

    # Update title and description with the county name
    title = f"QCEW Data for {region_name} County, {year}, {formatted_qtr}"
    description = f"{formatted_qtr} employment and wage data for {region_name} county in {year}, derived from the U.S. Bureau of Labor Statistics."

    
    acsResource = {
      "profile": "tabular-data-resource",
      "name": os.path.basename(dataPath).replace(".csv","").lower(),
      "path": os.path.basename(abs_dataPath),
      "title": title,
      "description": description,
      "format": "csv",
      "mediatype": "text/csv",
      "encoding": "utf-8",
      "bytes": os.path.getsize(dataPath),
      "hash": morpc.md5(dataPath),
      "rows": data.shape[0],
      "columns": data.shape[1],    
      "schema": schema,
      "sources": [{
          "title": "Quarterly Census of Employment and Wages, U.S. Bureau of Labor Statistics",
          "path": QCEW_TABLE_DOC_URL
      }]
    }

    # Create the resource object
    resource = frictionless.Resource(acsResource)

    print("Writing resource file to {}".format(resourcePath))
    dummy = resource.to_yaml(resourcePath)
    
    print("SCHEMA data to {}".format(schemaPath))
    
    print("Validating resource on disk (including data and schema). This may take some time.")
    resourceOnDisk = frictionless.Resource(resourcePath)
    results = resourceOnDisk.validate()
    if(results.valid):
        print("Resource is valid\n")
    else:
        print("ERROR: Resource is NOT valid. Errors follow.\n")
        print(results)
        raise RuntimeError



## Main code for iterating through each year, region, and qtr combination for data, resource, and schema files

In [176]:
"""
Iterates over specified years, regions, and quarters, retrieves and processes data, and saves it along with schema and resource files.
Parameters:
    years (list): List of years to process in YEARS_PARAMETER
    regions (list): List of region codes (FIPS codes) to process in COUNTY_CODES
    qtrs (list): List of quarters [1,2,3,4] or [a] for annual data.
"""
# Instantiate countyLookup with a scope relevant to your data
county_lookup = morpc.countyLookup(scope="ohio")  # Adjust scope as needed

for year in YEARS_PARAMETER:
    for region in COUNTY_CODES:
        for qtr in QUARTERS:

            # Receive paths from generate_file_paths
            paths = generate_file_paths(year, qtr, region)
            temp_data = qcewGetAreaDataPandas(str(year), qtr, region)
            
            if temp_data is not None:  # Check if data is not None before proceeding
                temp_df = clean_data(temp_data)
                print("Writing data to {}".format(paths["data_path"]))
                temp_df.to_csv(paths["data_path"], index=False)
                exportDataResourceComponents(temp_df, paths["data_path"], paths["schema_path"], paths["resource_path"], year, region, qtr, county_lookup)
            else:
                print(f"No data available for {year}, {qtr}, {region}")
                
print("DONE!")

Loading data for Ohio counties only
Writing data to ./output_data\morpc-qcew-2020-1-39041.csv


PermissionError: [Errno 13] Permission denied: './output_data\\morpc-qcew-2020-1-39041.csv'