# Domain Constraint Checks

In this notebook, we'll perform a series of domain constraint checks on the cleaned dataset `PPP-Data-up-to-150k-080820-HI-OpenRefine-PythonCleaned.csv`; this will help ensure that our data adheres to the expected rules and standards.


In [8]:
# Import necessary libraries
import pandas as pd

## Load the data

First, we'll load the cleaned dataset `PPP-Data-up-to-150k-080820-HI-OpenRefine-PythonCleaned.csv` data into a pandas DataFrame.


In [9]:
# Load the data
df = pd.read_csv('../../data/cleaned/PPP-Data-up-to-150k-080820-HI-OpenRefine-PythonCleaned.csv')

## Check `LoanAmount` column

We expect all the values in the `LoanAmount` column to be positive numbers, as these represent the amount of money that was loaned to the businesses and because it wouldn't make sense for a loan amount to be zero or negative.


In [10]:
# Check that LoanAmount is a positive number
loan_amount_violations = df[df['LoanAmount'] <= 0].shape[0]
print(f"Number of `LoanAmount` column domain constraint violations: {loan_amount_violations}")

Number of `LoanAmount` column domain constraint violations: 0


## Check `Zip` column

We expect all values in the `Zip` code column to be 5-digit numbers.


In [11]:
# Check that the Zip column only contains 5-digit numbers
zip_violations = df[~df['Zip'].astype(str).str.match(r'^\d{5}$')].shape[0]
print(f"Number of `Zip` column domain constraint violations: {zip_violations}")

Number of `Zip` column domain constraint violations: 0


## Check `JobsReported` column

We expect `JobsReported` to be a non-negative number since this field represents the number of jobs that were reported by the business when applying for the loan and the number of jobs cannot be negative because it's not possible for a business to have a negative amount of jobs.


In [12]:
# Check that JobsReported is a non-negative number
jobs_reported_violations = df[df['JobsReported'] < 0].shape[0]
print(f"Number of `JobsReported` column domain constraint violations: {jobs_reported_violations}")


Number of `JobsReported` column domain constraint violations: 0


## Check `DateApproved` column

We expect the `DateApproved` column values to follow a valid ISO date format (YYYY-MM-DDTHH:MM:SSZ) as we standardized these date values during our OpenRefine operations to ensure consistency and ease of use.


In [13]:
# Check that DateApproved follows a valid date ISO date format:
date_approved_violations = df[~df['DateApproved'].astype(str).str.match(
    r'^\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}Z$')].shape[0]
print(f"Number of `DateApproved` column domain constraint violations: {date_approved_violations}")

Number of `DateApproved` column domain constraint violations: 0


## Check Missing Values

Finally, we check the missing values in ALL the columns of the cleaned dataset.

Missing values in columns might indicate data entry errors, or they might occur when certain information is not applicable or not available. Regardless of the reasons, it's crucial to identify these missing values and decide how to handle them appropriately in the subsequent data analysis.

Missing values can occur for various reasons depending on the nature of the data and the context. For instance, missing values in the `JobsReported` column could arise if some businesses did not report any jobs. This could occur for several reasons. For instance, some businesses might not have any employees, or they might not have been comfortable reporting this information.

For the `NAICSCode` and `NAICSTitle` columns, missing values might indicate that the business did not classify its industry under any of the NAICS codes. This could be due to the business owner not knowing their appropriate NAICS code, or the business not fitting neatly into any of the predefined NAICS categories.

In [14]:
# Check for missing values in all columns
missing_values = df.isna().sum()

print("Number of missing values in each column:")
print(missing_values)

Number of missing values in each column:
LoanAmount          0
City                0
State               0
Zip                 0
NAICSCode         105
BusinessType        0
RaceEthnicity       0
Gender              0
Veteran             0
NonProfit           0
JobsReported     2446
DateApproved        0
Lender              0
CD                  0
NAICSTitle        105
dtype: int64
