In [1]:
import glob
import pandas as pd
from src.static import DATA_DIR

# Data Completeness
Examining data for completeness involves a few methodical steps to ensure no critical components are missing

In [2]:
files = glob.glob(f'{DATA_DIR}/*.csv')
files

['/home/mango/data/CS504/raw_loan_data.csv',
 '/home/mango/data/CS504/unit_data.csv',
 '/home/mango/data/CS504/loan_data.csv',
 '/home/mango/data/CS504/raw_unit_data.csv']

In [3]:
loan_data = pd.read_csv(files[1])
unit_data = pd.read_csv(files[2])

  loan_data = pd.read_csv(files[1])


## Missing values check

In [4]:
def null_pct(df:pd.core.frame.DataFrame, col:str):
    return f'{round((df[col].isna().sum())*100/df.shape[0], 2)}%'

In [5]:
def null_report(df, **kwargs):
    print(f"{kwargs.get('name', 'DF')+' '}Data Completeness'")
    for col in df.columns:
        print(f'{col.capitalize()}: {null_pct(df, col)}')

With such few null values in the data we can examine the nulls by hand...

In [6]:
nulls = loan_data.loc[loan_data.enterprise_flag.isna()]
nulls.shape

(0, 7)

looks like there are 32 records missing pretty much everything, lets drop them.

In [7]:
loan_data.dropna(axis=0, inplace=True)

In [8]:
null_report(loan_data, **{'name':'Loan Data'})

Loan Data Data Completeness'
Year: 0.0%
Enterprise_flag: 0.0%
Record_number: 0.0%
Num_bedrooms: 0.0%
Num_units: 0.0%
Affordability_level: 0.0%
Tenant_income_ind: 0.0%


In [9]:
null_report(unit_data, **{'name':'Unit Data'})

Unit Data Data Completeness'
Year: 0.0%
Enterprise_flag: 0.0%
Record_number: 0.0%
Census_tract_2020: 0.0%
Tract_income_ratio: 0.0%
Affordability_cat: 0.0%
Date_of_mortgage_note: 0.0%
Purpose_of_loan: 0.0%
Type_of_seller: 0.0%
Federal_guarantee: 0.0%
Tot_num_units: 0.0%


# Uniqueness
A quick check confirms if there are duplicate records


In [10]:
loan_data.duplicated().value_counts()

False    895729
Name: count, dtype: int64

In [12]:
unit_data.duplicated().value_counts()

False    107310
Name: count, dtype: int64

# Accuracy
We need to check that all values are accurate according to expected values


In [15]:
from src.datamappers import unit_data_dict, loan_data_dict

In [None]:
for col in unit_data:
    if col in unit_data_dict.keys():
        vals = unit_data[col].unique().tolist()
        check_vals = [x for x in unit_data_dict[col].values()]
        for val in vals:
            if not val in check_vals:
                print(f"Unexpected value {val} in column {col}")

In [29]:
for col in loan_data:
    if col in loan_data_dict.keys():
        vals = loan_data[col].unique().tolist()
        check_vals = [x for x in loan_data_dict[col].values()]
        for val in vals:
            if not val in check_vals:
                print(f"Unexpected value {val} in column {col}")

# Atomicity
Atomicity in this context relates to the indivudal ensuring that data transformations and updates are applied completely and correctly. When you're performing data cleaning, atomicity ensures that each operation (like removing duplicates, filling missing values, or correcting errors) is fully applied across all relevant records. If an operation fails partway through, atomicity ensures that the data reverts to its previous state, maintaining data integrity. The data pipeline scripts ensure this behavior by packaging all necessary operations into one executable that fails if all component operations do not run successfully. That is to say, if the logic of the data pipeline script is sound then the atomicity of the data is intact.

# Conformity
Visual inspection of the data dictionaries ensures that the data encodings have not changed much over time. Consistent definitions for all columns included in the 2023 data are present across all years of the data. There is however an additional column that does appear to be present in earlier versions of the data that is not included in later versions of the data. This means that the data does have slight conformity issues and will have to be addressed in the data pipeline.