# 01 â€“ Data Quality Assessment

This notebook analyses the raw credit application dataset for data quality issues across the following dimensions:

- Completeness
- Consistency
- Validity
- Accuracy

All issues are quantified (counts and percentages) and mapped to governance implications.

## 1. Data Loading & Structural Inspection
- Load raw JSON
- Inspect nested structure
- Examine column names and data types

## 2. Completeness Analysis
- Missing values per column
- Incomplete nested objects
- % affected records

## 3. Consistency & Type Validation
- Data type mismatches
- Inconsistent categorical encoding (e.g., gender formats)
- Date format inconsistencies

## 4. Validity Checks
- Impossible values (e.g., negative income, negative credit history)
- Logical inconsistencies (e.g., interest rate assigned when rejected)

## 5. Duplicate Record Detection

## 6. Remediation Strategy
- Cleaning logic
- Standardisation decisions
- Governance implications

---

## Setup & Connection

In [30]:
# Import required libraries
!pip install pymongo 



In [37]:
import json
from pathlib import Path
from pymongo import MongoClient
from pymongo.errors import DuplicateKeyError

# Establish connection to the local MongoDB instance
client = MongoClient('mongodb://localhost:27017/')
db = client['novacred']
collection = db['credit_applications']

# Reset the collection to ensure an idempotent and clean baseline for the audit
collection.drop()
print(f"Connected to: {db.name}. Dropped existing collection for fresh ingestion.")

# Determine the canonical data path based on the current working directory
cwd = Path.cwd()
if cwd.name == 'notebooks':
    data_path = cwd.parent / 'data' / 'raw_credit_applications.json'
else:
    data_path = cwd / 'data' / 'raw_credit_applications.json'

# Validate the existence of the source file before proceeding
if not data_path.exists():
    raise FileNotFoundError(f"CRITICAL: Could not locate data file at: {data_path}")

# Load the raw application data from the JSON file
with data_path.open('r') as file:
    raw_data = json.load(file)

# Execute the fault-tolerant ingestion loop
successful_inserts = 0
rescued_duplicates = 0

for doc in raw_data:
    try:
        # Attempt to insert the document into the collection
        collection.insert_one(doc)
        successful_inserts += 1
    except DuplicateKeyError:
        # Append a suffix to the primary key to rescue colliding records for audit purposes
        doc['_id'] = f"{doc['_id']}_duplicate"
        collection.insert_one(doc)
        rescued_duplicates += 1

# Output the final ingestion metrics to verify the audit baseline
print("\n--- Ingestion Audit Summary ---")
print(f"Standard Records Inserted: {successful_inserts}")
print(f"Duplicate IDs Rescued:    {rescued_duplicates}")
print(f"Total Audit Baseline:      {collection.count_documents({})} documents")

Connected to: novacred. Dropped existing collection for fresh ingestion.

--- Ingestion Audit Summary ---
Standard Records Inserted: 500
Duplicate IDs Rescued:    2
Total Audit Baseline:      502 documents


**Ingestion Audit Note:**
 
MongoDB automatically rejected 2 records (app_042, app_001) during import due to E11000 duplicate key errors on the _id field. To prevent data loss and ensure 100% auditability, these records were rescued by appending a _duplicate suffix to their _id. Baseline record count established at 502 documents.

---

## Quick Data Overview

In [40]:
# View a sample document from the collection
sample = collection.find_one()
pprint(sample)

{'_id': 'app_200',
 'applicant_info': {'date_of_birth': '2001-03-09',
                    'email': 'jerry.smith17@hotmail.com',
                    'full_name': 'Jerry Smith',
                    'gender': 'Male',
                    'ip_address': '192.168.48.155',
                    'ssn': '596-64-4340',
                    'zip_code': '10036'},
 'decision': {'loan_approved': False,
              'rejection_reason': 'algorithm_risk_score'},
 'financials': {'annual_income': 73000,
                'credit_history_months': 23,
                'debt_to_income': 0.2,
                'savings_balance': 31212},
 'processing_timestamp': '2024-01-15T00:00:00Z',
 'spending_behavior': [{'amount': 480, 'category': 'Shopping'},
                       {'amount': 790, 'category': 'Rent'},
                       {'amount': 247, 'category': 'Alcohol'}]}


---

## Data Quality Dimension 1: Uniqueness

The **Uniqueness Audit** identified **11 violations** across the 502-record dataset, establishing an initial health score of **97.81%** and a clean baseline of **491 records**. Findings included **two technical duplicates** for **Joseph Lopez** and **Stephanie Nguyen** flagged as "**RESUBMISSION**" and "**DUPLICATE_ENTRY_ERROR**", **four identity collisions** involving shared SSNs (e.g., Martinez/Wilson), and **five completeness failures** where SSNs were missing. All **11 records** were quarantined to satisfy **AI Act Art. 10** requirements for data accuracy and uniqueness, ensuring no redundant or ambiguous entries compromise the subsequent fairness audit.

In [53]:
# Define the aggregation pipeline to find duplicate SSNs - each person should appear only once!
# This query specifically addresses the Uniqueness and Completeness dimensions for the 2026 Audit.
pipeline_duplicates = [
    {
        # Group by the SSN identifier to detect identity collisions and missing values
        "$group": {
            "_id": "$applicant_info.ssn",
            "count": {"$sum": 1},
            # Map names and IDs into a records array to maintain a granular audit trail
            "records": {
                "$push": {
                    "name": "$applicant_info.full_name",
                    "id": "$_id"
                }
            }
        }
    },
    {
        # Filter for groups that appear more than once to isolate potential data quality violations
        "$match": {
            "count": {"$gt": 1}
        }
    },
    {
        # Sort by frequency to prioritize the investigation of high-risk identifier collisions
        "$sort": {"count": -1}
    }
]

# Execute the aggregation pipeline against the credit_applications collection
duplicates = list(collection.aggregate(pipeline_duplicates))

# Segregate results into Missing Identifiers and Duplicate Identifiers for focused governance reporting
missing_ssn = [dup for dup in duplicates if dup['_id'] is None]
actual_duplicates = [dup for dup in duplicates if dup['_id'] is not None]

# Output findings for missing identifiers to address the Completeness requirement of the AI Act
if missing_ssn:
    print("--- Missing Identifiers (Completeness Issues) ---")
    for item in missing_ssn:
        print(f" Found {item['count']} records without an SSN:")
        for record in item['records']:
            # List each specific record with its ID for technical validation
            print(f"  - {record['name']} (ID: {record['id']})")
        print()

# Output findings for duplicate identifiers to address the Uniqueness requirement for high-risk AI data
if actual_duplicates:
    print("--- Duplicate Identifiers (Uniqueness Issues) ---")
    print(f"Found {len(actual_duplicates)} duplicate SSNs (Identity Collisions):")
    for dup in actual_duplicates:
        print(f" SSN: {dup['_id']} - Count: {dup['count']}")
        for record in dup['records']:
            # Document individual applicants sharing a single unique identifier
            print(f"  - {record['name']} (ID: {record['id']})")
        print()

--- Missing Identifiers (Completeness Issues) ---
 Found 5 records without an SSN:
  - Margaret Williams (ID: app_075)
  - Carolyn Martin (ID: app_120)
  - Larry Williams (ID: app_268)
  - Stephanie Nguyen (ID: app_001_duplicate)
  - Brandon Moore (ID: app_165)

--- Duplicate Identifiers (Uniqueness Issues) ---
Found 3 duplicate SSNs (Identity Collisions):
 SSN: 780-24-9300 - Count: 2
  - Susan Martinez (ID: app_088)
  - Gary Wilson (ID: app_016)

 SSN: 652-70-5530 - Count: 2
  - Joseph Lopez (ID: app_042)
  - Joseph Lopez (ID: app_042_duplicate)

 SSN: 937-72-8731 - Count: 2
  - Sandra Smith (ID: app_101)
  - Samuel Hill (ID: app_234)



The uniqueness audit of the **502-record baseline** identifies **11 violating documents** (2.19% of the dataset) across **four distinct failure groups**. These findings include **five applicants with missing SSN identifiers**, representing a critical **completeness failure**, and **six records** involved in **identity collisions** where **three SSNs** are shared by multiple entries. These collisions range from **exact system duplicates** (Joseph Lopez) to **conflicting identities** (Susan Martinez and Gary Wilson), suggesting both **system integration errors** and **potential fraud attempts** that violate **AI Act Art. 10** high-quality data standards.

In [54]:
# Pipeline to find name collisions and their associated SSNs for secondary discovery
pipeline_name_discovery = [
    {
        # Group by full name to identify homonym groups or missing-link duplicates
        "$group": {
            "_id": "$applicant_info.full_name",
            "count": {"$sum": 1},
            "record_details": {
                "$push": {
                    "id": "$_id",
                    "ssn": "$applicant_info.ssn"
                }
            }
        }
    },
    {
        # Filter for names appearing more than once to detect potential duplicates
        "$match": {"count": {"$gt": 1}}
    },
    {
        # Sort by frequency to prioritize the investigation of common name groups
        "$sort": {"count": -1}
    }
]

# Execute discovery against the current audit collection
name_collisions = list(collection.aggregate(pipeline_name_discovery))

print(f"--- Name-Based Collision Discovery ({len(name_collisions)} groups identified) ---\n")
for group in name_collisions:
    print(f"Name: {group['_id']} | Total Records: {group['count']}")
    for record in group['record_details']:
        ssn_value = record.get('ssn', 'MISSING')
        print(f"  - Record ID: {record['id']} | SSN: {ssn_value}")
    print("-" * 50) # Visual separator for cleaner reporting

--- Name-Based Collision Discovery (26 groups identified) ---

Name: Susan Flores | Total Records: 3
  - Record ID: app_448 | SSN: 383-48-9078
  - Record ID: app_073 | SSN: 470-84-5617
  - Record ID: app_226 | SSN: 817-96-6416
--------------------------------------------------
Name: Amy Flores | Total Records: 2
  - Record ID: app_212 | SSN: 654-72-8456
  - Record ID: app_146 | SSN: 577-59-1479
--------------------------------------------------
Name: Rachel King | Total Records: 2
  - Record ID: app_193 | SSN: 852-24-1787
  - Record ID: app_418 | SSN: 107-92-5280
--------------------------------------------------
Name: Shirley Davis | Total Records: 2
  - Record ID: app_148 | SSN: 384-17-7019
  - Record ID: app_219 | SSN: 994-53-6088
--------------------------------------------------
Name: James Rivera | Total Records: 2
  - Record ID: app_465 | SSN: 853-96-1952
  - Record ID: app_498 | SSN: 942-34-6834
--------------------------------------------------
Name: Jerry Nguyen | Total Recor

The secondary discovery phase identified **26 groups** where names appear multiple times, totaling 53 records. Cross-referencing these with unique identifiers reveals that **23 groups** are **unique individuals** sharing common names (**homonyms**), such as Susan Flores, who must remain in the dataset. Only one group is a **confirmed system duplicate** (Joseph Lopez), while two others (Stephanie Nguyen, Brandon Moore) involve records with **missing identifiers**. These results confirm that **SSN-based auditing** is the only reliable deduplication method to avoid "**Homonym Bias**" and comply with **AI Act Art. 10** standards.

In [57]:
# Comprehensive list of all 11 records identified in the Uniqueness/Completeness audit
all_violating_ids = [
    "app_042", "app_042_duplicate", # Joseph Lopez (Confirmed Duplicate)
    "app_001", "app_001_duplicate", # Stephanie Nguyen (Incomplete Link)
    "app_088", "app_016",           # Martinez/Wilson (SSN Collision)
    "app_101", "app_234",           # Smith/Hill (SSN Collision)
    "app_075", "app_120", "app_268",# Williams/Martin/Williams (Missing IDs)
    "app_134", "app_165"            # Brandon Moore (Incomplete Link)
]

# Retrieve the full documents for all flagged identifiers
audit_details = list(collection.find({"_id": {"$in": all_violating_ids}}))

print(f"--- Global Metadata Inspection (11 Records) ---\n")
for doc in audit_details:
    name = doc.get('applicant_info', {}).get('full_name', 'Unknown')
    ssn = doc.get('applicant_info', {}).get('ssn', 'MISSING')
    
    print(f"ID: {doc['_id']} | Name: {name} | SSN: {ssn}")
    
    # Dynamically scan for any field that might contain 'resubmission' or 'note'
    found_metadata = False
    for key, value in doc.items():
        # Check top-level strings and nested dictionaries for 'note' or 'resubmit'
        if any(term in str(key).lower() or term in str(value).lower() 
               for term in ['note', 'resubmit', 'audit', 'comment', 'flag']):
            print(f"  -> [METADATA FOUND in '{key}']: {value}")
            found_metadata = True
            
    if not found_metadata:
        print("  -> No audit notes or system flags detected.")
    print("-" * 50)

--- Global Metadata Inspection (11 Records) ---

ID: app_001 | Name: Stephanie Nguyen | SSN: 427-90-1892
  -> No audit notes or system flags detected.
--------------------------------------------------
ID: app_001_duplicate | Name: Stephanie Nguyen | SSN: MISSING
  -> [METADATA FOUND in 'notes']: DUPLICATE_ENTRY_ERROR
--------------------------------------------------
ID: app_016 | Name: Gary Wilson | SSN: 780-24-9300
  -> No audit notes or system flags detected.
--------------------------------------------------
ID: app_042 | Name: Joseph Lopez | SSN: 652-70-5530
  -> No audit notes or system flags detected.
--------------------------------------------------
ID: app_042_duplicate | Name: Joseph Lopez | SSN: 652-70-5530
  -> [METADATA FOUND in 'notes']: RESUBMISSION
--------------------------------------------------
ID: app_075 | Name: Margaret Williams | SSN: MISSING
  -> No audit notes or system flags detected.
--------------------------------------------------
ID: app_088 | Name: Su

The global metadata inspection confirms that two records contain explicit audit flags: **app_042_duplicate** (Joseph Lopez) is marked as a "**RESUBMISSION**" and **app_001_duplicate** (Stephanie Nguyen) as a "**DUPLICATE_ENTRY_ERROR**". These notes provide evidence of **system integration failures** for these specific cases. However, the remaining **nine records**, including the **identity collisions** (Martinez/Wilson and Smith/Hill) and **missing SSNs** (Williams, Martin, Moore), lack any explanatory metadata.

In [None]:
# Definitive list of the 11 violating records identified for the 2026 Audit
target_quarantine_ids = [
    "app_042", "app_001_duplicate", "app_016", "app_088", 
    "app_101", "app_234", "app_075", "app_120", 
    "app_268", "app_134", "app_165"
]

# Reconstruct the 502-record baseline count by checking both collections
# This ensures the KPI remains accurate even after remediation
active_count = collection.count_documents({})
quarantined_count = db.quarantine_uniqueness.count_documents({"_id": {"$in": target_quarantine_ids}})
baseline_total = active_count + quarantined_count

# Calculate metrics based on the historical baseline to quantify data health
violation_count = len(target_quarantine_ids)
unique_score = ((baseline_total - violation_count) / baseline_total) * 100

# Output the Uniqueness Health Metric for the audit report
print(f"--- Uniqueness Health Metric (Historical Baseline) ---")
print(f"Original Baseline Total: {baseline_total}")
print(f"Total Records with Violations: {violation_count}")
print(f"Uniqueness Score: {unique_score:.2f}%\n")

--- Uniqueness Health Metric (Historical Baseline) ---
Original Baseline Total: 502
Total Records with Violations: 11
Uniqueness Score: 97.81%


In [None]:
# Retrieve all documents held in the quarantine collection for verification
# This ensures the 11 identified violations are securely segregated from the active baseline
quarantined_audit_trail = list(db.quarantine_uniqueness.find({}))

# Output the Verification Table for the Audit Report to document the exclusion rationale
print(f"--- Uniqueness Quarantine Verification ({len(quarantined_audit_trail)} Records) ---")
print(f"{'ID':<20} | {'Name':<20} | {'Status/Note'}")
print("-" * 60)

for record in quarantined_audit_trail:
    # Access the applicant name and relevant audit flags for reporting transparency
    name = record.get('applicant_info', {}).get('full_name', 'Unknown')
    # Capture the specific 'notes' field to prove the technical reason for quarantine
    note = record.get('notes', 'No specific flag (SSN Collision/Missing)')
    
    print(f"{record['_id']:<20} | {name:<20} | {note}")

# Final validation of the remaining active records to confirm readiness for the Consistency phase
# The goal is to verify the transition from a 502-record set to a 491-record clean baseline
print(f"\nAudit Status: Phase 1 (Uniqueness) Verified.")
print(f"Clean Baseline: {collection.count_documents({})} records remain in the audit pool.")

--- Uniqueness Quarantine Verification (11 Records) ---
ID                   | Name                 | Status/Note
------------------------------------------------------------
app_001_duplicate    | Stephanie Nguyen     | DUPLICATE_ENTRY_ERROR
app_016              | Gary Wilson          | No specific flag (SSN Collision/Missing)
app_042              | Joseph Lopez         | No specific flag (SSN Collision/Missing)
app_075              | Margaret Williams    | No specific flag (SSN Collision/Missing)
app_088              | Susan Martinez       | No specific flag (SSN Collision/Missing)
app_101              | Sandra Smith         | No specific flag (SSN Collision/Missing)
app_120              | Carolyn Martin       | No specific flag (SSN Collision/Missing)
app_134              | Brandon Moore        | No specific flag (SSN Collision/Missing)
app_165              | Brandon Moore        | No specific flag (SSN Collision/Missing)
app_234              | Samuel Hill          | No specific fla

---

## Data Quality Dimension 2: Consistency

The **Categorical Consistency Remediation** successfully resolved encoding fragmentation within the gender field of the 491-record baseline. The audit identified **109 records** using **non-standard abbreviations**: **57 labeled as "F"** and **52 as "M"** which were mapped to the required "Female" and "Male" taxonomy. This consolidation increased the Female cohort to **248** and the Male cohort to **243**, ensuring the dataset is statistically robust for the subsequent AI Act Art. 10 fairness testing without losing any valid records from the clean baseline.

In [69]:
# How many different gender values exist?
pipeline_gender_consistency = [
    {
        "$group": {
            "_id": "$applicant_info.gender",
            "count": {"$sum": 1}
        }
    },
    {
        "$sort": {"count": -1}
    }
]

gender_values = list(collection.aggregate(pipeline_gender_consistency))

print("Gender value distribution:")
print("Expected: 2 distinct values (Male, Female)")
print(f"Actual: {len(gender_values)} distinct values")
print()
for gv in gender_values:
    print(f"  '{gv['_id']}': {gv['count']} records")

Gender value distribution:
Expected: 2 distinct values (Male, Female)
Actual: 4 distinct values

  'Female': 191 records
  'Male': 191 records
  'F': 57 records
  'M': 52 records


In [None]:
# Execute in-place database updates to resolve 109 categorical encoding fragmentation errors
# Standardize gender values to align with the required taxonomy for AI Act Art. 10 fairness testing

# Map 52 abbreviated 'M' encodings to the standard 'Male' category
result_m = collection.update_many(
    {"applicant_info.gender": "M"},
    {"$set": {"applicant_info.gender": "Male"}}
)

# Map 57 abbreviated 'F' encodings to the standard 'Female' category
result_f = collection.update_many(
    {"applicant_info.gender": "F"},
    {"$set": {"applicant_info.gender": "Female"}}
)

print(f"--- Categorical Remediation Executed ---")
print(f"Standardized 'M' -> 'Male': {result_m.modified_count} records updated.")
print(f"Standardized 'F' -> 'Female': {result_f.modified_count} records updated.\n")

--- Categorical Remediation Executed ---
Standardized 'M' -> 'Male': 0 records updated.
Standardized 'F' -> 'Female': 0 records updated.



In [72]:
# Execute verification aggregation to confirm the remediation of the gender field for the formal audit trail
# This ensures the 491-record baseline contains exactly two distinct values for gender
gender_values_clean = list(collection.aggregate(pipeline_gender_consistency))

print("--- Consistency Verification: applicant_info.gender ---")
print("Expected standard values: ['Male', 'Female']")
print(f"Distinct values detected: {len(gender_values_clean)}\n")

for gv in gender_values_clean:
    # Handle potential null or missing values to prevent execution errors
    val = gv['_id'] if gv['_id'] is not None else "MISSING_OR_NULL"
    print(f"  [{val}]: {gv['count']} records")
print("-" * 50)

--- Consistency Verification: applicant_info.gender ---
Expected standard values: ['Male', 'Female']
Distinct values detected: 2

  [Female]: 248 records
  [Male]: 243 records
--------------------------------------------------


In [83]:
# Check consistency for categorical fields in the current schema
def check_field_consistency(field_path, expected_values=None, unwind_array_path=None):
    """Check how many distinct values exist for a field."""
    
    # Fast schema guard: avoid misleading outputs when a field does not exist
    existing_count = collection.count_documents({field_path: {"$exists": True, "$ne": None}})
    print(f"\nField: {field_path}")
    if existing_count == 0:
        print("Field not found (or always null) in current dataset schema.")
        return
    
    pipeline = []
    if unwind_array_path:
        pipeline.append({"$unwind": f"${unwind_array_path}"})
    
    pipeline.extend([
        {"$group": {"_id": f"${field_path}", "count": {"$sum": 1}}},
        {"$sort": {"count": -1}}
])
    results = list(collection.aggregate(pipeline))
    
    missing_count = sum(r['count'] for r in results if r['_id'] is None)
    non_missing_results = [r for r in results if r['_id'] is not None]
    
    print(f"Distinct non-missing values: {len(non_missing_results)}")
    if missing_count:
        print(f"Missing/Null records: {missing_count}")
    if expected_values:
        print(f"Expected: {expected_values}")
    for r in non_missing_results:
        print(f"  '{r['_id']}': {r['count']}")

# Check fields that exist in this dataset
check_field_consistency("applicant_info.gender", ["Male", "Female"])
check_field_consistency("loan_purpose")
check_field_consistency("decision.loan_approved", [True, False])
check_field_consistency("decision.rejection_reason")
check_field_consistency("spending_behavior.category", unwind_array_path="spending_behavior")


Field: applicant_info.gender
Distinct non-missing values: 2
Expected: ['Male', 'Female']
  'Female': 248
  'Male': 243

Field: loan_purpose
Distinct non-missing values: 10
Missing/Null records: 442
  'medical': 8
  'debt_consolidation': 6
  'education': 6
  'vacation': 6
  'wedding': 6
  'moving': 5
  'personal': 4
  'auto': 3
  'home_improvement': 3
  'business': 2

Field: decision.loan_approved
Distinct non-missing values: 2
Expected: [True, False]
  'True': 288
  'False': 203

Field: decision.rejection_reason
Distinct non-missing values: 4
Missing/Null records: 288
  'algorithm_risk_score': 164
  'insufficient_credit_history': 23
  'high_dti_ratio': 12
  'low_income': 4

Field: spending_behavior.category
Distinct non-missing values: 15
  'Travel': 80
  'Utilities': 74
  'Fitness': 70
  'Entertainment': 70
  'Healthcare': 68
  'Insurance': 65
  'Education': 64
  'Dining': 63
  'Groceries': 62
  'Rent': 59
  'Transportation': 59
  'Shopping': 53
  'Alcohol': 11
  'Gambling': 6
  'Adu

The **Consistency Audit** confirms that demographic standardization is complete, with the gender field now partitioned into two balanced cohorts of **248 Female** and **243 Male** records. While the **loan_approved** field is **100% consistent**, the analysis exposed a critical **90.02% completeness failure** in **loan_purpose,** where 442 records are missing or null. Additionally, the audit successfully flagged high-risk behavioral data, including **Gambling** and **Adult Entertainment** within the spending arrays, though the systemic absence of **employment_status** indicates a remaining schema mapping error that prevents a full compliance review under **AI Act Art. 10**.