# Lab 01.3: Data Quality Assessment & Cleansing
## Big Data Analytics Workshop - Banking Data Integrity

### 🎯 **Learning Objectives**
After completing this lab, you will understand:
- Data quality dimensions and their importance in banking
- Common data quality issues in financial datasets
- Spark-based data profiling and quality assessment techniques
- Automated data cleansing and validation strategies
- Regulatory compliance through data quality management

### 🏦 **Banking Context: Why Data Quality Matters**
In banking, data quality is critical for:
- **Regulatory Compliance**: Basel III, GDPR, PSD2 requirements
- **Risk Management**: Accurate credit scoring and fraud detection
- **Customer Experience**: Personalized services and recommendations
- **Operational Efficiency**: Automated decision-making systems

Poor data quality costs banks an average of **$15M annually**!

### 🔍 **Data Quality Dimensions**
- **Completeness**: Are all required fields populated?
- **Accuracy**: Is the data correct and valid?
- **Consistency**: Is data uniform across systems?
- **Timeliness**: Is the data current and up-to-date?
- **Uniqueness**: Are there duplicate records?
- **Validity**: Does data conform to business rules?

---

In [None]:
# Environment Setup and Imports
import os
import sys
import time
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pyspark.sql import SparkSession
from pyspark.sql.types import *
from pyspark.sql.functions import *
from pyspark.sql.window import Window
import warnings
warnings.filterwarnings('ignore')

# Add utils to path
sys.path.append('../utils')
from banking_data_generator import BankingDataGenerator
from performance_monitor import PerformanceMonitor
from banking_analytics import BankingAnalytics

print("🔍 Lab 01.3: Data Quality Assessment & Cleansing")
print("=" * 60)
print("🎯 Focus: Banking Data Integrity & Validation")
print("📊 Skills: Profiling, Cleansing, Validation")
print("=" * 60)