# Medallion Architecture Data Cleaning Pipeline 

Delta Live Tables offer a fault-tolerant, optimized approach for building reliable data pipelines, making them ideal for this use case.

In the real world, roles & responsibilities of E2E data projects are as shown: 
- **Data Engineers**: Focus on building pipelines that handle common data issues such as duplicates, formatting of columns, schema definition, and invalid values.

- **Data Scientists**: Work on EDA, imputing missing values, handling outliers, and preparing data for modeling (feature engineering / selection / dimensionality reduction etc).

In this notebook, I will be implementing a simplified **Medallion Architecture** using **Delta Live Tables (DLT) 
 in Azure Databricks** to simulate real-world data engineering practices. 

I will be using the following visualisation as a guide to build the data pipeline. 


<br>

<img src="https://media.datacamp.com/cms/ad_4nxe4oejrhu9gexxri3ea6vmsu1fgxcxbvlwmbaj4ji5s2u31dg3hbyyg4sxmd7ma8-9zamnbxadzz_h4kllvjylicug3v4-iinvx65erdijn4htymmqvc3mjqblskqzdu5ttmodyua.png">



By the end of this notebook, I should be able to: 
- Output a **thoroughly cleansed target dataset** ready for data scientists' to conduct EDA, dataset preprocessing and other model building practices. 

- Define **feature and target variables** from the target table clearly 

## Bronze Delta Table

This serves as a 'landing place' for raw data for single-source of truth purposes. In case data processing in subsequent stages go faulty, data specialists can use the **Bronze Delta Table** for reference, ensuring data integrity. 



In [0]:
# ===============================
# + --------------------------- +
# | Bronze Delta Table Pipeline |
# + --------------------------- +
# ===============================

import dlt
import pyspark.sql.functions as F
import pyspark.sql.types



# This will only be allowed if I can create a DLT pipeline (not allowed due to Azure for Students)

# @dlt.table(name="bronze_raw_lendingclub_data", comment="Ingest raw loan data from Lending Club csv")
# def bronze_raw_loans():
#     return spark.read.csv("/FileStore/tables/accepted_2007_to_2018Q4.csv", 
#                           header=True, 
#                           inferSchema=True)
    
# I will need to ensure inferSchema = True, so that all columns dtypes are auto-detected to lessen my workload later 

# ✅ The below allows DLT pipeline not to be created 
bronze_df = (
    spark.read
    .option("header", True)
    .option("inferSchema", True)
    .csv("/FileStore/tables/accepted_2007_to_2018Q4.csv")
)

spark.sql("CREATE SCHEMA IF NOT EXISTS bronze")

# ✅ 2. Save as a Delta table in the `bronze` schema
bronze_df.write.format("delta").mode("overwrite").saveAsTable("bronze.lendingclub_raw")


## Silver Delta Table

Next, the pipeline to produce a Silver Delta Table will mainly perform key data cleaning steps.
  - Deal with Duplicates
  - Remove String Column Spaces
  - Handle String Formatting / Spelling Issues 
  - Ensure UTF-8 for String Columns 
  - Schema Definition 
  - Invalid Value Handling 

In [0]:
# Functions needed for Silver Delta Pipeline 

def drop_duplicates(df):
    duplicate_rows = df.count() - df.dropDuplicates().count()
    print(f"Number of duplicate rows: {duplicate_rows}")

    return df.dropDuplicates()


################################
# +---------------------------+#
# |Handle String Column Issues|#
# +---------------------------+# 
################################

def handle_string_cols_spaces(df): 
    string_cols = [
        field.name for field in df.schema.fields
        if isinstance(field.dataType, pyspark.sql.types.StringType)]
    
    # Replaces each existing column with new <string> values which are trimmed 
    for col_name in string_cols:
        df = df.withColumn(col_name, F.trim(pyspark.sql.functions.col(col_name)))
    
    return df 

def handle_string_cols_formatting(df):  
    """
    Uses library of RapidFuzz to provide lightweight similarity calculations, optimised for performance

    Takes reference from String issues are in ../sandbox/string_issues.ipynb
    """
    # Drops unusable String columns 


    # Fix addr_state (check len() > 2)

    # Fix invalid string column values (mafan)

    # Drop meaningless string columns 

    return 


def utf8_string_cols(df):  
    return 


#################################
# +----------------------------+#
# |Handle Numeric Column Issues|#
# +----------------------------+# 
#################################

- Type Casting...

# def define_new_schema(df: DataFrame) -> DataFrame: 
#     # ensure nullable = true 
#     return 




In [0]:




    
# 1. Define Schema  
def define_schema(df: pd.Dataframe) -> pd.DataFrame: 



# # 4. ... 

# from pyspark.sql import DataFrame
# from pyspark.sql import functions as F

# def define_schema(df: DataFrame) -> DataFrame:
#     """Define explicit schema, instead of schema inference, which is prone to error"""
#     # Another syntax is StructField thingy ... 
#     my_ddl_schema = '''
#                     Item_Identifier STRING,
#                     Item_Weight STRING, 
#                     Item_Fat_Content STRING,
#                     Item_Visibility DOUBLE,
#                     Item_Type STRING,
#                     Item_MRP DOUBLE,
#                     Outlet_Identifier STRING,
#                     Outlet_Establishment_Year INT,
#                     Outlet_Size STRING,
#                     Outlet_Location_Type STRING,
#                     Outlet_Type STRING,
#                     Item_Outlet_Sales DOUBLE

#                     ''' 

#     df = dlt.read("bronze_raw_lendingclub_data")  # reads the bronze delta table as a DataFrame

    
    




In [0]:
# @dlt.table(name="silver_cleaned_lendingclub_data", comment="Full data cleaning pipeline to create Silver Delta Table")
# def silver_cleaned_loans(bronze_df.DataFrame) -> DataFrame: 
#     # Outliers shall not be removed here, since they are normally dealt with by data scientists with knowledge of statistics

#     bronze_df = drop_duplicates(bronze_df)
#     print('✅ Duplicates removed...')


#     bronze_df = handle_string_cols_spaces(bronze_df)
#     print('✅ Trailing / Leading Spaces removed...')

#     bronze_df = handle_string_cols_formatting(bronze_df)
#     bronze_df = utf8_string_cols(bronze_df)
#     bronze_df = 


#     bronze_df = define_new_schema(bronze_df)
    

## Gold Delta Table 
Finally, to product a Gold Delta Table, the pipeline built should perform the following steps. 
  - Attain Derived Columns, e.g. KPI ratios 
  - Create ML target variables 

In [0]:
# @dlt.table(name="gold_lendingclub_data", comment="Ready for data scientists")
# def gold_processed_loans(gold_df.DataFrame) -> DataFrame: 



%md
## 3. Renaming Columns 
Some of the column names are too short-formed for understanding. As such, I will be renaming them for ease of interpretation. 