This notebook aims to simulate a data scientist consuming data from a data pipeline, using PySpark. 

In this notebook, I will be executing the following steps:
  - Conduct EDA on data produced from Medallion Data Pipeline 
  - Handling Missing Values 

Thereafter, feature engineering, selection and multicollinearity checks will be handled invidually in their respective notebooks. I have specifically done this, since feature engineering should be targeted for LGD, PD, EAD. Doing feature engineering and producing a master table to use for all 3 models makes our model building tedious, computationally expensive and hard to interpret. 

# 1. Import Libraries

In [1]:
# == Standard libraries == 
import os

# == Load Environment Variables == 
from dotenv import load_dotenv
load_dotenv()

# == Start Spark == 
from init_spark import start_spark
spark = start_spark()

# == Wandb for logging == 
import wandb
wandb.login(key=os.getenv("WANDB_API_KEY"))

# == Pyspark Functions == 
from pyspark.sql import functions as F
from pyspark.sql.functions import (
    col,
    when,
    count,
    desc,
    isnan,
    isnull,
    lit,
    length,
    trim,
    lower,
    upper,
    to_date,
    concat_ws,
    regexp_extract,
    sum,
    unix_timestamp,
    from_unixtime,
)

from pyspark.sql.types import (
    StructType,
    StructField,
    StringType,
    DoubleType,
    IntegerType,
    DateType,
    NumericType,
)

from pyspark.sql import DataFrame
from datetime import date

# == Pandas, Seaborn, Matplotlib == 
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns 

# == Global Functions == 
from functions import * # Import functions from the functions module

Using Spark's default log4j profile: org/apache/spark/log4j2-defaults.properties
25/08/10 16:36:48 WARN Utils: Your hostname, Chengs-MacBook-Pro.local, resolves to a loopback address: 127.0.0.1; using 192.168.0.77 instead (on interface en0)
25/08/10 16:36:48 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
:: loading settings :: url = jar:file:/Users/lunlun/Downloads/Github/Credit-Risk-Modeling-PySpark/venv/lib/python3.11/site-packages/pyspark/jars/ivy-2.5.3.jar!/org/apache/ivy/core/settings/ivysettings.xml
Ivy Default Cache set to: /Users/lunlun/.ivy2.5.2/cache
The jars for the packages stored in: /Users/lunlun/.ivy2.5.2/jars
io.delta#delta-spark_2.13 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-59a5969e-74eb-48a0-97af-fd012a3b144b;1.0
	confs: [default]
	found io.delta#delta-spark_2.13;4.0.0 in central
	found io.delta#delta-storage;4.0.0 in central
	found org.antlr#antlr4-runtime;4.13.1 in central
:: resolution report :: 

4.0.0


[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /Users/lunlun/.netrc
[34m[1mwandb[0m: Currently logged in as: [33mwlunlun1212[0m ([33mwlunlun1212-singapore-management-university[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /Users/lunlun/.netrc


In [None]:
# ============ Helper Functions ===================

# == Miscellaneous == 
def drop_constant_columns(df):
    """
    Removes all columns in the DataFrame that have only one distinct value.
    Returns a new DataFrame with those columns removed.
    """
    cols_to_drop = []

    for column_name in df.columns:
        if df.select(col(column_name)).distinct().count() <= 1:
            cols_to_drop.append(column_name)

    print(f"⚠️ Dropping constant columns: {cols_to_drop}")
    return df.drop(*cols_to_drop)

# == Outliers Handling ==
def compute_outlier_pct(df, col_name, lower_pct=0.25, upper_pct=0.75):
    '''Computes pct of outliers per column based on IQR method '''
    
    # 1. Compute percentile bounds
    quantiles = df.approxQuantile(col_name, [lower_pct, upper_pct], 0.01)
    q1, q3 = quantiles[0], quantiles[1]
    iqr = q3 - q1 

    # 2. Obtain lower and upper bound, any data points outside of this are seen as outliers 
    lower_bound = q1 - 1.5 * iqr
    upper_bound = q3 + 1.5 * iqr
    total_rows = df.count()

    return round(df.filter( (col(col_name) < lower_bound) | (col(col_name) > upper_bound) ).count() / total_rows * 100, 2) 

def display_distributions(df): 
    """Takes in Spark Dataframe. Samples it and display distribution for skewness checking"""
    # 1. Select numerical columns
    numeric_cols = [field.name for field in df.schema if isinstance(field.dataType, NumericType)]

    # 2. Sample small portion of data (e.g., 5%) and convert to pandas
    sample_df = df.select(numeric_cols).sample(fraction=0.1, seed=42)
    sample_pdf = sample_df.toPandas()

    # 3. Plot histograms as subplots
    n_cols = 3  # Number of plots per row
    n_rows = (len(numeric_cols) + n_cols - 1) // n_cols

    fig, axes = plt.subplots(n_rows, n_cols, figsize=(15, 4 * n_rows))
    axes = axes.flatten()

    for i, col_name in enumerate(numeric_cols):
        axes[i].hist(sample_pdf[col_name].dropna(), bins=50, color='skyblue')
        axes[i].set_title(col_name, fontsize=10)
        axes[i].tick_params(axis='x', rotation=45)

    # Hide any unused subplots
    for j in range(i + 1, len(axes)):
        axes[j].axis('off')

    plt.tight_layout()
    plt.show()

def inspect_outliers_iqr(df: DataFrame, columns: list, sample_size: int = 5):
    for col_name in columns:
        try:
            print(f"\n📊 Inspecting Outliers for Column: `{col_name}`")

            # Step 1: Calculate Q1, Q3, and IQR
            q1, q3 = df.approxQuantile(col_name, [0.25, 0.75], 0.01)
            iqr = q3 - q1
            lower = q1 - 1.5 * iqr
            upper = q3 + 1.5 * iqr

            print(f"Q1 = {q1}, Q3 = {q3}, IQR = {iqr}")
            print(f"Lower Bound = {lower}, Upper Bound = {upper}")

            # Step 2: Count outliers
            outlier_count = df.filter((col(col_name) < lower) | (col(col_name) > upper)).count()
            total_count = df.count()
            outlier_pct = round(outlier_count / total_count * 100, 2)
            print(f"Outlier Count: {outlier_count} ({outlier_pct}%)")

            # Step 3: Sample outlier values (top and bottom)
            print(f"🔼 Top Outliers (>{upper}):")
            df.filter(col(col_name) > upper).select(col_name).orderBy(col(col_name).desc()).show(sample_size)

            print(f"🔽 Bottom Outliers (<{lower}):")
            df.filter(col(col_name) < lower).select(col_name).orderBy(col(col_name)).show(sample_size)

        except Exception as e:
            print(f"❌ Could not process column `{col_name}`: {str(e)}")

def winsorise_col(df, col_name,operator: str,  condition_val, final_val):
    """
    Winsorises a column by replacing values above a certain condition with a final value.
    
    Args:
        df (DataFrame): The input DataFrame.
        col_name (str): The name of the column to winsorise.
        condition_val (float): The value above which to replace with final_val (cut-off)
        final_val (float): The value to replace with.
        
    Returns:
        DataFrame: The DataFrame with the winsorised column.
    """
    print("✅ Winsorising column:", col_name, "...")
    
    if operator == '<':
        return df.withColumn(col_name, when(col(col_name) < condition_val, final_val).otherwise(col(col_name)))
    
    elif operator == '>':
        return df.withColumn(col_name, when(col(col_name) > condition_val, final_val).otherwise(col(col_name)))

def retain_rows(df: DataFrame, col_name: str, condition_val: float, operator: str) -> DataFrame:
    """
    Retains rows in the DataFrame where the specified column meets a condition.

    Returns:
        DataFrame: The DataFrame with the specified rows dropped.
    """
    
    if operator == '<=':
        return df.filter(col(col_name) <= condition_val)    
    
    elif operator == '<':
        return df.filter(col(col_name) < condition_val)
    
    elif operator == '>':
        return df.filter(col(col_name) > condition_val)
    
    elif operator == '>=':
        return df.filter(col(col_name) >= condition_val)
    
    else:
        raise ValueError("Operator must be '>=' or '<='")

In [5]:
df1 = df

### 6.2 Outlier Handling 

### 7.2 Highly Skewed Features Detection 


Now, let's deal with skewness issues. We will group the following numerical features into the following categories
- **Highly Right Skewed (Pct / Probability)**: Logit Transformation
- **Highly Right Skewed (Continuous Feature)**: Log Transformation, Else Box-Cox 
- **Discrete Variables**: WoE binning in Feature Selection 


⚠️ However ... At this checkpoint, I chanced upon how the usage of WoE binning and Feature Selection was robust to outliers and skewness. For example, outliers can be binned into a separate category of itself in WoE binning. Upon this, the WoE encoding of continuous column replaces the original continuous column with each row's bin's WoE value (log-odds scale). Since we are not feeding raw scales of continuous values into our future model, there will not be skewness fixing or transformations needed. 

Since we still made some minor changes, i.e. removing placeholders and illogical values of some of our features, let's log it to `wandb` to simulate the adherence to strict Basel regulations. 