# Numeric Formatting EDA (Lending Club)
This notebook aims to find out issues in the numeric columns in the Bronze Delta Table Lending Club Dataset. After finding out such issues, such issues will then be resolved via the Medallion Architecture Data Cleaning Pipeline. 

## Import Required Libraries

In [0]:
# Import Libraries 

from pyspark.sql.functions import (
    col, when, count, desc, isnan, isnull, lit, length, trim, lower, upper, to_date, concat_ws
)

from pyspark.sql.types import (
    StructType, StructField, StringType, DoubleType, IntegerType, DateType, NumericType
)


## Read Silver Delta Table 1 (Strings Cleaned)

In [0]:
df = spark.read.table("silver.lendingclub_cleaned_string")

df.printSchema()

In [0]:
df.limit(10).display()

## Identify all Numeric Columns

In [0]:
all_numeric_cols = [
    f.name
    for f in df.schema.fields
    if isinstance(f.dataType, NumericType)
]

print("Numeric columns:", all_numeric_cols)
print(f"Number of Numeric Columns: {len(all_numeric_cols)}")


## Inspect Summary Statistics
Using summary statistics help me easily detect numeric columns which has invalid entries, and deal with them accordingly. 

In [0]:
df.select(all_numeric_cols).summary().display()


Columns that have invalid entries are 
- **dti**: Should never be negative 
- **total_rec_late_fee**: Should not be negative (Penalty Fees)

We shall check the percentage of such invalid entries to the entire datasets to determine if it is ok to drop them. 

In [0]:
print(f"Number of records with DTI < 0: {df.where(  col('dti') < 0   ).count() }")
print(f"Number of records with total_rec_late_fee < 0: {df.where(   col('total_rec_late_fee') < 0   ).count()}") 


Dropping such rows will do little impact to our dataset ... 

In [0]:
hello = spark.read.table('silver.lendingclub_cleaned_numeric')
hello.limit(10).display()