# Silver Layer Analysis

`This notebook performs exploratory and validation analysis on the Silver datasets
to ensure data consistency and analytical readiness.`


In [0]:
silver_path = f"abfss://{container}@{storage_account}.dfs.core.windows.net/Silver"


In [0]:
display(dbutils.fs.ls(silver_path))

In [0]:
# loan_table 
loan_df = spark.read.format("delta").load(f"{silver_path}/loan_enriched")
display(loan_df)


### Portfolio Structure by Product

`This analysis aims to understand how the credit portfolio is distributed across product types.`\
`In credit risk frameworks such as IFRS 9, PD, LGD, and EAD models are systematically segmented by product category, as each product exhibits distinct risk characteristics.`\
`Analyzing the portfolio structure by product is therefore a necessary first step to interpret risk metrics and to provide a meaningful view of the portfolio composition.`


In [0]:

df_prod = (
    loan_df
        .groupBy("product_type")
        .agg(
            F.count("*").alias("nb_loans"),
            F.sum("principal_amount").alias("total_exposure"),
            F.avg("default_flag").alias("default_rate")
        )
        .orderBy("total_exposure", ascending=False)
)

display(df_prod)


product_type,nb_loans,total_exposure,default_rate
IMMO,37930,13243416017.832453,1.0
CONSO,96489,2032270994.59028,1.0
REVOLVING,1505,8215390.96650434,1.0


In [0]:
# There is a problem with the variable default_flag 
loan_df.select("default_flag").distinct().show()

+------------+
|default_flag|
+------------+
|        NULL|
|           1|
+------------+



In [0]:

# We assume that default events have been joined without explicitly encoding the absence of an event.
# Missing default indicators are therefore interpreted as non-default cases and are replaced with zero before proceeding with the analysis.


loan_df = loan_df.withColumn(
    "default_flag",
    F.when(F.col("default_flag").isNull(), 0).otherwise(F.col("default_flag")))


In [0]:
loan_df.select("default_flag").distinct().show()

+------------+
|default_flag|
+------------+
|           1|
|           0|
+------------+



In [0]:

df_prod = (
    loan_df
        .groupBy("product_type")
        .agg(
            F.count("*").alias("nb_loans"),
            F.sum("principal_amount").alias("total_exposure"),
            F.avg("default_flag").alias("default_rate")
        )
        .orderBy("total_exposure", ascending=False)
)

display(df_prod)


product_type,nb_loans,total_exposure,default_rate
IMMO,37930,13243416017.832453,0.0284207751120485
CONSO,96489,2032270994.59028,0.0734384230326773
REVOLVING,1505,8215390.96650434,0.1269102990033222


##### Interpretation of Results

`The observed default rates by product type are consistent with typical retail banking risk profiles.`\
`Mortgage loans exhibit a low default rate (around 2.8%), which is realistic given the presence of strong collateral, more solvent borrowers, larger loan amounts, and generally more stable repayment capacity.`\
`Consumer loans show a higher default rate (around 7.3%), reflecting their unsecured nature, easier access, and their frequent use by more financially constrained borrowers.`\
`Revolving credit displays the highest default rate (around 12.7%), which is expected for this product category.`\
`Revolving credit is structurally riskier due to the absence of collateral, high utilization behavior, elevated interest rates, fragile borrower profiles, debt cycles, and high credit conversion factors (CCF) reflecting frequent usage of credit limits.`



### Customer Risk Analysis — Default Rate by Income Segment

`The objective of this analysis is to assess whether lower-income customers carry higher credit risk.`

`Annual income is a key driver of repayment capacity and is therefore a fundamental variable in IFRS 9 frameworks and PD modeling.`

`Analyzing default rates by income segment allows the identification of more fragile customer groups, whose solvency and income stability may be more sensitive to economic conditions.`

`Customers are segmented into low, medium, and high income buckets to facilitate comparative risk analysis.`


In [0]:
loan_df.select("annual_income").show()

+------------------+
|     annual_income|
+------------------+
| 45057.86837163758|
| 82359.76886963609|
|45276.664880487006|
|63915.201074141914|
|  21400.2783605274|
|28312.678256321517|
| 32868.53091173815|
| 22335.39213692309|
| 35595.03540649403|
| 40590.42106010997|
|24079.792073740893|
| 52627.82106424119|
| 73093.51123168226|
| 26210.94008856016|
|37861.880721824375|
| 26950.23062394332|
| 72935.34230545466|
| 23087.01074096215|
|54594.420105640456|
|55031.774173592195|
+------------------+
only showing top 20 rows



In [0]:
loan_df= loan_df.withColumn(
    "income_bucket",
    F.when(F.col("annual_income") < 20000, "LOW")
     .when((F.col("annual_income") >= 20000) & (F.col("annual_income") < 40000), "MEDIUM")
     .otherwise("HIGH")
)


In [0]:
df_income = (
    loan_df
        .groupBy("income_bucket")
        .agg(
            F.count("*").alias("nb_loans"),
            F.sum("principal_amount").alias("exposure"),
            F.avg("default_flag").alias("default_rate")
        )
        .orderBy("income_bucket")
)

display(df_income)


income_bucket,nb_loans,exposure,default_rate
HIGH,83470,9367762683.126446,0.053576135138373
LOW,4311,480811081.9972477,0.1057759220598469
MEDIUM,48143,5435328638.26556,0.0711837650333381


##### Interpretation by Income Segment

`The LOW income segment exhibits the highest default rate (approximately 10.6%), indicating a structurally fragile borrower profile.`\
`Low-income customers typically have limited repayment capacity, high sensitivity to economic shocks, more unstable employment, frequent use of consumer and revolving credit, and little financial buffer.`\
`Although this segment represents a relatively small share of the portfolio in volume, it carries disproportionate risk and is therefore critical in PD analysis.`\
`The MEDIUM income segment shows an intermediate level of risk, with a default rate around 7.1%, reflecting more balanced but still vulnerable borrower profiles.`\
`The HIGH income segment is both the largest and the safest, with the lowest default rate (around 5.3%).`\
`High-income customers concentrate the majority of the exposure while exhibiting the strongest repayment capacity and the lowest observed credit risk.`


### Credit Score Analysis

`The objective of this analysis is to assess the relationship between credit score and default risk.`\
`Credit score is a core indicator of borrower creditworthiness and is a central input in PD modeling and IFRS 9 risk frameworks.`\
`To facilitate analysis and interpretation, borrowers are grouped into discrete credit score buckets before evaluating default behavior across segments.`


In [0]:
loan_df = loan_df.withColumn(
    "credit_score_bucket",
    F.when(F.col("credit_score") < 500, "LOW")
     .when((F.col("credit_score") >= 500) & (F.col("credit_score") < 650), "MEDIUM")
     .otherwise("HIGH")
)
 

In [0]:
df_score = (
    loan_df
        .groupBy("credit_score_bucket")
        .agg(
            F.count("*").alias("nb_loans"),
            F.sum("principal_amount").alias("exposure"),
            F.avg("default_flag").alias("default_rate")
        )
        .orderBy("credit_score_bucket")
)

display(df_score)


credit_score_bucket,nb_loans,exposure,default_rate
HIGH,93357,10469064301.844631,0.0479449853787075
LOW,3533,406508565.2030793,0.1327483724879705
MEDIUM,39034,4408329536.341547,0.0873597376646



##### Interpretation of Results 

`Default probabilities decrease as credit score increases, from approximately 13% for low-score customers to around 5% for high-score customers.`\
`This behavior is fully consistent with standard banking practices and confirms the quality of the portfolio segmentation.`


### Analysis by Age Segment

`Borrowers are grouped into five age segments to analyze default behavior across life stages.`\
`Default rates are computed for each age segment to assess how credit risk evolves with borrower age.`


In [0]:

loan_df = loan_df.withColumn(
  "age_bucket",
  F.when(F.col("age")< 26, "18-25")
   .when((F.col("age") >= 26) & (F.col("age") < 41), "26-40")
   .when((F.col("age") >= 41) & (F.col("age") < 61), "41-60")
  .otherwise("60+")
)

In [0]:
df_age = (
    loan_df
        .groupBy("age_bucket")
        .agg(
            F.count("*").alias("nb_loans"),
            F.sum("principal_amount").alias("exposure"),
            F.avg("default_flag").alias("default_rate")
        )
        .orderBy("age_bucket")
)

display(df_age)

age_bucket,nb_loans,exposure,default_rate
18-25,14043,1580531228.275617,0.0596026490066225
26-40,41490,4626941610.363953,0.0616534104603518
41-60,55330,6267245068.422494,0.0617928790891017
60+,25061,2809184496.3272066,0.0614899644866525


##### Interpretation of Results 

`The analysis by age segment shows a relatively stable default rate around 6% across all age groups.`\
`Age does not appear to be a major discriminating factor of credit risk in this simulated portfolio, which is consistent with a framework where credit score, income, and product type are the primary risk drivers.`

### Cross-Analysis: Product × Income - A cross-risk perspective

`This type of cross-analysis is systematically used in banking, as it combines product risk, customer risk, portfolio fragility, and profitability.`\
`It provides a consolidated view of where risk is concentrated and is typically one of the first analyses reviewed by risk management teams.This analysis is essential because:`

`A high-risk product sold to a fragile customer represents a significant risk.` 
`A high-risk product sold to a solid customer can be profitable.`  
`A low-risk product sold to a fragile customer is generally acceptable.`  
`A low-risk product sold to a solid customer represents the optimal risk-return profile.`


In [0]:
df_cross = (
    loan_df
        .groupBy("product_type", "income_bucket")
        .agg(
            F.count("*").alias("nb_loans"),
            F.avg("default_flag").alias("default_rate")
        )
        .orderBy("product_type", "income_bucket")
)

display(df_cross)

product_type,income_bucket,nb_loans,default_rate
CONSO,HIGH,59368,0.0639907020617167
CONSO,LOW,3068,0.1277705345501955
CONSO,MEDIUM,34053,0.0850145361642146
IMMO,HIGH,23202,0.024480648219981
IMMO,LOW,1194,0.0452261306532663
IMMO,MEDIUM,13534,0.033692921530959
REVOLVING,HIGH,900,0.1166666666666666
REVOLVING,LOW,49,0.2040816326530612
REVOLVING,MEDIUM,556,0.1366906474820144


`The cross-analysis of product type and income reveals risk patterns that are fully consistent with typical banking portfolios`\
`Low-income customers systematically exhibit the highest default rates across all product categories.`\
`Mortgage products remain the most stable, with default rates typically between 2% and 4%, even within low-income segments.`\
`Revolving credit concentrates the highest risk, with default rates exceeding 20% in the low-income segment.`\
`These results confirm that income is a major driver of credit risk and that product characteristics significantly amplify or mitigate this sensitivity.`



### Vintage Analysis — Default Rate by Origination Year

`This analysis examines default rates by loan origination year to assess the evolution of credit risk across vintages.`\
`Vintage analysis helps identify changes in underwriting quality, portfolio risk appetite, and macroeconomic effects over time`


In [0]:

loan_df = loan_df.withColumn(
    "vintage_year",
    F.year(F.col("origination_date"))
) 

In [0]:
df_vintage = (
    loan_df
        .groupBy("vintage_year")
        .agg(
            F.count("*").alias("nb_loans"),
            F.sum("principal_amount").alias("exposure"),
            F.avg("default_flag").alias("default_rate")
        )
        .orderBy("vintage_year")
)

display(df_vintage)

vintage_year,nb_loans,exposure,default_rate
2018,19441,2184313432.8183603,0.0615709068463556
2019,19425,2175795882.8671,0.0638352638352638
2020,19402,2209033283.798316,0.060973095557159
2021,19596,2184183006.3385305,0.0615431720759338
2022,19483,2197313542.245272,0.061489503669866
2023,19220,2162523147.690524,0.0596774193548387
2024,19357,2170740107.631158,0.0611665030738234


##### Interpretation of Results 

`The analysis by origination year shows a remarkable stability of credit risk over time.`\
`Default rates remain within a narrow range (approximately 5.9%–6.3%), with no evidence of structural deterioration in the portfolio.`\
`More recent vintages (2023–2024) naturally exhibit lower default rates due to their shorter seasoning period.`\
`Overall, the results indicate a healthy and consistent portfolio, with no signs of abnormal credit policy shifts or significant macroeconomic shocks.`


### Cross-Analysis — Credit Score × Income

`This analysis examines the combined effect of credit score and income on default risk.`\
`By crossing these two key borrower dimensions, it highlights how repayment capacity and creditworthiness jointly shape credit risk levels.`


In [0]:
df_score_income = (
    loan_df
        .groupBy("credit_score_bucket", "income_bucket")
        .agg(
            F.count("*").alias("nb_loans"),
            F.avg("default_flag").alias("default_rate")
        )
        .orderBy("credit_score_bucket", "income_bucket")
)

display(df_score_income)


credit_score_bucket,income_bucket,nb_loans,default_rate
HIGH,HIGH,69529,0.0479799795768672
HIGH,LOW,1,0.0
HIGH,MEDIUM,23827,0.0478448818567171
LOW,LOW,708,0.1242937853107344
LOW,MEDIUM,2825,0.1348672566371681
MEDIUM,HIGH,13941,0.0814862635392009
MEDIUM,LOW,3602,0.1021654636313159
MEDIUM,MEDIUM,21491,0.0886882881206086



### Cross-Analysis — Credit Score × Age

`This analysis explores the interaction between credit score and borrower age in explaining default risk.`\
`It helps assess whether creditworthiness dominates age effects or whether specific life-stage patterns emerge in the portfolio.`


In [0]:
df_score_age = (
    loan_df
        .groupBy("credit_score_bucket", "age_bucket")
        .agg(
            F.count("*").alias("nb_loans"),
            F.avg("default_flag").alias("default_rate")
        )
        .orderBy("credit_score_bucket", "age_bucket")
)

display(df_score_age)




credit_score_bucket,age_bucket,nb_loans,default_rate
HIGH,18-25,9624,0.0475893599334995
HIGH,26-40,28540,0.0491941135248773
HIGH,41-60,37880,0.0479408658922914
HIGH,60+,17313,0.0460925316236354
LOW,18-25,391,0.1048593350383631
LOW,26-40,1041,0.138328530259366
LOW,41-60,1412,0.1352691218130311
LOW,60+,689,0.1349782293178519
MEDIUM,18-25,4028,0.0839126117179741
MEDIUM,26-40,11909,0.0848098077084558


### Cross-Analysis — Income × Age

`This analysis examines the combined effect of income level and borrower age on default risk.`\
`It helps identify whether income-related risk patterns vary across different life stages within the portfolio.`


In [0]:
df_income_age = (
    loan_df
        .groupBy("income_bucket", "age_bucket")
        .agg(
            F.count("*").alias("nb_loans"),
            F.avg("default_flag").alias("default_rate")
        )
        .orderBy("income_bucket", "age_bucket")
)

display(df_income_age)


income_bucket,age_bucket,nb_loans,default_rate
HIGH,18-25,8631,0.0530645348163596
HIGH,26-40,25446,0.0545075846891456
HIGH,41-60,34015,0.0533000146993973
HIGH,60+,15378,0.0529327610872675
LOW,18-25,420,0.1095238095238095
LOW,26-40,1419,0.1035940803382663
LOW,41-60,1732,0.1050808314087759
LOW,60+,740,0.1094594594594594
MEDIUM,18-25,4992,0.0667067307692307
MEDIUM,26-40,14625,0.070017094017094


### Cross-Analysis — Product × Credit Score × Income

`This analysis combines product type, credit score, and income level to provide a multidimensional view of credit risk.`\
`It highlights how product risk interacts with borrower creditworthiness and repayment capacity, allowing a granular identification of high-risk and high-value segments within the portfolio.`


In [0]:
df_prod_score_income = (
    loan_df
        .groupBy("product_type", "credit_score_bucket", "income_bucket")
        .agg(
            F.count("*").alias("nb_loans"),
            F.avg("default_flag").alias("default_rate")
        )
        .orderBy("product_type", "credit_score_bucket", "income_bucket")
)

display(df_prod_score_income)


product_type,credit_score_bucket,income_bucket,nb_loans,default_rate
CONSO,HIGH,HIGH,49421,0.0569798263895914
CONSO,HIGH,LOW,1,0.0
CONSO,HIGH,MEDIUM,16938,0.0579171094580233
CONSO,LOW,LOW,484,0.1570247933884297
CONSO,LOW,MEDIUM,1994,0.1544633901705115
CONSO,MEDIUM,HIGH,9947,0.0988237659595858
CONSO,MEDIUM,LOW,2583,0.1223383662408052
CONSO,MEDIUM,MEDIUM,15121,0.1062099067521989
IMMO,HIGH,HIGH,19353,0.0224254637523898
IMMO,HIGH,MEDIUM,6605,0.020893262679788


### Analysis — Exposure Buckets

`This analysis segments loans into exposure buckets to examine how credit risk varies with loan size.`\
`It helps assess whether higher exposures concentrate more risk or whether risk remains primarily driven by borrower and product characteristics.`


In [0]:
# bucket creation 

loan_df = loan_df.withColumn(
    "exposure_bucket",
    F.when(F.col("principal_amount") < 5000, "<5K")
     .when((F.col("principal_amount") >= 5000) & (F.col("principal_amount") < 20000), "5K-20K")
     .when((F.col("principal_amount") >= 20000) & (F.col("principal_amount") < 100000), "20K-100K")
     .otherwise("100K+")
)


In [0]:
# analyse
df_exposure_bucket = (
    loan_df
        .groupBy("exposure_bucket")
        .agg(
            F.count("*").alias("nb_loans"),
            F.sum("principal_amount").alias("exposure"),
            F.avg("default_flag").alias("default_rate")
        )
        .orderBy("exposure_bucket")
)

display(df_exposure_bucket)


exposure_bucket,nb_loans,exposure,default_rate
100K+,34989,13018866117.666883,0.0283517676984195
20K-100K,50419,1723773630.3455698,0.0709653106963644
5K-20K,41210,511289306.9178891,0.0741567580684299
<5K,9306,29973348.45888235,0.0783365570599613


In [0]:
loan_df = loan_df.withColumn(
    "vintage_year",
    F.year(F.col("origination_date"))
)
