# Credit Risk Analysis


In this mini project, I work as a data scientist in a finance company that provides loans to customers. The company earns profit from loan interests, but also faces losses when customers fail to repay — known as default customers.
The main goal of this task is to build a classification model that can predict whether an applicant is likely to repay or default on the loan.
By identifying risky applicants early, the company can minimize potential losses and make smarter lending decisions.


# Attributes Information
|Columns | Description |
|-|-|
|person_age |  Age (Umur)|
|person_income Annual | Income (Penghasilan) |
|person_home_ownership Home | ownership (Status Kepemilikan Rumah) |
|person_emp_length | Employment length (Lama bekerja)|
|loan_intent | Loan intent (Tujuan Peminjaman)|
|loan_grade |  Loan grade (Grade Peminjaman)|
|loan_amnt | Loan amount (Total Peminjaman)|
|loan_int_rate | Interest rate (Suku Bunga Peminjaman)|
|loan_status | Loan status (0 is non default 1 is default) |
|loan_percent_income | Percent income (Persentase dari pendapatan yang dipakai untuk pinjaman)|
|cb_person_default_on_file | Historical default (Apakah customer pernah default/historis default pada file kreditnya)|
|cb_preson_cred_hist_length |  Credit history length (Panjang sejarah kredit orang tersebut)|

# Import Libraries

In [40]:
import pandas as pd
import yaml


In [41]:
class ConfigManager:
    def __init__(self, config_path: str) -> None:
        self.config_path = config_path
        self.config = None

    def load_config(self) -> dict:
        """ 
        Load configuration parameters from a YAML file.

        Returns:
            dict: Configuration parameters.
        """
        with open(self.config_path, 'r') as file:
            self.config = yaml.safe_load(file)
        return self.config

    def update_config(self, key: str, value: any) -> None:
        """
        Update a configuration parameter and save it back to the YAML file.
        Args:
            key (str): The configuration key to update (can be nested using dot notation).
            value (any): The new value for the configuration key.
        Returns:
            None
        """
        if self.config is None:
            raise RuntimeError("Config not loaded. Please load the config before updating.")

        # split if the key is nested
        keys = key.split(".")
        cfg = self.config

        for k in keys[:-1]:
            # auto create dict if key doesn't exist
            if k not in cfg or not isinstance(cfg[k], dict):
                cfg[k] = {}  
            cfg = cfg[k]

        cfg[keys[-1]] = value

        with open(self.config_path, 'w') as file:
            yaml.safe_dump(self.config, file)

        print(f"Updated: {key} = {value}")


In [42]:
# Usage configuration manager
config_manager = ConfigManager('config/config.yaml')
# Load the config
config = config_manager.load_config()
config

{'features': {'categorical': ['person_home_ownership',
   'loan_intent',
   'loan_grade'],
  'numerical': ['person_age',
   'person_income',
   'person_emp_length',
   'loan_amnt',
   'loan_int_rate',
   'loan_status',
   'loan_percent_income',
   'cb_person_cred_hist_length']},
 'path': {'path_config': 'config/config.yaml',
  'raw_data': 'data/raw/credit_risk_dataset.csv',
  'processed_data': 'data/processed'},
 'schema': {'cb_person_cred_hist_length': {'minimum': 0, 'type': 'integer'},
  'cb_person_default_on_file': {'enum': ['Y', 'N'], 'type': 'string'},
  'loan_amnt': {'minimum': 100, 'type': 'integer'},
  'loan_grade': {'enum': ['A', 'B', 'C', 'D', 'E', 'F', 'G'],
   'type': 'string'},
  'loan_int_rate': {'minimum': 0.0, 'type': 'decimal'},
  'loan_intent': {'enum': ['PERSONAL',
    'EDUCATION',
    'MEDICAL',
    'VENTURE',
    'HOMEIMPROVEMENT',
    'DEBTCONSOLIDATION'],
   'type': 'string'},
  'loan_percent_income': {'minimum': 0.0, 'type': 'decimal'},
  'loan_status': {'enum':

In [43]:
class DataLoaderAndValidator:
    def __init__(self, config: dict) -> None:
        self.data_path = config.get('path', {}).get('raw_data', '')
        self.schema = config.get('schema', {})
        self.data = None

    def load_data(self) -> pd.DataFrame:
        """ 
        Load data from a CSV file.

        Returns:
            pd.DataFrame: Loaded data.
        """
        pd.set_option('display.max_colwidth', None)
        self.data = pd.read_csv(self.data_path)


        # Create missing report of data columns
        missing_report = self.data.isna().sum().to_frame(name='missing_count').reset_index()
        missing_report['missing_percentage'] = ((missing_report['missing_count'] / len(self.data)) * 100).round(2).astype(str) + '%'
        # Create data types report
        result = list()
        for col in self.data.columns:
            result.append([col, self.data[col].dtypes, self.data[col].nunique(), self.data[col].unique()])
        data_types_report = pd.DataFrame(result, columns=['column_name', 'data_type', 'unique_count', 'unique_values'])
        # Merge reports
        data_reports = data_types_report.merge(missing_report, left_on='column_name', right_on='index').drop(columns=['index'], axis=1)

        print(f"Duplicated rows: {self.data.duplicated().sum()}")
        print(f"Data shape: {self.data.shape}")
        print()
        print("Data Overview:")
        print(50*"-")
        display(
            self.data.head(),
            data_reports,
            self.data.describe(include='object'),
            self.data.describe(),
        )

        return self.data
    
    def validate_data(self) -> bool:
        """ 
        Validate data against the schema.

        Returns:
            bool: True if data is valid, False otherwise.
        """
        TYPE_MAPPING = {
            'integer': ['int64', 'int32'],
            'decimal': ['float64', 'float32'],
            'string': ['object', 'string'],
        }

        # Check if data is not loaded
        if self.data is None:
            raise RuntimeError("Data not loaded. Please load the data before validation.")
        
        # Store validation errors
        errors = list()

        # Check each colum from schema and dataset
        schema_cols = set(self.schema.keys())
        data_cols = set(self.data.columns)

        missing_in_data = schema_cols - data_cols # columns in schema but not in data
        extra_in_data = data_cols - schema_cols # columns in data but not in schema

        if missing_in_data:
            errors.append(f"Columns in schema but missing in data: {missing_in_data}")
        if extra_in_data:
            errors.append(f"Columns in data but not in schema: {extra_in_data}")
        
        for column, column_schema in self.schema.items():
            if column not in self.data.columns:
                continue # already reported as missing

            # Check data type
            data_type = self.data[column].dtype
            expected_type = column_schema.get('type', None)
            if expected_type in TYPE_MAPPING:
                if data_type not in TYPE_MAPPING[expected_type]:
                    errors.append(f"Column '{column}' has type '{data_type}', expected '{expected_type}'")
            else:
                errors.append(f"Column '{column}' has unexpected type '{data_type}'")

            # Check minimum value
            if 'minimum' in column_schema:
                min_value = column_schema['minimum']
                if not self.data[column].dropna().ge(min_value).all():
                    errors.append(f"Column '{column}' has values below minimum of {min_value}")
            
            # Check maximum value
            if 'maximum' in column_schema:
                max_value = column_schema['maximum']
                if not self.data[column].dropna().le(max_value).all():
                    errors.append(f"Column '{column}' has values above maximum of {max_value}")
            
            # Check enum values
            if 'enum' in column_schema:
                enum_values = set(column_schema['enum'])
                data_values = set(self.data[column].dropna().unique())
                invalid_values = data_values - enum_values
                if invalid_values:
                    errors.append(f"Column '{column}' has invalid enum values: {invalid_values}")

        if errors:
            print("Validation errors found:")
            for error in errors:
                print(f" - {error}")
            return 
        else:
            print("Data validation passed.")
            return 

In [44]:
alala_uye = DataLoaderAndValidator(config)
df = alala_uye.load_data()

Duplicated rows: 165
Data shape: (32581, 12)

Data Overview:
--------------------------------------------------


Unnamed: 0,person_age,person_income,person_home_ownership,person_emp_length,loan_intent,loan_grade,loan_amnt,loan_int_rate,loan_status,loan_percent_income,cb_person_default_on_file,cb_person_cred_hist_length
0,22,59000,RENT,123.0,PERSONAL,D,35000,16.02,1,0.59,Y,3
1,21,9600,OWN,5.0,EDUCATION,B,1000,11.14,0,0.1,N,2
2,25,9600,MORTGAGE,1.0,MEDICAL,C,5500,12.87,1,0.57,N,3
3,23,65500,RENT,4.0,MEDICAL,C,35000,15.23,1,0.53,N,2
4,24,54400,RENT,8.0,MEDICAL,C,35000,14.27,1,0.55,Y,4


Unnamed: 0,column_name,data_type,unique_count,unique_values,missing_count,missing_percentage
0,person_age,int64,58,"[22, 21, 25, 23, 24, 26, 144, 123, 20, 32, 34, 29, 33, 28, 35, 31, 27, 30, 36, 40, 50, 45, 37, 39, 44, 43, 41, 46, 38, 47, 42, 48, 49, 58, 65, 51, 53, 66, 61, 54, 57, 59, 62, 60, 55, 52, 64, 70, 78, 69, 56, 73, 63, 94, 80, 84, 76, 67]",0,0.0%
1,person_income,int64,4295,"[59000, 9600, 65500, 54400, 9900, 77100, 78956, 83000, 10000, 85000, 95000, 108160, 115000, 500000, 120000, 92111, 113000, 10800, 162500, 137000, 65000, 10980, 80000, 67746, 11000, 11389, 11520, 306000, 300000, 12000, 48000, 64000, 75000, 71500, 62050, 80690, 66300, 89028, 78000, 92004, 97000, 280000, 277104, 277000, 128000, 131000, 275000, 263000, 221850, 70000, 260000, 259000, 255000, 250000, 56950, 88000, 83004, 100000, 110000, 108000, 151200, 69000, 240000, 73200, 73399, 62500, 12360, 60000, 234000, 221004, 232500, 230000, 12600, 42500, 41000, 12816, 12960, 226000, 225000, 213000, 12996, 42000, 42360, 49000, 49464, 50000, 44000, 50004, 52000, 13200, 220000, 216000, 215000, 46000, 53000, 47000, 210000, 54996, 55000, 55164, ...]",0,0.0%
2,person_home_ownership,object,4,"[RENT, OWN, MORTGAGE, OTHER]",0,0.0%
3,person_emp_length,float64,36,"[123.0, 5.0, 1.0, 4.0, 8.0, 2.0, 6.0, 7.0, 0.0, 9.0, 3.0, 10.0, nan, 11.0, 18.0, 12.0, 17.0, 14.0, 16.0, 13.0, 19.0, 15.0, 20.0, 22.0, 21.0, 24.0, 23.0, 26.0, 25.0, 27.0, 28.0, 31.0, 41.0, 34.0, 29.0, 38.0, 30.0]",895,2.75%
4,loan_intent,object,6,"[PERSONAL, EDUCATION, MEDICAL, VENTURE, HOMEIMPROVEMENT, DEBTCONSOLIDATION]",0,0.0%
5,loan_grade,object,7,"[D, B, C, A, E, F, G]",0,0.0%
6,loan_amnt,int64,753,"[35000, 1000, 5500, 2500, 1600, 4500, 30000, 1750, 34800, 34000, 1500, 33950, 33000, 4575, 1400, 32500, 4000, 2000, 32000, 31050, 24250, 7800, 20000, 10000, 25000, 18000, 12000, 29100, 28000, 9600, 3000, 6100, 4200, 4750, 4800, 2700, 27600, 3250, 27500, 27050, 27000, 26000, 25600, 25475, 21600, 11900, 25300, 3650, 6000, 2400, 3600, 7500, 4950, 21000, 16000, 22000, 7750, 24000, 15000, 15500, 9000, 23050, 5375, 6250, 5000, 2100, 14000, 6200, 9950, 4475, 2600, 8000, 4600, 3500, 7200, 8800, 3175, 2800, 13000, 1800, 3300, 3200, 2275, 5600, 3625, 4375, 24750, 24500, 3900, 13750, 15250, 24150, 2250, 4975, 4900, 23975, 23750, 23600, 23575, 5400, ...]",0,0.0%
7,loan_int_rate,float64,348,"[16.02, 11.14, 12.87, 15.23, 14.27, 7.14, 12.42, 11.11, 8.9, 14.74, 10.37, 8.63, 7.9, 18.39, 10.65, 20.25, 18.25, 10.99, 7.49, 16.77, 17.58, 7.29, 14.54, 12.68, 17.74, 9.32, 9.99, 12.84, 11.12, 6.62, 14.17, 13.85, 13.49, 7.51, 16.89, nan, 17.99, 12.69, 7.88, 19.41, 10.38, 15.33, 16.45, 18.62, 15.96, 11.48, 5.99, 11.58, 15.7, 15.99, 14.84, 14.42, 6.99, 13.61, 9.91, 13.48, 12.98, 13.57, 15.68, 13.06, 15.62, 11.71, 8.88, 12.18, 13.99, 5.42, 12.73, 11.49, 19.91, 11.83, 14.59, 9.64, 16.35, 18.67, 10.08, 10.36, 12.23, 16.07, 14.22, 14.79, 13.22, 11.86, 13.43, 15.28, 17.93, 9.25, 10.62, 18.43, 11.36, 15.65, 13.04, 17.04, 14.83, 14.65, 16.82, 10.25, 14.96, 11.99, 8.49, 6.17, ...]",3116,9.56%
8,loan_status,int64,2,"[1, 0]",0,0.0%
9,loan_percent_income,float64,77,"[0.59, 0.1, 0.57, 0.53, 0.55, 0.25, 0.45, 0.44, 0.42, 0.16, 0.41, 0.37, 0.32, 0.3, 0.06, 0.29, 0.31, 0.22, 0.52, 0.14, 0.49, 0.13, 0.5, 0.35, 0.17, 0.27, 0.33, 0.08, 0.03, 0.21, 0.63, 0.47, 0.4, 0.07, 0.38, 0.34, 0.04, 0.23, 0.15, 0.11, 0.43, 0.51, 0.02, 0.28, 0.26, 0.19, 0.39, 0.09, 0.05, 0.61, 0.18, 0.6, 0.01, 0.48, 0.12, 0.54, 0.56, 0.46, 0.36, 0.24, 0.2, 0.72, 0.64, 0.69, 0.77, 0.83, 0.65, 0.67, 0.58, 0.71, 0.68, 0.7, 0.66, 0.0, 0.76, 0.62, 0.78]",0,0.0%


Unnamed: 0,person_home_ownership,loan_intent,loan_grade,cb_person_default_on_file
count,32581,32581,32581,32581
unique,4,6,7,2
top,RENT,EDUCATION,A,N
freq,16446,6453,10777,26836


Unnamed: 0,person_age,person_income,person_emp_length,loan_amnt,loan_int_rate,loan_status,loan_percent_income,cb_person_cred_hist_length
count,32581.0,32581.0,31686.0,32581.0,29465.0,32581.0,32581.0,32581.0
mean,27.7346,66074.85,4.789686,9589.371106,11.011695,0.218164,0.170203,5.804211
std,6.348078,61983.12,4.14263,6322.086646,3.240459,0.413006,0.106782,4.055001
min,20.0,4000.0,0.0,500.0,5.42,0.0,0.0,2.0
25%,23.0,38500.0,2.0,5000.0,7.9,0.0,0.09,3.0
50%,26.0,55000.0,4.0,8000.0,10.99,0.0,0.15,4.0
75%,30.0,79200.0,7.0,12200.0,13.47,0.0,0.23,8.0
max,144.0,6000000.0,123.0,35000.0,23.22,1.0,0.83,30.0


In [45]:
# Validate data
alala_uye.validate_data()

Validation errors found:
 - Column 'person_age' has values above maximum of 100
 - Column 'person_emp_length' has type 'float64', expected 'integer'


In [46]:
class DataPreparation:
    def __init__(self, data: pd.DataFrame, config: dict)-> None:
        self.data = data
        self.features = config.get('features', {})
        self.target = config.get('target', [])
        self.path_processed_data = config.get('path', {}).get('processed_data', '')
    
    def split_input_output(self) -> tuple[pd.DataFrame, pd.Series]:
        """ 
        Split data into input features and target variable.

        Returns:
            tuple[pd.DataFrame, pd.Series]: Input features and target variable.
        """
        features_columns = list() 

        for _, value_list in self.features.items():
            for col in value_list:
                if col not in self.data.columns:
                    raise ValueError(f"Feature column '{col}' not found in data.")
                features_columns.append(col)

        # ------ Print features data types
        print("Input Features Data Types:")
        print(f"Input Features Categorical: {self.features.get('categorical', [])}")
        print(f"Input Features Numerical: {self.features.get('numerical', [])}")
        print(100*"-")
        # ------ Extract input features and target variable
        X = self.data[features_columns]
        y = self.data[self.target]

        print(f"Input features shape: {X.shape}")
        print(f"Target variable shape: {y.shape}")
        return X, y
    
    def split_train_test(
        self, 
        X: pd.DataFrame, 
        y: pd.Series, 
        test_size: float = 0.2, 
        random_state: int = 42
    ) -> tuple[pd.DataFrame, pd.DataFrame, pd.Series, pd.Series]:
        """ 
        Split data into training and testing sets.

        Args:
            test_size (float): Proportion of the dataset to include in the test split.
            random_state (int): Random seed for reproducibility.
        Returns:
            tuple[pd.DataFrame, pd.DataFrame, pd.Series, pd.Series]: Training and testing sets.
        """
        from sklearn.model_selection import train_test_split

        X_train, X_test, y_train, y_test = train_test_split(
            X, y, test_size=test_size, random_state=random_state
        )

        print(f"Training set shape: X_train: {X_train.shape}, y_train: {y_train.shape}")
        print(f"Testing set shape: X_test: {X_test.shape}, y_test: {y_test.shape}")
        return X_train, X_test, y_train, y_test
    
    def serialized_data(self, data: any, name: str) -> None:
        """ 
        Serialize data to a file using joblib. 
        Args:
            data (any): Data to serialize.
            name (str): Name of the file to save the serialized data.
        Returns:
            None
        """
        import joblib
        file_path = f"{self.path_processed_data}/{name}.pkl"
        joblib.dump(data, file_path)
        print(f"Serialized data saved to {file_path}")

    def deserialize_data(self, name: str) -> any:
        """ 
        Deserialize data from a file using joblib.
        Args:
            name (str): Name of the file to load the serialized data from.
        Returns:
            any: Deserialized data.
        """
        import joblib
        file_path = f"{self.path_processed_data}/{name}.pkl"
        data = joblib.load(file_path)
        print(f"Deserialized data loaded from {file_path}")
        return data

In [47]:
# Split input and output
data_preparer = DataPreparation(df, config)
X, y = data_preparer.split_input_output()

Input Features Data Types:
Input Features Categorical: ['person_home_ownership', 'loan_intent', 'loan_grade']
Input Features Numerical: ['person_age', 'person_income', 'person_emp_length', 'loan_amnt', 'loan_int_rate', 'loan_status', 'loan_percent_income', 'cb_person_cred_hist_length']
----------------------------------------------------------------------------------------------------
Input features shape: (32581, 11)
Target variable shape: (32581, 1)


In [48]:
# Split Data into train set
X_train, X_not_train, y_train, y_not_train = data_preparer.split_train_test(X=X, y=y)

Training set shape: X_train: (26064, 11), y_train: (26064, 1)
Testing set shape: X_test: (6517, 11), y_test: (6517, 1)


In [49]:
# Split data into validation and test sets
X_valid, X_test, y_valid, y_test = data_preparer.split_train_test(X=X_not_train, y=y_not_train, test_size=0.5, random_state=42)

Training set shape: X_train: (3258, 11), y_train: (3258, 1)
Testing set shape: X_test: (3259, 11), y_test: (3259, 1)


In [50]:
# Serialized processed data train
data_preparer.serialized_data(X_train, 'X_train')
data_preparer.serialized_data(y_train, 'y_train') 
# Serialized processed data validation
data_preparer.serialized_data(X_valid, 'X_valid')
data_preparer.serialized_data(y_valid, 'y_valid')
# Serialized processed data test
data_preparer.serialized_data(X_test, 'X_test')
data_preparer.serialized_data(y_test, 'y_test')

Serialized data saved to data/processed/X_train.pkl
Serialized data saved to data/processed/y_train.pkl
Serialized data saved to data/processed/X_valid.pkl
Serialized data saved to data/processed/y_valid.pkl
Serialized data saved to data/processed/X_test.pkl
Serialized data saved to data/processed/y_test.pkl


In [51]:
# Deserialize processed data train
X_train = data_preparer.deserialize_data('X_train')
y_train = data_preparer.deserialize_data('y_train')
# Deserialize processed data validation
X_valid = data_preparer.deserialize_data('X_valid')
y_valid = data_preparer.deserialize_data('y_valid')
# Deserialize processed data test
X_test = data_preparer.deserialize_data('X_test')
y_test = data_preparer.deserialize_data('y_test')

Deserialized data loaded from data/processed/X_train.pkl
Deserialized data loaded from data/processed/y_train.pkl
Deserialized data loaded from data/processed/X_valid.pkl
Deserialized data loaded from data/processed/y_valid.pkl
Deserialized data loaded from data/processed/X_test.pkl
Deserialized data loaded from data/processed/y_test.pkl


In [52]:
X_train

Unnamed: 0,person_home_ownership,loan_intent,loan_grade,person_age,person_income,person_emp_length,loan_amnt,loan_int_rate,loan_status,loan_percent_income,cb_person_cred_hist_length
32377,RENT,PERSONAL,C,64,46000,2.0,4800,11.09,0,0.10,24
1338,OWN,DEBTCONSOLIDATION,E,26,26000,0.0,8500,16.45,1,0.33,3
7047,MORTGAGE,PERSONAL,C,23,51000,3.0,16000,13.11,0,0.31,3
8225,MORTGAGE,MEDICAL,A,22,56004,6.0,6000,7.88,0,0.11,4
7178,RENT,PERSONAL,C,24,79000,3.0,7000,12.54,0,0.09,3
...,...,...,...,...,...,...,...,...,...,...,...
29802,MORTGAGE,MEDICAL,C,39,38500,7.0,3500,13.98,0,0.09,17
5390,RENT,HOMEIMPROVEMENT,A,25,69000,5.0,8500,6.92,1,0.12,4
860,RENT,DEBTCONSOLIDATION,E,26,148000,1.0,20000,17.99,1,0.14,3
15795,MORTGAGE,PERSONAL,C,26,175000,0.0,15000,,0,0.09,3
