# Data Science Competition - Predicting Probability of Default 

## Problem Statement 
Financial institutions face significant risks due to loan defaults. Accurately predicting the 
probability of default (PD) on loans is critical for risk management and strategic planning. In this
competition, participants are tasked with developing a predictive model that estimates th 
probability of default on loans using historical loan data.

### 1. Data Cleaning 
In this section we will be loading the data to get a summary view of what the data looks like, type of columns, etc

In [9]:
#Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import sklearn 
import seaborn as sns

In [39]:
#load the data 
main_data = pd.read_csv('data_science_competition_2024.csv')
main_data.head()
main_data = main_data.drop('Unnamed: 0',axis = 1) 

In [40]:
#Display information about the dataset
print(main_data.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 20 columns):
 #   Column                Non-Null Count   Dtype  
---  ------                --------------   -----  
 0   loan_id               100000 non-null  object 
 1   gender                100000 non-null  object 
 2   disbursemet_date      100000 non-null  object 
 3   currency              100000 non-null  object 
 4   country               99900 non-null   object 
 5   sex                   100000 non-null  object 
 6   is_employed           100000 non-null  bool   
 7   job                   95864 non-null   object 
 8   location              99405 non-null   object 
 9   loan_amount           100000 non-null  float64
 10  number_of_defaults    100000 non-null  int64  
 11  outstanding_balance   100000 non-null  float64
 12  interest_rate         100000 non-null  float64
 13  age                   100000 non-null  int64  
 14  number_of_defaults.1  100000 non-null  int64  
 15  r

In [42]:
#Check if (age1 and age) and (number_of_defaults and number_of_defaults1) are just duplicated columns
if (main_data[main_data["age"] != main_data["age.1"]].shape[0] == 0):
   main_data = main_data.drop('age.1',axis = 1) 

In [43]:
if (main_data[main_data["number_of_defaults"] != main_data["number_of_defaults.1"]].shape[0] == 0):
   main_data = main_data.drop('number_of_defaults.1',axis = 1) 


In [44]:
main_data.describe()

Unnamed: 0,loan_amount,number_of_defaults,outstanding_balance,interest_rate,age,salary
count,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0
mean,31120.0,0.44197,36964.909763,0.210435,43.57069,2781.804324
std,15895.093631,0.688286,10014.758477,0.018725,4.86376,696.450055
min,1000.0,0.0,0.0,0.1,21.0,250.0
25%,21000.0,0.0,29625.227472,0.2,40.0,2273.929349
50%,31000.0,0.0,35063.852394,0.21,44.0,2665.441567
75%,40000.0,1.0,42133.388817,0.22,47.0,3146.577655
max,273000.0,2.0,150960.0,0.3,65.0,10000.0


In [45]:
#Check for null values 
null = main_data.isnull().sum()
null

loan_id                   0
gender                    0
disbursemet_date          0
currency                  0
country                 100
sex                       0
is_employed               0
job                    4136
location                595
loan_amount               0
number_of_defaults        0
outstanding_balance       0
interest_rate             0
age                       0
remaining term            0
salary                    0
marital_status            0
Loan Status               0
dtype: int64

In [22]:
#Check for duplicate values 
duplicate = main_data[main_data.duplicated()]
main_data[main_data.duplicated(subset = ['loan_id']) == True]

Unnamed: 0.1,Unnamed: 0,loan_id,gender,disbursemet_date,currency,country,sex,is_employed,job,location,...,number_of_defaults,outstanding_balance,interest_rate,age,number_of_defaults.1,remaining term,salary,marital_status,age.1,Loan Status


In [None]:
#Access the load_id column to understand if its unique or what

