# Loan Flagging Dataset

<span style="font-size:17px;">We have a dataset containing information on thousands of loans, including features like loan info, age, past billings, credit score, etc. The target variable indicates whether the loan was flagged or not. </span>

In [1]:
import pandas as pd

In [2]:
# importing the dataset
data = pd.read_csv("test_task.csv")

In [3]:
# we have 4157 rows and 22 columns
data.shape

(4157, 22)

In [4]:
# prints the first 10 rows of the data.
data.head(5)

Unnamed: 0,loanKey,rep_loan_date,first_loan,dpd_5_cnt,dpd_15_cnt,dpd_30_cnt,first_overdue_date,close_loans_cnt,federal_district_nm,TraderKey,...,payment_type_2,payment_type_3,payment_type_4,payment_type_5,past_billings_cnt,score_1,score_2,age,gender,bad_flag
0,708382,2016-10-06,2015-11-13,,,,,3.0,region_6,6,...,10,0,0,0,10.0,,,21.0,False,0
1,406305,2016-03-26,2015-09-28,1.0,0.0,0.0,2016-01-30,0.0,region_6,6,...,6,0,0,0,5.0,,,20.0,False,0
2,779736,2016-10-30,2015-12-21,,,,,2.0,region_1,6,...,0,5,0,0,5.0,,,19.0,False,0
3,556376,2016-06-29,2015-06-30,,,,,1.0,region_6,14,...,4,0,0,0,6.0,,,21.0,False,0
4,266968,2015-12-01,2015-08-03,,,,,0.0,region_5,22,...,0,0,0,0,3.0,,,33.0,False,0


# Details of Features

<span style="font-size:17px;">

`loanKey`: This is a unique identifier for each loan.

`rep_loan_date`: This indicates the repayment date for a loan.

`first_loan`: An object column which represent whether this is the customer's first loan or some date related to it.

`dpd_5_cnt`, `dpd_15_cnt`, `dpd_30_cnt`: These features probably represent the count of days past due (DPD) for 5 days, 15 days, and 30 days respectively.

`first_overdue_date`: An object column representing the date when the loan first became overdue.

`close_loans_cnt`: A float column representing the count of closed loans, with a few missing values.

`federal_district_nm`: An object column that represent the name of a federal district related to the loan or borrower.

`TraderKey`: An integer column which is a unique identifier for a trader or an agent associated with the loan.

`payment_type_0` `to payment_type_5`: Integer columns representing different types of payments. These might be categorical indicators or counts of each type of payment.

`past_billings_cnt`: A float column representing the count of past billings, with some missing values.

`score_1`, `score_2`: These columns represent credit scores or internal risk assessment scores. The score_2 column has a lot of missing data.

`age`: A column representing the age of the borrower or associated entity.

`gender`: A boolean column, which indicates two gender categories.

`bad_flag`: An integer column which is a binary flag indicating whether a loan went bad (defaulted) or not.
    
</span>

# First Step 

### Exploratory Data Analysis




<span style="font-size:17px;">Walk me through how you would explore this dataset and perform some initial EDA. What are some key things you would look at to start understanding the data, relationships between features, and what might be predictive of our target variable? Feel free to speak generally about your approach first, then get more specific in terms of what you might examine with this particular dataset.</span>

# Second Step

### Feature Engineering


<span style="font-size:17px;">Walk me through how you would explore this data and identify opportunities for feature engineering. What types of new features might you extract or derive from the existing data that could help a model better predict loan risk?</span>


# Third Step

### Machine Learning Problem

<span style="font-size:17px;">As you already know, we are trying to predict whether a customer is 'bad' or not based on their attributes and past behavior. What type of machine learning problem is this, and how would you approach selecting an appropriate algorithm?</span>


# Fourth Step

### XGBoost

<span style="font-size:17px;">
    
- First, can you briefly explain your understanding of how XGBoost works and what makes it effective for problems like this?&nbsp;

- What are some of the key advantages XGBoost can provide over other common algorithms for this use case?&nbsp;

- Our data and business needs may evolve over time. Do you know of any newer or more advanced algorithms that could potentially outperform XGBoost in the future as a better solution for this problem?
</span>

# Fifth Step

### Alternative Methods

<span style="font-size:17px;">As we just discussed, XGBoost has been the dominant gradient boosting algorithm for some time now. However, machine learning is constantly evolving with new innovations.
Are you aware of any newer or more recently developed algorithms that could challenge or surpass XGBoost's performance, especially for problems similar to our use case?</span>

# Sixth Step

### Hyper-Parameter Optimization 

<span style="font-size:17px;">Hyperparameter tuning is crucial to maximizing XGBoost's performance. Have you used tools like Hyperopt before for tuning XGBoost models? If so, how does it compare to other tuning approaches like random search or grid search with cross-validation? </span>

# Seventh Step

### Calibration

<span style="font-size:17px;">Let's say we've trained our XGBoost model and now want to assess how calibrated its predictions are before deployment. Walk me through some techniques you could use for this.</span>

<span style="font-size:17px;">- Platt scaling - show how you would implement it with and without log transforming the raw predictions first.</span>

<span style="font-size:17px;"> - Isotonic regression - code how you would apply this to get calibrated probabilities.</span>

<span style="font-size:17px;">- Histogram binning - write the code to bucket predictions and calibrate.</span>

<span style="font-size:17px;">- Platt scaling on binned predictions - show how you'd combine both techniques.</span>

# Eighth Step

### Explain the concept of Nested Cross-Validation

<span style="font-size:17px;">Also explain how you would implement nested CV with the following set up:
    
- 2 folds for the outer CV loop to split the data into train/validation sets&nbsp;

- 5 folds for the inner CV loop to tune hyperparameters&nbsp;</span>


# Last Step

### Deep Learning 


<span style="font-size:17px;">Do you have any experience using deep neural networks for machine learning problems. What types of problems have you applied them to? How did the performance compare to other machine learning algorithms?</span>

<span style="font-size:17px;">How we could potentially apply deep learning to our problem of predicting high-risk loans. </span>