# MLE challenge - Features engineering

### Notebook 1

In this notebook we compute five features for the **credit risk** dataset. 
Each row in the dataset consists of the credit that a user took on a given date.

These features are roughly defined as follows:

**nb_previous_loans:** number of loans granted to a given user, before the current loan.

**avg_amount_loans_previous:** average amount of loans granted to a user, before the current rating.

**age:** user age in years.

**years_on_the_job:** years the user has been in employment.

**flag_own_car:** flag that indicates if the user has his own car.



In [21]:
import pandas as pd

In [22]:
df = pd.read_csv('dataset_credit_risk.csv')

In [23]:
df

Unnamed: 0,loan_id,id,code_gender,flag_own_car,flag_own_realty,cnt_children,amt_income_total,name_income_type,name_education_type,name_family_status,...,flag_work_phone,flag_phone,flag_email,occupation_type,cnt_fam_members,status,birthday,job_start_date,loan_date,loan_amount
0,208089,5044500,F,N,Y,0,45000.0,Pensioner,Secondary / secondary special,Widow,...,0,0,0,,1.0,0,1955-08-04,3021-09-18,2019-01-01,133.714974
1,112797,5026631,F,N,Y,0,99000.0,Working,Secondary / secondary special,Separated,...,0,0,0,Medicine staff,1.0,0,1972-03-30,1997-06-05,2019-01-01,158.800558
2,162434,5036645,M,Y,N,0,202500.0,Working,Incomplete higher,Married,...,0,0,0,Drivers,2.0,0,1987-03-24,2015-02-22,2019-01-01,203.608487
3,144343,5033584,F,N,Y,0,292500.0,Working,Higher education,Married,...,0,0,0,,2.0,0,1973-03-15,2009-06-29,2019-01-01,113.204964
4,409695,5085755,F,Y,Y,1,112500.0,Commercial associate,Secondary / secondary special,Civil marriage,...,0,0,0,Core staff,3.0,0,1989-10-15,2019-07-03,2019-01-01,109.376260
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
777710,46869,5021552,F,N,Y,0,202500.0,Commercial associate,Secondary / secondary special,Married,...,0,0,0,Medicine staff,2.0,0,1964-07-28,2017-05-18,2020-12-30,156.011689
777711,520468,5117453,F,N,N,2,90000.0,Working,Secondary / secondary special,Married,...,1,0,0,,4.0,0,1989-03-26,2021-06-13,2020-12-30,181.019600
777712,375790,5067951,F,N,Y,0,157500.0,Commercial associate,Secondary / secondary special,Single / not married,...,0,1,1,,1.0,0,1972-01-25,2011-11-18,2020-12-30,128.972541
777713,2763,5008914,F,N,Y,0,297000.0,Commercial associate,Secondary / secondary special,Single / not married,...,0,0,0,Laborers,1.0,0,1979-03-23,2012-11-09,2020-12-30,132.357583


In [24]:
df.shape

(777715, 24)

In [25]:
df = df.sort_values(by=["id", "loan_date"])
df = df.reset_index(drop=True)
df["loan_date"] = pd.to_datetime(df.loan_date)
df.head(2)

Unnamed: 0,loan_id,id,code_gender,flag_own_car,flag_own_realty,cnt_children,amt_income_total,name_income_type,name_education_type,name_family_status,...,flag_work_phone,flag_phone,flag_email,occupation_type,cnt_fam_members,status,birthday,job_start_date,loan_date,loan_amount
0,1008,5008804,M,Y,Y,0,427500.0,Working,Higher education,Civil marriage,...,1,0,0,,2.0,0,1988-11-04,2009-04-11,2019-02-01,102.283361
1,1000,5008804,M,Y,Y,0,427500.0,Working,Higher education,Civil marriage,...,1,0,0,,2.0,0,1988-11-04,2009-04-11,2019-02-15,136.602049


#### Feature nb_previous_loans

In [26]:
df_grouped = df.groupby("id")
df["nb_previous_loans"] = df_grouped["loan_date"].rank(method="first") - 1

#### Feature avg_amount_loans_previous

In [27]:
df['avg_amount_loans_previous'] = (
    df.groupby('id')['loan_amount'].apply(lambda x: x.shift().expanding().mean())
)

#### Feature age

In [28]:
from datetime import datetime, date

In [29]:
df['birthday'] = pd.to_datetime(df['birthday'], errors='coerce')


In [30]:
df['age'] = (pd.to_datetime('today').normalize() - df['birthday']).dt.days // 365

#### Feature years_on_the_job

In [31]:
df['job_start_date'] = pd.to_datetime(df['job_start_date'], errors='coerce')

In [32]:
df['years_on_the_job'] = (pd.to_datetime('today').normalize() - df['job_start_date']).dt.days // 365

#### Feature flag_own_car

In [33]:
df['flag_own_car'] = df.flag_own_car.apply(lambda x : 0 if x == 'N' else 1)

## Save dataset for model training

In [34]:
df = df[['id', 'age', 'years_on_the_job', 'nb_previous_loans', 'avg_amount_loans_previous', 'flag_own_car', 'status']]
df

Unnamed: 0,id,age,years_on_the_job,nb_previous_loans,avg_amount_loans_previous,flag_own_car,status
0,5008804,33,13.0,0.0,,1,0
1,5008804,33,13.0,1.0,102.283361,1,0
2,5008804,33,13.0,2.0,119.442705,1,0
3,5008804,33,13.0,3.0,117.873035,1,0
4,5008804,33,13.0,4.0,114.289538,1,0
...,...,...,...,...,...,...,...
777710,5150487,53,6.0,25.0,132.585287,1,0
777711,5150487,53,6.0,26.0,132.016323,1,0
777712,5150487,53,6.0,27.0,131.044545,1,0
777713,5150487,53,6.0,28.0,130.375785,1,0


In [35]:
df.to_csv('train_model.csv', index=False)