# MLE challenge - Features engineering

### Notebook 1

In this notebook we compute five features for the **credit risk** dataset. 
Each row in the dataset consists of the credit that a user took on a given date.

These features are roughly defined as follows:

**nb_previous_loans:** number of loans granted to a given user, before the current loan.

**avg_amount_loans_previous:** average amount of loans granted to a user, before the current rating.

**age:** user age in years.

**years_on_the_job:** years the user has been in employment.

**flag_own_car:** flag that indicates if the user has his own car.



In [4]:

import pandas as pd

In [6]:
df = pd.read_csv('../data/input/dataset_credit_risk.csv')

In [7]:
df.shape

(777715, 24)

In [8]:
df = df.sort_values(by=["id", "loan_date"])
df = df.reset_index(drop=True)
df["loan_date"] = pd.to_datetime(df.loan_date)
df.head(2)

Unnamed: 0,loan_id,id,code_gender,flag_own_car,flag_own_realty,cnt_children,amt_income_total,name_income_type,name_education_type,name_family_status,...,flag_work_phone,flag_phone,flag_email,occupation_type,cnt_fam_members,status,birthday,job_start_date,loan_date,loan_amount
0,1008,5008804,M,Y,Y,0,427500.0,Working,Higher education,Civil marriage,...,1,0,0,,2.0,0,1988-11-04,2009-04-11,2019-02-01,102.283361
1,1000,5008804,M,Y,Y,0,427500.0,Working,Higher education,Civil marriage,...,1,0,0,,2.0,0,1988-11-04,2009-04-11,2019-02-15,136.602049


#### Feature nb_previous_loans

In [9]:
df_grouped = df.groupby("id")
df["nb_previous_loans"] = df_grouped["loan_date"].rank(method="first") - 1

#### Feature avg_amount_loans_previous

In [10]:
df['avg_amount_loans_previous'] = (
    df.groupby('id')['loan_amount'].apply(lambda x: x.shift().expanding().mean())
)

#### Feature age

In [11]:
from datetime import datetime, date

In [12]:
df['birthday'] = pd.to_datetime(df['birthday'], errors='coerce')


In [13]:
df['age'] = (pd.to_datetime('today').normalize() - df['birthday']).dt.days // 365

#### Feature years_on_the_job

In [14]:
df['job_start_date'] = pd.to_datetime(df['job_start_date'], errors='coerce')

In [15]:
df['years_on_the_job'] = (pd.to_datetime('today').normalize() - df['job_start_date']).dt.days // 365

#### Feature flag_own_car

In [16]:
df['flag_own_car'] = df.flag_own_car.apply(lambda x : 0 if x == 'N' else 1)

## Save dataset for model training

In [17]:
df = df[['id', 'age', 'years_on_the_job', 'nb_previous_loans', 'avg_amount_loans_previous', 'flag_own_car', 'status']]


In [18]:
df.to_csv('../data/output/train_model.csv', index=False)