# Développement d'un algorithme de notation de prêt pour "Prêt à dépenser"

### Notebook by [Nasr-edine DRAI](https://www.hackerrank.com/d_nasredine)



### [Openclassrooms](https://openclassrooms.com/en/)

## Introduction

Dans ce projet, vous êtes un Data Scientist travaillant chez "Prêt à dépenser", une entreprise financière qui offre des crédits de consommation aux personnes ayant peu ou pas d'historique de prêt. La société souhaite mettre en place un outil de notation de crédit qui calcule la probabilité qu'un client rembourse un prêt et puis classe la demande de prêt comme approuvée ou rejetée. L'objectif est de développer un algorithme de classification pour aider à décider si un prêt peut être accordé à un client.

## Le champ d'application du problème.

Les gestionnaires de la relation client seront les utilisateurs de l'outil de notation. Comme ils interagissent avec les clients, ils ont besoin que votre modèle soit facilement interprétable. Les gestionnaires de la relation souhaitent également une mesure de l'importance des variables qui ont conduit le modèle à donner une probabilité particulière à un client.

<img src="../imgs/french_public_health_agency.png" />

## Verify Python Virtual Environments

#### Check the Version of the Python Interpreter

In [2]:
!python --version

Python 3.10.1


#### Verify that I'm using the right virtual environment

In [1]:
!pip -V

pip 23.0 from /Users/drainasr-edine/github/ingenieur_ia/P4_drai_nasr-edine/.venv/lib/python3.10/site-packages/pip (python 3.10)


#### Check Installed Modules in Python

Run through this notebook to make sure my environment is properly setup. Be sure to launch Jupyter from inside the virtual environment.

In [1]:
import os, sys

parent = os.path.abspath('..')
sys.path.insert(1, parent)
print(parent)

/Users/drainasr-edine/github/ingenieur_ia/P4_drai_nasr-edine


In [None]:
from src.check_environment import

This code allows me to import modules from the parent directory in my notebook

In [2]:
from src.check_environment import run_checks
run_checks()

Using Python in /Users/drainasr-edine/github/ingenieur_ia/P4_drai_nasr-edine/.venv:
[42m[ OK ][0m Python is version 3.10.1 (v3.10.1:2cd268a3a9, Dec  6 2021, 14:28:59) [Clang 13.0.0 (clang-1300.0.29.3)]

[42m[ OK ][0m jupyterlab
[42m[ OK ][0m jupyterlab_git
[42m[ OK ][0m matplotlib
[42m[ OK ][0m numpy
[42m[ OK ][0m pandas
[42m[ OK ][0m seaborn
[42m[ OK ][0m statsmodels
[42m[ OK ][0m plotly
[42m[ OK ][0m colorama
[42m[ OK ][0m sklearn
[42m[ OK ][0m missingno
[42m[ OK ][0m wordcloud


## Import Python library for data science

In [7]:
# NumPy library for numerical computing
import numpy as np

# Import the statsmodels library for statistical analysis and modeling
import statsmodels.api as sm

# Pandas library for data manipulation and analysis
import pandas as pd

# Matplotlib library for data visualization
import matplotlib.pyplot as plt

# Seaborn library for data visualization based on Matplotlib
import seaborn as sns

# Scikit-Learn library for machine learning
# import sklearn

# Tensorflow library for building and training machine learning models
# import tensorflow as tf

# Wordcloud library for generating word clouds from text data
# from wordcloud import WordCloud

### Importing and Previewing a CSV Data File with Pandas

### Display CSV Files with Their Sizes in a Pandas DataFrame

In [None]:
import os
import pandas as pd

folder_path = '../data'
file_list = os.listdir(folder_path)

file_details = []
for file_name in file_list:
    if file_name.endswith(".csv"):
        file_path = os.path.join(folder_path, file_name)
        file_size = os.path.getsize(file_path)
        file_details.append([file_name, file_size/10**6])

df_csv_files = pd.DataFrame(file_details, columns=["Name", "Size (MB)"])
df_csv_files.sort_values("Size (MB)", axis=0, ascending=True, inplace=True)
df_csv_files


Unnamed: 0,Name,Size (MB)
1,HomeCredit_columns_description.csv,0.037383
9,sample_submission.csv,0.536202
0,application_test.csv,26.567651
5,application_train.csv,166.13337
6,bureau.csv,170.016717
8,bureau_balance.csv,375.592889
2,POS_CASH_balance.csv,392.703158
7,previous_application.csv,404.973293
3,credit_card_balance.csv,424.582605
4,installments_payments.csv,723.118349


### Importing and Previewing CSV Data Files with Pandas

In [None]:
import pandas as pd

header = ["Table", "Row", "Description", "Special",]

df_homeCredit_columns_description = pd.read_csv("../data/HomeCredit_columns_description.csv", skiprows=1, names=header, index_col=0, encoding = 'unicode_escape')
df_homeCredit_columns_description.head()

Unnamed: 0,Table,Row,Description,Special
1,application_{train|test}.csv,SK_ID_CURR,ID of loan in our sample,
2,application_{train|test}.csv,TARGET,Target variable (1 - client with payment diffi...,
5,application_{train|test}.csv,NAME_CONTRACT_TYPE,Identification if loan is cash or revolving,
6,application_{train|test}.csv,CODE_GENDER,Gender of the client,
7,application_{train|test}.csv,FLAG_OWN_CAR,Flag if the client owns a car,


In [None]:
# Import sample_submission.csv
df_sample_submission = pd.read_csv("../data/sample_submission.csv", sep=',')
df_sample_submission.head()

Unnamed: 0,SK_ID_CURR,TARGET
0,100001,0.5
1,100005,0.5
2,100013,0.5
3,100028,0.5
4,100038,0.5


In [None]:
# Import application_test.csv
df_application_test = pd.read_csv("../data/application_test.csv", sep=',')
df_application_test.head()

Unnamed: 0,SK_ID_CURR,NAME_CONTRACT_TYPE,CODE_GENDER,FLAG_OWN_CAR,FLAG_OWN_REALTY,CNT_CHILDREN,AMT_INCOME_TOTAL,AMT_CREDIT,AMT_ANNUITY,AMT_GOODS_PRICE,...,FLAG_DOCUMENT_18,FLAG_DOCUMENT_19,FLAG_DOCUMENT_20,FLAG_DOCUMENT_21,AMT_REQ_CREDIT_BUREAU_HOUR,AMT_REQ_CREDIT_BUREAU_DAY,AMT_REQ_CREDIT_BUREAU_WEEK,AMT_REQ_CREDIT_BUREAU_MON,AMT_REQ_CREDIT_BUREAU_QRT,AMT_REQ_CREDIT_BUREAU_YEAR
0,100001,Cash loans,F,N,Y,0,135000.0,568800.0,20560.5,450000.0,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0
1,100005,Cash loans,M,N,Y,0,99000.0,222768.0,17370.0,180000.0,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,3.0
2,100013,Cash loans,M,Y,Y,0,202500.0,663264.0,69777.0,630000.0,...,0,0,0,0,0.0,0.0,0.0,0.0,1.0,4.0
3,100028,Cash loans,F,N,Y,2,315000.0,1575000.0,49018.5,1575000.0,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,3.0
4,100038,Cash loans,M,Y,N,1,180000.0,625500.0,32067.0,625500.0,...,0,0,0,0,,,,,,


In [None]:
# Import application_train.csv
df_application_train = pd.read_csv("../data/application_train.csv", sep=',')
df_application_train.head()

Unnamed: 0,SK_ID_CURR,TARGET,NAME_CONTRACT_TYPE,CODE_GENDER,FLAG_OWN_CAR,FLAG_OWN_REALTY,CNT_CHILDREN,AMT_INCOME_TOTAL,AMT_CREDIT,AMT_ANNUITY,...,FLAG_DOCUMENT_18,FLAG_DOCUMENT_19,FLAG_DOCUMENT_20,FLAG_DOCUMENT_21,AMT_REQ_CREDIT_BUREAU_HOUR,AMT_REQ_CREDIT_BUREAU_DAY,AMT_REQ_CREDIT_BUREAU_WEEK,AMT_REQ_CREDIT_BUREAU_MON,AMT_REQ_CREDIT_BUREAU_QRT,AMT_REQ_CREDIT_BUREAU_YEAR
0,100002,1,Cash loans,M,N,Y,0,202500.0,406597.5,24700.5,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,1.0
1,100003,0,Cash loans,F,N,N,0,270000.0,1293502.5,35698.5,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0
2,100004,0,Revolving loans,M,Y,Y,0,67500.0,135000.0,6750.0,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0
3,100006,0,Cash loans,F,N,Y,0,135000.0,312682.5,29686.5,...,0,0,0,0,,,,,,
4,100007,0,Cash loans,M,N,Y,0,121500.0,513000.0,21865.5,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0


In [None]:
# Import bureau.csv
df_bureau = pd.read_csv("../data/bureau.csv", sep=',')
df_bureau.head()

Unnamed: 0,SK_ID_CURR,SK_ID_BUREAU,CREDIT_ACTIVE,CREDIT_CURRENCY,DAYS_CREDIT,CREDIT_DAY_OVERDUE,DAYS_CREDIT_ENDDATE,DAYS_ENDDATE_FACT,AMT_CREDIT_MAX_OVERDUE,CNT_CREDIT_PROLONG,AMT_CREDIT_SUM,AMT_CREDIT_SUM_DEBT,AMT_CREDIT_SUM_LIMIT,AMT_CREDIT_SUM_OVERDUE,CREDIT_TYPE,DAYS_CREDIT_UPDATE,AMT_ANNUITY
0,215354,5714462,Closed,currency 1,-497,0,-153.0,-153.0,,0,91323.0,0.0,,0.0,Consumer credit,-131,
1,215354,5714463,Active,currency 1,-208,0,1075.0,,,0,225000.0,171342.0,,0.0,Credit card,-20,
2,215354,5714464,Active,currency 1,-203,0,528.0,,,0,464323.5,,,0.0,Consumer credit,-16,
3,215354,5714465,Active,currency 1,-203,0,,,,0,90000.0,,,0.0,Credit card,-16,
4,215354,5714466,Active,currency 1,-629,0,1197.0,,77674.5,0,2700000.0,,,0.0,Consumer credit,-21,


In [None]:
# Import bureau_balance.csv
df_bureau_balance = pd.read_csv("../data/bureau_balance.csv", sep=',')
df_bureau_balance.head()

Unnamed: 0,SK_ID_BUREAU,MONTHS_BALANCE,STATUS
0,5715448,0,C
1,5715448,-1,C
2,5715448,-2,C
3,5715448,-3,C
4,5715448,-4,C


In [None]:
# Import POS_CASH_balance.csv
df_POS_CASH_balance = pd.read_csv("../data/POS_CASH_balance.csv", sep=',')
df_POS_CASH_balance.head()

Unnamed: 0,SK_ID_PREV,SK_ID_CURR,MONTHS_BALANCE,CNT_INSTALMENT,CNT_INSTALMENT_FUTURE,NAME_CONTRACT_STATUS,SK_DPD,SK_DPD_DEF
0,1803195,182943,-31,48.0,45.0,Active,0,0
1,1715348,367990,-33,36.0,35.0,Active,0,0
2,1784872,397406,-32,12.0,9.0,Active,0,0
3,1903291,269225,-35,48.0,42.0,Active,0,0
4,2341044,334279,-35,36.0,35.0,Active,0,0


In [None]:
# Import previous_application.csv
df_previous_application = pd.read_csv("../data/previous_application.csv", sep=',')
df_previous_application.head()

Unnamed: 0,SK_ID_PREV,SK_ID_CURR,NAME_CONTRACT_TYPE,AMT_ANNUITY,AMT_APPLICATION,AMT_CREDIT,AMT_DOWN_PAYMENT,AMT_GOODS_PRICE,WEEKDAY_APPR_PROCESS_START,HOUR_APPR_PROCESS_START,...,NAME_SELLER_INDUSTRY,CNT_PAYMENT,NAME_YIELD_GROUP,PRODUCT_COMBINATION,DAYS_FIRST_DRAWING,DAYS_FIRST_DUE,DAYS_LAST_DUE_1ST_VERSION,DAYS_LAST_DUE,DAYS_TERMINATION,NFLAG_INSURED_ON_APPROVAL
0,2030495,271877,Consumer loans,1730.43,17145.0,17145.0,0.0,17145.0,SATURDAY,15,...,Connectivity,12.0,middle,POS mobile with interest,365243.0,-42.0,300.0,-42.0,-37.0,0.0
1,2802425,108129,Cash loans,25188.615,607500.0,679671.0,,607500.0,THURSDAY,11,...,XNA,36.0,low_action,Cash X-Sell: low,365243.0,-134.0,916.0,365243.0,365243.0,1.0
2,2523466,122040,Cash loans,15060.735,112500.0,136444.5,,112500.0,TUESDAY,11,...,XNA,12.0,high,Cash X-Sell: high,365243.0,-271.0,59.0,365243.0,365243.0,1.0
3,2819243,176158,Cash loans,47041.335,450000.0,470790.0,,450000.0,MONDAY,7,...,XNA,12.0,middle,Cash X-Sell: middle,365243.0,-482.0,-152.0,-182.0,-177.0,1.0
4,1784265,202054,Cash loans,31924.395,337500.0,404055.0,,337500.0,THURSDAY,9,...,XNA,24.0,high,Cash Street: high,,,,,,


In [None]:
# Import credit_card_balance.csv
df_credit_card_balance = pd.read_csv("../data/credit_card_balance.csv", sep=',')
df_credit_card_balance.head()

Unnamed: 0,SK_ID_PREV,SK_ID_CURR,MONTHS_BALANCE,AMT_BALANCE,AMT_CREDIT_LIMIT_ACTUAL,AMT_DRAWINGS_ATM_CURRENT,AMT_DRAWINGS_CURRENT,AMT_DRAWINGS_OTHER_CURRENT,AMT_DRAWINGS_POS_CURRENT,AMT_INST_MIN_REGULARITY,...,AMT_RECIVABLE,AMT_TOTAL_RECEIVABLE,CNT_DRAWINGS_ATM_CURRENT,CNT_DRAWINGS_CURRENT,CNT_DRAWINGS_OTHER_CURRENT,CNT_DRAWINGS_POS_CURRENT,CNT_INSTALMENT_MATURE_CUM,NAME_CONTRACT_STATUS,SK_DPD,SK_DPD_DEF
0,2562384,378907,-6,56.97,135000,0.0,877.5,0.0,877.5,1700.325,...,0.0,0.0,0.0,1,0.0,1.0,35.0,Active,0,0
1,2582071,363914,-1,63975.555,45000,2250.0,2250.0,0.0,0.0,2250.0,...,64875.555,64875.555,1.0,1,0.0,0.0,69.0,Active,0,0
2,1740877,371185,-7,31815.225,450000,0.0,0.0,0.0,0.0,2250.0,...,31460.085,31460.085,0.0,0,0.0,0.0,30.0,Active,0,0
3,1389973,337855,-4,236572.11,225000,2250.0,2250.0,0.0,0.0,11795.76,...,233048.97,233048.97,1.0,1,0.0,0.0,10.0,Active,0,0
4,1891521,126868,-1,453919.455,450000,0.0,11547.0,0.0,11547.0,22924.89,...,453919.455,453919.455,0.0,1,0.0,1.0,101.0,Active,0,0


In [None]:
# Import installments_payments.csv
df_installments_payments = pd.read_csv("../data/installments_payments.csv", sep=',')
df_installments_payments.head()

Unnamed: 0,SK_ID_PREV,SK_ID_CURR,NUM_INSTALMENT_VERSION,NUM_INSTALMENT_NUMBER,DAYS_INSTALMENT,DAYS_ENTRY_PAYMENT,AMT_INSTALMENT,AMT_PAYMENT
0,1054186,161674,1.0,6,-1180.0,-1187.0,6948.36,6948.36
1,1330831,151639,0.0,34,-2156.0,-2156.0,1716.525,1716.525
2,2085231,193053,2.0,1,-63.0,-63.0,25425.0,25425.0
3,2452527,199697,1.0,3,-2418.0,-2426.0,24350.13,24350.13
4,2714724,167756,1.0,2,-1383.0,-1366.0,2165.04,2160.585
