<h1 align='center' style="color: darksalmon; font-size: 48px"><strong>Alura Cash</strong></h1>

<h3 align='center'>A cute finance project using Data Science concepts</h3><br>

<p align='center' style='text-align: justify'>
I've been hired as a freelancer data scientist to work for an international digital bank called Alura Cash. At the first meeting, the financial management informs me that people are repeatedly defaulting after credits have been released. Therefore, I'm asked for a solution to decrease financial losses due to borrowers who do not pay their debts.
</p>

<p align='center' style='text-align: justify'>
As a data scientist, I suggest a study around financial and loan application information aiming to find patterns that might indicate a possible default.
</p>

<p align='center' style='text-align: justify'>
So, I request a dataset that contains informations of customer, loan application, credit history, as well as whether the borrower is delinquent or not. With this data, I know that I can model a classifier that can find potential delinquent customers and solve the Alura Cash's problem. <br> <br>

>&nbsp; <br>
>Note: This is a fictional case study. The dataset used in this project is not real. <br>
>&nbsp; <br>

</p>

# 1. Data Collection

Here, I'll be using the dataset provided by Alura Cash, at the first meeting, that contains information about 34,501 customers. <br><br>

<p align='center'>
    <i>In order to also access this data, you can follow <a href='https://raw.githubusercontent.com/Mirlaa/Challenge-Data-Science-1ed/main/Dados/dados_juntos.csv'>this link</a></i> 😊 
</p>

## 1.1 - Importing the dataset

As I downloaded it to my local environment, I'll import it from it path using the pandas library. I'll be also importing numpy and plotly for now.

In [2]:
import plotly.express as px
import pandas as pd
import numpy as np

uri = './data/alura_cash_data.csv'
df = pd.read_csv(uri)

## 1.2 - Data Description

Let's check some useful info about this set regarding the number of rows and columns, the data types, the number of missing values, and some statistics around the numerical variables.

### 1.2.1 - Number of rows and columns

In [18]:
print('''
    This data contains the following amount of rows and columns:

    => {} rows and {} columns.
'''.format(df.shape[0], df.shape[1]))



    This data contains the following amount of rows and columns:

    => 34501 rows and 12 columns.



### 1.2.2 - Data types

In [14]:
for column in df.columns:
    print('{} <-> {}'.format(column, df[column].dtype))

person_age <-> float64
person_income <-> float64
person_home_ownership <-> object
person_emp_length <-> float64
loan_intent <-> object
loan_grade <-> object
loan_amnt <-> float64
loan_int_rate <-> float64
loan_status <-> float64
loan_percent_income <-> float64
cb_person_default_on_file <-> object
cb_person_cred_hist_length <-> float64


### 1.2.3 - Missing values

In [19]:
df.isna().sum()

person_age                     324
person_income                  339
person_home_ownership          331
person_emp_length             1254
loan_intent                    315
loan_grade                     313
loan_amnt                      331
loan_int_rate                 3630
loan_status                    343
loan_percent_income            319
cb_person_default_on_file      370
cb_person_cred_hist_length       4
dtype: int64

### 1.2.4 - Numerical variables

In [20]:
df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
person_age,34177.0,27.731018,6.345281,20.0,23.0,26.0,30.0,144.0
person_income,34162.0,66028.687957,61405.057742,4000.0,38493.0,55000.0,79200.0,6000000.0
person_emp_length,33247.0,4.787229,4.137463,0.0,2.0,4.0,7.0,123.0
loan_amnt,34170.0,9590.576529,6320.429041,500.0,5000.0,8000.0,12200.0,35000.0
loan_int_rate,30871.0,11.01363,3.24124,5.42,7.9,10.99,13.47,23.22
loan_status,34158.0,0.218192,0.413024,0.0,0.0,0.0,0.0,1.0
loan_percent_income,34182.0,0.170227,0.106783,0.0,0.09,0.15,0.23,0.83
cb_person_cred_hist_length,34497.0,5.808186,4.063231,2.0,3.0,4.0,8.0,30.0


## 1.3 - Notes

In [21]:
df.sample(3)

Unnamed: 0,person_age,person_income,person_home_ownership,person_emp_length,loan_intent,loan_grade,loan_amnt,loan_int_rate,loan_status,loan_percent_income,cb_person_default_on_file,cb_person_cred_hist_length
16242,22.0,55000.0,Mortgage,6.0,Venture,A,10000.0,5.79,0.0,0.18,N,2.0
11034,24.0,26000.0,Rent,5.0,Venture,C,12000.0,14.26,1.0,0.46,N,3.0
11998,37.0,66150.0,Own,11.0,Personal,C,17000.0,12.73,0.0,0.26,N,17.0


<p align='center'>
With this few commands, I know that I'm dealing with more than 34 thousand rows and 43 columns. <br> 
The data types are mostly numerical, but there are also some categorical variables. I also know that there just a few missing values, compared to the whole. Finally, I can see that the numerical variables have different scales, which is a problem that I'll have to deal with later.
</p>

# 2. Data Cleaning

In this section, I'll be cleaning the dataset, removing unnecessary columns, filling missing values, and transforming categorical variables into numerical ones.

## 