<h1 align='center' style="color: #844bff; font-size: 48px"><strong>Alura Cash</strong></h1>

<h3 align='center'>A cute finance project using Data Science concepts</h3><br>

<p align='center' style='text-align: justify'>
I've been hired as a freelancer data scientist to work for an international digital bank called Alura Cash. At the first meeting, the financial management informs me that people are repeatedly defaulting after credits have been released. Therefore, I'm asked for a solution to decrease financial losses due to borrowers who do not pay their debts.
</p>

<p align='center' style='text-align: justify'>
As a data scientist, I suggest a study around financial and loan application information aiming to find patterns that might indicate a possible default.
</p>

<p align='center' style='text-align: justify'>
So, I request a dataset that contains informations of customer, loan application, credit history, as well as whether the borrower is delinquent or not. With this data, I know that I can model a classifier that can find potential delinquent customers and solve the Alura Cash's problem. <br> <br>

>&nbsp; <br>
>Note: This is a fictional case study. The dataset used in this project is not real. <br>
>&nbsp; <br>

</p>

# 1. Data Collection

Here, I'll be using the dataset provided by Alura Cash, at the first meeting, that contains information about 34,501 customers. <br><br>

<p align='center'>
    <i>In order to also access this data, you can follow <a href='https://raw.githubusercontent.com/Mirlaa/Challenge-Data-Science-1ed/main/Dados/dados_juntos.csv'>this link</a></i> 😊 
</p>

## 1.1 - Importing the dataset

As I downloaded it to my local environment, I'll import it from it path using the pandas library. I'll be also importing numpy and plotly for now.

In [409]:
import plotly.express as px
import pandas as pd
import numpy as np

uri = './data/alura_cash_data.csv'
df = pd.read_csv(uri)

## 1.2 - Data Description

Let's check some useful info about this set regarding the number of rows and columns, the data types, the number of missing values, and some statistics around the numerical variables.

### 1.2.1 - Number of rows and columns

In [410]:
print('''
    This data contains the following amount of rows and columns:

    => {} rows and {} columns.
'''.format(df.shape[0], df.shape[1]))



    This data contains the following amount of rows and columns:

    => 34501 rows and 12 columns.



### 1.2.2 - Data types

In [411]:
for column in df.columns:
    print('{} <-> {}'.format(column, df[column].dtype))

person_age <-> float64
person_income <-> float64
person_home_ownership <-> object
person_emp_length <-> float64
loan_intent <-> object
loan_grade <-> object
loan_amnt <-> float64
loan_int_rate <-> float64
loan_status <-> float64
loan_percent_income <-> float64
cb_person_default_on_file <-> object
cb_person_cred_hist_length <-> float64


### 1.2.3 - Missing values

In [412]:
df.isna().sum()

person_age                     324
person_income                  339
person_home_ownership          331
person_emp_length             1254
loan_intent                    315
loan_grade                     313
loan_amnt                      331
loan_int_rate                 3630
loan_status                    343
loan_percent_income            319
cb_person_default_on_file      370
cb_person_cred_hist_length       4
dtype: int64

### 1.2.4 - Numerical variables

In [413]:
df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
person_age,34177.0,27.731018,6.345281,20.0,23.0,26.0,30.0,144.0
person_income,34162.0,66028.687957,61405.057742,4000.0,38493.0,55000.0,79200.0,6000000.0
person_emp_length,33247.0,4.787229,4.137463,0.0,2.0,4.0,7.0,123.0
loan_amnt,34170.0,9590.576529,6320.429041,500.0,5000.0,8000.0,12200.0,35000.0
loan_int_rate,30871.0,11.01363,3.24124,5.42,7.9,10.99,13.47,23.22
loan_status,34158.0,0.218192,0.413024,0.0,0.0,0.0,0.0,1.0
loan_percent_income,34182.0,0.170227,0.106783,0.0,0.09,0.15,0.23,0.83
cb_person_cred_hist_length,34497.0,5.808186,4.063231,2.0,3.0,4.0,8.0,30.0


## 1.3 - Notes

In [414]:
df.sample(3)

Unnamed: 0,person_age,person_income,person_home_ownership,person_emp_length,loan_intent,loan_grade,loan_amnt,loan_int_rate,loan_status,loan_percent_income,cb_person_default_on_file,cb_person_cred_hist_length
8964,37.0,33888.0,Rent,7.0,Education,D,3200.0,12.86,0.0,0.09,Y,16.0
30451,24.0,46000.0,Mortgage,,Medical,D,10000.0,15.99,1.0,0.22,N,3.0
34173,21.0,12000.0,Rent,,Medical,B,2500.0,11.99,1.0,0.21,N,4.0


<p align='center'>
With this few commands, I know that I'm dealing with more than 34 thousand rows and 43 columns. <br> 
The data types are mostly numerical, but there are also some categorical variables. I also know that there just a few missing values, compared to the whole. Finally, I can see that the numerical variables have different scales, which is a problem that I'll have to deal with later.
</p>

# 2. Data Cleaning

In this section, I'll be cleaning the dataset, removing unnecessary columns, filling missing values, and transforming categorical variables into numerical ones.

## 2.2 - Dealing with missing values

As seen before, there are some missing values in the dataset. I'll be filling them with the median of each column if they are numerical, and with the value "Unknown" otherwise.

But, to make sure that I'm not introducing any bias in the dataset, I'll check if the amount of missing values in the target variable is less than or equal to 15% of the whole dataset.

In [415]:
for column in df.columns:
  null_rows = df[column].isna().sum()   
  print('{} -> {} => {}%'.format(
        column,
        null_rows, 
        round((null_rows / df.shape[0]) * 100, 2)
      )
    )

person_age -> 324 => 0.94%
person_income -> 339 => 0.98%
person_home_ownership -> 331 => 0.96%
person_emp_length -> 1254 => 3.63%
loan_intent -> 315 => 0.91%
loan_grade -> 313 => 0.91%
loan_amnt -> 331 => 0.96%
loan_int_rate -> 3630 => 10.52%
loan_status -> 343 => 0.99%
loan_percent_income -> 319 => 0.92%
cb_person_default_on_file -> 370 => 1.07%
cb_person_cred_hist_length -> 4 => 0.01%


### 2.2.1 - Filling numerical missing values

In [416]:
df.person_age.fillna(df.person_age.median(), inplace=True)
df.person_income.fillna(df.person_income.median(), inplace=True)
df.person_emp_length.fillna(df.person_emp_length.median(), inplace=True)
df.loan_amnt.fillna(df.loan_amnt.median(), inplace=True)
df.loan_int_rate.fillna(df.loan_int_rate.median(), inplace=True)
df.loan_status.fillna(df.loan_status.median(), inplace=True)
df.loan_percent_income.fillna(df.loan_percent_income.median(), inplace=True)
df.cb_person_cred_hist_length.fillna(df.cb_person_cred_hist_length.median(), inplace=True)

In [417]:
for column in df.columns:
  null_rows = df[column].isna().sum()   
  
  if null_rows == 0:
    print('{} -> {} => {}%'.format(
          column,
          null_rows, 
          round((null_rows / df.shape[0]) * 100, 2)
        )
      )

person_age -> 0 => 0.0%
person_income -> 0 => 0.0%
person_emp_length -> 0 => 0.0%
loan_amnt -> 0 => 0.0%
loan_int_rate -> 0 => 0.0%
loan_status -> 0 => 0.0%
loan_percent_income -> 0 => 0.0%
cb_person_cred_hist_length -> 0 => 0.0%


### 2.2.2 - Filling categorical missing values

In [418]:
df.person_home_ownership.fillna('Unknown', inplace=True)
df.loan_intent.fillna('Unknown', inplace=True)
df.loan_grade.fillna('Unknown', inplace=True)
df.cb_person_default_on_file.fillna('Unknown', inplace=True)

In [419]:
print('NUMERICAL VARIABLES')
print('--------------------\n')

for column in df.columns:
  null_rows = df[column].isna().sum()   

  if df[column].dtype != 'object':
    print('{} -> {} => {}%'.format(
        column,
        null_rows, 
        round((null_rows / df.shape[0]) * 100, 2)
      )
    )

print('\nCATEGORICAL VARIABLES')
print('------------------------\n')

for column in df.columns:
  null_rows = df[column].isna().sum()   

  if df[column].dtype == 'object':
    print('{} -> {} => {}%'.format(
        column,
        null_rows, 
        round((null_rows / df.shape[0]) * 100, 2)
      )
    )

NUMERICAL VARIABLES
--------------------

person_age -> 0 => 0.0%
person_income -> 0 => 0.0%
person_emp_length -> 0 => 0.0%
loan_amnt -> 0 => 0.0%
loan_int_rate -> 0 => 0.0%
loan_status -> 0 => 0.0%
loan_percent_income -> 0 => 0.0%
cb_person_cred_hist_length -> 0 => 0.0%

CATEGORICAL VARIABLES
------------------------

person_home_ownership -> 0 => 0.0%
loan_intent -> 0 => 0.0%
loan_grade -> 0 => 0.0%
cb_person_default_on_file -> 0 => 0.0%


# 3. Exploratory Data Analysis

Now, I'll be exploring the dataset, looking for patterns and insights that might help me to build a better model. <br>
For this, I'll be using the plotly library, which is a great tool for data visualization.

## 3.1 - Grouping by Home Ownership

In [420]:
aux = df.groupby('person_home_ownership').median().reset_index()[
  [
    'person_home_ownership',
    'person_age', 
    'person_income', 
    'loan_amnt', 
    'loan_status'
  ]
]

aux

Unnamed: 0,person_home_ownership,person_age,person_income,loan_amnt,loan_status
0,Mortgage,26.0,68500.0,9000.0,0.0
1,Other,24.0,59000.0,10000.0,0.0
2,Own,26.0,47900.0,7500.0,0.0
3,Rent,26.0,48000.0,7750.0,0.0
4,Unknown,26.0,55000.0,8000.0,0.0


In [430]:
def bar_plot(data, x, y, title, color, x_title, y_title):
  fig = px.bar(
    data, 
    x=x,
    y=y, 
    color_discrete_sequence=[color], 
    labels={ 
        'person_home_ownership': 'Home Ownership',
        'person_age': 'Age',
        'person_income': 'Median Income',
        'loan_amnt': 'Loan Amount',
        'loan_status': 'Loan Status',
      }
    )

  fig.update_layout(
    template='plotly_white',
    title=title,
    title_font_size=20,
    title_font_family='Fira Code',
    title_x=0.5,
  )

  fig.update_xaxes(
    title_text=x_title,
    tickfont=dict(
      family='Fira Code',
      size=14,
    )
  )

  fig.update_yaxes(
    title_text=y_title,
    tickfont=dict(
      family='Fira Code',
      size=12,
    )
  )

  fig.show()

aux = aux.sort_values('person_income', ascending=False)

bar_plot(
  data=aux, 
  x='person_home_ownership', 
  y='person_income', 
  x_title='Home Ownership',
  y_title='Median Income',
  color='#1f77b4',
  title='''Median Income by Home Ownership <br> 
  <span style=\'font-size: 14px\'> 
        Sum of defaulted and paid loans 
  </span>''', 
)

In [436]:
aux = aux.sort_values('loan_amnt', ascending=False)

bar_plot(
  data=aux, 
  x='person_home_ownership', 
  y='loan_amnt', 
  x_title='Home Ownership',
  y_title='Median Loan Amount',
  color='#ff7f0e',
  title='''Median Loan Amount by Home Ownership <br> 
  <span style=\'font-size: 14px\'> 
              Sum of defaulted and paid loans 
  </span>'''
)

In [457]:
kedward = df.query('loan_status == 1')
aux = kedward.groupby('person_home_ownership') \
            .count() \
            .sort_values('loan_status', ascending=False) \
            .reset_index() \
            [['person_home_ownership', 'loan_status']]


bar_plot(
  data=aux, 
  x='person_home_ownership', 
  y='loan_status', 
  x_title='Home Ownership',
  y_title='Median Loan Amount',
  color='#2ca02c',
  title='''Possibility of default by Home Ownership <br> 
  <span style=\'font-size: 14px\'> 
                                  Sum of defaulted
  </span>'''
)