# 9. A Loan Data Practical Example with NumPy
- Ejemplo práctico de datos de préstamos con NumPy
## 9_03 Setting Up: Checking for Incomplete Data

## Quick glance at loan-data.csv (notepad++)
- Contains both text and numeric data.
- A header clarifying the contents of each column
- 1st col is called 'id' -> each row consists of info for the account of a loan candidate's application and each candidate is described by their id. => We referred to the rows as accounts, candidates or applications.
- Can't see whether there are missing values in the dtset.
- ';' as delimiter

### genfromtxt(autostrip=False)
- autostrip: bool, optional
Whether to automatically strip white spaces from the variables.

In [12]:
import numpy as np
np.__version__
np.set_printoptions(suppress=True, linewidth=100, precision=2)

In [13]:
# Function show_attr

def show_attr(arrnm: str) -> str:
    strout = f' {arrnm}: '

    for attr in ('shape', 'ndim', 'size', 'dtype'):     #, 'itemsize'):
            arrnm_attr = arrnm + '.' + attr
            strout += f'| {attr}: {eval(arrnm_attr)} '

    return strout

In [20]:
raw_data_np = np.genfromtxt('9_02_loan-data.csv',
                            delimiter=';',
                            skip_header=1,
                            autostrip=True)
display(raw_data_np)
display(show_attr('raw_data_np'))
print('Num of NANs:', np.isnan(raw_data_np).sum())

# Lot of NANs, either text or missing
# The entire 1st row is NAN so the skip_header=1
# autostrip cause it removes white spaces which can distort our cols

array([[48010226.  ,         nan,    35000.  , ...,         nan,         nan,     9452.96],
       [57693261.  ,         nan,    30000.  , ...,         nan,         nan,     4679.7 ],
       [59432726.  ,         nan,    15000.  , ...,         nan,         nan,     1969.83],
       ...,
       [50415990.  ,         nan,    10000.  , ...,         nan,         nan,     2185.64],
       [46154151.  ,         nan,         nan, ...,         nan,         nan,     3199.4 ],
       [66055249.  ,         nan,    10000.  , ...,         nan,         nan,      301.9 ]])

' raw_data_np: | shape: (10000, 14) | ndim: 2 | size: 140000 | dtype: float64 '

Num of NANs: 88005


## Structure of the working process
- Gathering (Recopilación de información), Cleaning and Preprocessing the Data <- Data Analysts
- DAnalysts hand them over to the DScientists (ML knowlegde) to construct complex Predictive Models.
### DAnalysts Rol:
1. Our goal is to obtain a clean and preprocessed dataset
2. We'll note down all the changes we're making to the original dataset in a documentation file where we describe what each column of the new dtset represents.
3. This info will be invaluable to the DScientists who will work with this data after us.

## A day in the life of a DAnalyst
- Explain our role in the project.
- Examine the data.
- Import the data.
- Split the data.

## The Case
- Rol: DAnalyst in a data science team of central bank in Europe.
- Team assignment: create a CRM which estimates the probability of default for every personal account.
- Terms like Probability of default, Recovery rate, and Credit Risk Modeling.
- Chore: Take the raw dataset and prepare it for the models the plan to run.
- Details provided:
    1. What data is stored in every column.
    2. Set of rules on how to clean and pre-process the values in each column col.
> The essence of the DAnalyst job and is much more demanding and sizable than it might initially sound.

## Step by Step approach to the problem
1. Loan data is a sample from a larger dtset that belongs to an affiliate bank based in USA. Therefore all the values are in dollars, so we need to provide their Euro equivalents.
2. Every categorical variable must be quantified. We nee to change any text columns into numbers based on the info they contain.
    - Issue date (fecha de emisión) on each loan: transformation is straightforward since we can split the accounts by months.
    - For other cols, we only care if they provide positive or negative connotations. So we'll be turning them into __*dummy variables*__ that hold either zero or one.
3. Missing Data:
    - Furthermore when we're measuring creditworthiness we need to be extremely risk-averse and distrustful of any unavailable data.
    - That's why the consensus in the field is that missing info suggest foul play because loan applications are self reported. To elaborate since candidates fill out their loan applications manually, there is an incentive to withhold info which can lower their chances of getting a loan.
    - Of course we prefer to give out loans to applicants who can repay them. So __*if the information isn´t available, we'll just assume the worst*__.
    - What is worst varies from one column to the next, so the team has provided us with casting directions for each variable in the dtset.
    - Therefore as we go through the dtset we'll usually know whether we want to use the minimum, maximum, or some other value when taking care of missing data

> Loan info is store in a .csv file called loan-data.csv

#### Translation about Missing Data
Además, cuando medimos la solvencia crediticia, debemos ser extremadamente reacios al riesgo y desconfiar de cualquier dato no disponible. Por eso, el consenso en el campo es que la información faltante sugiere un juego sucio porque las solicitudes de préstamos son auto-reportadas. Para explicarlo mejor, dado que los candidatos completan sus solicitudes de préstamo manualmente, existe un incentivo para retener información, lo que puede reducir sus posibilidades de obtener un préstamo. Por supuesto, preferimos otorgar préstamos a los solicitantes que pueden devolverlos. Entonces, si la información no está disponible, simplemente asumiremos lo peor. Lo peor varía de una columna a la siguiente, por lo que el equipo nos ha proporcionado instrucciones de conversión para cada variable en el conjunto de datos. Por lo tanto, a medida que avanzamos en el conjunto de datos, generalmente sabremos si queremos usar el mínimo, el máximo o algún otro valor al ocuparnos de los datos faltantes.
