# 7. Data Preprocessing with NumPy
## 9. A Loan Data Practical Example with NumPy
- Summary

### 0. Import Libraries, set_printoptions, and def show_attr()

In [1]:
import numpy as np
np.__version__
np.set_printoptions(suppress=True, linewidth=100, precision=2)

In [2]:
# Function show_attr

def show_attr(arrnm: str) -> str:
    strout = f' {arrnm}: '

    for attr in ('shape', 'ndim', 'size', 'dtype'):     #, 'itemsize'):
            arrnm_attr = arrnm + '.' + attr
            strout += f'| {attr}: {eval(arrnm_attr)} '

    return strout

### 1. Setting Up: Introduction to the Practical Example

#### What working as a data analyst in a data science team looks like
1. Introduce the project.
2. All the Tasks and Responsibilities that the head of data analytics assigned to us.
3. Examine the dataset (before we start cleaning and prepro)
4. Cleaning and Preprocessing the dataset.
5. Save it to an external *.csv file (once the data is ready to be analyzed).
6. Pass it on to the data scientists.
> DScientist: will use the cleaning and prepro dtset to construct a CRM (credit risk model), that measures probability of default.

#### Structure of the working process
- Gathering (Recopilación de información), Cleaning and Preprocessing the Data <- Data Analysts
- DAnalysts hand them over to the DScientists (ML knowlegde) to construct complex Predictive Models.
##### DAnalysts Rol:
1. Our goal is to obtain a clean and preprocessed dataset
2. We'll note down all the changes we're making to the original dataset in a documentation file where we describe what each column of the new dtset represents.
3. This info will be invaluable to the DScientists who will work with this data after us.

#### A day in the life of a DAnalyst
- Explain our role in the project.
- Examine the data.
- Import the data.
- Split the data.

#### The Case
- Rol: DAnalyst in a data science team of central bank in Europe.
- Team assignment: create a CRM which estimates the probability of default for every personal account.
- Terms like Probability of default, Recovery rate, and Credit Risk Modeling.
- Chore: Take the raw dataset and prepare it for the models the plan to run.
- Details provided:
    1. What data is stored in every column.
    2. Set of rules on how to clean and pre-process the values in each column col.
> The essence of the DAnalyst job and is much more demanding and sizable than it might initially sound.

#### Step by Step approach to the problem
1. Loan data is a sample from a larger dtset that belongs to an affiliate bank based in USA. Therefore all the values are in dollars, so we need to provide their Euro equivalents.
2. Every categorical variable must be quantified. We nee to change any text columns into numbers based on the info they contain.
    - Issue date (fecha de emisión) on each loan: transformation is straightforward since we can split the accounts by months.
    - For other cols, we only care if they provide positive or negative connotations. So we'll be turning them into __*dummy variables*__ that hold either zero or one.
3. Missing Data:
    - Furthermore when we're measuring creditworthiness we need to be extremely risk-averse and distrustful of any unavailable data.
    - That's why the consensus in the field is that missing info suggest foul play because loan applications are self reported. To elaborate since candidates fill out their loan applications manually, there is an incentive to withhold info which can lower their chances of getting a loan.
    - Of course we prefer to give out loans to applicants who can repay them. So __*if the information isn´t available, we'll just assume the worst*__.
    - What is worst varies from one column to the next, so the team has provided us with casting directions for each variable in the dtset.
    - Therefore as we go through the dtset we'll usually know whether we want to use the minimum, maximum, or some other value when taking care of missing data

> Loan info is store in a .csv file called loan-data.csv

### 2. Setting Up: Importing the Data Set

In [3]:
raw_data_np = np.genfromtxt('9_loan-data.csv',
                            delimiter=';',
                            skip_header=1,
                            autostrip=True)
display(raw_data_np)
display(show_attr('raw_data_np'))

# The entire 1st row is NAN so the skip_header=1

array([[48010226.  ,         nan,    35000.  , ...,         nan,         nan,     9452.96],
       [57693261.  ,         nan,    30000.  , ...,         nan,         nan,     4679.7 ],
       [59432726.  ,         nan,    15000.  , ...,         nan,         nan,     1969.83],
       ...,
       [50415990.  ,         nan,    10000.  , ...,         nan,         nan,     2185.64],
       [46154151.  ,         nan,         nan, ...,         nan,         nan,     3199.4 ],
       [66055249.  ,         nan,    10000.  , ...,         nan,         nan,      301.9 ]])

' raw_data_np: | shape: (10000, 14) | ndim: 2 | size: 140000 | dtype: float64 '

### 3. Setting Up: Checking for Incomplete Data

In [4]:
print('Number of NANs: ', np.isnan(raw_data_np).sum())

# Calc tmp_fill and orig_mean for e/col, then orig_stats e/col
print('Original Maximum: ', np.nanmax(raw_data_np))
display(tmp_fill := np.nanmax(raw_data_np) + 1)
display(orig_means := np.nanmean(raw_data_np, axis=0))

orig_stats = np.array([np.nanmin(raw_data_np, axis=0),
                       orig_means,
                       np.nanmax(raw_data_np, axis=0)])
display(orig_stats)

# orig_means =>  can see 8 cols full of NAN (warnings) 

Number of NANs:  88005
Original Maximum:  68616519.0


68616520.0

  display(orig_means := np.nanmean(raw_data_np, axis=0))


array([54015809.19,         nan,    15273.46,         nan,    15311.04,         nan,       16.62,
            440.92,         nan,         nan,         nan,         nan,         nan,     3143.85])

  orig_stats = np.array([np.nanmin(raw_data_np, axis=0),
  np.nanmax(raw_data_np, axis=0)])


array([[  373332.  ,         nan,     1000.  ,         nan,     1000.  ,         nan,        6.  ,
              31.42,         nan,         nan,         nan,         nan,         nan,        0.  ],
       [54015809.19,         nan,    15273.46,         nan,    15311.04,         nan,       16.62,
             440.92,         nan,         nan,         nan,         nan,         nan,     3143.85],
       [68616519.  ,         nan,    35000.  ,         nan,    35000.  ,         nan,       28.99,
            1372.97,         nan,         nan,         nan,         nan,         nan,    41913.62]])

### 4. Setting Up: Splitting the Dataset

In [5]:
# Get the cols_str(ixs), and the cols_num(ixs)
cols_str = np.argwhere(np.isnan(orig_means)).squeeze()
display(cols_str)

# cols_num = np.argwhere(~np.isnan(orig_means)).squeeze()
cols_num = np.argwhere(np.isnan(orig_means) == False).squeeze()
display(cols_num)

# Check: num_cols_str + num_cols_num = raw_data.shape[1]
len(cols_str) + len(cols_num) == raw_data_np.shape[1]

array([ 1,  3,  5,  8,  9, 10, 11, 12], dtype=int64)

array([ 0,  2,  4,  6,  7, 13], dtype=int64)

True

In [6]:
# Reload strings cols to build the str sub-array (now data)
data_str = np.genfromtxt('9_loan-data.csv',
                         delimiter=';',
                         dtype=str,
                         usecols=cols_str,
                         skip_header=1,
                         autostrip=True)
display(data_str)
show_attr('data_str')

array([['May-15', 'Current', '36 months', ..., 'Verified',
        'https://www.lendingclub.com/browse/loanDetail.action?loan_id=48010226', 'CA'],
       ['', 'Current', '36 months', ..., 'Source Verified',
        'https://www.lendingclub.com/browse/loanDetail.action?loan_id=57693261', 'NY'],
       ['Sep-15', 'Current', '36 months', ..., 'Verified',
        'https://www.lendingclub.com/browse/loanDetail.action?loan_id=59432726', 'PA'],
       ...,
       ['Jun-15', 'Current', '36 months', ..., 'Source Verified',
        'https://www.lendingclub.com/browse/loanDetail.action?loan_id=50415990', 'CA'],
       ['Apr-15', 'Current', '36 months', ..., 'Source Verified',
        'https://www.lendingclub.com/browse/loanDetail.action?loan_id=46154151', 'OH'],
       ['Dec-15', 'Current', '36 months', ..., '',
        'https://www.lendingclub.com/browse/loanDetail.action?loan_id=66055249', 'IL']],
      dtype='<U69')

' data_str: | shape: (10000, 8) | ndim: 2 | size: 80000 | dtype: <U69 '

In [7]:
# Reload numeric cols to build the num sub-array (now data)
# ALSO fill NANs w/temp_fill
data_num = np.genfromtxt('9_loan-data.csv',
                         delimiter=';',
                         usecols=cols_num,
                         skip_header=1,
                         autostrip=True,
                         filling_values=tmp_fill)
display(data_num)
display(show_attr('data_num'))
print("Num of NANs:", np.isnan(data_num).sum())

array([[48010226.  ,    35000.  ,    35000.  ,       13.33,     1184.86,     9452.96],
       [57693261.  ,    30000.  ,    30000.  , 68616520.  ,      938.57,     4679.7 ],
       [59432726.  ,    15000.  ,    15000.  , 68616520.  ,      494.86,     1969.83],
       ...,
       [50415990.  ,    10000.  ,    10000.  , 68616520.  , 68616520.  ,     2185.64],
       [46154151.  , 68616520.  ,    10000.  ,       16.55,      354.3 ,     3199.4 ],
       [66055249.  ,    10000.  ,    10000.  , 68616520.  ,      309.97,      301.9 ]])

' data_num: | shape: (10000, 6) | ndim: 2 | size: 60000 | dtype: float64 '

Num of NANs: 0


In [8]:
# Load header strings cols to build the str sub-array (now header)
header_str = np.genfromtxt('9_loan-data.csv',
                            delimiter=';',
                            dtype=str,
                            usecols=cols_str,
                            skip_footer=raw_data_np.shape[0],
                            autostrip=True)
display(header_str)
show_attr('header_str')

array(['issue_d', 'loan_status', 'term', 'grade', 'sub_grade', 'verification_status', 'url',
       'addr_state'], dtype='<U19')

' header_str: | shape: (8,) | ndim: 1 | size: 8 | dtype: <U19 '

In [9]:
# Load header numeric cols to build the num sub-array (now header)
header_num = np.genfromtxt('9_loan-data.csv',
                            delimiter=';',
                            dtype=str,
                            usecols=cols_num,
                            skip_footer=raw_data_np.shape[0],
                            autostrip=True)
display(header_num)
show_attr('header_num')

array(['id', 'loan_amnt', 'funded_amnt', 'int_rate', 'installment', 'total_pymnt'], dtype='<U11')

' header_num: | shape: (6,) | ndim: 1 | size: 6 | dtype: <U11 '

### 5. Setting Up: Creating Checkpoints

In [10]:
# Function chkpt (checkpoint)

def chkpt(filenm: str, chk_header: np.ndarray,
          chk_data: np.ndarray) -> np.lib.npyio.NpzFile:
    np.savez(filenm, header=chk_header, data=chk_data)
    chkpt_var = np.load(f'{filenm}.npz')
    return chkpt_var

In [11]:
chkpt_tst = chkpt('chkpt-tst', header_num, data_num)
display(chkpt_tst.files)
np.array_equal(chkpt_tst['data'], data_num)

['header', 'data']

True

### 6. Manipulating Text Data: Issue Date

- Chg header of col issue_d to issue_date (more descriptive)
- Eliminate -15 (all 2015) and change month str by num (1-12) [0-for '']

In [12]:
# Chg header of col issue_d to issue_date (more descriptive)
display(header_str)
display(np.argwhere(header_str == 'issue_d'))
header_str[0] = 'issue_date'
display(f'{header_str[0] = }')
display(header_str)


array(['issue_d', 'loan_status', 'term', 'grade', 'sub_grade', 'verification_status', 'url',
       'addr_state'], dtype='<U19')

array([[0]], dtype=int64)

"header_str[0] = 'issue_date'"

array(['issue_date', 'loan_status', 'term', 'grade', 'sub_grade', 'verification_status', 'url',
       'addr_state'], dtype='<U19')

In [13]:
# Check the column AND jm- view sort in order of counts
id_uniq, id_ucnt = np.unique(data_str[:,0], return_counts=True)
display(id_uniq, id_ucnt)
id_ucnt_ixsorted = np.argsort(id_ucnt)
id_uniq[id_ucnt_ixsorted], id_ucnt[id_ucnt_ixsorted]

array(['', 'Apr-15', 'Aug-15', 'Dec-15', 'Feb-15', 'Jan-15', 'Jul-15', 'Jun-15', 'Mar-15',
       'May-15', 'Nov-15', 'Oct-15', 'Sep-15'], dtype='<U69')

array([ 500,  757,  846,  997,  530,  770, 1061,  654,  559,  741,  847, 1095,  643], dtype=int64)

(array(['', 'Feb-15', 'Mar-15', 'Sep-15', 'Jun-15', 'May-15', 'Apr-15', 'Jan-15', 'Aug-15',
        'Nov-15', 'Dec-15', 'Jul-15', 'Oct-15'], dtype='<U69'),
 array([ 500,  530,  559,  643,  654,  741,  757,  770,  846,  847,  997, 1061, 1095], dtype=int64))

In [14]:
# Eliminate -15 
data_str[:,0] = np.chararray.strip(data_str[:,0], '-15')
display(np.unique(data_str[:,0]))

# Replace month str by number - '' = 0
months = np.array(['', 'Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun',
                   'Jul', 'Aug', 'Sep', 'Oct', 'Nov','Dec'])

for mth_num in range(13):
    data_str[:,0] = np.where(data_str[:,0] == months[mth_num],
                             mth_num,
                             data_str[:,0])
    
display(np.unique(data_str[:,0]))
display(np.unique(data_str[:,0])[id_ucnt_ixsorted])

array(['', 'Apr', 'Aug', 'Dec', 'Feb', 'Jan', 'Jul', 'Jun', 'Mar', 'May', 'Nov', 'Oct', 'Sep'],
      dtype='<U69')

array(['0', '1', '10', '11', '12', '2', '3', '4', '5', '6', '7', '8', '9'], dtype='<U69')

array(['0', '12', '5', '9', '4', '6', '1', '2', '10', '7', '11', '3', '8'], dtype='<U69')

### 7. Manipulating Text Data: Loan Status and Term

### 8. Manipulating Text Data: Grade and Sub Grade

### 9. Manipulating Text Data: Verification Status & URL

### 10. Manipulating Text Data: State Address

### 11. Manipulating Text Data: Converting Strings and Creating a Checkpoint

### 12. Manipulating Numeric Data: Substitute Filler Values