# Introduction
## Purpose
The purpose of this article is to make a portfolio project to showcase some basic machine learning skills.

To do this I used a [tutorial](https://www.dataquest.io/blog/data-science-portfolio-machine-learning/) from Dataquest.

This article will be a rehash of the Dataquest tutorial project. Then, in another, separate, article, I will make a different project with the same format.

The source code of this project is also available on [GitHub](https://github.com/m4rtinpf/loan-prediction).


## Topic
According to [Investopedia](https://www.investopedia.com/articles/investing/091814/fannie-mae-what-it-does-and-how-it-operates.asp):

>Fannie Mae is a government-sponsored enterprise that makes mortgages available to low- and moderate-income borrowers. It does not provide loans, but backs or guarantees them in the secondary mortgage market.

The goal of this project is to predict if a loan acquired by Fannie Mae will go into foreclosure or not.

## Basic datasets
Fannie Mae publishes [here](https://capitalmarkets.fanniemae.com/credit-risk-transfer/single-family-credit-risk-transfer/fannie-mae-single-family-loan-performance-data) data for the loans that it has acquired and how they perform through time.

In 2020, Fannie Mae modified the way they publish their data. To avoid having to make too many modifications to the tutorial project, I used the [old dataset](https://rapidsai.github.io/demos/datasets/mortgage-data).

## Project structure
In this case, the project is going to be structured using different Python script `.py` files, instead of a Jupyter Notebook. We are going to do this so the project can be easily run in an automated way, instead of interactively.

# Getting the data
## Organising the files
First, we'll create a directory called `loan-prediction`. And then another directory inside of it, called `data`.

```
mkdir loan-prediction
cd loan-prediction
mkdir data
cd data
```

## Downloading the data
The following script downloads a `gzip` compressed `tar` file containing the data, from 2000 to 2015. It does so like this:

* Checks if the file already exists, using the `os.path.isfile()` function.
  
  If it doesn't exist, it gets the file from the `url` using `urllib.request.urlopen()`, and saves it using `shutil.copyfileobj()`.

  ```
url = 'http://rapidsai-data.s3-website.us-east-2.amazonaws.com/notebook-mortgage-data/mortgage_2000-2015.tgz'
file_name = url.split('/')[-1]
# Check if the file exists
if not os.path.isfile(file_name):
    # Download the file from "url" and save it as "file_name"
    with urllib.request.urlopen(url) as response, open(file_name, 'wb') as out_file:
        shutil.copyfileobj(response, out_file)
```

* Uses the `tar` library to extract only the files to be used (we will only use the same as the tutorial did).

  ```
extract_path = 'data'
# Open the tar file
tar = tarfile.open(file_name, "r")
# Loop through each file and extract only the ones that are needed
for f in [
    'acq/Acquisition_2015Q1.txt',
    'acq/Acquisition_2014Q4.txt',
    'acq/Acquisition_2014Q3.txt',
    'acq/Acquisition_2014Q2.txt',
    'acq/Acquisition_2014Q1.txt',
    'acq/Acquisition_2013Q4.txt',
    'acq/Acquisition_2013Q3.txt',
    'acq/Acquisition_2013Q2.txt',
    'acq/Acquisition_2013Q1.txt',
    'acq/Acquisition_2012Q4.txt',
    'acq/Acquisition_2012Q3.txt',
    'acq/Acquisition_2012Q2.txt',
    'acq/Acquisition_2012Q1.txt',
    'perf/Performance_2015Q1.txt',
    'perf/Performance_2014Q4.txt',
    'perf/Performance_2014Q3.txt',
    'perf/Performance_2014Q2.txt',
    'perf/Performance_2014Q1.txt',
    'perf/Performance_2013Q4.txt',
    'perf/Performance_2013Q3.txt_1',
    'perf/Performance_2013Q3.txt_0',
    'perf/Performance_2013Q2.txt_1',
    'perf/Performance_2013Q2.txt_0',
    'perf/Performance_2013Q1.txt_1',
    'perf/Performance_2013Q1.txt_0',
    'perf/Performance_2012Q4.txt_1',
    'perf/Performance_2012Q4.txt_0',
    'perf/Performance_2012Q3.txt_1',
    'perf/Performance_2012Q3.txt_0',
    'perf/Performance_2012Q2.txt_1',
    'perf/Performance_2012Q2.txt_0',
    'perf/Performance_2012Q1.txt_1',
    'perf/Performance_2012Q1.txt_0',
]:
    tar.extract(f, path=extract_path)
```

* Removes the `tar` file as it's no longer needed.

  ```
os.remove(file_name)
```

* Moves the extracted text files to the `data` directory (essentially, it removes the directory structure that the compressed file had).

  ```
# Move files from "acq" and "perf" directories to "data"
for directory in ['acq', 'perf']:
    for f in os.listdir('{0}/{1}'.format(extract_path, directory)):
        shutil.move('{0}/{1}/{2}'.format(extract_path, directory, f), extract_path)
```

* Removes the no longer used directories inside of `data`.

  ```
os.rmdir('{0}/{1}'.format(extract_path, directory))
```


# Installing the requirements
We now go back to the `loan-prediction` directory with `cd ..`.

## Requirements file
To make it easier for user to install of required libraries, we are create a text file called `requirements.txt`. This text file contains the name of a needed library in each line:

```
pandas
matplotlib
scikit-learn
numpy
ipython
scipy
```

## Installing
Now, installing the requirements is as easy as running `pip install -r requirements.txt`.

# Configuration file
To keep all the settings of the process in one place, we'll make a `settings.py` file to specify:

* The directory for the data files.

  ```
DATA_DIR = "data"
```

* The directory for the processed files.

  ```
PROCESSED_DIR = "processed"
```

* The minimum amount of quarters that a loan has to be in the dataset in order to use it.

  ```
MINIMUM_TRACKING_QUARTERS = 4
```

* The name of the target variable.

  ```
TARGET = "foreclosure_status"
```

* The name of the variables which are not predictors.

  ```
NON_PREDICTORS = [TARGET, "id"]
```

* The number of cross-validation folds to use.

  ```
CV_FOLDS = 3
```


# Assembling the datasets
Right now there are a lot of `Acquisition` and `Performance` files in the `data` directory, one for each quarter.

We want to combine these into two files: `Acquisition.txt` and `Performance.txt`.

The text files don't have headers, so we'll define a `HEADERS` dictionary, with the name of the two types of datasets as keys, and the names of the columns of each one as values. The headers can be found [here](https://s3.amazonaws.com/dq-blog-files/lppub_file_layout.pdf).

```
HEADERS = {
    "Acquisition": [
        "id",
        "channel",
        "seller",
        "interest_rate",
        "balance",
        "loan_term",
        "origination_date",
        "first_payment_date",
        "ltv",
        "cltv",
        "borrower_count",
        "dti",
        "borrower_credit_score",
        "first_time_homebuyer",
        "loan_purpose",
        "property_type",
        "unit_count",
        "occupancy_status",
        "property_state",
        "zip",
        "insurance_percentage",
        "product_type",
        "co_borrower_credit_score"
    ],
    "Performance": [
        "id",
        "reporting_period",
        "servicer_name",
        "interest_rate",
        "balance",
        "loan_age",
        "months_to_maturity",
        "maturity_date",
        "msa",
        "delinquency_status",
        "modification_flag",
        "zero_balance_code",
        "zero_balance_date",
        "last_paid_installment_date",
        "foreclosure_date",
        "disposition_date",
        "foreclosure_costs",
        "property_repair_costs",
        "recovery_costs",
        "misc_costs",
        "tax_costs",
        "sale_proceeds",
        "credit_enhancement_proceeds",
        "repurchase_proceeds",
        "other_foreclosure_proceeds",
        "non_interest_bearing_balance",
        "principal_forgiveness_balance"
    ]
}
```

To improve the performance of the algorithm, we specify the data-type of each column in each dataset.

```
TYPES = {
    "Acquisition": {
        'id': 'int64',
        'channel': 'object',
        'seller': 'object',
        'interest_rate': 'float64',
        'balance': 'int64',
        'loan_term': 'int64',
        'origination_date': 'object',
        'first_payment_date': 'object',
        'ltv': 'int64',
        'cltv': 'float64',
        'borrower_count': 'float64',
        'dti': 'float64',
        'borrower_credit_score': 'float64',
        'first_time_homebuyer': 'object',
        'loan_purpose': 'object',
        'property_type': 'object',
        'unit_count': 'int64',
        'occupancy_status': 'object',
        'property_state': 'object',
        'zip': 'int64',
        'insurance_percentage': 'float64',
        'product_type': 'object',
        'co_borrower_credit_score': 'float64',
    },
    "Performance": {
        'id': 'int64',
        'reporting_period': 'object',
        'servicer_name': 'object',
        'interest_rate': 'float64',
        'balance': 'float64',
        'loan_age': 'float64',
        'months_to_maturity': 'float64',
        'maturity_date': 'float64',
        'msa': 'object',
        'delinquency_status': 'float64',
        'modification_flag': 'int64',
        'zero_balance_code': 'object',
        'zero_balance_date': 'float64',
        'last_paid_installment_date': 'object',
        'foreclosure_date': 'object',
        'disposition_date': 'object',
        'foreclosure_costs': 'object',
        'property_repair_costs': 'float64',
        'recovery_costs': 'float64',
        'misc_costs': 'float64',
        'tax_costs': 'float64',
        'sale_proceeds': 'float64',
        'credit_enhancement_proceeds': 'float64',
        'repurchase_proceeds': 'float64',
        'other_foreclosure_proceeds': 'float64',
        'non_interest_bearing_balance': 'float64',
        'principal_forgiveness_balance': 'float64',
    },
}
```

Now we make a dictionary of the columns we wish to keep: all the columns from the `Acquisition` datasets, to have as much data to predict as possible; and only the `id` and `foreclosure_date` from the `Performance` datasets.

```
SELECT = {
    "Acquisition": HEADERS["Acquisition"],
    "Performance": [
        "id",
        "foreclosure_date"
    ]
}
```

To actually join the files, we are going to make a function called `concatenate()`, that takes `prefix` as an argument.

The function will do this:
* Get a list of the files in `DATA_DIR`.

  ```
files = os.listdir(settings.DATA_DIR)
```

*  Loop through the files:
  * If the file name starts with `prefix`, read it as a `DataFrame`.
  * Keep only the columns on `SELECT`.
  * Write the `DataFrame` to a `.csv` file; only use headers for the first write.

  ```
def concatenate(prefix="Acquisition"):
    files = os.listdir(settings.DATA_DIR)
    for f in files:
        if not f.startswith(prefix):
            continue
        data = pd.read_csv(os.path.join(settings.DATA_DIR, f), sep="|", header=None, names=HEADERS[prefix],
                           index_col=False, dtype=TYPES[prefix])
        data = data[SELECT[prefix]]
        if not os.path.isfile(os.path.join(settings.PROCESSED_DIR, "{}.txt".format(prefix))):
            data.to_csv(os.path.join(settings.PROCESSED_DIR, "{}.txt".format(prefix)), sep="|", header=SELECT[prefix],
                        index=False)
        else:
            data.to_csv(os.path.join(settings.PROCESSED_DIR, "{}.txt".format(prefix)), mode='a', sep="|", header=False,
                        index=False)
```

We only want this script to call the function if it is run, not if it is imported. So we'll call the function inside this conditional:

```
if __name__ == "__main__":
    concatenate("Acquisition")
    concatenate("Performance")
```

To make it easy to run this step, we can put all the code on a single file called `assemble.py`.

# Generating the training data
Now we have two files on the `processed` directory: `Acquisition.txt` and `Performance.txt`.

But the machine learning models that we want to use need a single variable as input data.

We are going to create an `annotate.py` file, with five functions in it:

* `read()`
* `count_performance_rows()`
* `get_performance_summary_value()`
* `annotate()`
* `write()`

## `read()`
This function simply reads in the `Acquisition` data.

```
def read():
    acquisition = pd.read_csv(os.path.join(settings.PROCESSED_DIR, "Acquisition.txt"), sep="|")
    return acquisition
```

## `count_performance_rows()`
Its job is to count how many times the loan appears on the `Performance` dataset and check if it was foreclosed on.

To do this it:

* Opens the `Performance.txt` file.

  ```
with open(os.path.join(settings.PROCESSED_DIR, "Performance.txt"), 'r') as f:
```

* Loops through each line in the file, skipping the first (header row).

  ```
    for i, line in enumerate(f):
            if i == 0:
                continue
```

* Checks if the loan is in the `counts` dictionary and adds it if it isn't.

  ```
            loan_id, date = line.split("|")
            loan_id = int(loan_id)
            if loan_id not in counts:
                counts[loan_id] = {
                    "foreclosure_status": False,
                    "performance_count": 0
                }
```
* Increases the `performance_count` value for the loan of the current line on the `counts` dictionary by one.

  ```
            counts[loan_id]["performance_count"] += 1
```       

* If the foreclosure date is not zero, sets the `foreclosure_status` of the loan to `True`.

  ```
            if len(date.strip()) > 0:
                counts[loan_id]["foreclosure_status"] = True
```

* Returns the `counts` dictionary.

  ```
    return counts  
```
      
## `get_performance_summary_value()`
Returns the key corresponding to the loan provided if it exists on the `performance_summary` dictionary, or a default value if it doesn't.

```
def get_performance_summary_value(loan_id, key, performance_summary):
    value = performance_summary.get(loan_id, {
        "foreclosure_status": False,
        "performance_count": 0
    })
    return value[key]
```    

## `annotate()`
Creates two new columns:
* `foreclosure_status`: contains a `Boolean` value that tells us if the loan was foreclosed on or not.

  ```
acquisition["foreclosure_status"] = acquisition["id"].apply(
        lambda x: get_performance_summary_value(x, "foreclosure_status", performance_summary))
```

* `performance_count`: contains the number of times that a loan appears on the dataset.

  ```
acquisition["performance_count"] = acquisition["id"].apply(
        lambda x: get_performance_summary_value(x, "performance_count", performance_summary))
```

It also loops through each numerical column and converts them to categorical.

```
    for column in [
        "channel",
        "seller",
        "first_time_homebuyer",
        "loan_purpose",
        "property_type",
        "occupancy_status",
        "property_state",
        "product_type"
    ]:
        acquisition[column] = acquisition[column].astype('category').cat.codes
```

And splits the date columns into year and month columns, deleting the original date columns.

```
    for start in ["first_payment", "origination"]:
        column = "{}_date".format(start)
        acquisition["{}_year".format(start)] = pd.to_numeric(acquisition[column].str.split('/').str.get(1))
        acquisition["{}_month".format(start)] = pd.to_numeric(acquisition[column].str.split('/').str.get(0))
        del acquisition[column]
```        

Now it fills all `NA` values with `-1`.

```
    acquisition = acquisition.fillna(-1)
```

Finally, it keeps only the loans that were kept for a minimum amount of quarters.

```
    acquisition = acquisition[acquisition["performance_count"] > settings.MINIMUM_TRACKING_QUARTERS]
```    

## `write()`
Writes the resulting dataset.

```
def write(acquisition):
    acquisition.to_csv(os.path.join(settings.PROCESSED_DIR, "train.csv"), index=False)
```

## Calling the functions
Again, we don't want the functions to be called if the file is imported.

```
if __name__ == "__main__":
    acquisition = read()
    performance_summary = count_performance_rows()
    acquisition = annotate(acquisition, performance_summary)
    write(acquisition)
```    

# Creating a machine learning model
After running `annotate.py` we get a `train.csv` file on the `processed` directory. Now we can use this file to train our machine learning model and make predictions.

We are going to use the most simple model for this type of problem: [logistic regression](https://en.wikipedia.org/wiki/Logistic_regression).

To better organise the script (located on `predict.py`), we are going to, again, make several functions:
* `read()`
* `cross_validate()`
* `compute_error()`
* `compute_false_negatives()`
* `compute_false_positives()`

## Reading in the data
The `read()` function loads the `train.csv` file.

```
def read():
    train = pd.read_csv(os.path.join(settings.PROCESSED_DIR, "train.csv"))
    return train
```

## Performing cross-validation
The `cross_validate()` function takes the `train` `DataFrame` as argument and:

* Instantiates a `LogisticRegression` model from [`scikit-learn`](https://scikit-learn.org/).

  ```
  clf = LogisticRegression(random_state=1, class_weight="balanced", solver='saga', n_jobs=-1, dual=False, tol=0.01)
  ```

  * The `random_state=1` parameter sets the seed for the pseudo-random number generator, so the result is repeatable.
  * `class_weight="balanced"` is used so that the algorithm takes into account the class imbalance (there are much fewer foreclosed on loans than not foreclosed on).
  * `solver='saga'` tells sklearn to use the [SAGA solver](https://arxiv.org/pdf/1407.0202.pdf), which performs better for large datasets.
  * With `n_jobs=-1` the computations will use all the microprocessor's cores.
  * `tol=0.01` sets the stopping tolerance.
* Gets the names of the columns to use as predictors.

  ```
  predictors = train.columns.tolist()
  predictors = [p for p in predictors if p not in settings.NON_PREDICTORS]
```    
* Performs n-fold cross-validation, where n is set in the `settings.py` file.
    
  ```
    predictions = model_selection.cross_val_predict(estimator=clf, X=train[predictors], y=train[settings.TARGET],
    cv=settings.CV_FOLDS, n_jobs=-1)
```                                

## Computing error metrics
### Accuracy score
First, we compute the subset accuracy, defined as the proportion of correctly predicted foreclosed loans.

The `compute_error()` function computes the [accuracy_score](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html).

```
def compute_error(target, predictions):
    return metrics.accuracy_score(target, predictions)
```    

Previously we noted that there is much class imbalance. For example, let's say that `99%` of values on the `foreclosure_status` column equal to `False`. Then we could make a model that predicts always `foreclosure_status` equals to `False` and it would be `99%` accurate. It would also be completely useless!

### False negatives
A more useful metric is the false negatives rate `FNR`. A false negative is a result that was predicted as negative (`foreclosure_status=False`), but it was actually positive (`foreclosure_status=True`).

$$
    FNR=\frac{FN}{FN+TP}=\frac{FN}{P}
$$

Where:
* `FN`: false negatives
* `TP`: true positives
* `P`: positives

The `compute_false_negatives()` function calculates the false negatives rate.

```
def compute_false_negatives(target, predictions):
    df = pd.DataFrame({"target": target, "predictions": predictions})
    return df[(df["target"] == 1) & (df["predictions"] == 0)].shape[0] / (df[(df["target"] == 1)].shape[0] + 1)
```    

### False positives
Similarly, we can compute the false positives rate `FPR`. A false positive is a result that was predicted as positive (`foreclosure_status=True`), but it was actually negative (`foreclosure_status=False`).

$$
    FPR=\frac{FP}{TN+FP}=\frac{FP}{N}
$$

Where:
* `FP`: false positives
* `TN`: true negatives
* `N`: negatives

The `compute_false_positives()` function calculates the false negatives rate.

```
def compute_false_positives(target, predictions):
    df = pd.DataFrame({"target": target, "predictions": predictions})
    return df[(df["target"] == 0) & (df["predictions"] == 1)].shape[0] / (df[(df["target"] == 0)].shape[0] + 1)
```

## Calling the functions
As in the previous scripts, we need to actually call the functions.

```
if __name__ == "__main__":
    train = read()
    predictions = cross_validate(train)
    error = compute_error(train[settings.TARGET], predictions)
    fn = compute_false_negatives(train[settings.TARGET], predictions)
    fp = compute_false_positives(train[settings.TARGET], predictions)
    print("Accuracy Score: {}".format(error))
    print("False Negatives: {}".format(fn))
    print("False Positives: {}".format(fp))
```    

# Predicting with the model
To make predictions with our model, we need to run the script with `python predict.py`. We get this output after doing it:

```
Accuracy Score: 0.6590917718026998
False Negatives: 0.2350684017350684
False Positives: 0.34100154945883804
```

With (most of) the default parameters, we got a `66%` accuracy, `24%` false negatives rate, and `34%` false positives rate.

# Conclusion
In this notebook, we created a machine learning model, going through the whole process workflow:
* Getting the raw data.
* Selecting the useful columns and discarding the rest.
* Making a training dataset.
* Training the model.
* Making predictions.
* Computing error metrics.

There is much room for improvement with this model, so we could perform [hyperparameter tuning](https://en.wikipedia.org/wiki/Hyperparameter_optimization).

This was a first approach to machine learning, done following a tutorial, and it should serve as a basis to perform a more custom-made project.