# Data Colllection and Conditioning
Our datasets are distributed from several sources and require data collection techniques to integrate them into any potential analysis in a simplified manner.

## Multifamily National File All Multifamily Properties By Units And Mortgages
Our primary datasets are available on [fhfa.gov](https://www.fhfa.gov/data/multifamily-national-file-all-multifamily-properties-by-units-and-mortgages), but without an api will require the use of webscraping techniques and requests to integrate into the code. A set of data scraping tools specific to the webpage for this dataset has will be developed in `./src/srcaping.py` to address the findings of the exploration below.

In [None]:
import re
import requests
import itertools
import glob
import pandas as pd
from bs4 import BeautifulSoup
from pprint import pprint
from src.static import DATA_DIR
from src.datamappers import loan_data_mapper, unit_data_mapper

In [2]:
fhfa_url = 'https://www.fhfa.gov/data/multifamily-national-file-all-multifamily-properties-by-units-and-mortgages'

In [3]:
def get_page_data(url) -> BeautifulSoup:
    """Gets a pages data
    Args:
        n: int -> number representing the page number
    Returns:
        A bs4 object full of the page info"""
    response = requests.get(url)
    return BeautifulSoup(response.content, 'html.parser')

In [4]:
# get main page html
soup = get_page_data(fhfa_url)
# parse for table data tags
a_tags = soup.find_all('a')
print(f'There are {len(a_tags)} total anchor tags')

There are 176 total anchor tags


In [5]:
a_tags[120:125]

[<a href="/sites/default/files/2023-10/2020_MFNationalFile2020.zip">[ZIP]</a>,
 <a href="/sites/default/files/2023-10/2020_Multifamily_National_File_Unit_Class-Level_Data.pdf">[PDF]</a>,
 <a href="/sites/default/files/2023-10/2020_Multifamily_National_File_Property-Level_Data.pdf">[PDF]</a>,
 <a href="/sites/default/files/2023-10/2019_MFNationalFile2019.zip">​[ZIP]</a>,
 <a href="/sites/default/files/2023-10/2019_Multifamily_National_File_Unit_Class-Level_Data.pdf">​[PDF]</a>]

It looks like the some of the tags start off with the year then the zip, then two metadata pdfs. This can be used to extract relevant files and store them in the users `DATA_DIRECTORY`. We wont access the files from the web each time we run our analysis since the stability of scraped data is dependent on the webpage html staying static. What works now may not work in the future.

In [6]:
counter = itertools.count()
for a_tag in a_tags:
    if a_tag['href'].endswith(('.pdf', '.zip')):
        print(a_tag['href'])
        if next(counter) == 5:
            break


/sites/default/files/2024-08/2023_MFNationalFile2023.zip
/sites/default/files/2024-08/2023_Multifamily_National_File_Unit_Class-Level_Data.pdf
/sites/default/files/2024-08/2023_Multifamily_National_File_Property-Level_Data.pdf
/sites/default/files/2023-09/2022_MFNationalFile2022.zip
/sites/default/files/2023-09/2022_Multifamily_National_File_Unit_Class-Level_Data.pdf
/sites/default/files/2023-09/2022_Multifamily_National_File_Property-Level_Data.pdf


We now have a way to hone in on the necessary data files and download them using requests.

In [7]:
from src.scraping import get_fhfa_data, extract_zips

In [8]:
get_fhfa_data()

## FHFA Data Conditioning
Now that the FHFA data is downloaded we can begin working with it. let's open one file and examine the contents
### Loan Data Parsing

In [9]:
files = glob.glob(f'{DATA_DIR}/*')
files[:5]

['/home/mango/data/CS504/2023_Multifamily_National_File_Unit_Class-Level_Data.pdf',
 '/home/mango/data/CS504/2018_Multifamily_National_File_Property-Level_Data.pdf',
 '/home/mango/data/CS504/2014_Multifamily_National_File_Property-Level_Data.pdf',
 '/home/mango/data/CS504/2008_Multifamily_National_File_Property-Level_Data.pdf',
 '/home/mango/data/CS504/2021_MFNationalFile2021']

>The contens of the data directory contain folders that actually contain the data files. We need to index down.

In [10]:
files = glob.glob(f'{DATA_DIR}/*MFNationalFile*/*')
files[:5]

['/home/mango/data/CS504/2021_MFNationalFile2021/fhlmc_mf2021b_loans.txt',
 '/home/mango/data/CS504/2021_MFNationalFile2021/2021_Multifamily_National_File_Unit_Class-Level_Data.pdf',
 '/home/mango/data/CS504/2021_MFNationalFile2021/fhlmc_mf2021b_units.txt',
 '/home/mango/data/CS504/2021_MFNationalFile2021/2021_Multifamily_National_File_Property-Level_Data.pdf',
 '/home/mango/data/CS504/2021_MFNationalFile2021/fnma_mf2021b_loans.txt']

The data is locked up in these text files lets import and parse one

In [11]:
filepath = files[4]
filepath

'/home/mango/data/CS504/2021_MFNationalFile2021/fnma_mf2021b_loans.txt'

In [12]:
with open(filepath, 'r') as f:
    str_data:str = f.read()
print(str_data[:200])

1       1 3 1 3 2 2 1 2 2
1       2 1 2 4 2 1 2 2 2
1       3 2 2 3 2 1 1 2 2
1       4 3 2 4 2 2 1 2 5
1       5 2 2 4 2 1 1 2 2
1       6 3 3 4 2 1 2 2 4
1       7 3 1 4 2 2 2 2 2
1       8 3 3 4 2 


The data is encoded into these text files. First we need to parse the data, then we can worry about decoding it. The first issue with the data is the multiple spaces separating the data inputs. we need a uniform pattern to split on so lets use regex to compress multiple spaces into single spaces

In [13]:
str_data_clean:str = re.sub(r'[^\S\n]+', ' ', str_data)

and split on newlines

In [14]:
list_data: list[str] = str_data_clean.split('\n')
list_data[:5]

['1 1 3 1 3 2 2 1 2 2',
 '1 2 1 2 4 2 1 2 2 2',
 '1 3 2 2 3 2 1 1 2 2',
 '1 4 3 2 4 2 2 1 2 5',
 '1 5 2 2 4 2 1 1 2 2']

and now lets wrap it in a DataFrame

In [15]:
df = pd.DataFrame([x.split(' ') for x in list_data], columns=[x+1 for x in range(10)])
df.head()

Unnamed: 0,1,2,3,4,5,6,7,8,9,10
0,1,1,3,1,3,2,2,1,2,2
1,1,2,1,2,4,2,1,2,2,2
2,1,3,2,2,3,2,1,1,2,2
3,1,4,3,2,4,2,2,1,2,5
4,1,5,2,2,4,2,1,1,2,2


Using the exploration above a set of tools can be developed off screen to parse the data and read it into dataframes for later use.

### Data Mapping
The next challenge is to map the data dictionary values to the data. This mapping should be able to transform the data from it's encoded version to a human-readable format and back again. Unfortunately there is no apparent way to do this other than by hand.

In [16]:
# first we overwrite existing column names with something easier to read
column_mapper = {1: 'enterprise_flag', 2: 'record_number', 3: 'census_tract_2020',
                 4: 'tract_income_ratio', 5: 'affordability_cat', 6: 'date_of_mortgage_note',
                 7: 'purpose_of_loan', 8: 'type_of_seller', 9: 'federal_guarantee',
                   10: 'tot_num_units'}

df.rename(column_mapper, axis=1, inplace=True)
df.head()

Unnamed: 0,enterprise_flag,record_number,census_tract_2020,tract_income_ratio,affordability_cat,date_of_mortgage_note,purpose_of_loan,type_of_seller,federal_guarantee,tot_num_units
0,1,1,3,1,3,2,2,1,2,2
1,1,2,1,2,4,2,1,2,2,2
2,1,3,2,2,3,2,1,1,2,2
3,1,4,3,2,4,2,2,1,2,5
4,1,5,2,2,4,2,1,1,2,2


In [17]:
# next we apply a mapper to the values to decode them into human readable formats
# we skip columns that have no mapping
loan_data_mapper = {
    'enterprise_flag': {'1':'fannie', '2':'freddie'},
    'census_tract_2020': {'1': '<10%', '2': '>=10%, <30%', '3':'>=30% <100%', '4':'missing'},
    'tract_income_ratio': {'1': '>0, <=80%', '2': '>10, <=120%', '3': '>120%', '4': 'missing'},
    'affordability_cat': {'1' :  '>=20% of the units in the property are affordable at or below'+\
                          '50% of Area Median Income (AMI), and <40% are affordable at or below'+\
                          '60% AMI', '2' : '<20% and >=40%','3' : '>=20% and >=40%',
                          '4' : '<20% and <40%', '8' : 'Not available', '9' : 'Not eligible',
                          '0' :  'Missing'},
    'date_of_mortgage_note': {'1' : 'originated in same year as acquired',
                              '2' : 'originated prior to calendar year of acquisition',
                              '9' : 'missing'},
    'purpose_of_loan': {'1' : 'Purchase', '2' : 'Refinancing (all types)', '3' : 'New construction',
                        '4' : 'Home Improvement/Rehabilitation',
                        '9' : 'Not applicable/not available'},
    'type_of_seller': {'1' : 'Mortgage Company', '2' : 'Savings Association Insurance Fund (SAIF)'+\
                       '- or Bank Insurance Fund (BIF)-insured depository institution',
                       '3' : 'NCUA-insured Credit Union', '4' : 'Other'},
    'federal_guarantee': {'1' : 'Yes (has some type of Federal Guarantee)', '2' : 'No',
                          '3' : 'FHA Risk Sharing', '9' : 'Not available'},
    'tot_num_units': {'1' : '5 to 24 units', '2' : '25 to 50', '3' : '51 to 99', '4' : '100 to 149',
                      '5' : 'over 149', '9' : 'Unknown'}
    }

df = df.apply({
    'enterprise_flag': lambda x: loan_data_mapper['enterprise_flag'].get(x, x),
    'record_number': lambda x: x,
    'census_tract_2020': lambda x: loan_data_mapper['census_tract_2020'].get(x, x),
    'tract_income_ratio': lambda x: loan_data_mapper['tract_income_ratio'].get(x, x),
    'affordability_cat': lambda x: loan_data_mapper['affordability_cat'].get(x, x),
    'date_of_mortgage_note': lambda x: loan_data_mapper['date_of_mortgage_note'].get(x, x),
    'purpose_of_loan': lambda x: loan_data_mapper['purpose_of_loan'].get(x, x),
    'type_of_seller': lambda x: loan_data_mapper['type_of_seller'].get(x, x),
    'federal_guarantee': lambda x: loan_data_mapper['federal_guarantee'].get(x, x),
    'tot_num_units': lambda x: loan_data_mapper['tot_num_units'].get(x, x),
})

In [18]:
df.head()

Unnamed: 0,enterprise_flag,record_number,census_tract_2020,tract_income_ratio,affordability_cat,date_of_mortgage_note,purpose_of_loan,type_of_seller,federal_guarantee,tot_num_units
0,fannie,1,>=30% <100%,">0, <=80%",>=20% and >=40%,originated prior to calendar year of acquisition,Refinancing (all types),Mortgage Company,No,25 to 50
1,fannie,2,<10%,">10, <=120%",<20% and <40%,originated prior to calendar year of acquisition,Purchase,Savings Association Insurance Fund (SAIF)- or ...,No,25 to 50
2,fannie,3,">=10%, <30%",">10, <=120%",>=20% and >=40%,originated prior to calendar year of acquisition,Purchase,Mortgage Company,No,25 to 50
3,fannie,4,>=30% <100%,">10, <=120%",<20% and <40%,originated prior to calendar year of acquisition,Refinancing (all types),Mortgage Company,No,over 149
4,fannie,5,">=10%, <30%",">10, <=120%",<20% and <40%,originated prior to calendar year of acquisition,Purchase,Mortgage Company,No,25 to 50


The loan data is successfully mapped. Using the logic developed here we will develop an object to convert between readable and encoded formats.

### Unit Data Parsing

In [19]:
files = glob.glob(f'{DATA_DIR}/*MFNationalFile*/*')
files[:5]

['/home/mango/data/CS504/2021_MFNationalFile2021/fhlmc_mf2021b_loans.txt',
 '/home/mango/data/CS504/2021_MFNationalFile2021/2021_Multifamily_National_File_Unit_Class-Level_Data.pdf',
 '/home/mango/data/CS504/2021_MFNationalFile2021/fhlmc_mf2021b_units.txt',
 '/home/mango/data/CS504/2021_MFNationalFile2021/2021_Multifamily_National_File_Property-Level_Data.pdf',
 '/home/mango/data/CS504/2021_MFNationalFile2021/fnma_mf2021b_loans.txt']

In [20]:
df = pd.read_csv(files[2], header=None)
df.head()

Unnamed: 0,0
0,2 1 1 27.0 2 0
1,2 1 1 40.0 4 0
2,2 1 1 83.0 1 0
3,2 1 1 6.0 1 0
4,2 1 1 48.0 3 0


In [21]:
# first we overwrite existing column names with something easier to read
column_mapper = {1: 'enterprise_flag', 2: 'record_number', 3: 'num_bedrooms',
                 4: 'num_units', 5: 'affordability_level', 6: 'tenant_income_ind'}

df.rename(column_mapper, axis=1, inplace=True)
df.head()

Unnamed: 0,0
0,2 1 1 27.0 2 0
1,2 1 1 40.0 4 0
2,2 1 1 83.0 1 0
3,2 1 1 6.0 1 0
4,2 1 1 48.0 3 0


In [22]:
from src.data_pipeline import clean_data

In [23]:
clean_data(files[2])

Unnamed: 0,1,2,3,4,5,6
0,2,1,1,27.0,2,0
1,2,1,1,40.0,4,0
2,2,1,1,83.0,1,0
3,2,1,1,6.0,1,0
4,2,1,1,48.0,3,0
...,...,...,...,...,...,...
245778,2,4647,1,1.0,4,0
245779,2,4647,1,1.0,4,0
245780,2,4647,1,1.0,4,0
245781,2,4647,1,1.0,1,0
