# Data Colllection and Conditioning
Our datasets are distributed from several sources and require data collection techniques to integrate them into any potential analysis in a simplified manner.

## Multifamily National File All Multifamily Properties By Units And Mortgages
Our primary datasets are available on [fhfa.gov](https://www.fhfa.gov/data/multifamily-national-file-all-multifamily-properties-by-units-and-mortgages), but without an api will require the use of webscraping techniques and requests to integrate into the code. A set of data scraping tools specific to the webpage for this dataset has will be developed in `./src/srcaping.py` to address the findings of the exploration below.

In [1]:
import re
import requests
import itertools
from bs4 import BeautifulSoup
from pprint import pprint

In [2]:
fhfa_url = 'https://www.fhfa.gov/data/multifamily-national-file-all-multifamily-properties-by-units-and-mortgages'

In [3]:
def get_page_data(url) -> BeautifulSoup:
    """Gets a pages data
    Args:
        n: int -> number representing the page number
    Returns:
        A bs4 object full of the page info"""
    response = requests.get(url)
    return BeautifulSoup(response.content, 'html.parser')

In [4]:
# get main page html
soup = get_page_data(fhfa_url)
# parse for table data tags
a_tags = soup.find_all('a')
print(f'There are {len(a_tags)} total anchor tags')

There are 176 total anchor tags


In [5]:
a_tags[120:125]

[<a href="/sites/default/files/2023-10/2020_MFNationalFile2020.zip">[ZIP]</a>,
 <a href="/sites/default/files/2023-10/2020_Multifamily_National_File_Unit_Class-Level_Data.pdf">[PDF]</a>,
 <a href="/sites/default/files/2023-10/2020_Multifamily_National_File_Property-Level_Data.pdf">[PDF]</a>,
 <a href="/sites/default/files/2023-10/2019_MFNationalFile2019.zip">​[ZIP]</a>,
 <a href="/sites/default/files/2023-10/2019_Multifamily_National_File_Unit_Class-Level_Data.pdf">​[PDF]</a>]

It looks like the some of the tags start off with the year then the zip, then two metadata pdfs. This can be used to extract relevant files and store them in the users `DATA_DIRECTORY`. We wont access the files from the web each time we run our analysis since the stability of scraped data is dependent on the webpage html staying static. What works now may not work in the future.

In [6]:
counter = itertools.count()
for a_tag in a_tags:
    if a_tag['href'].endswith(('.pdf', '.zip')):
        print(a_tag['href'])
        if next(counter) == 5:
            break


/sites/default/files/2024-08/2023_MFNationalFile2023.zip
/sites/default/files/2024-08/2023_Multifamily_National_File_Unit_Class-Level_Data.pdf
/sites/default/files/2024-08/2023_Multifamily_National_File_Property-Level_Data.pdf
/sites/default/files/2023-09/2022_MFNationalFile2022.zip
/sites/default/files/2023-09/2022_Multifamily_National_File_Unit_Class-Level_Data.pdf
/sites/default/files/2023-09/2022_Multifamily_National_File_Property-Level_Data.pdf


We now have a way to hone in on the necessary data files and download them using requests.

In [7]:
from src.scraping import get_fhfa_data, extract_zips

In [None]:
# extract_zips()

In [None]:
get_fhfa_data()