![consumer financial protection beureau](https://files.consumerfinance.gov/f/images/FCM_mortgages.original.png)

# Scrape the CFPB Website
---
The [Consumer Financial Protection Bureau](https://www.consumerfinance.gov/consumer-tools/mortgages/) provides a wealth of unbiased information about mortgages and the application process. We will scrape all the articles they've posted about mortgages to use in our model.  
After scraping the raw text, we will format it so that it works well with our encoding and GPT model.

In [1]:
import requests
from bs4 import BeautifulSoup
import re
import string
import pandas as pd
from tqdm.notebook import tqdm
from transformers import GPT2TokenizerFast
import tiktoken
from nltk.tokenize import sent_tokenize
import os

import warnings
warnings.filterwarnings("ignore", message="Unverified HTTPS request is being made to host")
os.environ["CURL_CA_BUNDLE"] = ""

pd.set_option("display.max_colwidth", None)
pd.set_option("display.html.use_mathjax", False)

---
### Text formatting
Web scraping is notoriously dirty, so we define a function that makes the string prettier and easier for a machine learning model - and humans - to read.

In [2]:
def prettify_string(text):
    """ 
    Reformat a text string to follow the rules of English 
    punctuation and have only readable characters
    """
    # Replace quotes and dashes with their ASCII counterparts
    text = (
        text
        .replace("’", "'")
        .replace("‘", "'")
        .replace('“', '"')
        .replace('”', '"')
        .replace("–", " - ")
        .replace("—", " - ")
        .replace("\n", " ") # No new lines in text
        .replace("\t", " ") # No tabs
    )
    # Remove any phone numbers
    phone_pattern = "(at )??(1\s+)??\(??[0-9]{3}\)??\s+[0-9A-Z]{3}[-\s]+[0-9A-Z]{4}\s+(\([0-9]{4}\))?"
    text = re.sub(phone_pattern, "", text)
    # Remove any emails
    email_pattern = r"(at )??([a-zA-Z0-9\.]+)@([a-zA-Z0-9]+)\.([a-zA-Z]+)"
    text = re.sub(email_pattern, "", text)
    # Replace all non-breaking whitespace or tabs with regular space
    text = re.sub(r"\xa0|\t", " ", text)
    # Remove all other non-printable characters
    text = "".join([c for c in text if c in  string.printable])
    # No double white space
    text = re.sub(r"\s+", " ", text)
    # Remove space before punctuation
    text = re.sub(r"(\s)([.,;:'])", r"\2", text) 
    # Remove double punctuation
    text = re.sub(r"([,.;])([,.;]+)", r"\1", text) 
    # Space between words
    text = re.sub(r"([a-zA-Z])?([,.;])([a-zA-Z])", r"\1\2 \3", text)
    # Switch punctuation to be on outside of quotes
    text = re.sub(r"([\?\.])([\"\'])",  r"\2\1", text)
    # In case of merged words that are lower then upper
    text = re.sub(r"([a-z])([A-Z])", r"\1 \2", text)
    
    # Strip leading/trailing whitespace
    return text.strip()

---
### Key Terms
The CFPB provides a [helpful explaination](https://www.consumerfinance.gov/consumer-tools/mortgages/answers/key-terms/) of some of the most new and confusing terms surrounding mortgages. It explains everything from APR to initial adjustment cap and beyond.  
We will scrape this page to get the definitions of the terms. Since some of the definitions start with "It means," we will add back in the context of the term.

In [3]:
# Scrap raw HTML from webpage
base_url = "https://www.consumerfinance.gov/consumer-tools/mortgages/answers/"
key_terms_url = base_url + "key-terms/"
page = requests.get(key_terms_url, verify=False)
soup = BeautifulSoup(page.content, "html.parser")

In [4]:
# Extract the term and the definition
def get_term(term):
    found = term.find("dt", class_="term_name")
    return prettify_string(found.get_text())
        
def get_definition(term):
    found = term.find("dd", class_="term_definition")
    text = prettify_string(found.get_text())
    text = (
        text
        .split("Read more")[0]
        .split("Learn more")[0]
    )
    return text

def_records = [
    (get_term(term), get_definition(term))
    for term in soup.find_all("div", class_="term")]
term_df = pd.DataFrame(def_records, columns=["term", "definition"])


# Adjust the answer in case it's ambigouous without the term
def adjust_answer(row):
    if not row.definition.startswith("It means"):
        return row.definition
    return re.sub("It", row.term, row.definition)
term_df["content"] = term_df.apply(adjust_answer, axis=1)

In [5]:
# Here's what it looks like
term_df.loc[4:5]

Unnamed: 0,term,definition,content
4,Amount financed,"It means the amount of money you are borrowing from the lender, minus most of the upfront fees the lender is charging you.","Amount financed means the amount of money you are borrowing from the lender, minus most of the upfront fees the lender is charging you."
5,Annual income,"Annual income is a factor in a mortgage loan application and generally refers to your total earned, pre-tax income over a year. Annual income may include income from full-time or part-time work, self-employment, tips, commissions, overtime, bonuses, or other sources. A lender will use information about your annual income and your existing monthly debts to determine if you have the ability to repay the loan. Whether a lender will rely upon a specific income source or amount when considering you for a loan will often depend upon whether you can reasonably expect the income to continue.","Annual income is a factor in a mortgage loan application and generally refers to your total earned, pre-tax income over a year. Annual income may include income from full-time or part-time work, self-employment, tips, commissions, overtime, bonuses, or other sources. A lender will use information about your annual income and your existing monthly debts to determine if you have the ability to repay the loan. Whether a lender will rely upon a specific income source or amount when considering you for a loan will often depend upon whether you can reasonably expect the income to continue."


---
### Mortgage Questions and Answers
The CFPB also provides more [detailed answers](https://www.consumerfinance.gov/consumer-tools/mortgages/answers/) to specific questions. These are grouped into 4 categories: basics, common issues, know your rights, and how-to guides.  
Each question gets its own article. The articles are listed in a paginated format on their website, so we can scrape the article urls, then scrape the articles themselves. Each article has a title (which is the question), a short answer (optional), and a full answer. 

In [6]:
# Get all pages with article links
URL_pages = {
    "basics": 5,
    "common-issues": 5,
    "know-your-rights": 3,
    "how-to-guides": 2
}
URLs = []
for k, v in URL_pages.items():
    URLs.extend([f"{base_url}{k}?page={i+1}" for i in range(v)])

# For each of these pages, scrape all of the article URLs
articles = []
for url in URLs:
    page = requests.get(url, verify=False)
    soup = BeautifulSoup(page.content, "html.parser")
    articles.extend([
        f"https://{link.find('a').get('data-pretty-href')}" 
        for link in soup.find_all("article")])

In [7]:
def parse_cfpb_article(article_link):
    """ Scrape and parse and article from the CFPB website. """
    # Scrape article
    article = requests.get(article_link, verify=False)
    article_soup = BeautifulSoup(article.content, "html.parser")
    page_contents = {}
    
    # Get main question/page title
    page_contents["question"] = article_soup.find('h1').get_text().strip()
        
    # See if there's a short answer for this article
    lead = article_soup.find("div", class_="lead-paragraph")
    page_contents["short_answer"] = lead.get_text().strip() if lead else ""

    # Get long answer, including list items
    long_answer = ""
    answer_html = article_soup.find("div", class_="answer-text")
    for row in answer_html.find_all("div", class_="row"):
        for line in row.find_all(["h1", "h2", "p", "li"]):
            text = line.get_text().strip()
            long_answer += " " + text
            if line.name == "li" and text:
                long_answer += ";"
    page_contents["long_answer"] = long_answer
    return pd.Series(page_contents).apply(prettify_string)
  
# Load and parse all articles into a pandas dataframe
tqdm.pandas(desc="Scraping")
answer_df = (
    pd.Series(articles)
    .progress_apply(parse_cfpb_article)
    .drop_duplicates()
)
print(f"\n{len(answer_df)} records")

Scraping:   0%|          | 0/313 [00:00<?, ?it/s]


240 records


When we feed the data into a language model, we don't want 
individual texts to be too long, so we set a token cutoff and 
break text that is longer than that cutoff into multiple records.  
Once we've broken down large records, we'll join together the 
question, short answer, and long answer (section) to get one
cohesive content entry.

In [8]:
# We want to format the text to have the question, the short answer, and the 
# long answer. There are some really long bits of text, so we should break 
# those into smaller chunks.

TOKENIZER = GPT2TokenizerFast.from_pretrained("gpt2")
ENCODING = tiktoken.get_encoding("cl100k_base")

def count_tokens(text):
    """ Count number of tokens. """
    return len(ENCODING.encode(text))
      
    
def format_content(row, max_len=512):
    """
    Create the content column of our answer data frame for use 
    in our embeddings model. Include the question, short answer, 
    and long answer. If the body is too long, break it into 
    several entries by returning a list and later using .explode()
    For each of the chunks of the body, we will prepend the 
    title question and short answer
    """
    prefix = row.question + " " + row.short_answer + " "
    prefix_tokens = count_tokens(prefix)
    max_len -= prefix_tokens
    remaining_tokens = count_tokens(row.long_answer)
    if prefix_tokens + remaining_tokens < max_len:
        return prefix + row.long_answer
    
    # Break into smaller chunks by tokenizing sentences
    sentences = sent_tokenize(row.long_answer)
    use_text = []
    while True:
        # Count the remaining total tokens
        remaining = count_tokens(". ".join(sentences))
        
        # Remaining text is shorter than maximum length
        if remaining <= max_len:
            use_text.append(sentences)
            break

        # Find the maximum number of sentences until the
        # maximum token length is reached
        ntokens = 0
        for i, sentence in enumerate(sentences):
            ntokens += 1 + count_tokens(sentence)
            if ntokens > max_len:
                use_text.append(sentences[:i][:-1])
                sentences = sentences[i:]
                break
                
    # Join each of those sublists into cohesive chunks
    chunks = [re.sub(r"\.+", ".", f"{prefix} {'. '.join(x)}.") 
              for x in use_text]
    return chunks

In [9]:
MAX_LENGTH = 512
answer_df["content"] = answer_df.apply(
    format_content, max_len=MAX_LENGTH, axis=1)
answer_df = answer_df.explode("content")

print(f"New length: {len(answer_df)} records")

New length: 291 records


### Create one data frame that contains just the content
We will need one data frame to do our embedding. So we will concatenate our term and our answer data frames and limit them to just the content column.  
Then we will add a "tokens" column that specifies how many tokens each text has.

In [10]:
# Concatenate the content columns and add a tokens counter column
content_df = pd.concat(
    [term_df[["content"]], answer_df[["content"]]],
    ignore_index=True
)
content_df["tokens"] = content_df.content.apply(count_tokens)

print(f"Total of {len(content_df)} records \n")
content_df.head(2)

Total of 394 records 



Unnamed: 0,content,tokens
0,"A 5/1 adjustable rate mortgage (ARM) or 5-year ARM is a mortgage loan where ""5"" is the number of years your initial interest rate will stay fixed. The ""1"" represents how often your interest rate will adjust after the initial five-year period ends. The most common fixed periods are 3, 5, 7, and 10 years and ""1,"" is the most common adjustment period. It's important to carefully read the contract and ask questions if you're considering an ARM.",106
1,The ability-to-repay rule is the reasonable and good faith determination most mortgage lenders are required to make that you are able to pay back the loan.,31


In [11]:
# Save all dataframes
term_save = "../data/cfpb_key_terms.csv"
term_df.to_csv(term_save, index=False)

answer_save = "../data/cfpb_mortgage_questions.csv"
answer_df.to_csv(answer_save, index=False)

content_save = "../data/mortgage_context_text.csv"
content_df.to_csv(content_save, index=False)

---

In [12]:
answer_df[answer_df.question.str.contains("flipped and that I have to get a second appraisal")]

Unnamed: 0,question,short_answer,long_answer,content
121,I was told I'm buying a home that was flipped and that I have to get a second appraisal. How does that work?,"If the home you're buying is considered a ""flip"" and you're getting a higher-priced mortgage loan covered under new mortgage rules, you will have to get a second appraisal.","A ""flip"" is when: You buy a home from a seller who bought the home less than six months ago and; You pay a certain amount more than the seller paid for the home:10 percent more if the seller bought the home within the past 90 days.20 percent more if the seller bought the home in the past 91 to 180 days. 10 percent more if the seller bought the home within the past 90 days. 20 percent more if the seller bought the home in the past 91 to 180 days. When you buy a ""flipped"" home, your lender must pay for a second appraisal of the home that includes an inside inspection. The lender cannot charge you for this second appraisal. Keep in mind that not all flips are subject to this requirement. For example, flips in rural areas are exempt because those areas might have fewer appraisers available. Also, properties acquired from a government agency are exempt. If you have a problem with your mortgage, you can submit a complaint with the CFPB online or by calling.","I was told I'm buying a home that was flipped and that I have to get a second appraisal. How does that work? If the home you're buying is considered a ""flip"" and you're getting a higher-priced mortgage loan covered under new mortgage rules, you will have to get a second appraisal. A ""flip"" is when: You buy a home from a seller who bought the home less than six months ago and; You pay a certain amount more than the seller paid for the home:10 percent more if the seller bought the home within the past 90 days.20 percent more if the seller bought the home in the past 91 to 180 days. 10 percent more if the seller bought the home within the past 90 days. 20 percent more if the seller bought the home in the past 91 to 180 days. When you buy a ""flipped"" home, your lender must pay for a second appraisal of the home that includes an inside inspection. The lender cannot charge you for this second appraisal. Keep in mind that not all flips are subject to this requirement. For example, flips in rural areas are exempt because those areas might have fewer appraisers available. Also, properties acquired from a government agency are exempt. If you have a problem with your mortgage, you can submit a complaint with the CFPB online or by calling."


In [13]:
token_df = answer_df.drop(columns=["content"]).drop_duplicates().applymap(count_tokens)
token_df.describe().round(2)

Unnamed: 0,question,short_answer,long_answer
count,240.0,240.0,240.0
mean,19.14,30.05,300.63
std,9.1,14.64,235.67
min,5.0,0.0,18.0
25%,13.0,20.0,129.25
50%,17.0,29.0,263.0
75%,23.25,39.0,396.5
max,59.0,81.0,1512.0


In [14]:
token_df[token_df.long_answer.lt(20)]

Unnamed: 0,question,short_answer,long_answer
140,18,30,18


In [15]:
answer_df.loc[[140]]

Unnamed: 0,question,short_answer,long_answer,content
140,I was told that I was too young to get a mortgage loan. Is this possible?,A creditor such as a lender or broker cannot discriminate against a credit applicant because of age unless the applicant is too young to legally enter into a contract.,State law governs the age at which an individual can enter into a legally binding contract.,I was told that I was too young to get a mortgage loan. Is this possible? A creditor such as a lender or broker cannot discriminate against a credit applicant because of age unless the applicant is too young to legally enter into a contract. State law governs the age at which an individual can enter into a legally binding contract.
