![consumer financial protection beureau](https://files.consumerfinance.gov/f/images/FCM_mortgages.original.png)

# Scrape the CFPB Website
---
The [Consumer Financial Protection Bureau](https://www.consumerfinance.gov/consumer-tools/mortgages/) provides a wealth of unbiased information about mortgages and the application process. We will scrape all the articles they've posted about mortgages to use in our model.  
After scraping the raw text, we will format it so that it works well with our encoding and GPT model.

In [1]:
import requests
from bs4 import BeautifulSoup
import re
import string
import pandas as pd
from tqdm.notebook import tqdm
from transformers import GPT2TokenizerFast
import tiktoken
from nltk.tokenize import sent_tokenize
import os

import warnings
warnings.filterwarnings("ignore", message="Unverified HTTPS request is being made to host")
os.environ["CURL_CA_BUNDLE"] = ""

pd.set_option("display.max_colwidth", None)
pd.set_option("display.html.use_mathjax", False)

---
### Text formatting
Web scraping is notoriously dirty, so we define a function that makes the string prettier and easier for a machine learning model - and humans - to read.

In [2]:
def prettify_string(text):
    """ 
    Reformat a text string to follow the rules of English 
    punctuation and have only readable characters
    """
    # Replace quotes and dashes with their ASCII counterparts
    text = (
        text
        .replace("’", "'")
        .replace("‘", "'")
        .replace('“', '"')
        .replace('”', '"')
        .replace("–", " - ")
        .replace("—", " - ")
        .replace("\n", " ") # No new lines in text
        .replace("\t", " ") # No tabs
    )
    # Remove any phone numbers
    phone_pattern = "(at )??(1\s+)??\(??[0-9]{3}\)??\s+[0-9A-Z]{3}[-\s]+[0-9A-Z]{4}\s+(\([0-9]{4}\))?"
    text = re.sub(phone_pattern, "", text)
    # Remove any emails
    email_pattern = r"(at )??([a-zA-Z0-9\.]+)@([a-zA-Z0-9]+)\.([a-zA-Z]+)"
    text = re.sub(email_pattern, "", text)
    # Only printable characters
    text = "".join([c for c in text if c in string.printable])
    # No double white space
    text = re.sub(r"\s+", " ", text)
    # Remove space before punctuation
    text = re.sub(r"(\s)([.,;:'])", r"\2", text) 
    # Remove double punctuation
    text = re.sub(r"([,.;])([,.;]+)", r"\1", text) 
    # Space between words
    text = re.sub(r"([a-zA-Z])?([,.;])([a-zA-Z])", r"\1\2 \3", text)
    # Switch punctuation to be on outside of quotes
    text = re.sub(r"([\?\.])([\"\'])",  r"\2\1", text)
    # In case of merged words that are lower then upper
    text = re.sub(r"([a-z])([A-Z])", r"\1 \2", text)
    
    # Strip leading/trailing whitespace
    return text.strip()

---
### Key Terms
The CFPB provides a [helpful explaination](https://www.consumerfinance.gov/consumer-tools/mortgages/answers/key-terms/) of some of the most new and confusing terms surrounding mortgages. It explains everything from APR to initial adjustment cap and beyond.  
We will scrape this page to get the definitions of the terms. Since some of the definitions start with "It means," we will add back in the context of the term.

In [3]:
# Scrap raw HTML from webpage
base_url = "https://www.consumerfinance.gov/consumer-tools/mortgages/answers/"
key_terms_url = base_url + "key-terms/"
page = requests.get(key_terms_url, verify=False)
soup = BeautifulSoup(page.content, "html.parser")

In [4]:
# Extract the term and the definition
def get_term(term):
    found = term.find("dt", class_="term_name")
    return prettify_string(found.get_text())
        
def get_definition(term):
    found = term.find("dd", class_="term_definition")
    text = prettify_string(found.get_text())
    text = (
        text
        .split("Read more")[0]
        .split("Learn more")[0]
    )
    return text

def_records = [
    (get_term(term), get_definition(term))
    for term in soup.find_all("div", class_="term")]
term_df = pd.DataFrame(def_records, columns=["term", "definition"])


# Adjust the answer in case it's ambigouous without the term
def adjust_answer(row):
    if not row.definition.startswith("It means"):
        return row.definition
    return re.sub("It", row.term, row.definition)
term_df["content"] = term_df.apply(adjust_answer, axis=1)

In [5]:
# Here's what it looks like
term_df.loc[4:5]

Unnamed: 0,term,definition,content
4,Amount financed,"It means the amount of money you are borrowing from the lender, minus most of the upfront fees the lender is charging you.","Amount financed means the amount of money you are borrowing from the lender, minus most of the upfront fees the lender is charging you."
5,Annual income,"Annual income is a factor in a mortgage loan application and generally refers to your total earned, pre-tax income over a year. Annual income may include income from full-time or part-time work, self-employment, tips, commissions, overtime, bonuses, or other sources. A lender will use information about your annual income and your existing monthly debts to determine if you have the ability to repay the loan. Whether a lender will rely upon a specific income source or amount when considering you for a loan will often depend upon whether you can reasonably expect the income to continue.","Annual income is a factor in a mortgage loan application and generally refers to your total earned, pre-tax income over a year. Annual income may include income from full-time or part-time work, self-employment, tips, commissions, overtime, bonuses, or other sources. A lender will use information about your annual income and your existing monthly debts to determine if you have the ability to repay the loan. Whether a lender will rely upon a specific income source or amount when considering you for a loan will often depend upon whether you can reasonably expect the income to continue."


---
### Mortgage Questions and Answers
The CFPB also provides more [detailed answers](https://www.consumerfinance.gov/consumer-tools/mortgages/answers/) to specific questions. These are grouped into 4 categories: basics, common issues, know your rights, and how-to guides.  
Each question gets its own article. The articles are listed in a paginated format on their website, so we can scrape the article urls, then scrape the articles themselves. Each article has a title (which is the question), a short answer (optional), and a full answer. 

In [6]:
# Get all pages with article links
URL_pages = {
    "basics": 5,
    "common-issues": 5,
    "know-your-rights": 3,
    "how-to-guides": 2
}
URLs = []
for k, v in URL_pages.items():
    URLs.extend([f"{base_url}{k}?page={i+1}" for i in range(v)])

# For each of these pages, scrape all of the article URLs
articles = []
for url in URLs:
    page = requests.get(url, verify=False)
    soup = BeautifulSoup(page.content, "html.parser")
    articles.extend([
        f"https://{link.find('a').get('data-pretty-href')}" 
        for link in soup.find_all("article")])

In [7]:
answer_df = pd.DataFrame(columns=["question", "short_answer", "long_answer"])
errors = []

# Loop through all of the URLs to parse the article
for i, article_link in enumerate(tqdm(articles, desc="Scraping")):
    try:
        # Scrape article
        article = requests.get(article_link, verify=False)
        article_soup = BeautifulSoup(article.content, "html.parser")
        
        # Get main question/page title
        question = article_soup.find('h1').get_text()
        
        # Skip if we've seen this article before
        if question in answer_df.question.values:
            # Note: there are duplicate articles that have
            #       different URLs. We don't want those.
            
            continue
        
        # See if there's a short answer for this article
        lead = article_soup.find("div", class_="lead-paragraph")
        short_answer = lead.get_text() if lead else ""
        
        # Get long answer, including list items
        long_answer = (
            article_soup
            .find("div", class_="answer-text")
            .find("div", class_="row")
        )
        long_answer = [
            f"{line.get_text()}{';' if line.name == 'li' else ''}"
            for line in long_answer.find_all(["h1", "h2", "p", "li"])
        ]
        long_answer = " ".join(long_answer)
        
        # Append information to dataframe
        answer_df.loc[i] = {
            "question": question, 
            "short_answer": short_answer, 
            "long_answer": long_answer
        }
        
    except Exception as e:
        # If it fails, add to errors
        print(f"Failed on {article_link}:\n{e}")
        errors.append(article_link)

# Apply formatting to all text
answer_df = answer_df.applymap(prettify_string)

print(f"\n{len(answer_df)} records")

Scraping:   0%|          | 0/313 [00:00<?, ?it/s]


240 records


When we feed the data into a language model, we don't want 
individual texts to be too long, so we set a token cutoff and 
break text that is longer than that cutoff into multiple records.  
Once we've broken down large records, we'll join together the 
question, short answer, and long answer (section) to get one
cohesive content entry.

In [8]:
# We want to format the text to have the question, the short answer, and the 
# long answer. There are some really long bits of text, so we should break 
# those into smaller chunks.

TOKENIZER = GPT2TokenizerFast.from_pretrained("gpt2")
ENCODING = tiktoken.get_encoding("cl100k_base")

def count_tokens(text):
    """ Count number of tokens. """
    return len(ENCODING.encode(text))
      
    
def format_content(row, max_len=590):
    """
    Create the content column of our answer data frame for use 
    in our embeddings model. Include the question, short answer, 
    and long answer. If the body is too long, break it into 
    several entries by returning a list and later using .explode()
    For each of the chunks of the body, we will prepend the 
    title question and short answer
    """
    prefix = row.question + " " + row.short_answer + " "
    prefix_tokens = count_tokens(prefix)
    max_len -= prefix_tokens
    remaining_tokens = count_tokens(row.long_answer)
    if prefix_tokens + remaining_tokens < max_len:
        return prefix + row.long_answer
    
    # Break into smaller chunks by tokenizing sentences
    sentences = sent_tokenize(row.long_answer)
    use_text = []
    while True:
        # Count the remaining total tokens
        remaining = count_tokens(". ".join(sentences))
        
        # Remaining text is shorter than maximum length
        if remaining <= max_len:
            use_text.append(sentences)
            break

        # Find the maximum number of sentences until the
        # maximum token length is reached
        ntokens = 0
        for i, sentence in enumerate(sentences):
            ntokens += 1 + count_tokens(sentence)
            if ntokens > max_len:
                use_text.append(sentences[:i][:-1])
                sentences = sentences[i:]
                break
                
    # Join each of those sublists into cohesive chunks
    chunks = [re.sub(r"([.])([.]+)", r".", f"{prefix} {'. '.join(x)}.") 
              for x in use_text]
    return chunks

In [9]:
MAX_LENGTH = 590
answer_df["content"] = answer_df.apply(
    format_content, max_len=MAX_LENGTH, axis=1)
answer_df = answer_df.explode("content")

print(f"New length: {len(answer_df)} records")

New length: 266 records


In [10]:
answer_df[answer_df.question.eq("What is a deed-in-lieu of foreclosure?")]

Unnamed: 0,question,short_answer,long_answer,content
99,What is a deed-in-lieu of foreclosure?,A deed-in-lieu of foreclosure is an arrangement where you voluntarily turn over ownership of your home to the lender to avoid the foreclosure process.,"A deed-in-lieu of foreclosure may help you avoid being personally liable for any amount remaining on the mortgage. If you choose this option, a U. S. Department of Housing and Urban Development (HUD)-approved housing counseling agency can help you plan your next steps. Borrowers who are considering a deed-in-lieu of foreclosure should also ask their lenders or servicers about help with their relocation expenses through private programs that are sometimes called ""cash-for-keys"". If you live in a state in which you are responsible for any deficiency, which is a difference between the value of your property and the amount you still owe on your mortgage loan, you will want to ask your lender to waive the deficiency. If the lender waives the deficiency, get the waiver in writing and keep it for your records. A deed-in-lieu of foreclosure is one type of loss mitigation. For help in exploring your options, call the CFPB to be connected to a HUD-approved housing counseling agency today. Tip: See our handout for more information on how to avoid foreclosure.","What is a deed-in-lieu of foreclosure? A deed-in-lieu of foreclosure is an arrangement where you voluntarily turn over ownership of your home to the lender to avoid the foreclosure process. A deed-in-lieu of foreclosure may help you avoid being personally liable for any amount remaining on the mortgage. If you choose this option, a U. S. Department of Housing and Urban Development (HUD)-approved housing counseling agency can help you plan your next steps. Borrowers who are considering a deed-in-lieu of foreclosure should also ask their lenders or servicers about help with their relocation expenses through private programs that are sometimes called ""cash-for-keys"". If you live in a state in which you are responsible for any deficiency, which is a difference between the value of your property and the amount you still owe on your mortgage loan, you will want to ask your lender to waive the deficiency. If the lender waives the deficiency, get the waiver in writing and keep it for your records. A deed-in-lieu of foreclosure is one type of loss mitigation. For help in exploring your options, call the CFPB to be connected to a HUD-approved housing counseling agency today. Tip: See our handout for more information on how to avoid foreclosure."


### Create one data frame that contains just the content
We will need one data frame to do our embedding. So we will concatenate our term and our answer data frames and limit them to just the content column.  
Then we will add a "tokens" column that specifies how many tokens each text has.

In [11]:
# First save the original data frames in case we want to use 
# them as reference in the future
term_save = "../data/cfpb_key_terms.csv"
term_df.to_csv(term_save, index=False)

answer_save = "../data/cfpb_mortgage_questions.csv"
answer_df.to_csv(answer_save, index=False)

In [12]:
# Concatenate the content columns
df = pd.concat(
    [term_df[["content"]], answer_df[["content"]]],
    ignore_index=True
)

# Add a tokens counter column
df["tokens"] = df.content.apply(count_tokens)

# Save to file
save_name = "../data/mortgage_context_text.csv"
df.to_csv(save_name, index=False)

In [13]:
print(f"Total of {len(df)} records \n")
df.sample(n=2)

Total of 369 records 



Unnamed: 0,content,tokens
298,"Someone offered me the ability to make 26 bi-weekly mortgage payments a year for a fee. Is there a way I can pay down my loan faster on my own without paying a fee to sign up for this plan? In the bi-weekly payment plan, the servicer is collecting half of your monthly payment every two weeks, resulting in 26 payments over the course of the year (totaling one extra monthly payment per year). By making additional payments and applying your payments to the principal, you may be able to pay off your loan early. Before choosing a bi-weekly payment, be sure to review your loan terms to see if you will be subject to a pre-payment penalty if you do so. Check if your servicer charges any fees for a bi-weekly payment plan. You may be able to accomplish the same goal without the fee by making an extra monthly mortgage payment each year. Tip: Even if you don't set up a bi-weekly plan with your servicer, you can accomplish the same goal of paying down your loan faster by: Making one extra mortgage payment a year on your own; or; Dividing your monthly payment by 12, and adding that amount to your payment every month;",246
330,"If I can't pay my mortgage loan, what are my options? If you can't pay your mortgage or are worried about missing a mortgage payment, call your mortgage servicer right away. You should also contact a HUD-approved housing counseling agency to get free, expert assistance on avoiding foreclosure. ;",59
