# Advanced Data Processing
---
We will enhance our data processing in order to attain (hopefully) better results. The things we will do are
- Create questions about what acronymns stand for (e.g., ARM)
- Summarize the answer passages using GPT-3 so that responses aren't cut off by max_tokens
- Maybe other stuff

In [28]:
# Imports
import pandas as pd
import numpy as np
import os
import re
import string
import warnings
from numpy.random import Generator, PCG64

warnings.filterwarnings("ignore", message="Unverified HTTPS request is being made to host")
os.environ["CURL_CA_BUNDLE"] = ""
pd.set_option("display.max_colwidth", None)
rand = Generator(PCG64(seed=13))

## Acronymns
---
A common question is what something stands for. We don't want the model to elaborate more than it needs to, so we will create questions from the data that are to-the-point answers.

In [2]:
# Load data frames
key_path = "../data/cfpb_key_terms.csv"
key_df = pd.read_csv(key_path)

question_path = "../data/cfpb_mortgage_questions.csv"
quest_df = pd.read_csv(question_path)

In [18]:
def find_acronym_meaning(text):
    """
    From a passage, pull out all acronymns in parentheses
    and extract their meanings.
    """
    # Format the text
    text = re.sub(r"[\"\-]", " ", text) 
    text = re.sub(r"([a-zA-Z])(\'s)", r"\1", text)
    
    # Split on parentheses and get first mention of acronym
    split = re.split(r"(\([A-Z]+)[s]?\)", text)[::-1]
    if len(split) <= 1 or ("(TTY)" in text and len(split) <= 3):
        return None
    acronyms = {e[1:]: split[i+1].split() for i, e in enumerate(split) 
                if e.startswith("(")}

    # Find the meaning for each one
    ignore = ["the", "and", "a", "of", "an", "to"] + list(string.punctuation)
    answers = []
    for acronym, before in acronyms.items():
        # Look backwards through what precedes acronym in text
        match = acronym.lower()[::-1]
        answer = [""]

        for word in before[::-1]:
            # If there's no more matches and previous word isn't 'of', stop
            if not match and word.lower() != "of" and answer[-1].lower() != "of":
                break

            # If word is a filler word, include it
            if word.lower() in ignore:
                answer.append(word)

            else:
                # Add words that start with acronym letters
                for i, letter in enumerate(match):
                    if word.lower().startswith(letter):
                        answer.append(word)
                        # Remove those letters from acronym
                        match = match[i+1:]
                        break
                # Any intervening capitalized words
                if word.istitle() and word != answer[-1]:
                    answer.append(word)

        # Remove any starting filler words
        answer = answer[::-1]
        for i, word in enumerate(answer):
            if word.lower() not in ignore:
                break
        definition = " ".join(answer[i:]).strip()

        # get the appropriate case
        if not all([c.istitle() for c in definition.split() if c not in ignore]):
            definition = definition.lower()
        if definition:
            answers.append({"acronym": acronym, "meaning": definition})
        
    return answers

In [27]:
acronymns = (
    pd.concat([
        key_df.definition.apply(find_acronym_meaning),
        quest_df.content.apply(find_acronym_meaning)
    ], ignore_index=True)
    .explode()
    .apply(pd.Series)
    .sort_values("meaning")
    .drop_duplicates(subset=["acronym"], keep="first")
    .dropna()
    .reset_index(drop=True)
)


acronymns

Unnamed: 0,acronym,meaning
0,APR,Annual Percentage Rate
1,AAA,Area Agencies Aging
2,APOR,Average Prime Offer Rate
3,COE,Certificate of Eligibility
4,CCPA,Consumer Credit Protection Act
5,HUD,Department of Housing and Urban Development
6,VA,Department of Veteran Affairs
7,ECOA,Equal Credit Opportunity Act
8,FEMA,Federal Emergency Management Agency
9,FHA,Federal Housing Administration


In [52]:
def make_abbreviation_question(a):
    """ 
    Given an acronym, generate 2 questions. One will
    randomly be lower case. A question mark may or may 
    not be at the end of the question.
    """
    punc = lambda: "?" if rand.random() > 0.5 else ""
    
    # shuffle order of lower case acronym
    a1, a2 = rand.choice([a, a.lower()], size=2, replace=False)
    response = [
        f"What does {a1} stand for" + punc(),
        f"What does {a2} mean" + punc()
    ]
    return response

adf = pd.DataFrame({
    "prompt": acronymns.acronym.apply(make_abbreviation_question),
    "completion": acronymns.meaning
}).explode("prompt")


adf.head(4)

Unnamed: 0,prompt,completion
0,What does APR stand for?,Annual Percentage Rate
0,What does apr mean?,Annual Percentage Rate
1,What does AAA stand for,Area Agencies Aging
1,What does aaa mean?,Area Agencies Aging


## Summarizing completions
---
When completions are generated, they take the form (language and length) of the training data. Many of our responses are too long to be helpful. It is also more expensive to generate long answers.  

To combat this, we are going to use GPT-3 to do a summary task on long answers and use that in the training data instead of the original too-long data. 

In [54]:
def make_prompt(row, prompt_stop):
    p1 = "Including as many details as possible, summarize the following text about "
    p2 = " in under 250 words:\n"
    return p1 + re.split(r"[\.\?]", question)[0] + long_text + prompt_stop


(quest_df.short_answer + " " + quest_df.long_answer).apply(make_prompt, prompt_stop="\nX0X\n", axis=1)

Unnamed: 0,question,short_answer,long_answer,content
0,What are my rights as a servicemember under the Equal Credit Opportunity Act? Can a creditor refuse to extend credit to me just because I'm a servicemember?,It depends. Service in the military - whether current or prior service - is not a protected class under the Equal Credit Opportunity Act (ECOA).,"But just like your civilian counterparts, you are covered by the ECOA. The ECOA is a federal law that makes it illegal for a creditor to discriminate against you, in any aspect of a credit transaction, because of: Sex (including sexual orientation and gender identity); Age (provided the applicant is old enough to enter a contract); Race; Color; National origin; Religion; Marital status; Whether all or part of your income is from any public assistance program. This includes, but is not limited to, Social Security and Supplemental Security Income (SSI), unemployment compensation, Temporary Assistance to Needy Families (TANF), and Supplemental Nutritional Assistance Program benefits (SNAP). Some Veterans' benefits may also be considered income from a public assistance program (in which case no creditor may discriminate against a veteran on the basis of receiving such benefits). Whether you've exercised in good faith a right under the Consumer Credit Protection Act (CCPA). The CCPA is a collection of consumer protection laws, including ECOA, relating to credit; Some state laws do make it illegal to discriminate against covered persons based on active duty or veteran/military status. These states include: Illinois; Massachusetts; New York; Ohio; Washington; If you are in one of these states and believe you may have been discriminated against, contact your local JAG. Use the JAG Locator to find your local JAG Legal Assistance attorney.","What are my rights as a servicemember under the Equal Credit Opportunity Act? Can a creditor refuse to extend credit to me just because I'm a servicemember? It depends. Service in the military - whether current or prior service - is not a protected class under the Equal Credit Opportunity Act (ECOA). But just like your civilian counterparts, you are covered by the ECOA. The ECOA is a federal law that makes it illegal for a creditor to discriminate against you, in any aspect of a credit transaction, because of: Sex (including sexual orientation and gender identity); Age (provided the applicant is old enough to enter a contract); Race; Color; National origin; Religion; Marital status; Whether all or part of your income is from any public assistance program. This includes, but is not limited to, Social Security and Supplemental Security Income (SSI), unemployment compensation, Temporary Assistance to Needy Families (TANF), and Supplemental Nutritional Assistance Program benefits (SNAP). Some Veterans' benefits may also be considered income from a public assistance program (in which case no creditor may discriminate against a veteran on the basis of receiving such benefits). Whether you've exercised in good faith a right under the Consumer Credit Protection Act (CCPA). The CCPA is a collection of consumer protection laws, including ECOA, relating to credit; Some state laws do make it illegal to discriminate against covered persons based on active duty or veteran/military status. These states include: Illinois; Massachusetts; New York; Ohio; Washington; If you are in one of these states and believe you may have been discriminated against, contact your local JAG. Use the JAG Locator to find your local JAG Legal Assistance attorney."
1,Will I receive the Know Before You Owe disclosures when I shop for a mortgage?,"If you applied for a mortgage on or after Oct. 3, 2015, you should have received the forms in most cases.","The CFPB has created interactive tools and resources to help you shop for a mortgage, review your Loan Estimate, and review your Closing Disclosure. We also have adownloadable step-by-step guide. Our Know Before You Owe mortgage initiativeempowers consumers with the information they need to make informed mortgage choices. Under this initiative, we have consolidated the mortgage disclosures required under federal law in two forms. The disclosure forms are called the Loan Estimate and the Closing Disclosure. The Loan Estimate is provided early in the mortgage process, within three business days of your application. The Closing Disclosure is provided at the end of the process, and you must receive it at least three business days before closing. These new forms replace the old federal mortgage disclosures, known as the Good Faith Estimate, the HUD-1 Settlement Statement, and the Truth in Lending disclosures. If you applied for a mortgage before Oct. 3, 2015, you will receive the old forms even if you close after Oct. 3, 2015. Certain kinds of mortgages are not covered by the new forms. You will not receive a Loan Estimate or Closing Disclosure if you are shopping for: A reverse mortgage; A home equity line of credit (HELOC); A manufactured housing or mobile home loan not secured by real estate; A subordinate loan through certain types of homebuyer assistance programs; For these kinds of loans, you should receive Truth-in-Lending disclosures. If you are shopping for a reverse mortgage, you will also receive a Good Faith Estimate (GFE) and a HUD-1 Settlement Statement.","Will I receive the Know Before You Owe disclosures when I shop for a mortgage? If you applied for a mortgage on or after Oct. 3, 2015, you should have received the forms in most cases. The CFPB has created interactive tools and resources to help you shop for a mortgage, review your Loan Estimate, and review your Closing Disclosure. We also have adownloadable step-by-step guide. Our Know Before You Owe mortgage initiativeempowers consumers with the information they need to make informed mortgage choices. Under this initiative, we have consolidated the mortgage disclosures required under federal law in two forms. The disclosure forms are called the Loan Estimate and the Closing Disclosure. The Loan Estimate is provided early in the mortgage process, within three business days of your application. The Closing Disclosure is provided at the end of the process, and you must receive it at least three business days before closing. These new forms replace the old federal mortgage disclosures, known as the Good Faith Estimate, the HUD-1 Settlement Statement, and the Truth in Lending disclosures. If you applied for a mortgage before Oct. 3, 2015, you will receive the old forms even if you close after Oct. 3, 2015. Certain kinds of mortgages are not covered by the new forms. You will not receive a Loan Estimate or Closing Disclosure if you are shopping for: A reverse mortgage; A home equity line of credit (HELOC); A manufactured housing or mobile home loan not secured by real estate; A subordinate loan through certain types of homebuyer assistance programs; For these kinds of loans, you should receive Truth-in-Lending disclosures. If you are shopping for a reverse mortgage, you will also receive a Good Faith Estimate (GFE) and a HUD-1 Settlement Statement."
2,What exactly happens when a mortgage lender checks my credit?,"The credit check is reported to the credit reporting agencies as an ""inquiry"".","Inquiries tell other creditors that you are thinking of taking on new debt. An inquiry typically has a small, but negative, impact on your credit score. Inquiries are a necessary part of applying for a mortgage, so you can't avoid them altogether. But it pays to be smart about them. As a general rule, apply for credit only when you need it. Applying for a credit card, car loan, or other type of loan also results in an inquiry that can lower your score, so try to avoid applying for these other types of credit right before getting a mortgage or during the mortgage process. Learn more about credit scores You can shop around for a mortgage and it will not hurt your credit Within a 45-day window, multiple credit checks from mortgage lenders are recorded on your credit report as a single inquiry. This is because other creditors realize that you are only going to buy one home. You can shop around and get multiple preapprovals and official Loan Estimates. The impact on your credit is the same no matter how many lenders you consult, as long as the last credit check is within 45 days of the first credit check. Even if a lender needs to check your credit after the 45-day window is over, shopping around is usually still worth it. The impact of an additional inquiry is small, while shopping around for the best deal can save you a lot of money in the long run. Note: the 45-day rule applies only to credit checks from mortgage lenders or brokers' credit card and other inquiries are processed separately. You can check your own credit with no impact on your score When you check your own credit - whether you're getting a credit report or a credit score - it's handled differently by the credit reporting agencies and does not affect your credit score. If you are applying for a mortgage and haven't already checked your credit report for errors, do so now. You can get a free copy of your credit report at www. annualcreditreport. com. If you find any errors, get them corrected as soon as possible.","What exactly happens when a mortgage lender checks my credit? The credit check is reported to the credit reporting agencies as an ""inquiry"". Inquiries tell other creditors that you are thinking of taking on new debt. An inquiry typically has a small, but negative, impact on your credit score. Inquiries are a necessary part of applying for a mortgage, so you can't avoid them altogether. But it pays to be smart about them. As a general rule, apply for credit only when you need it. Applying for a credit card, car loan, or other type of loan also results in an inquiry that can lower your score, so try to avoid applying for these other types of credit right before getting a mortgage or during the mortgage process. Learn more about credit scores You can shop around for a mortgage and it will not hurt your credit Within a 45-day window, multiple credit checks from mortgage lenders are recorded on your credit report as a single inquiry. This is because other creditors realize that you are only going to buy one home. You can shop around and get multiple preapprovals and official Loan Estimates. The impact on your credit is the same no matter how many lenders you consult, as long as the last credit check is within 45 days of the first credit check. Even if a lender needs to check your credit after the 45-day window is over, shopping around is usually still worth it. The impact of an additional inquiry is small, while shopping around for the best deal can save you a lot of money in the long run. Note: the 45-day rule applies only to credit checks from mortgage lenders or brokers' credit card and other inquiries are processed separately. You can check your own credit with no impact on your score When you check your own credit - whether you're getting a credit report or a credit score - it's handled differently by the credit reporting agencies and does not affect your credit score. If you are applying for a mortgage and haven't already checked your credit report for errors, do so now. You can get a free copy of your credit report at www. annualcreditreport. com. If you find any errors, get them corrected as soon as possible."
3,How much does it cost to receive a Loan Estimate?,The only fee a lender can ask you to pay prior to providing a Loan Estimateis a fee for obtaining your credit report. Credit report fees are typically less than $30.,"The Loan Estimate is a form that went into effect on Oct. 3, 2015. A lender cannot collect any other fees before providing you with a Loan Estimate. In fact, a lender must wait until you indicate that you'd like to proceed with the loan application before charging you any other fees. Until that time, a lender also cannot collect your credit card number or require you to provide a check for anything other than a reasonable fee to obtain your credit report. Once you receive a Loan Estimate, it's up to you to decide whether you want to proceed with that particular lender and that particular loan application. If you have received your Loan Estimate and you tell the lender that you want to proceed, then the lender can charge you additional fees. For example, lenders commonly charge an application fee or an appraisal fee after you decide to proceed with the loan application. Learn how to proceed with a Loan Estimate. See a sample Loan Estimate form with interactive tips and definitions. You won't receive a Loan Estimate if you applied for a mortgage prior to Oct. 3, 2015, or if you're applying for a reverse mortgage. For those loans, you will receive two forms - a Good Faith Estimate (GFE) and an initial Truth-in-Lending disclosure - instead of a Loan Estimate. If you are applying for a HELOC, a manufactured housing loan that is not secured by real estate, or a loan through certain types of homebuyer assistance programs, you will not receive a GFE or a Loan Estimate, but you should receive a Truth-in-Lending disclosure.","How much does it cost to receive a Loan Estimate? The only fee a lender can ask you to pay prior to providing a Loan Estimateis a fee for obtaining your credit report. Credit report fees are typically less than $30. The Loan Estimate is a form that went into effect on Oct. 3, 2015. A lender cannot collect any other fees before providing you with a Loan Estimate. In fact, a lender must wait until you indicate that you'd like to proceed with the loan application before charging you any other fees. Until that time, a lender also cannot collect your credit card number or require you to provide a check for anything other than a reasonable fee to obtain your credit report. Once you receive a Loan Estimate, it's up to you to decide whether you want to proceed with that particular lender and that particular loan application. If you have received your Loan Estimate and you tell the lender that you want to proceed, then the lender can charge you additional fees. For example, lenders commonly charge an application fee or an appraisal fee after you decide to proceed with the loan application. Learn how to proceed with a Loan Estimate. See a sample Loan Estimate form with interactive tips and definitions. You won't receive a Loan Estimate if you applied for a mortgage prior to Oct. 3, 2015, or if you're applying for a reverse mortgage. For those loans, you will receive two forms - a Good Faith Estimate (GFE) and an initial Truth-in-Lending disclosure - instead of a Loan Estimate. If you are applying for a HELOC, a manufactured housing loan that is not secured by real estate, or a loan through certain types of homebuyer assistance programs, you will not receive a GFE or a Loan Estimate, but you should receive a Truth-in-Lending disclosure."
4,What is a Loan Estimate?,A Loan Estimate is a three-page form that you receive after applying for a mortgage.,"The Loan Estimate tells you important details about the loan you have requested. The lender must provide you a Loan Estimate within three business days of receiving your application. The Loan Estimate is a form that took effect on Oct. 3, 2015. The form provides you with important information, including the estimated interest rate, monthly payment, and total closing costs for the loan. The Loan Estimate also gives you information about the estimated costs of taxes and insurance, and how the interest rate and payments may change in the future. In addition, the form indicates if the loan has special features that you will want to be aware of, like penalties for paying off the loan early (a prepayment penalty) or increases to the mortgage loan balance even if payments are made on time (negative amortization). If your loan has a negative amortization feature, it appears in the description of the loan product. The form uses clear language and design to help you better understand the terms of the mortgage loan you've applied for. All lenders are required to use the same standard Loan Estimate form. This makes it easier for you to compare mortgage loans so that you can choose the one that is right for you. When you receive a Loan Estimate, the lender has not yet approved or denied your loan application. The Loan Estimate shows you what loan terms the lender expects to offer if you decide to move forward. If you decide to move forward, the lender will ask you for additional financial information. See a sample Loan Estimate form with interactive tips and definitions. Note:You won't receive a Loan Estimate if you're applying for a reverse mortgage. For those loans, you will receive two forms - a Good Faith Estimate (GFE) and an initial Truth-in-Lending disclosure - instead of a Loan Estimate. If you are applying for a HELOC, a manufactured housing loan that is not secured by real estate, or a loan through certain types of homebuyer assistance programs, you will not receive a GFE or a Loan Estimate, but you should receive a Truth-in-Lending disclosure.","What is a Loan Estimate? A Loan Estimate is a three-page form that you receive after applying for a mortgage. The Loan Estimate tells you important details about the loan you have requested. The lender must provide you a Loan Estimate within three business days of receiving your application. The Loan Estimate is a form that took effect on Oct. 3, 2015. The form provides you with important information, including the estimated interest rate, monthly payment, and total closing costs for the loan. The Loan Estimate also gives you information about the estimated costs of taxes and insurance, and how the interest rate and payments may change in the future. In addition, the form indicates if the loan has special features that you will want to be aware of, like penalties for paying off the loan early (a prepayment penalty) or increases to the mortgage loan balance even if payments are made on time (negative amortization). If your loan has a negative amortization feature, it appears in the description of the loan product. The form uses clear language and design to help you better understand the terms of the mortgage loan you've applied for. All lenders are required to use the same standard Loan Estimate form. This makes it easier for you to compare mortgage loans so that you can choose the one that is right for you. When you receive a Loan Estimate, the lender has not yet approved or denied your loan application. The Loan Estimate shows you what loan terms the lender expects to offer if you decide to move forward. If you decide to move forward, the lender will ask you for additional financial information. See a sample Loan Estimate form with interactive tips and definitions. Note:You won't receive a Loan Estimate if you're applying for a reverse mortgage. For those loans, you will receive two forms - a Good Faith Estimate (GFE) and an initial Truth-in-Lending disclosure - instead of a Loan Estimate. If you are applying for a HELOC, a manufactured housing loan that is not secured by real estate, or a loan through certain types of homebuyer assistance programs, you will not receive a GFE or a Loan Estimate, but you should receive a Truth-in-Lending disclosure."
