# Advanced Data Processing
---
We will enhance our data processing in order to attain (hopefully) better results. The things we will do are
- Create questions about what acronymns stand for (e.g., ARM)
- Summarize the answer passages using GPT-3 so that responses aren't cut off by max_tokens
- Maybe other stuff

In [1]:
# Imports
import pandas as pd
import numpy as np
import os
import re
import string
import warnings
from numpy.random import Generator, PCG64

warnings.filterwarnings("ignore", message="Unverified HTTPS request is being made to host")
os.environ["CURL_CA_BUNDLE"] = ""
pd.set_option("display.max_colwidth", None)
pd.set_option("display.html.use_mathjax", False)
rand = Generator(PCG64(seed=13))

## Acronymns
---
A common question is what something stands for. We don't want the model to elaborate more than it needs to, so we will create questions from the data that are to-the-point answers.

In [2]:
# Load data frames
key_path = "../data/cfpb_key_terms.csv"
key_df = pd.read_csv(key_path)

question_path = "../data/cfpb_mortgage_questions.csv"
quest_df = pd.read_csv(question_path)

In [3]:
def find_acronym_meaning(text):
    """
    From a passage, pull out all acronymns in parentheses
    and extract their meanings.
    """
    # Format the text
    text = re.sub(r"[\"\-]", " ", text) 
    text = re.sub(r"([a-zA-Z])(\'s)", r"\1", text)
    
    # Split on parentheses and get first mention of acronym
    split = re.split(r"(\([A-Z]+)[s]?\)", text)[::-1]
    if len(split) <= 1 or ("(TTY)" in text and len(split) <= 3):
        return None
    acronyms = {e[1:]: split[i+1].split() for i, e in enumerate(split) 
                if e.startswith("(")}

    # Find the meaning for each one
    ignore = ["the", "and", "a", "of", "an", "to"] + list(string.punctuation)
    answers = []
    for acronym, before in acronyms.items():
        # Look backwards through what precedes acronym in text
        match = acronym.lower()[::-1]
        answer = [""]

        for word in before[::-1]:
            # If there's no more matches and previous word isn't 'of', stop
            if not match and word.lower() != "of" and answer[-1].lower() != "of":
                break

            # If word is a filler word, include it
            if word.lower() in ignore:
                answer.append(word)

            else:
                # Add words that start with acronym letters
                for i, letter in enumerate(match):
                    if word.lower().startswith(letter):
                        answer.append(word)
                        # Remove those letters from acronym
                        match = match[i+1:]
                        break
                # Any intervening capitalized words
                if word.istitle() and word != answer[-1]:
                    answer.append(word)

        # Remove any starting filler words
        answer = answer[::-1]
        for i, word in enumerate(answer):
            if word.lower() not in ignore:
                break
        definition = " ".join(answer[i:]).strip()

        # get the appropriate case
        if not all([c.istitle() for c in definition.split() if c not in ignore]):
            definition = definition.lower()
        if definition:
            answers.append({"acronym": acronym, "meaning": definition})
        
    return answers

In [4]:
acronymns = (
    pd.concat([
        key_df.definition.apply(find_acronym_meaning),
        quest_df.content.apply(find_acronym_meaning)
    ], ignore_index=True)
    .explode()
    .apply(pd.Series)
    .sort_values("meaning")
    .drop_duplicates(subset=["acronym"], keep="first")
    .dropna()
    .reset_index(drop=True)
)


acronymns

Unnamed: 0,acronym,meaning
0,APR,Annual Percentage Rate
1,AAA,Area Agencies Aging
2,APOR,Average Prime Offer Rate
3,COE,Certificate of Eligibility
4,CCPA,Consumer Credit Protection Act
5,HUD,Department of Housing and Urban Development
6,VA,Department of Veteran Affairs
7,ECOA,Equal Credit Opportunity Act
8,FEMA,Federal Emergency Management Agency
9,FHA,Federal Housing Administration


In [5]:
def make_abbreviation_question(a):
    """ 
    Given an acronym, generate 2 questions. One will
    randomly be lower case. A question mark may or may 
    not be at the end of the question.
    """
    punc = lambda: "?" if rand.random() > 0.5 else ""
    
    # shuffle order of lower case acronym
    a1, a2 = rand.choice([a, a.lower()], size=2, replace=False)
    response = [
        f"What does {a1} stand for" + punc(),
        f"What does {a2} mean" + punc()
    ]
    return response

adf = pd.DataFrame({
    "prompt": acronymns.acronym.apply(make_abbreviation_question),
    "completion": acronymns.meaning
}).explode("prompt")


adf.sample(2)

Unnamed: 0,prompt,completion
7,What does ecoa mean?,Equal Credit Opportunity Act
30,What does RESPA stand for?,Real Estate Settlement Procedures Act


## Summarizing completions
---
When completions are generated, they take the form (language and length) of the training data. Many of our responses are too long to be helpful. It is also more expensive to generate long answers.  

To combat this, we are going to use GPT-3 to do a summary task on long answers and use that in the training data instead of the original too-long data. 

In [46]:
def make_summary_prompt(row, min_words=100, max_words=250):
    """
    Make a summary if the given answer is very long
    """
    # Under the maximum output length, don't summarize
    if len(row.long_answer.split()) <= max_words:
        return None
    prompt = (
        f"In {min_words} to {max_words} words, summarize the following passage "
        f"that answers the question '{row.question}'. Include as many details "
        f"as possible:\n\n{row.short_answer} {row.long_answer}"
    )

    # Future: limit total request to 2,048 tokens in case of a super long passage.
    # That's the limit for most models (~1500 words). Note: that 2,048 includes
    # the response
    
    return prompt

qdf = quest_df.fillna("").drop(columns=["content"]).drop_duplicates().copy()
summary_prompts = (
    qdf
    .apply(
        make_summary_prompt, 
        max_words=250,
        axis=1
    ).dropna()
)

print(summary_prompts.sample(n=1).iloc[0])

In 100 to 250 words, summarize the following passage that answers the question 'What should I do if the house or apartment I'm renting goes into foreclosure?'. Include as many details as possible:

Know Your Rights, look for notices, ask questions. You may want to consult an attorney. If your landlord stops paying the mortgage, foreclosure proceedings may begin. Some state and local laws may offer protections for renters in the foreclosure process. More information about landlord-tenant laws in your state is available as well as a summary of state and local tenant protections from foreclosure. Tip:If you need help finding an attorney, you can view this list ofresources from the American Bar Association and you can find your local legal aid office or volunteer attorney program. Be aware of the following: Look for notices. If notices of a possible foreclosure are delivered to or posted on your property, contact the sender right away and let them know that you are a tenant. You should als

In [51]:
import tiktoken
ENCODING = tiktoken.get_encoding("cl100k_base")

def count_tokens(text):
    """ Count number of tokens. """
    return len(ENCODING.encode(text))


n_tokens = qdf.applymap(count_tokens)
n_tokens.describe().round()

Unnamed: 0,question,short_answer,long_answer
count,240.0,240.0,240.0
mean,19.0,30.0,276.0
std,9.0,15.0,217.0
min,5.0,0.0,1.0
25%,13.0,20.0,116.0
50%,17.0,29.0,246.0
75%,23.0,39.0,371.0
max,59.0,81.0,1512.0


In [52]:
n_tokens[n_tokens.long_answer == 1]

Unnamed: 0,question,short_answer,long_answer
136,12,31,1
221,19,49,1
227,14,44,1


In [55]:
qdf.loc[[136, 221, 227]]

Unnamed: 0,question,short_answer,long_answer
136,Does my mortgage servicer have to help me avoid foreclosure?,Many mortgage servicers have to work with you to see if you qualify for ways to avoid foreclosure. Your servicer may refer to this as loss mitigation.,;
221,I can't make my mortgage payments. How long will it take before I'll face foreclosure?,"The legal foreclosure process generally can't start during the first 120 days after you're behind on your mortgage. After that, once your servicer begins the legal process, the amount of time you have until an actual foreclosure sale varies by state.",;
227,"If I can't pay my mortgage loan, what are my options?","If you can't pay your mortgage or are worried about missing a mortgage payment, call your mortgage servicer right away. You should also contact a HUD-approved housing counseling agency to get free, expert assistance on avoiding foreclosure.",;


In [24]:
max_words

def get_summary(text):
    MAX_TOKENS = 175
    TEMPERATURE = 0
    FREQ_PENALTY = 2
    PRES_PENALTY = -1
    BEST_OF = 3
    
    response = openai.Completion.create(
        model=model,
        prompt=question + prompt_stop,
        max_tokens=MAX_TOKENS,
        temperature=TEMPERATURE,
        frequency_penalty=FREQ_PENALTY,
        presence_penalty=PRES_PENALTY,
        best_of=BEST_OF,
        stop=completion_stop
    )
    answer = response["choices"][0]["text"].strip()
    

count    266.000000
mean      26.015038
std       12.461252
min        0.000000
25%       17.000000
50%       25.000000
75%       34.000000
max       62.000000
Name: short_answer, dtype: float64

In [40]:
qdf = 

y = qdf.short_answer.apply(lambda s: s.split(".")[0])
print(y.value_counts())

Yes                                                                                                                                              10
                                                                                                                                                 10
No                                                                                                                                               10
It depends                                                                                                                                        5
Reverse mortgage loans typically must be repaid either when you move out of the home or when you die                                              2
                                                                                                                                                 ..
It depends on your situation                                                                                    

In [42]:
sa = (
    "Reverse mortgage loans typically must be repaid either when you move out of the home or when you die"
)

qdf[qdf.short_answer.str.contains(sa)]

Unnamed: 0,question,short_answer,long_answer
48,When do I have to pay back a reverse mortgage loan?,"Reverse mortgage loans typically must be repaid either when you move out of the home or when you die. However, the loan may need to be paid back sooner if the home is no longer your principal residence, you fail to pay your property taxes or homeowners insurance, or do not keep the home in good repair.","Most reverse mortgage loans are Home Equity Conversion Mortgages (HECMs). A HECM must be paid off when the last surviving borrower or Eligible Non-Borrowing Spouse: Dies; Sells their home, or; No longer lives in the home as their principal residence, meaning where they live for a majority of the year. If the you are away for more than 12 consecutive months in a healthcare facility such as a hospital, rehabilitation center, nursing home, or assisted living facility and there is no co-borrower living in the home, anyone living with you will have to move out unless they are able to pay back the loan or qualify as an Eligible Non-Borrowing Spouse. An ""Eligible Non-Borrowing Spouse"" is a term used for your spouse when they are not a co-borrower, but qualify under the U. S. Department of Housing and Urban Development's (HUD) rules to stay in your home after you have died. Learn more about what happens to your reverse mortgage after you die Learn more about reverse mortgages"
179,"What happens if I have a reverse mortgage and I have to move out of my home, such as moving into a nursing home or to live with family?","Reverse mortgage loans typically must be repaid either when you move out of the home or when you die. However, you may not need to immediately pay it back if you are away from your home for more than 12 consecutive months in a healthcare facility or have a co-borrower or Eligible Non-Borrowing Spouse living in the home.","If your spouse or person living with you is a co-borrower If you move out of your home for any reason such as to live in a nursing home, or downsize to a smaller house and your spouse or the person living with you is a co-borrower on the reverse mortgage loan, they can stay in the home and continue to receive loan disbursements as long as they fulfill the ongoing obligations of the reverse mortgage. If your spouse or person living with you isn't a co-borrower If your spouse or partner is not a co-borrower and you move someplace else for the majority of the year, the reverse mortgage loan will need to be paid back. The most common way to pay back a reverse mortgage is by selling the home, in which case your spouse or partner will have to move. If you are away from your home and in a healthcare facility such as a hospital, assisted living, nursing home, or rehabilitation center for more than 12 consecutive months, your non-borrowing spouse may be able to stay in the home without paying off the loan, depending on when you took out (""originated"") the loan. and whether they qualify as an Eligible Non-Borrowing Spouse under HUD's rules. Qualifying as an ""Eligible Non-Borrowing Spouse"" can be difficult, so your spouse may want to consider contacting an attorney or HUD-approved housing counseling agency. Learn more about whether your spouse can qualify as an Eligible Non-Borrowing Spouse. Note: This information only applies to Home Equity Conversion Mortgages (HECMs), which are the most common type of reverse mortgage loan. Learn more about reverse mortgages"


In [41]:
v = y.value_counts()
for i, e in v.items():
    if e > 1:
        print(i)
        print(f"({e} times)\n")

Yes
(10 times)


(10 times)

No
(10 times)

It depends
(5 times)

Reverse mortgage loans typically must be repaid either when you move out of the home or when you die
(2 times)



In [26]:
quest_df[quest_df.question.str.contains("flood insurance")]

Unnamed: 0,question,short_answer,long_answer,content
62,Do I ever have to buy property or flood insurance from my lender?,"No. You may shop for property or flood insurance. But if you do not get homeowner's insurance, or let your policy lapse, your lender may insure your property and charge you for it.","This is called ""force-placed"" or ""collateral protection"" insurance. It is usually much more expensive than a regular policy. A lender may also buy ""force-placed"" flood insurance for homeowners in flood zones who do not have adequate flood insurance to meet the legal minimum required to protect the property. If you can obtain your own insurance, it will generally be less expensive than the insurance bought by your lender for you. In some cases of force-placed insurance, the policy that the lender buys protects their interest but not your interest in the property. If you believe that any force-placed insurance was purchased in error, you should contact your lender immediately and give proof of your current insurance policy. Tip: If you disagree with your lender's determination that you need flood insurance, you can review the FEMA flood maps. If you think there has been an error, you can ask FEMA to issue a Letter of Map Amendment (LOMA), or a Letter of Map Revision Based on Fill (LOMR-F). If your loan doesn't include an escrow account, you will have to plan for potentially large property-related expenses, such as property taxes and homeowner's insurance premiums. Be sure you budget for your monthly mortgage payments plus these extra costs and stay current on your taxes and insurance payments. If you fail to pay your property taxes, your state or local government may impose fines and penalties or place a tax lien on your home. In addition, if you fail to pay any of your property-related costs, your lender may add the amounts to your loan balance, add an escrow account to your loan, or require you to pay for insurance on your home that your lenders buys on your behalf, which likely would be more expensive and provide fewer benefits than what you could obtain on your own.","Do I ever have to buy property or flood insurance from my lender? No. You may shop for property or flood insurance. But if you do not get homeowner's insurance, or let your policy lapse, your lender may insure your property and charge you for it. This is called ""force-placed"" or ""collateral protection"" insurance. It is usually much more expensive than a regular policy. A lender may also buy ""force-placed"" flood insurance for homeowners in flood zones who do not have adequate flood insurance to meet the legal minimum required to protect the property. If you can obtain your own insurance, it will generally be less expensive than the insurance bought by your lender for you. In some cases of force-placed insurance, the policy that the lender buys protects their interest but not your interest in the property. If you believe that any force-placed insurance was purchased in error, you should contact your lender immediately and give proof of your current insurance policy. Tip: If you disagree with your lender's determination that you need flood insurance, you can review the FEMA flood maps. If you think there has been an error, you can ask FEMA to issue a Letter of Map Amendment (LOMA), or a Letter of Map Revision Based on Fill (LOMR-F). If your loan doesn't include an escrow account, you will have to plan for potentially large property-related expenses, such as property taxes and homeowner's insurance premiums. Be sure you budget for your monthly mortgage payments plus these extra costs and stay current on your taxes and insurance payments. If you fail to pay your property taxes, your state or local government may impose fines and penalties or place a tax lien on your home. In addition, if you fail to pay any of your property-related costs, your lender may add the amounts to your loan balance, add an escrow account to your loan, or require you to pay for insurance on your home that your lenders buys on your behalf, which likely would be more expensive and provide fewer benefits than what you could obtain on your own."
