# Gathering Data for FLAIR (Financial Literacy AI Resource)
---
FLAIR answers questions using a financial literacy text dataset. In this notebook, we will go through creating that dataset. The hackathon has time and resource limitations, so the data set is rather small. Any product that would be built would have a much greater volume of information.  

FLAIR only uses information that has been published by the U.S. government in order to avoid problems such as advertising, bias, opinions, and false claims - all of which are possible, if not likely, in a simple Google search.  

Data is created in "chunks" of around 100 words. The dataset is a list of dictionaries that have the form  
```
{
    "_id": int,
    "source_title": str,
    "source_filename": str,
    "source_url" : str,
    "section_title": str,
    "passage_text": str
}
```

---
## Data source: CFPB
<img src="https://upload.wikimedia.org/wikipedia/commons/b/bb/CFPB_logo.svg" alt="drawing" width="300"/>

The Consumer Financial Protection Bureau, or CFPB, is an agency of the U.S. government that is responsible for consumer protection in the financial sector. In short, their aim is to make sure people are treated fairly by banks, lenders and other financial institutions. Their website has a plethera of informations for consumers, as broad as "buying a house" or as specific as "tips when using mobile devices for financial services."  

Here we pull an extremely small amount of information provided by the CFPB around financial literacy. Read more about the CFPB and access additional resources on their website at https://www.consumerfinance.gov/

### Your Money, Your Goals
The [Your Money, Your Goals Toolkit](https://www.consumerfinance.gov/consumer-tools/educator-tools/your-money-your-goals/toolkit/) has information on a wide range of money issues. The information is intended to help people make smart spending decisions, fix credit reports, keep track of bills, and more. Basically, it is a broad overview of financial literacy. 

The cfpb also publishes [companion guides](https://www.consumerfinance.gov/consumer-tools/educator-tools/your-money-your-goals/companion-guides/) to the Your Money, Your Goals toolkit designed to help specific populations. Currently, they publish a companion guide for the military community, native communities, people with disabilities, and people with criminal records.  

Let's download, parse, and format the text data contained in the PDF publications.

In [1]:
### Imports
from parse_FL_pdfs import parsePDF


# Initialize the list that will hold the text data
financial_literacy = list()

In [2]:
# The main Your Money, Your Goals Toolkit document
pdf_name = "cfpb_your-money-your-goals_financial-empowerment_toolkit.pdf"
download_url = f"https://www.consumerfinance.gov/documents/8956/{pdf_name}"

# Use the PDF parser
PP = parsePDF(download_url, pdf_name)
raw_text = PP.extract_contents(start_regex=r"MODULE 1\s+", download=True)  # Skip the intro
clean_text = PP.clean_text(specific="cfpb_ymyg")
text_data = PP.get_text_data(chapter_regex=r"MODULE [0-9].+", starting_id=0)
financial_literacy.extend(text_data)

# Look at a sample of the extracted text
print("\n" + financial_literacy[-10]["passage_text"])


A freeze helps prevent identity thieves from opening fraudulent accounts in your name. This also means you won't be able to apply for credit as easily if you were planning to open a new account or apply for a loan. You must contact each of the credit reporting companies to freeze your credit report. You will have to contact them to lift the freeze before a third-party can access your credit report. An initial fraud alert requires creditors to verify your identity before opening a new account, issuing an additional card, or increasing the credit limit on an existing account. This is a good first step if you're worried that your identity may be stolen, like after a data breach. The alert lasts for one year and can be renewed after it expires.


In [3]:
# The Your Money, Your Goals companion guides
companion_pdfs = [
    "cfpb_ymyg-servicemembers-companion-guide.pdf",     # military
    "cfpb_ymyg_focus-on-people-with-disabilities.pdf",  # disabilities
    "cfpb_ymyg_reentry_supplement.pdf",                 # criminal justice reentry
    "cfpb_ymyg_focus-on-native-communities.pdf"         # native communities
]
for pdf_name in companion_pdfs:
    download_url = f"https://files.consumerfinance.gov/f/documents/{pdf_name}"
    starting_id = max([d["_id"] for d in financial_literacy]) + 1

    PP = parsePDF(download_url, pdf_name)
    raw_text = PP.extract_contents(start_regex=r"MODULE 1\s+", download=True)
    clean_text = PP.clean_text(specific="cfpb_ymyg")
    text_data = PP.get_text_data(chapter_regex=r"MODULE [0-9].+", starting_id=starting_id)
    financial_literacy.extend(text_data)
    
print("\n" + financial_literacy[-10]["passage_text"])


Preventing elder financial exploitation takes a coordinated community response that includes engaging elders, the people who provide direct services to elders, community leaders, and law enforcement responders. Use this tool and to start engaging community members in protecting elders. the tool contains both strategies for communities to use and actions for elders and their trusted family members to take. taking action at both levels will contribute to a safer community for elders. What to do Identify the steps you can take to prevent elder financial exploitation. these steps. may be different if you are an elder, a family member or caregiver, a community member, or a tribal leader. Check the step when it's completed. To access a dynamic and fillable version of this tool, visit:  consumerfinance. gov/practitioner-resources/your-money-your-goals/companion-guides.


### Tools and Resources for Newcomers
There are [tools and resources](https://www.consumerfinance.gov/consumer-tools/educator-tools/adult-financial-education/tools-and-resources/#newcomers) available from the CFPB for many different topics. Here, we download information designed to help the 41 million+ immigrants in America. We will download resources about opening a bank account, selecting financial products and services, how to pay your bills, and ways to receive your money.

Let's download, parse, and format the text data contained in the PDF publications.

In [4]:
# Resources for newcomers
newcomers_documents = [
    "cfpb_adult-fin-ed_checklist-for-opening-an-account.pdf",
    "cfpb_adult-fin-ed_selecting-financial-products-and-services.pdf",
    "201507_cfpb_ways-to-pay-your-bills.pdf",
    "201507_cfpb_ways-to-receive-your-money.pdf"]
for pdf_name in newcomers_documents:
    base_url = "https://files.consumerfinance.gov/f/"
    if pdf_name[0].isdigit():
        download_url = base_url + pdf_name
    else:
        download_url = base_url + "documents/" + pdf_name
    starting_id = max([d["_id"] for d in financial_literacy]) + 1

    PP = parsePDF(download_url, pdf_name)
    raw_text = PP.extract_contents(download=True)
    clean_text = PP.clean_text()
    text_data = PP.get_text_data(starting_id=starting_id)
    financial_literacy.extend(text_data)
    
print("\n" + financial_literacy[-10]["passage_text"])


Credit card. Definition Benefits Risks A credit card allows you to borrow money up to an approved credit limit. You will pay interest if you carry a balance, and you can be charged other fees based on the terms of the contract. You can expect to make a minimum monthly payment and you may want to pay more than the minimum to pay it off sooner. Can use a credit card to pay bills over the phone or online. Easy to prove payment should a dispute arise. Protects you from having to pay for some or all the charges if your card or information is stolen or lost and you report the theft. Can be set up to automatically pay recurring bills. Can help build your credit history if you make payments on time and don't get close to your credit limit. Costs more than paying for the purchase with cash or a check if you can't pay the credit card balance in full every month. If you carry a balance, you have to pay interest on the balance. Creates another bill you have to pay. Creates debtyou are borrowing m

### Money Management
Another section of the [tools and resources](https://www.consumerfinance.gov/consumer-tools/educator-tools/adult-financial-education/tools-and-resources/#money-management) is money management. We will download resources about overdraft options, gift cards, managing spending, managing cash flow, making financial decisions, and organizing finances. These resources are all short, quick overviews that are 1-2 pages.

Let's download, parse, and format the text data contained in the PDF publications.

In [5]:
# The Your Money, Your Goals companion guides
money_management_pdfs = [
    "cfpb_adult-fin-ed_know-your-overdraft-options.pdf",
    "cfpb_adult-fin-ed_unwrapping-gift-cards-avoid-surprises.pdf",
    "201702_cfpb_Managing-Spending-Ideas-for-Financial-Educators.pdf",
    "201702_cfpb_Consumer-Tips-on-Managing-Spending.pdf", 
    "cfpb_adult-fin-ed_managing-cash-flow-and-bill-payments.pdf",
    "cfpb_adult-fin-ed_five-steps-for-making-financial-decisions.pdf",
    "cfpb_fin-ed-digest_organizing-finances.pdf"
]
for pdf_name in money_management_pdfs:
    download_url = f"https://files.consumerfinance.gov/f/documents/{pdf_name}"
    starting_id = max([d["_id"] for d in financial_literacy]) + 1

    PP = parsePDF(download_url, pdf_name)
    raw_text = PP.extract_contents(download=True)
    clean_text = PP.clean_text()
    text_data = PP.get_text_data(starting_id=starting_id)
    financial_literacy.extend(text_data)
    
print("\n" + financial_literacy[-10]["passage_text"])


Ask; Ask questions about costs and risks. Keep asking more questions until you're sure you understand what you're payingand what you're getting. How much does this cost now? How much will it cost over time? Are there fees, taxes, penalties, or other charges? When do those apply? How do I avoid paying for extra services or add-ons if I don't want them? Can I cancel and get my money back? What's the deadline for canceling? How does the salesperson or company make money from me? Am I comfortable with that? What payment options do I have? Can I adjust my payment date? What's the worst-case scenario? How much money can I lose? How high can the cost of using the product go?


---
## Saving our Data
Now, all we have to do is save our data as a csv file. We will use pandas to make this easier and give us a preview of what our data looks like

In [None]:
import pandas as pd
pd.set_option("display.max_colwidth", None)

# Make dataframe
flair_df = pd.DataFrame(financial_literacy)

# View a sample of 4 entries
flair_df[["source_title", "section_title", "passage_text"]].sample(4)

In [None]:
# Save it to a pipe-deliminated csv file
save_name = "financial_literacy_data.csv"
flair_df.to_csv(save_name, index=False, sep="|")