# Questions
- How has the federal budget for programs relating to the environment changed over the past eight years? 
- What assistance programs did these broad programs support? 
- Where did the money from these programs go? 
- What does all this mean for the state of the environment today?  

# Original 3 Data Sources
- [The budget of the US government](https://www.govinfo.gov/app/collection/budget/2021)
- [Federal assistance programs related to the environment](https://beta.sam.gov/)
- [The dollar-for-dollar spending on these programs](https://www.usaspending.gov/download_center/custom_award_data)  
- A crosswalk between environmental outcomes data and the programs intended to affect change 

# Kickoff Video Notes
- This is available 2 hours after kickoff - a but disappointing, and I can't work that hour, so 3 hours into the challenge, here are my notes upon watching the kickoff video.
- Can I get the slideshow - very useful.
- 

In [1]:
import pandas as pd
import numpy as np
import matplotlib
import os

# Challenge AM - 4 tabs in a spreadsheet

## Federal Legislative Actions
- One way to look at how the Federal Government has tried to affect environmental issues is to look at the number of proposed and enacted legislation over the past few congresses. The Library of Congress (www.congress.gov) has a searchable database showing all legislative actions taken by congress. Often this is difficult because the term "environment" will show up when legislation isn't really directed toward the environment. For example, an appropriations bill that funds the EPA. This is an exercise in finding legislation that is directly tied to environmental issues and appropriately tagging those items to quantify the attempts (successful or not) at enacting environmental laws.

### Solution Design?
- Two levels
  - Document Classifier
  - Document Miner?  I don't know what this looks like yet
- Document Classifier
  - Get known environmental legislation and label it
    - [Good article](https://www.findlaw.com/smallbusiness/business-laws-and-regulations/overview-key-federal-environmental-laws.html)
  - Get known not environmental legislation and label it
  - Different classification models for words to try
- PDFMiner looks like the best python package to extract all the text (and other objects) from a pdf
  - [PDF Miner Readthedocs site](https://pdfminersix.readthedocs.io/en/latest/)

In [2]:
import pdfminer as pdfm
print(pdfm.__version__)

20201018


In [6]:
# Need to get this to work
test_text = pdfm.high_level.extract_text('Data/esa_act.pdf')
print(type(test_text))
print(len(test_text))
print(test_text[:1000])

AttributeError: module 'pdfminer' has no attribute 'high_level'

In [5]:
# Works Well
from io import StringIO

from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.pdfpage import PDFPage
from pdfminer.pdfparser import PDFParser

output_string = StringIO()
with open('Data/esa_act.pdf', 'rb') as in_file:
    parser = PDFParser(in_file)
    doc = PDFDocument(parser)
    rsrcmgr = PDFResourceManager()
    device = TextConverter(rsrcmgr, output_string, laparams=LAParams())
    interpreter = PDFPageInterpreter(rsrcmgr, device)
    for page in PDFPage.create_pages(doc):
        interpreter.process_page(page)

print(output_string.getvalue())

ENDANGERED SPECIES ACT OF 1973


1 

2


ENDANGERED SPECIES ACT OF 1973 1 
[As Amended Through P.L. 108-136, November 24, 2003] 

AN ACT To provide for the conservation of endangered and threatened species of 
fish, wildlife, and plants, and for other purposes. 

Be it enacted by the Senate and House of Representatives of the 
United States of America in Congress assembled, That this Act may 
be cited as the ‘‘Endangered Species Act of 1973’’. 

TABLE OF CONTENTS 

Sec.  2.  Findings, purposes, and policy.

Sec.  3.  Definitions.

Sec.  4.  Determination of endangered species and threatened species.

Sec.  5.  Land acquisition.

Sec.  6.  Cooperation with the States.

Sec.  7.  Interagency cooperation.

Sec.  8.  International cooperation.

Sec.  8A.  Convention implementation.

Sec.  9.  Prohibited acts.

Sec.  10.  Exceptions.

Sec.  11.  Penalties and enforcement.

Sec.  12.  Endangered plants.

Sec.  13.  Conforming amendments.

Sec.  14.  Repealer.

Sec.  15.  Authorization of a

In [3]:
from io import StringIO

from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.pdfpage import PDFPage
from pdfminer.pdfparser import PDFParser

In [4]:
import nltk
nltk.download("punkt")

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\MickC\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [6]:
#only needed to run one time as an installation type notion
#nltk.download()

showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml


True

In [5]:
import string
from collections import Counter

In [6]:
nltk_stopwords = nltk.corpus.stopwords.words("english")
# dont_stop = [] # Put any words that we want to keep that are in nltk_stopwords here, so we can remove them from the list
punct_list = list(string.punctuation)
# dont_stop_punct = []

#mick_list = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z',
#             'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z',
#             '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', 'id', '2d', 'one', 'two', '3d', 'ibid'
#            ]

tmp = nltk_stopwords + punct_list # + mick_list
useless_words = tmp
# useless_words = tmp minus the dont* variables
#useless_words

In [7]:
def get_text_from_pdf(pdf_path) :
    output_string = StringIO()
    with open(pdf_path, 'rb') as in_file:
        parser = PDFParser(in_file)
        doc = PDFDocument(parser)
        rsrcmgr = PDFResourceManager()
        device = TextConverter(rsrcmgr, output_string, laparams=LAParams())
        interpreter = PDFPageInterpreter(rsrcmgr, device)
        for page in PDFPage.create_pages(doc):
            interpreter.process_page(page)

    return output_string.getvalue()    

In [8]:
# This should be replaced using regular expressions and can be significantly enhanced functionally.
# Almost a placeholder function right now, just removing newlines.
def clean_text(input_string) :
    str1 = input_string.replace(" \n", "")
    str2 = str1.replace("\n", "")
    str3 = str2.replace("\\x0c", "")
    final_string = str3.lower()
    return final_string

In [9]:
def filter_words(input_words) :
    filtered_words = []
    for word in input_words :
        append_it = True
        if word in useless_words :
            #print(f"useless word {word}")
            append_it = False
        elif len(word) == 1 :
            #print(f"word length 1 {word}")
            append_it = False    
        elif word.isdigit() :
            #print(f"number {word}")
            append_it = False
        elif word[0] == chr(167) :
            #print(f"section symbol {word}")
            append_it = False
        if append_it :
            filtered_words.append(word)
    return filtered_words

In [10]:
def count_words(word_list) :
    word_counter = Counter(word_list)
    wc_rev_sort = sorted(word_counter.items(), key=lambda pair: pair[1], reverse=True)
    return wc_rev_sort

In [12]:
# Create Data For ML Algorithms
doc_num = pd.Series([], name='doc_num', dtype='int')
doc_filepath = pd.Series([], name='doc_filepath', dtype='str')
doc_text = pd.Series([], name='doc_text', dtype='str')
text_cleaned = pd.Series([], name='text_cleaned', dtype='str')
nltk_words = pd.Series([], name='nltk_words', dtype='str')
filtered_words = pd.Series([], name='filtered_words', dtype='str')
word_counts = pd.Series([], name='word_counts', dtype='str')
env_label = pd.Series([], name='env_label', dtype='str')

doc_ctr = 0

envdir = 'Data/EnvironmentLabel'
for file_nm in os.listdir(envdir) :
    filepath = envdir + '/' + file_nm
    doc_str = get_text_from_pdf(filepath)
    txt_cln = clean_text(doc_str)
    nltk_wds = nltk.word_tokenize(txt_cln)
    filt_words = filter_words(nltk_wds)
    wc = count_words(filt_words)
    word_counts[doc_ctr] = wc

    doc_num[doc_ctr] = doc_ctr
    doc_filepath[doc_ctr] = filepath
    doc_text[doc_ctr] = doc_str
    text_cleaned[doc_ctr] = txt_cln
    nltk_words[doc_ctr] = nltk_wds
    filtered_words[doc_ctr] = filt_words
    env_label[doc_ctr] = 'Environmental'
    doc_ctr += 1
    
envdir = 'Data/NotEnvironmentLabel'
for file_nm in os.listdir(envdir) :
    filepath = envdir + '/' + file_nm
    doc_str = get_text_from_pdf(filepath)
    txt_cln = clean_text(doc_str)
    nltk_wds = nltk.word_tokenize(txt_cln)
    filt_words = filter_words(nltk_wds)
    wc = count_words(filt_words)
    word_counts[doc_ctr] = wc

    doc_num[doc_ctr] = doc_ctr
    doc_filepath[doc_ctr] = filepath
    doc_text[doc_ctr] = doc_str
    text_cleaned[doc_ctr] = txt_cln
    nltk_words[doc_ctr] = nltk_wds
    filtered_words[doc_ctr] = filt_words
    env_label[doc_ctr] = 'NotEnvironmental'
    doc_ctr += 1

doc_df = doc_num.to_frame().\
         join(doc_filepath).\
         join(doc_text).\
         join(text_cleaned).\
         join(nltk_words).\
         join(filtered_words).\
         join(word_counts).\
         join(env_label)
print(doc_df)

# Split Data

# Build Model

# Optionally Test Model

   doc_num                                  doc_filepath  \
0        0             Data/EnvironmentLabel/esa_act.pdf   
1        1     Data/EnvironmentLabel/PLAW-116publ186.pdf   
2        2      Data/EnvironmentLabel/PLAW-116publ63.pdf   
3        3         Data/EnvironmentLabel/SaveOurSeas.pdf   
4        4  Data/NotEnvironmentLabel/PLAW-116publ153.pdf   
5        5  Data/NotEnvironmentLabel/PLAW-116publ206.pdf   
6        6  Data/NotEnvironmentLabel/PLAW-116publ254.pdf   

                                            doc_text  \
0  ENDANGERED SPECIES ACT OF 1973\n\n\n1 \n\n2\n...   
1  PUBLIC LAW 116–186—OCT. 30, 2020 \n\n134 STAT....   
2  133 STAT. 1120 \n\nPUBLIC LAW 116–63—OCT. 4, 2...   
3  PUBLIC LAW 116–224—DEC. 18, 2020 \n\nSAVE OUR ...   
4  134 STAT. 688 \n\nPUBLIC LAW 116–153—AUG. 8, 2...   
5  PUBLIC LAW 116–206—DEC. 4, 2020 \n\nRODCHENKOV...   
6  PUBLIC LAW 116–254—DEC. 23, 2020 \n\n134 STAT....   

                                        text_cleaned  \
0  endangered 

In [32]:
doc_str

'ENDANGERED SPECIES ACT OF 1973\n\n\n1 \n\n\x0c2\n\n\n\x0cENDANGERED SPECIES ACT OF 1973 1 \n[As Amended Through P.L. 108-136, November 24, 2003] \n\nAN ACT To provide for the conservation of endangered and threatened species of \nfish, wildlife, and plants, and for other purposes. \n\nBe it enacted by the Senate and House of Representatives of the \nUnited States of America in Congress assembled, That this Act may \nbe cited as the ‘‘Endangered Species Act of 1973’’. \n\nTABLE OF CONTENTS \n\nSec.  2.  Findings, purposes, and policy.\n\nSec.  3.  Definitions.\n\nSec.  4.  Determination of endangered species and threatened species.\n\nSec.  5.  Land acquisition.\n\nSec.  6.  Cooperation with the States.\n\nSec.  7.  Interagency cooperation.\n\nSec.  8.  International cooperation.\n\nSec.  8A.  Convention implementation.\n\nSec.  9.  Prohibited acts.\n\nSec.  10.  Exceptions.\n\nSec.  11.  Penalties and enforcement.\n\nSec.  12.  Endangered plants.\n\nSec.  13.  Conforming amendments.\n

In [33]:
txt_cln

'endangered species act of 19731\x0c2\x0cendangered species act of 1973 1[as amended through p.l. 108-136, november 24, 2003]an act to provide for the conservation of endangered and threatened species offish, wildlife, and plants, and for other purposes.be it enacted by the senate and house of representatives of theunited states of america in congress assembled, that this act maybe cited as the ‘‘endangered species act of 1973’’.table of contentssec.  2.  findings, purposes, and policy.sec.  3.  definitions.sec.  4.  determination of endangered species and threatened species.sec.  5.  land acquisition.sec.  6.  cooperation with the states.sec.  7.  interagency cooperation.sec.  8.  international cooperation.sec.  8a.  convention implementation.sec.  9.  prohibited acts.sec.  10.  exceptions.sec.  11.  penalties and enforcement.sec.  12.  endangered plants.sec.  13.  conforming amendments.sec.  14.  repealer.sec.  15.  authorization of appropriations.sec.  16.  effective date.sec.  17. 

In [None]:
mth_num = pd.Series([], name='mth_num', dtype='int')
year = pd.Series([], name='year', dtype='int')
tank_capacity_ser = pd.Series([], name='tcs', dtype='int')
m_usage = (daily_usage * 30)
mth_usage = pd.Series([], name='mth_usage', dtype='float')
mth_rainfall = pd.Series([], name='mth_rainfall', dtype='float')
mth_rf_collected = pd.Series([], name='mth_rf_collected', dtype='float')


s_idx = 0
for fmonth in range(1, 25) :
    use_month = fmonth % 12
    if use_month == 0 :
        use_month = 12
    fyear = 2003
    if fmonth > 12 :
        fyear = 2004
    #tcap = 10
    mth_num.at[s_idx] = use_month
    year.at[s_idx] = fyear
    tank_capacity_ser.at[s_idx] = tank_capacity
    mth_usage[s_idx] = m_usage
    filter = rf_monthly_df['mth_num'] == use_month
    if fyear == 2003 :
        m_rf = rf_monthly_df.loc[filter, '2003'].values[0]
    else :
        m_rf = rf_monthly_df.loc[filter, '2004'].values[0]
    mth_rainfall.at[s_idx] = m_rf
    mth_rf_collected.at[s_idx] = m_rf * roof_size
    s_idx += 1

#print(len(mth_rainfall.values), len(mth_rainfall_collected))    

forecast_mth_df = mth_num.to_frame().\
               join(year).\
               join(tank_capacity_ser).\
               join(mth_usage).\
               join(mth_rainfall).\
               join(mth_rf_collected)
print(forecast_mth_df)

forecast_mth_df.to_csv("monthly_forecast.csv")

## State Legislative Actions
- The National Conference of State Legislatures (NCSL) maintains a terrific database on legislation proposed and enacted at the state level. You can search this by various criteria. It would be interesting to see if additional breakouts are warrented and to quantify the number of proposed and enacted legislation at the state level. 

## EPA Strategic Plans
- The Federal Government's Strategic Planning process was designed to connect programs and resources to the goals and objectives each administration intends to affect. The overall strategic plans provide a narrative about what the administration's plan is to affect the change and the Annual Performance Report details progress. Although narrative in these reports provide insight into key initiatives and actions taken, they also include specific metrics that provide objective and measurable outcomes to measure performance. Below are the goals, objectives, and metrics measured in the last Annual Perfomance Report for the Trump and Obama administrations. Because each administration sets their own strategic plans, the ability to compare performance across administrations is difficult. Lagged data also makes it difficult to measure performance or outcomes for the administration itself. Finding sources for these data (in machine readable format) that live beyond an administration makes these measurements more useful for the public to understand how each successive administration has either continued to advance, held steady, or rolled back performance compared to the prior administration. This exercise will help the public better understand the affects of administration change on the measures in the EPA strategic plans.

### Superfund Work
- I was able to fill in about 10 rows on the spreadsheet as there is an EPA Superfund site that I knew about and was quickly able to find.  I added some analysis work there.

## EPA Budget
- The EPA's budgets and other financial reports can be found here: https://www.epa.gov/planandbudget/archive#FinancialReports. What we're looking to do is bring in its budget data into a machine-readable format. Choose one of the budgets listed and find the breakdown of spending by program (usually in the appendix). Note the line-item below with the year, category, and the budget request. If you need to stop at some point, note the status (likely the page number) of where you stopped so that someone can take over.

### Trying to help others
- Emily Ng looks like she's working on this and was looking for help in the general channel.
- Emily did a code walkthrough for me and we talked some.  I think I only offered one useful suggestion on her code.  
- She offered the PDF libraries she is using - PyPDF2 apparently good for getting text from PDF and camelot apparently good for pulling tables out of PDFs.