# U.S. Farm Bills (1935 to 2018)

There are 19 total "farm bills", with bills for years 1996 to 2018 available as text online and bills from 1935 to 1990 available only as PDF on this [website](https://nationalaglawcenter.org/farmbills/) hosting an archive of agriculture-related federal legislation.

**Questions of interest**: What are the most important priorities across the different farm bills and how have they changed over time? How has the role of soil conservation in agriculture changed over time in the bills?

In [4]:
# installing spacy
#!pip install -U pip setuptools wheel
#!pip install -U spacy
#python -m download en_core_web_md #(for word vector models)
#!pip install spacy-transformers
#python -m download en_trf-bertbaseuncased_lg # transformer architectures (large model)

In [143]:
# libraries
import pandas as pd
import geopandas as gpd
import numpy as np

import requests
import urllib
from urllib.request import urlopen
from selenium import webdriver
from selenium.webdriver.firefox.service import Service
from selenium.webdriver.firefox.options import Options as FirefoxOptions
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
from webdriver_manager.firefox import GeckoDriverManager
from bs4 import BeautifulSoup

from pdfminer.high_level import extract_text
from PyPDF2 import PdfFileReader, PdfFileWriter, PdfFileMerger

import re
import os
import os.path
from sklearn import svm
import spacy
import torch
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.stem import PorterStemmer
import json

# set display
pd.options.display.max_columns = 150
pd.options.display.max_rows = 300

In [47]:
# scraping from URL: unable to scrape using urllib and requests, have to use Selenium
URL = 'https://nationalaglawcenter.org/farmbills/'

In [6]:
# initializing
#service = Service(executable_path = r'C:\Users\melod\Documents\data science\geckodriver-v0.33.0-win64\geckodriver.exe')
options = webdriver.FirefoxOptions()
options.add_argument("--headless")
driver = webdriver.Firefox(options = options, service = Service(GeckoDriverManager().install()))
driver.get(URL)

  service = Service(executable_path = r'C:\Users\melod\Documents\data science\geckodriver-v0.33.0-win64\geckodriver.exe')
  driver = webdriver.Firefox(options = options, service = Service(GeckoDriverManager().install()))


In [20]:
# collecting bill text/PDF URLs
headers = driver.find_elements(By.TAG_NAME, 'h3')

sections = []
for header in headers:
    header = header.text
    sections.append(header)

sections

['GENERAL RESOURCES',
 'FARM BILL LEGISLATION',
 'Additional Historical Farm Legislation',
 'GENERAL RESOURCES',
 'Congressional Research Service Report Subjects:',
 'FARM BILL LEGISLATION',
 'The 2014 Farm Bill: Agricultural Act of 2014',
 'The 2008 Farm Bill: Food, Conservation, and Energy Act of 2008 (H.R. 6124)',
 'The 2002 Farm Bill: The Farm Security and Rural Investment Act of 2002',
 'The 1996 Farm Bill: The Federal Agriculture Improvement and Reform Act of 1996',
 'Food, Agriculture, Conservation, and Trade Act of 1990',
 'Food Security Act of 1985',
 'Agriculture and Food Act of 1981',
 'Food and Agriculture Act of 1977',
 'Agricultural and Consumer Protection Act of 1973',
 'Agricultural Act of 1970',
 'Food and Agricultural Act of 1965',
 'Agricultural Act of 1956',
 'Agricultural Act of 1954',
 'Agricultural Act of 1949',
 'Agricultural Act of 1948',
 'Agricultural Adjustment Act of 1938',
 'Agricultural Adjustment Act of 1933',
 'Amendments to the National Wool Act',
 'Om

In [14]:
# collecting bill text/PDF URLs
#bills = driver.find_elements(By.XPATH, '//a[@class="toc-item-heading"]')
bills = driver.find_elements(By.TAG_NAME, 'a')
#bills = driver.find_elements(By.PARTIAL_LINK_TEXT, 'congress.gov/bill')

billlist = []
for bill in bills:
    bill = bill.get_attribute('href')
    billlist.append(bill)

billlist

[None,
 'https://nationalaglawcenter.org/',
 'http://nationalaglawcenter.org/ag-and-food-law-blog/',
 'https://nationalaglawcenter.org/about-the-center/',
 'https://nationalaglawcenter.org/about-the-center/professional-staff/',
 'https://nationalaglawcenter.org/partners/',
 'https://nationalaglawcenter.org/research-by-topic/',
 'https://nationalaglawcenter.org/center-publications/',
 'https://nationalaglawcenter.org/webinars/',
 'https://nationalaglawcenter.org/state-compilations/',
 'https://nationalaglawcenter.org/farmbills/',
 'https://nationalaglawcenter.org/ag-law-bibliography/',
 'https://nationalaglawcenter.org/ag-law-glossary/',
 'https://nationalaglawcenter.org/aglaw-reporter/',
 'https://nationalaglawcenter.org/general-resources/',
 'https://nationalaglawcenter.org/website-guide/',
 'https://nationalaglawcenter.org/disclaimer/',
 'https://nationalaglawcenter.org/about-the-center/',
 'https://nationalaglawcenter.org/about-the-center/professional-staff/',
 'https://www.congress

In [67]:
# scrape URL for online bill texts for Farm bills 1996 to 2018
bills = driver.find_elements(By.XPATH, '//*[contains(text(), "Text")]')

list1 = []
for bill in bills[11:16]:
    bill = bill.get_attribute('href')
    list1.append(bill)
    
list1

['https://www.congress.gov/bill/115th-congress/house-bill/2/text?format=txt&q=%7B%22search%22%3A%5B%22hr2%22%5D%7D&r=1',
 'https://www.govinfo.gov/content/pkg/PLAW-113publ79/html/PLAW-113publ79.htm',
 'https://www.congress.gov/bill/110th-congress/house-bill/6124/text',
 'https://www.congress.gov/bill/107th-congress/house-bill/2646/text',
 'https://www.govtrack.us/congress/bills/104/hr2854/text']

### Scraping links to PDFs

Each bill has been uploaded as a PDF, though some of the earlier bills were uploaded in parts, while the most recent bills are also publicly archived online through the Congress.gov website.

For consistency, all bills will be collected as PDFs before being analyzed.

In [35]:
# scrape PDFs for Farm bills 1996 to 2018
PDFS = driver.find_elements(By.XPATH, '//*[text()="PDF"]')

list1 = []
for PDF in PDFS:
    PDF = PDF.get_attribute('href')
    list1.append(PDF)
    
list1

['https://www.congress.gov/115/bills/hr2/BILLS-115hr2enr.pdf',
 'https://www.govinfo.gov/content/pkg/BILLS-113hr2642enr/pdf/BILLS-113hr2642enr.pdf',
 'https://www.congress.gov/110/plaws/publ246/PLAW-110publ246.pdf',
 'http://frwebgate.access.gpo.gov/cgi-bin/getdoc.cgi?dbname=107_cong_public_laws&docid=f:publ171.107.pdf',
 'https://www.congress.gov/104/plaws/publ127/PLAW-104publ127.pdf']

In [45]:
# scrape PDFs for Farm bills 1977 to 1990
PDFS = driver.find_elements(By.XPATH, '//*[contains(text(), "Part ")]')

list1 = []
for PDF in PDFS:
    PDF = PDF.get_attribute('href')
    list1.append(PDF)
    
list1

['https://nationalaglawcenter.org/wp-content/uploads/assets/farmbills/1990-1.pdf',
 'https://nationalaglawcenter.org/wp-content/uploads/assets/farmbills/1990-2.pdf',
 'https://nationalaglawcenter.org/wp-content/uploads/assets/farmbills/1990-3.pdf',
 'https://nationalaglawcenter.org/wp-content/uploads/assets/farmbills/1990-4.pdf',
 'https://nationalaglawcenter.org/wp-content/uploads/assets/farmbills/1990-5.pdf',
 'https://nationalaglawcenter.org/wp-content/uploads/assets/farmbills/1990-6.pdf',
 'https://nationalaglawcenter.org/wp-content/uploads/assets/farmbills/1990-7.pdf',
 'https://nationalaglawcenter.org/wp-content/uploads/assets/farmbills/1990-8.pdf',
 'https://nationalaglawcenter.org/wp-content/uploads/assets/farmbills/1990-9.pdf',
 'https://nationalaglawcenter.org/wp-content/uploads/assets/farmbills/1990-10.pdf',
 'https://nationalaglawcenter.org/wp-content/uploads/assets/farmbills/1990-11.pdf',
 'https://nationalaglawcenter.org/wp-content/uploads/assets/farmbills/1990conf-house9

In [46]:
# scrape PDFs for Farm bills before 1977: only first 9
PDFS = driver.find_elements(By.XPATH, '//*[text()="Full Text"]')

list1 = []
for PDF in PDFS[:9]:
    PDF = PDF.get_attribute('href')
    list1.append(PDF)
    
list1

['https://nationalaglawcenter.org/wp-content/uploads/assets/farmbills/1973.pdf',
 'https://nationalaglawcenter.org/wp-content/uploads/assets/farmbills/1970.pdf',
 'https://nationalaglawcenter.org/wp-content/uploads/assets/farmbills/1965.pdf',
 'https://nationalaglawcenter.org/wp-content/uploads/assets/farmbills/1956.pdf',
 'https://nationalaglawcenter.org/wp-content/uploads/assets/farmbills/1954.pdf',
 'https://nationalaglawcenter.org/wp-content/uploads/assets/farmbills/1949.pdf',
 'https://nationalaglawcenter.org/wp-content/uploads/assets/farmbills/1948.pdf',
 'https://nationalaglawcenter.org/wp-content/uploads/assets/farmbills/1938.pdf',
 'https://nationalaglawcenter.org/wp-content/uploads/assets/farmbills/1933.pdf']

### Prepping plans from 1977 to 1990

Each of the 4 bills from this period were scanned, separated, and uploaded as multiple documents. Each bill will need to have its parts remerged into single PDFs before they can be analyzed.

However, the "title" pages for each of these parts documents will need to be removed before they are merged to avoid artificially inflating the presence of specific keywords in each of the bills (e.g. "agricultural").

In [93]:
# 1977

# initializing merge
merger = PdfFileMerger()

# merging
for pdf in [r"C:\Users\melod\Documents\data science\data\Farm Bills\Parts\1977-1.pdf", 
           r"C:\Users\melod\Documents\data science\data\Farm Bills\Parts\1977-2.pdf"]:
    merger.append(pdf, import_bookmarks=False) # kept getting an error otherwise

# save to folder
merger.write("1977.pdf")
merger.close()
#path = 'C://Users//melod//Documents//data science//Food-Systems-Policy-Research//Food and Ag CA Legislation//Data'
#saved = os.path.join(path, "1977.pdf") 
#PDF = open(saved, "w")
#toFile = input("1977")
#PDF.write(toFile)
#PDF.close()

In [95]:
# 1981

# initializing merge
merger = PdfFileMerger()

# merging
for pdf in [r"C:\Users\melod\Documents\data science\data\Farm Bills\Parts\1981-1.pdf", 
           r"C:\Users\melod\Documents\data science\data\Farm Bills\Parts\1981-2.pdf"]:
    merger.append(pdf, import_bookmarks=False) # kept getting an error otherwise

# save
merger.write("1981.pdf")
merger.close()

In [96]:
# 1985

# initializing merge
merger = PdfFileMerger()

# merging
for pdf in [r"C:\Users\melod\Documents\data science\data\Farm Bills\Parts\1985-1.pdf", 
           r"C:\Users\melod\Documents\data science\data\Farm Bills\Parts\1985-2.pdf"]:
    merger.append(pdf, import_bookmarks=False) # kept getting an error otherwise

# save
merger.write("1985.pdf")
merger.close()

In [98]:
# 1990

# initializing merge
merger = PdfFileMerger()

# merging
for pdf in [r"C:\Users\melod\Documents\data science\data\Farm Bills\Parts\1990-1.pdf", 
           r"C:\Users\melod\Documents\data science\data\Farm Bills\Parts\1990-2.pdf",
           r"C:\Users\melod\Documents\data science\data\Farm Bills\Parts\1990-3.pdf", 
           r"C:\Users\melod\Documents\data science\data\Farm Bills\Parts\1990-4.pdf",
           r"C:\Users\melod\Documents\data science\data\Farm Bills\Parts\1990-5.pdf", 
           r"C:\Users\melod\Documents\data science\data\Farm Bills\Parts\1990-6.pdf",
           r"C:\Users\melod\Documents\data science\data\Farm Bills\Parts\1990-7.pdf", 
           r"C:\Users\melod\Documents\data science\data\Farm Bills\Parts\1990-8.pdf",
           r"C:\Users\melod\Documents\data science\data\Farm Bills\Parts\1990-9.pdf", 
           r"C:\Users\melod\Documents\data science\data\Farm Bills\Parts\1990-10.pdf",
           r"C:\Users\melod\Documents\data science\data\Farm Bills\Parts\1990-11.pdf"]:
    merger.append(pdf, import_bookmarks=False) # kept getting an error otherwise

# save
merger.write("1990.pdf")
merger.close()

### Pre-processing Bill Texts 


In [146]:
# installing wordninja
#!pip install wordninja

In [147]:
# libraries
from textblob import TextBlob # spellcorrecting
# python -m textblob.download_corpora
import wordninja # separating misjoined words

In [7]:
# loading bill PDFs from data folder in directory
bills = os.listdir(r"C:\Users\melod\Documents\data science\data\Farm Bills")

# inspect
print(type(bills))
print(len(bills))
print(bills)

<class 'list'>
19
['1933.pdf', '1938.pdf', '1948.pdf', '1949.pdf', '1954.pdf', '1956.pdf', '1965.pdf', '1970.pdf', '1973.pdf', '1977.pdf', '1981.pdf', '1985.pdf', '1990.pdf', '1996.pdf', '2002.pdf', '2008.pdf', '2014.pdf', '2018.pdf', 'Parts']


In [12]:
# extracting text from bill PDFs
def readPDF(bills):
    txt = extract_text("C:/Users/melod/Documents/data science/data/Farm Bills/"+bills)
    
    # strip quotation marks
    text = re.sub(r'"', '', string)
    # remove punctuation, numbers, etc.
    txt = re.sub(r'[^A-z\s]', '', txt)
    # remove any extra whitepace
    txt = re.sub(r'\s+', ' ', txt) 
    
    # Add year identifier to the beginning of each bill
    
    # clean up bill name
    bill = bills.split(".")[0]
    
    # insert year ID to start of bill string: indexable bill ID w/i each bill txt string
    txt = bill+', '+txt
    
    # confirm
    print('Finished {}'.format(bill))
    return txt

# read in all pdf files
suffixes = ('.pdf', '.PDF')
farmbills = [readPDF(bill) for bill in bills if bill.endswith(suffixes)]

Finished 1933
Finished 1938
Finished 1948
Finished 1949
Finished 1954
Finished 1956
Finished 1965
Finished 1970
Finished 1973
Finished 1977
Finished 1981
Finished 1985
Finished 1990
Finished 1996
Finished 2002
Finished 2008
Finished 2014
Finished 2018


In [13]:
# inspect
print(type(farmbills))
print(len(farmbills))

<class 'list'>
18


In [None]:
# save list of bills to hardrive so that it called be reloaded/called directly instead of
#  having to rerun time and resource intensive scraping function
#import pickle

#with open('farmbills.pickle', 'wb') as f:
 #   pickle.dump(farmbills, f)

In [None]:
# list of words to exclude
swords = stopwords.words('english')

### TEST CODE AREA

In [105]:
#TEST
string = '(7) PARTICIPATION AGREEMENTS.-"(A) IN GENERAL.-Producers on a farm desiring toparticipate in the program conducted under this subsectionshall execute an agreement with the Secretary providingfor the participation not later than such date as the Sec-retary may prescribe."(B) MODIFICATION OR TERMINATION.-The Secretary may,by mutual agreement with producers on a farm, modify orterminate any such agreement if the Secretary determinesthe action necessary because of an emergency created bydrought or other disaster or to prevent or alleviate a short-age in the supply of agricultural commodities. The Sec-retary may modify the agreement under this subparagraphfor the purpose of alleviating a shortage in the supply ofagricultural commodities only if there has been a signifi-cant change in the estimated stocks of the commodity sincethe Secretary announced the final terms and conditions ofthe program for the crop of rice.'

In [153]:
# separating misjoined words
for word in string:
    print(''.join(wordninja.split(word)))


7


P
A
R
T
I
C
I
P
A
T
I
O
N

A
G
R
E
E
M
E
N
T
S




A


I
N

G
E
N
E
R
A
L


P
r
o
d
u
c
e
r
s

o
n

a

f
a
r
m

d
e
s
i
r
i
n
g

t
o
p
a
r
t
i
c
i
p
a
t
e

i
n

t
h
e

p
r
o
g
r
a
m

c
o
n
d
u
c
t
e
d

u
n
d
e
r

t
h
i
s

s
u
b
s
e
c
t
i
o
n
s
h
a
l
l

e
x
e
c
u
t
e

a
n

a
g
r
e
e
m
e
n
t

w
i
t
h

t
h
e

S
e
c
r
e
t
a
r
y

p
r
o
v
i
d
i
n
g
f
o
r

t
h
e

p
a
r
t
i
c
i
p
a
t
i
o
n

n
o
t

l
a
t
e
r

t
h
a
n

s
u
c
h

d
a
t
e

a
s

t
h
e

S
e
c

r
e
t
a
r
y

m
a
y

p
r
e
s
c
r
i
b
e



B


M
O
D
I
F
I
C
A
T
I
O
N

O
R

T
E
R
M
I
N
A
T
I
O
N


T
h
e

S
e
c
r
e
t
a
r
y

m
a
y

b
y

m
u
t
u
a
l

a
g
r
e
e
m
e
n
t

w
i
t
h

p
r
o
d
u
c
e
r
s

o
n

a

f
a
r
m


m
o
d
i
f
y

o
r
t
e
r
m
i
n
a
t
e

a
n
y

s
u
c
h

a
g
r
e
e
m
e
n
t

i
f

t
h
e

S
e
c
r
e
t
a
r
y

d
e
t
e
r
m
i
n
e
s
t
h
e

a
c
t
i
o
n

n
e
c
e
s
s
a
r
y

b
e
c
a
u
s
e

o
f

a
n

e
m
e
r
g
e
n
c
y

c
r
e
a
t
e
d

b
y
d
r
o
u
g
h
t

o
r

o
t
h
e
r

d
i
s
a
s
t
e
r

o
r

t
o

p
r
e
v
e
n
t

o
r

a
l
l
e
v
i
a
t
e

a

s
h
o


In [None]:
# POS tagging
tb_string.tags

In [167]:
# TEST

# strip quotation marks
text = re.sub(r'(")+', '', string)

# remove parentheses and characters w/i where <2 characters w/i parentheses
text = re.sub(r'(\([A-z0-9]{,2}\))+', '', text)

# remove hyphens only when followed by a newline or preceded by a period
text = re.sub(r'\-(\n|\s)+', '', text)
text = re.sub(r'\.\-+', r' ', text)
text = re.sub(r'(?<=[A-z])\-(?=[A-z])+', '', text)

# add space after commas preceded by a complete word
text = re.sub(r'(,)(?=[A-z])+', r'\1 ', text)

# correct spelling errors (created when removing hyphens or spaces)
tb_string = TextBlob(text)
tb_string.correct()

# separating misjoined words
#for word in string:
 #   ' '.join(wordninja.split(word))

# remove any extra whitepace
text = re.sub(r'\s+', ' ', text) 

# separate into sentences: replace periods w/ commas when not preceded/followed by digits
#text = re.sub(r'(?<![/d{1,20}])(\.)(?![/d{1,20}])', r', ', text) # using regex
text = sent_tokenize(text, language = "english") # using nltk
        
# inspect
text

[' PARTICIPATION AGREEMENTS.IN GENERAL Producers on a farm desiring toparticipate in the program conducted under this subsectionshall execute an agreement with the Secretary providingfor the participation not later than such date as the Secretary may prescribe.',
 'MODIFICATION OR TERMINATION The Secretary may, by mutual agreement with producers on a farm, modify orterminate any such agreement if the Secretary determinesthe action necessary because of an emergency created bydrought or other disaster or to prevent or alleviate a shortage in the supply of agricultural commodities.',
 'The Secretary may modify the agreement under this subparagraphfor the purpose of alleviating a shortage in the supply ofagricultural commodities only if there has been a significant change in the estimated stocks of the commodity sincethe Secretary announced the final terms and conditions ofthe program for the crop of rice.']

In [137]:
# testing search
comp = re.compile(r'(,)[A-z]+')
find = re.search(comp, string)
find

<re.Match object; span=(321, 324), match=',by'>

### NLP Analysis: Identifying Interests, Policy Goals, and Mechanisms

1. Basic keyword search
2. Keyword proximities: Do specific keywords appear near to each other in the bill language? (e.g. soil and food)
       (create word list per bill document; each document is a list of words)
3. Text classification using word vector representation: Search bill text for terms under the following categories -- "INTERESTS", "OBJECTIVES", and "INCENTIVES" to identify what stakeholder interests are being addressed in each bill, the policy goals that are being prioritized or ignored, and the kinds of monetary and punitive tools used to facilitate both stated and inferred policy outcomes. Comparison of (1,2) and (3) to ascertain the extent to which community-scale food production and land and environmental health as a prerequisite for healthy food systems is either aligned with or contrary to the stated and inferred policy goals in farm bills, and how if at all this changes over time.
    (create sentence list per bill document; each document is a list of sentences)
4. Text summarization: Summarizing the bill contents

Using pretrained sentence BERT (SBERT)

### Keywords: How Are Farm Bills Talking About Food and Environment?

In [None]:
# descriptive statistics: document word count
count =

In [None]:
# list of keywords of interest
keywords = ['food', 'agriculture', 'producer', 'soil', 'conservation', 'regenerative']


In [None]:
# keyword counts

# turning cleaned bill text into list of words
wordlist = [word for word in word_tokenize(textonly.lower()) 
                 if word not in swords]

### Identifying Stakeholders, Policy Goals, and Tools/Incentives: (A) Stakeholders


In [None]:
# libraries
from sklearn.feature_extraction.text import CountVectorizer

In [None]:
# creating classifer categories of stakeholder interests represented
class Category:
    GOVERNMENT = "Government"
    BUSINESS = "Business"
    CIVIC = "Civic"
train_xstakeholder = billsentence # sample of cleaned sentences from sentence list
train_ystakeholder = [Category.GOVERNMENT, Category.GOVERNMENT, 
                      Category.BUSINESS, Category.BUSINESS, Category.CIVIC,
                     Category.CIVIC]

In [None]:
# initializing
vectorizer = CountVectorizer(binary = True) # binary: unique counts 
vectorizer = CountVectorizer(binary = True, ngram_range = (1,2)) # ngram spec
#vectorizer = CountVectorizer() # absolute counts of words (captures multiple instances)

# fit dictionary of unique words pulled from corpus
train_xvectors = vectorizer.fit_transform(train_xstakeholder)

# inspect sample
print(vectorizer.get_feature_names())
print(train_xvectors.toarray())

In [None]:
# load pre-trained spacy model (size = medium)
nlp = spacy.load('en_core_web_md')

In [None]:
# vectorize each word in the bill documents used to train model
docs = [nlp(text) for text in train_xstakeholders]

# inspect sample
#print(docs[0].vector)

# creating train x for vectorized corpus
train_xstakeholder_v = [word.vector for word in docs] 

In [None]:
# fitting
clf_svm_wv = svm.SVC(kernel = 'linear') # linear is a good classification model for text?
clf_svm_wv_fit(train_xstakeholder_v, train_ystakeholder)

In [None]:
# predicting: testing model 
# insert into loop for each sentence in sentence list for each bill corpus
test_txt = [] # insert leftover bills corpus (minus text from train_xstakeholder)
test_corp = [nlp(text) for text in test_txt]
text_txt_wv = [x.vector for x in test_corp]

clf_svm_wv_predict(text_txt_wv) 

Using open-source BERT model: when not training model from scratch

In [None]:
nlp = spacy.load('en_trf-bertbaseuncased_lg')
doc = nlp('Here is some text to encode.')

In [None]:
# rerun vector model code
