# Food, Agriculture, and Soils: California Legislation

This notebook explores state level legislation from the Legiscan API. It conducts text analysis of all available California Legislation that mentions food and agriculture from years 20XX to 2023, with special attention to mentions of land, soil, and environmental management.

LegiScan API documentation [here](https://legiscan.com/gaits/documentation/legiscan) and [here](https://api.legiscan.com/dl/).

In [1]:
# libraries
import pandas as pd
import geopandas as gpd
import numpy as np

import requests
import urllib
from urllib.request import urlopen
from bs4 import BeautifulSoup

import re
import os
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.stem import PorterStemmer
import json

# set display
pd.options.display.max_columns = 150
pd.options.display.max_rows = 300

First, I will pull the bill ID, year/session, and full texts for all CA legisation that passed mentioning "food" and "agriculture". 

Then I will narrow my search to all bills mentioning "food", "agriculture", "environment", and "soil".

I will store relevant information into a dictionary that can be transformed into a pandas dataframe.

Pull API template: "https://api.legiscan.com/?key=APIKEY&op=OPERATION&PARAMS"
APIKey = '7e00040f1f7618af234e7415484d2494'

**OPERATIONS**
getBill, getBillText, getSearch, getSearchRaw

**PARAMETERS**
state, year, query (URL encoded), page

**IDENTIFIERS**
bill_id, doc_id

In [3]:
# request: CA, food and ag, results for current year
requeststring = "https://api.legiscan.com/?key=7e00040f1f7618af234e7415484d2494&op=getSearchRaw&state=CA&year=2&query=food%20and%20agriculture"

In [15]:
# request: CA, food and ag, results for recent years
requeststring1 = "https://api.legiscan.com/?key=7e00040f1f7618af234e7415484d2494&op=getSearchRaw&state=CA&year=3&query=food%20and%20agriculture"

In [5]:
# request: CA, food and ag, results for all available years
requeststring2 = "https://api.legiscan.com/?key=7e00040f1f7618af234e7415484d2494&op=getSearchRaw&state=CA&year=1&query=food%20and%20agriculture"

In [16]:
# pull request: getting results as JSON object
r = requests.get(requeststring1)

In [17]:
# examining results
#print(r.text)

In [18]:
# turning JSON object into a dict
dict1 = json.loads(r.text)

In [19]:
# examining structure of dictionary and identifying relevant keys
print(dict1.keys())
print(dict1['searchresult'].keys())
print(dict1['searchresult']['summary'].keys())
print(dict1['searchresult']['results'])

dict_keys(['status', 'searchresult'])
dict_keys(['summary', 'results'])
dict_keys(['page', 'range', 'relevancy', 'count', 'page_current', 'page_total', 'query'])
[{'relevance': 100, 'bill_id': 1711942, 'change_hash': 'cd9d825f6d643ff19d5d789db7760174'}, {'relevance': 99, 'bill_id': 1693920, 'change_hash': '464be2cde28c223849b4efe7d1b2e894'}, {'relevance': 99, 'bill_id': 1702780, 'change_hash': '266b25df310d9ebd3e42bb97fdec3adf'}, {'relevance': 99, 'bill_id': 1712167, 'change_hash': '37ed87e95a3bc0627706e74408a2a63a'}, {'relevance': 99, 'bill_id': 1707639, 'change_hash': '6bddb0f6f615ed90a9bafc8ec369b181'}, {'relevance': 99, 'bill_id': 1671132, 'change_hash': '3f4199a2ffe1d27113d5607be727e6db'}, {'relevance': 99, 'bill_id': 1712103, 'change_hash': 'ce1242ff43bcfe48ae2d9a3325bd5959'}, {'relevance': 99, 'bill_id': 1707512, 'change_hash': '85d63281d189d8a6f47466e7268baa96'}, {'relevance': 99, 'bill_id': 1714501, 'change_hash': 'f094de3f6fb0046b0ab05b73f8ae9b46'}, {'relevance': 99, 'bill_id

In [19]:
# creating subset dict with only keys of interest
list1 = dict1['searchresult']['results']
list1[:5]

[{'relevance': 100,
  'bill_id': 1453447,
  'change_hash': 'c4cbd2bf405a5c9f9d8db0604c01ee91'},
 {'relevance': 99,
  'bill_id': 1693920,
  'change_hash': '2cc2c16e46855bab164790df5a2dc955'},
 {'relevance': 99,
  'bill_id': 1456707,
  'change_hash': '0f15fc913ea7989404f0e73f48c7b1de'},
 {'relevance': 99,
  'bill_id': 1388510,
  'change_hash': '75ed87f9287e2be9b2670e33d0fb03c5'},
 {'relevance': 99,
  'bill_id': 1711942,
  'change_hash': 'cd9d825f6d643ff19d5d789db7760174'}]

In [20]:
# generating list of bill_ids
# https://www.geeksforgeeks.org/python-get-values-of-particular-key-in-list-of-dictionaries/
bills = [value['bill_id'] for value in list1]
bills[:5]

[1453447, 1693920, 1456707, 1388510, 1711942]

In [21]:
# turning list of bill_ids for each relevant result from query into a df
df = pd.DataFrame(bills, columns = ['bill_id'])
df.head()

Unnamed: 0,bill_id
0,1453447
1,1693920
2,1456707
3,1388510
4,1711942


In [173]:
# sampling bill pull
bill = 'https://api.legiscan.com/?key=7e00040f1f7618af234e7415484d2494&op=getBill&id=1711942'
billa = 'https://api.legiscan.com/?key=7e00040f1f7618af234e7415484d2494&op=getBill&id=1592779'
b = requests.get(bill)
ba = requests.get(billa)
binfo = json.loads(b.text)
binfoa = json.loads(ba.text)
print(binfo.keys()) 
print(binfoa.keys()) 
#print(binfo)
print(binfoa)

dict_keys(['status', 'bill'])
dict_keys(['status', 'alert'])
{'status': 'ERROR', 'alert': {'message': 'Unknown bill id'}}


In [160]:
# exploring structure and extracting doc_ids from bill: BILL
print(binfo.keys())
print(binfo['bill'].keys())
print(binfo['bill']['texts'])

dict_keys(['status', 'bill'])
dict_keys(['bill_id', 'change_hash', 'session_id', 'session', 'url', 'state_link', 'completed', 'status', 'status_date', 'progress', 'state', 'state_id', 'bill_number', 'bill_type', 'bill_type_id', 'body', 'body_id', 'current_body', 'current_body_id', 'title', 'description', 'pending_committee_id', 'committee', 'referrals', 'history', 'sponsors', 'sasts', 'subjects', 'texts', 'votes', 'amendments', 'supplements', 'calendar'])
[{'doc_id': 2704856, 'date': '2023-02-16', 'type': 'Introduced', 'type_id': 1, 'mime': 'text/html', 'mime_id': 1, 'url': 'https://legiscan.com/CA/text/AB1197/id/2704856', 'state_link': 'https://leginfo.legislature.ca.gov/faces/billTextClient.xhtml?bill_id=202320240AB1197#99INT', 'text_size': 7488, 'text_hash': 'b048ef4ee2d51bcb0266641e75d77798', 'alt_bill_text': 0, 'alt_mime': '', 'alt_mime_id': 0, 'alt_state_link': '', 'alt_text_size': 0, 'alt_text_hash': ''}, {'doc_id': 2743399, 'date': '2023-03-13', 'type': 'Amended', 'type_id': 3,

In [175]:
# exploring structure and extracting doc_ids from bill: ALERT
print(binfoa.keys())
print(binfoa['alert'].keys())
print(binfoa['status'])
print(binfoa['alert']['message'])

dict_keys(['status', 'alert'])
dict_keys(['message'])
ERROR
Unknown bill id


In [161]:
# isolating list of dictionaries
doclist = binfo['bill']['texts']

In [132]:
# generating list of bill_ids
docs = [sub['doc_id'] for sub in doclist]
docs

[2698931, 2753361, 2766250, 2785841, 2796256, 2829887, 2833039]

In [133]:
# indexing last element (most recently amended version of bill text)
docs[-1]

2833039

In [184]:
# sampling bill text pull
btext1 = 'https://api.legiscan.com/?key=7e00040f1f7618af234e7415484d2494&op=getBillText&id=2833039'
b = requests.get(btext1)
btext = json.loads(b.text)
print(btext.keys()) 
print(btext)

dict_keys(['status', 'text'])
{'status': 'OK', 'text': {'doc_id': 2833039, 'bill_id': 1707639, 'date': '2023-07-06', 'type': 'Amended', 'type_id': 3, 'mime': 'text/html', 'mime_id': 1, 'url': 'https://legiscan.com/CA/text/SB485/id/2833039', 'state_link': 'https://leginfo.legislature.ca.gov/faces/billTextClient.xhtml?bill_id=202320240SB485#93AMD', 'text_size': 14009, 'text_hash': 'd91e4b644ac770f54ffa4d1f537a2af0', 'doc': 'PGRpdiBpZD0iYmlsbF9hbGwiIGFsaWduPSJqdXN0aWZ5Ij48ZGl2IGlkPSJhYm91dCI+PGJyIGNsZWFyPSJhbGwiLz48ZGl2IGFsaWduPSJjZW50ZXIiPjx0YWJsZSBzdHlsZT0iYm9yZGVyLWNvbGxhcHNlOiBjb2xsYXBzZSIgY2VsbFNwYWNpbmc9IjAiIGFsaWduPSJjZW50ZXIiIHJvbGU9InByZXNlbnRhdGlvbiI+PHRib2R5Pjx0cj48dGQgYWxpZ249ImNlbnRlciI+PHNwYW4gc3R5bGU9IiB0ZXh0LXRyYW5zZm9ybTogdXBwZXJjYXNlOyBmb250LXNpemU6IDFlbSI+CiAgICAgICAgICAgICAgICBBbWVuZGVkCiAgICAgICAgICAgICAgwqBJTsKgPC9zcGFuPjxzcGFuIHN0eWxlPSIgdGV4dC10cmFuc2Zvcm06IHVwcGVyY2FzZTsgZm9udC1zaXplOiAxZW0iPgogICAgICAgICAgICAgICAgQXNzZW1ibHkKICAgICAgICAgICAgICA8L3NwYW4+wqA8c3Bhbi

In [185]:
# exploring structure
print(btext.keys())
print(btext['text'].keys())
print(btext['text']['url'])
print(btext['text']['state_link']) # this URL less likely to turn up errors

dict_keys(['status', 'text'])
dict_keys(['doc_id', 'bill_id', 'date', 'type', 'type_id', 'mime', 'mime_id', 'url', 'state_link', 'text_size', 'text_hash', 'doc', 'alt_bill_text', 'alt_mime', 'alt_mime_id', 'alt_state_link', 'alt_text_size', 'alt_text_hash', 'alt_doc'])
https://legiscan.com/CA/text/SB485/id/2833039
https://leginfo.legislature.ca.gov/faces/billTextClient.xhtml?bill_id=202320240SB485#93AMD


In [186]:
# extracting status info from leginfo
#statusURL = btext['text']['state_link'].replace('Text', 'Status')
statusURL = btext['text']['url'].split("/id")[0].replace('text', 'bill')
statusURL

'https://legiscan.com/CA/bill/SB485'

In [187]:
# extracting text from bill text url
URL = btext['text']['state_link']
URL1 = statusURL
#URL2 = "https://leginfo.legislature.ca.gov/faces/billStatusClient.xhtml?bill_id=202320240AB778" #test
r = urlopen(URL)
html_bytes = r.read()
html = html_bytes.decode("utf-8")
soup = BeautifulSoup(html, features='html.parser')

In [188]:
# inspecting structure to locate text bloc
print(soup.prettify())

<?xml version='1.0' encoding='UTF-8' ?>
<!DOCTYPE html>
<html lang="en" xml:lang="en" xmlns="http://www.w3.org/1999/xhtml">
 <head id="j_idt5">
  <meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
  <meta content="IE=edge,chrome=1" http-equiv="X-UA-Compatible"/>
  <!--&lt;meta http-equiv="refresh" content=";url=faces/home.xhtml"/&gt;-->
  <link href="https://leginfo.legislature.ca.gov/resources/css/leginfo_master.css" media="screen   ,print" rel="stylesheet" type="text/css"/>
  <!--&lt;link rel="stylesheet" type="text/css" href="./resources/css/leginfo_mobile.css" media="screen and (-webkit-min-device-pixel-ratio: 1.1) and (max-device-width: 990px)" /&gt;
    &lt;link rel="stylesheet" type="text/css" href="./resources/css/leginfo_mobile.css" media="screen and (min-device-width: 481px)and (max-device-width: 640px)and (-webkit-max-device-pixel-ratio: 1)" /&gt;-->
  <link href="https://leginfo.legislature.ca.gov/resources/css/leginfo_mobile.css" media="screen and (orient

In [45]:
# main page: locating title in html
t = soup.find_all('title')
print(len(t))
print(t[0])

1
<title>Bill Status - SB-485 Elections: election worker protections.</title>


In [100]:
# main page: locating bill year introduced in html
y = soup.find_all(id = 'bill_intro_date')
print(len(y))
print(y[0])

0


IndexError: list index out of range

In [140]:
# main page: locating bill year amended in html
y1 = soup.find_all('tbody')
print(len(y1))
print(type(y1))
print(y1[0])

3
<class 'bs4.element.ResultSet'>
<tbody><tr><td align="center"><span style=" text-transform: uppercase; font-size: 1em">
                Amended
               IN </span><span style=" text-transform: uppercase; font-size: 1em">
                Assembly
              </span> <span style="text-transform: uppercase">June 26, 2023</span><br/></td></tr></tbody>


In [141]:
# main page: locating bill text bloc in html
ti = soup.find_all('div', id = 'bill')
print(len(ti))
print(type(ti))
print(ti)

1
<class 'bs4.element.ResultSet'>
[<div id="bill"><div style=" text-transform: uppercase"><h2 style=" text-align: left;">The people of the State of California do enact as follows:</h2><br/></div><div id="s10.8470851203510501"><div class="ActionLine" style="margin:0 0 1em 0"><h3 style="font-weight:bold; display:inline;"><font class="blue_text" color="blue"><i>SECTION 1.</i></font></h3> <font class="blue_text" color="blue"><i>Section 1295 of the Code of Civil Procedure is amended to read:</i></font></div><div><div id="id_784DF561-ED41-4C7B-8CC5-C5414171AEBD"><div><p><div style="margin:0 0 1em 0;"><h6 style="display:inline;">1295.</h6> (a) Any contract for medical services which contains a provision for arbitration of any dispute as to professional negligence of a health care provider shall have such provision as the first article of the contract and shall be expressed in the following language: “It is understood that any dispute as to medical malpractice, that is as to whether any medica

In [153]:
# status page: locating bill status
#status = soup.find_all('tr')
status = soup.find_all('div', id = 'bill-last-action')
#status = soup.find_all(class_= 'tab ls-tab-track')
(lambda tag: tag.name == "strong" and text in tag.text)
print(len(status))
print(type(status))
print(status[0])
#print(status.pop())

1
<class 'bs4.element.ResultSet'>
<div id="bill-last-action" style="margin: 0 1em">Spectrum: Partisan Bill (Democrat 1-0)<br/>Status: Engrossed on May 30 2023 - 50% progression<br/>Action: 2023-07-13 - From committee: Do pass as amended and re-refer to Com. on APPR. (Ayes 6. Noes 1.) (July 11).<br/>Pending: <a href="/CA/pending/assembly-appropriations-committee/id/449" title="View other bills referred to CA Assembly Appropriations Committee">Assembly Appropriations Committee</a><br/>Text: <a href="/CA/text/SB485/2023" title="View latest bill text for California SB485">Latest bill text (Amended) [HTML]</a></div>


In [57]:
# remove html style and tags from title
for text in t[0]:
    title = text.text.strip()
    title = title.split("- ")[-1]
title

'SB-118 Budget Act of 2023: health.'

In [58]:
# remove html style and tags from year int
for text in y:
    yeari = text.text.strip()
    yeari = yeari[-4:]
yeari

'2023'

In [59]:
# remove html style and tags from year amended
for text in y1[0]:
    yeara = text.text.strip()
    yeara = yeara[-4:]
yeara

'2023'

In [236]:
# remove html style and tags from bill text
for text in ti:
    
    billtxt = text.text.strip()#.encode("ascii", "ignore")
    #billtxt = str(billtxt)
    
    # remove non ASCII chars
    # https://stackoverflow.com/questions/20078816/replace-non-ascii-characters-with-a-single-space
    billtxt = re.sub(r'[^\x00-\x7F]+', ' ', billtxt)#.replace('\n\t\t\t\t\t\t', ' ')
    
    # remove tabs and new lines
    billtxt = re.sub(r'(\s)', ' ', billtxt)

    # add space after colons and semicolons
    billtxt = re.sub(r'(\:|\;)+', r'\1 ', billtxt)
    
    # add space after periods only if preceded by words
    billtxt = re.sub(r'([A-z]{4,}\.)+', r'\1 ', billtxt)
    
    # remove any extra white spaces
    billtxt = re.sub(r'\s+', ' ', billtxt)
    
    # testing search
    #comp = re.compile(r'([A-z]{4,}\.)+')
   # find = re.search(comp, billtxt)

#find
print(type(billtxt))
billtxt

<class 'str'>


'The people of the State of California do enact as follows: SECTION 1. Section 1295 of the Code of Civil Procedure is amended to read: 1295. (a) Any contract for medical services which contains a provision for arbitration of any dispute as to professional negligence of a health care provider shall have such provision as the first article of the contract and shall be expressed in the following language: It is understood that any dispute as to medical malpractice, that is as to whether any medical services rendered under this contract were unnecessary or unauthorized or were improperly, negligently or incompetently rendered, will be determined by submission to arbitration as provided by California law, and not by a lawsuit or resort to court process except as California law provides for judicial review of arbitration proceedings. Both parties to this contract, by entering into it, are giving up their constitutional right to have any such dispute decided in a court of law before a jury, a

In [87]:
# remove html style and tags from bill status
for text in status[-2]:
    status = text.text.strip()
status

''

In [68]:
# stripping lingering unicode characters
billtxt1 = billtxt.astype('str')
billtxt1

AttributeError: 'str' object has no attribute 'astype'

In [122]:
# creating bill_id
bill_id = title.split("- ")[-1][:7]

# concatenating into one list
infotxt1 = [bill_id, title, yeari, status, billtxt]
infotxt1

['SB-118 ',
 'SB-118 Budget Act of 2023: health.',
 '2023',
 'Referred to Coms. on  N.R. & W. and  AGRI.',
 'The people of the State of California do enact as follows:SECTION 1.\xa0Section 1295 of the Code of Civil Procedure is amended to read:1295.\xa0(a)\xa0Any contract for medical services which contains a provision for arbitration of any dispute as to professional negligence of a health care provider shall have such provision as the first article of the contract and shall be expressed in the following language: “It is understood that any dispute as to medical malpractice, that is as to whether any medical services rendered under this contract were unnecessary or unauthorized or were improperly, negligently or incompetently rendered, will be determined by submission to arbitration as provided by California law, and not by a lawsuit or resort to court process except as California law provides for judicial review of arbitration proceedings. Both parties to this contract, by entering i

In [102]:
# remove html style and tags from all bill info (B)

# list of all bill elements
billinfo = [t[0], y, y1[0], ti]

# remove html style and tags from bill text
infotxt = []

for info in billinfo:
    for text in info:
        row = text.text.strip()
        infotxt.append(row)
        
    # create bill id from bill title
    bill_id = infotxt[0].split("- ")[-1][:7]
    
# add an item to the list that will serve as the bill id later
infotxt.insert(0, bill_id)
        
print(len(infotxt))
infotxt

5


['AB-1197',
 'Bill Text  - AB-1197 Agricultural Protection Planning Grant Program: local food producers.',
 'February\xa016,\xa02023',
 'Amended\n              \xa0IN\xa0\n                Senate\n              \xa0June\xa026,\xa02023',
 'The people of the State of California do enact as follows:SECTION 1.\xa0Section 10280 of the Public Resources Code is amended to read:10280.\xa0The Agricultural Protection Planning Grant Program is hereby established within the Department of Conservation, to provide planning grants to do all of the following:(a)\xa0Conserve California’s most productive farmlands and ecologically important rangelands.(b)\xa0Advance California’s climate change goals through carbon sequestration and greenhouse gas emissions reductions resulting from the implementation of local plans.(c)\xa0Maintain local food supplies, local food producers, and agricultural economies through the protection of agricultural lands.SEC. 2.\xa0Section 10280.5 of the Public Resources Code is am

In [104]:
# storing bill info in df    
itxtdf = pd.DataFrame(infotxt1).transpose()

# rename columns
#itxtdf.rename(columns = {'0':'title', '1':'y_intro', '2': 'y_recent', '3':'text'}, inplace = True)
itxtdf.columns = ['bill_id', 'title', 'y_intro', 'y_recent', 'text']

# optional: create column to serve as new index ID?
#itxtdf.set_index('bill_id', inplace = True)

itxtdf

Unnamed: 0,bill_id,title,y_intro,y_recent,text
0,AB-1197,AB-1197 Agricultural Protection Planning Grant...,2023,2023,The people of the State of California do enact...


In [68]:
# further processing bill text for NLP

# list of stopwords to exclude
swords = stopwords.words('english')

# stripping nonessential characters
textonly = re.sub(r"[^A-z\s]", "", billtxt)

# turning cleaned bill text into list of words
wordlist = [word for word in word_tokenize(textonly.lower()) 
                 if word not in swords]
print(len(wordlist))
wordlist

641


['people',
 'state',
 'california',
 'enact',
 'followssection',
 'section',
 'public',
 'resources',
 'code',
 'amended',
 'read',
 'agricultural',
 'protection',
 'planning',
 'grant',
 'program',
 'hereby',
 'established',
 'within',
 'department',
 'conservation',
 'provide',
 'planning',
 'grants',
 'followinga',
 'conserve',
 'californias',
 'productive',
 'farmlands',
 'ecologically',
 'important',
 'rangelandsb',
 'advance',
 'californias',
 'climate',
 'change',
 'goals',
 'carbon',
 'sequestration',
 'greenhouse',
 'gas',
 'emissions',
 'reductions',
 'resulting',
 'implementation',
 'local',
 'plansc',
 'maintain',
 'local',
 'food',
 'supplies',
 'local',
 'food',
 'producers',
 'agricultural',
 'economies',
 'protection',
 'agricultural',
 'landssec',
 'section',
 'public',
 'resources',
 'code',
 'amended',
 'read',
 'following',
 'terms',
 'following',
 'meanings',
 'used',
 'division',
 'unless',
 'context',
 'clearly',
 'requires',
 'otherwisea',
 'authority',
 'means'

## Putting All the Pieces Together

This code aims to:

1. Extract the individual *bill_id* for each bill in the search results for a query of "food and agriculture" bills in California for the most recent legislative session;
2. extract the *doc_id* for the most recently amended version of each bill;
3. extract (A) the bill title and (B) the bill text for each document;
4. and finally, clean the bill text (prepare into either (A) a relatively clean string text block or (B) a cleaned wordlist).

In [154]:
## SCRAPE SEARCH RESULTS for bill_ids

# scrape LegiScan search results for CA bills RE: food and ag for current year
requeststring = "https://api.legiscan.com/?key=7e00040f1f7618af234e7415484d2494&op=getSearchRaw&state=CA&year=2&query=food%20and%20agriculture"
# request: CA, food and ag, results for recent years
requeststring1 = "https://api.legiscan.com/?key=7e00040f1f7618af234e7415484d2494&op=getSearchRaw&state=CA&year=3&query=food%20and%20agriculture"
# request: CA, food and ag, results for all available years
requeststring2 = "https://api.legiscan.com/?key=7e00040f1f7618af234e7415484d2494&op=getSearchRaw&state=CA&year=1&query=food%20and%20agriculture"

# pull request: ge\tting results as JSON object
r = requests.get(requeststring1)
# turning JSON object into a dict
dict1 = json.loads(r.text)

# creating subset dict with only keys of interest
list1 = dict1['searchresult']['results']
# generating list of bill_ids
bills = [value['bill_id'] for value in list1]

# each bill in bills is stored as an integer; convert to list of str
bills1 = [str(bill) for bill in bills]

#inspect/show results
#bills1[:5]
print(len(bills1))

565


In [166]:
# TEST
docs = []
docs1 = {}

for bill in bills1:
    
    # extract bill info
    bill_id = bill
    requestbill = "https://api.legiscan.com/?key=7e00040f1f7618af234e7415484d2494&op=getBill&id="+bill_id
    b = requests.get(requestbill)
    binfo = json.loads(b.text)
    
    # FIND doc_id for most recent bill version
    docs1[bill] = binfo.keys()
    
docs1

{'1453447': dict_keys(['status', 'bill']),
 '1693920': dict_keys(['status', 'bill']),
 '1456707': dict_keys(['status', 'bill']),
 '1388510': dict_keys(['status', 'bill']),
 '1711942': dict_keys(['status', 'bill']),
 '1454994': dict_keys(['status', 'bill']),
 '1594832': dict_keys(['status', 'bill']),
 '1614129': dict_keys(['status', 'bill']),
 '1592497': dict_keys(['status', 'bill']),
 '1453388': dict_keys(['status', 'bill']),
 '1593665': dict_keys(['status', 'bill']),
 '1702780': dict_keys(['status', 'bill']),
 '1458834': dict_keys(['status', 'bill']),
 '1433188': dict_keys(['status', 'bill']),
 '1451034': dict_keys(['status', 'bill']),
 '1578060': dict_keys(['status', 'bill']),
 '1438442': dict_keys(['status', 'bill']),
 '1588122': dict_keys(['status', 'bill']),
 '1417874': dict_keys(['status', 'bill']),
 '1398281': dict_keys(['status', 'bill']),
 '1712167': dict_keys(['status', 'bill']),
 '1581321': dict_keys(['status', 'bill']),
 '1707639': dict_keys(['status', 'bill']),
 '1592396':

In [164]:
# TEST
docs = []
docs1 = {}

for bill in bills1:
    
    # extract bill info
    bill_id = bill
    requestbill = "https://api.legiscan.com/?key=7e00040f1f7618af234e7415484d2494&op=getBill&id="+bill_id
    b = requests.get(requestbill)
    binfo = json.loads(b.text)
    
    # FIND doc_id for most recent bill version: https://www.geeksforgeeks.org/python-check-whether-given-key-already-exists-in-a-dictionary/#
    if 'bill' in binfo.keys():
        doclist = binfo['bill']['texts']
        docs = [value['doc_id'] for value in doclist]
        docs1 = [str(doc) for doc in docs] # converting int to str
    if 'alert' in binfo.keys():
        doc_id = None
        
docs1

{'1453447': dict_keys(['status', 'bill']),
 '1693920': dict_keys(['status', 'bill']),
 '1456707': dict_keys(['status', 'bill']),
 '1388510': dict_keys(['status', 'bill']),
 '1711942': dict_keys(['status', 'bill']),
 '1454994': dict_keys(['status', 'bill']),
 '1594832': dict_keys(['status', 'bill']),
 '1614129': dict_keys(['status', 'bill']),
 '1592497': dict_keys(['status', 'bill']),
 '1453388': dict_keys(['status', 'bill']),
 '1593665': dict_keys(['status', 'bill']),
 '1702780': dict_keys(['status', 'bill']),
 '1458834': dict_keys(['status', 'bill']),
 '1433188': dict_keys(['status', 'bill']),
 '1451034': dict_keys(['status', 'bill']),
 '1578060': dict_keys(['status', 'bill']),
 '1438442': dict_keys(['status', 'bill']),
 '1588122': dict_keys(['status', 'bill']),
 '1417874': dict_keys(['status', 'bill']),
 '1398281': dict_keys(['status', 'bill']),
 '1712167': dict_keys(['status', 'bill']),
 '1581321': dict_keys(['status', 'bill']),
 '1707639': dict_keys(['status', 'bill']),
 '1592396':

In [168]:
# TEST
docs = []
docs1 = {}

for bill in bills1:
    
    # extract bill info
    bill_id = bill
    requestbill = "https://api.legiscan.com/?key=7e00040f1f7618af234e7415484d2494&op=getBill&id="+bill_id
    b = requests.get(requestbill)
    binfo = json.loads(b.text)
    
    # FIND doc_id for most recent bill version: https://www.geeksforgeeks.org/python-check-whether-given-key-already-exists-in-a-dictionary/#
    if 'alert' in binfo.keys():
        docs1[bill] = binfo['alert'].keys()
        
docs1

{'1592779': dict_keys(['message']),
 '1592715': dict_keys(['message']),
 '1592770': dict_keys(['message']),
 '1592712': dict_keys(['message'])}

In [178]:
## EXTRACT doc_id for most recent version of bill, select bill info, save as df

# empty dict to store cleaned bill info
infotxt = {}
# empty list to store list of bill texts
texts = []

# loop
for bill in bills1:
    
    # extract bill info
    bill_id = bill
    requestbill = "https://api.legiscan.com/?key=7e00040f1f7618af234e7415484d2494&op=getBill&id="+bill_id
    b = requests.get(requestbill)
    binfo = json.loads(b.text)
    
    # FIND doc_id for most recent bill version:
    if 'bill' in binfo.keys():
        doclist = binfo['bill']['texts']
        docs = [value['doc_id'] for value in doclist]
        docs1 = [str(doc) for doc in docs] # converting int to str
        # identify most recently amended version of doc text
        doc_id = docs1[-1] 

        # get bill text for doc text
        requestdoc = 'https://api.legiscan.com/?key=7e00040f1f7618af234e7415484d2494&op=getBillText&id='+doc_id
        d = requests.get(requestdoc)
        btext = json.loads(d.text)

        # get URL for doc
        URL = btext['text']['state_link']

        # get status page URL
        statusURL = btext['text']['state_link'].replace('Text', 'Status') 
        #statusURL = btext['text']['url'].split("/id")[0].replace('text', 'bill')
        URL1 = statusURL

        ## EXTRACT select info from URLs

        # main page
        r = urlopen(URL)
        html_bytes = r.read()
        html = html_bytes.decode("utf-8")
        soup = BeautifulSoup(html, features='html.parser')

        # status page
        r1 = urlopen(URL1)
        html_bytes = r1.read()
        html1 = html_bytes.decode("utf-8")
        soup1 = BeautifulSoup(html1, features='html.parser')

        # bill title
        t = soup.find_all('title')
        ti = t[0]

        # bill year: introduced
        yi = soup.find_all(id = 'bill_intro_date') 

        # bill year: amended
        #ya = soup.find_all('tbody')
        #yam = ya[0]

        # bill text
        txt = soup.find_all('div', id = 'bill')

        # bill status
        status = soup.find_all('td')

        # creating bill_id
        #bill_id = title.split("- ")[-1][:7]

        ## REMOVE html style and isolate key info

        # title
        for text in ti:
            title = text.text.strip()
            title = title.split("- ")[-1]

        # year recent activity
        #for text in yam:
         #   recent = text.text.strip()

        # bill text: remove html formatting and clean
        for text in txt:
            billtxt = text.text.strip()
            # remove non ASCII chars; adapted from: https://stackoverflow.com/questions/20078816/replace-non-ascii-characters-with-a-single-space
            billtxt = re.sub(r'[^\x00-\x7F]+', ' ', billtxt)
            # remove tabs and new lines
            billtxt = re.sub(r'(\s)', ' ', billtxt)
            # add space after colons and semicolons
            billtxt = re.sub(r'(\:|\;)+', r'\1 ', billtxt)
            # add space after periods only if preceded by words >=4 chars
            billtxt = re.sub(r'([A-z]{4,}\.)+', r'\1 ', billtxt)
            # remove any extra white spaces
            billtxt = re.sub(r'\s+', ' ', billtxt)

       # year intro
        for text in yi:
            if text is None:
                year = None
            else:
                year = text.text.strip()
                year = year[-4:]

       # bill status
        if len(status) < 1:
            status = None
        else:
            for text in status[-1]:
                status = text.text.strip()
    
    # PLACEHOLDER code until bills can be identified
    if 'alert' in binfo.keys():
        title = None
        year = None
        status = None
        billtxt = None
            
    # concatenate into one list
    infotxt = [title, year, status, billtxt]

    # add bill_id to list
    infotxt.insert(0, bill_id)
    
     ## CONVERT lists to df  
    itxtdf = pd.DataFrame(infotxt)
    texts.append(itxtdf)
    fulltext = pd.concat(texts, axis = 1).transpose()

# rename columns
fulltext.columns = ['bill_id', 'title', 'year', 'status', 'text']

# inspect/show
fulltext.head()

Unnamed: 0,bill_id,title,year,status,text
0,1453447,AB-778 Institutional purchasers: purchase of C...,,,The people of the State of California do enact...
0,1693920,"AB-408 Climate-resilient Farms, Sustainable He...",2023.0,"February 02, 2023",The people of the State of California do enact...
0,1456707,AB-1009 Farm to Community Food Hub Program.,2023.0,,The people of the State of California do enact...
0,1388510,"AB-125 Equitable Economic Recovery, Healthy Fo...",2020.0,"December 18, 2020",The people of the State of California do enact...
0,1711942,AB-1197 Agricultural Protection Planning Grant...,2023.0,"February 16, 2023",The people of the State of California do enact...


In [181]:
# create bill_id column after df created
#fulltext['bill_id'] = infotxt[0].split("- ")[-1][:7]
#fulltext['bill_id'] = fulltext['title']#.str.split("- ")[-1][:7]
#fulltext.head()

In [9]:
fulltext

Unnamed: 0,bill_id,title,year,status,text
0,AB-1197,AB-1197 Agricultural Protection Planning Grant...,2023,Referred to Coms. on N.R. & W. and AGRI.,The people of the State of California do enact...
0,AB-408,"AB-408 Climate-resilient Farms, Sustainable He...",2023,Referred to Coms. on AGRI and GOV. & F.,The people of the State of California do enact...
0,AB-660,"AB-660 Food labeling: quality dates, safety da...",2023,In Senate. Read first time. To Com. on RLS. ...,The people of the State of California do enact...
0,SB-688,SB-688 Agrivoltaic systems: grant funding.,2023,In Assembly. Read first time. Held at Desk.,The people of the State of California do enact...
0,SB-485,SB-485 Elections: election worker protections.,2023,12,The people of the State of California do enact...
0,SB-224,SB-224 Agricultural land: foreign ownership an...,2023,From committee: Do pass and re-refer to Com. o...,The people of the State of California do enact...
0,SB-624,SB-624 Horse racing: state-designated fairs: a...,2023,From committee with author's amendments. Read ...,The people of the State of California do enact...
0,AB-865,AB-865 Sale of agricultural products: requirem...,2023,"From committee chair, with author's amendments...",The people of the State of California do enact...
0,AB-1583,AB-1583 California Seed Law: subventions: suns...,2023,Read third time. Passed. Ordered to the Assemb...,The people of the State of California do enact...
0,AB-1763,AB-1763 Food and agriculture: industry-funded ...,2023,In Senate. Read first time. To Com. on RLS. ...,The people of the State of California do enact...


In [10]:
## TEST: SCRAPE SEARCH RESULTS for bill_ids: RECENT

# scrape LegiScan search results for CA bills RE: food and ag for current year
requeststring = "https://api.legiscan.com/?key=7e00040f1f7618af234e7415484d2494&op=getSearchRaw&state=CA&year=2&query=food%20and%20agriculture"
# request: CA, food and ag, results for recent years
requeststring1 = "https://api.legiscan.com/?key=7e00040f1f7618af234e7415484d2494&op=getSearchRaw&state=CA&year=3&query=food%20and%20agriculture"
# request: CA, food and ag, results for all available years
requeststring2 = "https://api.legiscan.com/?key=7e00040f1f7618af234e7415484d2494&op=getSearchRaw&state=CA&year=1&query=food%20and%20agriculture"

# pull request: ge\tting results as JSON object
r = requests.get(requeststring1)
# turning JSON object into a dict
dict1 = json.loads(r.text)

# creating subset dict with only keys of interest
list1 = dict1['searchresult']['results']
# generating list of bill_ids
bills = [value['bill_id'] for value in list1]

# each bill in bills is stored as an integer; convert to list of str
bills1 = [str(bill) for bill in bills]

#inspect/show results
print(len(bills1))
print(bills1[:5])

565
['1453447', '1693920', '1456707', '1388510', '1711942']


In [120]:
# TEST CELL 
bills2 = bills1[:5]
bills2

['1453447', '1693920', '1456707', '1388510', '1711942']

In [122]:
## EXTRACT doc_id for most recent version of bill, select bill info, save as df

# empty dict to store cleaned bill info
infotxt = {}
# empty list to store list of bill texts
texts = []

# loop
for bill in bills2:
    
    # extract bill info
    bill_id = bill
    requestbill = "https://api.legiscan.com/?key=7e00040f1f7618af234e7415484d2494&op=getBill&id="+bill_id
    b = requests.get(requestbill)
    binfo = json.loads(b.text)
    
    # FIND doc_id for most recent bill version
    doclist = binfo['bill']['texts']
    docs = [value['doc_id'] for value in doclist]
    docs1 = [str(doc) for doc in docs] # converting int to str
    
    # identify most recently amended version of text
    doc_id = docs1[-1] 
    
    # get bill text for doc
    requestdoc = 'https://api.legiscan.com/?key=7e00040f1f7618af234e7415484d2494&op=getBillText&id='+doc_id
    d = requests.get(requestdoc)
    btext = json.loads(d.text)
    
    # get URL for amended doc
    URL = btext['text']['state_link']
    
    # get status page URL
    statusURL = btext['text']['state_link'].replace('Text', 'Status') 
    #statusURL = btext['text']['url'].split("/id")[0].replace('text', 'bill')
    URL1 = statusURL

    ## EXTRACT select info from URL
    
    # main page
    r = urlopen(URL)
    html_bytes = r.read()
    html = html_bytes.decode("utf-8")
    soup = BeautifulSoup(html, features='html.parser')

    # status page
    r1 = urlopen(URL1)
    html_bytes = r1.read()
    html1 = html_bytes.decode("utf-8")
    soup1 = BeautifulSoup(html1, features='html.parser')
    
    # bill title
    t = soup.find_all('title')
    ti = t[0]

    # bill year: introduced
    yi = soup.find_all(id = 'bill_intro_date') # will have to be cleaned

    # bill year: amended
    #ya = soup.find_all('tbody')
    #yam = ya[0]
    
    # bill text
    txt = soup.find_all('div', id = 'bill')

    # bill status
    status = soup.find_all('td')
    
    # creating bill_id
    bill_id = title.split("- ")[-1][:7]
    
    ## REMOVE html style and tags from bill text
    
    # title
    for text in ti:
        title = text.text.strip()
        title = title.split("- ")[-1]
    
    # year intro
    for text in yi:
        if text is None:
            year = None
        else:
            year = text.text.strip()
            year = year[-4:]

    # year recent activity
    #for text in yam:
     #   recent = text.text.strip()
                  
    # bill text
    for text in txt:
        billtxt = text.text.strip()
    
    # bill status
    if len(status) < 1:
        status = None
    else:
        for text in status[-2]:
            status = text.text.strip()
            
    # concatenating into one list
    infotxt = [title, year, status, billtxt]

    # creating bill_id
    bill_id = infotxt[0].split("- ")[-1][:7]
    
    # adding bill_id to list
    infotxt.insert(0, bill_id)
    
     ## CONVERT to df  
    itxtdf = pd.DataFrame(infotxt)
    texts.append(itxtdf)
    fulltext = pd.concat(texts, axis = 1).transpose()

# rename columns
fulltext.columns = ['bill_id', 'title', 'year', 'status', 'text']

# inspect/show
fulltext.head()

Unnamed: 0,bill_id,title,year,status,text
0,AB-778,AB-778 Institutional purchasers: purchase of C...,2023,,The people of the State of California do enact...
0,AB-408,"AB-408 Climate-resilient Farms, Sustainable He...",2023,Introduced by Assembly Members Wilson and Conn...,The people of the State of California do enact...
0,AB-1009,AB-1009 Farm to Community Food Hub Program.,2023,,The people of the State of California do enact...
0,AB-125,"AB-125 Equitable Economic Recovery, Healthy Fo...",2020,Introduced by Assembly Member Robert Rivas(Coa...,The people of the State of California do enact...
0,AB-1197,AB-1197 Agricultural Protection Planning Grant...,2023,Introduced by Assembly Member Hart,The people of the State of California do enact...


In [234]:
# SAVE: pieces

# FIND doc_id for most recent bill version
doclist = binfo['bill']['texts']
docs = [value['doc_id'] for value in doclist]
docs1 = [str(doc) for doc in docs] # converting int to str

# call the most recently amended version of bill text
doc_id = docs1[-1] 

# get bill text for doc
btext1 = 'https://api.legiscan.com/?key=7e00040f1f7618af234e7415484d2494&op=getBillText&id='+doc_id
b = requests.get(btext1)
btext = json.loads(b.text)

# get URL for amended doc
URL = btext['text']['state_link']

## EXTRACTION: relevant info from URL
r = urlopen(URL)
html_bytes = r.read()
html = html_bytes.decode("utf-8")
soup = BeautifulSoup(html, features='html.parser')

# bill title
t = soup.find_all('title')
ti = t[0]

# bill year: introduced
yi = soup.find_all(id = 'bill_intro_date') # will have to be cleaned

# bill year: amended
ya = soup.find_all('tbody')
yam = (ya[0])

# bill text
txt = soup.find_all('div', id = 'bill')

## REMOVE: html style and tags from bill text

# title
for text in ti:
    title = text.text.strip()
    title = title.split("- ")[-1]

# year intro
for text in yi:
    yeari = text.text.strip()
    yeari = yeari[-4:]

# year recent activity
for text in yam:
    yeara = text.text.strip()
    yeara = yeara[-4:]

# bill text
for text in txt:
    billtxt = text.text.strip()
    
# creating bill_id
bill_id = title.split("- ")[-1][:7]

# concatenating into one list
infotxt = [bill_id, title, yeari, yeara, billtxt]
infotxt

## CONVERT to df

itxtdf = pd.DataFrame(infotxt).transpose()

# rename columns
itxtdf.columns = ['bill_id', 'title', 'y_intro', 'y_recent', 'text']
itxtdf

Unnamed: 0,status,bill
text,OK,"{'bill_id': 1711942, 'change_hash': 'cd9d825f6..."
text,OK,"{'bill_id': 1693920, 'change_hash': '464be2cde..."
text,OK,"{'bill_id': 1702780, 'change_hash': '266b25df3..."
text,OK,"{'bill_id': 1712167, 'change_hash': '37ed87e95..."
text,OK,"{'bill_id': 1707639, 'change_hash': '6bddb0f6f..."
text,OK,"{'bill_id': 1671132, 'change_hash': '3f4199a2f..."
text,OK,"{'bill_id': 1712103, 'change_hash': 'ce1242ff4..."
text,OK,"{'bill_id': 1707512, 'change_hash': '85d63281d..."
text,OK,"{'bill_id': 1714501, 'change_hash': 'f094de3f6..."
text,OK,"{'bill_id': 1738922, 'change_hash': '95e252ff3..."
