# Food, Agriculture, and Soils: California Legislation

This notebook explores state level legislation from the Legiscan API. It conducts text analysis of all available California Legislation that mentions food and agriculture from years 20XX to 2023, with special attention to mentions of land, soil, and environmental management.

LegiScan API documentation [here](https://legiscan.com/gaits/documentation/legiscan) and [here](https://api.legiscan.com/dl/).

In [1]:
# libraries
import pandas as pd
import geopandas as gpd
import numpy as np

import requests
import urllib
from urllib.request import urlopen
from bs4 import BeautifulSoup

import re
import os
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.stem import PorterStemmer
import json

# set display
pd.options.display.max_columns = 150
pd.options.display.max_rows = 300

First, I will pull the bill ID, year/session, and full texts for all CA legisation that passed mentioning "food" and "agriculture". 

Then I will narrow my search to all bills mentioning "food", "agriculture", "environment", and "soil".

I will store relevant information into a dictionary that can be transformed into a pandas dataframe.

Pull API template: "https://api.legiscan.com/?key=APIKEY&op=OPERATION&PARAMS"
APIKey = '7e00040f1f7618af234e7415484d2494'

**OPERATIONS**
getBill, getBillText, getSearch, getSearchRaw

**PARAMETERS**
state, year, query (URL encoded), page

**IDENTIFIERS**
bill_id, doc_id

In [2]:
# request: CA, food and ag, results for current year
requeststring = "https://api.legiscan.com/?key=7e00040f1f7618af234e7415484d2494&op=getSearchRaw&state=CA&year=2&query=food%20and%20agriculture"

In [3]:
# pull request: getting results as JSON object
r = requests.get(requeststring)

In [4]:
# examining results
#print(r.text)

In [5]:
# turning JSON object into a dict
dict1 = json.loads(r.text)

In [6]:
# examining structure of dictionary and identifying relevant keys
print(dict1.keys())
print(dict1['searchresult'].keys())
print(dict1['searchresult']['summary'].keys())
print(dict1['searchresult']['results'])

dict_keys(['status', 'searchresult'])
dict_keys(['summary', 'results'])
dict_keys(['page', 'range', 'relevancy', 'count', 'page_current', 'page_total', 'query'])
[{'relevance': 100, 'bill_id': 1711942, 'change_hash': 'cd9d825f6d643ff19d5d789db7760174'}, {'relevance': 99, 'bill_id': 1693920, 'change_hash': '464be2cde28c223849b4efe7d1b2e894'}, {'relevance': 99, 'bill_id': 1702780, 'change_hash': '266b25df310d9ebd3e42bb97fdec3adf'}, {'relevance': 99, 'bill_id': 1712167, 'change_hash': '37ed87e95a3bc0627706e74408a2a63a'}, {'relevance': 99, 'bill_id': 1707639, 'change_hash': '6bddb0f6f615ed90a9bafc8ec369b181'}, {'relevance': 99, 'bill_id': 1671132, 'change_hash': '3f4199a2ffe1d27113d5607be727e6db'}, {'relevance': 99, 'bill_id': 1712103, 'change_hash': 'ce1242ff43bcfe48ae2d9a3325bd5959'}, {'relevance': 99, 'bill_id': 1707512, 'change_hash': '85d63281d189d8a6f47466e7268baa96'}, {'relevance': 99, 'bill_id': 1714501, 'change_hash': 'f094de3f6fb0046b0ab05b73f8ae9b46'}, {'relevance': 99, 'bill_id

In [7]:
# creating subset dict with only keys of interest
list1 = dict1['searchresult']['results']
list1[:5]

[{'relevance': 100,
  'bill_id': 1711942,
  'change_hash': 'cd9d825f6d643ff19d5d789db7760174'},
 {'relevance': 99,
  'bill_id': 1693920,
  'change_hash': '464be2cde28c223849b4efe7d1b2e894'},
 {'relevance': 99,
  'bill_id': 1702780,
  'change_hash': '266b25df310d9ebd3e42bb97fdec3adf'},
 {'relevance': 99,
  'bill_id': 1712167,
  'change_hash': '37ed87e95a3bc0627706e74408a2a63a'},
 {'relevance': 99,
  'bill_id': 1707639,
  'change_hash': '6bddb0f6f615ed90a9bafc8ec369b181'}]

In [8]:
# generating list of bill_ids
# https://www.geeksforgeeks.org/python-get-values-of-particular-key-in-list-of-dictionaries/
bills = [value['bill_id'] for value in list1]
bills[:5]

[1711942, 1693920, 1702780, 1712167, 1707639]

In [9]:
# turning list of bill_ids for each relevant result from query into a df
df = pd.DataFrame(bills, columns = ['bill_id'])
df.head()

Unnamed: 0,bill_id
0,1711942
1,1693920
2,1702780
3,1712167
4,1707639


In [10]:
# sampling bill pull
bill1 = 'https://api.legiscan.com/?key=7e00040f1f7618af234e7415484d2494&op=getBill&id=1711942'
b = requests.get(bill1)
binfo = json.loads(b.text)
print(binfo.keys()) 
print(binfo)

dict_keys(['status', 'bill'])
{'status': 'OK', 'bill': {'bill_id': 1711942, 'change_hash': 'cd9d825f6d643ff19d5d789db7760174', 'session_id': 2016, 'session': {'session_id': 2016, 'state_id': 5, 'year_start': 2023, 'year_end': 2024, 'prefile': 0, 'sine_die': 0, 'prior': 0, 'special': 0, 'session_tag': 'Regular Session', 'session_title': '2023-2024 Regular Session', 'session_name': '2023-2024 Session'}, 'url': 'https://legiscan.com/CA/bill/AB1197/2023', 'state_link': 'https://leginfo.legislature.ca.gov/faces/billStatusClient.xhtml?bill_id=202320240AB1197', 'completed': 0, 'status': 2, 'status_date': '2023-05-30', 'progress': [{'date': '2023-02-16', 'event': 1}, {'date': '2023-03-02', 'event': 9}, {'date': '2023-03-13', 'event': 10}, {'date': '2023-03-14', 'event': 9}, {'date': '2023-04-19', 'event': 10}, {'date': '2023-04-19', 'event': 9}, {'date': '2023-05-10', 'event': 9}, {'date': '2023-05-18', 'event': 10}, {'date': '2023-05-30', 'event': 2}, {'date': '2023-05-31', 'event': 9}, {'dat

In [11]:
# exploring structure and extracting doc_ids from bill
print(binfo.keys())
print(binfo['bill'].keys())
print(binfo['bill']['texts'])

dict_keys(['status', 'bill'])
dict_keys(['bill_id', 'change_hash', 'session_id', 'session', 'url', 'state_link', 'completed', 'status', 'status_date', 'progress', 'state', 'state_id', 'bill_number', 'bill_type', 'bill_type_id', 'body', 'body_id', 'current_body', 'current_body_id', 'title', 'description', 'pending_committee_id', 'committee', 'referrals', 'history', 'sponsors', 'sasts', 'subjects', 'texts', 'votes', 'amendments', 'supplements', 'calendar'])
[{'doc_id': 2704856, 'date': '2023-02-16', 'type': 'Introduced', 'type_id': 1, 'mime': 'text/html', 'mime_id': 1, 'url': 'https://legiscan.com/CA/text/AB1197/id/2704856', 'state_link': 'https://leginfo.legislature.ca.gov/faces/billTextClient.xhtml?bill_id=202320240AB1197#99INT', 'text_size': 7488, 'text_hash': 'b048ef4ee2d51bcb0266641e75d77798', 'alt_bill_text': 0, 'alt_mime': '', 'alt_mime_id': 0, 'alt_state_link': '', 'alt_text_size': 0, 'alt_text_hash': ''}, {'doc_id': 2743399, 'date': '2023-03-13', 'type': 'Amended', 'type_id': 3,

In [12]:
# isolating list of dictionaries
doclist = binfo['bill']['texts']

In [13]:
# generating list of bill_ids
docs = [sub['doc_id'] for sub in doclist]
docs

[2704856, 2743399, 2814717, 2826589, 2830807]

In [14]:
# indexing last element (most recently amended version of bill text)
docs[-1]

2830807

In [15]:
# sampling bill text pull
btext1 = 'https://api.legiscan.com/?key=7e00040f1f7618af234e7415484d2494&op=getBillText&id=2830807'
b = requests.get(btext1)
btext = json.loads(b.text)
print(btext.keys()) 
print(btext)

dict_keys(['status', 'text'])
{'status': 'OK', 'text': {'doc_id': 2830807, 'bill_id': 1711942, 'date': '2023-06-26', 'type': 'Amended', 'type_id': 3, 'mime': 'text/html', 'mime_id': 1, 'url': 'https://legiscan.com/CA/text/AB1197/id/2830807', 'state_link': 'https://leginfo.legislature.ca.gov/faces/billTextClient.xhtml?bill_id=202320240AB1197#95AMD', 'text_size': 16144, 'text_hash': 'ad35b5cac4cbc4fc4fb154c8f1ed301b', 'doc': 'PGRpdiBpZD0iYmlsbF9hbGwiIGFsaWduPSJqdXN0aWZ5Ij48ZGl2IGlkPSJhYm91dCI+PGJyIGNsZWFyPSJhbGwiLz48ZGl2IGFsaWduPSJjZW50ZXIiPjx0YWJsZSBzdHlsZT0iYm9yZGVyLWNvbGxhcHNlOiBjb2xsYXBzZSIgY2VsbFNwYWNpbmc9IjAiIGFsaWduPSJjZW50ZXIiIHJvbGU9InByZXNlbnRhdGlvbiI+PHRib2R5Pjx0cj48dGQgYWxpZ249ImNlbnRlciI+PHNwYW4gc3R5bGU9IiB0ZXh0LXRyYW5zZm9ybTogdXBwZXJjYXNlOyBmb250LXNpemU6IDFlbSI+CiAgICAgICAgICAgICAgICBBbWVuZGVkCiAgICAgICAgICAgICAgwqBJTsKgPC9zcGFuPjxzcGFuIHN0eWxlPSIgdGV4dC10cmFuc2Zvcm06IHVwcGVyY2FzZTsgZm9udC1zaXplOiAxZW0iPgogICAgICAgICAgICAgICAgU2VuYXRlCiAgICAgICAgICAgICAgPC9zcGFuPsKgPHNwYW4g

In [16]:
# exploring structure
print(btext.keys())
print(btext['text'].keys())
print(btext['text']['url'])
print(btext['text']['state_link']) # this URL less likely to turn up errors

dict_keys(['status', 'text'])
dict_keys(['doc_id', 'bill_id', 'date', 'type', 'type_id', 'mime', 'mime_id', 'url', 'state_link', 'text_size', 'text_hash', 'doc', 'alt_bill_text', 'alt_mime', 'alt_mime_id', 'alt_state_link', 'alt_text_size', 'alt_text_hash', 'alt_doc'])
https://legiscan.com/CA/text/AB1197/id/2830807
https://leginfo.legislature.ca.gov/faces/billTextClient.xhtml?bill_id=202320240AB1197#95AMD


In [17]:
# extracting text from bill text url
URL = btext['text']['state_link']
r = urlopen(URL)
html_bytes = r.read()
html = html_bytes.decode("utf-8")
soup = BeautifulSoup(html, features='html.parser')

In [18]:
# inspecting structure to locate text bloc
print(soup.prettify())

<?xml version='1.0' encoding='UTF-8' ?>
<!DOCTYPE html>
<html lang="en" xml:lang="en" xmlns="http://www.w3.org/1999/xhtml">
 <head id="j_idt5">
  <meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
  <meta content="IE=edge,chrome=1" http-equiv="X-UA-Compatible"/>
  <!--&lt;meta http-equiv="refresh" content=";url=faces/home.xhtml"/&gt;-->
  <link href="https://leginfo.legislature.ca.gov/resources/css/leginfo_master.css" media="screen   ,print" rel="stylesheet" type="text/css"/>
  <!--&lt;link rel="stylesheet" type="text/css" href="./resources/css/leginfo_mobile.css" media="screen and (-webkit-min-device-pixel-ratio: 1.1) and (max-device-width: 990px)" /&gt;
    &lt;link rel="stylesheet" type="text/css" href="./resources/css/leginfo_mobile.css" media="screen and (min-device-width: 481px)and (max-device-width: 640px)and (-webkit-max-device-pixel-ratio: 1)" /&gt;-->
  <link href="https://leginfo.legislature.ca.gov/resources/css/leginfo_mobile.css" media="screen and (orient

In [57]:
# locating title in html
t = soup.find_all('title')
print(len(t))
print(t[0])

2
<title>Bill Text  - AB-1197 Agricultural Protection Planning Grant Program: local food producers.</title>


In [58]:
# locating status in html
s = soup.find_all('title')
print(len(s))
print(s)

2
[<title>Bill Text  - AB-1197 Agricultural Protection Planning Grant Program: local food producers.</title>, <title>AB1197:v95#DOCUMENT</title>]


In [59]:
# locating bill year introduced in html
y = soup.find_all(id = 'bill_intro_date')
print(len(y))
print(y[0])

1
<td align="center" class="textcenter" id="bill_intro_date"><br/>February 16, 2023</td>


In [60]:
# locating bill year amended in html
y1 = soup.find_all('tbody')
print(len(y1))
print(y1[0])

6
<tbody><tr><td align="center"><span style=" text-transform: uppercase; font-size: 1em">
                Amended
               IN </span><span style=" text-transform: uppercase; font-size: 1em">
                Senate
              </span> <span style="text-transform: uppercase">June 26, 2023</span><br/></td></tr></tbody>


In [61]:
# locating bill text bloc in html
ti = soup.find_all('div', id = 'bill')
print(len(ti))
print(type(ti))
print(ti)

1
<class 'bs4.element.ResultSet'>
[<div id="bill"><div style=" text-transform: uppercase"><h2 style=" text-align: left;">The people of the State of California do enact as follows:</h2><br/></div><div id="s10.8574569622048498"><div class="ActionLine" style="margin:0 0 1em 0"><h3 style="font-weight:bold; display:inline;">SECTION 1.</h3> Section 10280 of the Public Resources Code is amended to read:</div><div><div id="id_3C722B0F-3B2E-4E9E-84DB-4C2AF0F1DC0C"><div><p><div style="margin:0 0 1em 0;"><h6 style="display:inline;">10280.</h6> The Agricultural Protection Planning Grant Program is hereby established within the Department of Conservation, to provide planning grants to do all of the following:</div><div style="margin:0 0 1em 0;">(a) Conserve California’s most productive farmlands and ecologically important rangelands.</div><div style="margin:0 0 1em 0;">(b) Advance California’s climate change goals through carbon sequestration and greenhouse gas emissions reductions resulting from t

In [100]:
# remove html style and tags from title
for text in t[0]:
    title = text.text.strip()
    title = title.split("- ")[-1]
title

'AB-1197 Agricultural Protection Planning Grant Program: local food producers.'

In [97]:
# remove html style and tags from year int
for text in y:
    yeari = text.text.strip()
    yeari = yeari[-4:]
yeari

'2023'

In [99]:
# remove html style and tags from year amended
for text in y1[0]:
    yeara = text.text.strip()
    yeara = yeara[-4:]
yeara

'2023'

In [65]:
# remove html style and tags from bill text
for text in ti:
    billtxt = text.text.strip()
billtxt

'The people of the State of California do enact as follows:SECTION 1.\xa0Section 10280 of the Public Resources Code is amended to read:10280.\xa0The Agricultural Protection Planning Grant Program is hereby established within the Department of Conservation, to provide planning grants to do all of the following:(a)\xa0Conserve California’s most productive farmlands and ecologically important rangelands.(b)\xa0Advance California’s climate change goals through carbon sequestration and greenhouse gas emissions reductions resulting from the implementation of local plans.(c)\xa0Maintain local food supplies, local food producers, and agricultural economies through the protection of agricultural lands.SEC. 2.\xa0Section 10280.5 of the Public Resources Code is amended to read:10280.5.\xa0The following terms have the following meanings as used in this division, unless the context clearly requires otherwise:(a)\xa0“Authority” means an entity established by the state that requires its members, incl

In [103]:
# creating bill_id
bill_id = title.split("- ")[-1][:7]

# concatenating into one list
infotxt1 = [bill_id, title, yeari, yeara, billtxt]
infotxt1

['AB-1197',
 'AB-1197 Agricultural Protection Planning Grant Program: local food producers.',
 '2023',
 '2023',
 'The people of the State of California do enact as follows:SECTION 1.\xa0Section 10280 of the Public Resources Code is amended to read:10280.\xa0The Agricultural Protection Planning Grant Program is hereby established within the Department of Conservation, to provide planning grants to do all of the following:(a)\xa0Conserve California’s most productive farmlands and ecologically important rangelands.(b)\xa0Advance California’s climate change goals through carbon sequestration and greenhouse gas emissions reductions resulting from the implementation of local plans.(c)\xa0Maintain local food supplies, local food producers, and agricultural economies through the protection of agricultural lands.SEC. 2.\xa0Section 10280.5 of the Public Resources Code is amended to read:10280.5.\xa0The following terms have the following meanings as used in this division, unless the context clear

In [102]:
# remove html style and tags from all bill info (B)

# list of all bill elements
billinfo = [t[0], y, y1[0], ti]

# remove html style and tags from bill text
infotxt = []

for info in billinfo:
    for text in info:
        row = text.text.strip()
        infotxt.append(row)
        
    # create bill id from bill title
    bill_id = infotxt[0].split("- ")[-1][:7]
    
# add an item to the list that will serve as the bill id later
infotxt.insert(0, bill_id)
        
print(len(infotxt))
infotxt

5


['AB-1197',
 'Bill Text  - AB-1197 Agricultural Protection Planning Grant Program: local food producers.',
 'February\xa016,\xa02023',
 'Amended\n              \xa0IN\xa0\n                Senate\n              \xa0June\xa026,\xa02023',
 'The people of the State of California do enact as follows:SECTION 1.\xa0Section 10280 of the Public Resources Code is amended to read:10280.\xa0The Agricultural Protection Planning Grant Program is hereby established within the Department of Conservation, to provide planning grants to do all of the following:(a)\xa0Conserve California’s most productive farmlands and ecologically important rangelands.(b)\xa0Advance California’s climate change goals through carbon sequestration and greenhouse gas emissions reductions resulting from the implementation of local plans.(c)\xa0Maintain local food supplies, local food producers, and agricultural economies through the protection of agricultural lands.SEC. 2.\xa0Section 10280.5 of the Public Resources Code is am

In [104]:
# storing bill info in df    
itxtdf = pd.DataFrame(infotxt1).transpose()

# rename columns
#itxtdf.rename(columns = {'0':'title', '1':'y_intro', '2': 'y_recent', '3':'text'}, inplace = True)
itxtdf.columns = ['bill_id', 'title', 'y_intro', 'y_recent', 'text']

# optional: create column to serve as new index ID?
#itxtdf.set_index('bill_id', inplace = True)

itxtdf

Unnamed: 0,bill_id,title,y_intro,y_recent,text
0,AB-1197,AB-1197 Agricultural Protection Planning Grant...,2023,2023,The people of the State of California do enact...


In [68]:
# further processing bill text for NLP

# list of stopwords to exclude
swords = stopwords.words('english')

# stripping nonessential characters
textonly = re.sub(r"[^A-z\s]", "", billtxt)

# turning cleaned bill text into list of words
wordlist = [word for word in word_tokenize(textonly.lower()) 
                 if word not in swords]
print(len(wordlist))
wordlist

641


['people',
 'state',
 'california',
 'enact',
 'followssection',
 'section',
 'public',
 'resources',
 'code',
 'amended',
 'read',
 'agricultural',
 'protection',
 'planning',
 'grant',
 'program',
 'hereby',
 'established',
 'within',
 'department',
 'conservation',
 'provide',
 'planning',
 'grants',
 'followinga',
 'conserve',
 'californias',
 'productive',
 'farmlands',
 'ecologically',
 'important',
 'rangelandsb',
 'advance',
 'californias',
 'climate',
 'change',
 'goals',
 'carbon',
 'sequestration',
 'greenhouse',
 'gas',
 'emissions',
 'reductions',
 'resulting',
 'implementation',
 'local',
 'plansc',
 'maintain',
 'local',
 'food',
 'supplies',
 'local',
 'food',
 'producers',
 'agricultural',
 'economies',
 'protection',
 'agricultural',
 'landssec',
 'section',
 'public',
 'resources',
 'code',
 'amended',
 'read',
 'following',
 'terms',
 'following',
 'meanings',
 'used',
 'division',
 'unless',
 'context',
 'clearly',
 'requires',
 'otherwisea',
 'authority',
 'means'

## Putting All the Pieces Together

This code aims to:

1. Extract the individual *bill_id* for each bill in the search results for a query of "food and agriculture" bills in California for the most recent legislative session;
2. extract the *doc_id* for the most recently amended version of each bill;
3. extract (A) the bill title and (B) the bill text for each document;
4. and finally, clean the bill text (prepare into either (A) a relatively clean string text block or (B) a cleaned wordlist).

In [110]:
## SCRAPE SEARCH RESULTS for bill_ids

# scrape LegiScan search results for CA bills RE: food and ag for current year
requeststring = "https://api.legiscan.com/?key=7e00040f1f7618af234e7415484d2494&op=getSearchRaw&state=CA&year=2&query=food%20and%20agriculture"
# pull request: getting results as JSON object
r = requests.get(requeststring)
# turning JSON object into a dict
dict1 = json.loads(r.text)

# creating subset dict with only keys of interest
list1 = dict1['searchresult']['results']
# generating list of bill_ids
bills = [value['bill_id'] for value in list1]

# each bill in bills is stored as an integer; convert to list of str
bills1 = [str(bill) for bill in bills]

#inspect/show results
#bills1[:5]

In [233]:
## EXTRACT doc_id for most recent version of bill, select bill info, save as df

# empty list to store individual bill texts
billtext = {}
# empty list to store list of bill texts
texts = []

# loop
for bill in bills1:
    
    # extract bill info
    bill_id = bill
    requestbill = "https://api.legiscan.com/?key=7e00040f1f7618af234e7415484d2494&op=getBill&id="+bill_id
    b = requests.get(requestbill)
    binfo = json.loads(b.text)
    
    # FIND doc_id for most recent bill version
    doclist = binfo['bill']['texts']
    docs = [value['doc_id'] for value in doclist]
    docs1 = [str(doc) for doc in docs] # converting int to str
    
    # identify most recently amended version of text
    doc_id = docs1[-1] 
    
    # get bill text for doc
    requestdoc = 'https://api.legiscan.com/?key=7e00040f1f7618af234e7415484d2494&op=getBillText&id='+doc_id
    d = requests.get(requestdoc)
    btext = json.loads(d.text)

    # get URL for amended doc
    URL = btext['text']['state_link']

    ## EXTRACT select info from URL
    r = urlopen(URL)
    html_bytes = r.read()
    html = html_bytes.decode("utf-8")
    soup = BeautifulSoup(html, features='html.parser')

    # bill title
    t = soup.find_all('title')
    ti = t[0]

    # bill year: introduced
    yi = soup.find_all(id = 'bill_intro_date') # will have to be cleaned

    # bill year: amended
    ya = soup.find_all('tbody')
    yam = (ya[0])

    # bill text
    txt = soup.find_all('div', id = 'bill')

    for text in ti:
    title = text.text.strip()
    title = title.split("- ")[-1]
    
    ## REMOVE html style and tags from bill text
    # year intro
    for text in yi:
        yeari = text.text.strip()
        yeari = yeari[-4:]

    # year recent activity
    for text in yam:
        yeara = text.text.strip()
        yeara = yeara[-4:]

    # bill text
    for text in txt:
        billtxt = text.text.strip()

    # creating bill_id
    bill_id = title.split("- ")[-1][:7]
    
    # concatenating into one list
    infotxt = [bill_id, title, yeari, yeara, billtxt]

     ## CONVERT to df  
    itxtdf = pd.DataFrame(infotxt)#.transpose()
    #textstore = pd.DataFrame.from_dict(binfo, orient = 'index', columns = ['text'])
    texts.append(itxtdf)
    fulltext = pd.concat(texts)

# rename columns
fulltext.columns = ['bill_id', 'title', 'y_intro', 'y_recent', 'text']

# inspect/show
fulltext.head()

Unnamed: 0,text,text.1,text.2,text.3,text.4,text.5,text.6,text.7,text.8,text.9,text.10,text.11,text.12,text.13,text.14,text.15,text.16,text.17,text.18,text.19,text.20,text.21,text.22,text.23,text.24,text.25,text.26,text.27,text.28,text.29,text.30,text.31,text.32,text.33,text.34,text.35,text.36,text.37,text.38,text.39,text.40,text.41,text.42,text.43,text.44,text.45,text.46,text.47,text.48,text.49,text.50,text.51,text.52,text.53,text.54,text.55,text.56,text.57,text.58,text.59,text.60,text.61,text.62,text.63,text.64,text.65,text.66,text.67,text.68,text.69,text.70,text.71,text.72,text.73,text.74,...,text.75,text.76,text.77,text.78,text.79,text.80,text.81,text.82,text.83,text.84,text.85,text.86,text.87,text.88,text.89,text.90,text.91,text.92,text.93,text.94,text.95,text.96,text.97,text.98,text.99,text.100,text.101,text.102,text.103,text.104,text.105,text.106,text.107,text.108,text.109,text.110,text.111,text.112,text.113,text.114,text.115,text.116,text.117,text.118,text.119,text.120,text.121,text.122,text.123,text.124,text.125,text.126,text.127,text.128,text.129,text.130,text.131,text.132,text.133,text.134,text.135,text.136,text.137,text.138,text.139,text.140,text.141,text.142,text.143,text.144,text.145,text.146,text.147,text.148,text.149
status,OK,OK,OK,OK,OK,OK,OK,OK,OK,OK,OK,OK,OK,OK,OK,OK,OK,OK,OK,OK,OK,OK,OK,OK,OK,OK,OK,OK,OK,OK,OK,OK,OK,OK,OK,OK,OK,OK,OK,OK,OK,OK,OK,OK,OK,OK,OK,OK,OK,OK,OK,OK,OK,OK,OK,OK,OK,OK,OK,OK,OK,OK,OK,OK,OK,OK,OK,OK,OK,OK,OK,OK,OK,OK,OK,...,OK,OK,OK,OK,OK,OK,OK,OK,OK,OK,OK,OK,OK,OK,OK,OK,OK,OK,OK,OK,OK,OK,OK,OK,OK,OK,OK,OK,OK,OK,OK,OK,OK,OK,OK,OK,OK,OK,OK,OK,OK,OK,OK,OK,OK,OK,OK,OK,OK,OK,OK,OK,OK,OK,OK,OK,OK,OK,OK,OK,OK,OK,OK,OK,OK,OK,OK,OK,OK,OK,OK,OK,OK,OK,OK
bill,"{'bill_id': 1711942, 'change_hash': 'cd9d825f6...","{'bill_id': 1693920, 'change_hash': '464be2cde...","{'bill_id': 1702780, 'change_hash': '266b25df3...","{'bill_id': 1712167, 'change_hash': '37ed87e95...","{'bill_id': 1707639, 'change_hash': '6bddb0f6f...","{'bill_id': 1671132, 'change_hash': '3f4199a2f...","{'bill_id': 1712103, 'change_hash': 'ce1242ff4...","{'bill_id': 1707512, 'change_hash': '85d63281d...","{'bill_id': 1714501, 'change_hash': 'f094de3f6...","{'bill_id': 1738922, 'change_hash': '95e252ff3...","{'bill_id': 1659520, 'change_hash': '6e9c3a09c...","{'bill_id': 1709841, 'change_hash': '3f69d2430...","{'bill_id': 1702725, 'change_hash': '86574f8ad...","{'bill_id': 1714731, 'change_hash': 'b194953c6...","{'bill_id': 1714546, 'change_hash': '32c5c8246...","{'bill_id': 1701021, 'change_hash': '39ac916c6...","{'bill_id': 1714651, 'change_hash': '21338194e...","{'bill_id': 1649742, 'change_hash': 'aa17e5c5d...","{'bill_id': 1693914, 'change_hash': '4fde30eda...","{'bill_id': 1714813, 'change_hash': 'd4bf439ae...","{'bill_id': 1717506, 'change_hash': '95d04db7a...","{'bill_id': 1712067, 'change_hash': 'dd2363f13...","{'bill_id': 1729722, 'change_hash': 'fb63c6e80...","{'bill_id': 1712107, 'change_hash': '48387cd0b...","{'bill_id': 1709716, 'change_hash': '7730c0e81...","{'bill_id': 1700993, 'change_hash': '3c6bcef93...","{'bill_id': 1697043, 'change_hash': 'c0a9f9903...","{'bill_id': 1693918, 'change_hash': '5506b66cd...","{'bill_id': 1693916, 'change_hash': '300dc33d0...","{'bill_id': 1689320, 'change_hash': 'aa0e9982f...","{'bill_id': 1702715, 'change_hash': 'f2bf918eb...","{'bill_id': 1709742, 'change_hash': 'fd8501717...","{'bill_id': 1697060, 'change_hash': 'e291d2447...","{'bill_id': 1693932, 'change_hash': '6f14721d1...","{'bill_id': 1693917, 'change_hash': '01cb29a4d...","{'bill_id': 1709744, 'change_hash': '1b07a1b11...","{'bill_id': 1711952, 'change_hash': '6db38492e...","{'bill_id': 1711977, 'change_hash': '8d9bef99e...","{'bill_id': 1693952, 'change_hash': '921c5f46a...","{'bill_id': 1714485, 'change_hash': 'a15947213...","{'bill_id': 1714521, 'change_hash': 'b42a94762...","{'bill_id': 1709872, 'change_hash': '325a7c834...","{'bill_id': 1707654, 'change_hash': 'f39c94658...","{'bill_id': 1731372, 'change_hash': 'f81a84a14...","{'bill_id': 1711904, 'change_hash': 'bfecfedd7...","{'bill_id': 1724789, 'change_hash': '049dc8121...","{'bill_id': 1691346, 'change_hash': '9bc238120...","{'bill_id': 1712204, 'change_hash': '51e644266...","{'bill_id': 1712049, 'change_hash': 'db9ff52a4...","{'bill_id': 1714818, 'change_hash': 'f5f177fee...","{'bill_id': 1712180, 'change_hash': 'ab3296bf5...","{'bill_id': 1659519, 'change_hash': 'b43f4c4dd...","{'bill_id': 1702767, 'change_hash': '837ce31bb...","{'bill_id': 1649739, 'change_hash': 'a5e2e7863...","{'bill_id': 1712004, 'change_hash': '9d396eef7...","{'bill_id': 1667855, 'change_hash': 'b0fa57373...","{'bill_id': 1714317, 'change_hash': '671f2d892...","{'bill_id': 1702772, 'change_hash': '9f3f0543e...","{'bill_id': 1636426, 'change_hash': 'e0a89e2bd...","{'bill_id': 1709864, 'change_hash': '3c71d76fe...","{'bill_id': 1701039, 'change_hash': '93d11e603...","{'bill_id': 1772423, 'change_hash': '364d09f2d...","{'bill_id': 1707500, 'change_hash': '8a6f05636...","{'bill_id': 1714491, 'change_hash': '6769455b3...","{'bill_id': 1714547, 'change_hash': '69a7ee3ba...","{'bill_id': 1649612, 'change_hash': 'c7f3264e8...","{'bill_id': 1714422, 'change_hash': '3052c3425...","{'bill_id': 1649634, 'change_hash': '251768e06...","{'bill_id': 1714633, 'change_hash': 'ce7358fa7...","{'bill_id': 1709708, 'change_hash': 'e4bac41ab...","{'bill_id': 1714407, 'change_hash': 'ab106d9bf...","{'bill_id': 1691357, 'change_hash': 'aa61d08fc...","{'bill_id': 1759484, 'change_hash': '40cf0ed85...","{'bill_id': 1705874, 'change_hash': '64c223f33...","{'bill_id': 1649639, 'change_hash': 'd8377c0bb...",...,"{'bill_id': 1714773, 'change_hash': '9fc6a826a...","{'bill_id': 1649614, 'change_hash': 'b382e16ec...","{'bill_id': 1726105, 'change_hash': 'd69ef12e7...","{'bill_id': 1697071, 'change_hash': '36f81c6b9...","{'bill_id': 1667835, 'change_hash': '8879f7a19...","{'bill_id': 1667828, 'change_hash': '09de74990...","{'bill_id': 1702745, 'change_hash': '93a12f3c5...","{'bill_id': 1714319, 'change_hash': 'c35cced74...","{'bill_id': 1771715, 'change_hash': '5d54d92d6...","{'bill_id': 1667847, 'change_hash': '75f22d12a...","{'bill_id': 1702748, 'change_hash': '874e77afc...","{'bill_id': 1698914, 'change_hash': 'f8ccf3508...","{'bill_id': 1714562, 'change_hash': 'd40e5e852...","{'bill_id': 1714634, 'change_hash': 'a265d1588...","{'bill_id': 1649626, 'change_hash': 'd51c3ff53...","{'bill_id': 1731369, 'change_hash': '16546fc93...","{'bill_id': 1667834, 'change_hash': '022fe7499...","{'bill_id': 1652083, 'change_hash': 'e85361333...","{'bill_id': 1652088, 'change_hash': 'a97b642ed...","{'bill_id': 1707695, 'change_hash': '2b2226970...","{'bill_id': 1714444, 'change_hash': 'fc02d80a3...","{'bill_id': 1755204, 'change_hash': '6994b27d1...","{'bill_id': 1649613, 'change_hash': '92d588e4b...","{'bill_id': 1649615, 'change_hash': '37f9dc7ed...","{'bill_id': 1714449, 'change_hash': '050279eab...","{'bill_id': 1705827, 'change_hash': '22ac81092...","{'bill_id': 1714621, 'change_hash': '47303d89f...","{'bill_id': 1714683, 'change_hash': '6a98dbe0f...","{'bill_id': 1712152, 'change_hash': '49df7bb80...","{'bill_id': 1712188, 'change_hash': '328c4a4b7...","{'bill_id': 1691348, 'change_hash': '78f038d6e...","{'bill_id': 1705905, 'change_hash': '0b2d5128d...","{'bill_id': 1667861, 'change_hash': '2ba229342...","{'bill_id': 1709927, 'change_hash': '522e27c83...","{'bill_id': 1705843, 'change_hash': '33585e9a4...","{'bill_id': 1667836, 'change_hash': '8883727ed...","{'bill_id': 1731371, 'change_hash': '03a3e0c4a...","{'bill_id': 1667953, 'change_hash': 'ca2aeab03...","{'bill_id': 1686434, 'change_hash': '3dbdf3756...","{'bill_id': 1667857, 'change_hash': '50bc51977...","{'bill_id': 1772090, 'change_hash': 'e56933e52...","{'bill_id': 1709738, 'change_hash': '792a58ccd...","{'bill_id': 1707647, 'change_hash': '9d03f553f...","{'bill_id': 1705988, 'change_hash': 'f85721e19...","{'bill_id': 1714767, 'change_hash': '33d220426...","{'bill_id': 1707671, 'change_hash': '79c18218b...","{'bill_id': 1649636, 'change_hash': 'd7f1b0dc5...","{'bill_id': 1705844, 'change_hash': 'f0159cdd4...","{'bill_id': 1709925, 'change_hash': '0965b4b99...","{'bill_id': 1759810, 'change_hash': '8cc87d32d...","{'bill_id': 1657752, 'change_hash': '649de8a02...","{'bill_id': 1637884, 'change_hash': 'fd10ffcd3...","{'bill_id': 1705952, 'change_hash': '08d9c317b...","{'bill_id': 1714508, 'change_hash': 'e24390e48...","{'bill_id': 1649640, 'change_hash': '7cd361c3f...","{'bill_id': 1714453, 'change_hash': '90cab0798...","{'bill_id': 1707631, 'change_hash': 'fbdc14b88...","{'bill_id': 1711982, 'change_hash': '2b74383ca...","{'bill_id': 1702834, 'change_hash': '8b2626088...","{'bill_id': 1698948, 'change_hash': 'be07c109b...","{'bill_id': 1709848, 'change_hash': '38050c9b8...","{'bill_id': 1689325, 'change_hash': '1afe77894...","{'bill_id': 1697014, 'change_hash': '12b4db961...","{'bill_id': 1749145, 'change_hash': 'a7cb40f36...","{'bill_id': 1649628, 'change_hash': 'f95bf81f8...","{'bill_id': 1706006, 'change_hash': 'a71151914...","{'bill_id': 1773123, 'change_hash': '03f610642...","{'bill_id': 1667853, 'change_hash': 'ac0007f84...","{'bill_id': 1649632, 'change_hash': '5646aa597...","{'bill_id': 1709901, 'change_hash': '9d1e34a32...","{'bill_id': 1667849, 'change_hash': 'f6646284a...","{'bill_id': 1667850, 'change_hash': 'a53200c24...","{'bill_id': 1649629, 'change_hash': '50b7be8e5...","{'bill_id': 1649630, 'change_hash': 'f59dee709...","{'bill_id': 1667851, 'change_hash': '511075920..."


In [None]:
# TEST CELL


In [234]:
# SAVE: pieces

# FIND doc_id for most recent bill version
doclist = binfo['bill']['texts']
docs = [value['doc_id'] for value in doclist]
docs1 = [str(doc) for doc in docs] # converting int to str

# call the most recently amended version of bill text
doc_id = docs1[-1] 

# get bill text for doc
btext1 = 'https://api.legiscan.com/?key=7e00040f1f7618af234e7415484d2494&op=getBillText&id='+doc_id
b = requests.get(btext1)
btext = json.loads(b.text)

# get URL for amended doc
URL = btext['text']['state_link']

## EXTRACTION: relevant info from URL
r = urlopen(URL)
html_bytes = r.read()
html = html_bytes.decode("utf-8")
soup = BeautifulSoup(html, features='html.parser')

# bill title
t = soup.find_all('title')
ti = t[0]

# bill year: introduced
yi = soup.find_all(id = 'bill_intro_date') # will have to be cleaned

# bill year: amended
ya = soup.find_all('tbody')
yam = (ya[0])

# bill text
txt = soup.find_all('div', id = 'bill')

## REMOVE: html style and tags from bill text

# title
for text in ti:
    title = text.text.strip()
    title = title.split("- ")[-1]

# year intro
for text in yi:
    yeari = text.text.strip()
    yeari = yeari[-4:]

# year recent activity
for text in yam:
    yeara = text.text.strip()
    yeara = yeara[-4:]

# bill text
for text in txt:
    billtxt = text.text.strip()
    
# creating bill_id
bill_id = title.split("- ")[-1][:7]

# concatenating into one list
infotxt = [bill_id, title, yeari, yeara, billtxt]
infotxt

## CONVERT to df

itxtdf = pd.DataFrame(infotxt).transpose()

# rename columns
itxtdf.columns = ['bill_id', 'title', 'y_intro', 'y_recent', 'text']
itxtdf

Unnamed: 0,status,bill
text,OK,"{'bill_id': 1711942, 'change_hash': 'cd9d825f6..."
text,OK,"{'bill_id': 1693920, 'change_hash': '464be2cde..."
text,OK,"{'bill_id': 1702780, 'change_hash': '266b25df3..."
text,OK,"{'bill_id': 1712167, 'change_hash': '37ed87e95..."
text,OK,"{'bill_id': 1707639, 'change_hash': '6bddb0f6f..."
text,OK,"{'bill_id': 1671132, 'change_hash': '3f4199a2f..."
text,OK,"{'bill_id': 1712103, 'change_hash': 'ce1242ff4..."
text,OK,"{'bill_id': 1707512, 'change_hash': '85d63281d..."
text,OK,"{'bill_id': 1714501, 'change_hash': 'f094de3f6..."
text,OK,"{'bill_id': 1738922, 'change_hash': '95e252ff3..."
