</center></div>
<div style = "background-color:indigo"><center>
<h1 style="font-size: 50px; font-weight: bold; color:goldenrod; border-top: 3px solid goldenrod; padding-top: 10px">AI California Legislative Policy Analysis (CALPA-AI)</h1>
<div style="font-size: 35px; font-weight: bold; color: goldenrod"> Part 1 - Preliminary Data Processing</div>
<div style="font-size: 30px; font-weight: bold; color: goldenrod; border-bottom: 3px solid goldenrod; padding-bottom: 20px">v.1.0 April 2025</div>
</center></div>

This is the main notebook for the AI California Legislative Policy Analysis (CALPA) project. The goal of this project is to analyze California legislative bills using natural language processing (NLP) techniques. This notebook will cover the preliminary data processing steps, including data loading, cleaning, and preparation for analysis.
The project is divided into several parts, each focusing on a specific aspect of the analysis. The first part will cover the data loading and cleaning process, while subsequent parts will focus on feature extraction, model training, and evaluation.

<h1 style="font-weight:bold; color:orangered; border-bottom: 2px solid orangered">1. Preliminaries</h1>

<h2 style="font-weight:bold; color:dodgerblue; border-bottom: 1px solid dodgerblue; padding-left: 25px">1.1 Referencing Libraries and Initialization</h2>

If needed to reset the kernel, please run the following cell:

In [1]:
#%reset

Instantiating python libraries for the project

In [1]:
# Import required libraries
import os
from dotenv import load_dotenv
import time
from datetime import date
from datetime import datetime
import json
import mimetypes
import glob
import base64
import zipfile
import io
import requests
import pandas as pd

Load the local python modules containing classes and functions for the project from the local directory. There are two modules:
- `calpa`: This module contains the main classes and functions for the project.
- `legiscan`: This module contains the classes and functions for the LegiScan API.

In [2]:
# Load the Calpa module located in the scripts/python/calpa directory
from calpa import Calpa, LegiScan

<h2 style="font-weight:bold; color:dodgerblue; border-bottom: 1px solid dodgerblue; padding-left: 25px">1.2. Project and Workspace Variables</h2>

<h3 style="font-weight:bold; color:lime; padding-left: 50px">Load Environment Variables</h3>

Define and maintain project, workspace and metadata. Below we load the environment variables from the `.env` file. The environment variables are used to configure the project and workspace settings. The `dotenv` library is used to load the environment variables from the `.env` file into the Python environment. The environment also contains the LegiScan API key, which is used to access the LegiScan API. The API key is stored in the `LEGISCAN_API_KEY` environment variable. The `dotenv` library is used to load the environment variables from the `.env` file into the Python environment.

In [3]:
# Load environment variables from .env file
load_dotenv()

True

<h3 style="font-weight:bold; color:lime; padding-left: 50px">Main Class Instantiation</h3>

Instantiate the two main classes for the project:
- `calpa`: This class is used to access the CALPA API and retrieve legislative data.
- `legiscan`: This class is used to access the LegiScan API and retrieve legislative data.

In [4]:
# Instantiate the LegiScan and Calpa classes
calpa = Calpa()
legiscan = LegiScan()

Create project metadata for the project

In [None]:
# Create project metadata for the AI project
prjMetadata = calpa.projectMetadata("AI", "1")

Project Global Settings:
- Name: California Legislative Policy Analysis
- Title: AI Legislative Policy Analysis
- Version: 1.0
- Author: Dr. Kostas Alexandridis, GISP
Data Dates
- Start Date: 2010-12-02
- End Date: 2025-04-24
- Periods: 2009-2010, 2011-2012, 2013-2014, 2015-2016, 2017-2018, 2019-2020, 2021-2022, 2023-2024, 2025-2026


Create the project directories dictionary

In [6]:
# Create the project directories dictionary
prjDirs = calpa.projectDirectories(os.getcwd())

Directory Global Settings:

General:
- Project (pathPrj): c:\Users\ktalexan\OneDrive\Documents\GitHub\CaLPA
- Admin (pathAdmin): c:\Users\ktalexan\OneDrive\Documents\GitHub\CaLPA\admin
- Metadata (pathMetadata): c:\Users\ktalexan\OneDrive\Documents\GitHub\CaLPA\metadata
- Analysis (pathAnalysis): c:\Users\ktalexan\OneDrive\Documents\GitHub\CaLPA\analysis
Scripts:
- Python Calpa Module (pathScriptsCalpa): c:\Users\ktalexan\OneDrive\Documents\GitHub\CaLPA\calpa
- Markdown Scripts (pathScriptsMd): c:\Users\ktalexan\OneDrive\Documents\GitHub\CaLPA\markdown
- RIS Scripts (pathScriptsRis): c:\Users\ktalexan\OneDrive\Documents\GitHub\CaLPA\ris
Data:
- Main Data (pathData): c:\Users\ktalexan\OneDrive\Documents\GitHub\CaLPA\data
- Documents (pathDataDocs): c:\Users\ktalexan\OneDrive\Documents\GitHub\CaLPA\data\docs
- LegiScan (pathDataLegis): c:\Users\ktalexan\OneDrive\Documents\GitHub\CaLPA\data\legis
- LookUp (pathDataLookup): c:\Users\ktalexan\OneDrive\Documents\GitHub\CaLPA\data\lookup
- Ma

<h3 style="font-weight:bold; color:lime; padding-left: 50px">Load Lookup DataFrames</h3>

Load necessary lookup data frames for the project. These are located in the `data/lookup` project diretory.
Specifically we will load the following lookup tables:

- `codebookBill`: This table contains the mapping of bill codes to their descriptions.
- `codebookRollCall`: This table contains the mapping of roll call codes to their descriptions.
- `codebookBillText`: This table contains the mapping of bill text codes to their descriptions.
- `codebookAmendment`: This table contains the mapping of amendment codes to their descriptions.
- `codebookSupplement`: This table contains the mapping of supplement codes to their descriptions.
- `codebookPerson`: This table contains the mapping of person codes to their descriptions.
- `codebookSessionList`: This table contains the mapping of session codes to their descriptions.

In [None]:
# Load the codebookBill pickle file from the data/lookup directory
codebookBill = pd.read_pickle(os.path.join(prjDirs["pathDataLookup"], "codebookBill.pkl"))

# Load the codebookRollCall pickle file from the data/lookup directory
codebookRollCall = pd.read_pickle(os.path.join(prjDirs["pathDataLookup"], "codebookRollCall.pkl"))

# Load the codebookBillText pickle file from the data/lookup directory
codebookBillText = pd.read_pickle(os.path.join(prjDirs["pathDataLookup"], "codebookBillText.pkl"))

# Load the codebookAmendment pickle file from the data/lookup directory
codebookAmendment = pd.read_pickle(os.path.join(prjDirs["pathDataLookup"], "codebookAmendment.pkl"))

# Load the codebookSupplement pickle file from the data/lookup directory
codebookSupplement = pd.read_pickle(os.path.join(prjDirs["pathDataLookup"], "codebookSupplement.pkl"))

# Load the codebookPerson pickle file from the data/lookup directory
codebookPerson = pd.read_pickle(os.path.join(prjDirs["pathDataLookup"], "codebookPerson.pkl"))

# Load the codebookSessionList pickle file from the data/lookup directory
codebookSessionList = pd.read_pickle(os.path.join(prjDirs["pathDataLookup"], "codebookSessionList.pkl"))

<h1 style="font-weight:bold; color:orangered; border-bottom: 2px solid orangered">2. Baseline LegiScan Data</h1>

<h2 style="font-weight:bold; color:dodgerblue; border-bottom: 1px solid dodgerblue; padding-left: 25px">2.1. Session List</h2>

Using the LegiScan API, we will retrieve the list of sessions for California. This will be used to get the session ID for the current session and the previous session. The session ID is needed to retrieve the bills for each session.

In [9]:
# Get the list of sessions from LegiScan
sessionList = legiscan.getSessionList()

Convert the session list to a pandas dataframe

In [10]:
# Convert the sessionList to a pandas DataFrame
sessionDf = pd.DataFrame(sessionList)
sessionDf.head()

Unnamed: 0,2025-2026,2023-2024,2021-2022,2019-2020,2017-2018,2015-2016,2013-2014,2011-2012,2009-2010
session_id,2172,2016,1791,1624,1400,1120,993,82,30
state_id,5,5,5,5,5,5,5,5,5
state_abbr,CA,CA,CA,CA,CA,CA,CA,CA,CA
year_start,2025,2023,2021,2019,2017,2015,2013,2011,2009
year_end,2026,2024,2022,2020,2018,2016,2014,2012,2010


We need to compare the session list we obtained from the Legiscan API with the previous session list (stored in the disk under `data/lookup/sessionList.json`). Here, we open the stored session list into a new dictionary called `sessionListStored`.

In [11]:
# Obtain the stored sessions list from JSON dictionary on disk (data/lookup directory)
sessionListStored = legiscan.getStoredData(dataType = "session")

Now that we have both the session lists (the one from the legiscan api, `sessionList`, and the stored version, `sessionListStored`), we can compare them. We will check if the session list from the LegiScan API is the same as the session list stored in the disk. If they are not the same, we will first identify which sessions need updating, and will later update the stored session list with the new session list from the LegiScan API. We will also check if there are any new sessions that have been added to the LegiScan API since the last time we retrieved the session list.

The function method `matchHash` from the legiscan module class, uses the hash values to compare the two lists. In this case the relevant JSON keys are `sesion_hash` for each `session_id`.

In [12]:
# Compare the sessionList and sessionListStored dictionaries for any changes
unmatchedSessions = legiscan.matchHash(sessionList, sessionListStored, hashType = "session", silent = True)

# if the unmatchedSessions is empty, print "All sessions match", and delete the unmatchedSessions variable
if unmatchedSessions is None:
    print("All sessions match")
    del unmatchedSessions
else:
    print("Unmatched sessions found")
    # Print the unmatched sessions
    print(unmatchedSessions)

All sessions match


Export the LegiScan query records to the `data/legis/json` directory as a JSON file for future reference.

In [13]:
# Export the sessionList to a JSON file in the data/legiscan/json directory
with open(os.path.join(prjDirs["pathDataLegis"], "json", "sessionList.json"), "w", encoding="utf-8") as f:
    json.dump(sessionList, f, ensure_ascii=False, indent=4)
del f

If needed update the stored session list with the new session list from the LegiScan API.

In [14]:
# Update the stored sessions list with the new sessionList
with open(os.path.join(prjDirs["pathDataLookup"], "sessionListStored.json"), "w", encoding="utf-8") as f:
    json.dump(sessionList, f, ensure_ascii=False, indent=4)

<h2 style="font-weight:bold; color:dodgerblue; border-bottom: 1px solid dodgerblue; padding-left: 25px">2.2. Session People</h2>

In this step, we will obtain the list of California legislature members (Senate and Assembly) for each of the legislative sessions. This will be used to get the list of members for each session. The session ID is needed to retrieve the members for each session. We will use the LegiScan API to retrieve the list of members for each session.

The `legiscan.getSessionPeople` method retrieves the list of members for each session. The session ID is passed as an argument to the method. The method returns a list of members for each session. The list of members is stored in a dictionary called `sessionPeople`. The dictionary contains the session ID as the key and the list of members as the value.

In [15]:
# Get the list of session people from LegiScan
sessionPeople = {}
for key, value in sessionList.items():
    sessionId = value["session_id"]
    sessionPeople[key] = legiscan.getSessionPeople(sessionId)
del key, value, sessionId

Similarly with the legislative session list, we will compare the session people list we obtained from the LegiScan API with the previous session people list (stored in the disk under `data/lookup/sessionPeople.json`). Here, we open the stored session people list into a new dictionary called `sessionPeopleStored`.

In [16]:
# Obtain the stored session People list from JSON dictionary on disk (data/lookup directory)
sessionPeopleStored = legiscan.getStoredData(dataType = "people")

This time, the task is not that simple, since `sessionPeople` lists are nested for each session. The comparison of the Legislature members needs to be done in a loop for each session. In the following code segment, we perform this task in sequential steps:

1. Create a dictionary named `unmatchedPeople` to hold the unmatched session people (will be nested for each session).
2. Loop through the `sessionPeople` and `sessionPeopleStored` dictionaries to compare the session people lists for each session.
3. For each session, compare the session people lists and store the unmatched session people in the `unmatchedSessionPeople` dictionary, based on the `person_hash` key attribute on both lists. We will use the `matchHash` method from the `legiscan` module to compare the two lists.
4. If there are any unmatched session people, we will update the `unmatchedPeople` dictionary with the unmatched session people (for each session).
5. Finally, we will check if there are any unmatched session people in the `unmatchedPeople` dictionary. If there are, we will update the `sessionPeopleStored` dictionary with the unmatched session people and save it to the disk. 

In [17]:
# Compare the sessionPeople and sessionPeopleStored dictionaries for any changes
# Create a dictionary to store unmatched people
unmatchedPeople = {}
# Iterate through each session and compare the people lists
for key, value in sessionPeople.items():
    unmatchedPeople[key] = {}
    unmatched = legiscan.matchHash(sessionPeople[key]["people"], sessionPeopleStored[key]["people"], hashType = "person", silent = True)
    # If there are unmatched people, store them in the unmatchedPeople dictionary
    unmatchedPeople[key] = unmatched if unmatched is not None else None
del key, value, unmatched

# if the unmatchedPeople is empty, print "All people match", and delete the unmatchedPeople variable
if all(not value for value in unmatchedPeople.values()):
    print("All people match")
    # Delete the unmatchedPeople variable
    del unmatchedPeople
else:
    print("Unmatched people found")
    # Print the unmatched sessions
    print(unmatchedPeople)

All people match


Export the LegiScan query data for the session people to the `data/legis/json' directory for future reference.

In [18]:
# Export the sessionPeople to a JSON file in the data/legiscan/json directory
with open(os.path.join(prjDirs["pathDataLegis"], "json", "sessionPeople.json"), "w", encoding="utf-8") as f:
    json.dump(sessionPeople, f, ensure_ascii=False, indent=4)
del f

If needed update the stored session people list with the new session people list from the LegiScan API.

In [19]:
# Update the stored session People list with the new sessionPeople
with open(os.path.join(prjDirs["pathDataLookup"], "sessionPeopleStored.json"), "w", encoding="utf-8") as f:
    json.dump(sessionPeople, f, ensure_ascii=False, indent=4)

<h2 style="font-weight:bold; color:dodgerblue; border-bottom: 1px solid dodgerblue; padding-left: 25px">2.3. Dataset List</h2>

In this session we will obtain the list of datasets and their attributes for each of the California Legislative Session from LegiScan. This process is needed to obtain the dataset `access_key` for each session, and consequently to use it in query dataset data in a further step.

The `legiscan.getDatasetList` method retrieves the list of datasets for each session. The session ID is passed as an argument to the method. The method returns a list of datasets for each session. The list of datasets is stored in a dictionary called `datasetList`. The dictionary contains the session ID as the key and the list of datasets as the value.

In [20]:
# Get the list of datasets from LegiScan for each legislative session
datasetList = legiscan.getDatasetList()

Obtain the stored dataset list from the disk. The list will be used to compare with the LegiScan API dataset list. The stored dataset list is stored in the `data/lookup/datasetList.json` file. The dataset list contains the list of datasets for each session. The dataset ID is needed to retrieve the datasets for each session. We will use the LegiScan API to retrieve the list of datasets for each session. Here, we open the stored dataset list into a new dictionary called `datasetListStored`.

In [21]:
# Obtain the stored dataset list from JSON dictionary on disk (data/lookup directory)
datasetListStored = legiscan.getStoredData(dataType = "dataset")

Now that we have both the dataset lists (the one from the legiscan api, `datasetList`, and the stored version, `datasetListStored`), we can compare them. We will check if the dataset list from the LegiScan API is the same as the dataset list stored in the disk. If they are not the same, we will first identify which datasets need updating, and will later update the stored dataset list with the new dataset list from the LegiScan API. We will also check if there are any new datasets that have been added to the LegiScan API since the last time we retrieved the session list.

The function method `matchHash` from the legiscan module class, uses the hash values to compare the two lists. In this case the relevant JSON keys are `dataset_hash` for each `session_id`.

In [22]:
# Compare the datasetList and datasetListStored dictionaries for any changes
unmatchedDatasets = legiscan.matchHash(datasetList, datasetListStored, hashType = "dataset", silent = True)

# if the unmatchedSessions is empty, print "All sessions match", and delete the unmatchedSessions variable
if unmatchedDatasets is None:
    print("All datasets match")
    del unmatchedDatasets
else:
    print("Unmatched datasets found")
    # Print the unmatched sessions
    print(unmatchedDatasets)

All datasets match


Export the LegiScan query records to the `data/legis/json` directory as a JSON file for future reference.

In [23]:
# export the datasetList to a JSON file in the data/legis/json directory
with open(os.path.join(prjDirs["pathDataLegis"], "json", "datasetList.json"), "w", encoding="utf-8") as f:
    json.dump(datasetList, f, ensure_ascii=False, indent=4)
del f

Update the stored dataset list with the new dataset list from the LegiScan API.

In [24]:
# Update the stored dataset list with the new datasetList
with open(os.path.join(prjDirs["pathDataLookup"], "datasetListStored.json"), "w", encoding="utf-8") as f:
    json.dump(datasetList, f, ensure_ascii=False, indent=4)

<h2 style="font-weight:bold; color:dodgerblue; border-bottom: 1px solid dodgerblue; padding-left: 25px">2.4. Master List</h2>

This step is to obtain the master list of datasets for each session. The master list contains the list of bills for each legislative session. The master list is used to get the list of bills for each session. The session ID is needed to retrieve the bills for each session. We will use the LegiScan API to retrieve the list of bills for each session.

There are two options for this method. The first obtains the master list with bill attributes (when `raw = False`), and the second obtains the raw master list containing only the bill_ID and hash (when `raw = True`)

We will use the `legiscan.getMasterList(sessionID, raw)` method and will store the results in a dictionary called `masterList` or `masterListRaw` depending on the option provided in the method invocation.

In [25]:
# Get the Raw Master List from LegiScan for each legislative session
masterListRaw = {}
for key, value in sessionList.items():
    sessionId = value["session_id"]
    masterListRaw[key] = legiscan.getMasterList(sessionId, raw = True)
del key, value, sessionId

In [26]:
# Get the Master List from LegiScan for each legislative session
masterList = {}
for key, value in sessionList.items():
    sessionId = value["session_id"]
    masterList[key] = legiscan.getMasterList(sessionId, raw = False)
del key, value, sessionId

Obtain the stored master lists (both the raw and full) from the disk. The lists will be used to compare each bill with the LegiScan API master list. The stored dataset list is stored in the `data/lookup/mastertList.json` or `data/lookup/masterListRaw.json` files. The master lists contain the list of bills for each session. The dataset ID is needed to retrieve the bills for each session. We will use the LegiScan API to retrieve the list of datasets for each session. Here, we open the stored dataset list into a new dictionary called `masterListStored` and `masterListRawStored`.

In [27]:
# Get the stored raw master list from JSON dictionary on disk (data/lookup directory)
masterListRawStored = legiscan.getStoredData(dataType = "master", raw = True)
# Get the stored master list from JSON dictionary on disk (data/lookup directory)
masterListStored = legiscan.getStoredData(dataType = "master", raw = False)

Now that we have both the master lists (the one from the legiscan api, `masterListRaw`, and the stored version, `masterListRawStored`), we can compare them. We will check if the master list from the LegiScan API is the same as the master list stored in the disk. If they are not the same, we will first identify which bills need updating, and will later update the stored master list with the new master list from the LegiScan API. We will also check if there are any new bills that have been added to the LegiScan API since the last time we retrieved the session list.

The function method `matchHash` from the legiscan module class, uses the hash values to compare the two lists. In this case the relevant JSON keys are `chanbe_hash` for each `session_id`.

In [28]:
# Compare the masterList and masterListStored dictionaries for any changes
# Create a dictionary to store unmatched bills
unmatchedMasterList = {}
for key, value in masterListRaw.items():
    # if key is not "session"
    if key != "session":
        unmatchedMasterList[key] = {}
        unmatched = legiscan.matchHash(masterListRaw[key]["bills"], masterListRaw[key]["bills"], hashType = "change", silent = True)
        # If there are unmatched bills, store them in the unmatchedMasterList dictionary
        unmatchedMasterList[key] = unmatched if unmatched is not None else None
del key, value, unmatched

# if the unmatchedMasterList is empty, print "All bills match", and delete the unmatchedMasterList variable
if all(not value for value in unmatchedMasterList.values()):
    print("All bills match")
    # Delete the unmatchedMasterList variable
    del unmatchedMasterList
else:
    print("Unmatched bills found")
    # Print the unmatched sessions
    print(unmatchedMasterList)

All bills match


Export both the LegiScan query records (raw and full master list) to the `data/legis/json` directory as a JSON file for future reference.

In [29]:
# export the raw master list to a JSON file in the data/legis/json directory
with open(os.path.join(prjDirs["pathDataLegis"], "json", "masterListRaw.json"), "w", encoding="utf-8") as f:
    json.dump(masterListRaw, f, ensure_ascii=False, indent=4)
del f

In [30]:
# export the master list to a JSON file in the data/legis/json directory
with open(os.path.join(prjDirs["pathDataLegis"], "json", "masterList.json"), "w", encoding="utf-8") as f:
    json.dump(masterList, f, ensure_ascii=False, indent=4)
del f

If needed update the stored session people list with the new session people list from the LegiScan API.

In [31]:
# Update the stored raw master list with the new masterListRaw
with open(os.path.join(prjDirs["pathDataLookup"], "masterListRawStored.json"), "w", encoding="utf-8") as f:
    json.dump(masterListRaw, f, ensure_ascii=False, indent=4)

In [32]:
# Update the master list with the new masterList
with open(os.path.join(prjDirs["pathDataLookup"], "masterListStored.json"), "w", encoding="utf-8") as f:
    json.dump(masterList, f, ensure_ascii=False, indent=4)

<h1 style="font-weight:bold; color:orangered; border-bottom: 2px solid orangered">3. Bill Monitoring Operations</h1>

<h2 style="font-weight:bold; color:dodgerblue; border-bottom: 1px solid dodgerblue; padding-left: 25px">3.1. Stored Monitoring Lists</h2>

Get the stored monitoring lists from the disk, for each sub-project operation.

In [33]:
# Get the AI monitoring list from disk (data/lookup directory)
aiBillListStored = legiscan.getStoredData(dataType = "bills", project = "AI")

<h2 style="font-weight:bold; color:dodgerblue; border-bottom: 1px solid dodgerblue; padding-left: 25px">3.2. Get LegiScan Bills</h2>

Get the LegiScan bills for each session. The session ID is needed to retrieve the bills for each session. We will use the LegiScan API to retrieve the list of bills for each session.

In [34]:
# Define a dictionary to store AI bills
aiBills = {}
# Iterate through the AI bill list and fetch the bill details from LegiScan
for key, value in aiBillListStored.items():
    # Set the key to the legislative session period
    aiBills[key] = {}
    print(f"Legislative session: {key}")
    # Iterate through the bills for the legislative session
    for bno, bid in value.items():
        # add the legiscan query to the aiBills dictionary
        aiBills[key][bno] = legiscan.getBill(billId = bid)
print("Completed fetching AI bills from LegiScan")

Legislative session: 2013-2014
Legislative session: 2017-2018
Legislative session: 2019-2020
Legislative session: 2021-2022
Legislative session: 2023-2024
Legislative session: 2025-2026
Completed fetching AI bills from LegiScan


Get the bill counts for each legislative session from the obtained data.

In [35]:
i=0
for key, value in aiBills.items():
    print(f"{key}: {len(value)} bills")
    i+=len(value)
print(f"    Total: {i} bills")
del i, key, value

2013-2014: 3 bills
2017-2018: 5 bills
2019-2020: 16 bills
2021-2022: 15 bills
2023-2024: 81 bills
2025-2026: 39 bills
    Total: 159 bills


Export the LegiScan query records to the `data/legis/json` directory as a JSON file for future reference.

In [36]:
# Export the AI bills to a JSON file in the data/legiscan/json directory
with open(os.path.join(prjDirs["pathDataLegis"], "json", "aiBills.json"), "w", encoding="utf-8") as f:
    json.dump(aiBills, f, ensure_ascii=False, indent=4)

<h2 style="font-weight:bold; color:dodgerblue; border-bottom: 1px solid dodgerblue; padding-left: 25px">3.3. Get Bill Text</h2>

Get the LegiScan bill text for each session. This process involves two legiscan functions:

- `legiscan.getBillText`: This function retrieves the bill text for each bill in the session. The document ID is passed as an argument to the function. The function runs the LegiScan API call, and returns the bill text JSON information for each bill (which includes the base64 encoded bill text).
- `legiscan.summarizeBillText`: This function summarizes the bill text for each bill in the session. The bill JSON information is passed as an argument to this function. The function performs a number of tasks:
    - Looks up the `texts` JSON object group of the bill, and finds the last bill text version (is the last one in the list, with the latest bill date), and retrieves the `doc_id` identifier of the bill text. 
    - Uses the `legiscan.getBillText` function above to retrieve the encoded (base64) bill text. It then proceeds to decode the encoded bill text, and converts it to a string.
    - Uses an `Azure OpenAI` API call to create a TL;DR summary of the bill text, along with a list of keywords (tags) for the bill text. 
    - Finally, it constructs and returns a dictionary with the bill number, summarized bill text, the tags, and the bill text itself.
 

In [37]:
# Create a function that loops through the aiBills dictionary, gets the bill text for each bill, and then processes the text through Azure OpenAI to obtain the summary and keywords.
def createBillTextSummary(billList):
    """
    Create a dictionary of bill text objects from the bill list.
    """
    # Create a dictionary to store the bill text objects
    billTextDict = {}
    # First, loop through the legislative sessions in the bill list
    for key, value in billList.items():
        # Set the key to the legislative session period
        billTextDict[key] = {}
        # Count the number of bills in the legislative session
        billCount = len(value)
        print(f"Legislative session: {key} ({billCount} bills)")
        # Reset the counter for the number of bills processed
        i = 1        
        # Loop through the bills for the legislative session
        for billKey, billContent in value.items():
            maxRetries = 5
            attempts = 0
            success = False
            while not success and attempts < maxRetries:
                try:
                    # Get the bill text from LegiScan
                    billTextJson = legiscan.summarizeBillText(billContent)
                    success = True
                except Exception as e:
                    attempts += 1
                    print(f"  - Error fetching bill text for {billKey}: {e}")
            if success:
                # If the bill text is processed,
                print(f"- Bill {i}/{billCount} ({billKey})")
                # Add the bill text to the billTextDict dictionary
                billTextDict[key][billKey] = billTextJson
                # Increment the counter for the number of bills processed
                i += 1
            else:
                break
    # Return the billTextDict dictionary
    return billTextDict

We will execute the `createBillTextDict` function above (which in turns runs the `legiscan.summarizeBillText` function) for each bill in each legislative session, and generates a new dictionary obtaining the resulting data.

In [39]:
aiBillsSummaries = createBillTextSummary(aiBills)

Legislative session: 2013-2014 (3 bills)
- Bill 1/3 (AB1465)
- Bill 2/3 (SB836)
- Bill 3/3 (SB860)
Legislative session: 2017-2018 (5 bills)
- Bill 1/5 (AB1809)
- Bill 2/5 (AB2662)
- Bill 3/5 (ACR215)
- Bill 4/5 (SB843)
- Bill 5/5 (SB1470)
Legislative session: 2019-2020 (16 bills)
- Bill 1/16 (AB156)
- Bill 2/16 (AB459)
- Bill 3/16 (AB485)
- Bill 4/16 (AB594)
- Bill 5/16 (AB976)
- Bill 6/16 (AB1576)
- Bill 7/16 (AB2269)
- Bill 8/16 (AB3317)
- Bill 9/16 (AB3339)
- Bill 10/16 (ACR125)
- Bill 11/16 (SB348)
- Bill 12/16 (SB444)
- Bill 13/16 (SB730)
- Bill 14/16 (SB752)
- Bill 15/16 (SCR13)
- Bill 16/16 (SJR6)
Legislative session: 2021-2022 (15 bills)
- Bill 1/15 (AB13)
- Bill 2/15 (AB1400)
- Bill 3/15 (AB1545)
- Bill 4/15 (AB1651)
- Bill 5/15 (AB178)
- Bill 6/15 (AB179)
- Bill 7/15 (AB2224)
- Bill 8/15 (AB2826)
- Bill 9/15 (AB587)
- Bill 10/15 (SB54)
- Bill 11/15 (SB178)
- Bill 12/15 (SB179)
- Bill 13/15 (SB1018)
- Bill 14/15 (SB1216)
- Bill 15/15 (SR11)
Legislative session: 2023-2024 (81 b

Export the LegiScan query records to the `data/legis/json` directory as a JSON file for future reference.

In [40]:
# Export the AI bills summaries to a JSON file in the data/legiscan/json directory
with open(os.path.join(prjDirs["pathDataLegis"], "json", "aiBillsSummaries.json"), "w", encoding="utf-8") as f:
    json.dump(aiBillsSummaries, f, ensure_ascii=False, indent=4)

In [41]:
# Export the AI bill summaries to a JSON file in the data/lookup directory for reference
with open(os.path.join(prjDirs["pathDataLookup"], "aiBillsSummariesStored.json"), "w", encoding="utf-8") as f:
    json.dump(aiBillsSummaries, f, ensure_ascii=False, indent=4)

<div style = "background-color:indigo"><center>
<h1 style="font-weight:bold; color:goldenrod; border-top: 2px solid goldenrod; border-bottom: 2px solid goldenrod; padding-top: 5px; padding-bottom: 10px">End of Script</h1>
</center></div>