</center></div>
<div style = "background-color:indigo"><center>
<h1 style="font-size: 50px; font-weight: bold; color:goldenrod; border-top: 3px solid goldenrod; padding-top: 10px">AI California Legislative Policy Analysis (CALPA-AI)</h1>
<div style="font-size: 35px; font-weight: bold; color: goldenrod"> Part 2 - Preliminary Data Processing</div>
<div style="font-size: 30px; font-weight: bold; color: goldenrod; border-bottom: 3px solid goldenrod; padding-bottom: 20px">v.1.0 April 2025</div>
</center></div>

This is the main notebook for the AI California Legislative Policy Analysis (CALPA) project. The goal of this project is to analyze California legislative bills using natural language processing (NLP) techniques. This notebook will cover the preliminary data processing steps, including data loading, cleaning, and preparation for analysis.
The project is divided into several parts, each focusing on a specific aspect of the analysis. The first part will cover the data loading and cleaning process, while subsequent parts will focus on feature extraction, model training, and evaluation.

<h1 style="font-weight:bold; color:orangered; border-bottom: 2px solid orangered">1. Preliminaries</h1>

<h2 style="font-weight:bold; color:dodgerblue; border-bottom: 1px solid dodgerblue; padding-left: 25px">1.1 Referencing Libraries and Initialization</h2>

If needed to reset the kernel, please run the following cell:

In [None]:
#%reset

Instantiating python libraries for the project

In [1]:
# Import required libraries
import os
from dotenv import load_dotenv
import time
from datetime import date
from datetime import datetime
import json
import mimetypes
import glob
import base64
import zipfile
import io
import requests
import pandas as pd

Load the local python modules containing classes and functions for the project from the local directory. There are two modules:
- `calpa`: This module contains the main classes and functions for the project.
- `legiscan`: This module contains the classes and functions for the LegiScan API.

In [2]:
# Load the Calpa module located in the scripts/python/calpa directory
from calpa import Calpa, LegiScan

<h2 style="font-weight:bold; color:dodgerblue; border-bottom: 1px solid dodgerblue; padding-left: 25px">1.2. Project and Workspace Variables</h2>

<h3 style="font-weight:bold; color:lime; padding-left: 50px">Load Environment Variables</h3>

Define and maintain project, workspace and metadata. Below we load the environment variables from the `.env` file. The environment variables are used to configure the project and workspace settings. The `dotenv` library is used to load the environment variables from the `.env` file into the Python environment. The environment also contains the LegiScan API key, which is used to access the LegiScan API. The API key is stored in the `LEGISCAN_API_KEY` environment variable. The `dotenv` library is used to load the environment variables from the `.env` file into the Python environment.

In [3]:
# Load environment variables from .env file
load_dotenv()

True

<h3 style="font-weight:bold; color:lime; padding-left: 50px">Main Class Instantiation</h3>

Instantiate the two main classes for the project:
- `calpa`: This class is used to access the CALPA API and retrieve legislative data.
- `legiscan`: This class is used to access the LegiScan API and retrieve legislative data.

In [4]:
# Instantiate the LegiScan and Calpa classes
calpa = Calpa()
legiscan = LegiScan()

Create project metadata for the project

In [6]:
# Create project metadata for the AI project
prjMetadata = calpa.projectMetadata("AI", "1")

Project Global Settings:
- Name: California Legislative Policy Analysis
- Title: AI Legislative Policy Analysis
- Version: 1.0
- Author: Dr. Kostas Alexandridis, GISP
Data Dates
- Start Date: 2010-12-02
- End Date: 2025-04-24
- Periods: 2009-2010, 2011-2012, 2013-2014, 2015-2016, 2017-2018, 2019-2020, 2021-2022, 2023-2024, 2025-2026


Create the project directories dictionary

In [7]:
# Create the project directories dictionary
prjDirs = calpa.projectDirectories(os.getcwd())

Directory Global Settings:

General:
- Project (pathPrj): c:\Users\ktale\OneDrive\Documents\GitHub\CaLPA
- Admin (pathAdmin): c:\Users\ktale\OneDrive\Documents\GitHub\CaLPA\admin
- Metadata (pathMetadata): c:\Users\ktale\OneDrive\Documents\GitHub\CaLPA\metadata
- Analysis (pathAnalysis): c:\Users\ktale\OneDrive\Documents\GitHub\CaLPA\analysis
Scripts:
- Python Calpa Module (pathScriptsCalpa): c:\Users\ktale\OneDrive\Documents\GitHub\CaLPA\calpa
- Markdown Scripts (pathScriptsMd): c:\Users\ktale\OneDrive\Documents\GitHub\CaLPA\markdown
- RIS Scripts (pathScriptsRis): c:\Users\ktale\OneDrive\Documents\GitHub\CaLPA\ris
Data:
- Main Data (pathData): c:\Users\ktale\OneDrive\Documents\GitHub\CaLPA\data
- Documents (pathDataDocs): c:\Users\ktale\OneDrive\Documents\GitHub\CaLPA\data\docs
- LegiScan (pathDataLegis): c:\Users\ktale\OneDrive\Documents\GitHub\CaLPA\data\legis
- LookUp (pathDataLookup): c:\Users\ktale\OneDrive\Documents\GitHub\CaLPA\data\lookup
- Markdown (pathDataMd): c:\Users\kta

<h3 style="font-weight:bold; color:lime; padding-left: 50px">Load Lookup DataFrames</h3>

Load necessary lookup data frames for the project. These are located in the `data/lookup` project diretory.
Specifically we will load the following lookup tables:

- `codebookBill`: This table contains the mapping of bill codes to their descriptions.
- `codebookRollCall`: This table contains the mapping of roll call codes to their descriptions.
- `codebookBillText`: This table contains the mapping of bill text codes to their descriptions.
- `codebookAmendment`: This table contains the mapping of amendment codes to their descriptions.
- `codebookSupplement`: This table contains the mapping of supplement codes to their descriptions.
- `codebookPerson`: This table contains the mapping of person codes to their descriptions.
- `codebookSessionList`: This table contains the mapping of session codes to their descriptions.

In [8]:
# Load the codebookBill pickle file from the data/lookup directory
codebookBill = pd.read_pickle(os.path.join(prjDirs["pathDataLookup"], "codebookBill.pkl"))

# Load the codebookRollCall pickle file from the data/lookup directory
codebookRollCall = pd.read_pickle(os.path.join(prjDirs["pathDataLookup"], "codebookRollCall.pkl"))

# Load the codebookBillText pickle file from the data/lookup directory
codebookBillText = pd.read_pickle(os.path.join(prjDirs["pathDataLookup"], "codebookBillText.pkl"))

# Load the codebookAmendment pickle file from the data/lookup directory
codebookAmendment = pd.read_pickle(os.path.join(prjDirs["pathDataLookup"], "codebookAmendment.pkl"))

# Load the codebookSupplement pickle file from the data/lookup directory
codebookSupplement = pd.read_pickle(os.path.join(prjDirs["pathDataLookup"], "codebookSupplement.pkl"))

# Load the codebookPerson pickle file from the data/lookup directory
codebookPerson = pd.read_pickle(os.path.join(prjDirs["pathDataLookup"], "codebookPerson.pkl"))

# Load the codebookSessionList pickle file from the data/lookup directory
codebookSessionList = pd.read_pickle(os.path.join(prjDirs["pathDataLookup"], "codebookSessionList.pkl"))

<h3 style="font-weight:bold; color:lime; padding-left: 50px">Load Stored Data</h3>

Load the stored data from the `data` directory. This includes the following data files:
- `sessionListStored`: This file contains the list of legislative sessions.
- `sessionPeopleStored`: This file contains the list of legislative session people.
- `datasetListStored`: This file contains the list of legislative datasets.
- `datasetListRawStored`: This file contains the list of raw legislative datasets.
- `masterListStored`: This file contains the list of legislative master datasets.
- `aiBillListStored`: This file contains the list of AI legislative bills.
- `aiBills`: This file contains the the AI legislative bills data.
- `aiBillsSummariesStored`: This file contains the list of AI legislative bill text summaries.

In [10]:
# Obtain the stored sessions list from JSON dictionary on disk (data/lookup directory)
sessionListStored = legiscan.getStoredData(dataType = "session")

# Obtain the stored session People list from JSON dictionary on disk (data/lookup directory)
sessionPeopleStored = legiscan.getStoredData(dataType = "people")

# Obtain the stored dataset list from JSON dictionary on disk (data/lookup directory)
datasetListStored = legiscan.getStoredData(dataType = "dataset")

# Get the stored raw master list from JSON dictionary on disk (data/lookup directory)
masterListRawStored = legiscan.getStoredData(dataType = "master", raw = True)
# Get the stored master list from JSON dictionary on disk (data/lookup directory)
masterListStored = legiscan.getStoredData(dataType = "master", raw = False)

# Get the AI monitoring list from disk (data/lookup directory)
aiBillListStored = legiscan.getStoredData(dataType = "bills", project = "AI")

# Get the AI full list of bills from dism (data/legis/json directory)
aiBills = legiscan.getStoredData(dataType = "data", project = "AI")

# Get the AI bill summries list from disk (data/lookup directory)
aiBillsSummariesStored = legiscan.getStoredData(dataType = "summaries", project = "AI")