# Downloading bill data from LegiScan

There is a website called [LegiScan](https://legiscan.com/). From their about page:
    
> LegiScan launched to support the release of the national LegiScan data service, providing the nation's first impartial real-time legislative tracking service designed for both public citizens and government affairs professionals across all sectors in organizations large and small. Utilizing the LegiScan API, having nearly 20 years of development maturity, allows us to provide monitoring of every bill in the 50 states and Congress. Giving our users and clients a central and uniform interface with the ability to easily track a wide array of legislative information. Paired with one of the country's most powerful national full bill text legislative search engines.

We're using to use their API to **download data on over a million different pieces of legislation in the US.** 

<p class="reading-options">
  <a class="btn" href="/azcentral-text-reuse-model-legislation/01-downloading-one-million-pieces-of-legislation-from-legiscan">
    <i class="fa fa-sm fa-book"></i>
    Read online
  </a>
  <a class="btn" href="/azcentral-text-reuse-model-legislation/notebooks/01-Downloading one million pieces of legislation from LegiScan.ipynb">
    <i class="fa fa-sm fa-download"></i>
    Download notebook
  </a>
  <a class="btn" href="https://colab.research.google.com/github/littlecolumns/ds4j-notebooks/blob/master/azcentral-text-reuse-model-legislation/notebooks/01-Downloading one million pieces of legislation from LegiScan.ipynb" target="_new">
    <i class="fa fa-sm fa-laptop"></i>
    Interactive version
  </a>
</p>

## Imports

In [4]:
import zipfile
import base64
import io
import glob
import time
import json
import os
import requests
import mimetypes

In [5]:
import csv

In [36]:
pip install pypdf

Collecting pypdf
  Downloading pypdf-3.4.1-py3-none-any.whl (241 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m241.6/241.6 kB[0m [31m4.5 MB/s[0m eta [36m0:00:00[0m00:01[0m
Installing collected packages: pypdf
Successfully installed pypdf-3.4.1
Note: you may need to restart the kernel to use updated packages.


In [37]:
from pypdf import PdfReader

In [42]:
from base64 import b64decode

In [6]:
import pandas as pd
import numpy as np

In [253]:
from bs4 import BeautifulSoup

In [93]:
# United States of America Python Dictionary to translate States,
# Districts & Territories to Two-Letter codes and vice versa.
#
# Canonical URL: https://gist.github.com/rogerallen/1583593
#
# Dedicated to the public domain.  To the extent possible under law,
# Roger Allen has waived all copyright and related or neighboring
# rights to this code.  Data originally from Wikipedia at the url:
# https://en.wikipedia.org/wiki/ISO_3166-2:US
#
# Automatically Generated 2021-09-11 18:04:36 via Jupyter Notebook from
# https://gist.github.com/rogerallen/d75440e8e5ea4762374dfd5c1ddf84e0 

us_state_to_abbrev = {
    "Alabama": "AL",
    "Alaska": "AK",
    "Arizona": "AZ",
    "Arkansas": "AR",
    "California": "CA",
    "Colorado": "CO",
    "Connecticut": "CT",
    "Delaware": "DE",
    "Florida": "FL",
    "Georgia": "GA",
    "Hawaii": "HI",
    "Idaho": "ID",
    "Illinois": "IL",
    "Indiana": "IN",
    "Iowa": "IA",
    "Kansas": "KS",
    "Kentucky": "KY",
    "Louisiana": "LA",
    "Maine": "ME",
    "Maryland": "MD",
    "Massachusetts": "MA",
    "Michigan": "MI",
    "Minnesota": "MN",
    "Mississippi": "MS",
    "Missouri": "MO",
    "Montana": "MT",
    "Nebraska": "NE",
    "Nevada": "NV",
    "New Hampshire": "NH",
    "New Jersey": "NJ",
    "New Mexico": "NM",
    "New York": "NY",
    "North Carolina": "NC",
    "North Dakota": "ND",
    "Ohio": "OH",
    "Oklahoma": "OK",
    "Oregon": "OR",
    "Pennsylvania": "PA",
    "Rhode Island": "RI",
    "South Carolina": "SC",
    "South Dakota": "SD",
    "Tennessee": "TN",
    "Texas": "TX",
    "Utah": "UT",
    "Vermont": "VT",
    "Virginia": "VA",
    "Washington": "WA",
    "West Virginia": "WV",
    "Wisconsin": "WI",
    "Wyoming": "WY",
    "District of Columbia": "DC",
    "American Samoa": "AS",
    "Guam": "GU",
    "Northern Mariana Islands": "MP",
    "Puerto Rico": "PR",
    "United States Minor Outlying Islands": "UM",
    "U.S. Virgin Islands": "VI",
    "US": "US"
}

## pylegiscan

To talk to LegiScan's API, we're borrowing some code from [pylegiscan](https://github.com/poliquin/pylegiscan). Since it isn't a package you can install with `pip`, it wound up being easier for distribution to just cut and paste it here.

In [7]:
# Taken from https://github.com/poliquin/pylegiscan/blob/master/pylegiscan/legiscan.py

import os
import json
import requests
from urllib.parse import urlencode
from urllib.parse import quote_plus

# current aggregate status of bill
BILL_STATUS = {1: "Introduced",
               2: "Engrossed",
               3: "Enrolled",
               4: "Passed",
               5: "Vetoed",
               6: "Failed/Dead"}

# significant steps in bill progress.
BILL_PROGRESS = {1: "Introduced",
                 2: "Engrossed",
                 3: "Enrolled",
                 4: "Passed",
                 5: "Vetoed",
                 6: "Failed/Dead",
                 7: "Veto Override",
                 8: "Chapter/Act/Statute",
                 9: "Committee Referral",
                10: "Committee Report Pass",
                11: "Committee Report DNP"}


"""
Interact with LegiScan API.

"""

# a helpful list of valid legiscan state abbreviations (no Puerto Rico)
STATES = ['ak', 'al', 'ar', 'az', 'ca', 'co', 'ct', 'dc', 'de', 'fl', 'ga',
          'hi', 'ia', 'id', 'il', 'in', 'ks', 'ky', 'la', 'ma', 'md', 'me',
          'mi', 'mn', 'mo', 'ms', 'mt', 'nc', 'nd', 'ne', 'nh', 'nj', 'nm',
          'nv', 'ny', 'oh', 'ok', 'or', 'pa', 'ri', 'sc', 'sd', 'tn', 'tx',
          'ut', 'va', 'vt', 'wa', 'wi', 'wv', 'wy']

class LegiScanError(Exception):
    pass

class LegiScan(object):
    BASE_URL = 'http://api.legiscan.com/?key={0}&op={1}&{2}'

    def __init__(self, apikey=None):
        """LegiScan API.  State parameters should always be passed as
           USPS abbreviations.  Bill numbers and abbreviations are case
           insensitive.  Register for API at http://legiscan.com/legiscan
        """
        # see if API key available as environment variable
        if apikey is None:
            apikey = config.LEGISCAN_API_KEY
        self.key = apikey.strip()

    def _url(self, operation, params=None):
        """Build a URL for querying the API."""
        if not isinstance(params, str) and params is not None:
            params = urlencode(params)
        elif params is None:
            params = ''
        return self.BASE_URL.format(self.key, operation, params)

    def _get(self, url):
        """Get and parse JSON from API for a url."""
        req = requests.get(url)
        if not req.ok:
            raise LegiScanError('Request returned {0}: {1}'\
                    .format(req.status_code, url))
        data = json.loads(req.content)
        if data['status'] == "ERROR":
            raise LegiScanError(data['alert']['message'])
        return data

    def get_session_list(self, state):
        """Get list of available sessions for a state."""
        url = self._url('getSessionList', {'state': state})
        data = self._get(url)
        return data['sessions']

    def get_dataset_list(self, state=None, year=None):
        """Get list of available datasets, with optional state and year filtering.
        """
        if state is not None:
            url = self._url('getDatasetList', {'state': state})
        elif year is not None:
            url = self._url('getDatasetList', {'year': year})
        else:
            url = self._url('getDatasetList')
        data = self._get(url)
        # return a list of the bills
        return data['datasetlist']

    def get_dataset(self, id, access_key):
        """Get list of available datasets, with optional state and year filtering.
        """
        url = self._url('getDataset', {'id': id, 'access_key': access_key})
        data = self._get(url)
        # return a list of the bills
        return data['dataset']
      
    def get_master_list(self, state=None, session_id=None):
        """Get list of bills for the current session in a state or for
           a given session identifier.
        """
        if state is not None:
            url = self._url('getMasterList', {'state': state})
        elif session_id is not None:
            url = self._url('getMasterList', {'id': session_id})
        else:
            raise ValueError('Must specify session identifier or state.')
        data = self._get(url)
        # return a list of the bills
        return [data['masterlist'][i] for i in data['masterlist']]

    def get_bill(self, bill_id=None, state=None, bill_number=None):
        """Get primary bill detail information including sponsors, committee
           references, full history, bill text, and roll call information.

           This function expects either a bill identifier or a state and bill
           number combination.  The bill identifier is preferred, and required
           for fetching bills from prior sessions.
        """
        if bill_id is not None:
            url = self._url('getBill', {'id': bill_id})
        elif state is not None and bill_number is not None:
            url = self._url('getBill', {'state': state, 'bill': bill_number})
        else:
            raise ValueError('Must specify bill_id or state and bill_number.')
        return self._get(url)['bill']

    def get_bill_text(self, doc_id):
        """Get bill text, including date, draft revision information, and
           MIME type.  Bill text is base64 encoded to allow for PDF and Word
           data transfers.
        """
        url = self._url('getBillText', {'id': doc_id})
        return self._get(url)['text']

    def get_amendment(self, amendment_id):
        """Get amendment text including date, adoption status, MIME type, and
           title/description information.  The amendment text is base64 encoded
           to allow for PDF and Word data transfer.
        """
        url = self._url('getAmendment', {'id': amendment_id})
        return self._get(url)['amendment']

    def get_supplement(self, supplement_id):
        """Get supplement text including type of supplement, date, MIME type
           and text/description information.  Supplement text is base64 encoded
           to allow for PDF and Word data transfer.
        """
        url = self._url('getSupplement', {'id': supplement_id})
        return self._get(url)['supplement']

    def get_roll_call(self, roll_call_id):
        """Roll call detail for individual votes and summary information."""
        data = self._get(self._url('getRollcall', {'id': roll_call_id}))
        return data['roll_call']

    def get_sponsor(self, people_id):
        """Sponsor information including name, role, and a followthemoney.org
           person identifier.
        """
        url = self._url('getSponsor', {'id': people_id})
        return self._get(url)['person']

    def search(self, state, bill_number=None, query=None, year=2, page=1):
        """Get a page of results for a search against the LegiScan full text
           engine; returns a paginated result set.

           Specify a bill number or a query string.  Year can be an exact year
           or a number between 1 and 4, inclusive.  These integers have the
           following meanings:
               1 = all years
               2 = current year, the default
               3 = recent years
               4 = prior years
           Page is the result set page number to return.
        """
        if bill_number is not None:
            params = {'state': state, 'bill': bill_number}
        elif query is not None:
            params = {'state': state, 'query': query,
                      'year': year, 'page': page}
        else:
            raise ValueError('Must specify bill_number or query')
        data = self._get(self._url('search', params))['searchresult']
        # return a summary of the search and the results as a dictionary
        summary = data.pop('summary')
        results = {'summary': summary, 'results': [data[i] for i in data]}
        return results

    def __str__(self):
        return '<LegiScan API {0}>'.format(self.key)

    def __repr__(self):
        return str(self)

# Connect to LegiScan

Using pylegiscan, you just pass your API key to `LegiScan` and you're good to go. I set up an environment variable for mine, but you can also just paste yours at `OR_PUT_YOUR_API_KEY_HERE`.

In [8]:
import config

api_key = config.LEGISCAN_API_KEY
legis = LegiScan(api_key)

If you wanted to search for bills based on state or text, that's easy to do.

## Search for bills containing "biological sex"

In [9]:
bills = legis.search(state='ALL', query='biological sex')
bills['summary'] # how many results did we get?
# this returns 173 results; there are 368 in the LGBTQ+ legislative tracker
# https://docs.google.com/spreadsheets/d/1fTxHLjBa86GA7WCT-V6AbEMGRFPMJndnaVGoZZX4PMw/edit#gid=0
# but i think this is ok to start with

{'page': '1 of 4',
 'range': '1 - 50',
 'relevancy': '100% - 98%',
 'count': 173,
 'page_current': '1',
 'page_total': 4,
 'query': '((Zbiolog:(pos=1) AND Zsex:(pos=2)))'}

## Print their titles

In [8]:
for b in bills['results']:
    print(b['title'])

Establishes the "Missouri Save Adolescents from Experimentation (SAFE) Act"
Gender Reassignment Surgery
Creates provisions relating to gender transition procedures
Transgender Medical Treatments and Procedures Amendments
Creates provisions relating to gender transition procedures
Gender transition procedures for minors.
Establishes the "Missouri Save Adolescents from Experimentation (SAFE) Act"
Health care; creating the Oklahoma Save Adolescents from Experimentation (SAFE) Act; prohibiting gender transition procedures; providing for administrative and civil enforcement. Emergency.
Adopt the Let Them Grow Act
Establishes the "Missouri Child and Adolescent Protection Act"
Prohibiting certain medical practices
Public health and safety; defining terms; health care professionals; gender transition procedures; referrals; exceptions; public funds; Medicaid program reimbursements; felony penalties; statute of limitations; unprofessional conduct; license revocation; statute of limitations; clai

## Get the bill ID and sponsor for the first result

In [10]:
bill_id = bills['results'][0]['bill_id']
bill_detail = legis.get_bill(bill_id=bill_id)

In [11]:
people_id = bill_detail['sponsors'][0]['people_id']
sponsor = legis.get_sponsor(people_id)
print(sponsor['name'])

Mike Moon


In [64]:
doc_id = bill_detail['texts'][0]['doc_id']
doc_id

2631259

## Get all the bill IDs for my search query

In [11]:
# get all the bill ids for our query
bill_ids = []

for i in range(0,len(bills['results'])):
    bill_ids.append(bills['results'][i]['bill_id'])
    
bill_ids

[1635057,
 1637125,
 1675866,
 1639439,
 1640339,
 1650034,
 1634951,
 1668356,
 1662540,
 1696931,
 1656233,
 1634447,
 1651270,
 1634198,
 1702869,
 1643087,
 1654313,
 1665859,
 1650067,
 1663548,
 1635919,
 1649602,
 1634572,
 1683622,
 1635148,
 1715006,
 1640281,
 1660616,
 1698406,
 1668929,
 1642575,
 1640216,
 1633355,
 1642113,
 1686385,
 1674808,
 1689600,
 1632730,
 1645224,
 1700889,
 1698754,
 1659858,
 1634331,
 1665893,
 1649395,
 1632709,
 1660515,
 1699553,
 1669660,
 1668972]

In [2]:
bill_ids = [1635057,
 1637125,
 1675866,
 1639439,
 1640339,
 1650034,
 1634951,
 1668356,
 1662540,
 1696931,
 1656233,
 1634447,
 1651270,
 1634198,
 1702869,
 1643087,
 1654313,
 1665859,
 1650067,
 1663548,
 1635919,
 1649602,
 1634572,
 1683622,
 1635148,
 1715006,
 1640281,
 1660616,
 1698406,
 1668929,
 1642575,
 1640216,
 1633355,
 1642113,
 1686385,
 1674808,
 1689600,
 1632730,
 1645224,
 1700889,
 1698754,
 1659858,
 1634331,
 1665893,
 1649395,
 1632709,
 1660515,
 1699553,
 1669660,
 1668972]

# Get all the doc IDs for my query

In [12]:
# get all the doc ids for our query
bill_details = []

for i in range(0,len(bill_ids)):
    bill_details.append(legis.get_bill(bill_id=bill_ids[i]))

In [229]:
# get all the doc ids for our query
bill_details = []
doc_ids = []

for i in range(0,len(bill_ids)):
    bill_details.append(legis.get_bill(bill_id=bill_ids[i]))
    doc_id = bill_details[i]['texts'][0]['doc_id']
    doc_ids.append(doc_id)
    
doc_ids

[2631259,
 2616404,
 2658485,
 2619414,
 2620652,
 2629930,
 2631144,
 2650157,
 2644341,
 2684043,
 2637391,
 2614865,
 2631317,
 2614535,
 2692266,
 2623314,
 2637539,
 2647692,
 2629969,
 2645325,
 2615718,
 2629675,
 2614977,
 2666618,
 2632040,
 2708885,
 2620606,
 2680361,
 2686001,
 2650792,
 2622929,
 2620550,
 2613416,
 2622424,
 2670276,
 2674746,
 2696929,
 2612852,
 2625328,
 2689368,
 2686660,
 2641747,
 2614708,
 2647723,
 2629442,
 2612832,
 2642392,
 2687762,
 2651535,
 2650842]

## Get all the bill text from these doc IDs

In [230]:
bill_texts = []

# get bill texts
for i in range(0,len(doc_ids)):
    bill_text = legis.get_bill_text(doc_ids[i])
    bill_texts.append(bill_text)

bill_texts[0:3]

[{'doc_id': 2631259,
  'bill_id': 1635057,
  'date': '0000-00-00',
  'type': 'Introduced',
  'type_id': 1,
  'mime': 'application/pdf',
  'mime_id': 2,
  'text_size': 347111,
  'text_hash': '2b017580afa21ca9a37198a3291693d5',
  'doc': 'JVBERi0xLjUNCiW1tbW1DQoxIDAgb2JqDQo8PC9UeXBlL0NhdGFsb2cvUGFnZXMgMiAwIFIvTGFuZyhlbi1VUykgL1N0cnVjdFRyZWVSb290IDMzIDAgUi9NYXJrSW5mbzw8L01hcmtlZCB0cnVlPj4+Pg0KZW5kb2JqDQoyIDAgb2JqDQo8PC9UeXBlL1BhZ2VzL0NvdW50IDUvS2lkc1sgMyAwIFIgMTkgMCBSIDIxIDAgUiAyMyAwIFIgMjUgMCBSXSA+Pg0KZW5kb2JqDQozIDAgb2JqDQo8PC9UeXBlL1BhZ2UvUGFyZW50IDIgMCBSL1Jlc291cmNlczw8L0ZvbnQ8PC9GMSA1IDAgUi9GMiA5IDAgUi9GMyAxMSAwIFIvRjQgMTMgMCBSL0Y1IDE1IDAgUi9GNiAxNyAwIFI+Pi9FeHRHU3RhdGU8PC9HUzcgNyAwIFIvR1M4IDggMCBSPj4vUHJvY1NldFsvUERGL1RleHQvSW1hZ2VCL0ltYWdlQy9JbWFnZUldID4+L01lZGlhQm94WyAwIDAgNjEyIDc5Ml0gL0NvbnRlbnRzIDQgMCBSL0dyb3VwPDwvVHlwZS9Hcm91cC9TL1RyYW5zcGFyZW5jeS9DUy9EZXZpY2VSR0I+Pi9UYWJzL1MvU3RydWN0UGFyZW50cyAwPj4NCmVuZG9iag0KNCAwIG9iag0KPDwvRmlsdGVyL0ZsYXRlRGVjb2RlL0xlbmd0aCAyNjQ1Pj4NCnN0cmVh

## Check the length of the list

In [234]:
len(bill_texts)

50

## Function to decode from base64 into PDF file

In [None]:
# create function
def decodepdf(bill_text):
    # Define the Base64 string of the PDF file
    b64 = bill_text['doc']

    # Decode the Base64 string, making sure that it contains only valid characters
    bytes = b64decode(b64, validate=True)

    # Perform a basic validation to make sure that the result is a valid PDF file
    # Be aware! The magic number (file signature) is not 100% reliable solution to validate PDF files
    # Moreover, if you get Base64 from an untrusted source, you must sanitize the PDF contents
    if bytes[0:4] != b'%PDF':
      raise ValueError('Missing the PDF file signature')
    
    bill_id_name = bill_text['bill_id']
    
    # Write the PDF contents to a local file
    f = open('f'bill_id-'{bill_id_name}.pdf', 'wb')
    f.write(bytes)
    f.close()

# Merge it with function to get text from PDF and output it to a txt file

In [245]:
# create function
def decodepdftotext(bill_text):
    # Define the Base64 string of the PDF file
    b64 = bill_text['doc']

    # Decode the Base64 string, making sure that it contains only valid characters
    bytes = b64decode(b64, validate=True)

    # Perform a basic validation to make sure that the result is a valid PDF file
    # Be aware! The magic number (file signature) is not 100% reliable solution to validate PDF files
    # Moreover, if you get Base64 from an untrusted source, you must sanitize the PDF contents
    if bytes[0:4] != b'%PDF':
      raise ValueError('Missing the PDF file signature')
    
    bill_id_name = bill_text['bill_id']
    
    # Write the PDF contents to a local file
    f = open(f"bill_id-{bill_id_name}.pdf", "wb")
    f.write(bytes)
    f.close()
    
    reader = PdfReader(f"bill_id-{bill_id_name}.pdf")
    text=""
    for n in range(0,len(reader.pages)):
        page = reader.pages[n]
        text = text + page.extract_text()
    g = open(f"bill_id-{bill_id_name}.txt", "w")
    g.write(text)
    g.close()    

## Do it for all the bill texts

In [None]:
for i in range(0,len(bill_texts)):
    if(bill_texts[i]['mime'] == "application/pdf"):
        decodepdftotext(bill_texts[i])
    else:
        continue

Finally I extracted all the text from these PDFs. Now how do I deal with the HTML? Let's make a list of just those so I can deal with them properly:

In [None]:
for i in range(33):
    bill_id = bill_texts[i]['bill_id']

# Deal with the HTML-formatted bill texts

In [None]:
htmlbills = []
for i in range(0,len(bill_texts)):
    if(bill_texts[i]['mime'] == "text/html"):
        htmlbills.append(bill_texts[i])
htmlbills[0:3]

In [None]:
htmlbills[0]

In [274]:
def b64tohtml(bill_text):
    # Define the Base64 string of the PDF file
    b64 = bill_text['doc']

    # Decode the Base64 string, making sure that it contains only valid characters
    bytes = b64decode(b64, validate=True)

    bill_id_name = bill_text['bill_id']
    
    # Write the PDF contents to a local file
    h = open(f"bill_id-{bill_id_name}.html", "wb")
    print(h)
    h.write(bytes)
    h.close()
    
    with open(f"bill_id-{bill_id_name}.html") as fp:
        soup = BeautifulSoup(fp)
    i = open(f"bill_id-{bill_id_name}.txt", "w")
    i.write(soup.get_text())
    i.close()

In [285]:
b64tohtml(htmlbills[9])

<_io.BufferedWriter name='bill_id-1686385.html'>


UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa0 in position 8910: invalid start byte

In [266]:
len(htmlbills)

17

In [265]:
for i in range(len(htmlbills)):
    b64tohtml(htmlbills[i])

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x96 in position 12504: invalid start byte

{'doc_id': 2670276,
 'bill_id': 1686385,
 'date': '2023-01-30',
 'type': 'Introduced',
 'type_id': 1,
 'mime': 'text/html',
 'mime_id': 1,
 'text_size': 14317,
 'text_hash': '42e6da8617f25b52bfd92053fd37c28a',
 'doc': 'PHN0eWxlPg0KIEBmb250LWZhY2UNCgl7Zm9udC1mYW1pbHk6IkNhbWJyaWEgTWF0aCI7DQoJcGFub3NlLTE6MiA0IDUgMyA1IDQgNiAzIDIgNDt9DQpAZm9udC1mYWNlDQoJe2ZvbnQtZmFtaWx5OkxldHRlci1Hb3RoaWMtRHJhZnRpbmc7DQoJcGFub3NlLTE6MCAwIDAgMCAwIDAgMCAwIDAgMDt9DQpAZm9udC1mYWNlDQoJe2ZvbnQtZmFtaWx5OkxldHRlci1Hb3RoaWMtVXBwZXItRHJhZnRpbmc7DQoJcGFub3NlLTE6MCAwIDAgMCAwIDAgMCAwIDAgMDt9DQpAZm9udC1mYWNlDQoJe2ZvbnQtZmFtaWx5OlRhaG9tYTsNCglwYW5vc2UtMToyIDExIDYgNCAzIDUgNCA0IDIgNDt9DQpAZm9udC1mYWNlDQoJe2ZvbnQtZmFtaWx5OiJMZXR0ZXIgR290aGljLURyYWZ0aW5nIjsNCglwYW5vc2UtMTowIDAgMCAwIDAgMCAwIDAgMCAwO30NCiAvKiBTdHlsZSBEZWZpbml0aW9ucyAqLw0KIHAuTXNvTm9ybWFsLCBsaS5Nc29Ob3JtYWwsIGRpdi5Nc29Ob3JtYWwNCgl7bWFyZ2luOjBpbjsNCgl0ZXh0LWFsaWduOmp1c3RpZnk7DQoJZm9udC1zaXplOjEwLjBwdDsNCglmb250LWZhbWlseTpMZXR0ZXItR290aGljLURyYWZ0aW5nOw0KCWxheW91d

In [271]:
with open('bill_id-1633355.html') as f:
    print(f)

<_io.TextIOWrapper name='bill_id-1633355.html' mode='r' encoding='UTF-8'>


In [286]:
htmlbills[9]

{'doc_id': 2670276,
 'bill_id': 1686385,
 'date': '2023-01-30',
 'type': 'Introduced',
 'type_id': 1,
 'mime': 'text/html',
 'mime_id': 1,
 'text_size': 14317,
 'text_hash': '42e6da8617f25b52bfd92053fd37c28a',
 'doc': 'PHN0eWxlPg0KIEBmb250LWZhY2UNCgl7Zm9udC1mYW1pbHk6IkNhbWJyaWEgTWF0aCI7DQoJcGFub3NlLTE6MiA0IDUgMyA1IDQgNiAzIDIgNDt9DQpAZm9udC1mYWNlDQoJe2ZvbnQtZmFtaWx5OkxldHRlci1Hb3RoaWMtRHJhZnRpbmc7DQoJcGFub3NlLTE6MCAwIDAgMCAwIDAgMCAwIDAgMDt9DQpAZm9udC1mYWNlDQoJe2ZvbnQtZmFtaWx5OkxldHRlci1Hb3RoaWMtVXBwZXItRHJhZnRpbmc7DQoJcGFub3NlLTE6MCAwIDAgMCAwIDAgMCAwIDAgMDt9DQpAZm9udC1mYWNlDQoJe2ZvbnQtZmFtaWx5OlRhaG9tYTsNCglwYW5vc2UtMToyIDExIDYgNCAzIDUgNCA0IDIgNDt9DQpAZm9udC1mYWNlDQoJe2ZvbnQtZmFtaWx5OiJMZXR0ZXIgR290aGljLURyYWZ0aW5nIjsNCglwYW5vc2UtMTowIDAgMCAwIDAgMCAwIDAgMCAwO30NCiAvKiBTdHlsZSBEZWZpbml0aW9ucyAqLw0KIHAuTXNvTm9ybWFsLCBsaS5Nc29Ob3JtYWwsIGRpdi5Nc29Ob3JtYWwNCgl7bWFyZ2luOjBpbjsNCgl0ZXh0LWFsaWduOmp1c3RpZnk7DQoJZm9udC1zaXplOjEwLjBwdDsNCglmb250LWZhbWlseTpMZXR0ZXItR290aGljLURyYWZ0aW5nOw0KCWxheW91d

In [287]:
def b64tohtmlonly(bill_text):
    # Define the Base64 string of the PDF file
    b64 = bill_text['doc']

    # Decode the Base64 string, making sure that it contains only valid characters
    bytes = b64decode(b64, validate=True)

    bill_id_name = bill_text['bill_id']
    
    # Write the PDF contents to a local file
    h = open(f"bill_id-{bill_id_name}.html", "wb")
    print(h)
    h.write(bytes)
    h.close()

In [290]:
# 8 and 9 are the problematic ones
b64tohtmlonly(htmlbills[8])
b64tohtmlonly(htmlbills[9])

<_io.BufferedWriter name='bill_id-1633355.html'>
<_io.BufferedWriter name='bill_id-1686385.html'>


In [291]:
with open('bill_id-1633355.html') as f:
    print(f)

<_io.TextIOWrapper name='bill_id-1633355.html' mode='r' encoding='UTF-8'>


In [292]:
bill_details[0]

{'bill_id': 1635057,
 'change_hash': 'e09e6c58de86d4d7f02c80d880b2fd09',
 'session_id': 2012,
 'session': {'session_id': 2012,
  'state_id': 25,
  'year_start': 2023,
  'year_end': 2023,
  'prefile': 0,
  'sine_die': 0,
  'prior': 0,
  'special': 0,
  'session_tag': 'Regular Session',
  'session_title': '2023 Regular Session',
  'session_name': '2023 Regular Session'},
 'url': 'https://legiscan.com/MO/bill/SB49/2023',
 'state_link': 'https://www.senate.mo.gov/23info/BTS_Web/Bill.aspx?SessionType=R&BillID=44407',
 'completed': 0,
 'status': 1,
 'status_date': '2023-01-04',
 'progress': [{'date': '2023-01-04', 'event': 1},
  {'date': '2023-01-12', 'event': 9}],
 'state': 'MO',
 'state_id': 25,
 'bill_number': 'SB49',
 'bill_type': 'B',
 'bill_type_id': '1',
 'body': 'S',
 'body_id': 60,
 'current_body': 'S',
 'current_body_id': 60,
 'title': 'Establishes the "Missouri Save Adolescents from Experimentation (SAFE) Act"',
 'description': 'Establishes the "Missouri Save Adolescents from Expe

In [294]:
# test one file to see if it'll work
with open("billtxts/1632709.txt", "rb") as f:
    txt = f.readlines()
print(txt)

[b'\n', b'\n', b'\n', b'\xc2\xa0\n', b'\n', b'\n', b'\n', b'\xc2\xa0\n', b'\n', b'\n', b'\xc2\xa0\n', b'\n', b'\t\tBy:\xc2\xa0Swanson\n', b'H.B.\xc2\xa0No.\xc2\xa023\n', b'\n', b'\n', b'\n', b'\n', b'\xc2\xa0\n', b'\n', b'\n', b'\xc2\xa0\n', b'\n', b'\n', b'\n', b'\xc2\xa0\t\t\n', b'\t\t\t\n', b'\n', b'A BILL TO BE ENTITLED\n', b'\n', b'\n', b'\n', b'\n', b'\xc2\xa0\n', b'\t\t\t\n', b'\n', b'AN ACT\n', b'\n', b'\n', b'\n', b'\n', b'\xc2\xa0\n', b'\t\t\t\n', b'relating to participation in athletic activities based on \n', b'\n', b'\n', b'\n', b'\xc2\xa0\n', b'\t\t\t\n', b'biological sex; providing a civil right to action for K-12 athletes \n', b'\n', b'\n', b'\n', b'\xc2\xa0\n', b'\t\t\t\n', b'and college athletes.\n', b'\n', b'\n', b'\n', b'\xc2\xa0\n', b'\t\t\t\n', b'\xc2\xa0\xc2\xa0\xc2\xa0\xc2\xa0\xc2\xa0\xc2\xa0\xc2\xa0BE IT ENACTED BY THE LEGISLATURE OF THE STATE OF TEXAS:\n', b'\n', b'\n', b'\n', b'\xc2\xa0\n', b'\t\t\t\n', b'\xc2\xa0\xc2\xa0\xc2\xa0\xc2\xa0\xc2\xa0\xc2\xa0\xc2\x

# Add text from txt files to bill details

In [13]:
for i in range(len(bill_details)):
    filename = bill_details[i]['bill_id']
    with open(f"billtxts/{filename}.txt", "rb") as f:
        bill_details[i]['text'] = f.readlines()

In [296]:
bill_details[0]['text']

[b' \n',
 b'FIRST REGULAR SESSION  \n',
 b'SENATE BILL NO. 49  \n',
 b'102ND GENERAL ASSEMBLY   \n',
 b'INTRODUCED BY SENATOR MOON.  \n',
 b'0202S.01I  KRISTINA MARTIN, Secretary   \n',
 b'AN ACT  \n',
 b'To amend chapter 191, RSMo, by adding thereto one new section relating to gender transition \n',
 b'procedures.  \n',
 b' \n',
 b'Be it enacted by the General Assembly of the State of Missouri, as follows:  \n',
 b'     Section A.  Chapter 191, RSMo, is amended by adding thereto 1 \n',
 b'one new section, to be known as section 191.1720, to read as 2 \n',
 b'follows: 3 \n',
 b'     191.1720.   1.  This section shall be known and may be  1 \n',
 b'cited as the "Missouri Save Adolescents from Experimentation  2 \n',
 b'(SAFE) Act".  3 \n',
 b'     2.  For purposes of this section, the following t erms  4 \n',
 b'mean: 5 \n',
 b'     (1)  "Biological sex", the biological indication of  6 \n',
 b'male or female in the context of reproductive potential or  7 \n',
 b'capacity, such as sex c

In [298]:
type(bill_details[0])

dict

In [14]:
keys = bill_details[0].keys()

with open('bill_details.csv','w',newline='') as output_file:
    dict_writer = csv.DictWriter(output_file,keys)
    dict_writer.writeheader()
    dict_writer.writerows(bill_details)    

# More stuff below

## Get the bill info for the first result
This is a dict.

In [18]:
print(bills['results'][0])

{'relevance': 100, 'state': 'MO', 'bill_number': 'SB49', 'bill_id': 1635057, 'change_hash': 'e09e6c58de86d4d7f02c80d880b2fd09', 'url': 'https://legiscan.com/MO/bill/SB49/2023', 'text_url': 'https://legiscan.com/MO/text/SB49/2023', 'research_url': 'https://legiscan.com/MO/research/SB49/2023', 'last_action_date': '2023-02-27', 'last_action': 'Formal Calendar S Bills for Perfection', 'title': 'Establishes the "Missouri Save Adolescents from Experimentation (SAFE) Act"'}


## Get the bill text for the first result
This is a base64 encoded PDF, as we can see in 'mime':'application/pdf'.

In [65]:
testbilltext = legis.get_bill_text(2631259)
print(testbilltext)

{'doc_id': 2631259, 'bill_id': 1635057, 'date': '0000-00-00', 'type': 'Introduced', 'type_id': 1, 'mime': 'application/pdf', 'mime_id': 2, 'text_size': 347111, 'text_hash': '2b017580afa21ca9a37198a3291693d5', 'doc': 'JVBERi0xLjUNCiW1tbW1DQoxIDAgb2JqDQo8PC9UeXBlL0NhdGFsb2cvUGFnZXMgMiAwIFIvTGFuZyhlbi1VUykgL1N0cnVjdFRyZWVSb290IDMzIDAgUi9NYXJrSW5mbzw8L01hcmtlZCB0cnVlPj4+Pg0KZW5kb2JqDQoyIDAgb2JqDQo8PC9UeXBlL1BhZ2VzL0NvdW50IDUvS2lkc1sgMyAwIFIgMTkgMCBSIDIxIDAgUiAyMyAwIFIgMjUgMCBSXSA+Pg0KZW5kb2JqDQozIDAgb2JqDQo8PC9UeXBlL1BhZ2UvUGFyZW50IDIgMCBSL1Jlc291cmNlczw8L0ZvbnQ8PC9GMSA1IDAgUi9GMiA5IDAgUi9GMyAxMSAwIFIvRjQgMTMgMCBSL0Y1IDE1IDAgUi9GNiAxNyAwIFI+Pi9FeHRHU3RhdGU8PC9HUzcgNyAwIFIvR1M4IDggMCBSPj4vUHJvY1NldFsvUERGL1RleHQvSW1hZ2VCL0ltYWdlQy9JbWFnZUldID4+L01lZGlhQm94WyAwIDAgNjEyIDc5Ml0gL0NvbnRlbnRzIDQgMCBSL0dyb3VwPDwvVHlwZS9Hcm91cC9TL1RyYW5zcGFyZW5jeS9DUy9EZXZpY2VSR0I+Pi9UYWJzL1MvU3RydWN0UGFyZW50cyAwPj4NCmVuZG9iag0KNCAwIG9iag0KPDwvRmlsdGVyL0ZsYXRlRGVjb2RlL0xlbmd0aCAyNjQ1Pj4NCnN0cmVhbQ0KeJzFXG1v2zgS/h4

I struggled with opening this bill text and finally found the solution [here!](https://base64.guru/developers/python/examples/decode-pdf)

In [66]:
# Define the Base64 string of the PDF file
b64 = testbilltext['doc']

# Decode the Base64 string, making sure that it contains only valid characters
bytes = b64decode(b64, validate=True)

# Perform a basic validation to make sure that the result is a valid PDF file
# Be aware! The magic number (file signature) is not 100% reliable solution to validate PDF files
# Moreover, if you get Base64 from an untrusted source, you must sanitize the PDF contents
if bytes[0:4] != b'%PDF':
  raise ValueError('Missing the PDF file signature')

# Write the PDF contents to a local file
f = open('2631259.pdf', 'wb')
f.write(bytes)
f.close()

Finally, extracting the text from the entire PDF by finding the number of pages and getting their text one by one.

In [71]:
reader = PdfReader("2631259.pdf")
for n in range(0,len(reader.pages)):
    page = reader.pages[n]
    text = text + page.extract_text()

In [72]:
print(text)

 
FIRST REGULAR SESSION  
SENATE BILL NO. 49  
102ND GENERAL ASSEMBLY   
INTRODUCED BY SENATOR MOON.  
0202S.01I  KRISTINA MARTIN, Secretary   
AN ACT  
To amend chapter 191, RSMo, by adding thereto one new section relating to gender transition 
procedures.  
 
Be it enacted by the General Assembly of the State of Missouri, as follows:  
     Section A.  Chapter 191, RSMo, is amended by adding thereto 1 
one new section, to be known as section 191.1720, to read as 2 
follows: 3 
     191.1720.   1.  This section shall be known and may be  1 
cited as the "Missouri Save Adolescents from Experimentation  2 
(SAFE) Act".  3 
     2.  For purposes of this section, the following t erms  4 
mean: 5 
     (1)  "Biological sex", the biological indication of  6 
male or female in the context of reproductive potential or  7 
capacity, such as sex chromosomes, naturally occurring sex  8 
hormones, gonads, and nonambiguous internal and external  9 
genitali a present at birth, without regard to an

In [73]:
g = open('2631259.txt', 'w')
g.write(text)
g.close()

I previously extracted the wrong text, because get_bill_text takes doc_id as the argument and not bill id!

# Next steps
I think I need to [create a pandas dataframe](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.from_records.html#pandas.DataFrame.from_records) to hold all the bills['results'] info, then add an empty column for the bill text, then go through each of the bill IDs and download their text, accounting for each data type (some doc, some txt, some PDF, some HTML? check using the get_bill_text 'mime'). 

Then I have to clean it up, removing things like the numbers above. Then I can begin to use NLP tools to mess around with them.

In [52]:
df = pd.DataFrame.from_records(bills['results'])
df.shape

(50, 11)

I just realized that bills is only the first 50 results, from the first page of the search query. To get them all I have to iterate through the pages.

In [55]:
billslist = []
for i in range(1,10):
    bills = legis.search(state='ALL', query='biological sex', page=i)
    billslist.append(bills)

In [57]:
for i in range(1,10):
    print(billslist[i]['summary'])

{'page': '2 of 6', 'range': '51 - 100', 'relevancy': '98% - 97%', 'count': 298, 'page_current': '2', 'page_total': 6, 'query': '((Zbiolog:(pos=1) AND Zsex:(pos=2)))'}
{'page': '3 of 8', 'range': '101 - 150', 'relevancy': '97% - 96%', 'count': 378, 'page_current': '3', 'page_total': 8, 'query': '((Zbiolog:(pos=1) AND Zsex:(pos=2)))'}
{'page': '4 of 9', 'range': '151 - 200', 'relevancy': '96% - 94%', 'count': 445, 'page_current': '4', 'page_total': 9, 'query': '((Zbiolog:(pos=1) AND Zsex:(pos=2)))'}
{'page': '5 of 10', 'range': '201 - 250', 'relevancy': '94% - 92%', 'count': 478, 'page_current': '5', 'page_total': 10, 'query': '((Zbiolog:(pos=1) AND Zsex:(pos=2)))'}
{'page': '6 of 11', 'range': '251 - 300', 'relevancy': '92% - 89%', 'count': 501, 'page_current': '6', 'page_total': 11, 'query': '((Zbiolog:(pos=1) AND Zsex:(pos=2)))'}
{'page': '7 of 11', 'range': '301 - 350', 'relevancy': '89% - 85%', 'count': 534, 'page_current': '7', 'page_total': 11, 'query': '((Zbiolog:(pos=1) AND Zsex

IndexError: list index out of range

In [59]:
len(billslist)

9

In [60]:
billslist

[{'summary': {'page': '1 of 4',
   'range': '1 - 50',
   'relevancy': '100% - 98%',
   'count': 173,
   'page_current': '1',
   'page_total': 4,
   'query': '((Zbiolog:(pos=1) AND Zsex:(pos=2)))'},
  'results': [{'relevance': 100,
    'state': 'MO',
    'bill_number': 'SB49',
    'bill_id': 1635057,
    'change_hash': 'e09e6c58de86d4d7f02c80d880b2fd09',
    'url': 'https://legiscan.com/MO/bill/SB49/2023',
    'text_url': 'https://legiscan.com/MO/text/SB49/2023',
    'research_url': 'https://legiscan.com/MO/research/SB49/2023',
    'last_action_date': '2023-02-27',
    'last_action': 'Formal Calendar S Bills for Perfection',
    'title': 'Establishes the "Missouri Save Adolescents from Experimentation (SAFE) Act"'},
   {'relevance': 99,
    'state': 'SC',
    'bill_number': 'S0274',
    'bill_id': 1637125,
    'change_hash': '613f2f6755636d08d7795134c641ada1',
    'url': 'https://legiscan.com/SC/bill/S0274/2023',
    'text_url': 'https://legiscan.com/SC/text/S0274/2023',
    'research_u

## Different approach
I don't know what's going on with my queries! I did something weird to the page numbers. Maybe it would make more sense to:
1. Download the [tracker](https://docs.google.com/spreadsheets/d/1fTxHLjBa86GA7WCT-V6AbEMGRFPMJndnaVGoZZX4PMw/edit#gid=0) as a CSV
2. Use the URL to get the state & bill number for each bill
3. Use pylegiscan get_bill to get the bill info, including doc ID
4. Add it to a pandas dataframe
5. Use the doc ID to get_bill_text
6. Download the text and add that to the dataframe
7. Clean up all the text

Or, should I try downloading just a few more for now and see what results when I play with them?

Looking at the [code](https://github.com/alliraine/legialerts/blob/main/main.py) used to update the Google Sheet in the first place using the LegiScan API, they pull the legislative session master list and turn that into a CSV.

In [216]:
mainlist = legis.get_master_list(state=None,session_id=2031)

In [217]:
len(mainlist)

10258

In [218]:
df = pd.DataFrame.from_records(mainlist)

In [219]:
df.shape

(10258, 21)

In [146]:
list(df.columns)

['session_id',
 'state_id',
 'year_start',
 'year_end',
 'prefile',
 'sine_die',
 'prior',
 'special',
 'session_tag',
 'session_title',
 'session_name',
 'bill_id',
 'number',
 'change_hash',
 'url',
 'status_date',
 'status',
 'last_action_date',
 'last_action',
 'title',
 'description']

In [220]:
df[400:425]

Unnamed: 0,session_id,state_id,year_start,year_end,prefile,sine_die,prior,special,session_tag,session_title,...,bill_id,number,change_hash,url,status_date,status,last_action_date,last_action,title,description
400,,,,,,,,,,,...,1646294.0,A00420,7a963b760f926d3f5aba8427c7b893e2,https://legiscan.com/NY/bill/A00420/2023,2023-01-09,1.0,2023-01-09,referred to agriculture,Requires the installation and testing of fire ...,Requires the installation and testing of fire ...
401,,,,,,,,,,,...,1646271.0,A00421,1eb78a8ff7c8f6d33a91720d58a84e6c,https://legiscan.com/NY/bill/A00421/2023,2023-01-09,1.0,2023-01-09,referred to codes,Relates to repeat offenders of driving accidents.,Relates to repeat offenders of driving accidents.
402,,,,,,,,,,,...,1646107.0,A00422,9e8a4c2e07217ff373667276d69f3a74,https://legiscan.com/NY/bill/A00422/2023,2023-01-09,1.0,2023-01-09,referred to housing,Relates to eliminating the price index of oper...,Relates to eliminating the price index of oper...
403,,,,,,,,,,,...,1646176.0,A00423,4bba4871a4d8616414cd7d8377b18d67,https://legiscan.com/NY/bill/A00423/2023,2023-01-09,1.0,2023-01-09,referred to education,Requires consent prior to sharing personally i...,Requires consent prior to sharing personally i...
404,,,,,,,,,,,...,1645968.0,A00424,64aacc2367567a4fabb3645e4b46fdda,https://legiscan.com/NY/bill/A00424/2023,2023-01-09,1.0,2023-01-09,referred to local governments,Relates to prohibiting police officers from ca...,Relates to prohibiting police officers from ca...
405,,,,,,,,,,,...,1645933.0,A00425,47241287756bda7d6482df2e207fd109,https://legiscan.com/NY/bill/A00425/2023,2023-01-09,1.0,2023-01-09,referred to transportation,Relates to notifying persons renewing their li...,Relates to notifying persons renewing their li...
406,,,,,,,,,,,...,1645912.0,A00426,b84fddfd0e3fb8caa986504591b7591b,https://legiscan.com/NY/bill/A00426/2023,2023-01-09,1.0,2023-01-09,referred to transportation,Provides that whenever the estimate for constr...,Provides that whenever the estimate for constr...
407,,,,,,,,,,,...,1646244.0,A00427,7bbf91281a16312f60f2b4bb9351dff0,https://legiscan.com/NY/bill/A00427/2023,2023-01-09,1.0,2023-01-09,referred to labor,Requires the Olympic regional development auth...,Requires the Olympic regional development auth...
408,,,,,,,,,,,...,1646289.0,A00428,39b14084d460d247e688c06fe3546380,https://legiscan.com/NY/bill/A00428/2023,2023-01-09,1.0,2023-01-09,referred to environmental conservation,Relates to the taking of wildlife without a pe...,Relates to the taking of wildlife without a pe...
409,,,,,,,,,,,...,1646038.0,A00429,e6275deadea3df44193a294fe6b49b85,https://legiscan.com/NY/bill/A00429/2023,2023-01-09,1.0,2023-01-09,referred to codes,Prohibits handcuffing or forcibly restraining ...,Prohibits handcuffing or forcibly restraining ...


In [107]:
antibillsdf = pd.read_csv("anti-lgbtq-bills-tracker.csv")
antibillsdf.shape

(367, 37)

In [108]:
antibillsdf.head()

Unnamed: 0,State,Number,Summary,Bill Type,Date,Status,Erin Reed's State Risk,Notes,URL,Sponsors,...,Unnamed: 27,Unnamed: 28,Unnamed: 29,Unnamed: 30,Unnamed: 31,Unnamed: 32,Unnamed: 33,Unnamed: 34,Unnamed: 35,Unnamed: 36
0,Alaska,HB27,Designate Sex For School-sponsored Sports,Trans Sports Ban,1/19/2023,REFERRED TO EDUCATION,Moderate,,https://legiscan.com/AK/bill/HB27/2023,Thomas McKay,...,,,,,,,,,,
1,Arizona,HB2312,Women's shelters; male employees; liability,Bans trans people from working at shelters,2/21/2023,House minority caucus: Do pass,Low,,https://legiscan.com/AZ/bill/HB2312/2023,"Rachel Jones, Lupe Diaz, Liz Harris, Cory McGa...",...,,,,,,,,,,
2,Arizona,HB2711,Student information; parental notification; re...,Forced Outing by Schools,2/8/2023,House read second time,Low,,https://legiscan.com/AZ/bill/HB2711/2023,"Alexander Kolodin, Joseph Chaplik",...,,,,,,,,,,
3,Arizona,SB1001,Pronouns; biological sex; school policies,Forced Outing by Schools,1/18/2023,"Senate ED Committee action: Do Pass Amended, v...",Low,Pronoun Ban. It's requires parental permission...,https://legiscan.com/AZ/bill/SB1001/2023,John Kavanagh,...,,,,,,,,,,
4,Arizona,SB1026,State monies; drag shows; minors,Drag Ban,2/8/2023,"Senate GOV Committee action: Do Pass Amended, ...",Low,Bans use of state money at places where drag s...,https://legiscan.com/AZ/bill/SB1026/2023,John Kavanagh,...,,,,,,,,,,


In [109]:
us_state_to_abbrev["Alaska"]

'AK'

In [110]:
antibillsdf['State'] = antibillsdf['State'].replace(us_state_to_abbrev)
antibillsdf.head()

Unnamed: 0,State,Number,Summary,Bill Type,Date,Status,Erin Reed's State Risk,Notes,URL,Sponsors,...,Unnamed: 27,Unnamed: 28,Unnamed: 29,Unnamed: 30,Unnamed: 31,Unnamed: 32,Unnamed: 33,Unnamed: 34,Unnamed: 35,Unnamed: 36
0,AK,HB27,Designate Sex For School-sponsored Sports,Trans Sports Ban,1/19/2023,REFERRED TO EDUCATION,Moderate,,https://legiscan.com/AK/bill/HB27/2023,Thomas McKay,...,,,,,,,,,,
1,AZ,HB2312,Women's shelters; male employees; liability,Bans trans people from working at shelters,2/21/2023,House minority caucus: Do pass,Low,,https://legiscan.com/AZ/bill/HB2312/2023,"Rachel Jones, Lupe Diaz, Liz Harris, Cory McGa...",...,,,,,,,,,,
2,AZ,HB2711,Student information; parental notification; re...,Forced Outing by Schools,2/8/2023,House read second time,Low,,https://legiscan.com/AZ/bill/HB2711/2023,"Alexander Kolodin, Joseph Chaplik",...,,,,,,,,,,
3,AZ,SB1001,Pronouns; biological sex; school policies,Forced Outing by Schools,1/18/2023,"Senate ED Committee action: Do Pass Amended, v...",Low,Pronoun Ban. It's requires parental permission...,https://legiscan.com/AZ/bill/SB1001/2023,John Kavanagh,...,,,,,,,,,,
4,AZ,SB1026,State monies; drag shows; minors,Drag Ban,2/8/2023,"Senate GOV Committee action: Do Pass Amended, ...",Low,Bans use of state money at places where drag s...,https://legiscan.com/AZ/bill/SB1026/2023,John Kavanagh,...,,,,,,,,,,


In [173]:
antibillsdf.columns

Index(['State', 'Number', 'Summary', 'Bill Type', 'Date', 'Status',
       'Erin Reed's State Risk', 'Notes', 'URL', 'Sponsors', 'Calendar',
       'History', 'Manual Status', 'Change Hash', 'Unnamed: 14', 'Unnamed: 15',
       'Unnamed: 16', 'Unnamed: 17', 'Unnamed: 18', 'Unnamed: 19',
       'Unnamed: 20', 'Unnamed: 21', 'Unnamed: 22', 'Unnamed: 23',
       'Unnamed: 24', 'Unnamed: 25', 'Unnamed: 26', 'Unnamed: 27',
       'Unnamed: 28', 'Unnamed: 29', 'Unnamed: 30', 'Unnamed: 31',
       'Unnamed: 32', 'Unnamed: 33', 'Unnamed: 34', 'Unnamed: 35',
       'Unnamed: 36'],
      dtype='object')

In [118]:
# get just the state abbrev and bill number from the whole CSV
statebillnodf = antibillsdf.filter(['State','Number','URL'])

In [139]:
statebillnodf.head()

Unnamed: 0,State,Number,URL
0,AK,HB27,https://legiscan.com/AK/bill/HB27/2023
1,AZ,HB2312,https://legiscan.com/AZ/bill/HB2312/2023
2,AZ,HB2711,https://legiscan.com/AZ/bill/HB2711/2023
3,AZ,SB1001,https://legiscan.com/AZ/bill/SB1001/2023
4,AZ,SB1026,https://legiscan.com/AZ/bill/SB1026/2023


In [177]:
len(antibillsdf)

367

In [196]:
urllist = list(antibillsdf['URL'])

In [180]:
stateabbvs = list(antibillsdf['State'])
billnos = list(antibillsdf['Number'])

In [189]:
stateabbvs[0]

'AK'

In [190]:
billnos[0]

'HB27'

God this is such a dumb way to do this, but it's the way I know how!

In [200]:
antibillsdf.URL

0        https://legiscan.com/AK/bill/HB27/2023
1      https://legiscan.com/AZ/bill/HB2312/2023
2      https://legiscan.com/AZ/bill/HB2711/2023
3      https://legiscan.com/AZ/bill/SB1001/2023
4      https://legiscan.com/AZ/bill/SB1026/2023
                         ...                   
362    https://legiscan.com/WY/bill/SF0111/2023
363    https://legiscan.com/WY/bill/SF0117/2023
364    https://legiscan.com/WY/bill/SF0133/2023
365    https://legiscan.com/WY/bill/SF0144/2023
366    https://legiscan.com/WY/bill/SF0159/2023
Name: URL, Length: 367, dtype: object

In [221]:
len(antibillsdf.URL)

367

In [225]:
billinfos = []
for i in range(len(antibillsdf.URL)):
    if(billnos[i] != "nan"):
        billinfo = legis.get_bill(bill_id=None, state=stateabbvs[i], bill_number=billnos[i])    
        billinfos.append(billinfo)
    else:
        continue

LegiScanError: Invalid bill id

# Example code continues below

In [None]:
# bills = legis.search(state='tx', query='abortion')
# bills['summary'] # how many results did we get?

{'page': '1 of 2',
 'range': '1 - 50',
 'relevancy': '100% - 87%',
 'count': 59,
 'page_current': '1',
 'page_total': 2,
 'query': '(Zabort:(pos=1))'}

You can also get single bills, one at a time, as long as you know their ID in the LegiScan database.

In [12]:
legis.get_bill('1635057')

{'bill_id': 1635057,
 'change_hash': 'e09e6c58de86d4d7f02c80d880b2fd09',
 'session_id': 2012,
 'session': {'session_id': 2012,
  'state_id': 25,
  'year_start': 2023,
  'year_end': 2023,
  'prefile': 0,
  'sine_die': 0,
  'prior': 0,
  'special': 0,
  'session_tag': 'Regular Session',
  'session_title': '2023 Regular Session',
  'session_name': '2023 Regular Session'},
 'url': 'https://legiscan.com/MO/bill/SB49/2023',
 'state_link': 'https://www.senate.mo.gov/23info/BTS_Web/Bill.aspx?SessionType=R&BillID=44407',
 'completed': 0,
 'status': 1,
 'status_date': '2023-01-04',
 'progress': [{'date': '2023-01-04', 'event': 1},
  {'date': '2023-01-12', 'event': 9}],
 'state': 'MO',
 'state_id': 25,
 'bill_number': 'SB49',
 'bill_type': 'B',
 'bill_type_id': '1',
 'body': 'S',
 'body_id': 60,
 'current_body': 'S',
 'current_body_id': 60,
 'title': 'Establishes the "Missouri Save Adolescents from Experimentation (SAFE) Act"',
 'description': 'Establishes the "Missouri Save Adolescents from Expe

In [195]:
df2.shape

(10258, 21)

In [197]:
df2[df2.url.isin(urllist)]

Unnamed: 0,session_id,state_id,year_start,year_end,prefile,sine_die,prior,special,session_tag,session_title,...,bill_id,number,change_hash,url,status_date,status,last_action_date,last_action,title,description
0,2031.0,32.0,2023.0,2024.0,0.0,0.0,0.0,0.0,Regular Session,2023-2024 Regular Session,...,,,,,,,,,,


In [199]:
df2.url

0                                             NaN
1        https://legiscan.com/NY/bill/A00021/2023
2        https://legiscan.com/NY/bill/A00022/2023
3        https://legiscan.com/NY/bill/A00023/2023
4        https://legiscan.com/NY/bill/A00024/2023
                           ...                   
10253    https://legiscan.com/NY/bill/J00473/2023
10254    https://legiscan.com/NY/bill/J00474/2023
10255    https://legiscan.com/NY/bill/J00475/2023
10256    https://legiscan.com/NY/bill/J00476/2023
10257    https://legiscan.com/NY/bill/C00010/2023
Name: url, Length: 10258, dtype: object

In [212]:
df3 = df2['url'].isin(antibillsdf['URL'])

# LegiScan Datasets

It'd take forever to download the bills one at a time, so we take advantage of LegiScan's [datasets](https://legiscan.com/datasets) capability. They're a whole set of bill data for each session of the legislature.

In [None]:
datasets = legis.get_dataset_list()
dataset = legis.get_dataset(datasets[20]['session_id'], datasets[20]['access_key'])
dataset.keys()

dict_keys(['state_id', 'session_id', 'session_name', 'dataset_hash', 'dataset_date', 'dataset_size', 'mime_type', 'zip'])

In [213]:
print(df3)

0         True
1        False
2        False
3        False
4        False
         ...  
10253    False
10254    False
10255    False
10256    False
10257    False
Name: url, Length: 10258, dtype: bool


They come in a _really_ weird format, though: a [base64-encoded](https://en.wikipedia.org/wiki/Base64) zip file. SO first we need to convert the base64 zipfile into a normal file, then unzip it!

In [None]:
z_bytes = base64.b64decode(dataset['zip'])
z = zipfile.ZipFile(io.BytesIO(z_bytes))
z.extractall("./sample-data")

It creates a lot lot lot lot lot of `.json` files. For example, let's take a look at a sample of what we just extracted.

In [None]:
import glob

filenames = glob.glob("./sample-data/*/*/bill/*", recursive=True)
filenames[:15]

['./sample-data/AK/2017-2018_30th_Legislature/bill/SCR10.json',
 './sample-data/AK/2017-2018_30th_Legislature/bill/SB124.json',
 './sample-data/AK/2017-2018_30th_Legislature/bill/HB65.json',
 './sample-data/AK/2017-2018_30th_Legislature/bill/HB392.json',
 './sample-data/AK/2017-2018_30th_Legislature/bill/HB238.json',
 './sample-data/AK/2017-2018_30th_Legislature/bill/HCR25.json',
 './sample-data/AK/2017-2018_30th_Legislature/bill/HB111.json',
 './sample-data/AK/2017-2018_30th_Legislature/bill/HJR2.json',
 './sample-data/AK/2017-2018_30th_Legislature/bill/HCR1.json',
 './sample-data/AK/2017-2018_30th_Legislature/bill/HB404.json',
 './sample-data/AK/2017-2018_30th_Legislature/bill/HB280.json',
 './sample-data/AK/2017-2018_30th_Legislature/bill/SB173.json',
 './sample-data/AK/2017-2018_30th_Legislature/bill/SCR9.json',
 './sample-data/AK/2017-2018_30th_Legislature/bill/HCR401.json',
 './sample-data/AK/2017-2018_30th_Legislature/bill/HB32.json']

Each file has all sorts of information about the bill, but **none of the text of the bill itself!** You can see for yourself:

In [None]:
import json

json_data = json.load(open("./sample-data/AK/2017-2018_30th_Legislature/bill/SCR10.json"))
json_data

{'bill': {'bill_id': 1004624,
  'change_hash': '557d10e3e229284c17c4354e988bad06',
  'session_id': 1397,
  'session': {'session_id': 1397,
   'session_name': '30th Legislature',
   'session_title': '30th Legislature',
   'year_start': 2017,
   'year_end': 2018,
   'special': 0},
  'url': 'https://legiscan.com/AK/bill/SCR10/2017',
  'state_link': 'http://www.akleg.gov/basis/Bill/Detail/30?Root=SCR10',
  'completed': 0,
  'status': 3,
  'status_date': '2018-04-28',
  'progress': [{'date': '2017-04-07', 'event': 1},
   {'date': '2018-02-02', 'event': 10},
   {'date': '2018-02-09', 'event': 2},
   {'date': '2018-03-09', 'event': 10},
   {'date': '2018-04-28', 'event': 3}],
  'state': 'AK',
  'state_id': 2,
  'bill_number': 'SCR10',
  'bill_type': 'CR',
  'bill_type_id': '3',
  'body': 'S',
  'body_id': 14,
  'current_body': 'H',
  'current_body_id': 13,
  'title': 'Alaska Year Of Innovation',
  'description': 'Proclaiming 2019 to be the Year of Innovation in Alaska.',
  'committee': [],
  

You _can_ download the bill text if you have the ID, but... for some reason we don't do this. I'm going to be honest: I don't remember why. Maybe it's because they're older versions? They're incomplete? I truly have forgetten.

In [None]:
doc = legis.get_bill_text('2015157')
contents = base64.b64decode(doc['doc'])
with open("filename.html", "wb") as file:
    file.write(contents)

What we're going to need is the **URL to the published version.**

In [None]:
json_data['bill']['texts'][-1]

{'doc_id': 1790359,
 'date': '2018-05-01',
 'type': 'Enrolled',
 'type_id': 5,
 'mime': 'application/pdf',
 'mime_id': 2,
 'url': 'https://legiscan.com/AK/text/SCR10/id/1790359',
 'state_link': 'http://www.legis.state.ak.us/PDF/30/Bills/SCR010Z.PDF',
 'text_size': 592822}

We're going to need the URL to the published version from _every single one of those JSON files_.

# Download and extract all of the datasets from LegiScan

In [None]:
datasets = legis.get_dataset_list()
len(datasets)

583

Downloading and extracting all 583 is going to take a while, so we'll use a progress bar from [tqdm](https://github.com/tqdm/tqdm) to keep track of where we're at.

In [None]:
import tqdm

total = len(datasets)
for dataset in tqdm.tqdm_notebook(datasets):
    session_id = dataset['session_id']
    access_key = dataset['access_key']
    details = legis.get_dataset(session_id, access_key)
    z_bytes = base64.b64decode(details['zip'])
    z = zipfile.ZipFile(io.BytesIO(z_bytes))
    z.extractall("./bill_data")

HBox(children=(IntProgress(value=0, max=583), HTML(value='')))




In [206]:
print(df3.index.where(df3[1] == True))

ValueError: putmask: mask and data must be the same size

In [207]:
df3.index.isin("True")

TypeError: only list-like objects are allowed to be passed to isin(), you passed a [str]

# Converting the many JSON files to single CSV file

The data isn't doing us much good sitting around as a zillion json files, so we'll convert them into a CSV file with the pieces of information we're interested in. Those pieces are:

* State
* Bill title
* Bill URL

In [None]:
filenames = glob.glob("bill_data/*/*/bill/*.json")
len(filenames)

1253402

In [None]:
filenames[:5]

['bill_data/VT/2011-2012_Regular_Session/bill/HCR143.json',
 'bill_data/VT/2011-2012_Regular_Session/bill/H0291.json',
 'bill_data/VT/2011-2012_Regular_Session/bill/S0162.json',
 'bill_data/VT/2011-2012_Regular_Session/bill/S0027.json',
 'bill_data/VT/2011-2012_Regular_Session/bill/H0784.json']

If we want to process over a million rows, it's going to take a while! To speed things up we're going to turn to [swifter](https://github.com/jmcarpenter2/swifter), a package that can parallelize work on pandas dataframes. It's pretty easy to use:

**without swifter:**

```python
df = pd.Series(filenames).apply(process_json)
```

**with swifter:**

```python
df = pd.Series(filenames).swifter.apply(process_json)
```

And it does all the hard work for you! You just use it and hope for the best.

In [None]:
import json
import os
import swifter
import pandas as pd

def process_json(filename):
    with open(filename) as file:
        bill_data = {}
        # We need to do a little string replacing so the 
        json_str = file.read().replace('"0000-00-00"', 'null')
        content = json.loads(json_str)['bill']

        bill_data['bill_id'] = content['bill_id']
        bill_data['code'] = os.path.splitext(os.path.basename(filename))[0]
        bill_data['bill_number'] = content['bill_number']
        bill_data['title'] = content['title']
        bill_data['description'] = content['description']
        bill_data['state'] = content['state']
        bill_data['session'] = content['session']['session_name']
        bill_data['filename'] = filename
        bill_data['status'] = content['status']
        bill_data['status_date'] = content['status_date']

        try:
            bill_data['url'] = content['texts'][-1]['state_link']
        except:
            pass

        return pd.Series(bill_data)

df = pd.Series(filenames).swifter.apply(process_json)
df.head()

HBox(children=(IntProgress(value=0, description='Pandas Apply', max=1253402, style=ProgressStyle(description_w…




Unnamed: 0,bill_id,code,bill_number,title,description,state,session,filename,status,status_date,url
0,325258,HCR143,HCR143,House Concurrent Resolution Congratulating The...,House Concurrent Resolution Congratulating The...,VT,2011-2012 Session,bill_data/VT/2011-2012_Regular_Session/bill/HC...,4,2011-04-22,http://www.leg.state.vt.us/docs/2012/Acts/ACTR...
1,285625,H0291,H0291,An Act Relating To Raising The Penalties For A...,An Act Relating To Raising The Penalties For A...,VT,2011-2012 Session,bill_data/VT/2011-2012_Regular_Session/bill/H0...,1,2011-02-22,http://www.leg.state.vt.us/docs/2012/bills/Int...
2,398232,S0162,S0162,An Act Relating To Powers Of Attorney,An Act Relating To Powers Of Attorney,VT,2011-2012 Session,bill_data/VT/2011-2012_Regular_Session/bill/S0...,1,2012-01-03,http://www.leg.state.vt.us/docs/2012/bills/Int...
3,243054,S0027,S0027,An Act Relating To The Role Of Municipalities ...,An Act Relating To The Role Of Municipalities ...,VT,2011-2012 Session,bill_data/VT/2011-2012_Regular_Session/bill/S0...,1,2011-01-25,http://www.leg.state.vt.us/docs/2012/bills/Int...
4,417691,H0784,H0784,An Act Relating To Approval Of The Adoption An...,An Act Relating To Approval Of The Adoption An...,VT,2011-2012 Session,bill_data/VT/2011-2012_Regular_Session/bill/H0...,4,2012-05-05,http://www.leg.state.vt.us/docs/2012/Acts/ACTM...


And now we'll save it to prepare for the next step: **inserting it into a database.**

In [None]:
df.to_csv("data/bills-with-urls.csv", index=False)