# StateLegiscraper: PDF Format Example Notebook

*Author*: Katherine Chang (kachang@uw.edu)

*Last Updated*: 12/16/2021

StateLegiscraper is a Python package that scrapes and processes data from U.S. state legislature websites. As of writing, the package is focused on transcribing standing committee hearings from each state legislature from its native archival format to text, so that this text data can be easily used for NLP research purposes and for public review. For more details about the StateLegiscraper, please visit its [Github repository](https://github.com/ka-chang/StateLegiscraper) where it is under active development. 

This notebook walks a new user through the StateLegiscraper workflow, with a focus on the Nevada State Legislature and working with PDF file formats. 

This notebook makes several assumptions about the user, which are that they have:

- At least a novice level familiarity with Python, including importing packages, running basic functions, and saving files.
- Knowledge of different Python file types, particularily lists and dictionaries. 
- Comfort working in the command line, as StateLegiscraper is installed through the user's choice of terminal. 
- Have at least 100 mb of space on their local hard drive or a mounted cloud drive to save the raw data on.

## The Nevada Context

The Nevada State Legislature is a part-time biennial state legislature, which means state legislators meet on odd number of years between the months of February to June. The state legislature website, [www.leg.state.nv.us](www.leg.state.nv.us), hosts human transcribed transcripts of its standing committee meetings in PDF format. 

## Setup

Please ensure StateLegiscraper is installed on your local drive. Please refer to the [following instructions for details](https://github.com/ka-chang/StateLegiscraper/blob/main/README.md).

The following two code chunks changes the directory to your local StateLegiscraper directory, which allows us to import the modules in to use.

In [1]:
import os
from pathlib import Path
import sys

In [2]:
github_file_path = str(Path(os.getcwd()).parents[0]) #Sets to local Github directory path

sys.path.insert(1, github_file_path) 

Code chunk 3 below prints your unique local github_file_path. It should end with the Github root directory, /StateLegiscraper/

In [3]:
print(github_file_path)

/Volumes/GoogleDrive/My Drive/State Legislatures/StateLegiscraper


## Nevada Assets

Before we start scraping data, we should decide what data we're interested in. As of writing, StateLegiscraper's coverage of Nevada supports scraping PDF transcripts from Nevada's standing committee hearings from 2011-2021. To access the weblinks to scrape the PDF links, we can call on `statelegiscraper.assets.weblinks` module and import `nv_weblinks`.

In [4]:
from statelegiscraper.assets.package import nv_weblinks

I'm going to go ahead and print the nv_weblinks source so that you can review the file.

In [5]:
import inspect
links = inspect.getsource(nv_weblinks)

In [6]:
print(links)

"""Weblinks for Nevada committee meeting pages, organized by chamber and committee name 
for regular sessions from 2011-2021"""

#ASSEMBLY

assem_comlabor=[
    "https://www.leg.state.nv.us/App/NELIS/REL/81st2021/Committee/340/Meetings",
    "https://www.leg.state.nv.us/App/NELIS/REL/80th2019/Committee/219/Meetings",
    "https://www.leg.state.nv.us/App/NELIS/REL/79th2017/Committee/184/Meetings",
    "https://www.leg.state.nv.us/App/NELIS/REL/78th2015/Committee/47/Meetings",
    "https://www.leg.state.nv.us/App/NELIS/REL/77th2013/Committee/1/Meetings",
    "https://www.leg.state.nv.us/App/NELIS/REL/76th2011/Committee/24/Meetings"
]

assem_ed=[
    "https://www.leg.state.nv.us/App/NELIS/REL/81st2021/Committee/348/Meetings",
    "https://www.leg.state.nv.us/App/NELIS/REL/80th2019/Committee/228/Meetings",
    "https://www.leg.state.nv.us/App/NELIS/REL/79th2017/Committee/168/Meetings",
    "https://www.leg.state.nv.us/App/NELIS/REL/78th2015/Committee/50/Meetings",
    "https://www.leg.stat

So you can see `nv_weblinks` includes lists for all Assembly and House Standing Committee meetings from 2011-2021. Simply choose the standing committee you're interested in and call it into your environment by adding `nv_weblinks` before the list name.  

In [7]:
sen_hhs = nv_weblinks.sen_hhs

In [8]:
print(sen_hhs)

['https://www.leg.state.nv.us/App/NELIS/REL/81st2021/Committee/350/Meetings', 'https://www.leg.state.nv.us/App/NELIS/REL/80th2019/Committee/221/Meetings', 'https://www.leg.state.nv.us/App/NELIS/REL/79th2017/Committee/170/Meetings', 'https://www.leg.state.nv.us/App/NELIS/REL/78th2015/Committee/63/Meetings', 'https://www.leg.state.nv.us/App/NELIS/REL/77th2013/Committee/22/Meetings', 'https://www.leg.state.nv.us/App/NELIS/REL/76th2011/Committee/45/Meetings']


If you'd only like data from specific years, e.g., 2021 and 2019, then simply use the list index to specify.

In [9]:
sen_hhs_2021_2019 = sen_hhs[0:2]

In [10]:
print(sen_hhs_2021_2019)

['https://www.leg.state.nv.us/App/NELIS/REL/81st2021/Committee/350/Meetings', 'https://www.leg.state.nv.us/App/NELIS/REL/80th2019/Committee/221/Meetings']


## Nevada Scrape Class

Now that you've selected your targeted data through the weblinks asset, let's begin scraping data! StateLegiscraper is structured with two main classes of functions: Scrape and Process. We'll start with the Scrape class, which we import using the following code.

In [11]:
from statelegiscraper.states.nv import Scrape

There's one function in Nevada's Scrape class, `nv_scrape_pdf()`. All of StateLegiscraper's state module functions include detailed docstrings, so use the `help(classname)` function to easily access the documentation.

In [12]:
help(Scrape)

Help on class Scrape in module statelegiscraper.states.nv:

class Scrape(builtins.object)
 |  Scrape functions for Nevada State Legislature website
 |  
 |  Methods defined here:
 |  
 |  nv_scrape_pdf(webscrape_links, dir_chrome_webdriver, dir_save)
 |      Webscrape function for Nevada State Legislature Website.
 |      
 |      Parameters
 |      ----------
 |      webscrape_links : List
 |          List of direct link(s) to NV committee webpage.
 |          see assets/weblinks/nv_weblinks.py for lists organized by chamber and committee
 |      dir_chrome_webdriver : String
 |          Local directory that contains the appropriate Chrome Webdriver.
 |      dir_save : String
 |          Local directory to save PDFs.
 |      
 |      Returns
 |      -------
 |      All PDF files found on the webscrape_links, saved on local dir_save.
 |  
 |  ----------------------------------------------------------------------
 |  Data descriptors defined here:
 |  
 |  __dict__
 |      dictionary fo

So we can see here that `nv_scrape_pdf()` requires three parameters:
1. webscrape_links: A list of links to Nevada committee hearing webpages. This is what we covered in the Assets section. We'll use `sen_hhs_2021_2019`, which is currently in our environment. 
2. dir_chrome_webdriver: The directory of your Chrome Webdriver. You should have reviewed this in the installation section and downloaded it in the `StateLegiscraper/statelegiscraper/assets/chrome_webdriver` directory. [See more details here.](https://github.com/ka-chang/StateLegiscraper/blob/main/README.md)
3. dir_save: A local directory to save the scraped raw data, PDF files, in. We'll use `StateLegiscraper/examples/outputs` for this example.

Remember `github_file_path`? This is your unique local path address for wherever you cloned the StateLegiscraper repoistory. Let's organize the two parameters that require the recommended directories to access the Chrome Webdriver and where to save the files. 

Please note:
- Change your chromedriver file to the one appropriate for your Chrome version and hardware specification. I am using Google Chrome, version 96 on a Mac Mini M1, but you are probably not. Read the installation guide to download the appropriate Chrome Driver and save it in the assets folder.
- The save folder can be anywhere in your local drive, but for now we will be using `StateLegiscraper/examples/outputs`.

In [13]:
directory_chrome_webdriver = str(os.path.join(github_file_path, "statelegiscraper/assets/chromedriver/chromedriver_v96_m1"))

In [14]:
print(directory_chrome_webdriver)

/Volumes/GoogleDrive/My Drive/State Legislatures/StateLegiscraper/statelegiscraper/assets/chromedriver/chromedriver_v96_m1


In [15]:
directory_raw_data = str(os.path.join(github_file_path, "examples/outputs/"))

In [16]:
print(directory_raw_data)

/Volumes/GoogleDrive/My Drive/State Legislatures/StateLegiscraper/examples/outputs/


Let's make sure that the outputs directory really exists, and if it doesn't that we'll create it.

In [17]:
does_directory_exist = os.path.exists(directory_raw_data)

if does_directory_exist is False:
  os.makedirs(directory_raw_data)

Now that all our parameters are established, we can run the `nv_scrape_pdf()` function to export all the archived meeting PDF transcripts from Nevada's Senate Health and Human Services Committee from 2019 and 2021.

In [18]:
Scrape.nv_scrape_pdf(sen_hhs_2021_2019, directory_chrome_webdriver, directory_raw_data)

https://www.leg.state.nv.us/Session/81st2021/Minutes/Senate/HHS/Final/1351.pdf
https://www.leg.state.nv.us/Session/81st2021/Minutes/Senate/HHS/Final/1321.pdf
https://www.leg.state.nv.us/Session/81st2021/Minutes/Senate/HHS/Final/1257.pdf
https://www.leg.state.nv.us/Session/81st2021/Minutes/Senate/HHS/Final/1231.pdf
https://www.leg.state.nv.us/Session/81st2021/Minutes/Senate/HHS/Final/1216.pdf
https://www.leg.state.nv.us/Session/81st2021/Minutes/Senate/HHS/Final/1164.pdf
https://www.leg.state.nv.us/Session/81st2021/Minutes/Senate/HHS/Final/1146.pdf
https://www.leg.state.nv.us/Session/81st2021/Minutes/Senate/HHS/Final/1117.pdf
https://www.leg.state.nv.us/Session/81st2021/Minutes/Senate/HHS/Final/1024.pdf
https://www.leg.state.nv.us/Session/81st2021/Minutes/Senate/HHS/Final/972.pdf
https://www.leg.state.nv.us/Session/81st2021/Minutes/Senate/HHS/Final/884.pdf
https://www.leg.state.nv.us/Session/81st2021/Minutes/Senate/HHS/Final/787.pdf
https://www.leg.state.nv.us/Session/81st2021/Minutes/Se

Congratulations! You have transcript data on Nevada's Health and Human Services during the 2019 and 2021 legislative sessions! Let's check the raw data outputs folder to ensure the PDF files exported appropriately.

In [19]:
os.listdir(directory_raw_data)

['81st2021_Minutes_Senate_HHS_Final_1351.pdf',
 '81st2021_Minutes_Senate_HHS_Final_1321.pdf',
 '81st2021_Minutes_Senate_HHS_Final_1257.pdf',
 '81st2021_Minutes_Senate_HHS_Final_1231.pdf',
 '81st2021_Minutes_Senate_HHS_Final_1216.pdf',
 '81st2021_Minutes_Senate_HHS_Final_1164.pdf',
 '81st2021_Minutes_Senate_HHS_Final_1146.pdf',
 '81st2021_Minutes_Senate_HHS_Final_1117.pdf',
 '81st2021_Minutes_Senate_HHS_Final_1024.pdf',
 '81st2021_Minutes_Senate_HHS_Final_972.pdf',
 '81st2021_Minutes_Senate_HHS_Final_884.pdf',
 '81st2021_Minutes_Senate_HHS_Final_787.pdf',
 '81st2021_Minutes_Senate_HHS_Final_766.pdf',
 '81st2021_Minutes_Senate_HHS_Final_686.pdf',
 '81st2021_Minutes_Senate_HHS_Final_660.pdf',
 '81st2021_Minutes_Senate_HHS_Final_755.pdf',
 '81st2021_Minutes_Senate_HHS_Final_737.pdf',
 '81st2021_Minutes_Senate_HHS_Final_647.pdf',
 '81st2021_Minutes_Senate_HHS_Final_597.pdf',
 '81st2021_Minutes_Senate_HHS_Final_563.pdf',
 '81st2021_Minutes_Senate_HHS_Final_479.pdf',
 '81st2021_Minutes_Senate

## Nevada Process Class

Now that you have raw data in the form of PDF files, you will want to process them for text analysis. StateLegiscraper's Process class contains custom functions for each state's output raw data generated from the Scrape class.

First, let's import the Process class of functions and take a look at what's in them.

In [20]:
from statelegiscraper.states.nv import Process

In [21]:
help(Process)

Help on class Process in module statelegiscraper.states.nv:

class Process(builtins.object)
 |  Process functions for PDF transcripts scraped from Nevada State Legislature website
 |  
 |  Methods defined here:
 |  
 |  nv_pdf_to_text(dir_load, nv_json_name)
 |      Convert all PDFs to a dictionary and then saved locally as a JSON file.
 |      
 |      Parameters
 |      ----------
 |      dir_load : String
 |          Local location of the directory holding PDFs.
 |      nv_json_name : String
 |          JSON file name, include full local path.
 |      
 |      Returns
 |      -------
 |      A single JSON file, can be loaded as dictionary to work with.
 |  
 |  nv_text_clean(nv_json_path, trim=None)
 |      Loads JSON into environment as dictionary
 |      Preprocesses the raw PDF export from previously generated json
 |      Optional: Trims transcript to exclude list of those present 
 |      and signature page/list of exhibits
 |      
 |      Parameters
 |      ----------
 |     

There are two Process class functions, `nv_pdf_to_text()` and `nv_text_clean()`. Let's work with `nv_pdf_to_text()` first.

`nv_pdf_to_text()` takes all PDF files in a directory and dumps it it into a JSON file. The benefit of using a JSON file format for our data is that the text of multiple PDFs can be stored in a single file, which users can call into their environment as dictionaries. This makes it possible to work with hundreds of PDFs in a logical manner.

`nv_pdf_to_text()` takes two parameters: the local path for the scraped raw data and a string that details the file name for the JSON output.

We know that the data we scraped is currently in `directory_raw_data`, so let's remind ourselves where it's pointing to and what files are in there.

In [22]:
directory_raw_data

'/Volumes/GoogleDrive/My Drive/State Legislatures/StateLegiscraper/examples/outputs/'

In [23]:
os.listdir(directory_raw_data)

['81st2021_Minutes_Senate_HHS_Final_1351.pdf',
 '81st2021_Minutes_Senate_HHS_Final_1321.pdf',
 '81st2021_Minutes_Senate_HHS_Final_1257.pdf',
 '81st2021_Minutes_Senate_HHS_Final_1231.pdf',
 '81st2021_Minutes_Senate_HHS_Final_1216.pdf',
 '81st2021_Minutes_Senate_HHS_Final_1164.pdf',
 '81st2021_Minutes_Senate_HHS_Final_1146.pdf',
 '81st2021_Minutes_Senate_HHS_Final_1117.pdf',
 '81st2021_Minutes_Senate_HHS_Final_1024.pdf',
 '81st2021_Minutes_Senate_HHS_Final_972.pdf',
 '81st2021_Minutes_Senate_HHS_Final_884.pdf',
 '81st2021_Minutes_Senate_HHS_Final_787.pdf',
 '81st2021_Minutes_Senate_HHS_Final_766.pdf',
 '81st2021_Minutes_Senate_HHS_Final_686.pdf',
 '81st2021_Minutes_Senate_HHS_Final_660.pdf',
 '81st2021_Minutes_Senate_HHS_Final_755.pdf',
 '81st2021_Minutes_Senate_HHS_Final_737.pdf',
 '81st2021_Minutes_Senate_HHS_Final_647.pdf',
 '81st2021_Minutes_Senate_HHS_Final_597.pdf',
 '81st2021_Minutes_Senate_HHS_Final_563.pdf',
 '81st2021_Minutes_Senate_HHS_Final_479.pdf',
 '81st2021_Minutes_Senate

Looks good! We'll use `directory_raw_data` for our first parameter that points to the directory where the raw data is. 

Now let's create a local path that points to where we want the JSON to be saved. I'll put it in the `data` folder within the examples directory and add the file name at the end. Make sure that this file name string includes .json at the end to import correctly.

In [24]:
file_save_json = str(os.path.join(github_file_path, "examples/data/nv_sen_hhs_2021_2019.json"))

In [25]:
file_save_json

'/Volumes/GoogleDrive/My Drive/State Legislatures/StateLegiscraper/examples/data/nv_sen_hhs_2021_2019.json'

You're now ready to run `nv_pdf_to_text()`, which will convert all the PDF files in `directory_raw_data` to text and aggregate it all into a single JSON file on your local drive at `file_save_json()`.

In [26]:
Process.nv_pdf_to_text(directory_raw_data, file_save_json)

You now have a single JSON file saved on your local drive with the scraped text from the 2019 and 2021 Health and Human Services Committee meetings. You are free to take this JSON file and run analyses on it in other software environments, such as R. The JSON file is not locked down to the StateLegiscraper package or Python.

If you'd like to work with this JSON file in Python, there are three different ways to load this JSON file into your Python environment. I'm going to use a short work session transcript from May 22, 2019 to illustrate the difference between different load approaches.

First, you can simply use `json.load()` the open it in the text in its rawest form. 

In [27]:
import json
file_path = open(file_save_json,)
nv_sen_hhs_2021_2019_raw = json.load(file_path)

In [28]:
nv_sen_hhs_2021_2019_raw["4"]

'\nMINUTES OF THE  \nSENATE COMMITTEE ON HEALTH AND HUMAN SERVICES \n \nEightieth Session \nMay 22, 2019 \n \n \nThe Senate Committee on Health and Human Services was called to order by \nChair Julia Ratti at 2:50 p.m. on Wednesday, May 22, 2019, in Room 2135 of \nthe  Legislative  Building,  Carson City,  Nevada.  The  meeting  was \nvideoconferenced to Room 4404B of the Grant Sawyer State Office Building, \n555 East  Washington  Avenue,  Las Vegas,  Nevada.  Exhibit A  is  the  Agenda. \nExhibit B is the Attendance Roster. All exhibits are available and on file in the \nResearch Library of the Legislative Counsel Bureau. \n \nCOMMITTEE MEMBERS PRESENT: \n \nSenator Julia Ratti, Chair \nSenator Pat Spearman, Vice Chair \nSenator Joyce Woodhouse \nSenator Joseph P. Hardy \nSenator Scott Hammond \n \nSTAFF MEMBERS PRESENT: \n \nMegan Comlossy, Committee Policy Analyst \nEric Robbins, Committee Counsel \nMichelle Hamilton, Committee Secretary \n \nCHAIR RATTI: \nI will open with the work

You can see that the text is not very clean and is formatted with line breaks from the original PDF.

The second and third option are bundled into StateLegiscraper's Process class.`nv_text_clean()` loads the JSON into your Python environment and provides a trim option for some light cleaning to prepare it for analysis.

`nv_text_clean()` only requires a single parameter, which is the `file_save_json` path where `nv_pdf_to_text()` had exported the JSON file. It contains an optional `Trim=True` parameter, which is defaulted to `Trim=None`.

Let's run `nv_text_clean()` without the optional trim parameter.

In [29]:
nv_sen_hhs_2021_2019 = Process.nv_text_clean(file_save_json)

In [30]:
nv_sen_hhs_2021_2019["4"]

'MINUTES OF THE SENATE COMMITTEE ON HEALTH AND HUMAN SERVICES Eightieth Session May 22, 2019 The Senate Committee on Health and Human Services was called to order by Chair Julia Ratti at 2:50 p.m. on Wednesday, May 22, 2019, in Room 2135 of the Legislative Building, Carson City, Nevada. The meeting was videoconferenced to Room 4404B of the Grant Sawyer State Office Building, 555 East Washington Avenue, Las Vegas, Nevada. Exhibit A is the Agenda. Exhibit B is the Attendance Roster. All exhibits are available and on file in the Research Library of the Legislative Counsel Bureau. COMMITTEE MEMBERS PRESENT: Senator Julia Ratti, Chair Senator Pat Spearman, Vice Chair Senator Joyce Woodhouse Senator Joseph P. Hardy Senator Scott Hammond STAFF MEMBERS PRESENT: Megan Comlossy, Committee Policy Analyst Eric Robbins, Committee Counsel Michelle Hamilton, Committee Secretary CHAIR RATTI: I will open with the work session on Assembly Bill (A.B. 151). ASSEMBLY BILL 151: Provides for the protection o

Much cleaner than just using `json.load()` for the raw form. `nv_sen_hhs_2021_2019` is now a dictionary object where each item is a single committee hearing. Light text cleaning has been conducted with the removal of line breaks and page numbers.

Currently the dictionary keys are just the index in order of where they were downloaded in the raw dataw folder – additional functionality will come which will add committee hearing dates as keys.

Let's check that `nv_sen_hhs_2021_2019` is a dictionary object.

In [31]:
isinstance(nv_sen_hhs_2021_2019, dict) 

True

Yep, that's a dictionary! Now let's employ the optional `Trim=True` parameter to conduct a bit more aggressive cleaning. `Trim` removes text from the roll sheet (i.e., committee details, date, location, list of all attendees), the signature page, and the exhibit summary. This will provide you with the portion of the transcript where different hearing attendees were speaking to one another.

In [32]:
nv_sen_hhs_2021_2019_trim = Process.nv_text_clean(file_save_json, trim=True)

In [33]:
nv_sen_hhs_2021_2019_trim["4"]

'CHAIR RATTI: I will open with the work session on Assembly Bill (A.B. 151). ASSEMBLY BILL 151: Provides for the protection of children who are victims of commercial sexual exploitation. (BDR 38-457) MEGAN COMLOSSY (Committee Policy Analyst): I will read the summary of the bill and amendments from the work session document (Exhibit C). ERIC ROBBINS (Committee Counsel): The Nevada Rules of Professional Conduct prescribe the ethical responsibilities of lawyers in this State. Nevada Rule of Professional Conduct 1.6 generally prohibits a lawyer from revealing information relating to the representation of a Senate Committee on Health and Human Services May 22, 2019 client unless the client consents to the disclosure or the disclosure is impliedly authorized in order to carry out the representation. The policy reason behind this rule is that people have a right to counsel under the Constitution. In order for them to have effective representation, they might need to discuss incriminating info

`nv_sen_hhs_2021_2019_trim` is also a dictionary object and can be used in the same way as the untrimmed version. And while `nv_text_clean()` conducts light cleaning, additional processing will be required to segment the speakers and remove bill summaries and references. 

In [34]:
isinstance(nv_sen_hhs_2021_2019_trim, dict) 

True

## What Now?

This is the stage where you, the user, have free reign to begin working with popular NLP Python packages, such as nltk and SpaCy. I have provided a few key resources below.

### nltk Resources
- [nltk Book](https://www.nltk.org/book/)
- [Hands on nltk Tutorial](https://github.com/hb20007/hands-on-nltk-tutorial) 

### spaCy Resources
- [Official Usage Guide](https://spacy.io/usage)