# StateLegiscraper: Audio Format Example Notebook

*Author*: Katherine Chang (kachang@uw.edu)

*Last Updated*: 1/3/2022

StateLegiscraper is a Python package that scrapes and processes data from U.S. state legislature websites. As of writing, the package is focused on transcribing standing committee hearings from each state legislature from its native archival format to text, so that this text data can be easily used for NLP research purposes and for public review. For more details about the StateLegiscraper, please visit its [Github repository](https://github.com/ka-chang/StateLegiscraper) where it is under active development. 

This notebook walks a new user through the StateLegiscraper workflow, with a focus on the Washington State Legislature and working with audio file formats. 

This notebook makes several assumptions about the user, which are that they have:

- At least a novice level familiarity with Python, including importing packages, running basic functions, and saving files.
- Knowledge of different Python file types, particularily lists and dictionaries. 
- Comfort working in the command line, as StateLegiscraper is installed through the user's choice of terminal. 
- Have at least 500 mb of space on their local hard drive or a mounted cloud drive to save the raw data on.

## The Washington Context

Washington hosts theirs standing meeting data as audio and video files, necessitating the use of speech-to-text engines toc onvert the audio to text for NLP analysis. 

StateLegiscraper provides users with two speech to text engine options: a open source option DeepSpeech and a paid option using Google Cloud's Speech to Text API. The open source engine is employed as default with Google Cloud's option integrated as a helper function for those with interest.

## Setup

Please ensure StateLegiscraper is installed on your local drive. Please refer to the [following instructions for details](https://github.com/ka-chang/StateLegiscraper/blob/main/README.md).

The following two code chunks changes the directory to your local StateLegiscraper directory, which allows us to import the modules in to use.

In [1]:
import os
from pathlib import Path
import sys

In [2]:
github_file_path = str(Path(os.getcwd()).parents[0]) #Sets to local Github directory path

sys.path.insert(1, github_file_path) 

The code chunk below prints your unique local github_file_path. It should end with the Github root directory, /StateLegiscraper/

In [3]:
print(github_file_path)

/Volumes/GoogleDrive/My Drive/State Legislatures/StateLegiscraper


## Washington Assets

Before we start scraping data, we should decide what data we're interested in. As of writing, StateLegiscraper's coverage of Nevada supports scraping PDF transcripts from Nevada's standing committee hearings from 2011-2021. To access the weblinks to scrape the PDF links, we can call on `statelegiscraper.assets.package` module and import `wa_committees`.

In [4]:
from statelegiscraper.assets.package import wa_committees

Let's go ahead and print the wa_committees source so that you can review the file.

In [5]:
import inspect
wa_committees_source = inspect.getsource(wa_committees)

In [6]:
print(wa_committees_source)

"""
Standing committee names for Washington State Legislature, 
organized by chamber and committee name from 2015-2021
"""

house_standing =[
    "House Appropriations",
    "House Capital Budget",
    "House Children, Youth & Families",
    "House Civil Rights & Judiciary",
    "House College & Workforce Development",
    "House Commerce & Gaming",
    "House Community & Economic Development",
    "House Consumer Protection & Business",
    "House Education",
    "House Environment",
    "House Environment & Energy", #no results in 2017
    "House Finance",
    "House Health Care & Wellness",
    "House Housing, Human Services & Veterans",
    "House Labor & Workplace Standards",
    "House Local Government",
    "House Public Safety",
    "House Rules",
    "House Rural Development, Agriculture, & Natural Resources",
    "House State Government & Tribal Relations",
    "House Transportation"
    ]

senate_standing=[
    "Senate Agriculture, Water, Natural Resources & Parks",
    "Sen

In [7]:
wa_committees.house_standing

['House Appropriations',
 'House Capital Budget',
 'House Children, Youth & Families',
 'House Civil Rights & Judiciary',
 'House College & Workforce Development',
 'House Commerce & Gaming',
 'House Community & Economic Development',
 'House Consumer Protection & Business',
 'House Education',
 'House Environment',
 'House Environment & Energy',
 'House Finance',
 'House Health Care & Wellness',
 'House Housing, Human Services & Veterans',
 'House Labor & Workplace Standards',
 'House Local Government',
 'House Public Safety',
 'House Rules',
 'House Rural Development, Agriculture, & Natural Resources',
 'House State Government & Tribal Relations',
 'House Transportation']

In [8]:
wa_committees.senate_standing

['Senate Agriculture, Water, Natural Resources & Parks',
 'Senate Behavioral Health Subcommittee',
 'Senate Business, Financial Services & Trade',
 'Senate Early Learning & K-12 Education',
 'Senate Environment, Energy & Technology',
 'Senate Health & Long Term Care',
 'Senate Higher Education & Workforce Development',
 'Senate Housing & Local Government',
 'Senate Human Services, Reentry & Rehabilitation',
 'Senate Labor, Commerce & Tribal Affairs',
 'Senate Law & Justice',
 'Senate Rules',
 'Senate State Government & Elections',
 'Senate Transportation',
 'Senate Ways & Means']

## Washington Scrape Class

In [None]:
from statelegiscraper.states.wa import Scrape

In [None]:
help(Scrape)

In [None]:
Scrape.wa_scrape_links("Senate Transportation", "2017", dir_chrome_webdriver, dir_save)

## Speech-to-Text Engines: DeepSpeech vs. Google Cloud

There are many strengths and trade-offs between the choice of using an open source engine and a paid engine. The section will cover a couple quick points to get you situated. I encourage you to explore both options if you'd like to consider transcript quality between the two engines. Google Cloud currently provides new users with $300 credits over 3 months, which should be sufficient to become familiar with its strengths.