# Food, Agriculture, and Soils: California Counties Legislation

This notebook explores county level legislation from the LA County Code of Ordinances, hosted on the Municode Library website, found [here](https://library.municode.com/ca/los_angeles_county/codes/code_of_ordinances). Municode Library does not host an API.

More background on structure and powers of county level government [here](https://www.counties.org/general-information/county-structure-0): *"Most legislative acts, including using the police power, are adopted by ordinance."*

In [5]:
# installing selenium and webdriver manager
#!pip install selenium
#!pip install webdriver_manager

In [46]:
# libraries
import pandas as pd
import geopandas as gpd
import numpy as np

import requests
import urllib
from urllib.request import urlopen
from selenium import webdriver
from selenium.webdriver.firefox.service import Service
from selenium.webdriver.firefox.options import Options as FirefoxOptions
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
from webdriver_manager.firefox import GeckoDriverManager
from bs4 import BeautifulSoup

import re
import os
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.stem import PorterStemmer
import json

# set display
pd.options.display.max_columns = 150
pd.options.display.max_rows = 300

### Scraping the Municode Library

Codes of Ordinances typically have a common structure: titles, chapters, sections, and subsections.

After surveying the URL page to identify the structure, it seems that each title, of which there are over 20, is hosted in its own subpage, and each section within a chapter is further hosted on its own subpage (sub-subpage). 

So to scrape all available text in the code of ordinances, each section must be called from its title subpage after the title subpage is called from the main page.

In [24]:
# SELENIUM: finding the link for the bank of policies
URL = 'https://library.municode.com/ca/los_angeles_county/codes/code_of_ordinances?nodeId=LOS_ANGELES_CO_CODE'

service = Service(executable_path = r'C:\Users\melod\Documents\data science\geckodriver-v0.33.0-win64\geckodriver.exe')
options = webdriver.FirefoxOptions()
options.add_argument("--headless")
#driver = webdriver.Firefox(executable_path=r'C:\Users\melod\Documents\data science\geckodriver-v0.33.0-win64\geckodriver.exe')
#driver = webdriver.Firefox(service = service, options = options)
driver = webdriver.Firefox(options = options, service = Service(GeckoDriverManager().install()))
driver.get(URL)

  service = Service(executable_path = r'C:\Users\melod\Documents\data science\geckodriver-v0.33.0-win64\geckodriver.exe')
  driver = webdriver.Firefox(options = options, service = Service(GeckoDriverManager().install()))


In [54]:
# scraping titles
titles = driver.find_elements(By.XPATH, '//a[@class="toc-item-heading"]')

titlelist = []
for title in titles:
    title = title.text
    titlelist.append(title)

titlelist

['LOS ANGELES COUNTY CODE',
 'SUPPLEMENT HISTORY TABLE',
 'CHARTER OF THE COUNTY OF LOS ANGELES',
 'Title 1 - GENERAL PROVISIONS',
 'Title 2 - ADMINISTRATION',
 'Title 3 - ADVISORY COMMISSIONS AND COMMITTEES',
 'Title 4 - REVENUE AND FINANCE',
 'Title 5 - PERSONNEL',
 'Title 6 - SALARIES',
 'Title 7 - BUSINESS LICENSES',
 'Title 8 - CONSUMER PROTECTION, BUSINESS AND WAGE REGULATIONS',
 'Title 10 - ANIMALS',
 'Title 11 - HEALTH AND SAFETY',
 'Title 12 - ENVIRONMENTAL PROTECTION',
 'Title 13 - PUBLIC PEACE, MORALS AND WELFARE',
 'Title 15 - VEHICLES AND TRAFFIC',
 'Title 16 - HIGHWAYS',
 'Title 17 - PARKS, BEACHES AND OTHER PUBLIC AREAS',
 'Title 19 - AIRPORTS AND HARBORS',
 'Title 20 - UTILITIES',
 'FLOOD CONTROL DISTRICT CODE',
 'Title 21 - SUBDIVISIONS',
 'Title 22 - PLANNING AND ZONING',
 'Title 26 - BUILDING CODE',
 'Title 27 - ELECTRICAL CODE',
 'Title 28 - PLUMBING CODE',
 'Title 29 - MECHANICAL CODE',
 'Title 30 - RESIDENTIAL CODE',
 'Title 31 - GREEN BUILDING STANDARDS CODE',
 '

In [63]:
# identifying code titles and indiv title text URLs
titles = driver.find_elements(By.XPATH, '//a[@class="toc-item-heading"]')

linklist = []
for title in titles:
    link = title.get_attribute('href')
    linklist.append(link)

linklist

['https://library.municode.com/ca/los_angeles_county/codes/code_of_ordinances?nodeId=LOS_ANGELES_CO_CODE',
 'https://library.municode.com/ca/los_angeles_county/codes/code_of_ordinances?nodeId=SUHITA',
 'https://library.municode.com/ca/los_angeles_county/codes/code_of_ordinances?nodeId=CHCOLOAN',
 'https://library.municode.com/ca/los_angeles_county/codes/code_of_ordinances?nodeId=TIT1GEPR',
 'https://library.municode.com/ca/los_angeles_county/codes/code_of_ordinances?nodeId=TIT2AD',
 'https://library.municode.com/ca/los_angeles_county/codes/code_of_ordinances?nodeId=TIT3ADCOCO',
 'https://library.municode.com/ca/los_angeles_county/codes/code_of_ordinances?nodeId=TIT4REFI',
 'https://library.municode.com/ca/los_angeles_county/codes/code_of_ordinances?nodeId=TIT5PE',
 'https://library.municode.com/ca/los_angeles_county/codes/code_of_ordinances?nodeId=TIT6SA',
 'https://library.municode.com/ca/los_angeles_county/codes/code_of_ordinances?nodeId=TIT7BULI',
 'https://library.municode.com/ca/l

In [93]:
# identifying code chapters and indiv chapter text URLs
URL1 = linklist[3]
driver.get(URL1)

#chapters = driver.find_elements(By.CLASS_NAME, 'chunk-title-wrapper')
#chapters = driver.find_elements(By.TAG_NAME, 'li')
chapters = driver.find_elements(By.XPATH, '//*[contains(text(), "Chapter")]')
#chapters = driver.find_elements(By.XPATH, '//div[@class="chunk-title-wrapper"]')

chapterlist = []
for chapter in chapters:
    link = chapter.get_attribute('href')
    chapterlist.append(link)

chapterlist

[]

In [92]:
# extracting code chapter text url
URL2 = 'https://library.municode.com/ca/los_angeles_county/codes/code_of_ordinances?nodeId=TIT1GEPR_CH1.01COAD'
driver.get(URL2)

#texts = driver.find_elements(By.XPATH, '//p[@class="runin"]')
#texts = driver.find_elements(By.CLASS_NAME, 'runin')
texts = driver.find_elements(By.TAG_NAME, 'div')

textlist = []

for text in texts:
    text = text.text
    textlist.append(text)

textlist

['SIGN IN\nHELP',
 'SIGN IN\nHELP',
 '',
 '',
 '',
 '',
 'SIGN IN\nHELP',
 'SIGN IN\nHELP\nSelect Language▼',
 '',
 'SIGN IN',
 'SIGN IN',
 'SIGN IN',
 'Select Language▼',
 'Select Language▼',
 'Select Language▼',
 'Initializing application...',
 '',
 'Initializing application...',
 'Initializing application...',
 'Initializing application...',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '',
 '']

In [None]:
# inspecting structure to locate text bloc
print(soup.prettify())