## Scraping New York State pharmacy registration numbers

This notebook uses Selenium to scrape pharmacy registration numbers from the New York Department of Education’s Office of the Professions [online verification search engine](http://www.op.nysed.gov/opsearches.htm#rx) on June 21, 2022. All pharmacy owners must register their pharmacy with the Office of the Professions, which oversees the state’s Board of Pharmacy. They are required to renew their registration every three years, and notify the state when they close. 

These six-digit numbers are each also associated with a webpage containing more information about a pharmacy. Those webpages are scraped in another notebook.

The website contains a search engine that requires an input of at least one character to display results. To scrape all results, the scraper searches each letter of the alphabet and the numbers 0-9. The search was limited to retail pharmacies, excluding manufacturers, wholesalers, and outsource facilities.

In [104]:
import requests
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import Select

from webdriver_manager.chrome import ChromeDriverManager

import pandas as pd
import time
from datetime import datetime

In [172]:
%load_ext jupyternotify

<IPython.core.display.Javascript object>

In [176]:
driver = webdriver.Chrome(ChromeDriverManager().install())



Current google-chrome version is 102.0.5005
Get LATEST chromedriver version for 102.0.5005 google-chrome
Driver [/Users/jmingram/.wdm/drivers/chromedriver/mac64/102.0.5005.61/chromedriver] found in cache
  driver = webdriver.Chrome(ChromeDriverManager().install())


In [170]:
url = 'http://www.op.nysed.gov/opsearches.htm#rx'

In [53]:
# Entering a letter in the search bar
def fill_form(letter):
    driver.get(url)
    driver.find_element(By.XPATH, '//*[@id="content_column"]/div[4]/form/div[1]/select/option[2]').click()
    driver.find_element(By.XPATH, '//*[@id="content_column"]/div[4]/form/div[4]/select/option[1]').click()
    driver.find_element(By.XPATH, '//*[@id="content_column"]/div[4]/form/div[5]/select/option[3]').click()
    driver.find_element(By.XPATH, '//*[@id="content_column"]/div[4]/form/div[3]/input').send_keys(letter)
    driver.find_element(By.XPATH, '//*[@id="content_column"]/div[4]/form/div[6]/input[1]').click()

In [90]:
# Scraping all registration numbers on a page of search results 
def get_reg_numbers(all_numbers):
    for n in driver.find_elements(By.TAG_NAME, 'a')[23:39]:
        if n.text == 'Laws & Regulations':
            break
        all_numbers.append(n.text)
    return all_numbers

In [122]:
# The page shows a maximum of 16 results. This function clicks to display more, 
# calling the above function to retrive the numbers as they are displayed
def scroll_results(all_numbers):
    counter = 0
    while len(driver.find_elements(By.TAG_NAME, 'b')) < 2:
        time.sleep(1)
        all_numbers = get_reg_numbers(all_numbers)
        try:
            driver.find_element(By.XPATH, '//*[@id="content_column"]/form/input[7]').click()
            counter += 1
        except:
            print('ERROR: ' + counter + ' clicks')
    all_numbers = get_reg_numbers(all_numbers)
    return all_numbers

In [99]:
alphabet = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 
            'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 
            'u', 'v', 'w', 'x', 'y', 'z', '1', '2', '3', '4',
            '5', '6', '7', '8', '9']

In [None]:
%%notify

for letter in alphabet:
    print(datetime.now().strftime("%m/%d/%Y %H:%M:%S") + ' searching letter ' + letter)
    fill_form(letter)
    registration_numbers = scroll_results(registration_numbers)
    print(len(registration_numbers))
    time.sleep(5)

In [186]:
# Write all registration numbers to a text file
with open('all_registration_numbers_no_dupes.txt', 'w') as f:
    for n in set(registration_numbers):
        f.write("%s\n" % n)
    f.close()

In [165]:
len(registration_numbers)

14703