# Webscraping Hospital Data

Gather data online from the American Hospital Directory to get amount of hospital and hospital beds for every state/region.
https://www.ahd.com/state_statistics.html

skip to section 3 for individual parts of webscraping

## Section 1: Scrape By State


In [2]:
from selenium import webdriver
import pandas as pd

In [32]:
# to your Chrome driver, need to download based on your Chrome version 
CHROME_DRIVER_PATH = 'path' # insert path to your chrom edriver

# initialize the Selenium WebDriver
driver = webdriver.Chrome(executable_path=CHROME_DRIVER_PATH)

website = 'https://www.ahd.com/state_statistics.html'
driver.get(website)
matches = driver.find_elements_by_tag_name('tr') # gather all rows of table 

state = []
num_hospitals = []
staffed_beds = []
total_discharges = []
patient_days = []
gross_revenue = []

for match in matches[1:-1]: # runs through each row and extracts individual elements 
    state.append(match.find_element_by_xpath('./th').text)
    num_hospitals.append(match.find_element_by_xpath('./td[1]').text)
    staffed_beds.append(match.find_element_by_xpath('./td[2]').text)
    total_discharges.append(match.find_element_by_xpath('./td[3]').text)
    patient_days.append(match.find_element_by_xpath('./td[4]').text)
    gross_revenue.append(match.find_element_by_xpath('./td[5]').text)

    
driver.quit()

In [33]:
columns = ['State', 'Number_Hospitals', 'Staffed_Beds', 'Total_Discharges', "Patient_Days", 'Gross_Patient_Revenue']
by_state = pd.DataFrame(list(zip(state, num_hospitals, staffed_beds, total_discharges, patient_days, gross_revenue)), 
                       columns = columns)

In [34]:
by_state.head()

Unnamed: 0,State,Number_Hospitals,Staffed_Beds,Total_Discharges,Patient_Days,Gross_Patient_Revenue
0,AK - Alaska,11,1288,44807,261604,"$6,913,063"
1,AL - Alabama,90,15019,533204,2888426,"$72,463,215"
2,AR - Arkansas,55,7835,303107,1486441,"$30,233,263"
3,AS - American Samoa,1,131,4607,28024,"$68,160"
4,AZ - Arizona,79,13938,599898,3064973,"$97,200,727"


In [59]:
by_state.to_csv(r'your directory') # insert path to save

## Section 2: Scrape by Hospital in one code

didn't get csv for this one b/c i accessed the page too much lol 


won't scrape all states in one code in case in the process our ip addresses get banned, so instead we will go by individual state which is more tedious but safer, see section 3

In [56]:
# create empty dataframe 
columns_hospital = ['State', 'Hospital_Name', 'City', 'Staffed_Beds', 'Total_Discharges', "Patient_Days", 'Gross_Patient_Revenue']

by_hospital = pd.DataFrame(columns = columns_hospital)

In [58]:
# to your Chrome driver, need to download based on your Chrome version 
CHROME_DRIVER_PATH = 'path' # insert path to your chrom edriver

# initialize the Selenium WebDriver
driver = webdriver.Chrome(executable_path=CHROME_DRIVER_PATH)

website = 'https://www.ahd.com/state_statistics.html'
driver.get(website)
matches = driver.find_elements_by_tag_name('a') # with tags for href


for match in matches[8:-7]: # excludes href no to states 
    state = [match.text]
    match.click()
    
    matches_state = driver.find_elements_by_tag_name('tr')

    hospital_name = []
    city = []
    staffed_beds = []
    total_discharges = []
    patient_days = []
    gross_revenue = []
    
    for match in matches_state[1:-1]:
        hospital_name.append(match.find_element_by_xpath('./th').text)
        city.append(match.find_element_by_xpath('./td[1]').text)
        staffed_beds.append(match.find_element_by_xpath('./td[2]').text)
        total_discharges.append(match.find_element_by_xpath('./td[3]').text)
        patient_days.append(match.find_element_by_xpath('./td[4]').text)
        gross_revenue.append(match.find_element_by_xpath('./td[5]').text)
    
    # create list of state names 
    length = len(hospital_name)
    state_name = state * length
    
    # create dataframe 
    by_hospital_state = pd.DataFrame(list(zip(state_name, hospital_name, city, staffed_beds, 
                                              total_discharges, patient_days, gross_revenue)), 
                       columns = columns_hospital)
    
    by_hospital.append(by_hospital_state)


NoSuchElementException: Message: no such element: Unable to locate element: {"method":"xpath","selector":"./th"}
  (Session info: chrome=114.0.5735.199)


In [None]:
by_hospital.to_csv(r'your directory') # insert path to save

## Section 3: Webscrape all hospitals by each individual state

To ensure we save all the data we webscrape, we're going to webscrape everything by state html instead of from the original page as in section 2. Please see the following assignments:

* Jacob - Alaska to Iowa (AK, AL, AR, AS, AZ, CA, CO, CT, DC, DE, FL, GA, GU, HI, IA)
* Jason - Idaho to Northern Mariana Islands (ID, IL, IN, KS, KY, LA, MA, MD, ME, MI, MN, MO, MP)
* Lorenzo - Mississippi to Pennsylvania (MS, MT, NC, ND, NE, NH, NJ, NM, NV, NY, OH, OK, OR, PA)
* Xinyu - Puerto Rico to Wyoming (PR, RI, SC, SD, RN, RX, UT, VA, VI, VT, WA, WI , WV, WY)

You're going to have to install the selenium package and download a chrome driver, which you can find here https://chromedriver.chromium.org/downloads. Once you webscrape one state, save the csv to your local drive as "ahd_state" and upload to the github so we know which states have been successfully scraped.  

If you have any questions just let me know haha. Section 1 has an example of a successful webscraping and the csv is in the github folder. The code might not be perfect because I don't have access to the webpages so I'm going off of what I can remember. 


In [4]:
# first import packages 

from selenium import webdriver
import pandas as pd

#### Run the code chunk below for each state you are responsible for 

Steps include: 

* set your chrome driver path
* change the website url to your specific state 
* change the state variable to the state abbreviation
* change the path of saving the csv to your local drive and change the name to ahd_StateABB.csv 
    

In [7]:
# to your Chrome driver, need to download based on your Chrome version - see instructions in beginning of Section 3
CHROME_DRIVER_PATH = 'D:/UCLA/Capstone/HMS-MGH-Capstone-Project/chromedriver.exe' # insert path to your chrome driver

# initialize the Selenium WebDriver
driver = webdriver.Chrome(executable_path=CHROME_DRIVER_PATH) 

# You can use this website to find the htmls for each individaul state
# 'https://www.ahd.com/state_statistics.html'

website = 'https://www.ahd.com/states/hospital_CA.html' # CHANGE THIS to your state html 
driver.get(website)

matches_state = driver.find_elements_by_tag_name('tr') # gathers all rows of the table with the tag

hospital_name = []
city = []
staffed_beds = []
total_discharges = []
patient_days = []
gross_revenue = []

for match in matches_state[1:-1]:  # excludes header and total row 
    hospital_name.append(match.find_element_by_xpath('./th').text)
    city.append(match.find_element_by_xpath('./td[1]').text)
    staffed_beds.append(match.find_element_by_xpath('./td[2]').text)
    total_discharges.append(match.find_element_by_xpath('./td[3]').text)
    patient_days.append(match.find_element_by_xpath('./td[4]').text)
    gross_revenue.append(match.find_element_by_xpath('./td[5]').text)

# create list of state names 
state = "CA" # CHANGE THIS to whatever state abbreviation 
length = len(hospital_name)
state_name = state * length

# create dataframe 
columns_hospital = ['State', 'Hospital_Name', 'City', 'Staffed_Beds', 'Total_Discharges', "Patient_Days", 'Gross_Patient_Revenue']

by_hospital_state = pd.DataFrame(list(zip(state_name, hospital_name, city, staffed_beds, 
                                          total_discharges, patient_days, gross_revenue)), 
                   columns = columns_hospital)

by_hospital_state.to_csv(r'D:/UCLA/Capstone/HMS-MGH-Capstone-Project/Datasets/AHD_Data/Alaska_to_Iowa/ahd_CA.csv') # CHANGE THIS insert path to save ex. (/ahd_CA.csv)
    
driver.quit()

TypeError: __init__() got an unexpected keyword argument 'executable_path'

For newer selenium version

In [38]:
# to your Chrome driver, need to download based on your Chrome version - see instructions in beginning of Section 3
CHROME_DRIVER_PATH = 'D:/UCLA/Capstone/HMS-MGH-Capstone-Project/chromedriver.exe' # insert path to your chrom driver

# initialize the Selenium WebDriver

from selenium.webdriver.chrome.service import Service
service = Service(executable_path="CHROME_DRIVER_PATH")
options = webdriver.ChromeOptions()
driver = webdriver.Chrome(service=service, options=options)

# You can use this website to find the htmls for each individaul state
# 'https://www.ahd.com/state_statistics.html'


### Change it to the desried state ###
state = 'MP'

website = 'https://www.ahd.com/states/hospital_' + state + '.html'
driver.get(website)

from selenium.webdriver.common.by import By
matches_state = driver.find_elements(By.TAG_NAME,'tr') # gathers all rows of the table with the tag

hospital_name = []
city = []
staffed_beds = []
total_discharges = []
patient_days = []
gross_revenue = []

for match in matches_state[1:-1]:  # excludes header and total row 
    hospital_name.append(match.find_element(By.XPATH, './th').text)
    city.append(match.find_element(By.XPATH,'./td[1]').text)
    staffed_beds.append(match.find_element(By.XPATH, './td[2]').text)
    total_discharges.append(match.find_element(By.XPATH,'./td[3]').text)
    patient_days.append(match.find_element(By.XPATH, './td[4]').text)
    gross_revenue.append(match.find_element(By.XPATH, './td[5]').text)

# create list of state names 
length = len(hospital_name)
state_name = [state] * length

# create dataframe 
columns_hospital = ['State', 'Hospital_Name', 'City', 'Staffed_Beds', 'Total_Discharges', "Patient_Days", 'Gross_Patient_Revenue']

by_hospital_state = pd.DataFrame(list(zip(state_name, hospital_name, city, staffed_beds, 
                                        total_discharges, patient_days, gross_revenue)), 
                columns = columns_hospital)

file_name = 'D:/UCLA/Capstone/HMS-MGH-Capstone-Project/Datasets/AHD_Data/ID_to_MP/ahd_' + state + '.csv' # change path
by_hospital_state.to_csv(file_name) # CHANGE THIS insert path to save ex. (/ahd_CA.csv)
    
driver.quit()