# Webscraping Hospital Data

Gather data online from the American Hospital Directory to get amount of hospital and hospital beds for every state/region.
https://www.ahd.com/state_statistics.html

skip to section 3 for individual parts of webscraping

## Section 3: Webscrape all hospitals by each individual state

To ensure we save all the data we webscrape, we're going to webscrape everything by state html instead of from the original page as in section 2. Please see the following assignments:

* Jacob - Alaska to Iowa (AK, AL, AR, AS, AZ, CA, CO, CT, DC, DE, FL, GA, GU, HI, IA)
* Jason - Idaho to Northern Mariana Islands (ID, IL, IN, KS, KY, LA, MA, MD, ME, MI, MN, MO, MP)
* Lorenzo - Mississippi to Pennsylvania (MS, MT, NC, ND, NE, NH, NJ, NM, NV, NY, OH, OK, OR, PA)
* Xinyu - Puerto Rico to Wyoming (PR, RI, SC, SD, RN, RX, UT, VA, VI, VT, WA, WI , WV, WY)

You're going to have to install the selenium package and download a chrome driver, which you can find here https://chromedriver.chromium.org/downloads. Once you webscrape one state, save the csv to your local drive as "ahd_state" and upload to the github so we know which states have been successfully scraped.  

If you have any questions just let me know haha. Section 1 has an example of a successful webscraping and the csv is in the github folder. The code might not be perfect because I don't have access to the webpages so I'm going off of what I can remember. 


In [14]:
# first import packages 
import selenium
print(selenium.__version__)

4.10.0


In [13]:
from selenium import webdriver
import pandas as pd

#### Run the code chunk below for each state you are responsible for 

Steps include: 

* set your chrome driver path
* change the website url to your specific state 
* change the state variable to the state abbreviation
* change the path of saving the csv to your local drive and change the name to ahd_StateABB.csv 
    

In [28]:
# to your Chrome driver, need to download based on your Chrome version - see instructions in beginning of Section 3
CHROME_DRIVER_PATH = 'D:/Chromedriver/chromedriver_win32/chromedriver.exe' # insert path to your chrom edriver

# initialize the Selenium WebDriver
# different code in Selenium v3.x and v4.x
# https://stackoverflow.com/questions/70534875/typeerror-init-got-an-unexpected-keyword-argument-service-error-using-p

# driver = webdriver.Chrome(executable_path=CHROME_DRIVER_PATH) 

from selenium.webdriver.chrome.service import Service
service = Service(executable_path="CHROME_DRIVER_PATH")
options = webdriver.ChromeOptions()
driver = webdriver.Chrome(service=service, options=options)

# You can use this website to find the htmls for each individaul state
# 'https://www.ahd.com/state_statistics.html'

website = 'https://www.ahd.com/states/hospital_WY.html' # CHANGE THIS to your state html 
driver.get(website)

# matches_state = driver.find_elements_by_tag_name('tr') # gathers all rows of the table with the tag

from selenium.webdriver.common.by import By
matches_state = driver.find_elements(By.TAG_NAME,'tr') # gathers all rows of the table with the tag

hospital_name = []
city = []
staffed_beds = []
total_discharges = []
patient_days = []
gross_revenue = []

for match in matches_state[1:-1]:  # excludes header and total row 
    hospital_name.append(match.find_element(By.XPATH, './th').text)
    city.append(match.find_element(By.XPATH,'./td[1]').text)
    staffed_beds.append(match.find_element(By.XPATH, './td[2]').text)
    total_discharges.append(match.find_element(By.XPATH,'./td[3]').text)
    patient_days.append(match.find_element(By.XPATH, './td[4]').text)
    gross_revenue.append(match.find_element(By.XPATH, './td[5]').text)

# create list of state names 
state = "WY" # CHANGE THIS to whatever state abbreviation 
length = len(hospital_name)
state_name = state * length

# create dataframe 
columns_hospital = ['State', 'Hospital_Name', 'City', 'Staffed_Beds', 'Total_Discharges', "Patient_Days", 'Gross_Patient_Revenue']

by_hospital_state = pd.DataFrame(list(zip(state_name, hospital_name, city, staffed_beds, 
                                          total_discharges, patient_days, gross_revenue)), 
                   columns = columns_hospital)

by_hospital_state.to_csv(r'C:\Users\TANTAN\Desktop\MEng\Capstone\AHD_data\ahd_WY.csv') # CHANGE THIS insert path to save ex. (/ahd_CA.csv)
    
driver.quit()