# Get PA Nursing Home Data

This notebook scrapes data from the Pennsylvania Department of Health, Nursing Care Facility Information database.

The goal is to create a database of nursing homes in Montgomery County, PA that accept Medicaid payments.

## Import dependencies

In [1]:
import re
import requests

from bs4 import BeautifulSoup
import pandas as pd

## Get the data

In [2]:
url = 'https://sais.health.pa.gov/commonpoc/content/publicweb/nhinformation2.asp?COUNTY=Montgomery'
html_page = requests.get(url)
soup = BeautifulSoup(html_page.content, 'html.parser')

In [3]:
# Use to inspect organized/indented HTML
#print(soup.prettify())

### Note
Looking at the HTML revealed that there are table rows (tr) within tr, within tr...   
So, when I loop through the rows, below, I've got to start with the 3rd tr (that is, tr[2:])

## Extract the HTML table, with the target data
The target data are in the last table (`tables[-1]`) on the page.

In [4]:
tables = soup.find_all('table')

### Get column header info

In [5]:
table_headers = tables[-1].find_all('th')

In [6]:
columns = []
for header in table_headers:
    columns.append(header.getText())

### Make DataFrame

In [7]:
table_rows = tables[-1].find_all('tr')

list_of_rows = []

# Remember the tr within tr within tr... need to start at [2:]
for tr in table_rows[2:]:
    td = tr.find_all('td')
    row = [tr.text for tr in td]
    
    list_of_rows.append(row)

df = pd.DataFrame(data=list_of_rows, columns=columns)

## Select just the nursing homes that have Medicaid beds

In [8]:
df_Medicaid = df[df['Payment Options'].str.contains('Medicaid')]

## Clean the DataFrame

In [9]:
df_Medicaid.drop(columns='Select', inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  errors=errors,


## Write out the data to a CSV file

In [10]:
df_Medicaid.to_csv('output/Montgomery_County_PA_nursing_homes_with_Medicaid_beds.csv')