# Week 7 Assignment

_MkKinney 6.1_

This week has been all about getting information off the internet both in structured data formats (CSV, JSON, etc) as well as HTML.  For these exercises, we're going to use two practical examples of fetching data from web pages to show how to use Pandas and BeautifulSoup to extract structured information from the web.

---
---

### 33.1 Parsing a list in HTML

Go to the Banner Health Price Transparency Page: https://www.bannerhealth.com/patients/billing/pricing-resources/hospital-price-transparency

Notice that there is a list of hospitals and the city they are in.  We want to parse the underlying HTML to create a list of all the hospitals along with which city they're in.

```json
[
    ["Banner - University Medical Center Phoenix", "Arizona"],
    ["Banner - University Medical Center South ", "Arizona"],
    ...
]
```

To examine the underlying HTML code, you can use Chrome, right-click, and choose **Inspect**.

For reference, the documentation for BeautifulSoup is here: https://www.crummy.com/software/BeautifulSoup/bs4/doc/


In [None]:
from bs4 import BeautifulSoup
import requests
headers = { "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 11_2_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.90 Safari/537.36" }

response = requests.get('https://www.bannerhealth.com/patients/billing/pricing-resources/hospital-price-transparency', headers=headers)
soup = BeautifulSoup(response.text, 'html.parser')

In [None]:
div = soup.find_all('div', {"class":"col-md-8"})[0]
for hospital_list in div.find_all('ul'):
    state = hospital_list.previous_sibling.previous_sibling.string
    for hospital in hospital_list.find_all('li'):
        print(state, hospital.text)

---

### 33.2 Using Pandas to Read Tables


Pandas documentation for loading data https://pandas.pydata.org/pandas-docs/version/0.23.4/api.html#input-output

Pandas documentation for describing the shape of data https://pandas.pydata.org/pandas-docs/version/0.23.4/generated/pandas.DataFrame.shape.html

In [None]:
import pandas as pd

In [None]:
tables = pd.read_html('https://en.wikipedia.org/wiki/Diagnosis-related_group')
len(tables)

In [None]:
for index,table in enumerate(tables):
    print("**************TABLE {}".format(index))
    print(table)

In [None]:
drgs = tables[4]
drgs

---

### 33.3 Find Something of Your Own

Do some web searches and find an HTML page with some data that is interesting to something you're studying.  You can extract and parse that information using either BeautifulSoup or Pandas.  If you're using Pandas, then do something interesting to format and structure your data.  If you're using BeautifulSoup, you'll just need to do the work of parsing the data out of HTML -- that's hard enough!

You don't need to build this as a function.  Just use notebook cells as I've done above.  You will be graded based on _style_.  Use variable names that make sense for your problem / solution. Cleanup anything you don't need before you submit your work.

##### Question: Is there another way to access glassdoor? I think my code is correct but glassdoor just has extra security
##### https://www.glassdoor.com/Award/Best-Places-to-Work-LST_KQ0,19.htm  - 2021 list


In [None]:
# import requests
# from bs4 import BeautifulSoup

# url = 'https://www.glassdoor.com/Award/Best-Places-to-Work-LST_KQ0,19.htm'
# headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.90 Safari/537.36"}

# page = requests.get(url, headers=headers)
# soup=BeautifulSoup(page.text,'html.parser')

#page.status_code #gives a 503 code 

#### Create list of companies featured in Fortune's "Best Workplaces for Millennials" Articles 

1. Includes data from 2016 to 2020 articles (html structure is the same)
2. Provides brief information about data
3. High Level Question: Once I have list of top companies, how would you go about going to their careers website to check for openings relevant to data science?

'https://www.greatplacetowork.com/best-workplaces/Millennials/2020' #alternates between Large (75 companies listed)/Small & Medium (25 companies listed) 

In [254]:
import requests
from bs4 import BeautifulSoup
from datetime import date
import pandas as pd

site = 'https://www.greatplacetowork.com/best-workplaces/Millennials' 
current_year = date.today().year
companies = []
error_msg = []

for year in range(2016,current_year):
    url = site + "/" + str(year)
    page = requests.get(url)
    
    if str(page.status_code)[0] == '2':
        soup=BeautifulSoup(page.text,'html.parser')
        div = soup.find_all('div',{"class":"col-md-5 col-xs-12 company-text"})

        for company in div: 
            if len(company.find_all("ul")) < 3:  #skip companies that don't have complete info.
                continue
            else:  
                companies.append(
                    {"Name": company.a['title'],
                     "Industry": company.find("ul",{"class":"industry fa-ul"}).li.i.next_sibling,
                     "Location": company.find("ul",{"class":"location fa-ul"}).li.i.next_sibling, 
                     "Year": year})
    else:
        error_msg.append("The webpage {} gave the error {}".format(url,page.status_code))
        
if len(error_msg) != 0:
    print(error_msg)
  

In [None]:
#Create dataframe
companies_df = pd.DataFrame(companies)
companies_df

In [253]:
#Get Health Care Information
healthcare = companies_df[companies_df['Industry'] == 'Health Care'] #list of health care companies 
healthcare.groupby('Name').count() #how often health care companies occurred on list

Unnamed: 0_level_0,Industry,Location,Year
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Aledade,1,1,1
American Heart Association,1,1,1
"BayCare Health System, Inc.",3,3,3
Cottage Health,1,1,1
CoverMyMeds,1,1,1
Encompass Health - Home Health & Hospice,3,3,3
Exact Sciences Corporation,1,1,1
Jackson Healthcare,1,1,1
Nicklaus Childrens Health System,1,1,1
OhioHealth,1,1,1


In [255]:
#HTML Notes:
#div[0]
#div[0].li.i.attrs #can get attributes of a nested tag
#div[0].li.i.next_sibling #Since industry is listed first in li tag sequence, this returns industry dict

#DF Notes:
#counts = companies_df.groupby('Name').count() #givies counts across all columns 
#winners = counts[counts.Year == max(counts.Year)] #companies that made the list every year 

---

## Submitting Your Work

In order to submit your work, you'll need to use the `git` command line program to **add** your homework file (this file) to your local repository, **commit** your changes to your local repository, and then **push** those changes up to github.com.  From there, I'll be able to **pull** the changes down and do my grading.  I'll provide some feedback, **commit** and **push** my comments back to you.  Next week, I'll show you how to **pull** down my comments.

To run through everything one last time and submit your work:
1. Use the `Kernel` -> `Restart Kernel and Run All Cells` menu option to run everything from top to bottom and stop here.
2. Save this note with Ctrl-S (or Cmd-S)
2. Skip down to the last command cell (the one starting with `%%bash`) and run that cell.

If anything fails along the way with this submission part of the process, let me know.  I'll help you troubleshoort.

In [None]:
assert False, "DO NOT REMOVE THIS LINE"

---

In [None]:
%%bash
git pull
git add week08_assignment_2.ipynb
git commit -a -m "Submitting the week 8 programming assignment"
git push


---

If the message above says something like _Submitting the week 8 programming assignment_ or _Everything is up to date_, then your work was submitted correctly.