In [119]:
import re
import requests
import pandas as pd

from bs4 import BeautifulSoup

## Part I: Working with HTML and JSON

#### Reading data from the JSON file

In [12]:
json_data = pd.read_json("https://raw.githubusercontent.com/ktxdev/AIM-5001/main/M9/1.%20Data/albums.json")
json_data

Unnamed: 0,title,artist,year,tracks
0,And Then,Christopher Martin,2019,"[Life, Come Back, Bun Fi Bun, Can't Dweet Agai..."
1,Mustard Seed,Nutty O,2021,"[Open Doors, Safe, Ndiwe, Peter Pan, Ready, Ku..."
2,Melody,Demarco,2021,"[My Way, Do It Again, For You, In My Heart, St..."


#### Reading data from the HTML file

In [13]:
# Get first table in list returned
html_data = pd.read_html("https://raw.githubusercontent.com/ktxdev/AIM-5001/main/M9/1.%20Data/albums.html")[0]
html_data

Unnamed: 0,Title,Artist,Year,Tracks
0,And Then,Christopher Martin,2019,Life Come Back Bun Fi Bun Can't Dweet Again...
1,Mustard Seed,Nutty O,2021,Open Doors Safe Ndiwe Ready Kungfu Peter Pan
2,Melody,Demarco,2021,My Way Do It Again For You In My Heart Stu...


In [37]:
print("Tracks Type (json_data): ", type(json_data['tracks'][0]))
print("Tracks Type (html_data): ", type(html_data['Tracks'][0]))

Tracks Type (json_data):  <class 'list'>
Tracks Type (html_data):  <class 'str'>


The two dataframes above have one column "tracks" that has differences, indicating that they are not truly the same. The data in the "tracks" column of `json_data` has a `Series` object that encompasses a `list` object, but the data in the "tracks" column of `html_data` also contains a `Series` object but the object encompasses a `str` object.

## Part II: Scraping the Katz School’s “Staff” Web Page

In [147]:
page = requests.get("https://www.yu.edu/katz/staff")
soup = BeautifulSoup(page.content, 'html.parser')

fields = soup.find_all(class_="field--name-field-paragraph-body") # Not needed

staff = soup.find('div', class_="text-only")

staff_info = pd.DataFrame(columns = ["name", "title", "office", "email", "phone"])

curr_sibling = fields[0].find('h3')
current_office = None
while curr_sibling:
    
    if curr_sibling.name == 'h3':
        current_office = curr_sibling.contents[0].strip()
        staff[current_office] = []
    elif curr_sibling.name == 'p' and curr_sibling.contents[0].strip() != "":
        contents = curr_sibling.contents
        
        staff_member = { "name": contents[0].split(",")[0], "office": current_office, "email": "N/A", "phone": "N/A" }
        
        if curr_sibling.find('span'):
            staff_member["title"] = curr_sibling.find('span').contents[0].strip()
        else:
            staff_member["title"] = contents[0].split(",")[1].strip()
        
        
        mail = curr_sibling.find('a')
        
        if mail.get("href").startswith("mailto"):
            staff_member["email"] = mail.contents[0]
            
        phone = re.findall(r"[0-9-. ]{10,}", curr_sibling.text)
        
        if phone:
            staff_member["phone"] = phone[0]
        
        staff_info = pd.concat([staff_info, pd.DataFrame([staff_member])], ignore_index = True)
        
    curr_sibling = curr_sibling.find_next_sibling()


staff_info

Unnamed: 0,name,title,office,email,phone
0,Paul Russo,Vice Provost and Dean,Office of the Dean,,
1,Aaron Ross,Assistant Dean for Academic Programs and Deput...,Office of the Dean,aaron.ross2@yu.edu,646-592-4148
2,Jackie Hamilton,Executive Director of Enrollment Management an...,Office of the Dean,jackie.hamilton@yu.edu,646-787-6194
3,Pamela Rodman,Director of Finance and Administration,Office of the Dean,pamela.rodman@yu.edu,646.592.4777
4,Tabitha Collazo,Business and Operations Coordinator,Office of the Dean,tabitha.collazo@yu.edu,646-592-4735
5,Ann Leary,Office Manager/Executive Assistant to the Dean...,Office of the Dean,ann.leary@yu.edu,646-592-4724
6,Jared Hakimi,Director,Graduate Admissions,jared.hakimi@yu.edu,646-592-4722
7,Xavier Velasquez,Associate Director of Graduate Admissions Oper...,Graduate Admissions,xavier.velasquez@yu.edu,646-592-4737
8,Shayna Matzner,Assistant Director,Graduate Admissions,shayna.matzner@yu.edu,646-592-4726
9,Linyu Zheng,Assistant Director,Graduate Admissions,linyu.zheng@yu.edu,1-332-271-5865


## Part III: Working with Web API’s