## 3.3 Reading and Resources

The Essential reading for this topic is:
[McKinney, W. Python for Data Analysis. (Boston: O'Reilly, 2017) 2nd edition. pp.181-185.] (https://onlinelibrary.london.ac.uk/)

Useful reference material can be found at:

- [Python 'http protocol client', Internet protocols and support](https://docs.python.org/3/library/http.client.html)
- [Python 'Simple HTML and XHTML parser' ](https://docs.python.org/3/library/html.parser.html)
- [Pyquery 'A jquery-like library for Python' ](https://pythonhosted.org/pyquery/)
- [Requests: HTTP for Humans](https://requests.readthedocs.io/en/master)

## JSON DATA

very nearly python code - exception is value `null`  
library `json` built into python  

`json.loads` => convert json string to python form

`json.dumps` => convert python object back to json

**Converting to a dataframe**
Can pass a list of dicts (previously json objects) to DF constructor and select sub-set of the data fields

`pandas.read_json` can automatically convert JSON datasets in a specific arrangement to Series or DF  
Default option assume that each object in JSON array is a row in the table  
Example: `data = pd.read_json('examples/example.json')`  

`data.to_json()` export pandas to json

In [7]:
import json
import pandas as pd

obj = """ 
{"name": "Wes", 
"places_lived": ["United States", "Spain", "Germany"], 
"pet": null, 
"siblings": [{"name": "Scott", "age": 30, "pets": ["Zeus", "Zuko"]}, 
            {"name": "Katie", "age": 38, "pets": ["Sixes", "Stache", "Cisco"]}] 
} 
"""

result = json.loads(obj)
asjson = json.dumps(result)

#dataframe

siblings = pd.DataFrame(result['siblings'], columns=['name', 'age'])
siblings



Unnamed: 0,name,age
0,Scott,30
1,Katie,38


## Web Scraping

### Pandas `read_html`
built in function which uses libraries 'lxml' and 'BeautifulSoup' to auto parse tables out of HMTL files as DF objects



In [13]:
tables = pd.read_html('data/fdic_failed_bank_list.html')
len(tables)

failures = tables[0]
failures.head()

# compute number of bank failures by year

close_timestamps = pd.to_datetime(failures['Closing Date'])
close_timestamps.dt.year.value_counts()

2010    157
2009    140
2011     92
2012     51
2008     25
2013     24
2014     18
2002     11
2015      8
2016      5
2004      4
2001      4
2007      3
2003      3
2000      2
Name: Closing Date, dtype: int64

### Parsing XML with lxml.objectify


In [19]:
from lxml import objectify

path= 'data/Performance_MNR.xml'
parsed = objectify.parse(open(path))
root = parsed.getroot()

data = []

skip_fields=['PARENT_SEQ', 'INDICATOR_SEQ', 'DESIRED_CHANGE', 'DECIMAL_PLACES']

for elt in root.INDICATOR:
    el_data={}
    for child in elt.getchildren():
        if child.tag in skip_fields:
            continue
        el_data[child.tag] = child.pyval
    data.append(el_data)
    
# convert into a dataframe
perf = pd.DataFrame(data)

perf.head()

Unnamed: 0,AGENCY_NAME,INDICATOR_NAME,DESCRIPTION,PERIOD_YEAR,PERIOD_MONTH,CATEGORY,FREQUENCY,INDICATOR_UNIT,YTD_TARGET,YTD_ACTUAL,MONTHLY_TARGET,MONTHLY_ACTUAL
0,Metro-North Railroad,On-Time Performance (West of Hudson),Percent of commuter trains that arrive at thei...,2008,1,Service Indicators,M,%,95,96.9,95,96.9
1,Metro-North Railroad,On-Time Performance (West of Hudson),Percent of commuter trains that arrive at thei...,2008,2,Service Indicators,M,%,95,96.0,95,95.0
2,Metro-North Railroad,On-Time Performance (West of Hudson),Percent of commuter trains that arrive at thei...,2008,3,Service Indicators,M,%,95,96.3,95,96.9
3,Metro-North Railroad,On-Time Performance (West of Hudson),Percent of commuter trains that arrive at thei...,2008,4,Service Indicators,M,%,95,96.8,95,98.3
4,Metro-North Railroad,On-Time Performance (West of Hudson),Percent of commuter trains that arrive at thei...,2008,5,Service Indicators,M,%,95,96.6,95,95.8


In [None]:
XML data can get much more complicated  
each tag can have metadata too

Example:

In [22]:
from io import StringIO
tag = "<a href='http://www.google.com'>Google</a>"
root = objectify.parse(StringIO(tag)).getroot()

# now can access any of the fields like href in the tag or the link text
root

print(root.get('href'), root.text)

http://www.google.com Google


## `http.client ` -- HTTP protocol client

### [examples](https://docs.python.org/3/library/http.client.html#examples)


In [26]:
import http.client
conn = http.client.HTTPSConnection('www.python.org')
conn.request('GET', '/')
r1 = conn.getresponse()
print(r1.status, r1.reason)

# return entire content
data1 = r1.read()


200 OK


In [28]:
#read into chunks

conn.request('GET', '/')
r1 = conn.getresponse()
while chunk := r1.read(200):
        repr(chunk)

## `html.parser` -- Simple HTML and XHTML parser


### [examples](https://docs.python.org/3/library/html.parser.html#examples)



In [29]:
from html.parser import HTMLParser

class MyHTMLParser(HTMLParser):
    def handle_starttag(self, tag, attrs):
        print("Encountered a start tag:", tag)

    def handle_endtag(self, tag):
        print("Encountered an end tag :", tag)

    def handle_data(self, data):
        print("Encountered some data  :", data)

parser = MyHTMLParser()
parser.feed('<html><head><title>Test</title></head>'
            '<body><h1>Parse me!</h1></body></html>')



Encountered a start tag: html
Encountered a start tag: head
Encountered a start tag: title
Encountered some data  : Test
Encountered an end tag : title
Encountered an end tag : head
Encountered a start tag: body
Encountered a start tag: h1
Encountered some data  : Parse me!
Encountered an end tag : h1
Encountered an end tag : body
Encountered an end tag : html


## pyquery: a jquery like library for python

### [quickstart docs](https://pythonhosted.org/pyquery/)

## Requests: HTTP for Humans

simple HTTP library for python

### [requests quickstart docs](https://docs.python-requests.org/en/latest/user/quickstart/)


In [33]:
import requests
r = requests.get('https://api.github.com/events')
r.status_code
r.json()

[{'id': '16196834118',
  'type': 'CreateEvent',
  'actor': {'id': 78611625,
   'login': 'BaeTheDreamBoat',
   'display_login': 'BaeTheDreamBoat',
   'gravatar_id': '',
   'url': 'https://api.github.com/users/BaeTheDreamBoat',
   'avatar_url': 'https://avatars.githubusercontent.com/u/78611625?'},
  'repo': {'id': 364302571,
   'name': 'BaeTheDreamBoat/School-particle-simulation',
   'url': 'https://api.github.com/repos/BaeTheDreamBoat/School-particle-simulation'},
  'payload': {'ref': 'main',
   'ref_type': 'branch',
   'master_branch': 'main',
   'description': 'Version controll for the WIP School disease/virus spread simulation',
   'pusher_type': 'user'},
  'public': True,
  'created_at': '2021-05-04T15:30:33Z'},
 {'id': '16196834110',
  'type': 'PushEvent',
  'actor': {'id': 80655633,
   'login': 'CzY0913',
   'display_login': 'CzY0913',
   'gravatar_id': '',
   'url': 'https://api.github.com/users/CzY0913',
   'avatar_url': 'https://avatars.githubusercontent.com/u/80655633?'},
  'r