# Module 8: Working With Data from the Web

## Scraping a Web Site

- Before we scrape a website, we __must__ examine the site's 'robots.txt' file to determine whether or not the site allows its content to be scraped. 


- To find the robots.txt file for a given web site, type in the base url and add “/robots.txt”. For example, if we wanted to scrape some data from apartments.com, type in https://www.apartments.com/robots.txt at the url box.


- If the robots.txt allows full access the __Disallow__ field will be blank, e.g.:

__User-agent: *__

__Disallow:__


- If the robots.txt blocks all access, the __Dissallow__ field will contain
a backslash, e.g.:

__User-agent: *__

__Disallow: /__


- If the robots.txt allows partial access, the prohibited and allowed sections will be identified with a pair of backslashes, e.g.,:

__User-agent: *__

__Disallow: /section/__

__Allow: /section/__


In the case of apartments.com, the robots.txt contains the following , as of today:

https://www.apartments.com/robots.txt


This tells us that we are __NOT__ allowed to scrape the 'services' and 'virtual-leasing-office' section(s) of that web site.


If we examine Google's robots.txt we find that there is a long list of allowed and disallowed scraping activities:

https://www.google.com/robots.txt


### Remember, you MUST check this for any web site you are interested in scraping. If the robots.txt tells you that scraping of either parts or the entirerty of the web site is prohibited, DO NOT SCRAPE ANY DATA FROM THE SPECIFIED SECTIONS OF THAT SITE.

## How to Access Web Pages from your Python Code: 'urllib' vs. 'requests'

- 'urllib' and 'requests' are two widely used Python libraries that facilitate the reading and sending of data from/to web pages.


- 'urllib' is the older of the two packages and as such has been widely implemented.


- 'requests' is newer and attempts to offer improved 'easy of use' relative to the older 'urllib' package.


- Understanding how both packages work is a necessity for any Python user who wishes to interact with web pages via a Python application.


### The urllib library

- The __urllib.request__ module facilitates the opening of a URL from within a Python application (https://docs.python.org/3/library/urllib.request.html#module-urllib.request). 


- A simple example: Access the __python.org__ main page and extract its first 300 bytes.

In [1]:
# load the urllib.request module
import urllib.request

# use the urlopen() function to open the python.org URL
with urllib.request.urlopen('http://www.python.org/') as f:
    # read and print the first 300 bytes found on the python.org web page
    print(f.read(300))

b'<!doctype html>\n<!--[if lt IE 7]>   <html class="no-js ie6 lt-ie7 lt-ie8 lt-ie9">   <![endif]-->\n<!--[if IE 7]>      <html class="no-js ie7 lt-ie8 lt-ie9">          <![endif]-->\n<!--[if IE 8]>      <html class="no-js ie8 lt-ie9">                 <![endif]-->\n<!--[if gt IE 8]><!--><html class="no-js"'


Note how urlopen() simply provides access to the underlying HTML (or XML) code: it does not parse or otherwise filter/arrange the data it is extracting. 


### The requests library

- __requests__ aims to simplify the syntax used in the __urllib__ library.


- Some Python users say they don't see much difference when using __requests__ vs. __urllib__.


- A simple example: Access the __github.com__ former json incident report main page and extract its text.

In [2]:
import requests

r = requests.get('https://github.com/timeline.json')
print (r.text)

{"message":"Hello there, wayfaring stranger. If you’re reading this then you probably didn’t see our blog post a couple of years back announcing that this API would go away: http://git.io/17AROg Fear not, you should be able to get what you need from the shiny new Events API instead.","documentation_url":"https://docs.github.com/v3/activity/events/#list-public-events"}


In [3]:
# a second simple example: Access the github.com home page and display the 
# headers found on that page
r = requests.get('https://github.com')
r.headers

{'Server': 'GitHub.com', 'Date': 'Wed, 19 Oct 2022 16:50:56 GMT', 'Content-Type': 'text/html; charset=utf-8', 'Vary': 'X-PJAX, X-PJAX-Container, Turbo-Visit, Turbo-Frame, Accept-Language, Accept-Encoding, Accept, X-Requested-With', 'content-language': 'en-US', 'ETag': 'W/"2f314c3a72594987926a64698dfc6e7e"', 'Cache-Control': 'max-age=0, private, must-revalidate', 'Strict-Transport-Security': 'max-age=31536000; includeSubdomains; preload', 'X-Frame-Options': 'deny', 'X-Content-Type-Options': 'nosniff', 'X-XSS-Protection': '0', 'Referrer-Policy': 'origin-when-cross-origin, strict-origin-when-cross-origin', 'Expect-CT': 'max-age=2592000, report-uri="https://api.github.com/_private/browser/errors"', 'Content-Security-Policy': "default-src 'none'; base-uri 'self'; block-all-mixed-content; child-src github.com/assets-cdn/worker/ gist.github.com/assets-cdn/worker/; connect-src 'self' uploads.github.com objects-origin.githubusercontent.com www.githubstatus.com collector.github.com raw.githubuse

Note that like urlopen(), requests.get() simply provides access to the underlying HTML (or XML) code: it does not parse or otherwise filter/arrange the data it is extracting.

## JSON Data

- JSON (JavaScript Object Notation): One of the standard formats for transmitting data via HTTP requests (e.g., send data from a web browser to some other application).


- JSON relies on a set of Python-esque data structures to manage content. For example, a JSON 'object' is very similar to a Python 'dict' object, JSON 'arrays' are very similar to Python lists, and JSON also allows for strings, numbers, booleans, and null items.


- There are several pre-built Python libraries for reading and writing JSON data, including:

__json__: https://docs.python.org/3/library/json.html


__simplejson__: https://pypi.org/project/simplejson/


__ujson__: https://pypi.org/project/ujson/


__python-rapidjson__: https://pypi.org/project/python-rapidjson/


- We'll use the __json__ library (comes pre-installed with a basic Python installation) for some simple examples:

In [10]:
# load pandas
import pandas as pd

# define a JSON string: note that we are using a Python string object to store it
obj = """
{"name": "Wes",
 "places_lived": ["United States", "Spain", "Germany"],
 "pet": null,
 "siblings": [{"name": "Scott", "age": 30, "pets": ["Zeus", "Zuko"]},
              {"name": "Katie", "age": 38,
               "pets": ["Sixes", "Stache", "Cisco"]}]
}
"""

In [20]:
# import the json library
import json

In [6]:
# convert a json string into a Python object using the json.loads() function
result = json.loads(obj)
result

{'name': 'Wes',
 'places_lived': ['United States', 'Spain', 'Germany'],
 'pet': None,
 'siblings': [{'name': 'Scott', 'age': 30, 'pets': ['Zeus', 'Zuko']},
  {'name': 'Katie', 'age': 38, 'pets': ['Sixes', 'Stache', 'Cisco']}]}

In [7]:
# note that the Python object we've created is actually a dict object
type(result)

dict

In [7]:
# convert a python object to JSON format using json.dumps()
asjson = json.dumps(result)
asjson

'{"name": "Wes", "places_lived": ["United States", "Spain", "Germany"], "pet": null, "siblings": [{"name": "Scott", "age": 30, "pets": ["Zeus", "Zuko"]}, {"name": "Katie", "age": 38, "pets": ["Sixes", "Stache", "Cisco"]}]}'

In [8]:
# note that the output of json.dumps() is a Python string object
type(asjson)

str

### How to convert JSON data  into a Pandas Dataframe?

- There is not one universally applicable approach for converting JSON data into a Pandas data frame. However, there are some built-in functions that will work with specific types of JSON data.

Here's an example of how to convert a subset of the data fields contained within a list of Python dict objects to a data frame:

In [10]:
# create a data frame using only the 'name' and 'age' columns of the 'siblings'
# index of the 'result' dict object we create above
siblings = pd.DataFrame(result['siblings'], columns=['name', 'age'])
siblings

Unnamed: 0,name,age
0,Scott,30
1,Katie,38


### How to read JSON data from a file?

- pd.read_json() can convert properly formatted JSON data into a Pandas dataframe. The function assumes each object in a JSON array is supposed to be a row within the Pandas data frame that it creates.


- A simple example: https://raw.githubusercontent.com/wesm/pydata-book/2nd-edition/examples/example.json

In [11]:
data = pd.read_json('https://raw.githubusercontent.com/wesm/pydata-book/2nd-edition/examples/example.json')
data

Unnamed: 0,a,b,c
0,1,2,3
1,4,5,6
2,7,8,9


### How to export data from a Pandas object to JSON formatting?

- Use the __to_json()__ method:

In [12]:
# export content of a data frame to JSON format
jdata = data.to_json()
jdata

'{"a":{"0":1,"1":4,"2":7},"b":{"0":2,"1":5,"2":8},"c":{"0":3,"1":6,"2":9}}'

In [13]:
# the resulting object is a Python string. The contents of the string
# are very similar to that of a Python dict object
type(jdata)

str

## Reading & Writing HTML and XML Data

- XML is designed for __conveying/managing__ data, i.e., its focus is on data __content__.


- HTML is designed for __displaying__ data, i.e., its focus is on how data __looks__.


- Python has a variety of libraries for working with XML and HTML including:

__Beautiful Soup__ (https://www.crummy.com/software/BeautifulSoup/ ): A Python library that is used for extracting data from HTML and XML files/web pages. __Beautiful Soup__ by default uses Python's standard HTML parser. However, users can specify alternative parsers for __Beautiful Soup__ to use, such as __lxml__ and __html5lib__.


__lxml__ (https://lxml.de/index.html ): A parser that can work with both XML and HTML data. __lxml__ is often used instead of the default Python HTML parser due to its processing speed.


__html5lib__ (https://pypi.org/project/html5lib/ ): An HTML-specific parser that is often used instead of the default Python HTML parser due to its processing speed.


### Reading HTML via Pandas

Simple example of HTML content: https://www.pybloggers.com/2016/11/python-web-scraping-tutorial-using-beautifulsoup/

Pandas provides the __read_html()__ function, which relies on components of the libraries listed above. 

__NOTE__: If you get an error message when trying to use the __pd.read_html()__ function, check to see whether you have the __lxml__, __BeautifulSoup__, and __html5lib__ libraries installed. If not, install them as follows:

__conda install lxml__

__pip install beautifulsoup4 html5lib__

By default, __read_html()__ (see https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_html.html) searches for __table__ tags within an HTML document and attempts to convert whatever it finds between two __tr ("table row"), td ("table data", or th ("table header")__ tags into a Pandas data frame (https://www.w3schools.com/tags/tag_th.asp). 
    
Here's an example from the PfDA textbook:

https://raw.githubusercontent.com/wesm/pydata-book/2nd-edition/examples/fdic_failed_bank_list.html

This HTML file has embedded within it a lengthy table containing information on U.S. banks that went bankrupt during the 2007-2009 financial crisis. The table tells us the name + location of each bank that went bankrupt, the date on which it went bankrupt, the name of the banking institution that acquired the remaining assets of the bankrupt bank, and the date on which those assets were acquired by the acquiring institution.

In [2]:
# example of reading an HTML file housed within the PfDA author's Github repo
tables = pd.read_html('https://raw.githubusercontent.com/wesm/pydata-book/2nd-edition/examples/fdic_failed_bank_list.html')
len(tables)

1

In [3]:
# pd.read_html returns a Python list object
type(tables)

list

In [4]:
# the first item (and in this case, the only item) in the list is a data frame
failures = tables[0]
type(failures)

pandas.core.frame.DataFrame

In [5]:
# check the first few rows of the data frame
failures.head()

Unnamed: 0,Bank Name,City,ST,CERT,Acquiring Institution,Closing Date,Updated Date
0,Allied Bank,Mulberry,AR,91,Today's Bank,"September 23, 2016","November 17, 2016"
1,The Woodbury Banking Company,Woodbury,GA,11297,United Bank,"August 19, 2016","November 17, 2016"
2,First CornerStone Bank,King of Prussia,PA,35312,First-Citizens Bank & Trust Company,"May 6, 2016","September 6, 2016"
3,Trust Company Bank,Memphis,TN,9956,The Bank of Fayette County,"April 29, 2016","September 6, 2016"
4,North Milwaukee State Bank,Milwaukee,WI,20364,First-Citizens Bank & Trust Company,"March 11, 2016","June 16, 2016"


Now do some analysis on the extracted data: find the number of bank failures for each year:

In [6]:
# find the number of bank failures for each year within the data set
close_timestamps = pd.to_datetime(failures['Closing Date'])
close_timestamps.dt.year.value_counts()

2010    157
2009    140
2011     92
2012     51
2008     25
2013     24
2014     18
2002     11
2015      8
2016      5
2004      4
2001      4
2007      3
2003      3
2000      2
Name: Closing Date, dtype: int64

### Web Scrape Example: Using BeautifulSoup to parse an HTML web page

This tutorial provides a step-by-step guide on how to use the BeautifulSoup library to extract specific content from a web page:

- https://www.pybloggers.com/2016/11/python-web-scraping-tutorial-using-beautifulsoup/

The BeautifulSoup reference documentation provides a detailed overview of all of the different ways in which the library can be used to extract content from various components of an HTML page: https://www.crummy.com/software/BeautifulSoup/bs4/doc/

### Reading XML

- eXtensible Markup Language (XML) supports hierarchical and nested data.


- XML is more "generalized" than HTML, meaning it can be used for a wider variety of applications than can HTML (remember: XML = how data is __conveyed/organized__ vs. HTML = how data __appears__).


An example from the PfDA text on how to read XML data using the __lxml__ library: NYC MTA bus and train performance data (data set located on PfDA author's GitHub repo)

https://raw.githubusercontent.com/wesm/pydata-book/2nd-edition/datasets/mta_perf/Performance_MNR.xml

In [7]:
# load the urllib.request function so that we can use a web path with the 
# Python open() function
import urllib.request

# load the objectify() function from the lxml library
from lxml import objectify

In [8]:
# open the web page containing the data set: be sure to capture the 
# header info separately from the path
path, headers = urllib.request.urlretrieve('https://raw.githubusercontent.com/wesm/pydata-book/2nd-edition/datasets/mta_perf/Performance_MNR.xml')

# objectify() is then used to parse the web page
parsed = objectify.parse(open(path))

# now get a reference to the root node of the XML file
root = parsed.getroot()

In [9]:
# what is 'root'? It contains the root node of our file, in this case
# 'PERFORMANCE'
root

<Element PERFORMANCE at 0x21f91513548>

In [10]:
# define an empty list that will be used to store the parsed data
data = []

# define a list of XML fields that we will ignore for this particular
# data set: Note that this requires DOMAIN KNOWLEDGE - you would have
# had to explore the XML content BEFORE specifying the list of fields
# to ignore
skip_fields = ['PARENT_SEQ', 'INDICATOR_SEQ',
               'DESIRED_CHANGE', 'DECIMAL_PLACES']

# root.INDICATOR is a generator that we use to extract each <INDICATOR>
# element from the XML data
for elt in root.INDICATOR:
     # for each record, create a dict of tag names (e.g., 'YTD_ACTUAL')
    el_data = {}
    for child in elt.getchildren():
        if child.tag in skip_fields:
            continue
        el_data[child.tag] = child.pyval
    data.append(el_data)

In [23]:
# now check the results
perf = pd.DataFrame(data)
perf.head()

Unnamed: 0,AGENCY_NAME,CATEGORY,DESCRIPTION,FREQUENCY,INDICATOR_NAME,INDICATOR_UNIT,MONTHLY_ACTUAL,MONTHLY_TARGET,PERIOD_MONTH,PERIOD_YEAR,YTD_ACTUAL,YTD_TARGET
0,Metro-North Railroad,Service Indicators,Percent of commuter trains that arrive at thei...,M,On-Time Performance (West of Hudson),%,96.9,95,1,2008,96.9,95
1,Metro-North Railroad,Service Indicators,Percent of commuter trains that arrive at thei...,M,On-Time Performance (West of Hudson),%,95.0,95,2,2008,96.0,95
2,Metro-North Railroad,Service Indicators,Percent of commuter trains that arrive at thei...,M,On-Time Performance (West of Hudson),%,96.9,95,3,2008,96.3,95
3,Metro-North Railroad,Service Indicators,Percent of commuter trains that arrive at thei...,M,On-Time Performance (West of Hudson),%,98.3,95,4,2008,96.8,95
4,Metro-North Railroad,Service Indicators,Percent of commuter trains that arrive at thei...,M,On-Time Performance (West of Hudson),%,95.8,95,5,2008,96.6,95


What if the XML data you encounter is more complicated than the MTA data we explored here? For example, consider an HTML tag, which itself is valid XML:

In [24]:
# loag 'StringIO()' from io to enable string buffering
from io import StringIO

# a sample HTML tag to use in our example:
tag = '<a href="http://www.google.com">Google</a>'

# buffer the 'tag' string and then parse it using 'objectify()'
root = objectify.parse(StringIO(tag)).getroot()

In [25]:
# what was identified as the root of 'tag'?
root

<Element a at 0x16905203508>

In [26]:
# what was found to be contained in the 'href' field?
root.get('href')

'http://www.google.com'

In [27]:
# what text was found within the html tag?
root.text

'Google'

## Web Scrape Example: Pandas Tech Support Issues

- An example from the PfDA text using the __request__ library: Find the last 30 issues for Pandas listed on Github:

https://api.github.com/repos/pandas-dev/pandas/issues


- Github will automatically return the requested information to us in JSON format; there is no need for us to scrape the specified web page's HTML. The automated provision of JSON serves as a form of API for interacting with the Github web site.

In [2]:
import requests
import pandas as pd

url = 'https://api.github.com/repos/pandas-dev/pandas/issues'
resp = requests.get(url)
resp

<Response [200]>

In [3]:
# now convert the 'resp' object's JSON content into a list of 
# native Python objects
data = resp.json()
type(data)


list

In [4]:
# how many items are in the list object?
len(data)

30

In [5]:
# check the content of the list
data

[{'url': 'https://api.github.com/repos/pandas-dev/pandas/issues/51934',
  'repository_url': 'https://api.github.com/repos/pandas-dev/pandas',
  'labels_url': 'https://api.github.com/repos/pandas-dev/pandas/issues/51934/labels{/name}',
  'comments_url': 'https://api.github.com/repos/pandas-dev/pandas/issues/51934/comments',
  'events_url': 'https://api.github.com/repos/pandas-dev/pandas/issues/51934/events',
  'html_url': 'https://github.com/pandas-dev/pandas/pull/51934',
  'id': 1622040323,
  'node_id': 'PR_kwDOAA0YD85L7ERg',
  'number': 51934,
  'title': 'REF: move ExtensionIndex.map to be part of DatetimeLikeArrayMixin.map',
  'user': {'login': 'topper-123',
   'id': 26364415,
   'node_id': 'MDQ6VXNlcjI2MzY0NDE1',
   'avatar_url': 'https://avatars.githubusercontent.com/u/26364415?v=4',
   'gravatar_id': '',
   'url': 'https://api.github.com/users/topper-123',
   'html_url': 'https://github.com/topper-123',
   'followers_url': 'https://api.github.com/users/topper-123/followers',
   'f

In [6]:
# let's look at the data type of the first item in the list
type(data[0])

dict

In [7]:
# what's in the 'title' component of the dict?
data[0]['title']

'REF: move ExtensionIndex.map to be part of DatetimeLikeArrayMixin.map'

In [8]:
# now convert some of the extracted JSON data into a data frame: in this
# example we are extracting the 'number', 'title', 'labels' and 'state'
# components of the dict object
issues = pd.DataFrame(data, columns=['number', 'title',
                                     'labels', 'state'])
issues

Unnamed: 0,number,title,labels,state
0,51934,REF: move ExtensionIndex.map to be part of Dat...,[],open
1,51933,Backport PR #51082 on branch 2.0.x (API / CoW:...,"[{'id': 2085877452, 'node_id': 'MDU6TGFiZWwyMD...",open
2,51932,DOC: add examples to offsets CustomBusinessDay...,"[{'id': 134699, 'node_id': 'MDU6TGFiZWwxMzQ2OT...",open
3,51931,BLD: Try strip-all instead of strip-debug,"[{'id': 129350, 'node_id': 'MDU6TGFiZWwxMjkzNT...",open
4,51930,API / BUG: copy non-Index arrays in Index cons...,"[{'id': 1218227310, 'node_id': 'MDU6TGFiZWwxMj...",open
5,51929,BUG: Wrong column order after inner merge oper...,"[{'id': 76811, 'node_id': 'MDU6TGFiZWw3NjgxMQ=...",open
6,51928,BUG: Incorrect parsing of ISO 8601 durations s...,"[{'id': 49597148, 'node_id': 'MDU6TGFiZWw0OTU5...",open
7,51926,CoW: change ChainedAssignmentError exception t...,"[{'id': 2085877452, 'node_id': 'MDU6TGFiZWwyMD...",open
8,51925,BUG: Series in a DataFrame can overwrite other...,"[{'id': 76811, 'node_id': 'MDU6TGFiZWw3NjgxMQ=...",open
9,51924,"BUG: When csv has 1 line, pandas cannot read c...","[{'id': 76811, 'node_id': 'MDU6TGFiZWw3NjgxMQ=...",open


## Web Scrape Example: Using Selenium to scrape a dynamically loaded web page

__Selenium__ is a software library that enables the scraping of __web pages whose content is loaded dynamically__ (i.e., as we interact with them) as opposed to "static" web pages that have fixed content for the duration of our page visit. 

Scraping web pages that are loaded dynamically can be challenging. However, the Selenium package makes it possible to successfully extract content from such pages via a relatively easy-to-understand set of constructs.  

Using Selenium's functionality, you can use Python code to dynamically navigate (or "interact with") a web site via your Python code and then extract information from your web browser as the browser responds to the web page navigation instructions sent to it via your Python code.

- https://towardsdatascience.com/step-by-step-web-scraping-project-using-selenium-in-python-3be887e6e35c

## Interacting with Web API's

- Many websites provide publicly available API's that allow users to either access data feeds or download web page content via JSON, XML, HTML, or some other format.


- These API's can be a great tool for purposes of collecting all sorts of data for further analysis.


- Unfortunately, the vast majority of websites __DO NOT__ provide Web API's.


- Get started by searching the website for Web API information.


- If the site offers a Web API, most likely you will need to sign up for a Web API Key code (a unique identifier that enables your personal access to the content of the web site.


- Then, follow the instructions shown on the website for guidance on how to construct requests for specific information via their proprietary API.


## Web API Example:  Fetch the live weather report for a given city

- Get the current temperature, wind speed, description and sky conditions for a given city from __openweathermap.org__ via their free API: https://openweathermap.org/current


- Sourced from https://www.w3resource.com/python-exercises/web-scraping/web-scraping-exercise-21.php


Note the use of an __openweathermap.org__ API key: 

        'APPID =5cb6444eb6e961ddec86187085ac45ef'
        
__BE SURE TO SIGN UP FOR YOUR OWN API KEY IF YOU WANT TO TRY OUT THIS EXAMPLE__

In [10]:
import requests
from pprint import pprint

def weather_data(query):
    # submit our request via the openweathermap.org api
    # note the inclusion of the required API key
    res=requests.get('http://api.openweathermap.org/data/2.5/weather?'+query+
                     '&appid=5cb6444eb6e961ddec86187085ac45ef&units=metric');
    return res.json();


def print_weather(result,city):
    print("{}'s temperature: {}°C ".format(city,result['main']['temp']))
    print("Wind speed: {} m/s".format(result['wind']['speed']))
    print("Description: {}".format(result['weather'][0]['description']))
    print("Weather: {}".format(result['weather'][0]['main']))


def main():
    city=input('Enter the city:')
    print()
    try:
      query='q='+city;
      w_data=weather_data(query);
      print_weather(w_data, city)
      print()
    except:
      print('City name not found...')
if __name__=='__main__':
    main()

Enter the city:paris

paris's temperature: 10.94°C 
Wind speed: 10.29 m/s
Description: overcast clouds
Weather: Clouds



## Case Study: Scraping 'Apartments.com' for the name, address, description, and prices of available apartments

https://towardsdatascience.com/an-introduction-to-web-scraping-with-python-bc9563fe8860
