## Lecture 9 - Data - Web scrapping HTML, JSON and XML

Web Scraping is the process of obtainin data from websites and structuring the data into a useable format.

First lets get a connection status from a a website, we will use the following site and query information:

``https://www.monster.com/jobs/search/?q=Software-Developer&where=Australia``

What we are wanting to obtain is a status 200 which indicates a successful HTTP connection.

In [89]:
import requests

url ='https://www.monster.com/jobs/search/?q=Software-Developer&where=Australia'
page =requests.get(url)
print(page.status_code)

200


### Extracting HTML Data

In this next example we will look at extracting the HTML data. We will use the following URL:

``https://www.university-list.net/New-Zealand/universities-1000.htm``

In [28]:
import requests
from bs4 import BeautifulSoup

university_url = "https://www.university-list.net/New-Zealand/universities-1000.htm"

In [29]:
university_url

'https://www.university-list.net/New-Zealand/universities-1000.htm'

Lets check our requests status

In [31]:
page = requests.get(university_url)
print(page)

<Response [200]>


Lets examine the html page text it may not look to be in the prettiest of states at present.

In [33]:
print(page.text)

<!doctype html>
<html lang="en"><!-- InstanceBegin template="/Templates/eng.dwt" codeOutsideHTMLIsLocked="false" -->
<head>
<meta charset="utf-8">
<meta name=viewport content="width=device-width, initial-scale=1.0">
<link href="../css/css.css" rel="stylesheet" type="text/css">
<!--menu.jså¨åºé¨-->
<!-- IEè»å­è½¯ä»¶å¼å§ -->
<!â[if lt IE 9]> 
<script src="../js/polyfill.js"></script>
<![endif]â> 
<!-- IEè»å­è½¯ä»¶ç»æ -->
<!-- auto-adså¼å§ -->
<script data-ad-client="ca-pub-2375604138158108" async src="https://pagead2.googlesyndication.com/pagead/js/adsbygoogle.js"></script>
<!-- auto-adsç»æ -->
<!-- InstanceBeginEditable name="doctitle" -->
<link rel="amphtml" href="https://m.university-list.net/New-Zealand/universities-1000.htm" /><!--å¯¹åºçç§»å¨ç½é¡µ-->
<!--<link rel="canonical" href="https://www.university-list.net/New-Zealand/universities-1000.htm" /><!--ææé¦éç½å-->
<title>List of Universities and Colleges in New Zealand (36 Schoo

Lets look at tidying the data up by parsing the data through **BeautifuleSoup()** and the **prettify()** function to make the data a bit more readable.

In [34]:
soup = BeautifulSoup(page.text,'html.parser')

# look at the type of object from Beautifule soup
print(type(soup))

<class 'bs4.BeautifulSoup'>


In [36]:
print(soup.prettify())

<!DOCTYPE html>
<html lang="en">
 <!-- InstanceBegin template="/Templates/eng.dwt" codeOutsideHTMLIsLocked="false" -->
 <head>
  <meta charset="utf-8"/>
  <meta content="width=device-width, initial-scale=1.0" name="viewport"/>
  <link href="../css/css.css" rel="stylesheet" type="text/css"/>
  <!--menu.jså¨åºé¨-->
  <!-- IEè»å­è½¯ä»¶å¼å§ -->
  <!--â[if lt IE 9]-->
  <script src="../js/polyfill.js">
  </script>
  &lt;![endif]â&gt;
  <!-- IEè»å­è½¯ä»¶ç»æ -->
  <!-- auto-adså¼å§ -->
  <script async="" data-ad-client="ca-pub-2375604138158108" src="https://pagead2.googlesyndication.com/pagead/js/adsbygoogle.js">
  </script>
  <!-- auto-adsç»æ -->
  <!-- InstanceBeginEditable name="doctitle" -->
  <link href="https://m.university-list.net/New-Zealand/universities-1000.htm" rel="amphtml">
   <!--å¯¹åºçç§»å¨ç½é¡µ-->
   <!--<link rel="canonical" href="https://www.university-list.net/New-Zealand/universities-1000.htm" /><!--ææé¦éç½å-->
   <title>
    List of Un

As you can see it is much easier to read in a structured format.

Next lets find all ``a`` tags with atributes of ``{'rel':'nofollow'}`` and then display this data.

In [37]:
all_links = soup.findAll('a',attrs={'rel':'nofollow'})
all_links

[<a href="http://www.aut.ac.nz/" rel="nofollow" target="_blank">Auckland University of Technology</a>,
 <a href="http://www.Auckland.ac.nz/" rel="nofollow" target="_blank">The University of Auckland</a>,
 <a href="http://www.canterbury.ac.nz" rel="nofollow" target="_blank">University of canterbury</a>,
 <a href="http://www.lincoln.ac.nz/" rel="nofollow" target="_blank">Lincoln University</a>,
 <a href="http://www.massey.ac.nz/" rel="nofollow" target="_blank">Massey University</a>,
 <a href="http://www.otago.ac.nz" rel="nofollow" target="_blank">University of Otago</a>,
 <a href="http://www.victoria.ac.nz/" rel="nofollow" target="_blank">Victoria University of Wellington</a>,
 <a href="http://www.waikato.ac.nz/" rel="nofollow" target="_blank">The University or waikato</a>,
 <a href="http://www.ara.ac.nz/" rel="nofollow" target="_blank">Ara Institute of Canterbury</a>,
 <a href="http://www.boppoly.ac.nz/" rel="nofollow" target="_blank">Bay of Plenty Polytechnic</a>,
 <a href="http://www.

Next we will use a loop to provide us with only the URL and university names.

In [38]:
for eachuniversity in all_links: 
    print(eachuniversity['href']+" ,"+eachuniversity.string)

http://www.aut.ac.nz/ ,Auckland University of Technology
http://www.Auckland.ac.nz/ ,The University of Auckland
http://www.canterbury.ac.nz ,University of canterbury
http://www.lincoln.ac.nz/ ,Lincoln University
http://www.massey.ac.nz/ ,Massey University
http://www.otago.ac.nz ,University of Otago
http://www.victoria.ac.nz/ ,Victoria University of Wellington
http://www.waikato.ac.nz/ ,The University or waikato
http://www.ara.ac.nz/ ,Ara Institute of Canterbury
http://www.boppoly.ac.nz/ ,Bay of Plenty Polytechnic
http://www.cpit.ac.nz/ ,Christchurch Polytechnic Institute of Technology
http://www.eit.ac.nz/ ,Eastern Institute of Technology
http://www.laidlaw.ac.nz/ ,Laidlaw College
http://www.manukau.ac.nz ,Manukau Institute of Technology
https://www.nmit.ac.nz/ ,Nelson Marlborough Institute of Technology
http://www.northtec.ac.nz/ ,NorthTec
http://www.openpolytechnic.ac.nz/ ,The Open Polytechnic of New Zealand
http://www.otagopolytechnic.ac.nz ,Otago Polytechnic
https://www.sit.ac.nz/ 

### Activity Web Scraping

In this activity we will scrape the Wikipedia page for the following URL:

``"https://en.wikipedia.org/wiki/States_of_Germany"``

The final goal of this activity is to get a listing of German States.

First we will import our libraries and pull and parse our data and display our title page.

In [65]:
#import the library used to query a website
import requests

# import the Beautiful soup functions to parse the data returned from the website
from bs4 import BeautifulSoup
import time


#specify the url

wiki = "https://en.wikipedia.org/wiki/States_of_Germany"

#Query the website and return the html to the variable 'page'
page = requests.get(wiki)

# Parse the html in the 'page' variable, and store it in Beautiful Soup format
soup = BeautifulSoup(page.text,'html.parser')

# Goal 1
print(soup.title.string)



States of Germany - Wikipedia


Next we will use the``findAll()`` function to search for all ``'a'`` tags. We will then print out all links with ``['href']``.

In [66]:
# Goal 2

all_links = soup.findAll('a',href=True)

# print all links from the webpage
for link in all_links:
    print(link['href'])

#mw-head
#searchInput
/wiki/States_of_Germany_(disambiguation)
/wiki/Regions_of_Germany_(disambiguation)
https://en.wikipedia.org/w/index.php?title=States_of_Germany&action=edit
/wiki/Talk:States_of_Germany
/wiki/Help:Maintenance_template_removal
/wiki/File:Question_book-new.svg
/wiki/Wikipedia:Verifiability
https://en.wikipedia.org/w/index.php?title=States_of_Germany&action=edit
/wiki/Help:Referencing_for_beginners
//www.google.com/search?as_eq=wikipedia&q=%22States+of+Germany%22
//www.google.com/search?tbm=nws&q=%22States+of+Germany%22+-wikipedia
//www.google.com/search?&q=%22States+of+Germany%22+site:news.google.com/newspapers&source=newspapers
//www.google.com/search?tbs=bks:1&q=%22States+of+Germany%22+-wikipedia
//scholar.google.com/scholar?q=%22States+of+Germany%22
https://www.jstor.org/action/doBasicSearch?Query=%22States+of+Germany%22&acc=on&wc=on
/wiki/Help:Maintenance_template_removal
/wiki/Wikipedia:Manual_of_Style/Layout
https://en.wikipedia.org/w/index.php?title=States_of_

Finally, we will look for the ``table`` attribute with the class ``sortable wikitable`` with ``tr``

In [67]:
# First find the right table by using the class_ tag (similar to css tags)

right_table = soup.find('table', class_="sortable wikitable").findAll('tr')

# find all tags start with td

states = []

for row in right_table:
    state_data_entry = row.findAll('td')
    if(len(state_data_entry)!=0):
        states.append(state_data_entry[2].find(text=True))
print(states)

['Baden-Württemberg', 'Bavaria', 'Berlin', 'Brandenburg', 'Bremen', 'Hamburg', 'Hesse', 'Lower Saxony', 'Mecklenburg-Vorpommern', 'North Rhine-', 'Rhineland-Palatinate', 'Saarland', 'Saxony', 'Saxony-Anhalt', 'Schleswig-Holstein', 'Thuringia']


In this task we will use the same Wikipedia page to create the following dataframe columns filled with statiscal data for each German state

['STATE','SINCE','CAPITAL','LEGISLATURE','HEAD_OF_GOVERNMENT','GOVERNMENT_COALITION','BUNDES_RAT_VOTES','AREA_KM_SQUARE','POPULATION','POP_PER_KM_SQ','CAPITAL','HUMAN_DEVELOPMENT_INDEX','GDP_PER_CAPITA']

In [82]:
#import the library used to query a website
import requests

# import the Beautiful soup functions to parse the data returned from the website
from bs4 import BeautifulSoup
import time


#specify the url

wiki = "https://en.wikipedia.org/wiki/States_of_Germany"

#Query the website and return the html to the variable 'page'
page = requests.get(wiki)

# Parse the html in the 'page' variable, and store it in Beautiful Soup format
soup = BeautifulSoup(page.text,'html.parser')


right_table = soup.find('table', class_="sortable wikitable")



# Generate lists
TABLE_COLS = []
STATE = []
SINCE = []
CAPITAL=[]
LEGISLATURE=[]
HEAD_OF_GOVERNMENT = []
GOVERNMENT_COALITION = []
BUNDES_RAT_VOTES = []
AREA_KM_SQUARE = []
POPULATION = []
POP_PER_KM_SQ = []
HUMAN_DEVELOPMENT_INDEX = []
GDP_PER_CAPITA = []

for row in right_table.find_all("tr"):
    cells = row.find_all("th") # to store col names
    states = row.find_all("td") # to store state values

    if len(cells) == 14:
    # append the name of the cols
      for col in cells:
        TABLE_COLS.append(col.find(text=True))

    if len(states)!=0:
        STATE.append(states[2].find(text=True))
        SINCE.append(states[3].find(text=True))
        CAPITAL.append(states[4].find(text=True))
        LEGISLATURE.append(states[5].find(text=True))
        HEAD_OF_GOVERNMENT.append(states[6].find(text=True))
        GOVERNMENT_COALITION.append(states[7].find(text=True))
        BUNDES_RAT_VOTES.append(states[8].find(text=True))
        AREA_KM_SQUARE.append(states[9].find(text=True))
        POPULATION.append(states[10].find(text=True))
        POP_PER_KM_SQ.append(states[11].find(text=True))
        HUMAN_DEVELOPMENT_INDEX.append(states[12].find(text=True))
        GDP_PER_CAPITA.append(states[14].find(text=True))


# import pandas to convert list to data frame

import pandas as pd
df=pd.DataFrame()
df['STATE']=STATE
df['SINCE']=SINCE
df['CAPITAL']=CAPITAL
df['LEGISLATURE']=LEGISLATURE
df['HEAD_OF_GOVERNMENT']=HEAD_OF_GOVERNMENT
df['GOVERNMENT_COALITION']=GOVERNMENT_COALITION
df['BUNDES_RAT_VOTES']=BUNDES_RAT_VOTES
df['AREA_KM_SQUARE']=AREA_KM_SQUARE
df['POPULATION']=POPULATION
df['POP_PER_KM_SQ']=POP_PER_KM_SQ
df['HUMAN_DEVELOPMENT_INDEX']=HUMAN_DEVELOPMENT_INDEX
df['GDP_PER_CAPITA']=GDP_PER_CAPITA
print(df)


                     STATE   SINCE      CAPITAL  \
0        Baden-Württemberg    1952    Stuttgart   
1                  Bavaria  1949\n       Munich   
2                   Berlin    1990          –\n   
3              Brandenburg  1990\n      Potsdam   
4                   Bremen  1949\n       Bremen   
5                  Hamburg  1949\n          –\n   
6                    Hesse  1949\n    Wiesbaden   
7             Lower Saxony  1949\n      Hanover   
8   Mecklenburg-Vorpommern  1990\n     Schwerin   
9             North Rhine-  1949\n   Düsseldorf   
10    Rhineland-Palatinate  1949\n        Mainz   
11                Saarland  1957\n  Saarbrücken   
12                  Saxony  1990\n      Dresden   
13           Saxony-Anhalt  1990\n    Magdeburg   
14      Schleswig-Holstein  1949\n         Kiel   
15               Thuringia  1990\n       Erfurt   

                          LEGISLATURE    HEAD_OF_GOVERNMENT  \
0        Landtag of Baden-Württemberg  Winfried Kretschmann   
1     

### JSON

in these next two example we will read and write a JSON parsed file format.  

In [83]:
import json
json_data="""{"name": "Zophie", "isCat": true, "miceCaught": 0, "napsTaken": 37.5, "felineIQ": null}"""

print(json.loads(json_data))

{'name': 'Zophie', 'isCat': True, 'miceCaught': 0, 'napsTaken': 37.5, 'felineIQ': None}


In [84]:
import json
pythonValue = {'isCat': True, 'miceCaught': 0, 'name': 'Zophie','felineIQ': None}
json_data = json.dumps(pythonValue)
print(json_data)

{"isCat": true, "miceCaught": 0, "name": "Zophie", "felineIQ": null}


In [85]:
import requests
import json


# get when the space ship will pass through Newyork

def getNextTimeList(parameters):
    response_time = requests.get("http://api.open-notify.org/iss-pass.json",params=parameters)

    data_time = json.loads(response_time.text)

    response_final = data_time['response']
    total_num_passes = len(response_final)

    response_time_list = []
    for i in response_final:
        response_time_list.append(i['risetime'])

    return (total_num_passes, response_time_list)


#parameters_newyork = {"lat": 40.71, "lon": -74}

#(num_passes_ny, time_list_ny) = getNextTimeList(parameters_newyork)

parameters_regensburg = {"lat": 49.0145, "lon": 12.1009}
(num_passes_rgb, time_list_rgb) = getNextTimeList(parameters_regensburg)


# # Get the response from the API endpoint.
response = requests.get("http://api.open-notify.org/astros.json")
response_astro = json.loads(response.text)
num_astro_space = response_astro["number"]
print(num_astro_space)

astro_names = response_astro["people"]
astro_name_list = []

for name in astro_names:
    astro_name_list.append(name['name'])

print(astro_name_list)

7
['Sergey Ryzhikov', 'Kate Rubins', 'Sergey Kud-Sverchkov', 'Mike Hopkins', 'Victor Glover', 'Shannon Walker', 'Soichi Noguchi']


### XML

In this example we will read an xml file format.  

Now lets parse some XML data. Our xml file has the following entries:

<?xml version="1.0"?>
<company>
	<name>ABC</name>
	<staff id="123">
		<name>Andy</name>
		<expense>200</expense>
	</staff>
	<staff id="324">
		<name>Mike</name>
		<expense>300</expense>
	</staff>
	<staff id="567">
		<name>Chris</name>
		<expense>400</expense>
	</staff>
</company>

We will start of with the following basic parse. More Information can be obtained from https://docs.python.org/3/library/xml.etree.elementtree.html

In [92]:
import xml.etree.ElementTree as ET
tree = ET.parse("data/test.xml")
root = tree.getroot()
print (root.tag)

company


In [93]:
from xml.dom import minidom

#import pandas as pd

doc = minidom.parse("data/test.xml")

# doc.getElementsByTagName returns NodeList
name = doc.getElementsByTagName("name")[0]
print(name.firstChild.data)

staffs = doc.getElementsByTagName("staff")
for staff in staffs:
        sid = staff.getAttribute("id")
        name = staff.getElementsByTagName("name")[0]
        expense = staff.getElementsByTagName("expense")[0]
        print("id:%s, name:%s, expense:%s" %
              (sid, name.firstChild.data, expense.firstChild.data))

ABC
id:123, name:Andy, expense:200
id:324, name:Mike, expense:300
id:567, name:Chris, expense:400
