### University of Michigan: Programming for Everyone
    Module #3: Web Data
    date: Saturday, June 25th 2022

#### Web Data

To represent the wide range of characters that computers must be able to handle  - we represent characters with more than "one byte."

- UTF-16: fixed length; two (2) bytes

- UTF-32: fixed length; four (4) bytes
- UTF-8: 1-4 bytes
    - UTF-8 is recommended practice for encoding data to be exchanged between systems.

----

**When we read data from an external resource, we must decode it based on the caracter set so it is properly represented in Python 3 as a string:**

![Python Strings to Bytes](images/web_data_01.jpg)

- where "data = mysock.recv(512)" = bytes

and 

- "mystring = data.decode()" = unicode

**'decode()' method takes bytes and converts it to unicode (str)**
<br>

**'encode()' method takes strings (str) and converts it to bytes**

----
"import socket" 

![HTTP Requests in Python](images/web_data_02.png)

----

"D.R.Y" = "dont repeat yourself" :)

#### Using "URLlib" in Python

Given that HTTP is so common - there is a library that can manage all the "socket" functions and can make web pages look like a file.

- calling the module/library inside of python:

    - import urllib.request, urlib.parse, urllib.error

- example:
  
    - fhand = urlib.request.urlopen("<http://....>") [where "fhand" stands for "first handle"]["this line is similar to an 'open file' function"]

<b> example:

for line in fhand:

        print(line.decode().strip())

**note: this syntax when printed will remove web page headers, but they are not deleted and may be called if needed.**

----

#### Reading Web Pages continued

![Google web scrapper](images/wd04.png)

#### next - we'll read through ea. line and decode into unicode and append to a dictionary.

![Treat like File](images/web_data_03.png)

In [None]:
# practicing the "urllib" import and functionality of this module/library to read files 
# urllib also has embedded "socket" code/syntax that makes this process more efficient and easier for us

import urllib.request, urllib.parse, urllib.error

# example

fhandle = urllib.request.urlopen('http://www.dr-chuck.com/page1.htm')


counts = dict()
for line in fhandle:
    words = line.decode().split() # "splits" ea. line in the file
    for word in words: # iterating over every word in ea. individual line within the file
        # we are "appending" ea. word as a "key" in the "counts" dictionary
        # additionally, we are looking at ea. word and adding by 1 to the word-key "value" every time it is found within the line
        counts[word] = counts.get(word, 0) + 1
        

# finally - we are printing the results of ea. "key and value" count pair for all words in the file
print(counts)


### Understanding Web Scraping
    Network Programs (Part 5)
    date: Sunday, June 26th 2022

![Web Scraping](images/wd05.png)

##### Why Scrape?

Reasons may include:

    1. Pulling data from the internet - particularly social data (i.e., "who links to who?")
    2. Getting you own data back out of some systems/platforms that do not have "exporting" capabilities
    3. To monitor a site for new/updating information 
    4. "Spidering" as scraping is sometimes called...in order to make a database for a search engine


**NOTE: You should be very careful when scraping/spidering web sites**

----
### In Summary - 

![module summary](images/wd06.png)

In [None]:
import socket

mysock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
mysock.connect(('data.pr4e.org', 80))
cmd = 'GET http://data.pr4e.org/romeo.txt HTTP/1.0\n\n'.encode()
mysock.send(cmd)

while True:
    data = mysock.recv(512)
    if (len(data) < 1):
        break
    print(data.decode())
mysock.close()

In [None]:
import urllib.request, urllib.parse, urllib.error

# example

fhandle = urllib.request.urlopen('http://www.dr-chuck.com/page1.htm')

counts = dict()
for line in fhandle:
    words = line.decode().split() # "splits" ea. line in the file
    for word in words: # iterating over every word in ea. individual line within the file
        # we are "appending" ea. word as a "key" in the "counts" dictionary
        # additionally, we are looking at ea. word and adding by 1 to the word-key "value" every time it is found within the line
        counts[word] = counts.get(word, 0) + 1
        

# finally - we are printing the results of ea. "key and value" count pair for all words in the file
print(counts)

In [None]:
# To run this, download the BeautifulSoup zip file
# http://www.py4e.com/code3/bs4.zip
# and unzip it in the same directory as this file

from urllib.request import urlopen
from bs4 import BeautifulSoup
import ssl

# Ignore SSL certificate errors
ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE

url = "http://py4e-data.dr-chuck.net/comments_42.html"
html = urlopen(url, context=ctx).read()
soup = BeautifulSoup(html, "html.parser")


# Retrieve all of the anchor tags
tags = soup('span')
num_of_comments = 0
counter = 0
# spans = [int(tag.contents[0]) for tag in tags] / boolean masking to get all numbers in a list
# sum_of_spans = sum(spans) / adding all the numbers in the boolean mask
# print(sum_of_spans) / printing the total sum

for tag in tags:
    num_of_comments += 1
    tag = int(tag.contents[0])
    counter += tag

print(num_of_comments)
print(counter)


In [None]:
from urllib.request import urlopen
from bs4 import BeautifulSoup

url = "http://py4e-data.dr-chuck.net/comments_1495387.html"
html = urlopen(url, context=ctx).read()
soup = BeautifulSoup(html, "html.parser")


# Retrieve all of the anchor tags
tags = soup('span')
num_of_comments = 0
counter = 0

for tag in tags:
    num_of_comments += 1
    tag = int(tag.contents[0])
    counter += tag

print(num_of_comments)
print(counter)

----
### Following Links in Python

In this assignment you will write a Python program that expands on http://www.py4e.com/code3/urllinks.py. The program will use urllib to read the HTML from the data files below, extract the href= vaues from the anchor tags, scan for a tag that is in a particular position relative to the first name in the list, follow that link and repeat the process a number of times and report the last name you find.

We provide two files for this assignment. One is a sample file where we give you the name for your testing and the other is the actual data you need to process for the assignment

Sample problem: Start at http://py4e-data.dr-chuck.net/known_by_Fikret.html
Find the link at position 3 (the first name is 1). Follow that link. Repeat this process 4 times. The answer is the last name that you retrieve.

Sequence of names: Fikret Montgomery Mhairade Butchi Anayah
Last name in sequence: Anayah

Actual problem: Start at: http://py4e-data.dr-chuck.net/known_by_Unaiza.html
Find the link at position 18 (the first name is 1). Follow that link. Repeat this process 7 times. The answer is the last name that you retrieve.

Hint: The first character of the name of the last page that you will load is: R
Strategy

The web pages tweak the height between the links and hide the page after a few seconds to make it difficult for you to do the assignment without writing a Python program. But frankly with a little effort and patience you can overcome these attempts to make it a little harder to complete the assignment without writing a Python program. But that is not the point. The point is to write a clever Python program to solve the program.

----
1. write a Python program that expands on http://www.py4e.com/code3/urllinks.py

2. The program will use urllib to read the HTML from the data files below
   
3. extract the href = values from the anchor tags
   
4. scan for a tag that is in a particular position relative to the first name in the list 
   
5. follow that link and repeat the process a number of times and...
   
6. report the ~~last~~ **name** in the link that is returned

In [None]:
from urllib.request import urlopen
from bs4 import BeautifulSoup
import ssl 

# Ignore SSL certificate errors
ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE

for i in range(7):
    url = str("http://py4e-data.dr-chuck.net/known_by_Unaiza.html")
    html = urlopen(url, context=ctx).read()
    soup = BeautifulSoup(html, "html.parser")
    tags = soup("a")
    count = 0
    name_lst = list()
    for tag in tags:
        count += 1
        name_lst.append(tag.contents[0])
        if count > 18:
            break
        url = tag.get("href", None)
        name = tag.contents[0]

print(name_lst)

In [None]:
import urllib.request, urllib.parse, urllib.error
# from urllib.request import urlopen
from bs4 import BeautifulSoup
import ssl 

ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE

url = str(input("Enter web url: "))
num_of_repititions = int(input("Enter number of repititions: "))
ele_position = int(input("Enter link position: "))

for i in range(num_of_repititions):
    html = urllib.request.urlopen(url, context=ctx).read()
    soup = BeautifulSoup(html, "html.parser")
    tags = soup("a")
    count = 0

    for tag in tags:
        count += 1
        if count > ele_position:
            break
        url = tag.get("href", None)
        name = tag.contents[0]

print(f'{name} is the last person in loop.')

----
![data_on_the_web](images/wd07.png)

#### sending data across the net

is like "wire protocol": meaning, **what we send on the "wire"**

"eXtensible Markup Language" or **XML** is a wire format: ("more robost")
**JSON** is a wire format: ("lighter weight")

<person>

<name>

</name>

<phone>

</phone>

</person>

----
**Internal Structure: is serialized **

    - Python Dictionary

**De-Serialize:**

    - Java Hashmap

![XML](images/wd08.jpg)

<u>**XMLs are comprised of **</u>

* "Simple Elements" 
  
* "Complex Elements"

**key features include:**

    - start tag
    - end tag
    - text content 
    - attribute
    - opening tag of XML
    - self-closing tag " />"

<u>**where:**</u>

**"TAGS":** indicate the beginning and ending tag of XML

**"ATTRIBUTES":** represent keyword/value pairs on the opening tag of XML

**"SERIALIZE/De-SERIALIZE":** convert data in one program into a common format that can be stored and/or transmitted between systems in a programming language-independent manner

    - de-serialization refers to receiving across the network and translating it back into "readable text"

-----
![text and attributes](images/wd09.jpg)


XML as paths:

**example:**

    - <a> (parent folder)

    - <b> X </b> (child folder)
----

XML Schemas

![schemas](images/wd10.jpg)


<u>**"Validation"**</u>

*refers to contracts between applications and how they are developed and will typically be comprised of:*

        1. XML Document 
        2. XML Schema Contract

**XSD Data types**

Basic functions: 

- minOccurs

- maxOccurs


![xsd_data_types](images/wd11.jpg)

----
### Parsing XML in Python - 

In [None]:
import xml.etree.ElementTree as ET

data = ''' 
<person>
    <name>Miguel</name>
    <phone type="intl">
        +1 555 690 1516
        </phone>
        <email hide = "yes"/>
</person>
'''

tree = ET.fromstring(data)
print("Name:", tree.find('name').text)
print("Phone number:", tree.find("phone").text) # retriving the phone number from the "phone" tag within the data

-----
### Extracting Data from XML

In this assignment you will write a Python program somewhat similar to http://www.py4e.com/code3/geoxml.py. 

1. The program will prompt for a URL, 

2. read the XML data from that URL using urllib and 

3. then parse and extract the comment counts from the XML data

4. compute the sum of the numbers in the file.

We provide two files for this assignment. One is a sample file where we give you the sum for your testing and the other is the actual data you need to process for the assignment.

Sample data: http://py4e-data.dr-chuck.net/comments_42.xml (Sum=2553)

Actual data: http://py4e-data.dr-chuck.net/comments_1495389.xml (Sum ends with 83)

You do not need to save these files to your folder since your program will read the data directly from the URL. Note: Each student will have a distinct data url for the assignment - so only use your own data url for analysis.
Data Format and Approach
The data consists of a number of names and comment counts in XML as follows:

<comment>

    <name>Matthias</name>

    <count>97</count>

</comment>

* You are to look through all the <comment> tags and 
  
* find the <count> values sum the numbers.
  
* The closest sample code that shows how to parse XML is geoxml.py. 

But since the nesting of the elements in our data is different than the data we are parsing in that sample code you will have to make real changes to the code.

To make the code a little simpler, you can use an XPath selector string to look through the entire tree of XML for any tag named 'count' with the following line of code:

counts = tree.findall('.//count')


Take a look at the Python ElementTree documentation and look for the supported XPath syntax for details. You could also work from the top of the XML down to the comments node and then loop through the child nodes of the comments node.

Sample Execution

In [None]:
# 1. The program will prompt for a URL
# 2. read the XML data from that URL using urllib and 
# 3. then parse and extract the comment counts from the XML data
# 4. compute the sum of the numbers in the file.

import urllib.request, urllib.parse, urllib.error
from urllib.request import urlopen
import xml.etree.ElementTree as ET
import ssl


url_address = input("Enter address: ")

while True:
    if len(url_address) < 1:
        url_address = "http://py4e-data.dr-chuck.net/comments_1495389.xml"
    print("Retrieving: " + url_address)

    xml = urllib.request.urlopen(url_address).read()
    print("Retrieved: " + str(len(xml)), "characters")

    tree = ET.fromstring(xml)

    counts =  tree.findall('.//count')
    print("Count: " + str(len(counts)))

    counter = 0

    for count in counts:
        counter += int(count.text)

    print("Sum: " + str(counter))
    break

----
### Web Servies Part V: JavaScript Object Notation "JSON"

JSON represents data as nested lists and dictionaries

``Service Oriented Approach``

- Application Program Interfaces (APIs)

![json](/images/wd12.jpg)

In [None]:
# example question

x = {
    "users": [
        {
            "status": {
                "text": "@jazzychad I just bought one .__.",
             },
             "location": "San Francisco, California",
             "screen_name": "leahculver",
             "name": "Leah Culver",
         }]}

In [None]:
print(x["users"][0]["name"])

**``Week 6 Quiz``**
1. Who is credited with getting the JSON movement started? \
(answer) Douglas Crockford

2. What Python library do you have to import to pase and handle JSON? \
(answer) import JSON 

3. Which of the following is a web services approach used by the Twitter API? \
(answer) REST

4. What kind of variable will you get in Python when the following JSON is parsed: \
   
   { "id" : "001",
  "x" : "2",
  "name" : "Chuck"
} \
(answer) A dictionary with three key/value pairs

5. Which of the following is not true about the service-oriented approach? \
(answer) An application runs together all in one place

6. If the following JSON were parsed and put into the variable x, \
   
   x = {
    "users": [
        {
            "status": {
                "text": "@jazzychad I just bought one .__.",
             },
             "location": "San Francisco, California",
             "screen_name": "leahculver",
             "name": "Leah Culver",
         }]} \

what Python code would extract "Leah Culver" from the JSON? \
(answer) x["users"][0]["name"]

7. What library call do you make to append properly encoded parameters to the end of a URL like the following: \
   http://maps.googleapis.com/maps/api/geocode/json?sensor=false&address=Ann+Arbor%2C+MI \
(answer) urllib.parse.urlencode()

8. What happens when you exceed the Google geocoding API rate limit? \
(answer) You cannot use the API for 24 hours

9. What protocol does Twitter use to protect its API?
(answer) OAuth

10. What header does Twitter use to tell you how many more API requests you can make before you will be rate limited?
(answer) "x-rate-limit-remaining"

**``Extracting Data from JSON``**

In this assignment you will write a Python program somewhat similar to http://www.py4e.com/code3/json2.py. 

1. The program will prompt for a URL
2. read the JSON data from that URL using urllib and then...
3. parse and extract the comment counts from the JSON data
4. compute the sum of the numbers in the file and enter the sum below:

We provide two files for this assignment. One is a sample file where we give you the sum for your testing and the other is the actual data you need to process for the assignment.

**Sample data:** http://py4e-data.dr-chuck.net/comments_42.json (Sum=2553) \
**Actual data:** http://py4e-data.dr-chuck.net/comments_1495390.json (Sum ends with 52)

You do not need to save these files to your folder since your program will read the data directly from the URL. Note: Each student will have a distinct data url for the assignment - so only use your own data url for analysis.

Data Format \
The data consists of a number of names and comment counts in JSON as follows:

{
  comments: [
    {
      name: "Matthias"
      count: 97
    },
    {
      name: "Geomer"
      count: 97
    }
    ...
  ]
}

The closest sample code that shows how to parse JSON and extract a list is json2.py. You might also want to look at geoxml.py to see how to prompt for a URL and retrieve data from a URL.

----

**Sample Execution:**

$ python3 solution.py \
Enter location: http://py4e-data.dr-chuck.net/comments_42.json \
Retrieving http://py4e-data.dr-chuck.net/comments_42.json \
Retrieved 2733 characters \
Count: 50 \
Sum: 2...



In [None]:
import json

url_address = input("Enter URL: ")

while True:
    if len(url_address) < 1:
        url_address = "http://py4e-data.dr-chuck.net/comments_1495390.json"
    print("Retrieving:", url_address)
    
    uh = urllib.request.urlopen(url_address)
    data = uh.read().decode()
    
    print("Retrieved:", len(data), "characters")
    json_object = json.loads(data)

    sum = 0
    total_sum = 0
    
    for comment in json_object["comments"]:
        sum += int(comment["count"])
        total_sum += 1

    print("Count:", total_sum)
    print("Sum:", str(sum))
    break