### University of Michigan: Programming for Everyone
    Module #3: Web Data
    date: Saturday, June 25th 2022

#### Web Data

To represent the wide range of characters that computers must be able to handle  - we represent characters with more than "one byte."

- UTF-16: fixed length; two (2) bytes

- UTF-32: fixed length; four (4) bytes
- UTF-8: 1-4 bytes
    - UTF-8 is recommended practice for encoding data to be exchanged between systems.

----

**When we read data from an external resource, we must decode it based on the caracter set so it is properly represented in Python 3 as a string:**

![Python Strings to Bytes](images/web_data_01.jpg)

- where "data = mysock.recv(512)" = bytes

and 

- "mystring = data.decode()" = unicode

**'decode()' method takes bytes and converts it to unicode (str)**
<br>

**'encode()' method takes strings (str) and converts it to bytes**

----
"import socket" 

![HTTP Requests in Python](images/web_data_02.png)

----

"D.R.Y" = "dont repeat yourself" :)

#### Using "URLlib" in Python

Given that HTTP is so common - there is a library that can manage all the "socket" functions and can make web pages look like a file.

- calling the module/library inside of python:

    - import urllib.request, urlib.parse, urllib.error

- example:
  
    - fhand = urlib.request.urlopen("<http://....>") [where "fhand" stands for "first handle"]["this line is similar to an 'open file' function"]

<b> example:

for line in fhand:

        print(line.decode().strip())

**note: this syntax when printed will remove web page headers, but they are not deleted and may be called if needed.**

----

#### next - we'll read through ea. line and decode into unicode and append to a dictionary.

![Treat like File](images/web_data_03.png)

#### Reading Web Pages continued

![Google web scrapper](images/wd04.png)

In [29]:
# practicing the "urllib" import and functionality of this module/library to read files 
# urllib also has embedded "socket" code/syntax that makes this process more efficient and easier for us

import urllib.request, urllib.parse, urllib.error

# example

fhandle = urllib.request.urlopen('http://www.dr-chuck.com/page1.htm')

counts = dict()
for line in fhandle:
    words = line.decode().split() # "splits" ea. line in the file
    for word in words: # iterating over every word in ea. individual line within the file
        # we are "appending" ea. word as a "key" in the "counts" dictionary
        # additionally, we are looking at ea. word and adding by 1 to the word-key "value" every time it is found within the line
        counts[word] = counts.get(word, 0) + 1
        

# finally - we are printing the results of ea. "key and value" count pair for all words in the file
print(counts)


{'<h1>The': 1, 'First': 1, 'Page</h1>': 1, '<p>': 1, 'If': 1, 'you': 2, 'like,': 1, 'can': 1, 'switch': 1, 'to': 1, 'the': 1, '<a': 1, 'href="http://www.dr-chuck.com/page2.htm">': 1, 'Second': 1, 'Page</a>.': 1, '</p>': 1}


### Understanding Web Scraping
    Network Programs (Part 5)
    date: Sunday, June 26th 2022

![Web Scraping](images/wd05.png)

##### Why Scrape?

Reasons may include:

    1. Pulling data from the internet - particularly social data (i.e., "who links to who?")
    2. Getting you own data back out of some systems/platforms that do not have "exporting" capabilities
    3. To monitor a site for new/updating information 
    4. "Spidering" as scraping is sometimes called...in order to make a database for a search engine


**NOTE: You should be very careful when scraping/spidering web sites**

In [32]:
# Working with "BeautifulSoup"

from bs4 import BeautifulSoup


url = input("Enter - ")
html = urllib.request.urlopen(url).read()
soup = BeautifulSoup(html, "html.parser")

# retrieving all of the anchor tags 

tags = soup("a") # list of all 'achor tags' in the document/file 
for tag in tags:
    print(tag.get("href", None))

http://www.dr-chuck.com/page2.htm


----
### In Summary - 

![module summary](images/wd06.png)