# Understanding HTML 

## Reading Web pages

Here is an example of reading a simple webpage. 

In [1]:
import urllib
fhand = urllib.urlopen('http://www.dr-chuck.com/page1.htm')

for line in fhand:
    print(line.strip())
    

<h1>The First Page</h1>
<p>
If you like, you can switch to the
<a href="http://www.dr-chuck.com/page2.htm">
Second Page</a>.
</p>


When we read the wepage, we can identify the other links, and go to them by opening and reading them, and so on.


## Parsing HTML (also known as Web Scaping)


### What is Web Scraping?

* It's when a program or script pretends to be a browser and retrieves web pages, look at those web pages, extracts information, and then looks at more webpages.

* Search engines scrape web pages - we call this "spidering the web" or "web crawling". 

**Note:** Careful with this, not all webpages can be scrapped or crawled. Some of them require you to "log in" therefore they know who you are, and probably in the terms and conditions you agreed on not doing this. (eg: Facebook and Google engines) 

### Why scrape?

* Pull data - particularly social data - who is linked to whom?.
* Get your own data from some system that has no "export capability".
* Monitor a site for new information.
* Spider the web to make a database for a search engine.

## Parsing HTML with Beautiful Soup

* Retrieve a list of the anchor tags.
* Each tag is like a dictionary of HTML attributes.

In the following example we will grab what is linked to the tag `href`. 

In [2]:
import urllib
from bs4 import BeautifulSoup

In [3]:
url = raw_input('Enter - ')

Enter - http://www.dr-chuck.com/


In [4]:
html = urllib.urlopen(url).read()
soup = BeautifulSoup(html, "html.parser")

In [5]:
tags = soup('a')

In [6]:
for tag in tags:
    print(tag.get('href', None))

http://www.dr-chuck.com/csev-blog/
http://www.si.umich.edu/
http://www.ratemyprofessors.com/ShowRatings.jsp?tid=1159280
http://www.dr-chuck.com/csev-blog/
http://www.twitter.com/drchuck/
http://www.dr-chuck.com/dr-chuck/resume/speaking.htm
http://www.slideshare.net/csev
/dr-chuck/resume/index.htm
http://amzn.to/1K5Q81K
http://afs.dr-chuck.com/papers/
https://itunes.apple.com/us/podcast/computing-conversations/id731495760
http://www.youtube.com/playlist?list=PLHJB2bhmgB7dFuY7HmrXLj5BmHGKTD-3R
http://developers.imsglobal.org/
http://www.youtube.com/user/csev
http://vimeo.com/drchuck/videos
https://backpack.openbadges.org/share/4f76699ddb399d162a00b89a452074b3/
http://www.linkedin.com/pub/chuck-severance/2/92a/3a8
https://www.researchgate.net/profile/Charles_Severance/
http://www.tsugi.org/
http://youtu.be/slscHD40r78
https://www.coursera.org/course/pythonlearn
https://www.coursera.org/course/insidetheinternet
http://open.umich.edu/education/si/si502/winter2009/
http://www.pythonlearn.com

## Summary

* The TCP/IP gives us pipes (sockets) between applications.
* We designed applications protocols to make use of these pipes.
* HTTP is a simple yet powerful protocol.
* Python has a good support for sockets, HTTP, and HTML parsing.

In [7]:
from IPython.core.display import HTML
def css_styling():
    styles = open("styles/custom.css", "r").read()
    return HTML(styles)
css_styling()