# Welcome to the Dark Art of Coding:
## Introduction to Python
Gathering data from the web

<img src='../images/dark_art_logo.600px.png' width='300' style="float:right">

# Objectives
---

In this session, students should expect to:

* Use and understand the basics of the `urllib` module
* Use and understand the basics of the `beautiful soup` library

# Networks
---

## TCP

TCP is a protocol that is used to send data across a network

* It relies upon some builtin mechanisms to help increase reliability
* TCP creates connections between two devices (it is referred to as a connection-oriented protocol)
* It uses checks to ensure that all data has been correctly received, if not, it can request that missing data be resent
* TCP organizes packets in order
* Between the reliability checks and the organization/ordering of packets, it is very effective for the sending files (like web pages)


## Port numbers

The TCP protocol incorporates the use of port numbers:

* Port numbers are used by computers to ensure that traffic coming to a given computer gets funneled to the correct application
* Multiple ports allow multiple applications on the same computer to talk without interfering with each other
* Typically certain applications have default TCP port numbers that are used to send higher-level protocols

Task | Port
:----|:----
Telnet | 23
SSH | 22
HTTP | 80
HTTPS | 443
SMTP (E-mail) | 25
DNS (Domain Name) | 53
FTP (File Transfer) | 21

## HTTP (Hyper Text Transfer Protocol)

HTTP is a common protocol that may be sent using TCP.

* HTTP is the standard Protocol for most applications on the internet
* Invented to retrieve HTML, images, Documents, etc.
* Basic concept:
    * Make a connection
    * Request a document
    * Retrieve the document
    * Close the connection

HTTP uses Uniform Resource Locators (URL) to identify device addresses. A URL address has several components:

* The URL indicates the protocol, generally HTTP (but it could be others)
* It lists the server that hosts the document
* The name and path to the document

http://  | http://www.example.com/ | index.html
:--------|:-------------|:----------------
Protocol | Host         | Document

# HTTP

* Browser attempts to connect to `http://www.example.com`
* Issues a request for a document such as `index.html`
* The server sends the html document
* Browser renders html
* Closes connection when done

# HTTP requests in Python using urllib
---

In [2]:
# First we have to import the request module from urllib

import urllib.request

In [34]:
# urllib allows us to open web pages just like opening files.
# The following command creates an http.client.HTTPResponse object that
#     gives us access to a number of attributes and behaviors
#     related to the data retrieved

file = urllib.request.urlopen('http://www.gutenberg.org/cache/epub/11757/pg11757.txt')

In [35]:
# A common technique is to use a for loop to cycle through every
# line and print out the data one line at a time
# In this case, the data is read in as bytes

for line in file:
    # We convert each line from bytes to strings using the
    #     .decode() attribute.
    print(line.decode().strip())

﻿The Project Gutenberg eBook, The Velveteen Rabbit, by Margery Williams,
Illustrated by William Nicholson


This eBook is for the use of anyone anywhere at no cost and with
almost no restrictions whatsoever.  You may copy it, give it away or
re-use it under the terms of the Project Gutenberg License included
with this eBook or online at www.gutenberg.net





Title: The Velveteen Rabbit

Author: Margery Williams

Release Date: March 29, 2004  [eBook #11757]

Language: English


***START OF THE PROJECT GUTENBERG EBOOK THE VELVETEEN RABBIT***



This eBook is courtesy of the Celebration of Women Writers, online at
http://digital.library.upenn.edu/women/.

THE
Velveteen Rabbit

OR
HOW TOYS BECOME REAL

by Margery Williams
Illustrations by William Nicholson

DOUBLEDAY & COMPANY, INC.
Garden City                   New York
_________________________________________________________________

To Francesco Bianco
from
The Velveteen Rabbit
_________________________________________________________

In [36]:
# Much like other files we have looked at, we can 
# read and evaluate the text in web-based text files, like
# like counting words

import pprint
import urllib.request
file = urllib.request.urlopen('http://www.gutenberg.org/cache/epub/11757/pg11757.txt')

In [37]:
count = {}

for line in file:
    
    # Again, we take the line and use .decode() to convert
    #     the data to a string
    #     Then we strip the newline
    #     Then we split it on spaces
    words = line.decode().strip().split()
    
    # We cycle through the words one at a time
    for word in words:
        
        # If a key for the word already exists .get() grabs the value otherwise it automatically returns 0
        count[word] = count.get(word, 0) + 1

In [38]:
pprint.pprint(count)

{'"Can': 1,
 '"Come': 1,
 '"Defects,"': 1,
 '"Does': 3,
 '"Fancy': 2,
 '"Give': 1,
 '"He': 3,
 '"Here,"': 1,
 '"Ho!"': 1,
 '"How': 1,
 '"Hurrah!"': 1,
 '"I': 14,
 '"I\'d': 1,
 '"I\'ve': 1,
 '"Information': 1,
 '"It': 1,
 '"It\'s': 2,
 '"Little': 1,
 '"Oh,': 2,
 '"Plain': 2,
 '"Project': 5,
 '"Real': 1,
 '"Right': 1,
 '"Run': 1,
 '"Sometimes,"': 1,
 '"That': 1,
 '"That?"': 1,
 '"The': 1,
 '"Then': 1,
 '"To-morrow': 1,
 '"Wasn\'t': 1,
 '"What': 1,
 '"When': 1,
 '"Why': 2,
 '"Why,': 2,
 '"You': 5,
 '"because': 1,
 '"don\'t': 1,
 '"or': 1,
 '"take': 1,
 '"tidying': 1,
 '#10000,': 2,
 '#11757]': 1,
 '$5,000)': 1,
 '&': 1,
 "'AS-IS,'": 1,
 '("the': 1,
 '($1': 1,
 '(801)': 1,
 '(Or': 1,
 '(a)': 1,
 '(and': 1,
 '(any': 1,
 '(available': 1,
 '(b)': 1,
 '(c)': 1,
 '(does': 1,
 '(if': 1,
 '(or': 3,
 '(trademark/copyright)': 1,
 '(which': 1,
 '(www.gutenberg.net),': 1,
 '(zipped),': 1,
 '***': 4,
 '*******': 2,
 '***END': 1,
 '***START': 1,
 '-': 7,
 '/etext': 1,
 '00,': 1,
 '01,': 1,
 '02,': 1,
 

 'glade': 1,
 'glass': 1,
 'go': 5,
 'goals': 1,
 'going': 4,
 'golden': 1,
 'gone': 1,
 'gone.': 1,
 'got': 5,
 'govern': 1,
 'granted': 1,
 'grass,': 3,
 'grass.': 2,
 'gratefully': 1,
 'great': 5,
 'green': 2,
 'grew': 7,
 'grey,': 1,
 'gross': 1,
 'ground,': 2,
 'ground.': 1,
 'group': 1,
 'growing': 2,
 'grumbled': 1,
 'had': 44,
 "hadn't": 1,
 'hair': 1,
 'hair,': 1,
 'hairs': 1,
 'half': 1,
 'handle?"': 1,
 'hands': 1,
 'hands.': 1,
 'happen': 3,
 'happened': 2,
 'happened.': 1,
 'happening': 1,
 'happens': 1,
 'happy': 3,
 'happy,': 1,
 'happy-so': 1,
 'hard': 2,
 'harmless': 1,
 'has': 3,
 "hasn't": 2,
 'hated': 1,
 'have': 15,
 'have!"': 1,
 'having': 1,
 'he': 111,
 'head': 2,
 'hear': 1,
 'heard': 1,
 'heart': 1,
 'held': 2,
 'help': 2,
 'help,': 1,
 'helped': 1,
 'her': 8,
 'her,': 2,
 'her.': 1,
 'hidden': 1,
 'higher': 1,
 'him': 33,
 'him,': 10,
 'him.': 14,
 'himself': 2,
 'himself,': 2,
 'himself:': 1,
 'hind': 10,
 'his': 46,
 'hold': 1,
 'holder': 1,
 'holder),': 1,

# Unicode and Python text

* Internally, within Python 3+, all Python strings are Unicode
* When we talk to a network we usually have to encode and decode our data (generally to `utf-8`)
* When we recieve data we typically recieve it as a `bytes` object which we then pass through a `.decode()` method to get a string


In [39]:
# Poor man debugging...
# I find this to be one of the most useful lines of code to a 
#     new Pythonista

print(type(line), line)

<class 'bytes'> b'*** END: FULL LICENSE ***\r\n'


In [40]:
# Let us look at the difference between outputting:
#     a bytes object vs.
#     a string

print(line)
print(line.decode())

b'*** END: FULL LICENSE ***\r\n'
*** END: FULL LICENSE ***



# Reading web pages
---

In [41]:
# Our earlier examples were fairly straightforward, since we 
#     retrieved text files. Most of the web is not 
#     straight text files, it is composed of 
#     Hyper Text Markup Language (HTML)

# We request a page using urllib.request.urlopen()

page = urllib.request.urlopen('http://www.example.com/index.html')

for line in page:
    print(line.decode().strip())

<!doctype html>
<html>
<head>
<title>Example Domain</title>

<meta charset="utf-8" />
<meta http-equiv="Content-type" content="text/html; charset=utf-8" />
<meta name="viewport" content="width=device-width, initial-scale=1" />
<style type="text/css">
body {
background-color: #f0f0f2;
margin: 0;
padding: 0;
font-family: "Open Sans", "Helvetica Neue", Helvetica, Arial, sans-serif;

}
div {
width: 600px;
margin: 5em auto;
padding: 50px;
background-color: #fff;
border-radius: 1em;
}
a:link, a:visited {
color: #38488f;
text-decoration: none;
}
@media (max-width: 700px) {
body {
background-color: #fff;
}
div {
width: auto;
margin: 0 auto;
border-radius: 0;
padding: 1em;
}
}
</style>
</head>

<body>
<div>
<h1>Example Domain</h1>
<p>This domain is established to be used for illustrative examples in documents. You may use this
domain in examples without prior coordination or asking for permission.</p>
<p><a href="http://www.iana.org/domains/example">More information...</a></p>
</div>
</body>
</

# Beautiful soup
---

While it is possible to use `urllib` to read data from the web, a third party library, `Beautiful Soup` is commonly used instead to supplement urllib. `Beautiful Soup`:

* Makes reading and parsing web pages a lot easier
* Allows you to extract tags of only certain types
* You can find certain tags based on their relationship in the tag heirarchy
* Getting hyperlinks becomes a whole lot easier

## On the command line

To install Beautiful Soup, if not already installed, you can run this on the command line:

*`conda install beautifulsoup4`*

## In a Python file/interpreter

In [43]:
# Import the necessary modules

from bs4 import BeautifulSoup
import urllib.request

In [44]:
# Get the html text from the HTTPResponse object
# Notice the read() method >>>

htmlText = urllib.request.urlopen('http://www.unicode.org/').read()

In [45]:
# Use bs4 to create a soup object from our html text
# Provide a argument to identify which type of parser to
#     use, in this case, an html parser

soup = BeautifulSoup(htmlText, 'html.parser')

In [46]:
# The soup object allows you to retrieve specific types of tags, in this
#     anchor tags (identified using an 'a'). Anchor tags are used for links.

tags = soup('a') 

In [47]:
# Let's cycle through the tags and get the 'href' data portion. this is the data that contains the link itself

for tag in tags:
    print(tag.get('href', None))

http://www.unicode.org/contacts.html
http://www.unicode.org/sitemap/
http://www.unicode.org/search/

http://www.unicode.org/standard/WhatIsUnicode.html
http://www.unicode.org/consortium/newcomer.html
http://www.unicode.org/glossary/
http://www.unicode.org/copyright.html

http://www.unicode.org/standard/where/
http://www.unicode.org/help/display_problems.html
http://www.unicode.org/consortium/distlist.html
http://www.unicode.org/resources/
http://www.unicode.org/history/
http://www.unicode.org/casestudy/
http://www.unicode.org/policies/lastresortfont_eula.html
http://www.unicode.org/timesens/calendar.html

http://www.unicode.org/education/
http://www.unicode.org/education/consortwork.html
http://www.unicode.org/education/students.html
http://www.unicode.org/education/ngos.html

http://www.unicode.org/conference/about-conf.html
http://www.unicode.org/standard/tutorial-info.html

http://www.unicode.org/faq/
http://www.unicode.org/faq/basic_q.html
http://www.unicode.org/faq/emoji_dingbats.

# Using documentation
---

Let's explore the documentation for a third party library.

The documentation for Beautiful Soup has a number of nice attributes that can get you started fairly quickly, so let's use the documentation to enhance our knowledge of the subject.

[Beautiful Soup Documentation](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)

# Web scraping
---

## What is web scraping?

Web scraping is a technique used to retrieve data from the web OR from similar networks (intranets, etc).

* Web scrapers simulate the behavior of a browser
* They look at the data from specific site(s)
* They extract specific information you need from it
* Typically this is done over and over again across multiple sites

## Why web scrape?

* Get data from a sites that don't provide mechanisms to export the data
* Collect information on sites to build a search engine database
* Monitor sites for changes
* Collect social network data
    * who is connected to or communicates with who?
    * What is being said

# Miscellaneous:

In [29]:
# source:
# http://www.jabberwocky.com/carroll/jabber/jabberwocky.html

In [33]:
# the following command will run an HTTP server on your local computer...
# Run this from the command line.
#     This allows a class like this to be run in an isolated environment that may not have access to the Internet.

In [30]:
page = urllib.request.urlopen('http://localhost:8000/jabberwocky.html')

In [31]:
page

<http.client.HTTPResponse at 0x10ef50cc0>

In [32]:
text = page.read()
print(text)

b'<html><head><title>Jabberwocky</title></head><body bgcolor="#FFFFFF">\n<center><h1>JABBERWOCKY</h1>\n\n<h2>Lewis Carroll</h2>\n\n(from <cite>Through the Looking-Glass and What Alice Found There</cite>,\n1872)\n\n<p>\n<font size="+2">\n`Twas brillig, and the slithy toves<br>\n&nbsp;&nbsp;Did gyre and gimble in the wabe:<br>\nAll mimsy were the borogoves,<br>\n&nbsp;&nbsp;And the mome raths outgrabe.<p>\n</center>\n\n<img src="/pics/jabberwocky.jpg" align="right" border=0 width=291\nheight=432>\n\n<p><br>\n\n"Beware the Jabberwock, my son!<br>\n&nbsp;&nbsp;The jaws that bite, the claws that catch!<br>\nBeware the Jubjub bird, and shun<br>\n&nbsp;&nbsp;The frumious Bandersnatch!"<br>\n\n<p>\n\nHe took his vorpal sword in hand:<br>\n&nbsp;&nbsp;Long time the manxome foe he sought --<br>\nSo rested he by the Tumtum tree,<br>\n&nbsp;&nbsp;And stood awhile in thought.<br>\n\n<p>\n\nAnd, as in uffish thought he stood,<br>\n&nbsp;&nbsp;The Jabberwock, with eyes of flame,<br>\nCame whiffling t