# Welcome to the Dark Art of Coding:
## Introduction to Python
Gathering data from the web

<img src='../images/dark_art_logo.600px.png' width='300' style="float:right">

# Objectives
---

In this session, students should expect to:

* Use and understand the basics of the `urllib` module
* Use and understand the basics of the `beautiful soup` library

# Networks
---

## TCP

TCP is a protocol that is used to send data across a network

* It relies upon some builtin mechanisms to help increase reliability
* TCP creates connections between two devices (it is referred to as a connection-oriented protocol)
* It uses checks to ensure that all data has been correctly received, if not, it can request that missing data be resent
* TCP organizes packets in order
* Between the reliability checks and the organization/ordering of packets, it is very effective for the sending files (like web pages)


## Port numbers

The TCP protocol incorporates the use of port numbers:

* Port numbers are used by computers to ensure that traffic coming to a given computer gets funneled to the correct application
* Multiple ports allow multiple applications on the same computer to talk without interfering with each other
* Typically certain applications have default TCP port numbers that are used to send higher-level protocols

Task | Port
:----|:----
Telnet | 23
SSH | 22
HTTP | 80
HTTPS | 443
SMTP (E-mail) | 25
DNS (Domain Name) | 53
FTP (File Transfer) | 21

## HTTP (Hyper Text Transfer Protocol)

HTTP is a common protocol that may be sent using TCP.

* HTTP is the standard Protocol for most applications on the internet
* Invented to retrieve HTML, images, Documents, etc.
* Basic concept:
    * Make a connection
    * Request a document
    * Retrieve the document
    * Close the connection

HTTP uses Uniform Resource Locators (URL) to identify device addresses. A URL address has several components:

* The URL indicates the protocol, generally HTTP (but it could be others)
* It lists the server that hosts the document
* The name and path to the document

http://  | http://www.example.com/ | index.html
:--------|:-------------|:----------------
Protocol | Host         | Document

# HTTP

* Browser attempts to connect to `http://www.example.com`
* Issues a request for a document such as `index.html`
* The server sends the html document
* Browser renders html
* Closes connection when done

# Standing up a local HTTP server
---

Python lets you stand up your own HTTP server.
This lesson is designed for use in connectionless environments, where you may not have access to the Internet.

In those cases, we start this lesson by standing up our own server and using Python to interact with webpages on that server. The behaviors and code will all be the same >>> the only thing that changes is the URL.

Do the following on the commandline. It will run an HTTP server on your local computer... in the folder where you execute the Python command

```bash
$ cd path/to/this/lesson_directory/12_internet/
$ python -m http.server 9999
```

|Command|Purpose|
|:--|:---|
|`python` | calls the Python interpreter directly|
|`-m` | requests that the interpreter load the `http.server` module, which automatically starts a basic HTTP server.|
|`9999` | designates which port to use for your server|

Open your browser and surf to:

```bash
localhost:9999
```


You will see something looks like this:
    
<img src='./http_server_dir_list.png' width='300' style="float:center">

# HTTP requests in Python using urllib
---

In [1]:
# First we have to import the request module from urllib

import urllib.request

In [2]:
# urllib allows us to open web pages just like opening files.
# The following command creates an http.client.HTTPResponse object that
#     gives us access to a number of attributes and behaviors
#     related to the data retrieved

file = urllib.request.urlopen('http://localhost:9999/annabel_lee.txt')

In [3]:
# A common technique is to use a for loop to cycle through every
# line and print out the data one line at a time
# In this case, the data is read in as bytes

for line in file:
    # We convert each line from bytes to strings using the
    #     .decode() attribute.
    print(line.decode().strip())

Annabel Lee
Edgar Allen Poe, 1809 - 1849

It was many and many a year ago,
In a kingdom by the sea,
That a maiden there lived whom you may know
By the name of Annabel Lee;
And this maiden she lived with no other thought
Than to love and be loved by me.

I was a child and she was a child,
In this kingdom by the sea:
But we loved with a love that was more than love--
I and my Annabel Lee;
With a love that the winged seraphs of heaven
Coveted her and me.

And this was the reason that, long ago,
In this kingdom by the sea,
A wind blew out of a cloud, chilling
My beautiful Annabel Lee;
So that her highborn kinsman came
And bore her away from me,
To shut her up in a sepulchre
In this kingdom by the sea.

The angels, not half so happy in heaven,
Went envying her and me--
Yes!--that was the reason (as all men know,
In this kingdom by the sea)
That the wind came out of the cloud by night,
Chilling and killing my Annabel Lee.

But our love it was stronger by far than the love
Of those who were o

In [6]:
# Much like other files we have looked at, we can 
# read and evaluate the text in web-based text files, like
# like counting words

import pprint
import urllib.request
file = urllib.request.urlopen('http://localhost:9999/annabel_lee.txt')

In [7]:
count = {}

for line in file:
    
    # Again, we take the line and use .decode() to convert
    #     the data to a string
    #     Then we strip the newline
    #     Then we split it on spaces
    words = line.decode().strip().split()
    
    # We cycle through the words one at a time
    for word in words:
        
        # If a key for the word already exists .get() grabs the value otherwise it automatically returns 0
        count[word] = count.get(word, 0) + 1

In [8]:
pprint.pprint(count)

{'(as': 1,
 '-': 1,
 '1809': 1,
 '1849': 1,
 'A': 1,
 'Allen': 1,
 'And': 6,
 'Annabel': 8,
 'But': 2,
 'By': 1,
 'Can': 1,
 'Chilling': 1,
 'Coveted': 1,
 'Edgar': 1,
 'For': 1,
 'I': 4,
 'In': 7,
 'It': 1,
 'Lee': 1,
 'Lee.': 1,
 'Lee:': 1,
 'Lee;': 5,
 'My': 1,
 'Nor': 1,
 'Of': 6,
 'Poe,': 1,
 'So': 1,
 'Than': 1,
 'That': 2,
 'The': 1,
 'To': 1,
 'Went': 1,
 'With': 1,
 'Yes!--that': 1,
 'a': 9,
 'above,': 1,
 'ago,': 2,
 'all': 2,
 'and': 8,
 'angels': 1,
 'angels,': 1,
 'away': 1,
 'be': 1,
 'beams,': 1,
 'beautiful': 4,
 'blew': 1,
 'bore': 1,
 'bride,': 1,
 'bright': 1,
 'bringing': 1,
 'but': 1,
 'by': 11,
 'came': 2,
 'child': 1,
 'child,': 1,
 'chilling': 1,
 'cloud': 1,
 'cloud,': 1,
 'darling--my': 2,
 'demons': 1,
 'dissever': 1,
 'down': 2,
 'dreams': 1,
 'envying': 1,
 'ever': 1,
 'eyes': 1,
 'far': 2,
 'feel': 1,
 'from': 2,
 'half': 1,
 'happy': 1,
 'heaven': 2,
 'heaven,': 1,
 'her': 7,
 'highborn': 1,
 'in': 3,
 'it': 1,
 'killing': 1,
 'kingdom': 5,
 'kinsman': 1,

# Unicode and Python text

* Internally, within Python 3+, all Python strings are Unicode
* When we talk to a network we usually have to encode and decode our data (generally to `utf-8`)
* When we recieve data we typically recieve it as a `bytes` object which we then pass through a `.decode()` method to get a string


In [9]:
# Poor man debugging...
# I find this to be one of the most useful lines of code to a 
#     new Pythonista

print(type(line), line)

<class 'bytes'> b'   In her tomb by the sounding sea.'


In [10]:
# Let us look at the difference between outputting:
#     a bytes object vs.
#     a string

print(line)
print(line.decode())

b'   In her tomb by the sounding sea.'
   In her tomb by the sounding sea.


# Reading web pages
---

In [19]:
# Our earlier examples were fairly straightforward, since we 
#     retrieved text files. Most of the web is not 
#     straight text files, it is composed of 
#     Hyper Text Markup Language (HTML)

# We request a page using urllib.request.urlopen()

page = urllib.request.urlopen('http://localhost:9999/jabberwocky.html')

for line in page:
    print(line.decode().strip())
    
# source:
# http://www.jabberwocky.com/carroll/jabber/jabberwocky.html    

<html><head><title>Jabberwocky</title></head><body bgcolor="#FFFFFF">
<center><h1>JABBERWOCKY</h1>

<h2>Lewis Carroll</h2>

(from <cite>Through the Looking-Glass and What Alice Found There</cite>,
1872)

<p>
<font size="+2">
`Twas brillig, and the slithy toves<br>
&nbsp;&nbsp;Did gyre and gimble in the wabe:<br>
All mimsy were the borogoves,<br>
&nbsp;&nbsp;And the mome raths outgrabe.<p>
</center>

<img src="./pics/jabberwocky.jpg" align="right" border=0 width=291
height=432>

<p><br>

"Beware the Jabberwock, my son!<br>
&nbsp;&nbsp;The jaws that bite, the claws that catch!<br>
Beware the Jubjub bird, and shun<br>
&nbsp;&nbsp;The frumious Bandersnatch!"<br>

<p>

He took his vorpal sword in hand:<br>
&nbsp;&nbsp;Long time the manxome foe he sought --<br>
So rested he by the Tumtum tree,<br>
&nbsp;&nbsp;And stood awhile in thought.<br>

<p>

And, as in uffish thought he stood,<br>
&nbsp;&nbsp;The Jabberwock, with eyes of flame,<br>
Came whiffling through the tulgey wood,<br>
&nbsp;&nbs

# Beautiful soup
---

While it is possible to use `urllib` to read data from the web, a third party library, `Beautiful Soup` is commonly used instead to supplement urllib. `Beautiful Soup`:

* Makes reading and parsing web pages a lot easier
* Allows you to extract tags of only certain types
* You can find certain tags based on their relationship in the tag heirarchy
* Getting hyperlinks becomes a whole lot easier

## On the command line

To install Beautiful Soup, if not already installed, you can run this on the command line:

*`conda install beautifulsoup4`*

## In a Python file/interpreter

In [14]:
# Import the necessary modules

from bs4 import BeautifulSoup
import urllib.request

In [15]:
# Get the html text from the HTTPResponse object
# Notice the read() method >>>

htmlText = urllib.request.urlopen('http://localhost:9999/jabberwocky.html').read()

In [16]:
# Use bs4 to create a soup object from our html text
# Provide a argument to identify which type of parser to
#     use, in this case, an html parser

soup = BeautifulSoup(htmlText, 'html.parser')

In [17]:
# The soup object allows you to retrieve specific types of tags, in this
#     anchor tags (identified using an 'a'). Anchor tags are used for links.

tags = soup('a') 

In [18]:
# Let's cycle through the tags and get the 'href' data portion. this is the data that contains the link itself

for tag in tags:
    print(tag.get('href', None))

mailto:dshaw@jabberwocky.com
/carroll/jabber/
/carroll/
/


# Using documentation
---

Let's explore the documentation for a third party library.

The documentation for Beautiful Soup has a number of nice attributes that can get you started fairly quickly, so let's use the documentation to enhance our knowledge of the subject.

[Beautiful Soup Documentation](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)

# Web scraping
---

## What is web scraping?

Web scraping is a technique used to retrieve data from the web OR from similar networks (intranets, etc).

* Web scrapers simulate the behavior of a browser
* They look at the data from specific site(s)
* They extract specific information you need from it
* Typically this is done over and over again across multiple sites

## Why web scrape?

* Get data from a sites that don't provide mechanisms to export the data
* Collect information on sites to build a search engine database
* Monitor sites for changes
* Collect social network data
    * who is connected to or communicates with who?
    * What is being said