In [1]:
import pandas as pd
import numpy as np

# Beautiful Soup Refresher


## First: HTTP Response Codes

- 1XX - Informational
- 2XX - Success
- 3XX - Redirection
- 4XX - Client Error
- 5XX - Server Error

### Response Codes - The Greatest Hits

- **200 - OK** - The requested action was successfully executed
- **301 - Moved Permanently** - The resource has been relocated (and will not be back, so please stop asking me)
- **400 - Bad Request** - The the client request is malformed in some way
- **403 - Forbidden** - The requesting client (i.e. you) does not have permission to view the resource
- **404 - Not Found** - The resource can't be found at the moment (may be in the future, so check back later)
- **405 - Method Not Allowed** - Used GET when only POST was applicable for example
- **418 - I'm a teapot** - For when the server is a teapot
- **420 - NOT an HTTP code** - you're thinking of something else
- **429 - Too Many Requests** - They're on to you, and if you keep it up, they'll block you permenantly
- **500 - Internal Server Error** -Some non-specific bad happened on their end
- **502 - Bad Gateway** - The server was waiting on another resource and it ended badly
- **503 - Service Unavailable** - The server is overloaded or down at the moment

In [2]:
import requests

In [3]:
r = requests.get('http://news.ycombinator.com')

## We can check the response code

In [4]:
r

<Response [200]>

## DOM

> The Document Object Model (DOM) is a programming interface for HTML and XML documents. It provides a structured representation of the document and it defines a way that the structure can be accessed from programs so that they can change the document structure, style and content. The DOM provides a representation of the document as a structured group of nodes and objects that have properties and methods. Essentially, it connects web pages to scripts or programming languages.

## Typical Web Page Structure

    <html>
        <head>
        </head>
        <body>
            <div id="header" class="extraFancy">I'm a header!</div>
            <div id="main">
                I'm a div!
                <ul>
                    I'm an unordered list!
                    <li>I'm list item 1</li>
                    <li>I'm list item 2</li>
                </ul>
            </div>
            <div id="footer" class="extraFancy">I'm a footer</div>
        </body>
    </html>

In [5]:
page_html = """
    <html>
        <head>
        <title>Super Cool Website!</title>
        </head>
        <body>
            <div id="header" class="extraFancy">I'm a header!</div>
            <div id="main">
                I'm a div!
                <ul>
                    I'm an unordered list!
                    <li>I'm list item 1</li>
                    <li>I'm list item 2</li>
                </ul>
            </div>
            <div id="footer" class="extraFancy">I'm a footer</div>
        </body>
    </html>
"""

## We're going to feed this full HTML into a library called Beautiful Soup

<img src="http://i.imgur.com/klVeXY7.png" width="800">

## Coding BeautifulSoup

In [6]:
from bs4 import BeautifulSoup

## Pass the HTML into the BS object

In [7]:
soup = BeautifulSoup(page_html, "lxml")

From there it can be searched and parsed

## Print the html

In [8]:
print soup.prettify()

<html>
 <head>
  <title>
   Super Cool Website!
  </title>
 </head>
 <body>
  <div class="extraFancy" id="header">
   I'm a header!
  </div>
  <div id="main">
   I'm a div!
   <ul>
    I'm an unordered list!
    <li>
     I'm list item 1
    </li>
    <li>
     I'm list item 2
    </li>
   </ul>
  </div>
  <div class="extraFancy" id="footer">
   I'm a footer
  </div>
 </body>
</html>



## Let's now do some parsing of the HTML using the DOM

## Get the title

In [9]:
soup.title

<title>Super Cool Website!</title>

In [10]:
soup.title.text

u'Super Cool Website!'

## Find - get the first result

In [11]:
soup.find('div')

<div class="extraFancy" id="header">I'm a header!</div>

## FindAll - get all matching results

In [12]:
i = 0
for d in soup.findAll('div'):
    print i, d
    print '\n'
    i += 1

0 <div class="extraFancy" id="header">I'm a header!</div>


1 <div id="main">
                I'm a div!
                <ul>
                    I'm an unordered list!
                    <li>I'm list item 1</li>
<li>I'm list item 2</li>
</ul>
</div>


2 <div class="extraFancy" id="footer">I'm a footer</div>




## Get the page's text

In [13]:
print soup.text



Super Cool Website!


I'm a header!

                I'm a div!
                
                    I'm an unordered list!
                    I'm list item 1
I'm list item 2


I'm a footer





## Get the class of an element

In [14]:
# find returns the first result
soup.find('div')['class']

['extraFancy']

## Search by the id of an element

In [15]:
print soup.find(id='main')

<div id="main">
                I'm a div!
                <ul>
                    I'm an unordered list!
                    <li>I'm list item 1</li>
<li>I'm list item 2</li>
</ul>
</div>


## Search by the class

In [16]:
#  note the underscore after class
print soup.findAll(class_='extraFancy')

[<div class="extraFancy" id="header">I'm a header!</div>, <div class="extraFancy" id="footer">I'm a footer</div>]


## Get the children of an element

In [17]:
my_ul = soup.find('ul')

In [18]:
print my_ul

<ul>
                    I'm an unordered list!
                    <li>I'm list item 1</li>
<li>I'm list item 2</li>
</ul>


In [19]:
my_ul.findChildren()

[<li>I'm list item 1</li>, <li>I'm list item 2</li>]

## Exercise

Using Requests and BeautifulSoup, pull down hacker news and print out the headlines and the story links in your notebook

In [20]:
hn = requests.get('http://news.ycombinator.com')

In [21]:
# pass the content into BS
hn_soup = BeautifulSoup(hn.content, "lxml")

In [22]:
for link in hn_soup.findAll('a', class_='storylink'):
    print link.text
    print link['href']
    print '\n'

Big Companies and the Military Are Paying Novelists to Write Sci-Fi for Them
http://www.newyorker.com/tech/elements/better-business-through-sci-fi


Linux Load Averages: Solving the Mystery
http://www.brendangregg.com/blog/2017-08-08/linux-load-averages.html


https://www.nytimes.com/2017/08/07/business/dealbook/initial-coin-offerings-sec-virtual-currency.html


Cruise is running an autonomous ride hailing service for employees in SF
https://techcrunch.com/2017/08/08/cruise-is-running-an-autonomous-ride-hailing-service-for-employees-in-sf/


SKYACTIV-X: first commercial gasoline engine to use compression ignition
http://www2.mazda.com/en/publicity/release/2017/201708/170808a.html


Mongoose OS – An Open Source Operating System for the Internet of Things
https://mongoose-os.com/


Show HN: An interactive guide to compression basics
http://unwttng.com/compression-decompressed


Instagram photos reveal predictive markers of depression
https://epjdatascience.springeropen.com/articles/10.11