# Import needed libraries

In [1]:
import requests
from lxml import html

We used the library "request" last time in getting Twitter data (REST-ful).  We are introducing the new "lxml" library for analyzing & extracting HTML elements and attributes here.

# Use Requests to get HackerNews content

HackerNews is a community contributed news website with an emphasis on technology related content.  Let's grab the set of articles that are at the top of the HN list.

In [57]:
response = requests.get('http://news.ycombinator.com/')
response

<Response [200]>

In [78]:
response.content

'<html op="news"><head><meta name="referrer" content="origin"><meta name="viewport" content="width=device-width, initial-scale=1.0"><link rel="stylesheet" type="text/css" href="news.css?vvS6khQlZQ8ssGkyEBXp">\n        <link rel="shortcut icon" href="favicon.ico">\n          <link rel="alternate" type="application/rss+xml" title="RSS" href="rss">\n        <title>Hacker News</title></head><body><center><table id="hnmain" border="0" cellpadding="0" cellspacing="0" width="85%" bgcolor="#f6f6ef">\n        <tr><td bgcolor="#ff6600"><table border="0" cellpadding="0" cellspacing="0" width="100%" style="padding:2px"><tr><td style="width:18px;padding-right:4px"><a href="http://www.ycombinator.com"><img src="y18.gif" width="18" height="18" style="border:1px white solid;"></a></td>\n                  <td style="line-height:12pt; height:10px;"><span class="pagetop"><b class="hnname"><a href="news">Hacker News</a></b>\n              <a href="newest">new</a> | <a href="newcomments">comments</a> | <a 

We will now use lxml to create a programmatic access to the content from HackerNews.

# Analyzing HTML Content

In [79]:
page = html.fromstring(response.content)
page

<Element html at 0x10482b578>

## CSS Selectors

For those of you who are web designers, you are likely very familiar with Cascading Stylesheets (CSS).  Here is an example for how to use CSS selector for finding specific HTML elements

In [80]:
posts = page.cssselect('.title')

In [81]:
len(posts)

61

Details of how to use CSS selectors can be found in the w3 schools site:

http://www.w3schools.com/cssref/css_selectors.asp

## XPath

Alternatively, we can use a standard called "XPath" to find specific content in the HTML.

In [61]:
posts = page.xpath('//td[contains(@class, "title")]')

In [62]:
len(posts)

61

We are only interested in those "td" tags that contain an anchor link to the referred article.

In [84]:
posts = page.xpath('//td[contains(@class, "title")]/a')

In [85]:
len(posts)

31

So, only half of those "td" tags with "title" contain posts that we are interested in.  Let's take a look at the first such post.

In [86]:
first_post = posts[0]
first_post.text

'Create React Apps with No Configuration'

There is a lot of "content" in the td tag's attributes.

In [88]:
first_post.attrib

{'href': 'https://facebook.github.io/react/blog/2016/07/22/create-apps-with-no-configuration.html', 'class': 'storylink'}

In [89]:
first_post.attrib["href"]

'https://facebook.github.io/react/blog/2016/07/22/create-apps-with-no-configuration.html'

In [90]:
all_links = []
for p in posts:
    all_links.append((p.text, p.attrib["href"]))

In [91]:
all_links

[('Create React Apps with No Configuration',
  'https://facebook.github.io/react/blog/2016/07/22/create-apps-with-no-configuration.html'),
 (u'Apple says Pok\xc3\xa9mon Go is the most downloaded app in its first week ever',
  'https://techcrunch.com/2016/07/22/apple-says-pokemon-go-is-the-most-downloaded-app-in-its-first-week-ever/'),
 ('Verizon nears deal to acquire Yahoo',
  'http://www.bloomberg.com/news/articles/2016-07-22/verizon-said-nearing-deal-to-buy-yahoo-beating-rival-bidders'),
 ('Opus Interactive Audio Codec v1.1.3 released', 'http://opus-codec.com/'),
 (u'David Chang\xe2\x80\x99s Unified Theory of Deliciousness',
  'http://www.wired.com/2016/07/chef-david-chang-on-deliciousness/'),
 ('Boost Your Data Munging with R',
  'http://jangorecki.github.io/blog/2016-06-30/Boost-Your-Data-Munging-with-R.html'),
 ('A Compiler for 3D Machine Knitting',
  'https://www.disneyresearch.com/publication/machine-knitting-compiler/'),
 ('Kubernetes at Box: Microservices at Maximum Velocity',

Great: when you run the code above (starting from the HTTP request), this list of top content should change from time to time.

More details on how to use XPath can be found in the w3 schools site:

http://www.w3schools.com/xsl/xpath_syntax.asp
