# Using lxml to Parse HTML 

The libxml library `lxml` provides a lot of sophisticated (low-level) functionality for traversing XML and html documents.  In this example we use the specialized `html` submodule.

We download some data from wikipeda:

In [1]:
import requests
url = "https://en.wikipedia.org/wiki/United_States_presidential_election_in_Virginia,_2004"
resp = requests.get(url)
resp

<Response [200]>

## Construct the DOM

Parse the content into a document object model

In [2]:
from lxml import html
dom = html.document_fromstring(resp.content)

In [3]:
dom

<Element html at 0x1f8a7954548>

## Traversing the DOM

In [4]:
dom.getchildren()

[<Element head at 0x1f8a8a5eb38>, <Element body at 0x1f8a8a5ecc8>]

## Jumping directly to an Element

In [5]:
body = dom.find("body")

In [6]:
body.getchildren()

[<Element div at 0x1f8a8a5ed68>,
 <Element div at 0x1f8a8a5ea48>,
 <Element div at 0x1f8a8a6e048>,
 <Element div at 0x1f8a8a6e098>,
 <Element div at 0x1f8a8a6e0e8>,
 <Element div at 0x1f8a8a6e138>,
 <Element script at 0x1f8a8a6e188>,
 <Element script at 0x1f8a8a6e1d8>,
 <Element script at 0x1f8a8a6e228>]

# Using XPath to query HTML 

The following XPath query finds all the table elements starting at anywhere `//` in the tree and then traverses into the table to the row `/tr` and then the data entry `/td` and looks for a link `a` with the title attribute `@title` having the value `"Accomack County, Virginia"` and then gets its parent (the `td`) and then its parent (`tr`) and then its parent (`table`) and returns that.

In [7]:
tables = dom.xpath('//table/tbody/tr/td/a[@title="Accomack County, Virginia"]/../../../..')

Printing the returned table:

In [8]:
tables

[<Element table at 0x1f8a8a6e278>]

print(html.tostring(tables[0], pretty_print=True).decode('UTF8'))

Building a DataFrame from the table:

In [9]:
import pandas as pd
df = pd.read_html(html.tostring(tables[0]))[0]
df.head()

Unnamed: 0,County or City,Kerry %,Kerry #,Bush %,Bush #,Other %,Other #
0,Accomack,41.3%,5518,57.8%,7726,0.8%,112
1,Albemarle,50.5%,22088,48.5%,21189,1.0%,449
2,Alleghany,44.5%,3203,55.1%,3962,0.4%,30
3,Amelia,34.5%,1862,64.8%,3499,0.7%,36
4,Amherst,38.3%,4866,61.1%,7758,0.6%,71
