# Scraping HTML with lxml

When you're looking at a web page there can be a big difference between what a web browser displays to you and what was actually sent to your web browser. For example:


<br>
<br>
<div class="asdf">
    <div class="qwerty">
        <p class="para">What you see:</p>
    </div>
    <div class="qwerty">
        <p class="para">
            Some datum:
            <span id="data_id" class="data-class" style="font-weight: bold">42</span>
        </p>
    </div>
</div>

What your computer sees:

```
<div class="asdf">
    <div class="qwerty">
        <p class="para">What you see:</p>
    </div>
    <div class="qwerty">
        <p class="para">
            Some datum:
            <span id="data_id" class="data-class" style="font-weight: bold">42</span>
        </p>
    </div>
</div>
```

lxml docs: http://lxml.de/lxmlhtml.html

In [1]:
from lxml import html
import requests

### Some different ways of accessing elements

In [2]:
fragment = """\
<div class="asdf">
    <div class="qwerty">
        <p class="para">What you see:</p>
    </div>
    <div class="qwerty">
        <p class="para">
            Some datum:
            <span id="data_id" class="data-class" style="font-weight: bold">42</span>
        </p>
    </div>
</div>
"""

In [3]:
doc = html.fragment_fromstring(fragment)
doc

<Element div at 0x105930d68>

In [4]:
print(html.tostring(doc, encoding='unicode'))

<div class="asdf">
    <div class="qwerty">
        <p class="para">What you see:</p>
    </div>
    <div class="qwerty">
        <p class="para">
            Some datum:
            <span id="data_id" class="data-class" style="font-weight: bold">42</span>
        </p>
    </div>
</div>


Each of the following methods result in getting the same HTML element from the above fragment, but `get_element_by_id` only works if the element you're interested in has `id` attribute.

In [5]:
doc.get_element_by_id('data_id')

<Element span at 0x105930908>

In [6]:
doc.xpath('div[2]/p/span')

[<Element span at 0x105930908>]

In [7]:
doc.cssselect(':nth-child(2) > p > span')

[<Element span at 0x105930908>]

Get the actual text content of elements using the `.text` attribute or the `.text_content()` method. `.text_content()` can be especially useful when you know you want the textual content of an element that contains sub-elements.

In [8]:
doc.get_element_by_id('data_id').text

'42'

In [9]:
doc.xpath('div[2]/p/span')[0].text

'42'

In [10]:
doc.text_content()

'\n    \n        What you see:\n    \n    \n        \n            Some datum:\n            42\n        \n    \n'

### Getting data from a web page

Using `xpath` can be a quick way to get a specific element from a large document, if you know the `xpath` to use for your element.
In Chrome you can right-click on an element, click Inspect, then right-click again on the HTML element in the inspector and click Copy > Copy XPath in the resulting menu to get an `xpath` copied onto your clipboard. Here's an example:

```
//*[@id="inner-content"]/div[3]/div[1]/div[1]/div/city-current-conditions/div/div[2]/div/div/div[2]/display-unit/span/span[1]
```

When trying to parse data from a web page, use the [requests](http://docs.python-requests.org/en/master/) library to load the page, then pass the resulting string to lxml.

In [11]:
resp = requests.get('https://www.wunderground.com/weather/us/ca/san-francisco/94102')
resp.raise_for_status()

weather_doc = html.document_fromstring(resp.text)

In [12]:
weather_doc

<Element html at 0x10597b638>

In [13]:
elem = weather_doc.xpath('//*[@id="inner-content"]/div[3]/div[1]/div[1]/div/city-current-conditions/div/div[2]/div/div/div[2]/display-unit/span/span[1]')

In [14]:
elem[0].text_content()

'58'

### Getting multiple elements

If you aren't looking for one specific element, rather a specific _kind of element_, lxml has tools for that as well. For example, you can iterate through all of the `<a>` elements on page.

In [15]:
resp = requests.get('https://en.wikipedia.org/wiki/PyLadies')
resp.raise_for_status()

wiki_doc = html.document_fromstring(resp.text)

In [16]:
for elem in wiki_doc.iterdescendants('a'):
    print(elem.get('href'))

None
#mw-head
#p-search
/wiki/File:Question_book-new.svg
/wiki/Wikipedia:Verifiability
/wiki/Wikipedia:No_original_research#Primary.2C_secondary_and_tertiary_sources
/wiki/Wikipedia:No_original_research#Primary.2C_secondary_and_tertiary_sources
/wiki/Help:Maintenance_template_removal
/wiki/Mentorship
/wiki/Python_(programming_language)
/wiki/Open-source_community
#cite_note-1
#cite_note-AboutPyLadies-2
#cite_note-GitHub-3
/wiki/Outreach
/wiki/Los_Angeles
#cite_note-4
/wiki/Washington,_D.C.
#cite_note-5
#cite_note-6
#Events
#Activities
#References
#External_links
/w/index.php?title=PyLadies&action=edit&section=1
#cite_note-7
#cite_note-Geek_Chicks-8
#cite_note-PyLadies-9
/w/index.php?title=PyLadies&action=edit&section=2
/w/index.php?title=PyLadies&action=edit&section=3
#cite_ref-1
http://www.themarysue.com/pyladies/
#cite_ref-AboutPyLadies_2-0
http://www.pyladies.com/about/
#cite_ref-GitHub_3-0
https://github.com/pyladies/pyladies#readme
#cite_ref-4
http://heatherpayne.ca/review-of-pyla

You could also combine `xpath` and `iterdescendants` to scrape `<a>` tags from a specific part of a web page.

In [18]:
for elem in wiki_doc.xpath('//*[@id="mw-content-text"]/div')[0].iterdescendants('a'):
    print(elem.get('href'))

/wiki/File:Question_book-new.svg
/wiki/Wikipedia:Verifiability
/wiki/Wikipedia:No_original_research#Primary.2C_secondary_and_tertiary_sources
/wiki/Wikipedia:No_original_research#Primary.2C_secondary_and_tertiary_sources
/wiki/Help:Maintenance_template_removal
/wiki/Mentorship
/wiki/Python_(programming_language)
/wiki/Open-source_community
#cite_note-1
#cite_note-AboutPyLadies-2
#cite_note-GitHub-3
/wiki/Outreach
/wiki/Los_Angeles
#cite_note-4
/wiki/Washington,_D.C.
#cite_note-5
#cite_note-6
#Events
#Activities
#References
#External_links
/w/index.php?title=PyLadies&action=edit&section=1
#cite_note-7
#cite_note-Geek_Chicks-8
#cite_note-PyLadies-9
/w/index.php?title=PyLadies&action=edit&section=2
/w/index.php?title=PyLadies&action=edit&section=3
#cite_ref-1
http://www.themarysue.com/pyladies/
#cite_ref-AboutPyLadies_2-0
http://www.pyladies.com/about/
#cite_ref-GitHub_3-0
https://github.com/pyladies/pyladies#readme
#cite_ref-4
http://heatherpayne.ca/review-of-pyladies-intro-to-python-wor

### Tables

Scraping data from tables into data structures like Pandas DataFrames can be its own special challenge, but if you're lucky Pandas' [read_html](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_html.html#pandas.read_html) function might do all the work for you.

#### Pandas

In [19]:
import pandas as pd

In [20]:
tables = pd.read_html('https://jiffyclub.github.io/pyladies-2017-scraping/tables/example.html')
tables[0]

Unnamed: 0,Col 1,Col 2,Col 3
0,A,1,1.1
1,B,2,2.2
2,C,3,3.3
3,D,4,4.4
4,E,5,5.5


In [21]:
tables[0].dtypes

Col 1     object
Col 2      int64
Col 3    float64
dtype: object

#### lxml

If `pandas.read_html` doesn't work, you can always parse the data manually using lxml.
First find the table element, then start working through its rows as appropriate for that table's structure.

In [22]:
resp = requests.get('https://jiffyclub.github.io/pyladies-2017-scraping/tables/example.html')
resp.raise_for_status()

table_doc = html.document_fromstring(resp.text)

In [23]:
table_elem = table_doc.get_element_by_id('first-table')

In [24]:
heading = table_elem.xpath('thead/tr')[0]
heading

<Element tr at 0x106b64228>

In [25]:
col_names = [elem.text for elem in heading.iterchildren('td')]
col_names

['Col 1', 'Col 2', 'Col 3']

In [26]:
tbody = table_elem.xpath('tbody')[0]
tbody

<Element tbody at 0x10996a8b8>

In [27]:
data_rows = []
for row in tbody.iterchildren('tr'):
    data_rows.append([elem.text.strip() for elem in row.iterchildren('td')])
data_rows

[['A', '1', '1.1'],
 ['B', '2', '2.2'],
 ['C', '3', '3.3'],
 ['D', '4', '4.4'],
 ['E', '5', '5.5']]

In [28]:
df = pd.DataFrame(data=data_rows, columns=col_names)
df

Unnamed: 0,Col 1,Col 2,Col 3
0,A,1,1.1
1,B,2,2.2
2,C,3,3.3
3,D,4,4.4
4,E,5,5.5
