In [2]:
from lxml import etree

## Element tree object

In [3]:
html_file = "src\\web_page.html"
tree = etree.parse(html_file)
tree

<lxml.etree._ElementTree at 0x19c8b3d8f48>

This particular file reads the html file and converts it into a tree structure like this: 

![](https://www.researchgate.net/profile/Antanas-Cenys/publication/266611108/figure/fig10/AS:668860244045832@1536480117529/HTML-source-code-represented-as-tree-structure.png)

To get the html code of the web page, we can call the tostring method of etree object of lxml:

In [4]:
etree.tostring(tree)

b'<html lang="en">\n\n<head>\n    <title>This is the title</title>\n</head>\n\n<body>\n    <p>Hello World</p>\n    <ul>\n        <li id="myID">Web Scraping with Python using Requests, LXML and Splash</li>\n        <li class="myClass">Created by:\n            <a href="https://twitter.com/AhmedRafik__">Ahmed Rafik</a>\n        </li>\n    </ul>\n</body>\n\n</html>'

This returns the html code of the webpage

## Element object

In [5]:
title = tree.find('head/title')
print(title)

<Element title at 0x19c8b587d88>


Returns a element object of tag title. Now if we want the text inside it we call the text method:

In [6]:
title.text

'This is the title'

In [7]:
tree.find("body/p").text

'Hello World'

In [8]:
list_items = tree.findall("body/ul/li")
print(list_items)

[<Element li at 0x19c8b5a60c8>, <Element li at 0x19c8b5a6108>]


In [9]:
for li in list_items:
    a = li.find("a")
    if a is not None:
        print(li.text.strip() + a.text)
    else:
        print(li.text)

Web Scraping with Python using Requests, LXML and Splash
Created by:Ahmed Rafik


## Introduction to lxml with xpath

the xpath method works as findall method, returns a list

In [10]:
title = tree.xpath("//title/text()")[0]
title

'This is the title'

In [11]:
para = tree.xpath("//p/text()")[0]
para

'Hello World'

In [12]:
list_items = tree.xpath("//body/ul/li")
for li in list_items:
    a = li.xpath("a")
    if len(a)>0:
        print(li.text.strip()+a[0].text)
    else:
        print(li.text)

Web Scraping with Python using Requests, LXML and Splash
Created by:Ahmed Rafik


In [13]:
list_items = tree.xpath("//body/ul/li")
for li in list_items:
    text= "".join(map(str.strip,li.xpath(".//text()")))
    print(text)

Web Scraping with Python using Requests, LXML and Splash
Created by:Ahmed Rafik


**Note**: Use a "." sign before any xpath xpression when calling xpath from an element

## Introduction to lxml with CSS selectors

In [15]:
tree = etree.parse("src/web_page.html")
tree

<lxml.etree._ElementTree at 0x19c8b3d8d48>

here tree is a element tree object, but the css selectors works with html objects only. We can get the html object by calling the 
```getroot()```

In [16]:
html = tree.getroot()
html

<Element html at 0x19c8b4a1b48>

In [20]:
title = html.cssselect("title")[0]
title.text

'This is the title'

In [21]:
para = html.cssselect("p")[0]
para.text

'Hello World'

In [22]:
list_items = html.cssselect("li")
print(list_items)

[<Element li at 0x19c8b5a6048>, <Element li at 0x19c8b641b08>]


In [24]:
for li in list_items:
    a = li.cssselect("a")
    if len(a)>0:
        print(li.text.strip()+a[0].text)
    else:
        print(li.text)

Web Scraping with Python using Requests, LXML and Splash
Created by:Ahmed Rafik
