# Web Scraping

## Learning Outcomes:
- Learn the structure of HTML
- Learn how to use XPath to navigate HTML (via lxml)
- Use Selenium to scrape data from websites

One of the most common ways to obtain data is through the use of **web scraping**. Web scraping, as the name suggests, is about pulling information from websites in a programmatic fashion... (because copy and pasting would be way too much effort)

## The challenge

Let's say we wanted to build a model which would predict house prices given some features - for example, location, number of bedrooms, number of bathrooms. We need some way of obtaining this data - both the response and the target variables.

To introduce you to the concept of web scraping, let's try and extract data for 100 houses:
- **Sale Price**: Our response variable
- Number of bedrooms
- Square footage
- Description
- Address
    
[This URL shows houses listed for sale in London](https://www.zoopla.co.uk/new-homes/property/london/?q=London&results_sort=newest_listings&search_source=new-homes&page_size=25&pn=1&view_type=list). Let's take a look at where the information that we want to extract is on the webpage.

Before we look at solving this challenge, let's take a look at what websites and HTML actually are.

## Websites

### What format does information on a website exist in?

We know that websites don't just print data in a nice CSV or JSON format. 
They have content to display stuff to you in a way that makes sense, like buttons, on the page. 
This content is defined in a HTML file.

They also have styling

#### What is HTML?

HTML stands for HyperText Markup Language. It consists of a tree structure of different types of web elements, like buttons, page divisions, images and more. This means that it is used to define what **content** is rendered on any webpage that you visit.

HTML markdown contains elements/tags that may contain other elements/tags.

Here is an example of some HTML markdown, and the page which it is rendered as.

![](./images/example_html.png)

[Let's play around with some HTML](https://code.sololearn.com/WoNr8gIeKYDr/)

### How can we get the website HTML, which contains data that we want?

When you search for a URL in a browser, here's what happens:
- your browser makes a **GET request** to the computer (server) that serves requests from that URL endpoint
- this computer knows what web content to send you back, so it sends it in a response to the request. This stuff includes the HTML of the page that you want to view.
- Your browser gets the HTML, and knows how to present that type of data to you (it renders the webpage)

The point here being that you can get the HTML, which defines the content for any site, by making a GET request to that website.

Let's try that!

We can use the requests library to get the HTML from a website

In [None]:
import requests # import the requests library
r = requests.get('https://www.zoopla.co.uk') # make a HTTP GET request to this website
html_string = r.text # the text attribute of this response is the HTML as a string
print(r.text)

## Finding the data by Parsing HTML

Now that we have the page which we know contains the data, we need to find the information that we care about within it. 

The process of converting the string of text, which we know represents HTML, into structured HTML as well as maybe extracting information that we want from within it, is called **parsing**.

### Converting the string to a HTML tree

HTML is just a nested set of elements, all inside a single `<html>...</html>` tag. 
By looking at this top `html` node, and branching it's child nodes from it recursively, we can visualise the webpage as a tree. This is shown below.

![](./images/html_tree.jpg)

### Showing the webpage and markdown with this structure

We can use the `lxml` library to turn the homogeneous string into a tree data structure. 
`lxml` has a module called etree which is a simple and efficient API for parsing and creating XML or HTML data.
Using the HTML function.

See more about `lxml.etree` [here](https://lxml.de/api/)

In [None]:
from lxml import etree, html

tree = html.fromstring(html_string) # use the fromstring method to build an HTMLElement from a HTML string
# print(tree)
print(etree.tostring(tree)) # print the tree

This tree object that we've converted the string HTML into now has methods that can be used to find elements within it. 

Check out the lxml.html documentation [here](https://lxml.de/3.1/tutorial.html).

### Finding tree elements within a `HTMLElement` using xpath

Xpath is a query language for selecting nodes/branches/elements within a tree-like data structure like HTML or XML. 

Below is a very simple xpath expression. This one finds all of the button elements in the html

#### `//button` 

The `//` says "anywhere in the tree" and the `button` says find elements that have the tag type button. So this xpath expression says "find button tags anywhere within the tree"

The `xpath` method of `HTMLElement` takes in an xpath expression returns a list of all elements in the tree that match it.

Below are more examples of how to use xpath

`/button` find **child** (not all) tags of type button, of the element

`//div/button` - finds all of the button tags inside div tags anywhere on the page

`//div[@id='custom_id']` - finds all div tags with the attribute (`@`) `id` equal to `custom_id`, anywhere on the page 

If any of these don't make sense, let us know after [looking it up](https://www.w3schools.com/xml/xpath_syntax.asp).

Use the `//button` xpath expression as an argument to find the button on the page

In [None]:
buttons = tree.xpath('//div/button[1]/div/p[2][@class="myclass"]')
print(buttons) # returned from xpath expression

The elements of the tree that match the xpath expression are returned from the call to the `.xpath` method of the tree. To see the text that each of the buttons contain we can check out their `.text` attribute. Let's use a list comprehension to explore this, and a few other attributes.

In [None]:
btn_texts = [b.text for b in buttons] # map each button to their text
print(btn_texts)

## Using the developer console to identify the right xpath

### How to open the console

Modern browsers come with tools to maximise web developers productivity and help find bugs.

The developer console has a lot of different tools. 

Open your element inspector by pressing `CTRL + SHIFT + C`.
It should open on the right hand side of your screen as shown below.

The elements tab of the developer console shows you the HTML and CSS that make up the website code (actually it shows the DOM. Read more about what exactly the DOM is [here](https://css-tricks.com/dom/)).

You can always close the developer console by clicking the cross in the corner. 

Pressing `CTRL + SHIFT + J` opens the javascript console and closes the developer console if it is already open.

![title](./images/dev_console_opened.png)

Check out the zoopla website for yourself. Try using your selector to see the HTML structure of the page.
![](./images/form_selector.png)

Now use your selector to find the location of the button as shown below.

![](./images/button_selector.png)

As mentioned, the selector allows us to visualise the DOM and find elements within our webpage.


### Challenge: How many HTML buttons are there on the homepage? 

### We can find elements, and then search for elements within them!

Elements returned from finding them by xpath also have the same search methods. They are the same object type.

### We can search for elements in more ways than just xpath

There are loads of ways to find elements within HTML.

Let's check what methods and properties of our tree object exist by calling the built-in `dir()` method.

In [None]:
print(dir(tree))

If we wanted to find the documentation which would explain all of these in detail, we should print it's type and then find the corresponding page in the docs.

In [None]:
print(type(tree))

The type of our tree is currently `lxml.html.HtmlElement`. 

If we find the documentation, we can see all of the methods [here](https://lxml.de/api/lxml.html.HtmlElement-class.html)

Sometimes it's not easy, but use your element inspector, and consider the many ways to select elements.

In [None]:
all_children = tree.getchildren()
print('all_children:')
print(all_children)

In [None]:
find = tree.find('.//div')
print('using "find" method:')
print(find)

In [None]:
# In this cell, we will find ALL the <SelectElement> from the body element tree

## Assign a variable which contains the body tree (hint: look at the elements within all_children)
body_tree = all_children[1]

## find all the selects's from the body_tree
find_all = body_tree.findall('.//select')
print('using "findall" method:')
print(find_all)

Using lxml's CSS Select, we can find elements which have certain classes associated to them. In the example below, we find all `a` tags which have `mnav__link` in their classname.

In [None]:
tab_links = tree.cssselect("a.mnav__link")
print(tab_links)
print()

for element in tab_links:
    print(element.text.strip())

In [None]:
## Find all the list items (li) which have "homepage-browse__carousel-item" in their classname
li_carousel = tree.cssselect("li.homepage-browse__carousel-item")
print(li_carousel)


assert len(li_carousel) == 8

What is the difference between:
- `tree.cssselect("li.homepage-browse__carousel-item")`
- `tree.cssselect(".homepage-browse__carousel-item")`

We can also get elements by their ID, which returns a `HTMLElement` object we can continue parsing as normal

In [None]:
mn_advice = tree.get_element_by_id('search-submit')
print(mn_advice)
print(mn_advice.text)

### Once you've got an element, you can do all kinds of things with it

A likely thing that you'll want to do is get the text inside an element. This might be a house name for example.

Find the documentation [here](https://lxml.de/3.1/lxmlhtml.html)

In [None]:
element = tree.get_element_by_id('mn-advice')

# this element has a load of different attributes
as_a_string = etree.tostring(element) # get element as string
tag_type = element.tag # get the type of the tag

print('as a string:')
print(as_a_string[:100])
print()
print('type of element:', tag_type)
print()

In [None]:
# Direct children text
element_text = element.text

# Obtains text content within children
element_text2 = element.text_content()

print('Direct text:', element_text)
print('Text within child elements:', element_text2)


### Challenge: Extract the property address for the following URL: https://www.zoopla.co.uk/new-homes/details/55319462

In [None]:
URL = "https://www.zoopla.co.uk/new-homes/details/55319462"
r = requests.get(URL) # make a HTTP GET request to this website
html_string = r.text # the text attribute of this response is the HTML as a string

tree = html.fromstring(html_string)
address_element = tree.cssselect("h2.ui-property-summary__address")[0].text
address_element = tree.find(".//h2[@class='ui-property-summary__address']").text
print(address_element)

## Beyond just GETTING static HTML


### Why might using requests to get the website content not work?

Some elements on webpages are inserted or manipulated by javascript code that runs only after the HTML is rendered.

Some information that you want may be shown only after interacting with certain elements.

The GET requests to the website just get the HTML file. They don't actually run the javascript code, or interact with the page after it renders. So parsing them for our data won't work.

Again, there is a way around this. We can use a library called Selenium to take control of a browser that can then be programatically instructed to fill in forms, click elements, and find data on any webpage.

## Selenium

Selenium is a tool for programmatically controlling a browser. It's originally intended to be used for creating unit tests, but it can also be used to do anything that needs a browser to be controlled.

Check out the docs [here](https://selenium-python.readthedocs.io/)

### Webdriving

Selenium can "drive" a web browser. This means it can take full control of it and, find elements, click, scroll, execute js etc.

You need to specify which browsers this webdriver will drive such as Chrome or Firefox. To drive a browser you need to have the driver installed. We'll use the chrome browser and download it's driver called Chromedriver.

We'll have to install chromedriver to drive our chrome browser. You should ensure you have the correct version, which should be the same as the version of chrome which you wish to drive. 

[Check your chrome version here](https://help.zenplanner.com/hc/en-us/articles/204253654-How-to-Find-Your-Internet-Browser-Version-Number-Google-Chrome)

[Download chromedriver from here](https://chromedriver.chromium.org/downloads)

*We've assumed you're running Chrome 83 and have downloaded and placed the relevant driver for that in the 'chrome_driver' folder. If you're running a different version, please download the relevant driver*


### Getting pages

To start up Selenium automatically driving a browser, we need to instantiate a Selenium webdriver. 

`driver = webdriver.Chrome(<DRIVER_PATH>)`

In [None]:
from selenium import webdriver
from time import sleep

driver = webdriver.Chrome('chrome_driver/chromedriver') 
driver.get("https://zoopla.co.uk")

Cool! We see that we've navigated to the Zoopla.co.uk website. We can search for elements via `xpath` and can also send mouse and keyboard actions through Selenium as well. Let's recall the challenge we want to solve - extracting data for 50 houses:
- **Sale Price**: Our response variable
- Number of bedrooms
- Square footage
- Description
- Address

We'll focus our efforts just in the London area

In [None]:
driver = webdriver.Chrome('chrome_driver/chromedriver') 
URL = "https://www.zoopla.co.uk/new-homes/property/london/?q=London&results_sort=newest_listings&search_source=new-homes&page_size=25&pn=1&view_type=list"
driver.get(URL)

Oh... Looks like cookies are blocking us... We need to find a way to get around this 🤔. Let's start by using xpath to find the "Accept All Cookies" button

In [None]:
## Find the "Accept all cookies button"
accept_cookies = driver.find_elements_by_xpath('//button[@data-responsibility="acceptAll"]')
print(accept_cookies)

Looks like there's more than one element - we can find the one we want by searching for the "Accept all cookies" text

In [None]:
for button in accept_cookies:
    if button.text == "Accept all cookies":
        relevant_button = button
        
print(relevant_button)

Now that we have a button, we can send a click action to it!

In [None]:
relevant_button.click()

By using the dev tools, we can examine where, and in what format, all the listings are present:

![image.png](attachment:image.png)

Use XPath to get all the `li` elements under the `ul` element:

In [None]:
properties = driver.find_elements_by_xpath("//ul[@class='listing-results clearfix js-gtm-list']/li")
print(properties)

In [None]:
properties[0].text

From looking at the website, we can see that to access the full descsription, we need to navigate to the actual page of the listing. Let's create a dictionary which we can use to store all of the acquired data, and a for loop which will loop over the properties we've just obtained extract the relevant data.

In [None]:
def load_and_accept_cookies():
        
    driver = webdriver.Chrome('chrome_driver/chromedriver') 
    URL = "https://www.zoopla.co.uk/new-homes/property/london/?q=London&results_sort=newest_listings&search_source=new-homes&page_size=25&pn=1&view_type=list"
    driver.get(URL)
    accept_cookies = driver.find_elements_by_xpath('//button[@data-responsibility="acceptAll"]')
    for button in accept_cookies:
        if button.text == "Accept all cookies":
            relevant_button = button

    relevant_button.click()
    return driver

In [None]:
def get_properties(num_to_get=2):

    driver = load_and_accept_cookies()
    
    data = {"sale_price": [], "num_bedrooms": [], "sqft": [], "description": [], "address": []}
    PROPERTY_BASE = "//ul[@class='listing-results clearfix js-gtm-list']/li"
    
    for i in range(num_to_get):

        sleep(3)

        ## Find the price and append it to the dictionary
        price_elem = driver.find_elements_by_xpath("{}//a[@class='listing-results-price text-price']".format(PROPERTY_BASE))[i]
        data["sale_price"].append(price_elem.text)

        ## Find a link to click on
        house_page = driver.find_elements_by_xpath("{}//a[@class='photo-hover']".format(PROPERTY_BASE))[i].click()
        
        sleep(3)
        
        ## Find the number of bedrooms and append it to the dictionary
        ## HINT: https://lmgtfy.com/?q=xpath+svg+tag
        ## HINT: https://stackoverflow.com/questions/11657223/xpath-get-following-sibling
        bedrooms_elem = driver.find_element_by_xpath("//*[name()='svg' and @class='ui-icon icon-bed']/following-sibling::span")
        num_beds = bedrooms_elem.text[:-9] # " bedrooms" is 9 chars
        data["num_bedrooms"].append(num_beds)

        try:
            ## Find the number of square footage and append it to the dictionary
            sqft_elem = driver.find_element_by_xpath("//*[name()='svg' and @class='ui-icon icon-area']/following-sibling::span")
            data["sqft"].append(sqft_elem.text)
        except:
            data["sqft"].append("None")


        ## Find the description and append it to the dictionary
        description_elem = driver.find_element_by_xpath("//div[@class='dp-description__text']")
        data["description"].append(description_elem.text)

        ## Find the address
        address_elem = driver.find_element_by_xpath("//h2[@class='ui-property-summary__address']")
        data["address"].append(address_elem.text)

        driver.execute_script("window.history.go(-1)")

    return data

properties = get_properties()

In [None]:
print(properties)

### Challenge: Extend the function above to navigate to the next page and continue the data collection

In [None]:
driver.quit()