# Web Scraping

## Learning Outcomes:
- Learn the structure of HTML
- Learn how to use XPath to navigate HTML (via lxml)
- Use Selenium to scrape data from websites

One of the most common ways to obtain data is through the use of **web scraping**. Web scraping, as the name suggests, is about pulling information from websites in a programmatic fashion... (because copy and pasting would be way too much effort)

## The challenge

Let's say we wanted to build a model which would predict house prices given some features - for example, location, number of bedrooms, number of bathrooms. We need some way of obtaining this data - both the response and the target variables.

To introduce you to the concept of web scraping, let's try and extract data for 100 houses:
- **Sale Price**: Our response variable
- Number of bedrooms
- Square footage
- Description
- Address
    
[This URL shows houses listed for sale in London](https://www.zoopla.co.uk/new-homes/property/london/?q=London&results_sort=newest_listings&search_source=new-homes&page_size=25&pn=1&view_type=list). Let's take a look at where the information that we want to extract is on the webpage.

Before we look at solving this challenge, let's take a look at what websites and HTML actually are.

## Websites

### What format does information on a website exist in?

We know that websites don't just print data in a nice CSV or JSON format. 
They have content to display stuff to you in a way that makes sense, like buttons, on the page. 
This content is defined in a HTML file.

They also have styling

#### What is HTML?

HTML stands for HyperText Markup Language. It consists of a tree structure of different types of web elements, like buttons, page divisions, images and more. This means that it is used to define what **content** is rendered on any webpage that you visit.

HTML markdown contains elements/tags that may contain other elements/tags.


[Let's play around with some HTML](https://code.sololearn.com/WoNr8gIeKYDr/)

### How can we get the website HTML, which contains data that we want?

When you search for a URL in a browser, here's what happens:
- your browser makes a **GET request** to the computer (server) that serves requests from that URL endpoint
- this computer knows what web content to send you back, so it sends it in a response to the request. This stuff includes the HTML of the page that you want to view.
- Your browser gets the HTML, and knows how to present that type of data to you (it renders the webpage)

The point here being that you can get the HTML, which defines the content for any site, by making a GET request to that website.

Let's try that!

We can use the requests library to get the HTML from a website

In [1]:
import requests # import the requests library
r = requests.get('http://pythonscraping.com/pages/page3.html') # make a HTTP GET request to this website
html_string = r.text # the text attribute of this response is the HTML as a string
print(r.text)

<html>
<head>
<style>
img{
	width:75px;
}
table{
	width:50%;
}
td{
	margin:10px;
	padding:10px;
}
.wrapper{
	width:800px;
}
.excitingNote{
	font-style:italic;
	font-weight:bold;
}
</style>
</head>
<body>
<div id="wrapper">
<img src="../img/gifts/logo.jpg" style="float:left;">
<h1>Totally Normal Gifts</h1>
<div id="content">Here is a collection of totally normal, totally reasonable gifts that your friends are sure to love! Our collection is
hand-curated by well-paid, free-range Tibetan monks.<p>
We haven't figured out how to make online shopping carts yet, but you can send us a check to:<br>
123 Main St.<br>
Abuja, Nigeria
</br>We will then send your totally amazing gift, pronto! Please include an extra $5.00 for gift wrapping.</div>
<table id="giftList">
<tr><th>
Item Title
</th><th>
Description
</th><th>
Cost
</th><th>
Image
</th></tr>

<tr id="gift1" class="gift"><td>
Vegetable Basket
</td><td>
This vegetable basket is the perfect gift for your health conscious (or overweight) friend

# BeautifulSoup

What we saw above only gives us the HTML, but we want to be able to extract the data from it. After requesting the data from the webpage, we obtain a string of HTML, but looking for some specific data is a bit of a pain. We can use the **BeautifulSoup** library to extract the data from the HTML looking for specific tags and their attributes.

In [3]:
import requests
from bs4 import BeautifulSoup
page = requests.get('http://pythonscraping.com/pages/page3.html')
html = page.text # Get the content of the webpage
soup = BeautifulSoup(html, 'html.parser') # Convert that into a BeautifulSoup object that contains methods to make the tag searcg easier
print(soup.prettify())

<html>
 <head>
  <style>
   img{
	width:75px;
}
table{
	width:50%;
}
td{
	margin:10px;
	padding:10px;
}
.wrapper{
	width:800px;
}
.excitingNote{
	font-style:italic;
	font-weight:bold;
}
  </style>
 </head>
 <body>
  <div id="wrapper">
   <img src="../img/gifts/logo.jpg" style="float:left;"/>
   <h1>
    Totally Normal Gifts
   </h1>
   <div id="content">
    Here is a collection of totally normal, totally reasonable gifts that your friends are sure to love! Our collection is
hand-curated by well-paid, free-range Tibetan monks.
    <p>
     We haven't figured out how to make online shopping carts yet, but you can send us a check to:
     <br/>
     123 Main St.
     <br/>
     Abuja, Nigeria
We will then send your totally amazing gift, pronto! Please include an extra $5.00 for gift wrapping.
    </p>
   </div>
   <table id="giftList">
    <tr>
     <th>
      Item Title
     </th>
     <th>
      Description
     </th>
     <th>
      Cost
     </th>
     <th>
      Image
     </th>
   

Let's see an example using the following URL `'http://pythonscraping.com/pages/page3.html'`

In that webpage you will find a small list with a set of items. Let's say that you want to extract the data from the Fish Painting.

<p align="center">
  <img src='images/BS4_1.png' width=500>
</p>


In your browser, you can see that the HTML for the page is in the `<body>` tag. Let's see how we can extract the data from this HTML. In the page, right-click on the `<body>` tag and select **Inspect Element**. There, you will see the HTML for the page, and you can see that the Fish Painting is in a `<tr>` tag.

<p align="center">
  <img src='images/BS4_2.png' width=500>
</p>

You can find that tag using the method `find` that accepts the tag name, and the attributes of said tag

In [None]:
fish = soup.find(name='tr', attrs={'id': 'gift3', 'class': 'gift'}) # If it doesn't find anything it returns None

print(fish)

Inside the `tr` tag, you will find different `<td>` tags. You can find all the `<td>` tags using the method `find_all` that accepts the tag name and the attributes of said tag.

In [None]:
fish_row = fish.find_all('td') # This returns a list where each item corresponds to each td tag 

Now, you obtained a list where each element correponds to the data for each column. Thus, you can index the list to get the data you want.

In [None]:
title = fish_row[0].text
description = fish_row[1].text
price = fish_row[2].text

print(title)
print(description)
print(price)

You can keep looking for more data in the tree. For example, you can look for the parrot row taking into account that it is the sibling of the fish row.

In [None]:
parrot = fish.find_next_sibling()

And you can also find the parrot's children using the method `findChildren`:

In [None]:
parrot_children = parrot.findChildren()

# Challenge: What is a Method?

It is time to apply what you have learned so far about BeautifulSoup. Go to the following page, and look information about the `Methods` section. [https://en.wikipedia.org/wiki/Python_(programming_language)](https://en.wikipedia.org/wiki/Python_(programming_language))

You only need to extract the text from that section.

<p align="center">
  <img src='images/BS4_3.png' width=500>
</p>

_Tip_: The `p` tag containing the text does not have any attributes. Try looking at the `h3` tag before it, or the `a` child tag. From there, you can start moving around.

# Finding tree elements within a `HTMLElement` using xpath

Xpath is a query language for selecting nodes/branches/elements within a tree-like data structure like HTML or XML. 

Below is a very simple xpath expression. This one finds all of the button elements in the html

#### `//button` 

The `//` says "anywhere in the tree" and the `button` says find elements that have the tag type button. So this xpath expression says "find button tags anywhere within the tree"

The `xpath` method of `HTMLElement` takes in an xpath expression returns a list of all elements in the tree that match it.

Below are more examples of how to use xpath

`/button` find **child** (not all) tags of type button, of the element

`//div/button` - finds all of the button tags inside div tags anywhere on the page

`//div[@id='custom_id']` - finds all div tags with the attribute (`@`) `id` equal to `custom_id`, anywhere on the page 

If any of these don't make sense, let us know after [looking it up](https://www.w3schools.com/xml/xpath_syntax.asp).

Use the `//button` xpath expression as an argument to find the button on the page

In [None]:
%%bash
pip install selenium

## Using the developer console to identify the right xpath

### How to open the console

Modern browsers come with tools to maximise web developers productivity and help find bugs.

The developer console has a lot of different tools. 

Open your element inspector by pressing `CTRL + SHIFT + C`.
It should open on the right hand side of your screen as shown below.

The elements tab of the developer console shows you the HTML and CSS that make up the website code (actually it shows the DOM. Read more about what exactly the DOM is [here](https://css-tricks.com/dom/)).

You can always close the developer console by clicking the cross in the corner. 

Check out the zoopla website for yourself. Try using your selector to see the HTML structure of the page.
![](./images/form_selector.png)

Now use your selector to find the location of the button as shown below.

![](./images/button_selector.png)

As mentioned, the selector allows us to visualise the DOM and find elements within our webpage.


### Challenge: How many HTML buttons are there on the homepage? 

### We can find elements, and then search for elements within them!

Elements returned from finding them by xpath also have the same search methods. They are the same object type.

### We can search for elements in more ways than just xpath

There are loads of ways to find elements within HTML.

Let's check what methods and properties of our tree object exist by calling the built-in `dir()` method.

## Beyond just GETTING static HTML


### Why might using requests to get the website content not work?

Some elements on webpages are inserted or manipulated by javascript code that runs only after the HTML is rendered.

Some information that you want may be shown only after interacting with certain elements.

The GET requests to the website just get the HTML file. They don't actually run the javascript code, or interact with the page after it renders. So parsing them for our data won't work.

Again, there is a way around this. We can use a library called Selenium to take control of a browser that can then be programatically instructed to fill in forms, click elements, and find data on any webpage.

## Selenium

Selenium is a tool for programmatically controlling a browser. It's originally intended to be used for creating unit tests, but it can also be used to do anything that needs a browser to be controlled.

Check out the docs [here](https://selenium-python.readthedocs.io/)

### Webdriving

Selenium can "drive" a web browser. This means it can take full control of it and, find elements, click, scroll, execute js etc.

You need to specify which browsers this webdriver will drive such as Chrome or Firefox. To drive a browser you need to have the driver installed. We'll use the chrome browser and download it's driver called Chromedriver.

We'll have to install chromedriver to drive our chrome browser. You should ensure you have the correct version, which should be the same as the version of chrome which you wish to drive. 

[Check your chrome version here](https://help.zenplanner.com/hc/en-us/articles/204253654-How-to-Find-Your-Internet-Browser-Version-Number-Google-Chrome)

[Download chromedriver from here](https://chromedriver.chromium.org/downloads)

If you are using FireFox, you need to download the geckodriver. You can download it from [here](https://github.com/mozilla/geckodriver/releases)

If you are using Edge, you need to download the MicrosoftWebDriver. You can download it from [here](https://developer.microsoft.com/en-us/microsoft-edge/tools/webdriver/)

If you are using Safari, you can go to your browser, go to `Developer`, and set the "Allow Remote Automation" to "Yes".

Once you download the driver, you can move it to the Python Path. This will make things easier when you have to work in different directories. To move it to the path, you can do the following command:

1. Observe your `PATH` environment variable by running `echo $PATH`:


In [6]:
%%bash
echo $PATH

/opt/miniconda3/bin:/opt/miniconda3/condabin:/usr/local/bin:/usr/bin:/bin:/usr/sbin:/sbin


2. Move your driver file to any of the directories in your `PATH` environment variable. For example, if you are using `/usr/local/bin` as your `PATH` environment variable, you can move the driver to `/usr/local/bin` by running `mv chromedriver /usr/local/bin`. Make sure to replace `chromedriver` with the name of your driver.

In [1]:
from selenium import webdriver
from time import sleep

driver = webdriver.Chrome()
driver.get("https://zoopla.co.uk")

Cool! We see that we've navigated to the Zoopla.co.uk website. We can search for elements via `xpath` and can also send mouse and keyboard actions through Selenium as well. Let's recall the challenge we want to solve - extracting data for 50 houses:
- **Sale Price**: Our response variable
- Number of bedrooms
- Square footage
- Description
- Address

We'll focus our efforts just in the London area

In [2]:
driver = webdriver.Chrome() 
URL = "https://www.zoopla.co.uk/new-homes/property/london/?q=London&results_sort=newest_listings&search_source=new-homes&page_size=25&pn=1&view_type=list"
driver.get(URL)

Oh... Looks like cookies are blocking us... We need to find a way to get around this 🤔. Let's start by using xpath to find the "Accept All Cookies" button

In [4]:
## Find the "Accept all cookies button"
try:
    accept_cookies = driver.find_element_by_xpath('//*[@id="cookie-consent-form"]/div/div/div/button[2]')
    print(accept_cookies.text)
except:
    pass

Accept all cookies


In [5]:
accept_cookies.click()

Looks like there's more than one element - we can find the one we want by searching for the "Accept all cookies" text

In [10]:
price = driver.find_element_by_xpath('//span[@data-testid="price"]').text
print(price)
address = driver.find_element_by_xpath('//span[@data-testid="address-label"]').text
print(address)
bedrooms = driver.find_element_by_xpath('//span[@data-testid="beds-label"]').text
print(bedrooms)
div_tag = driver.find_element_by_xpath('//div[@data-testid="truncated_text_container"]')
span_tag = div_tag.find_element_by_xpath('.//span')
description = span_tag.text
print(description)

£550,000
Walnut Road, Walthamstow E10
4 beds
Icon Estates are delighted to offer for sale this four-bedroom terraced house.
The property is located in the very popular location of Leyton.
This property offers great size throughout and has through lounge, 4 bedrooms, dining room / kitchen, 1 family bathroom, 1WC, partially paved garden and a conservatory. This property is suitable for a family, investors or first-time buyers. It is close to all local amenities including schools, parks and shops.

Bedroom 1 - 3.82m x 3.44m (12'53" x 11'29")
Bedroom 2 - 2.25m x 3.95m (7'38" x 12'96")
Bedroom 3 - 3.83m x 2.27m (12'57" x 7'45")
Bedroom 4 - 3.83m x 2.85m (12'57" x 9'35")
Kitchen - 3.99m x 2.71m (13'1" x 8'89")
Living / Dining Room - 3.45m x 3.85m (11'32" x 12'63")
Bathroom - 1.48m x 2.38m (4'86" x 7'81")
- 0.84m x 1.54m (2'76" x 5'05")

Garden - Partially paved garden at the back.

Disclaimer:

The information provided about this property does not constitute or form part of any offer or cont

Now that we have a button, we can send a click action to it!

In [11]:
dict_properties = {'Price': [], 'Address': [], 'Bedrooms': [], 'Sqft': [], 'Description': []}
price = driver.find_element_by_xpath('//span[@data-testid="price"]').text
dict_properties['Price'].append(price)
address = driver.find_element_by_xpath('//span[@data-testid="address-label"]').text
dict_properties['Address'].append(address)
bedrooms = driver.find_element_by_xpath('//span[@data-testid="beds-label"]').text
dict_properties['Bedrooms'].append(bedrooms)
div_tag = driver.find_element_by_xpath('//div[@data-testid="truncated_text_container"]')
span_tag = div_tag.find_element_by_xpath('.//span')
description = span_tag.text
print(description)

Icon Estates are delighted to offer for sale this four-bedroom terraced house.
The property is located in the very popular location of Leyton.
This property offers great size throughout and has through lounge, 4 bedrooms, dining room / kitchen, 1 family bathroom, 1WC, partially paved garden and a conservatory. This property is suitable for a family, investors or first-time buyers. It is close to all local amenities including schools, parks and shops.

Bedroom 1 - 3.82m x 3.44m (12'53" x 11'29")
Bedroom 2 - 2.25m x 3.95m (7'38" x 12'96")
Bedroom 3 - 3.83m x 2.27m (12'57" x 7'45")
Bedroom 4 - 3.83m x 2.85m (12'57" x 9'35")
Kitchen - 3.99m x 2.71m (13'1" x 8'89")
Living / Dining Room - 3.45m x 3.85m (11'32" x 12'63")
Bathroom - 1.48m x 2.38m (4'86" x 7'81")
- 0.84m x 1.54m (2'76" x 5'05")

Garden - Partially paved garden at the back.

Disclaimer:

The information provided about this property does not constitute or form part of any offer or contract, nor may it be relied upon as representa

In [13]:
dict_properties['Sqft'].append(None)
print(dict_properties)

{'Price': ['£550,000'], 'Address': ['Walnut Road, Walthamstow E10'], 'Bedrooms': ['4 beds'], 'Sqft': [None], 'Description': []}


In [14]:
driver.back()

Awesome! Selenium will allow us to do many other things, such as scroll, click, and send keystrokes. For example, you can run the following cells one by one and observe the results.

In [7]:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
driver = webdriver.Chrome()

driver.get("http://www.python.org")

You are in the official Python Webpage, let's scroll down to the bottom of the page.

In [None]:
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")

The next cell will look for the search bar and click it.

In [9]:
search_bar = driver.find_element_by_xpath('//*[@id="id-search-field"]')
search_bar.click()

Now that you clicked it, you can send a keystroke to the search bar.

In [10]:
search_bar.send_keys("method")

And once you enter the text, you can 'Press Enter'

In [11]:
search_bar.send_keys(Keys.RETURN)

Good stuff. With that in mind, we can keep going with the Zoopla challenge.

In [13]:
from selenium import webdriver
import time

driver = webdriver.Chrome() 
URL = "https://www.zoopla.co.uk/new-homes/property/london/?q=London&results_sort=newest_listings&search_source=new-homes&page_size=25&pn=1&view_type=list"
driver.get(URL)
accept_cookies = driver.find_elements_by_xpath('//button[@data-responsibility="acceptAll"]')
for button in accept_cookies:
    if button.text == "Accept all cookies":
        relevant_button = button

relevant_button.click()
# Send your driver to sleep before doing anything else, so the webpage doesn't suspect you are a bot

time.sleep(2)

properties = driver.find_elements_by_xpath("//ul[@class='listing-results clearfix js-gtm-list']/li")
print(properties)

[]


From looking at the website, we can see that to access the full descsription, we need to navigate to the actual page of the listing. Let's create a dictionary which we can use to store all of the acquired data, and a for loop which will loop over the properties we've just obtained extract the relevant data.

In [15]:
def load_and_accept_cookies():
        
    driver = webdriver.Chrome() 
    URL = "https://www.zoopla.co.uk/new-homes/property/london/?q=London&results_sort=newest_listings&search_source=new-homes&page_size=25&pn=1&view_type=list"
    driver.get(URL)
    accept_cookies = driver.find_elements_by_xpath('//button[@data-responsibility="acceptAll"]')
    for button in accept_cookies:
        if button.text == "Accept all cookies":
            relevant_button = button

    relevant_button.click()
    return driver

In [22]:
driver = webdriver.Chrome() 
URL = "https://www.zoopla.co.uk/new-homes/property/london/?q=London&results_sort=newest_listings&search_source=new-homes&page_size=25&pn=1&view_type=list"
driver.get(URL)
accept_cookies = driver.find_elements_by_xpath('//button[@data-responsibility="acceptAll"]')
for button in accept_cookies:
    if button.text == "Accept all cookies":
        relevant_button = button

relevant_button.click()
# prop_container = driver.find_element_by_xpath('//div[@class="css-kdnpqc-ListingsContainer earci3d2"]')

In [25]:
prop_container = driver.find_element_by_xpath('//div[@class="css-kdnpqc-ListingsContainer earci3d2"]')
prop_list = prop_container.find_elements_by_xpath('./div')
print(len(prop_list))

25


In [26]:
driver = webdriver.Chrome() 
URL = "https://www.zoopla.co.uk/new-homes/property/london/?q=London&results_sort=newest_listings&search_source=new-homes&page_size=25&pn=1&view_type=list"
driver.get(URL)
accept_cookies = driver.find_elements_by_xpath('//button[@data-responsibility="acceptAll"]')
for button in accept_cookies:
    if button.text == "Accept all cookies":
        relevant_button = button

relevant_button.click()
prop_container = driver.find_element_by_xpath('//div[@class="css-kdnpqc-ListingsContainer earci3d2"]')
properties = prop_container.find_elements_by_xpath('./div')
num_props = len(properties)
for i in range(num_props):
    prop_container = driver.find_element_by_xpath('//div[@class="css-kdnpqc-ListingsContainer earci3d2"]')
    house = prop_container.find_elements_by_xpath('./div')[i]
    house.click()
    time.sleep(5)
    driver.back()


KeyboardInterrupt: 

In [16]:
import time
def get_properties():

    driver = load_and_accept_cookies()
    
    data = {"sale_price": [], "num_bedrooms": [], "sqft": [], "description": [], "address": []}
    prop_container = driver.find_element_by_xpath('//div[@class="css-kdnpqc-ListingsContainer earci3d2"]')
    prop_list = prop_container.find_elements_by_xpath('./div')
    num_props = len(prop_list)
    for i in range(num_props):
        house_list = driver.find_element_by_xpath('//*[@id="__next"]/div[5]/div[2]/main/div[2]/div[2]')
        house = house_list.find_elements_by_xpath('./div')[i]
        '//*[@id="listing_59472554"]/div/div[1]/div[2]/a[2]/div[1]/div/p[2]'
        house.click()
        time.sleep(5)
        
        ## Find the number of bedrooms and append it to the dictionary
        price_elem = abstracasdsa
        num_bed = asd
        bedrooms_elem = cxzv
        num_beds = cvxbb
        data["sale_price"].append(price_elem.text)
        data["num_bedrooms"].append(num_beds)

        try:
            ## Find the number of square footage and append it to the dictionary
            sqft_elem = regdf
            data["sqft"].append(sqft_elem.text)
        except:
            data["sqft"].append("None")


        ## Find the description and append it to the dictionary
        description_elem = sdfgdg
        data["description"].append(description_elem.text)

        ## Find the address
        address_elem = dfhgb
        data["address"].append(address_elem.text)
        # Go to the previous page
        driver.back()

    return data

properties = get_properties()

NameError: name 'abstracasdsa' is not defined

In [None]:
print(properties)

### Challenge: Extend the function above to navigate to the next page and continue the data collection

In [None]:
driver.quit()

In the challenge above, you clicked on every link. Another way to achieve this is obtaining a list with all the links of the properties, and once you obtained it, you can iterate through that list, and visit each link individually using the `.get()` method.

Give it a try, and if you get stuck, you can check the `zoopla_scraper.py` file in examples.