In this notebook we will explore the topic of web scraping using the **Selenium** library. Web scraping has many applications, but here we will use scripts to automate the process of downloading files from any website. Our aim is to show how to automate the process of dowloading csv files from any data source. In this example, we will download the csv file which contains the **Consumer price inflation** timeseries in the UK from 1989.

In [1]:
#importing the required libraries
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import os
import pandas as pd
import time

First and foremost, we need to set the path where the relevant exe file is stored. Make sure to change it to the directory you've saved your own file. For this example we will use the **Firefox** browser, however Selenium supports other browsers as well (e.g. Chrome). In order to navigate through the webpage, a basic understanding of html is necessary. <br>

To inspect the elements of a webpage, right-click on the page once it's loaded and then click **Inspect**. We then need to locate the relevant element which we could use. For illustration purposes, one example can be found below. We want to find and then click on the search bar by using a unique element. There are many options at hand, such as locating the element by class, name etc. One way we could do that is finding the **id** (circled in the pic) in the page. In the script below we use other ways of locating elements as well.

<img src="img/example.jpg" width="1080" height="960" />

In [2]:
#Setting the relevant paths
path = 'C:\Program Files (x86)\geckodriver.exe'
link_1 = 'https://www.ons.gov.uk/'
href_1 = '/economy/inflationandpriceindices/timeseries/l55o/mm23'
href_2 = '/generator?format=csv&uri=/economy/inflationandpriceindices/timeseries/l55o/mm23'
element_id = 'nav-search'
element_class = 'box__clickable'
keyword = 'Inflation price'

#using the Mozilla Firefox browser
driver = webdriver.Firefox(executable_path=path)
driver.implicitly_wait(0.5)
driver.get(link_1)

#find link text and click on the element
element_1 = driver.find_element_by_id(element_id)
element_1.click()
#keyword to be entered in the search box
element_1.send_keys(keyword)
#wait for one second and then search the keyword
time.sleep(1)
element_1.send_keys(Keys.RETURN)
#wait 2 seconds for the search to be completed
time.sleep(2)
#click on all the relevant elements until we reach the file we are looking for
element_2 = driver.find_element_by_class_name(element_class)
element_2.click()
time.sleep(2)
#we are using the xpath since we want to locate the element by href
element_3 = driver.find_element_by_xpath('//a[@href="'+href_1+'"]')
element_3.click()
time.sleep(2)
element_4 = driver.find_element_by_xpath('//a[@href="'+href_2+'"]')
element_4.click()
#wait as before
time.sleep(2)
#closing the browser
driver.quit()

If everything worked as expected, we should have successfully downloaded the file. We can then use a simple function to check if the file exists and if so we can open it.

In [3]:
#function to check whether the file exists or not
def open_file(file):
    if os.path.isfile(file) == True:
        return pd.read_csv(file, skiprows=7)
    else:
        print('File does not exist')

Lastly, let's have a look at the first few rows of the dataset.

In [4]:
df = open_file(r'C:\Users\Geo\Downloads\series-070721.csv')
df.head()

Unnamed: 0,Important notes,Unnamed: 1
0,1989,5.7
1,1990,8.0
2,1991,7.5
3,1992,4.6
4,1993,2.6
