# Getting started with Web Scraping in Python


(*juliankanjere@gmail.com, May 2019*)

This tutorial is designed with the aim of being a starting point into Web Scraping. You may be wondering, what is Web Scraping? The English definition of scraping (*according to Dr Google*) is to drag or pull a sharp implement across a surface so as to remove dirt or other matter. Web Scraping is therefore an automated way to render a Web Page and extract information from the Web Page using code. One might be wondering, why would there be a need to perform Web Scraping? The answer is simple, if one wanted to store data from a website locally in order to do some data analysis or aggregation of sort, it would be incredibly time consuming if they had to do so for a website with 1000 pages for example. One would need to load each page, perform a series of copy and paste and hope that they do not make a mistake. The alternative to manually doing this would be to make use of a Web Scraping application to automatically collect the data. This would not only save time but is likely to be more accurate than manual human effort.

In this tutorial, we walk through an example of a simple Web Scraping application written in Python 3. We will be scraping a Web Page that has a list of different courses, their descriptions and additional information. Once we have scraped the course information, we persist the data in a csv file. 

This tutorial assumes some working knowledge of Python and Web Programming concepts (such as HTML, CSS, requests and responses).



## 1. Introduction

This tutorial is broken up into sections that resemble the typical flow of a Web Scraping program written in Python, which is as follows:
- Define a list URL's that point to the Web Pages that will be scraped.
- Loop over the list of URL's and for each URL
    - Issue a web request to the Web Server.
    - Receive a response from the Web Server.
    - Parse the response and store the parsed data into a data frame. 
- Persist the dataframe to the harddisk in the form (in the form of a CSV file or an Excel spreadsheet).



## 2. Import Libraries

The key python libraries used include:
- `Pandas` - Pandas stands for Python Data Analysis Library. It is a library with powerful data structures for data analysis, time series and statistics.
- `Beautiful Soup` - Screen scrapping Python library used for pulling data out of HTML and XML files.
- `requests` - Python library for sending HTTP requests to a Web Server.
- `time` - The Python time module provides many ways of representing time in code, such as objects, numbers, and strings.
- `random` - Python module for generating pseudo-random numbers.
- `os` - Python provides a way of using operating system dependent functionality such as interacting with the File System.

These libraries are imported below.

In [1]:
from bs4 import BeautifulSoup
import requests
from random import randint
import pandas as pd
import time
import os

## 3. Initial Setup

In this example, we will be scraping data from the African Institute of Financial Markets and Risk Management's (AIFMRM) list of online short courses. You can navigate to this [link](http://www.aifmrm.uct.ac.za/education/online-short-courses/ "AIFMRM list of online short courses") to see the courses and the details available for each course. You will see from this page that each cousrse has details such as Course Name, Course Description and Course Further Information. The Course Further Information consists of one or more links. 

![title](img/webscrap1.png)

We would like to store Course Name, Course Description and Course Further Information in tabular form with one row per course.

The initial setup involves the following:
- Defining the URL for the website that we will be scraping from. See `BASE_URL` and `site_urls_list`.
- Updating the User Agent HTTP header of the Web Request that we will be sending to the server when we access the URL specified above. This is done so that the Web Request appears to be coming from a Web Browser instead of a program. Some Web Servers are configured to deny Web Requests that do not come from Web Browser e.g. hackers write bots that try to brute force login on web applications. Therefore, if we do not set the User Agent to appear as if it is coming from a Web Browser, there is a chance that our Web Scraper may not be able to access online short course web page. See `headers`.
- Defining the data buckets that will hold each of the course details of interest. Put another way, we setup a Python list to represent each column containing some detail of the course. Recall that the final output of this exercise should be the course information in tabular form with one course per row. The columns we have decided on include `title`, `description`, `source_url`, `further_info1_description`, `further_info1_url`, `further_info2_description`, `further_info2_url`, `further_info3_description`, `further_info3_url`, `further_info4_description`, `further_info4_url`. If you looked at this [link](http://www.aifmrm.uct.ac.za/education/online-short-courses/ "AIFMRM list of online short courses"), you will have noticed that each course generally has one or two items under Further Information. In our design, we make provision for upto four items, each item having a description (the text) and a URL (the link that the user is navigated to when they click on the item).
- Setting up a counter for the number of Web Pages that have been scraped in order to give a summary at the end. See `n_pages`.

In [2]:
#url we will be scrapping from
BASE_URL = 'http://www.aifmrm.uct.ac.za/education/online-short-courses/'

#Spoof the User Agent HTTP header so that queries look like they are coming from an actual browser instead of python-requests/version
headers = ({'User-Agent':
            'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36'})

# setting up the lists that will form pandas dataframe with all the results
title = []
description = []
source_url = []
further_info1_description = []
further_info1_url = []
further_info2_description = []
further_info2_url = []
further_info3_description = []
further_info3_url = []
further_info4_description = []
further_info4_url = []

n_pages = 0

#list to hold the urls we will be scraping
site_urls_list = []
site_urls_list.append(BASE_URL)
empty_list = ["",""]


## 4. Web Request and Response

At a very high level, when a user enters a URL into their browser , the browser (as an HTTP client) sends a request message to the HTTP server (where the web page/application resides). The server, in turn, returns a response message which is the web page that was requested. When Web Scraping, we make use of the Python `requests` library which sends the HTTP request and receives the HTTP response. Assumming that the request was successful and the server sends a response, the response is generally in the form of HTML (which in a Browser would be rendered in a well formatted way).

In the code snippet that follows Section 5, you will notice that we perform some basic error handling using Pythons `try except`. We have set a timeout of 60 seconds, which it to say that if we do not recieve a response from the Web Server within 60 seconds, we should abort the operation. We also make provision for the case where there is an issue with the network and we do not recieve the response that we were expecting.  

Once we have recieved the raw HTML response from the server, we convert it into a Beautiful Soup object. 


## 5. Parse Data

Beautiful Soup is the backbone of the Web Scraping exercise. The `page_html` object that we have created is a Beautiful Soup object that can be queried based on the HTML elements it holds. Below is a snippet of the raw HTML that we will be parsing to return the course details
![title](img/webscrap.png)

Looking at the source HTML, one can see that:
- the course information is found in a `div` whose CSS class is `course accordian`.
- each course is self contained in `article` elements
- Course Title is in a `label` element whose for attribute corresponds to the identifier of the course accordian e.g. *accordian_1*.
- Course Description is in a `div` whose class is content.
- Further information is in an unordered HTML list where each list entry is denoted by `li`.

Using this information, we proceed to query the `page_html` Beautiful Soup object, starting by returning only the HTML within the `div` with class set to `course accordian`. This gives us the `course_section` which we then query for all `article` elements (there is one `article` element for each course). We finally iterate over all the courses to return the invdividual details of interest. For each of the course details of interest, we employ error handling because it is not always the case that the HTML is well formed (e.g. the Web Developer might not have been very thorough in labelling and identification of elements) and we would not want our Web Scraper to fall over when it comes across an irregularity. 

For each course, we return the details of interest and then we assign these details to the column buckets earlier mentioned i.e. we append the details to the respective Python list for the column. It follows that at the end of the exercise, each of the column lists are expected to have the same length and therefore it is necessary to update the lists with empty strings in cases that a column does not have the expectd data (this is particularly relevant for the further_information fields).

In this example, since there are only three courses, all on one page, we are therefore only scraping one page. If there were dozens of courses spanning multiple pages (with a similar HTML format), our scraper would proceed onto the next page (assumming that the page would have been added to `site_urls_list` list). However, in an attempt not to exhaust a Web Servers resources with multiple requests in quick succession, it is good practise to add a delay to each request. In this example, we set a delay that is anything between one and two seconds between each successive Web Request using `time.sleep(randint(1, 2))`.

Lastly, once we are done with the scraping, we print out a summary of the number of pages we have scraped and the number of courses.

In [3]:
for site_url in site_urls_list:
    n_pages += 1
    print("%r %r " % ("Page: ", n_pages))

    #4. Web Request and Response
    #Make a simple GET request (just fetching a page)
    try:
        r = requests.get(site_url, headers=headers, timeout=60)  # wait up to 60 seconds
    # explicitly handle timeout
    except requests.exceptions.Timeout:
        print('Connection timed out after 60 seconds, please try later')
        continue #skip onto next iteration of for loop
    # explicitly handle other network issues
    except requests.exceptions.RequestException:
        print('Network error occured, please try later')
        continue  # skip onto next iteration of for loop

    #Check response code from the web server, a code of 200 is good, 4XX or 5XX errors are indicative of something gone wrong
    print("%r %r " % ("HTTP Status Code: ", r.status_code))

    #use r.text to access the returned HTML of the page in a single string
    #convert the HTML text from the response into a BeautifulSoup Document Object Model'ish structure that can be queried
    page_html = BeautifulSoup(r.text, 'html.parser')
    
    #5. Parse Data
    course_section = page_html.find_all(class_="course accordian")
    if course_section != []:
        course_details = course_section[0].find_all('article')
        for course_id, course in enumerate(course_details):
            accordion_id = "accordion_" + str(course_id + 1) #construct ID of the HTML elements in course accordion div
            further_info_dict = {}
            try:
                course_title = course.find('label', {'for': accordion_id}).text.replace('\n', '')
                # e.g. Business Risk Management Short Course
            except IndexError:
                course_title = 'null'

            try:
                course_description = course.find('div', {'class': 'content'}).text.replace('\n', '')
                # e.g. This 10 week course...
            except IndexError:
                course_description = 'null'

            course_further_details = course.find_all('li')
            for i, course_further_item in enumerate(course_further_details):
                try:
                    course_further_item_description = course_further_item.find_all('a', href=True)[0].contents[0]
                except IndexError:
                    course_further_item_description = ""

                try:
                    course_further_item_url = course_further_item.find_all('a', href=True)[0]['href']
                except IndexError:
                    course_further_item_url = ""

                further_info_dict[i] = [course_further_item_description, course_further_item_url]

            #update lists with column data, these lists need to be same size
            title.append(course_title)
            description.append(course_description)
            source_url.append(site_url)

            #we assume maximum of 4 further info items for each course
            further_info1 = further_info_dict.get(0, empty_list)
            further_info2 = further_info_dict.get(1, empty_list)
            further_info3 = further_info_dict.get(2, empty_list)
            further_info4 = further_info_dict.get(3, empty_list)

            further_info1_description.append(further_info1[0])
            further_info1_url.append(further_info1[1])
            further_info2_description.append(further_info2[0])
            further_info2_url.append(further_info2[1])
            further_info3_description.append(further_info3[0])
            further_info3_url.append(further_info3[1])
            further_info4_description.append(further_info4[0])
            further_info4_url.append(further_info4[1])
    else:
        continue

    # Sleep for 1 or 2 seconds before requesting the next url
    time.sleep(randint(1, 2))

print('Summary: scraped {} pages containing {} courses.'.format(n_pages, len(title)))

'Page: ' 1 
'HTTP Status Code: ' 200 
Summary: scraped 1 pages containing 3 courses.


## 6. Persist Data

Once the major work is complete i.e. the Web Scraping, we now have all the data held in memory (more specifically in the different column lists we created earlier). The object of Web Scraping exercises is often to retrieve data which will the be analysed and therefore it makes sense to persist the data. One could persist the data in a database table or in a flat file (Excel or csv). 

In this example, we proceed to persist the data in a csv file. To do this, we make use of a nifty library called Pandas. Pandas makes it easy to work with data in tabular form, more specificalyly in a Data Frame. We combine the various column lists into a single Data Frame. Once we have a Pandas Data Frame, it is very easy to save the data as a csv file by calling the `to_csv` method on the Data Frame.

In [4]:
#  save all these variables in a single dataframe
column_names = ['Title', 'Description', 'Source URL', 'Further Info 1', 'Further Info 1 URL', 'Further Info 2',
        'Further Info 2 URL','Further Info 3', 'Further Info 3 URL', 'Further Info 4', 'Further Info 4 URL']

course_data = pd.DataFrame({
                            'Title': title,
                            'Description': description,
                            'Source URL': source_url,
                            'Further Info 1': further_info1_description,
                            'Further Info 1 URL': further_info1_url,
                            'Further Info 2': further_info2_description,
                            'Further Info 2 URL': further_info2_url,
                            'Further Info 3': further_info3_description,
                            'Further Info 3 URL': further_info3_url,
                            'Further Info 4': further_info4_description,
                            'Further Info 4 URL': further_info4_url
                            })[column_names]

timestr = time.strftime("%Y%m%d-%H%M%S")
filename = 'course_data_raw_' + timestr + '.csv'
currentDirectory = os.getcwd()
try:
    course_data.to_csv(filename, index=None, header=True)
    print('{} successfully saved to {}'.format(filename, currentDirectory ))
except:
    print('Error occured, {} was not saved to {}. Please try again.'.format(filename,currentDirectory ))
#to read the csv file
#course_data.read_csv(filename)

#alternatively you can write to an Excel file 
#you will need xlwt library to generate spreadsheet files compatible with Microsoft Excel versions 95 to 2003.
#filename = 'course_data_raw_' + timestr + '.xls'
#course_data.to_excel(filename)
#to read the Excel file
#course_data.read_excel(filename)


course_data_raw_20190510-144324.csv successfully saved to /Users/juliankanjere/Documents/Academic/MPhil Data Science/Course Material/DOC5039F Financial Software Engineering/KNJJUL001


## 7. Other Considerations
Other considerations that have not been discussed in this tutorial with respect to Web Scraping include permissions, use of sessions and user of proxy servers. These are briefly discussed below.  

**Permissions**

In some cases, one might require permission from a Website's Administrator to scrap their site for data. It is ecommended to request approval and ensuring that one is not in contravention of any cyber laws specific to the jurisdiction they find themselves in before beginning to scrap a website.  

**Sessions**

HTTP is generally stateless(i.e. Web requests are independent of each other). However, some websites require a user to login in order to access a protected section of the website. These websites then employ some form of a mechanism to identify the logged in user across multiple requests. This ican be achived by use of cookies - when a user first logs in successfully o a website, a session cookie is set that is then used to identify the user to the website. As long as future requests send this cookie along, the site knows the user and what functionality they have got access to. In some cases, when performing Web Scraping, a uer might require to be logged in - in order to scrap information that they would otherwise not have access to if they were not logged in. The code below shows an example of sessions in Python, using the requests library.


**Proxy Server**

The Web Server that one access whilst scraping will be able to see the source IP address of the Web Request. The source IP address can be masked by use of a Proxy Server. The python requests library also supports proxy servers. See an example below.



In [None]:
############Sessions#########################
# create a session
# session = requests.Session()

# use session to make a login POST request (i.e. sending email and password hence POST)
# session.post("http://apple.com/login", data=dict(email="steve@apple.com",password="jobs"))

# assumming successful login, subsequent requests using the session will automatically include the cookie
#r = session.get("http://apple.com/user_only_content")


############Proxy Servers#####################
# assumming proxy server is setup on localhost (127.0.0.1) 
#r = requests.get("http://apple.com/user_only_content", proxies=dict(http="http://proxy_username:proxy_password@127.0.0.1:port",
#))

##############################################


## 8. Next Steps
In this tutorial, we have explored Web Scraping using Python. The main libraries used in this exercise were `Beautiful Soup`, `requests` and `Pandas`. We scraped Course Information from the AIFMRM online short courses web page and persisted this data in csv format on the file disk.

A great next step would be building on this to create a Web Scraper to pull cars for sale from a listings website and downloadig the images of the cars. After all, the best way to learn anything is by doing.

Good Luck and Happy Scraping!