# Lab session 8: Web Mining

## Introduction 

The aim of this lab is for students to get experience with **Web Mining** methods covered in week 9.

- This lab is the first part of a **two-week assignment** that covers weeks 9 and 10.
- This lab corresponds to **Assignment 4** which is due on **Tuesday 8th December at 10am**, accounting for 10% of your overall grade. Questions in this lab sheet will contribute to 5% of your overall grade; questions in the lab sheet for week 10 will cover for another 5% of your overall grade.
- <font color = 'maroon'>The last section of this notebook includes the questions that are assessed towards your final grade.</font> 

## Important notes about the assignment: 

- **PLAGIARISM** <ins>is an irreversible non-negotiable failure in the course</ins> (if in doubt of what constitutes plagiarism, ask!). 
- The total assessed coursework is worth 40% of your final grade.
- There will be 9 lab sessions and 4 assignments.
- One assignment will cover 2 consecutive lab sessions and will be worth 10 marks (percentages of your final grade).
- The submission cut-off date will be 7 days after the deadline and penalties will be applied for late submissions in accordance with the School policy on late submissions.
- You are asked to submit a **report** that should answer the questions specified in the last section of this notebook. The report should be in **PDF format** (so **NOT** *doc, docx, notebook* etc). It should be well identified with your name, student number, assignment number (for instance, Assignment 4), module, and marked with question numbers. 
- No other means of submission other than submitting your assignment through the appropriate QM+ link are acceptable at any time. Submissions sent via email will **not** be considered.
- Please name your report as follows: Assignment4-StudentName-StudentNumber.pdf
- Cases of **Extenuating Circumstances (ECs)** have to go through the proper procedure of the School in due time. Only cases approved by the School in due time can be considered.

## Web Scraping using Python

Web scraping is a term used to describe the use of a program or algorithm to extract and process large amounts of data from the web. In this lab notebook, we will be working on data extraction from the web using Python's [Beautiful Soup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) module. Please make sure to familiarise yourselves with the [Beautiful Soup documentation](https://www.crummy.com/software/BeautifulSoup/bs4/doc/), or to refer to the documentation if you need more information for a particular class or function.

The dataset to be used in the first 3 sections of the notebook is taken from a 10km race that took place in Hillsboro, USA on June 2017.

### 1. Opening HTML content

Most websites are created using HTML (Hypertext Markup Language), along with CSS (Cascading Style Sheets) and JavaScript. HTML elements are separated by tags and they directly introduce content to the web page. Here is how a basic HTML document looks like: [https://www.w3schools.com/html/html_basic.asp](https://www.w3schools.com/html/html_basic.asp) - please take some time to study the link and the HTML code, we will come back into that later during the tutorial.

We can see that the content of the first heading is contained between the ‘h1’ tags. The first paragraph is contained between the ‘p’ tags. On a real website, we need to find out between which tags the relevant data is and tell it to our scraper. We also need to specify which links should be explored and where they can be found among the HTML file. With all this information, our scraper should be able to gather the required data.

We first start by loading standard python modules for data mining:

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In order to perform web scraping, we also import the libraries shown below. The urllib.request module is used to open URLs. The Beautiful Soup package is used to extract data from HTML files. The Beautiful Soup library's name is bs4 which stands for Beautiful Soup, version 4.

In [2]:
from urllib.request import urlopen
from bs4 import BeautifulSoup

After importing the necessary modules, we specify the URL containing the dataset mentioned above and pass it to urlopen() to get the HTML contents of the page:

In [3]:
url = "https://www.hubertiming.com/results/2017GPTR"
html = urlopen(url)

### 2. Parsing HTML content 

Opening the HTML content of the page is just the first step. The next step is to create a Beautiful Soup object from the HTML content. This is done by passing the html content to the BeautifulSoup() function. The Beautiful Soup package is used to parse the HTML content, that is, take the raw HTML text and break it into Python objects. The second argument 'lxml' is the HTML parser whose details we do not need to worry about at this point.

In [4]:
soup = BeautifulSoup(html, 'lxml')
print(type(soup))

<class 'bs4.BeautifulSoup'>


The soup object allows you to extract interesting information about the website we are scraping such as getting the title of the page as shown below:

In [5]:
# Get the title
title = soup.title
print(title)

<title>Race results for the 2017 Intel Great Place to Run \ Urban Clash Games!</title>


We can also get the text of the webpage and quickly print it out to check if it is what we expect:

In [6]:
# Print out the text
text = soup.get_text()
#print(soup.text)

We can view the html content of the [webpage we are scraping](https://www.hubertiming.com/results/2017GPTR) by opening the webpage in another tab in a web browser, right-clicking anywhere on the webpage and selecting "View Source" or "View Page Source" (for Chrome and Firefox respectively - similar options to view the HTML source exist for other web browsers). 

Please take some time to inspect the html content of the webpage and spot for examples of useful tags. Examples of useful tags include < a > for hyperlinks, < table > for tables, < tr > for table rows, < th > for table headers, and < td > for table cells.

We can use the find_all() method of soup to extract useful html tags within a webpage. The code below shows how to extract all the **hyperlinks** within the webpage:

In [7]:
soup.find_all('a')

[<a href="mailto:timing@hubertiming.com">timing@hubertiming.com</a>,
 <a href="https://www.hubertiming.com/">Huber Timing Home</a>,
 <a class="btn btn-primary btn-lg" href="/results/2017GPTR10K" role="button" style="margin: 0px 0px 5px 5px"><i aria-hidden="true" class="fa fa-user"></i> 10K</a>,
 <a class="btn btn-primary btn-lg" href="/results/summary/2017GPTR" role="button" style="margin: 0px 0px 5px 5px"><i class="fa fa-stream"></i> Summary</a>,
 <a id="individual" name="individual"></a>,
 <a data-url="/results/2017GPTR" href="#tabs-1" id="rootTab" style="font-size: 18px">5K Results</a>,
 <a href="https://www.hubertiming.com/"><img height="65" src="https://www.hubertiming.com//sites/all/themes/hubertiming/images/clockWithFinishSign_small.png" width="50"/>Huber Timing</a>,
 <a href="https://facebook.com/hubertiming/"><img src="https://www.hubertiming.com/results/FB-f-Logo__blue_50.png"/></a>,
 <a class="small" id="bestFeatureEver" style="color:#007bff">Dark Mode</a>]

As we can see from the output above, HTML tags sometimes come with attributes such as *class* and *src*. These attributes provide additional information about html elements. We can use a for loop and the get("href") method to extract and print out only hyperlinks:

In [8]:
all_links = soup.find_all("a")
for link in all_links:
    print(link.get("href"))

mailto:timing@hubertiming.com
https://www.hubertiming.com/
/results/2017GPTR10K
/results/summary/2017GPTR
None
#tabs-1
https://www.hubertiming.com/
https://facebook.com/hubertiming/
None


To print out table rows only, pass the 'tr' argument in soup.find_all():

In [9]:
# Print the first 10 table rows
rows = soup.find_all('tr')  # the 'tr' tag in html denotes a table row
#print(rows[:10])

### 3. Converting an HTML table into a Pandas dataframe

The goal of this lab notebook is to take a table from a webpage and convert it into a pandas dataframe for easier manipulation using Python. For an example on how are tables encoded in HTML please study the following [example table](https://www.w3schools.com/tags/tag_tr.asp). As you'll see from the above example table, rows in HTML tables are identified using the 'tr' tag; each HTML table has a header identified by the 'th' tag; and each cell in the table is identified by the 'td' tag. Please take some time to familiarise yourselves with the example table html code in the above link.


To convert the HTML table to a pandas dataframe, we should get all table rows in list form first and then convert that list into a dataframe. Below is a for loop that iterates through table rows and prints out the cells of the rows:

In [10]:
for row in rows:
    row_td = row.find_all('td')  # the 'td' tag in html code denotes a table cell
    #print(row_td)
type(row_td)

bs4.element.ResultSet

The output above shows that each row is printed with html tags embedded in each row. This is not what we want. We can remove the html tags using Beautiful Soup or regular expressions. Using regular expressions is highly discouraged since it requires several lines of code and one can easily make mistakes. It requires importing the *re* (for regular expressions) module.

The easiest way to remove html tags is to use Beautiful Soup, and it takes just one line of code to do this. We pass the string of interest into BeautifulSoup() and use the get_text() method to extract the text without html tags. The following is an example of removing html tags for one row of the table:

In [11]:
str_cells = str(row_td)
cleantext = BeautifulSoup(str_cells, "lxml").get_text()
print(cleantext)

[1458, 1400, 

                    SUMALATHA PURMA

                , F, PORTLAND, OR, 1:48:13, 34:54, 1:48:13]


Having removed HTML tags for one row, we can now convert the entire HTML table into a pandas dataframe.

First, we try to scrape the header of the table, which includes names for all the table attributes. We create an empty list object, and using Beautiful Soup we locate the 'th' HTML tag which denotes a table header. We convert the header from HTML to a string, and then append it to the list object.

In [12]:
# Create an empty list where the table header will be stored
header_list = []

# Find the 'th' html tags which denote table header
col_labels = soup.find_all('th')
col_str = str(col_labels)
cleantext_header = BeautifulSoup(col_str, "lxml").get_text()  # extract the text without HTML tags
header_list.append(cleantext_header) # Add the clean table header to the list

print(header_list)

['[Place, Bib, Name, Gender, City, State, Chip Time, Chip Pace, Gun Time]']


We see that the header above contains 9 elements, separated by commas. 

Now, we do the same process as above but for every row in the table that contains cell elements identified by the 'td' tag:

In [13]:
# Create an empty list where the table will be stored
table_list = []

# For every row in the table, find each cell element and add it to the list
for row in rows:
    row_td = row.find_all('td')
    row_cells = str(row_td)
    row_cleantext = BeautifulSoup(row_cells, "lxml").get_text()  # extract the text without HTML tags
    table_list.append(row_cleantext)  # Add the clean table row to the list
    
#print(table_list)

We see that the table_list list object includes all information stored in the original table, where elements in each row are separated by commas. We also see a lot of special and uneccessary characters that would need to be removed later on.

Now, we have a python list object for the header called 'header_list' and another list object for the main table called 'table_list'. We can now convert the header list into a pandas dataframe:

In [14]:
df_header = pd.DataFrame(header_list)
df_header.head()

Unnamed: 0,0
0,"[Place, Bib, Name, Gender, City, State, Chip T..."


The dataframe is not in the format we want, since it only includes one column instead of 9 columns. To clean it up, we should split the "0" column into multiple columns at the comma position. This is accomplished by using the str.split() method:

In [15]:
df_header2 = df_header[0].str.split(',', expand=True)
df_header2.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8
0,[Place,Bib,Name,Gender,City,State,Chip Time,Chip Pace,Gun Time]


We can carry out the same process as above to convert the table list into a pandas dataframe for the table values:

In [16]:
df_table = pd.DataFrame(table_list)
df_table2 = df_table[0].str.split(',', expand=True)
df_table2.head(10)

Unnamed: 0,0,1,2,3,4,5,6,7,8
0,[],,,,,,,,
1,[Finishers:,1458],,,,,,,
2,[Male:,771],,,,,,,
3,[Female:,687],,,,,,,
4,[],,,,,,,,
5,[1,2320,\r\n\r\n DANIEL M HINCKLEY...,M,HILLSBORO,OR,16:42,5:23,16:44]
6,[2,2335,\r\n\r\n KORY F GRAY\r\n\r...,M,HILLSBORO,OR,17:34,5:40,17:35]
7,[3,1770,\r\n\r\n FILIP SCHMOLE\r\n...,M,PORTLAND,OR,18:13,5:52,18:14]
8,[4,2584,\r\n\r\n TRENTON C ROLLING...,M,PORTLAND,OR,18:32,5:58,18:35]
9,[5,2688,\r\n\r\n YEAN-AN LIAO\r\n\...,M,HILLSBORO,OR,19:12,6:11,19:18]


This looks much better, but there is still work to do. The dataframe has unwanted square brackets surrounding each row. It also has some line and carriage return characters that can be removed (\r, \n). We can use the strip() method to remove the square brackets and uneccesary characters on columns 0, 1, 2 and 8. 

We also notice that the first few rows of thable contain overall statistics on the race, and are not formatted as the rest of the table rows, containing missing values. Therefore any rows with missing values can be removed from the table.

In [17]:
# Remove uneccesary characters
df_table2[0] = df_table2[0].str.strip('[')
df_table2[0] = df_table2[0].str.strip(']')
df_table2[1] = df_table2[1].str.strip(']')
df_table2[8] = df_table2[8].str.strip(']')
df_table2[2] = df_table2[2].str.strip('\r\n\r\n ')

# Remove all rows with any missing values
df_table3 = df_table2.dropna(axis=0, how='any')

df_table3.head(10)

Unnamed: 0,0,1,2,3,4,5,6,7,8
5,1,2320,DANIEL M HINCKLEY,M,HILLSBORO,OR,16:42,5:23,16:44
6,2,2335,KORY F GRAY,M,HILLSBORO,OR,17:34,5:40,17:35
7,3,1770,FILIP SCHMOLE,M,PORTLAND,OR,18:13,5:52,18:14
8,4,2584,TRENTON C ROLLING,M,PORTLAND,OR,18:32,5:58,18:35
9,5,2688,YEAN-AN LIAO,M,HILLSBORO,OR,19:12,6:11,19:18
10,6,1576,JORGE1 LOPEZ,M,PORTLAND,OR,19:19,6:14,19:20
11,7,1479,SCOTT E HAMPSHIRE,M,HILLSBORO,OR,19:27,6:16,19:29
12,8,895,KEVIN CANADA,M,BEAVERTON,OR,19:53,6:24,20:02
13,9,2631,SCOTT GERWIG,M,PORTLAND,OR,19:57,6:26,19:59
14,10,2431,NICOLAUS L ROCK,M,HILLSBORO,OR,20:00,6:27,20:01


Almost there! Now we can concatenate the header dataframe with the table dataframe:

In [18]:
# We remove uneccessary characters from the header
df_header2[0] = df_header2[0].str.strip('[')
df_header2[8] = df_header2[8].str.strip(']')

# We concatenate the two dataframes
frames = [df_header2, df_table3]
df = pd.concat(frames)

df2 = df.rename(columns=df.iloc[0]) # We assign the first row to be the dataframe header
df3 = df2.drop(df2.index[0]) # We drop the replicated header from the first row of the dataframe

df3.head(10)

Unnamed: 0,Place,Bib,Name,Gender,City,State,Chip Time,Chip Pace,Gun Time
5,1,2320,DANIEL M HINCKLEY,M,HILLSBORO,OR,16:42,5:23,16:44
6,2,2335,KORY F GRAY,M,HILLSBORO,OR,17:34,5:40,17:35
7,3,1770,FILIP SCHMOLE,M,PORTLAND,OR,18:13,5:52,18:14
8,4,2584,TRENTON C ROLLING,M,PORTLAND,OR,18:32,5:58,18:35
9,5,2688,YEAN-AN LIAO,M,HILLSBORO,OR,19:12,6:11,19:18
10,6,1576,JORGE1 LOPEZ,M,PORTLAND,OR,19:19,6:14,19:20
11,7,1479,SCOTT E HAMPSHIRE,M,HILLSBORO,OR,19:27,6:16,19:29
12,8,895,KEVIN CANADA,M,BEAVERTON,OR,19:53,6:24,20:02
13,9,2631,SCOTT GERWIG,M,PORTLAND,OR,19:57,6:26,19:59
14,10,2431,NICOLAUS L ROCK,M,HILLSBORO,OR,20:00,6:27,20:01


That's it! It took a while to get here, but at this point, the dataframe is in the desired format. Now the table has been both scraped from the web and has been converted into an appropriate representation where we can apply data mining operations covered through the previous lectures and labs.

### 4. Second example - scraping URLs

For this second example, we will see how to scrape information from a fictional store, in this case a book store which is available at the following URL: [http://books.toscrape.com/](http://books.toscrape.com/)

Please visit the above website, inspect the webpage, and inspect the HTML source code from your browser (using the same process described in section 2 of the lab notebook).

We see that the webpage lists 20 books. For each book, there is an associated URL, which points to a separate webpage describing each book in detail. The goal of this example is to scrape the URLs for all these 20 books.

We first follow the same process as in sections 1 and 2 of the lab notebook to open the URL and parse the HTML content:

In [19]:
url_bookstore = "http://books.toscrape.com/index.html"
html_bookstore = urlopen(url_bookstore)
soup_bookstore = BeautifulSoup(html_bookstore, 'lxml')

If we inspect the HTML code of the [webpage](http://books.toscrape.com/), we see that each of the 20 books is mentioned under an 'article' tag with the value 'product_pod'. Under each mention of the 'product_pod' value, there is the corresponding URL for each book, under the 'a' tag, and specifically the 'href' attribute. This seems to be a reliable source to spot product URLs.

So the first thing to attempt is to find 'article' tags in the HTML code that contain the 'product_pod' value:

In [20]:
soup_bookstore.find("article", class_ = "product_pod")

<article class="product_pod">
<div class="image_container">
<a href="catalogue/a-light-in-the-attic_1000/index.html"><img alt="A Light in the Attic" class="thumbnail" src="media/cache/2c/da/2cdad67c44b002e7ead0cc35693c0e8b.jpg"/></a>
</div>
<p class="star-rating Three">
<i class="icon-star"></i>
<i class="icon-star"></i>
<i class="icon-star"></i>
<i class="icon-star"></i>
<i class="icon-star"></i>
</p>
<h3><a href="catalogue/a-light-in-the-attic_1000/index.html" title="A Light in the Attic">A Light in the ...</a></h3>
<div class="product_price">
<p class="price_color">£51.77</p>
<p class="instock availability">
<i class="icon-ok"></i>
    
        In stock
    
</p>
<form>
<button class="btn btn-primary btn-block" data-loading-text="Adding..." type="submit">Add to basket</button>
</form>
</div>
</article>

This seems to produce too much information. So instead of only looking for the 'product_pod' value in an 'article' tag, let's look for URLs only. 

If we inspect the HTML code further, we see that the URLs we are looking for are within the 'a' tag, which is within the 'div' tag. Soe we can modify the above command as:

In [21]:
soup_bookstore.find("article", class_ = "product_pod").div.a

<a href="catalogue/a-light-in-the-attic_1000/index.html"><img alt="A Light in the Attic" class="thumbnail" src="media/cache/2c/da/2cdad67c44b002e7ead0cc35693c0e8b.jpg"/></a>

This is beter, we now have all information contained within the 'a' tag for a book. But we only need the URL contained in the 'href' value, which in the above example should be "catalogue/a-light-in-the-attic_1000/index.html".

We can get this URL by adding .get('href') to the previous instruction: 

In [22]:
soup_bookstore.find("article", class_ = "product_pod").div.a.get('href')

'catalogue/a-light-in-the-attic_1000/index.html'

We now managed to get our first product URL with BeautifulSoup. Now let’s gather all the product URLs on the main web page at once using the findAll() function, which iterates across all mentions of the 'product_pod' value within an 'article' tag:

In [23]:
book_urls = [x.div.a.get('href') for x in soup_bookstore.findAll("article", class_ = "product_pod")]

# Display number of fetched URLs
print(str(len(book_urls)) + " fetched book URLs")

# We can print all fetched URLS
for book in book_urls:
    print(book)

20 fetched book URLs
catalogue/a-light-in-the-attic_1000/index.html
catalogue/tipping-the-velvet_999/index.html
catalogue/soumission_998/index.html
catalogue/sharp-objects_997/index.html
catalogue/sapiens-a-brief-history-of-humankind_996/index.html
catalogue/the-requiem-red_995/index.html
catalogue/the-dirty-little-secrets-of-getting-your-dream-job_994/index.html
catalogue/the-coming-woman-a-novel-based-on-the-life-of-the-infamous-feminist-victoria-woodhull_993/index.html
catalogue/the-boys-in-the-boat-nine-americans-and-their-epic-quest-for-gold-at-the-1936-berlin-olympics_992/index.html
catalogue/the-black-maria_991/index.html
catalogue/starving-hearts-triangular-trade-trilogy-1_990/index.html
catalogue/shakespeares-sonnets_989/index.html
catalogue/set-me-free_988/index.html
catalogue/scott-pilgrims-precious-little-life-scott-pilgrim-1_987/index.html
catalogue/rip-it-up-and-start-again_986/index.html
catalogue/our-band-could-be-your-life-scenes-from-the-american-indie-underground-198

We have now managed to fetch all 20 URLs corresponding to each book in the website, and placed them in a list object in python, which can be used for further data mining.

## <font color = 'maroon'>Assignment</font>

The first two questions are coding exercises; question 3 does not require you to write any code.

The supplementary material for this week contains offline copies of the assignment material in case there are internet connectivity issues with School websites. The supplementary material includes file "income_table.html" (for Question 1), file "programmes.html" (for Question 2), and file "Question-3-Graph.png" (for Question 3).

1. You are provided with the following URL: [http://eecs.qmul.ac.uk/~emmanouilb/income_table.html](http://eecs.qmul.ac.uk/~emmanouilb/income_table.html). This webpage includes a table on individuals' income and shopping habits - the same that was used in the Week 3 lab.
  1. Inspect the HTML code of the above URL, and provide a short report on the various tags present in the code. What is the function of each unique tag present in the HTML code? [0.5 marks out of 5]
  2. Using Beautiful Soup, scrape the table and convert it into a pandas dataframe. Perform data cleaning when necessary to remove extra characters (no need to handle missing values). In the report include the code that was used to scrape and convert the table and provide evidence that the table has been successfully scraped and converted (e.g. by displaying the contents of the dataframe). [1 mark out of 5]
  

In [136]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

from urllib.request import urlopen
from bs4 import BeautifulSoup

# Open the URL
url = "http://eecs.qmul.ac.uk/~emmanouilb/income_table.html"
html = urlopen(url)

# Get BS4 ready:
soup = BeautifulSoup(html, 'lxml')
#print(type(soup))

# Present the various tags present in the code: - This gives all.
#soup.find_all(True)

In [25]:
# Scrape and Clean the table - Form a dataframe

# Header Preperation:
header_list = []

column_labels = str(soup.find_all('th'))
cleantext_header = BeautifulSoup(column_labels, "lxml").get_text()  # extract the text without HTML tags
header_list.append(cleantext_header) # Add the clean table header to the list
# print(header_list)

df_header = pd.DataFrame(header_list)
df_header2 = df_header[0].str.split(', ', expand=True)
df_header2[0] = df_header2[0].str.strip('[')
df_header2[3] = df_header2[3].str.strip(']')
display(df_header2.head())

# Table Preperation:
table_list = []

# Print the first 10 table rows
rows = soup.find_all('tr')  # the 'tr' tag in html denotes a table row
#print(rows[:10])

# For every row in the table, find each cell element and add it to the list
for row in rows:
    row_td_cells = str(row.find_all('td'))
    row_cleantext = BeautifulSoup(row_td_cells, "lxml").get_text()  # extract the text without HTML tags
    table_list.append(row_cleantext)  # Add the clean table row to the list 
#print(table_list)

df_table = pd.DataFrame(table_list)
df_table2 = df_table[0].str.split(', ', expand=True)
df_table2.head(10)

# Remove uneccesary characters
df_table2[0] = df_table2[0].str.strip('[')
df_table2[3] = df_table2[3].str.strip(']')

# Place everything in:
frames = [df_header2, df_table2]
df = pd.concat(frames)
df2 = df.rename(columns=df.iloc[0]) # We assign the first row to be the dataframe header
df3 = df2.drop(df2.index[0]) # We drop the replicated header from the first row of the dataframe
df3.head(10)


Unnamed: 0,0,1,2,3
0,Region,Age,Income,Online Shopper


Unnamed: 0,Region,Age,Income,Online Shopper
1,India,49.0,86400.0,No
2,Brazil,32.0,57600.0,Yes
3,USA,35.0,64800.0,No
4,Brazil,43.0,73200.0,No
5,USA,45.0,,Yes
6,India,40.0,69600.0,Yes
7,Brazil,,62400.0,No
8,India,53.0,94800.0,Yes
9,USA,55.0,99600.0,No
10,India,42.0,80400.0,Yes


2. The list of the various MSc programmes offered by the School of EECS is provided at the following URL: [http://eecs.qmul.ac.uk/postgraduate/programmes/](http://eecs.qmul.ac.uk/postgraduate/programmes/). Perform web scraping on the table present in the above URL and convert it into a pandas dataframe that would include one row for each programme of study as shown in the webpage. The dataframe should include the following 5 columns: name of postgraduate degree programme (e.g. Advanced Electronic and Electrical Engineering), programme code for part-time study (e.g. H60C), programme code for full-time study (e.g. H60A), URL for part-time study programme details, URL for full-time study programme details. Perform data cleaning to remove unecessary characters when needed. In the report include the code that was used to scrape, convert and clean the table and provide evidence that the table has been successfully scraped (e.g. by displaying the contents of the dataframe). [1 mark out of 5]



In [139]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

from urllib.request import urlopen
from bs4 import BeautifulSoup

# Open the URL
url = "http://eecs.qmul.ac.uk/postgraduate/programmes/"
html = urlopen(url)

# Get BS4 ready:
soup = BeautifulSoup(html, 'lxml')
#print(type(soup))

# Inspect HTML for finding a way to scrape this:
# soup.find_all(True)

In [140]:
# Header Preperation:
header_list = []

column_labels = str(soup.find_all('th'))
cleantext_header = BeautifulSoup(column_labels, "lxml").get_text()  # extract the text without HTML tags
header_list.append(cleantext_header) # Add the clean table header to the list
#print(header_list)

df_header = pd.DataFrame(header_list)
df_header2 = df_header[0].str.split(', ', expand=True)
df_header2[0] = df_header2[0].str.strip('[')
df_header2[2] = df_header2[2].str.strip(']')
df_header2[3] = ['Part-time URL(2 year)']
df_header2[4] = ['Full-time URL(1 year)']
display(df_header2.head())

Unnamed: 0,0,1,2,3,4
0,Postgraduate degree programmes,Part-time(2 year),Full-time(1 year),Part-time URL(2 year),Full-time URL(1 year)


In [141]:
# Get all rows & print a sample row:
rows = soup.find_all('tr')  # the 'tr' tag in html denotes a table row
# print(rows[3])
# print()

table_list = []
for row in rows[1:]: # ignore header
    # Get the texts:
    columns = row.find_all('td')  # the 'td' tag in html code denotes a table cell  
    cols_str = str(columns)
    
    cleantext = BeautifulSoup(cols_str, "lxml").get_text()
#     print(cleantext)

    
    # Get the links:
    for column in columns[1:]:
        if(column.find('a') != None):
            links = column.find('a').get('href')
            cleantext = cleantext + ', ' + links
#             print(links)
        else:
            links = None
            cleantext = cleantext + ', ' + ''
            
    table_list.append(cleantext)
    
# Construct the structure:
df_table = pd.DataFrame(table_list)
df_table2 = df_table[0].str.split(', ', expand=True)

# Cleaning: - Remove Uncessary Characters:
df_table2[0] = df_table2[0].str.strip('[')
df_table2[2] = df_table2[2].str.strip(']')

# df_table2.head(20)

# Tie the header and the list:
frames = [df_header2, df_table2]
df = pd.concat(frames)
df2 = df.rename(columns=df.iloc[0]) # We assign the first row to be the dataframe header
df3 = df2.drop(df2.index[0]) # We drop the replicated header from the first row of the dataframe
df3.head(20)

Unnamed: 0,Postgraduate degree programmes,Part-time(2 year),Full-time(1 year),Part-time URL(2 year),Full-time URL(1 year)
1,Artificial Intelligence,I4U2,I4U1,https://www.qmul.ac.uk/postgraduate/taught/cou...,https://www.qmul.ac.uk/postgraduate/taught/cou...
2,Big Data Science,H6J6,H6J7,https://www.qmul.ac.uk/postgraduate/taught/cou...,https://www.qmul.ac.uk/postgraduate/taught/cou...
3,Computer Games,,I4U4,,https://www.qmul.ac.uk/postgraduate/taught/cou...
4,Computer Science,G4U2,G4U1,https://www.qmul.ac.uk/postgraduate/taught/cou...,https://www.qmul.ac.uk/postgraduate/taught/cou...
5,Computer Science by Research,G4Q2,G4Q1,https://www.qmul.ac.uk/postgraduate/taught/cou...,https://www.qmul.ac.uk/postgraduate/taught/cou...
6,Computing and Information Systems,G5U6,G5U5,https://www.qmul.ac.uk/postgraduate/taught/cou...,https://www.qmul.ac.uk/postgraduate/taught/cou...
7,Data Science and Artificial Intelligence by Co...,,I4U5,,https://www.qmul.ac.uk/postgraduate/taught/cou...
8,Electronic Engineering by Research,H6T6,H6T5,https://www.qmul.ac.uk/postgraduate/taught/cou...,https://www.qmul.ac.uk/postgraduate/taught/cou...
9,Internet of Things (Data),I1T2,I1T0,https://www.qmul.ac.uk/postgraduate/taught/cou...,https://www.qmul.ac.uk/postgraduate/taught/cou...
10,Machine Learning for Visual Data Analytics,H6JZ,H6JE,https://www.qmul.ac.uk/postgraduate/taught/cou...,https://www.qmul.ac.uk/postgraduate/taught/cou...


3. Consider the graph in the figure below as displaying the links for a group of 5 webpages.
  1. Which of the 5 nodes would you consider hubs and which would you consider authorities? Explain why. [0.5 marks out of 5]
  2. Assume that this graph is to be used as input to the PageRank algorithm. Calculate the transition probabilities $p_{ij}$ for all 5 nodes in the below graph (where $i$ and $j$ take values between 1 to 5). Add transitions with a uniform probability distribution in the case of dead-end nodes (do not consider cases of dead-end components). [1 mark out of 5].
  3. Derive the PageRank $\pi(i)$ for all nodes, where $i=\{1,...,5\}$ corresponds to the node index. Assume that the teleportation probability is set to $\alpha$. [1 mark out of 5]

![FigGraph](http://eecs.qmul.ac.uk/~emmanouilb/FigGraph.png)