# SCC.413 Applied Data Mining
# NLP: Week 16
# Web Scraping
# Answers

## Contents
- [Introduction](#intro)
- [Packages & imports](#packages)
- [Justext](#justext)
- [Requests](#requests)
- [Beautiful Soup](#beautiful_soup)
- [Wikipedia exercise](#wiki_assessed)
- [Optional Wikipedia exercise](#wiki_optional)
- [Forum exercise](#forum)
- [Advanced Forum exercise](#forum_advanced)

<a name="intro"></a>
## Introduction

If you use spidering or just download a list of webpage URLs (e.g. with [curl](https://curl.haxx.se/docs/manpage.html) or [requests](https://requests.readthedocs.io/)), you will be left with a collection of HTML files. These raw HTML files contain a lot of redundant information (e.g. adverts, menus, headings, etc.), this is known as "boilerplate". What we want is the main text of the webpage only (i.e. the news article, the blog post, etc.), in plain text without the HTML tags. If you are not familiar with HTML, [there is a guide here](https://www.w3schools.com/html/).

<a name="packages"></a>
## Packages & Imports

The packages required for the week's lab work are already available on Colab, except for justext. Running the cell below will install justext on Colab.

In [7]:
!pip install justext



Non-standard packages are also included in `requirements.txt`, if you need to install them on your own machine.

The imports for all of the code in this lab are provided in one cell here for convenience.

In [8]:
import json
import justext
import requests
from urllib.parse import urljoin #https://docs.python.org/3/library/urllib.parse.html#urllib.parse.urljoin
from bs4 import BeautifulSoup, Tag #https://www.crummy.com/software/BeautifulSoup/bs4/doc/
import re

<a name="justext"></a>
## Justext

Webpages come in all shapes and sizes, and it varies massively how easy it is to scrape the text of interest. If we are lucky, it is relatively easy to pick out the text, and there are fully automated tools available for that. One of these tools is [justext](https://github.com/miso-belica/jusText).

1. Have a read through the [description of the justext algorithm](http://corpus.tools/wiki/Justext/Algorithm).
2. There is a good [online demo of the tool](https://nlp.fi.muni.cz/projects/justext/). Try out a webpage, e.g. a BBC news article, this article demonstrates it well: <http://www.bbc.co.uk/bbcthree/article/cc72247b-e658-4af8-a838-dfe4e68e2776>.
3. Manually compare the filtered text and the text on the web page. How accurate is the text extraction? How much is missed, how much text is incorrectly included. What are the potential impacts of this in a later analysis of the text?
4. You can use justext via Python too. The HTML needs to be downloaded from a website, or using a file already collected. The `print_justext` function below takes some html and prints out the justext output.



In [9]:
def print_justext(html):
  paragraphs = justext.justext(html, justext.get_stoplist("English"))
  for paragraph in paragraphs:
    if not paragraph.is_boilerplate:
      tag = 'h' if paragraph.heading else 'p'
      print("<{}> {}".format(tag, paragraph.text))

5. You can use this function to process a previously gathered webpage, or a sample webpage is available alongside this notebook (/content/5.html), as shown below.
6. Examine the text produced. How much of the text is correctly outputted?
7. Note the `<h>` and `<p>` tags, indicating headers and paragraphs. These may be useful, if for instance you are only interested in headers or the actual body of the text, or want to separate them in later analysis. You'll learn next week how you could filter these, e.g. using regular expressions.

In [10]:
with open("/content/5.html", "r") as html_file:
  html = html_file.read()
  print_justext(html)

<h> Saturday, 3 January 2015
<p> Happy new year! I feel with the turn of 2015 I should talk about my new year's resolutions. As big of a year as 2014 was, I feel like this will be even bigger. I'll be graduating for one, turning 21 for two. So I thought instead of talking about the past year, I'll talk about what I have planned for 2015 and how I seek to improve myself.
<p> Graduating uni with a 2:1 and finding a job
<p> SAVING SAVING SAVING MONEY
<p> With that comes learning to drive - my dad's really been pushing me to at least do my theory for a couple of years now and I think with the end of school, it'll be the perfect time to dedicate myself to driving
<p> Toning up - I've had really bad back pains the past few months so the least I need to be doing is pilates. But I also wanna hit the gym just to tone up and feel better about myself. I used to have very severe body image issues but having lost a lot of weight, that mostly passed, but I still have some days where I just wish I lo

Often an automated approach will not provide accurate enough results. Fortunately, there are other methods available, other than just manually copying and pasting the text. Many tools have been built that assist in parsing web pages and extracting the text of interest, although scripts need writing using these tools for different sets of websites. Some mimic a user's interaction with a website to get to the relevant data (e.g. [Selenium](http://seleniumhq.github.io/selenium/docs/api/py/)). [Scrapy](https://doc.scrapy.org/en/latest/index.html) is another good option for grabbing and processing webpages. 

<a name="requests"></a>
## Requests

Here we will be using the Python requests package, which makes downloading webpages easy: <https://requests.readthedocs.io/>.

We simply provide a URL. This can be fed into justext, as below.

In [11]:
response = requests.get("http://www.bbc.co.uk/bbcthree/article/cc72247b-e658-4af8-a838-dfe4e68e2776")
print_justext(response.text)

<h> The kids TV shows begging for a reboot
<p> Crack out the orange soda: Kenan and Kel are getting back together to make some new material.
<p> Now, before you go too wild, we're not getting new episodes of their self-titled '90s kids' TV classic, or their sketch show All That. Instead, the pair - embodied by actors Kenan Thompson and Kel Mitchell - have paired up with some former All That castmates for a sketch on the MTV comedy show, Wild'N'Out.
<p> Looks like this post is no longer available from its original source. It might've been taken down or had its privacy settings changed."
<p> Seeing the guys back together - even in a limited capacity - is making us hanker for a proper reunion between Kenan and Kel - either a TV show or, even better, a movie.
<p> Now, before you point it out, we'd like to state we do know there already was a film, but that was made-for-TV and we want this to be a big cinema blockbuster.
<p> The duo filled our screens from 1996 to 2000, and, basically, now 

<a name="beautiful_soup"></a>
## Beautiful Soup

Requests will provide raw HTML files. The key part of web scraping is extracting the relevant parts of the webpage. For this we will use Beautiful Soup: <https://www.crummy.com/software/BeautifulSoup/bs4/doc/>

The basic process is to look at the HTML of the target webpage and look for ways of drilling down to the elements of interest, with the overall aim of extracting just the specific text of interest. This could be metadata, or actual text for further analysis. There are numerous methods provided by Beautiful Soup, please consult [the documentation](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) for options available.

All websites are different (although some have consistent structures), and often a custom scraper needs to be developed. You can use a standard web browser to look at the information of interest, right-click on the first part of the data of interest and select "Inspect Element" (Firefox & Safari) or "Inspect" (Chrome), you will then see the HTML code for that element, and the surrounding elements.

You can then use Beautiful Soup's functions for drilling down and traversing the relevant parts of the web page. You can also extract links and use the requests package to download further webpages for processing.

To demonstrate, we will be parsing Wikipedia pages. As an example we will look to extract plot summaries for Star Trek episodes: <https://en.wikipedia.org/wiki/List_of_Star_Trek:_The_Original_Series_episodes>. We download the webpage using requests, and then use Beautiful Soup to put the html into a parseable document.

In [12]:
base_url = "https://en.wikipedia.org/wiki/List_of_Star_Trek:_The_Original_Series_episodes"
#load webpage
req = requests.get(base_url)
soup = BeautifulSoup(req.text, "html.parser")

Use a web browser to view the [Wikipedia page](https://en.wikipedia.org/wiki/List_of_Star_Trek:_The_Original_Series_episodes). We are targetting the list of episodes in the tables starting under "Season 1 (1966–67)". Right click on the table cell with the first episode title ("The Man Trap") and "Inspect Element" (or just "Inspect" in Chrome). Note that this cell (`td`) is of class "summary", as is every title cell, and other cells are not. This is our way in. We are going to collect a list of titles from the table, along with the URL of the Wikipedia page about that episode.

In [13]:
episodes = [] #to store the list of episodes.
#find and loop through all tds (table cells) with class name ``summary'' (which we know is an episode title)
for episode_cell in soup.find_all('td', {'class': 'summary'}):
    title = episode_cell.a.text.strip() #Get the actual text from the cell.
    episode_url = episode_cell.a['href'] #extract the url
    episodes.append({'title': title, 'url': episode_url}) #store in dictionary format
    
episodes

[{'title': 'The Man Trap', 'url': '/wiki/The_Man_Trap'},
 {'title': 'Charlie X', 'url': '/wiki/Charlie_X'},
 {'title': 'Where No Man Has Gone Before',
  'url': '/wiki/Where_No_Man_Has_Gone_Before'},
 {'title': 'The Naked Time', 'url': '/wiki/The_Naked_Time'},
 {'title': 'The Enemy Within',
  'url': '/wiki/The_Enemy_Within_(Star_Trek:_The_Original_Series)'},
 {'title': "Mudd's Women", 'url': '/wiki/Mudd%27s_Women'},
 {'title': 'What Are Little Girls Made Of?',
  'url': '/wiki/What_Are_Little_Girls_Made_Of%3F'},
 {'title': 'Miri', 'url': '/wiki/Miri_(Star_Trek:_The_Original_Series)'},
 {'title': 'Dagger of the Mind', 'url': '/wiki/Dagger_of_the_Mind'},
 {'title': 'The Corbomite Maneuver', 'url': '/wiki/The_Corbomite_Maneuver'},
 {'title': 'The Menagerie',
  'url': '/wiki/The_Menagerie_(Star_Trek:_The_Original_Series)'},
 {'title': 'The Menagerie',
  'url': '/wiki/The_Menagerie_(Star_Trek:_The_Original_Series)'},
 {'title': 'The Conscience of the King',
  'url': '/wiki/The_Conscience_of_t

Note the URLs are relative, they can be made absolute with urljoin, using the base url as a reference:

In [14]:
from urllib.parse import urljoin #https://docs.python.org/3/library/urllib.parse.html#urllib.parse.urljoin
for episode in episodes:
    episode['url'] = urljoin(base_url, episode['url'])
    
episodes

[{'title': 'The Man Trap',
  'url': 'https://en.wikipedia.org/wiki/The_Man_Trap'},
 {'title': 'Charlie X', 'url': 'https://en.wikipedia.org/wiki/Charlie_X'},
 {'title': 'Where No Man Has Gone Before',
  'url': 'https://en.wikipedia.org/wiki/Where_No_Man_Has_Gone_Before'},
 {'title': 'The Naked Time',
  'url': 'https://en.wikipedia.org/wiki/The_Naked_Time'},
 {'title': 'The Enemy Within',
  'url': 'https://en.wikipedia.org/wiki/The_Enemy_Within_(Star_Trek:_The_Original_Series)'},
 {'title': "Mudd's Women",
  'url': 'https://en.wikipedia.org/wiki/Mudd%27s_Women'},
 {'title': 'What Are Little Girls Made Of?',
  'url': 'https://en.wikipedia.org/wiki/What_Are_Little_Girls_Made_Of%3F'},
 {'title': 'Miri',
  'url': 'https://en.wikipedia.org/wiki/Miri_(Star_Trek:_The_Original_Series)'},
 {'title': 'Dagger of the Mind',
  'url': 'https://en.wikipedia.org/wiki/Dagger_of_the_Mind'},
 {'title': 'The Corbomite Maneuver',
  'url': 'https://en.wikipedia.org/wiki/The_Corbomite_Maneuver'},
 {'title': '

Now we have a list of episodes and their individual wikipedia pages for downloading. We can try justext on the first episode to see if automatic extraction will do:

In [15]:
episode_req = requests.get(episodes[0]['url']) #use requests to download episode webpage.

paragraphs = justext.justext(episode_req.text, justext.get_stoplist("English"))
text = ""

for p in paragraphs:
    if not p.is_boilerplate:
        text += p.text.strip() + "\n"
        
print(text)

In the episode, the crew visit an outpost on planet M-113 to conduct routine medical exams on the residents, only to be attacked by a shapeshifting alien who kills by extracting salt from the victim's body.
This was the first Star Trek episode to air on television, although the sixth to be filmed; it was chosen as the first of the series to be broadcast by the studio due to the horror plot. "The Man Trap" placed first in the timeslot with a Nielsen rating of 25.2 for the first half-hour and 24.2 for the remainder. It aired two days earlier on Canadian network CTV.
The USS Enterprise arrives at planet M-113 to provide supplies and medical exams for the only known inhabitants of the planet, Professor Robert Crater (Alfred Ryder) and his wife Nancy (Jeanne Bal), who operate an archaeological research station there. Captain Kirk, Chief Medical Officer Dr. Leonard McCoy, and Crewman Darnell (Michael Zaslow) transport to the surface as Kirk teases McCoy about his affection for Nancy ten year

Compare this to the [Wikipedia page](https://en.wikipedia.org/wiki/The_Man_Trap). It does a pretty good job of getting the text from the whole page, however, we want to only include the plot section. To do this, we need to be more specific about what to extract, so we can use Beautiful Soup.

Looking at the [Wikipedia page](https://en.wikipedia.org/wiki/The_Man_Trap), you can see the sections are headed with an `h2` element, so to find the "Plot" section we just need to go through the `h2` tags, find the one with "Plot" and hoover up all of the text between there and the next section. We can do this for the first episode as follows:

In [16]:
from bs4 import Tag #we need the Tag class from Beautiful Soup to check if a node we are looking at is Tag.

episode_soup = BeautifulSoup(episode_req.text, "html.parser") #use beautiful soup to decode into a parseable document.

episode_plot = ""

for h2 in episode_soup.find_all("h2"): #Go through all of the h2 elements.
            if(h2.text.strip().startswith("Plot")): #This is the h2 With "Plot" (and "Plot Summary")
                node = h2.next_sibling #start looking for tags after the Plot h2, will be strings and Tags.
                while True:
                    if isinstance(node, Tag): #Check if this element is actually a Tag.
                        if node.name == "p": #p tag, we want this.
                            episode_plot += node.text.strip() + "\n" #append the text from p.
                        elif node.name == "h2": #at the next h2, so a new section, no longer the plot. Stop processing.
                            break
                    node = node.next_sibling #get next element at same level.
                    
print(episode_plot)

The USS Enterprise arrives at planet M-113 to provide supplies and medical exams for the only known inhabitants of the planet, Professor Robert Crater (Alfred Ryder) and his wife Nancy (Jeanne Bal), who operate an archaeological research station there. Captain Kirk, Chief Medical Officer Dr. Leonard McCoy, and Crewman Darnell (Michael Zaslow) transport to the surface as Kirk teases McCoy about his affection for Nancy ten years earlier. They arrive in the research station, and each of the three men sees Nancy differently: McCoy as she was when he first met her, Kirk as she should look accounting for her age, and Darnell as an attractive blonde woman whom he met on a pleasure planet. When Nancy goes out to fetch her husband, she beckons Darnell to follow her.
Professor Crater is reluctant to be examined, telling Kirk that they only require salt tablets. Before McCoy can complete the examination, they hear a scream from outside. They find Darnell dead, with red ring-like mottling on his f

You'll see we now just have the plot text. We just need to wrap this up in a loop and add the plot to each episode.

In [17]:
for episode in episodes:
    episode_plot = ""
    episode_req = requests.get(episode['url']) #do a new request for the episode page.
    episode_soup = BeautifulSoup(episode_req.text, "html.parser") #use beautiful soup to decode into a parseable document.
    for h2 in episode_soup.find_all("h2"): #Go through all of the h2 elements.
        if(h2.text.strip().startswith("Plot")): #This is the h2 With "Plot" (and "Plot Summary")
            node = h2.next_sibling #start looking for tags after the Plot h2, will be strings and Tags.
            while True:
                if isinstance(node, Tag): #Check if this element is actually a Tag.
                    if node.name == "p": #p tag, we want this.
                        episode_plot += node.text.strip() + "\n" #append the text from p.
                    elif node.name == "h2": #at the next h2, so a new section, no longer the plot. Stop processing.
                        break
                node = node.next_sibling #get next element at same level.

        episode['plot'] = episode_plot #add the plot to episode.




As this is in dictionary format, it's nice to convert to JSON:



In [18]:
print(json.dumps(episodes,indent=4)) #print out the resulting json (pretty printed).

[
    {
        "title": "The Man Trap",
        "url": "https://en.wikipedia.org/wiki/The_Man_Trap",
        "plot": "The USS Enterprise arrives at planet M-113 to provide supplies and medical exams for the only known inhabitants of the planet, Professor Robert Crater (Alfred Ryder) and his wife Nancy (Jeanne Bal), who operate an archaeological research station there. Captain Kirk, Chief Medical Officer Dr. Leonard McCoy, and Crewman Darnell (Michael Zaslow) transport to the surface as Kirk teases McCoy about his affection for Nancy ten years earlier. They arrive in the research station, and each of the three men sees Nancy differently: McCoy as she was when he first met her, Kirk as she should look accounting for her age, and Darnell as an attractive blonde woman whom he met on a pleasure planet. When Nancy goes out to fetch her husband, she beckons Darnell to follow her.\nProfessor Crater is reluctant to be examined, telling Kirk that they only require salt tablets. Before McCoy can

And the output can be saved to a file:

In [19]:
with open('startrek.json', 'w') as f:
  #Dump json file. indent=4 prints the output prettier, but will increase disk space.
  json.dump(episodes, f, indent=4)

The code is split up here to explain it more neatly in a notebook, the the full code is available in `startrekscrape.py`.

<a name="wiki_assessed"></a>

## Exercise 1: Wikipedia scraping

To practice using Beautiful Soup, try extracting details of films by Stanley Kubrick from this page: <https://en.wikipedia.org/wiki/Filmography_and_awards_of_Stanley_Kubrick>. The aim is to extract the year and title for the 13 feature films listed. To make it a little trickier, try only outputting films with Kubrick as a Producer. Some starting code is provided below.

Hint: You can use [findAll with a limit](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#the-limit-argument) to return only a limited number of nodes from another node, e.g. `td`s in a `tr`.

In [20]:
# Exercise 1: Wikipedia scraping

base_url = "https://en.wikipedia.org/wiki/Filmography_and_awards_of_Stanley_Kubrick"
#load webpage
req = requests.get(base_url)
#Use beautiful soup to decode webpage text into parseable document.
soup = BeautifulSoup(req.text, "html.parser")

table = soup.find('table', {'class': 'wikitable'}) #just find 1 table, as it's the first 1
trs = table.findAll('tr') #all table rows
trs = trs[1:] #skip header row

films = []

for tr in trs:
    year_td = tr.find('td')
    title_th = tr.find('th')
    year = year_td.text.strip() #save year.
    title = title_th.a.text.strip() #take the text from the a tag as this contains the actual title, earlier text sometimes contains sorting data.
    films.append({'title': title, 'year': year}) #add to list of films, using title extracted, and whatever the last extracted year is.

with open("kubrick.json", 'w') as f:
    json.dump(films, f, indent=4) #save the resulting json (pretty printed).

<a name="wiki_optional"></a>
### Optional Wikipedia Exercise

You can also use what you've learnt to parse details of something else from Wikipedia, e.g. other TV series, or albums from an artist's discography.

In [None]:
# Optional Task Wikipedia



<a name="forum"></a>
## Exercise 2: Forum scraping

The optional lab workbook uses Scrapy to download pages from a forum, you can perform a similar task for forums with Requests and Beautiful Soup. You can choose any forum and use what you've learnt to parse thread posts into plain text. Start with an individual thread, and one page of posts within that thread. A good example to try is from Mumsnet on noisy baby toys: <https://www.mumsnet.com/Talk/toys_and_games_chat/3414974-noisy-baby-toys-which-are-the-worst>.

1. Try using justext to parse the the first page of posts, examine the results. Is this good enough?
2. Use Beautiful Soup to collect just the posts from the first page and output as plain text.

Hint: one issue you may come across is `<br>` tags instead of newlines. If ignored, this will run lines of text in the same post directly next to each other (without even a space), e.g. something horrible like this:
"Allllllll of the toot tootSpin and bounce zebraAnd best (worst!) of all... peppa pig alphaphonic board- it has no off switch and the slightest touch sets it off for ages..."

This makes tokenisation difficult and is tricky to rectify. This is a good lesson in sanity checking the exported text, better to have the text as close to as appears on the webpage now. To deal with this issue, the `<br>` tags can be replaced with a new line character (or some other marker), e.g.:

```
for br in post.find_all("br"):
    br.replace_with("\n")
```

In [21]:
# Exercise 2: Forum scraping

base_url = "https://www.mumsnet.com/Talk/toys_and_games_chat/3414974-noisy-baby-toys-which-are-the-worst"
posts = []

#load page
req = requests.get(base_url)

#Use beautiful soup to decode webpage text into parseable document.
soup = BeautifulSoup(req.text, "html.parser")

for post in soup.find_all('div', {'class': 'message'}): #all posts are in a div with the message class.
  p = post.find("p")  #some message divs contain extra info., so just get p tag.
  for br in p.find_all("br"):
    br.replace_with("\n")
  posts.append(p.text.strip()) #strip out whitespace around actual text.
    
with open("mumsnet.txt", 'w') as f:
  for post in posts:
    f.write("----\n%s\n" % post)

<a name="forum_advanced"></a>
## Advanced Exercise: Forum paging

You will notice that the topic's posts are spread across multiple pages, this means that "paging" needs to be performed to extract the posts from every page. This is a little tricky, but work from collecting 1 page. You will need to consult the [requests documentation](https://requests.readthedocs.io/) to discover how to pass parameters to match the links to other pages. Be careful not to have duplicate posts in your extraction (the original post is repeated on each page).

The trickiest part is to know when to stop. Going past the number of pages available just brings the user to the final page. We will be covering regular expressions next week, so to help you out, the following code can be used to find the final page.

In [22]:
#this isn't complete code, you'll need to incorporate it into your solution.

lastpage = re.compile(r"page (\d+) of \1") #compiled regular expression, checking if we're at page n of n, i.e. last page.

#...

pages = soup.find('div', {'class': 'pages'}).text.strip() #find pages element which has the "This is page x of y" text.
if lastpage.search(pages) != None: #if our regex matches, then we're on last page, so make this the last one to parse.
    at_last = True

In [23]:
# Advanced Exercise: Forum paging

base_url = "https://www.mumsnet.com/Talk/toys_and_games_chat/3414974-noisy-baby-toys-which-are-the-worst"
posts = []

lastpage = re.compile(r"page (\d+) of \1") #compile here as always same, checking if we're at page n of n, i.e. last page.

pg = 1
at_last = False
while at_last != True:

    #load page
    req = requests.get(base_url, params={'pg': pg})
    if req.status_code != requests.codes.ok: #extra check to see if page is up, not strictly necessary.
        print("bad request: end")
        break

    #Use beautiful soup to decode webpage text into parseable document.
    soup = BeautifulSoup(req.text, "html.parser")
    pages = soup.find('div', {'class': 'pages'}).text.strip() #find pages element which has the "This is page x of y" text.
    if lastpage.search(pages) != None: #if our regex matches, then we're on last page, so make this the last one to parse.
        at_last = True

    print("parsing page %d" % pg)

    post_count = 1
    for post in soup.find_all('div', {'class': 'message'}): #all posts are in a div with the message class.
        if(pg==1 or post_count>1): #if past page 1, the original post is repeated first, we don't want this.
              p = post.find("p")  #some message divs contain extra info., so just get p tag.
              for br in p.find_all("br"):
                br.replace_with("\n")
              posts.append(p.text.strip()) #strip out whitespace around actual text.
        post_count += 1
    pg += 1

with open("mumsnet2.txt", 'w') as f:
    for post in posts:
        f.write("%s\n" % post)

parsing page 1
parsing page 2
parsing page 3
parsing page 4
