In [None]:
import warnings
warnings.filterwarnings('ignore')

# HTML Refresher
This part is based on chapter 11 of *Automate the Boring Stuff with Python* by Al Sweigart

HTML files are plain text files containing *tags*, which are words enclosed in angle brackets. Tags tell the browser how to format the web page. A starting tag and closing tag can enclose some text to form an element. The text (or inner HTML) is the content between the starting and closing tags.

There are many different tags in HTML. Some of these tags have extra properties in the form of attributes within the angle brackets. For example, the `<a>` tag encloses text that should be a link.

Some elements have an `id` attribute that is used to uniquely identify the element in the page. You will often instruct your programs to seek out an element by its id attribute, so figuring out an element’s id attribute using the browser’s developer tools is a common task in writing web scraping programs.

In [21]:
%%bash
cat << EOF > example.html
<!DOCTYPE html>
<html>
<head>
<title>Hello!</title>
</head>
<body>
<h1>Hello World!</h1>
You are extremely welcome!<br>
<br>
The <a href=\"https://github.com/datsoftlyngby/dat4sem2019spring-python-materials\">Lecture Notes</a>.<br>
<div>
<p>paragraph 1</p>
<p>and paragraph 2: <span id="span01">This is span 1</span id="span02"><span id="span03">Second span element</span>
<span class="red_border">Here is the third span</span>
</p>
</div>
</body>
</html>
EOF

In [11]:
%%bash
xdg-open example.html

# View a Page's HTML Sources

Here, I will only describe how to use Firefox' development features.

To view a page's sources right click on it and choose **View page source** which opens a new tab with the HTML sources.

<img src="images/view_source_small.png" width="500"> 

In Firefox, you can bring up the Web Developer Tools Inspector by pressing `CTRL-SHIFT-C` on Windows and Linux or by `CMD-OPTION-C` on OS X.

<img src="images/inspector_small.png" width="600"> 

# Parsing HTML with BeautifulSoup

BeautifulSoup is a module for parsing and extracting information from HTML sources. The module’s name is bs4. In case it is not already installed on your machine:
- install it with `pip install beautifulsoup4`. While beautifulsoup4 is the name used for installation, 
- to import BeautifulSoup you have to use `import bs4`.

According to its documentation (https://www.crummy.com/software/BeautifulSoup/) *"Beautiful Soup parses anything you give it, and does the tree traversal stuff for you. You can tell it "Find all the links", or "Find all the links of class externalLink", or "Find all the links whose urls match "foo.com", or "Find the table heading that's got bold text, then give me that text.""*

## Creating a BeautifulSoup Object from a Local HTML File

- The `bs4.BeautifulSoup()` function needs to be called with a string containing the HTML it will parse. 
- The `bs4.BeautifulSoup()` function returns is a `BeautifulSoup` object.

You can load a local HTML file and pass a file object to `bs4.BeautifulSoup()`.

In [7]:
import bs4

with open('./example.html') as f:
    example_html = f.read()
    
soup = bs4.BeautifulSoup(example_html)
print(type(soup))
print(soup.prettify())

<class 'bs4.BeautifulSoup'>
<!DOCTYPE html>
<html>
 <head>
  <title>
   Hello!
  </title>
 </head>
 <body>
  <h1>
   Hello World!
  </h1>
  You are extremely welcome!
  <br/>
  <br/>
  The
  <a href='\"https://github.com/datsoftlyngby/dat4sem2018fall-python\"'>
   Lecture Notes
  </a>
  .
  <br/>
 </body>
</html>



## Creating a BeautifulSoup Object from a Remote HTML File



In [9]:
import bs4
import requests


r = requests.get('https://github.com/datsoftlyngby/dat4sem2018fall-python')
r.raise_for_status()
soup = bs4.BeautifulSoup(r.text, 'html.parser')

print(soup.prettify()[:1500])

<!DOCTYPE html>
<html lang="en">
 <head>
  <meta charset="utf-8"/>
  <link href="https://github.githubassets.com" rel="dns-prefetch"/>
  <link href="https://avatars0.githubusercontent.com" rel="dns-prefetch"/>
  <link href="https://avatars1.githubusercontent.com" rel="dns-prefetch"/>
  <link href="https://avatars2.githubusercontent.com" rel="dns-prefetch"/>
  <link href="https://avatars3.githubusercontent.com" rel="dns-prefetch"/>
  <link href="https://github-cloud.s3.amazonaws.com" rel="dns-prefetch"/>
  <link href="https://user-images.githubusercontent.com/" rel="dns-prefetch"/>
  <link crossorigin="anonymous" href="https://github.githubassets.com/assets/frameworks-aea1483741d7c7c127f6294ced2bb4d0.css" integrity="sha512-Cc73o557zxvw32bFLLW7+eLwVRIAuuA73yJTaC3Sa4e0hIej7Jp20LEujbX810MZ9UZvxB8tZrxGnv/ZTZzhBg==" media="all" rel="stylesheet">
   <link crossorigin="anonymous" href="https://github.githubassets.com/assets/site-0e25450ff46c1fbf61d504f9ffc4d784.css" integrity="sha512-5LDBQ/Dhs

## Finding an Element with the `select()` Method

You can retrieve HTML elements from a `BeautifulSoup` object by calling the `select()` method and passing a string of a CSS selector for the element you are looking for. Selectors are like regular expressions: They specify a pattern to look for, in this case, in HTML pages instead of general text strings.

Common CSS selector patterns include:

  * `soup.select('div')` ... selects all elements named `<div>`
  * `soup.select('#lecturer')`  ... selects the element with an id attribute of author
  * `soup.select('.notice')` ... selects all elements that use a CSS class attribute named notice
  * `soup.select('div span')` ... selects all elements named ``<span>` that are within an element named `<div>`
  * `soup.select('div > span')` ... selects all elements named `<span>` that are directly within an element named `<div>`, with no other element in between
  * `soup.select('input[name]')` ... selects all elements named `<input>` that have a name attribute with any value
  * `soup.select('input[type="button"]')` ... selects all elements named `<input>` that have an attribute named type with value button
  
See more in the documentation: https://www.crummy.com/software/BeautifulSoup/bs4/doc/#css-selectors

In [38]:
import bs4


with open('./example.html') as f:
    example_html = f.read()

soup = bs4.BeautifulSoup(example_html, 'html.parser')

elems = soup.select('body')

#print(soup.prettify())
print('1: return type of select()',type(elems))
print('2: length of the returned list',len(elems))
print('3: type of elements in the list',type(elems[0]))
print('4: get text from the element',elems[0].getText()[:40])
print('5: string representation of an element: ',str(elems[0]))
print('6: the attributes of the element: ',elems[0].attrs)

1: return type of select() <class 'list'>
2: length of the returned list 1
3: type of elements in the list <class 'bs4.element.Tag'>
4: get text from the element 
Hello World!
You are extremely welcome!
5: string representation of an element:  <body>
<h1>Hello World!</h1>
You are extremely welcome!<br/>
<br/>
The <a href='\"https://github.com/datsoftlyngby/dat4sem2019spring-python-materials\"'>Lecture Notes</a>.<br/>
<div>
<p>paragraph 1</p>
<p>and paragraph 2: <span id="span01">This is span 1</span><span id="span03">Second span element</span>
<span class="red_border">Here is the third span</span>
</p>
</div>
</body>
6: the attributes of the element:  {}


In [40]:
p_elems = soup.select('p')

for el in p_elems:
    print(str(el))
    print(el.getText())
    print('------------')

<p>paragraph 1</p>
paragraph 1
------------
<p>and paragraph 2: <span id="span01">This is span 1</span><span id="span03">Second span element</span>
<span class="red_border">Here is the third span</span>
</p>
and paragraph 2: This is span 1Second span element
Here is the third span

------------


## Getting Data from an Element’s Attributes

The `get()` method for Tag objects makes it simple to access attribute values from an element. The method is passed a string of an attribute name and returns that attribute’s value.

In [42]:
{'id': 'lecturer'}.get('id', 0)

'lecturer'

In [43]:
import bs4

with open('./example.html') as f:
    example_html = f.read()
    
soup = bs4.BeautifulSoup(example_html, 'html.parser')
# soup.find_all?
span_elem = soup.select('span')[0]
print(str(span_elem))
print(span_elem.get('id'))
print(span_elem.get('some_nonexistent_addr') == None)
print(span_elem.attrs)

<span id="span01">This is span 1</span>
span01
True
{'id': 'span01'}


### What is the difference between the `select` and the `find`/`find_all` functions?

You are not the first ones wondering about this... See:
https://stackoverflow.com/questions/38028384/beautifulsoup-is-there-a-difference-between-find-and-select-python-3-x#38033910

# Example Scraping Events from a Page


Ususally, you will use web scraping to collect information, which you cannot gather otherwise. 
For example, let's imagine we want to do some statistics about:
- concerts in Copenhagen, 
- their start times and 
- their door prices.

Since we cannot find an API or any other open dataset, we decide to scrape the publicly available homepage www.kultunaut.dk, 

The website lists all possible events in Denmark. 
Concerts in Copenhagen are for example accessible here: 
- http://www.kultunaut.dk/perl/arrlist/type-nynaut/UK?showmap=&Area=Kbh.+og+Frederiksberg&periode=&Genre=Musik

**OBS** Many web pages are not built to support high traffic or they exlicitely discourage automatic access. Keep this in mind when writing your scraping tool.
- from time import sleep
- sleep(3) # sleep 3 seconds


Considering our example:
- we have to first figure out how many events there are at all. 
- We need this information, as events are given paginated, i.e., twenty events per page.
- The link given above only returns the link to the first page with the first twenty events. 
- Out of the total amount of events we can generate the URLs for the subsequent results.

In [1]:
import bs4
import requests


r = requests.get('http://www.kultunaut.dk/perl/arrlist/type-nynaut/UK?showmap=&Area=Kbh.+og+Frederiksberg&periode=&Genre=Musik')
r.raise_for_status
soup = bs4.BeautifulSoup(r.text, 'html.parser')

elems = soup.find_all('td', {'style': 'background-color:#CBDCEE;color:#324669'})
number = int(str(elems[0]).split(' ')[4].split('\n')[0])

print(number)

1282


In [45]:
r = requests.get('http://www.kultunaut.dk/perl/arrlist/type-nynaut/UK?showmap=&Area=Kbh.+og+Frederiksberg&periode=&Genre=Musik')
r.raise_for_status()
soup = bs4.BeautifulSoup(r.text, 'html.parser')

elems = soup.find_all('td', attrs={'style': 'background-color:#CBDCEE;color:#324669'})[0].find('b')
print(elems)

<b>Showing 1-20 of 1294
events  in Cph. and Frederiksberg <span style="white-space: nowrap;"> from/after 26 Mar 2019</span> </b>


### Looking at the browser inspector pane:

<img src="images/inspect_element.png" width="500">

We can see that the desired element is hiding in a structure like: a b-tag inside a h3-tag inside a td-tag or:
- `('td h3 b')`

In [47]:
import bs4
import requests
html = requests.get('http://www.kultunaut.dk/perl/arrlist/type-nynaut?Area=&ArrStartday=22&ArrStartmonth=Marts&ArrStartyear=2019&ArrSlutday=29&ArrSlutmonth=Marts&ArrSlutyear=2019&ArrMaalgruppe=&DefaultGenre=Musik&ArrKunstner=')
txt = html.text
soup = bs4.BeautifulSoup(txt, 'html.parser')
events = soup.select('td h3 b')
print('number of events is {}'.format(len(events)))
for e in events:
    print(e.getText())

number of events is 20
Nedenunder: Underwxlrd Showcase #2 + After Party.
Odense Symfoniorkester: Clarinet.
Odense Symfoniorkester: Flute.
Aarhus International Piano Competition.
Erindringsdans.
Lunchkonsert.
I and You.
John Mogensen Live Duo.
Aarhus International Piano Competition.
Forårsfestival: Museumskoncert.
Morten Grønvad The End.
Orgel-musikkens kvarter.
Fredagsbaren.
Sebastian Klein - Mer' styr på dyr!
Fyraftenskoncert.
Fredagskoncert: Rued Langgaard.
Den blå time.
Wave Blues Band ( Fyraftens Jazz ).
Vicious Rock Indoors 2019.
Koncert med Stepp2 og fernisering på udstilling med Wiliam Skotte Olsen m.fl.


In [11]:
import requests
from bs4 import BeautifulSoup
gov = requests.get('https://analytics.usa.gov')
soup = BeautifulSoup(gov.text, 'lxml')
print(soup.prettify()[:100])
print('------------------------')
for link in soup.find_all('a'):
    print(link.get('href'))

<!DOCTYPE html>
<html lang="en">
 <!-- Initalize title and data source variables -->
 <head>
  <!--

------------------------
/
#explanation
https://analytics.usa.gov/data/
data/
#top-pages-realtime
#top-pages-7-days
#top-pages-30-days
https://analytics.usa.gov/data/live/all-pages-realtime.csv
https://analytics.usa.gov/data/live/all-domains-30-days.csv
https://www.digitalgov.gov/services/dap/
https://www.digitalgov.gov/services/dap/common-questions-about-dap-faq/#part-4
https://support.google.com/analytics/answer/2763052?hl=en
https://analytics.usa.gov/data/live/second-level-domains.csv
https://analytics.usa.gov/data/live/sites.csv
mailto:DAP@support.digitalgov.gov
https://analytics.usa.gov/data/
https://analytics.usa.gov/developer
mailto:DAP@support.digitalgov.gov
https://github.com/GSA/analytics.usa.gov/issues
https://github.com/GSA/analytics.usa.gov
https://github.com/18F/analytics-reporter
http://www.gsa.gov/
https://www.digitalgov.gov/services/dap/
https://cloud.gov/


In [41]:
r = requests.get('http://www.kultunaut.dk/perl/arrlist/type-nynaut/UK?showmap=&Area=Kbh.+og+Frederiksberg&periode=&Genre=Musik')
r.raise_for_status()
soup = bs4.BeautifulSoup(r.text, 'html.parser')

elems = soup.find_all('td', attrs={'style': 'background-color:#CBDCEE;color:#324669'})[0].find('b')
print(elems.text, type(elems.text))
no_events = elems.text.split(' ')[3].split('\r')[0]
no_events = int(no_events)
print(no_events)

Showing 1-20 of 1310
events  in Cph. and Frederiksberg  from/after 22 Mar 2019  <class 'str'>
1310


Now, we can scrape the events per page. Observe, that now, out `base_url` http://www.kultunaut.dk/perl/arrlist/type-nynaut/UK?Startnr={}&showmap=&Area=Kbh.%20og%20Frederiksberg&periode=&Genre=Musik& has a placeholder for the paginated results (`Startnr=`).

Consequently, we scrape each page separately, see the function on the next slide: `scrape_events_per_page`. From examining the page's source code, we know that events are all given as table entries with a corresponding header. We iterate over each of the table cells and extract the strings for dates and prices if they exist.

In [51]:
from tqdm import tqdm

    
def scrape_events_per_page(url):
    """
    returns:
        A list of tuples of strings holding title, place, date, and price
        for concerts in Copenhagen scraped from Kulturnaut.dk
    """
    r = requests.get(url)
    r.raise_for_status()

    soup = bs4.BeautifulSoup(r.text, 'html.parser')
    event_cells = soup.find_all('td', {'width': '100%', 'valign' : 'top'})
    scraped_events_per_page = []
    
    for event_cell in event_cells:
        try:
            title = event_cell.find('b').text
            spans = event_cell.find_all('span')
            place = spans[1].text
            try:
                date, price = spans[0].text.splitlines()
            except ValueError as e:
                date = spans[0].text.splitlines()[0]
                price = ''
        except Exception as e:
            print(e)
            
        scraped_events_per_page.append((title, place, date, price))
        
    return scraped_events_per_page


base_url = 'http://www.kultunaut.dk/perl/arrlist/type-nynaut/UK?Startnr={}&showmap=&Area=Kbh.%20og%20Frederiksberg&periode=&Genre=Musik&'

scraped_events = []
indexes = list(range(1, no_events, 20))
indexes[0] = 0

for idx in tqdm(indexes):
    scrape_url = base_url.format(idx)
    scraped_events += scrape_events_per_page(scrape_url)

 80%|████████  | 53/66 [00:48<00:11,  1.09it/s]

list index out of range


100%|██████████| 66/66 [01:02<00:00,  1.18s/it]


### What do we have so far?

Now, you can see that we extracted a list of four element string tuples consisting of the title of the event, its location, a date and a time, and an entrance fee.

In [47]:
scraped_events[:20]

[('Nedenunder: Underwxlrd Showcase #2 + After Party.',
  'Rust, Guldbergsgade 8, Copenhagen N',
  'Fri 22 March 2019, kl. 00.',
  ''),
 ('Morten Grønvad The End.',
  'Jazzcup, Gothersgade 107, Copenhagen K',
  'Fri 22 March 2019, 3.30 pm.',
  '(Entrance fee: 120,- (100,-))'),
 ('Orgel-musikkens kvarter.',
  'Kristkirken, Enghave Plads 18, Copenhagen V',
  'Fri 22 March 2019, 4 pm.',
  ''),
 ('Fredagskoncert: Rued Langgaard.',
  'Trinitatis Kirke, Landemærket 2, Copenhagen K',
  'Fri 22 March 2019, 4.30 pm  - 5.30 pm.',
  ''),
 ('Wave Blues Band ( Fyraftens Jazz ).',
  'Drop Inn, Kompagnistræde 34, Copenhagen K',
  'Fri 22 March 2019, 4.30 pm  - 7.30 pm.',
  '(Free admission)'),
 ('Thirst For You Gospel Workshop.',
  'Timotheuskirken, Christen Bergs Allé 5, Valby',
  'Fri 22 March 2019, 5.30 pm.',
  '(Entrance fee: 129.50 DKK)'),
 ('Friday Rituals feat. Bryld & Haack.',
  'Brochner Hotels - SP34, Sankt Peders Stræde 34, Copenhagen K',
  'Fri 22 March 2019, 6 pm  - 7 pm.',
  '(Free admis

### How to Extract Dates and Prices from Strings.

Remember, the raw data, which we extracted from the web pages is all of type `str`. To do statistics about possible correlation of start times and entry fees, we need to convert the corresponding tuple fields into datetimes and integers respectively.


Since dates given on the web do not necessarily conform to standardized time formats, we can apply the `dateparser` (https://pypi.python.org/pypi/dateparser) module, which tries to parse arbitrary strings into datetimes.

You can install the module via:

```bash
pip install dateparser
```

You can read more about the module and its capabilities https://dateparser.readthedocs.io/en/latest/.

In [None]:
%%bash
pip install dateparser

In [51]:
from tqdm import tqdm
from dateparser import parse


def get_dates_and_prices(scraped_events):
    """
    Cleanup the data. Get price as integer and date as date.
    
    returns:
        A two-element tuple with a datetime representing the start 
        time of an event and an integer representing the price in Dkk.
    """

    price_regexp = r"(?P<price>\d+)" #initial ? is a lookbehind. r() r is for raw text, P<some pattern> is to give a pattern name to refer to. \d is numeric digit, + is for 1 or more.

    data_points = []

    for event_data in tqdm(scraped_events):
        title_str, place_str, date_str, price_str = event_data
        
        if 'Free admission' in price_str:
            price = 0
        else:
            m = re.search(price_regexp, price_str) # m is the Match object returned from re.search (might be None)
            try:
                price = int(m.group('price')) # if price can be converted to in then we do it else return 0.
            except:
                price = 0

        date_str = date_str.strip().strip('.')
        if '&' in date_str:
            date_str = date_str.split('&')[0]
        if '-' in date_str:
            date_str = date_str.split('-')[0]
        if '.' in date_str:
            date_str = date_str.replace('.', ':')
        
        date = parse(date_str)
        if date:
            data_points.append((date, price))
            
    return data_points


dates_and_prices = get_dates_and_prices(scraped_events)

100%|██████████| 1310/1310 [03:23<00:00,  6.43it/s] 


In [52]:
dates_and_prices[10:20]

[(datetime.datetime(2019, 3, 26, 20, 0), 110),
 (datetime.datetime(2019, 3, 26, 20, 0), 160),
 (datetime.datetime(2019, 3, 26, 20, 0), 130),
 (datetime.datetime(2019, 3, 26, 20, 0), 0),
 (datetime.datetime(2019, 3, 26, 20, 0), 80),
 (datetime.datetime(2019, 3, 26, 20, 15), 50),
 (datetime.datetime(2019, 3, 26, 21, 0), 0),
 (datetime.datetime(2019, 3, 27, 8, 30), 0),
 (datetime.datetime(2019, 3, 27, 17, 0), 0),
 (datetime.datetime(2019, 3, 27, 17, 0), 0)]

### Plotting Times vs. Prices

In [59]:
%matplotlib notebook 
from datetime import datetime
import matplotlib.pyplot as plt


dates, prices = zip(*dates_and_prices)
ref_day = datetime.today()
times = [datetime.combine(ref_day, t.time()) for t in dates] # get the time intervals for today

# plot
plt.plot(times, prices, 'ro')
#plt.plot(dates, prices, 'ro')
# beautify the x-labels
plt.gcf().autofmt_xdate()

plt.show()

<IPython.core.display.Javascript object>

### The Pearson Correlation Coefficient

The Pearson correlation coefficient measures the linear relationship
between two datasets. Like other correlation coefficients, this one varies between -1 and +1
with 0 implying no correlation. Correlations of -1 or +1 imply an exact
linear relationship. Positive correlations imply that as x increases, so
does y. Negative correlations imply that as x increases, y decreases.

The p-value roughly indicates the probability of an uncorrelated system
producing datasets that have a Pearson correlation at least as extreme
as the one computed from these datasets.

In [58]:
import matplotlib
from scipy.stats.stats import pearsonr


x, y = zip(*dates_and_prices)
x = matplotlib.dates.date2num(x)

pearsonr(x, y)

(0.008840273117261523, 0.7727034143471962)

Consequently, since our Pearson Correlation Coefficient is quite close to zero, there is likely no correlation between the start time of a concert and the price you have to pay for it.

# Class exercise: Scraping Images from a Page

In the following code you will use Beautiful Soup to extract all links to images, which are in `img` tags on a web page.

In [72]:
import bs4
import os
import sys
import requests
import shutil


def collect_img_links(url):
    r = requests.get(url)
    r.raise_for_status()
    soup = bs4.BeautifulSoup(r.text, 'html.parser')
    
    return [img.get('src') for img in soup.select('img') 
            if img.get('src').startswith('http')]


def download_imgs(links, out_folder="./test/"):
    img_no = 0
    for l in links:
        img_no += 1
        r = requests.get(l, stream=True)
        with open(out_folder+'img'+str(img_no)+'.jpg', 'wb') as f:
            r.raw.decode_content = True
            shutil.copyfileobj(r.raw, f)     
        
links = collect_img_links('https://www.google.dk/search?site=&tbm=isch&source=hp&biw=1163&bih=812&q=minions&oq=minions')
download_imgs(links)

# Exercise 2: Writing a Simple Web Crawler

Write a simple web crawler. More precisely, a program that extracts recursively all links from web pages. The result of running the web crawler is a dictionary, were the key-value pairs correspond to outgoiung links from a web page with the URL, which is stored in the key.


In case a page returns a status code, which is not `200` we just disregard this page. See https://en.wikipedia.org/wiki/List_of_HTTP_status_codes for more detailes on the various HTTP status codes.

In [77]:
def scrape_links(from_url, for_depth, all_links={}):
    # This is what the exercise below asks you to implement!
    pass


start_url = 'https://www.version2.dk/artikel/google-deepmind-vi-oeger-sikkerheden-mod-misbrug-sundhedsdata-1074452'

link_dict = scrape_links(from_url=start_url, for_depth=2)

The web crawler that you wrote above is perhaps not the most performant. If you are interested in more web scraping and application of crawlers have a look at the `scrapy` module (https://scrapy.org)