In [1]:
import warnings
warnings.filterwarnings('ignore')

# HTML Refresher
This part is based on chapter 11 of *Automate the Boring Stuff with Python* by Al Sweigart

HTML files are plain text files containing *tags*, which are words enclosed in angle brackets. Tags tell the browser how to format the web page. A starting tag and closing tag can enclose some text to form an element. The text (or inner HTML) is the content between the starting and closing tags.

There are many different tags in HTML. Some of these tags have extra properties in the form of attributes within the angle brackets. For example, the `<a>` tag encloses text that should be a link.

Some elements have an `id` attribute that is used to uniquely identify the element in the page. You will often instruct your programs to seek out an element by its id attribute, so figuring out an element’s id attribute using the browser’s developer tools is a common task in writing web scraping programs.

In [2]:
%%bash
cat << EOF > data/example.html
<!DOCTYPE html>
<html>
<head>
<title>Hello!</title>
</head>
<body>
<h1>Hello World!</h1>
You are extremely welcome!<br>
<br>
The <a href=\"https://github.com/datsoftlyngby/dat4sem2019spring-python-materials\">Lecture Notes</a>.<br>
<div>
<p>paragraph 1</p>
<p>and paragraph 2: <span id="span01">This is span 1</span id="span02"><span id="span03">Second span element</span>
<span class="red_border">Here is the third span</span>
</p>
</div>
</body>
</html>
EOF

In [3]:
!cat data/example.html

<!DOCTYPE html>
<html>
<head>
<title>Hello!</title>
</head>
<body>
<h1>Hello World!</h1>
You are extremely welcome!<br>
<br>
The <a href=\"https://github.com/datsoftlyngby/dat4sem2019spring-python-materials\">Lecture Notes</a>.<br>
<div>
<p>paragraph 1</p>
<p>and paragraph 2: <span id="span01">This is span 1</span id="span02"><span id="span03">Second span element</span>
<span class="red_border">Here is the third span</span>
</p>
</div>
</body>
</html>


# View a Page's HTML Sources

Here, I will only describe how to use Firefox' development features.

To view a page's sources right click on it and choose **View page source** which opens a new tab with the HTML sources.

<img src="images/view_source_small.png" width="500"> 

In Firefox, you can bring up the Web Developer Tools Inspector by pressing `CTRL-SHIFT-C` on Windows and Linux or by `CMD-OPTION-C` on OS X.

<img src="images/inspector_small.png" width="600"> 

# Parsing HTML with BeautifulSoup

BeautifulSoup is a module for parsing and extracting information from HTML sources. The module’s name is bs4. In case it is not already installed on your machine:
- install it with `pip install beautifulsoup4`. While beautifulsoup4 is the name used for installation, 
- to import BeautifulSoup you have to use `import bs4`.

According to its documentation (https://www.crummy.com/software/BeautifulSoup/) *"Beautiful Soup parses anything you give it, and does the tree traversal stuff for you. You can tell it "Find all the links", or "Find all the links of class externalLink", or "Find all the links whose urls match "foo.com", or "Find the table heading that's got bold text, then give me that text.""*

## Creating a BeautifulSoup Object from a Local HTML File

- The `bs4.BeautifulSoup()` function needs to be called with a string containing the HTML it will parse. 
- The `bs4.BeautifulSoup()` function returns is a `BeautifulSoup` object.

You can load a local HTML file and pass a file object to `bs4.BeautifulSoup()`.

In [1]:
import bs4

with open('./data/example.html') as f:
    example_html = f.read()
    
soup = bs4.BeautifulSoup(example_html)
print(type(soup))
print(soup.prettify())

<class 'bs4.BeautifulSoup'>
<!DOCTYPE html>
<html>
 <head>
  <title>
   Hello!
  </title>
 </head>
 <body>
  <h1>
   Hello World!
  </h1>
  You are extremely welcome!
  <br/>
  <br/>
  The
  <a href='\"https://github.com/datsoftlyngby/dat4sem2019spring-python-materials\"'>
   Lecture Notes
  </a>
  .
  <br/>
  <div>
   <p>
    paragraph 1
   </p>
   <p>
    and paragraph 2:
    <span id="span01">
     This is span 1
    </span>
    <span id="span03">
     Second span element
    </span>
    <span class="red_border">
     Here is the third span
    </span>
   </p>
  </div>
 </body>
</html>



## Creating a BeautifulSoup Object from a Remote HTML File



In [2]:
import bs4
import requests


r = requests.get('https://github.com/datsoftlyngby/dat4sem2020spring-python')
r.raise_for_status()
soup = bs4.BeautifulSoup(r.text, 'html.parser')

print(soup.prettify()[:1500])

<!DOCTYPE html>
<html data-a11y-animated-images="system" data-color-mode="auto" data-dark-theme="dark" data-light-theme="light" lang="en">
 <head>
  <meta charset="utf-8"/>
  <link href="https://github.githubassets.com" rel="dns-prefetch"/>
  <link href="https://avatars.githubusercontent.com" rel="dns-prefetch"/>
  <link href="https://github-cloud.s3.amazonaws.com" rel="dns-prefetch"/>
  <link href="https://user-images.githubusercontent.com/" rel="dns-prefetch"/>
  <link crossorigin="" href="https://github.githubassets.com" rel="preconnect"/>
  <link href="https://avatars.githubusercontent.com" rel="preconnect"/>
  <link crossorigin="anonymous" href="https://github.githubassets.com/assets/light-5178aee0ee76.css" media="all" rel="stylesheet">
   <link crossorigin="anonymous" href="https://github.githubassets.com/assets/dark-217d4f9c8e70.css" media="all" rel="stylesheet">
    <link crossorigin="anonymous" data-color-theme="dark_dimmed" data-href="https://github.githubassets.com/assets/da

## Finding an Element with the `select()` Method

You can retrieve HTML elements from a `BeautifulSoup` object by calling the `select()` method and passing a string of a CSS selector for the element you are looking for. Selectors are like regular expressions: They specify a pattern to look for, in this case, in HTML pages instead of general text strings.

Common CSS selector patterns include:

  * `soup.select('div')` ... selects all elements named `<div>`
  * `soup.select('#lecturer')`  ... selects the element with an id attribute of author
  * `soup.select('.notice')` ... selects all elements that use a CSS class attribute named notice
  * `soup.select('div span')` ... selects all elements named `<span>` that are within an element named `<div>`
  * `soup.select('div > span')` ... selects all elements named `<span>` that are directly within an element named `<div>`, with no other element in between
  * `soup.select('input[name]')` ... selects all elements named `<input>` that have a name attribute with any value
  * `soup.select('input[type="button"]')` ... selects all elements named `<input>` that have an attribute named type with value button
  
See more in the documentation: https://www.crummy.com/software/BeautifulSoup/bs4/doc/#css-selectors

In [3]:
import bs4


with open('./data/example.html') as f:
    example_html = f.read()

soup = bs4.BeautifulSoup(example_html, 'html.parser')

elems = soup.select('body')

#print(soup.prettify())
print('1: return type of select()',type(elems))
print('2: length of the returned list',len(elems))
print('3: type of elements in the list',type(elems[0]))
print('4: get text from the element',elems[0].getText()[:40])
print('5: string representation of an element: ',str(elems[0]))
print('6: the attributes of the element: ',elems[0].attrs)
elements = soup.select('div > p > span')
for element in elements:
    print(element.attrs)
print(soup.title.text)

1: return type of select() <class 'bs4.element.ResultSet'>
2: length of the returned list 1
3: type of elements in the list <class 'bs4.element.Tag'>
4: get text from the element 
Hello World!
You are extremely welcome!
5: string representation of an element:  <body>
<h1>Hello World!</h1>
You are extremely welcome!<br/>
<br/>
The <a href='\"https://github.com/datsoftlyngby/dat4sem2019spring-python-materials\"'>Lecture Notes</a>.<br/>
<div>
<p>paragraph 1</p>
<p>and paragraph 2: <span id="span01">This is span 1</span><span id="span03">Second span element</span>
<span class="red_border">Here is the third span</span>
</p>
</div>
</body>
6: the attributes of the element:  {}
{'id': 'span01'}
{'id': 'span03'}
{'class': ['red_border']}
Hello!


In [4]:
p_elems = soup.select('p')

for el in p_elems:
    print(str(el))
    print(el.getText())
    print('------------')

<p>paragraph 1</p>
paragraph 1
------------
<p>and paragraph 2: <span id="span01">This is span 1</span><span id="span03">Second span element</span>
<span class="red_border">Here is the third span</span>
</p>
and paragraph 2: This is span 1Second span element
Here is the third span

------------


## Getting Data from an Element’s Attributes

The `get()` method for Tag objects makes it simple to access attribute values from an element. The method is passed a string of an attribute name and returns that attribute’s value.

In [5]:
{'id': 'lecturer'}.get('id')

'lecturer'

In [6]:
import bs4

with open('./data/example.html') as f:
    example_html = f.read()
    
soup = bs4.BeautifulSoup(example_html, 'html.parser')
# soup.find_all?
span_elem = soup.select('span')[0]
print(str(span_elem))
print(span_elem.get('id'))
print(span_elem.get('some_nonexistent_addr') == None)
print(span_elem.attrs)

<span id="span01">This is span 1</span>
span01
True
{'id': 'span01'}


### What is the difference between the `select` and the `find`/`find_all` functions?

You are not the first ones wondering about this... See:
https://stackoverflow.com/questions/38028384/beautifulsoup-is-there-a-difference-between-find-and-select-python-3-x#38033910

# Example Scraping Events from a Page


Ususally, you will use web scraping to collect information, which you cannot gather otherwise. 
For example, let's imagine we want to do some statistics about:
- concerts in Copenhagen, 
- their start times and 
- their door prices.

Since we cannot find an API or any other open dataset, we decide to scrape the publicly available homepage www.kultunaut.dk, 

The website lists all possible events in Denmark. 
Concerts in Copenhagen are for example accessible here: 
- http://www.kultunaut.dk/perl/arrlist/type-nynaut/UK?showmap=&Area=Kbh.+og+Frederiksberg&periode=&Genre=Musik

**OBS** Many web pages are not built to support high traffic or they exlicitely discourage automatic access. Keep this in mind when writing your scraping tool.
- from time import sleep
- sleep(3) # sleep 3 seconds


Considering our example:
- we have to first figure out how many events there are at all. 
- We need this information, as events are given paginated, i.e., twenty events per page.
- The link given above only returns the link to the first page with the first twenty events. 
- Out of the total amount of events we can generate the URLs for the subsequent results.

In [14]:
kultunaut_2022e = 'https://www.kultunaut.dk/perl/arrlist/type-nynaut/UK?showmap=&Area=Kbh.+og+Frederiksberg&nearmeradius=2000&ArrStartdato=15%2F11+2022&ArrSlutdato=15%2F12+2022&Genre=Musik'
kultunaut_url2 = 'https://www.kultunaut.dk/perl/arrlist/type-nynaut/UK?showmap=&Area=Kbh.+og+Frederiksberg&nearmeradius=2000&ArrStartdato=15%2F11+2022&ArrSlutdato=15%2F12+2022&Genre=Musik&Startnr={}'

In [17]:
r = requests.get(kultunaut_2022e)

r.raise_for_status()
soup = bs4.BeautifulSoup(r.text, 'html.parser')

res = soup.select('.result-count > span > strong:nth-child(2)')[0].text
no_of_events = res
print('no of events: {}'.format(no_of_events))


elems = soup.findAll('div',{'class':'arr-genre'})
print('Div with class arr-genre',elems[0])
print();print()
# annother notation
elems = soup.select('div[class=arr-genre] > h3 > strong')
print(elems)
#print(b_el)

no of events: 319
Div with class arr-genre <div class="arr-genre">
<span class="genre_cat notranslate">
          Music
          
        </span>
<h3><strong>Morgensang med Christian Bro</strong></h3>
</div>


[<strong>Morgensang med Christian Bro</strong>, <strong>Korsang i København: Corpo Fairies</strong>, <strong>Johanitterhjælpens støttekoncert</strong>, <strong>Trompetfest med piano og orgel</strong>, <strong>Natkirke: Tonen fra Himlen</strong>, <strong>Cæcilie Norby</strong>, <strong>Ally Venable Band (US)</strong>, <strong>Morgensang - Tema: Kærlighed til alt det skabte</strong>, <strong>Jacob Banks (UK)</strong>, <strong>Rag'n'Bone Man: Life By Misadventure Tour</strong>, <strong>Cæcilie Norby</strong>, <strong>Cmat (ie)</strong>]


### Looking at the browser inspector pane:

<img src="images/inspect_element.png" width="500">

We can see that the desired element is hiding in a structure like: a b-tag inside a h3-tag inside a td-tag or:
- `('td h3 b')`

In [12]:
# use select with css-selectors rather than find_all
import bs4
import requests
html = requests.get(kultunaut_2022e)
txt = html.text
soup = bs4.BeautifulSoup(txt, 'html.parser')
events = soup.select('div[class=arr-genre] > h3 > strong')
for e in events:
    print(e.getText())

Morgensang med Christian Bro
Korsang i København: Corpo Fairies
Johanitterhjælpens støttekoncert
Trompetfest med piano og orgel
Natkirke: Tonen fra Himlen
Cæcilie Norby
Ally Venable Band (US)
Morgensang - Tema: Kærlighed til alt det skabte
Jacob Banks (UK)
Rag'n'Bone Man: Life By Misadventure Tour
Cæcilie Norby
Cmat (ie)


In [13]:
# Get all the links in a document
import requests
from bs4 import BeautifulSoup
gov = requests.get('https://analytics.usa.gov')
soup = BeautifulSoup(gov.text, 'lxml')
print(soup.prettify()[:100])
print('------------------------')
for link in soup.find_all('a'):
    if not link.get('href').startswith('https'):
        continue
    print(link.get('href'))

<!DOCTYPE html>
<html lang="en">
 <!-- Initalize title and data source variables -->
 <head>
  <!--

------------------------
https://analytics.usa.gov/data/
https://open.gsa.gov/api/dap/
https://analytics.usa.gov/data/live/all-pages-realtime.csv
https://analytics.usa.gov/data/live/all-domains-30-days.csv
https://www.digitalgov.gov/services/dap/
https://www.digitalgov.gov/services/dap/common-questions-about-dap-faq/#part-4
https://support.google.com/analytics/answer/2763052?hl=en
https://analytics.usa.gov/data/live/second-level-domains.csv
https://analytics.usa.gov/data/live/sites.csv
https://analytics.usa.gov/data/
https://open.gsa.gov/api/dap/
https://github.com/GSA/analytics.usa.gov/issues
https://github.com/GSA/analytics.usa.gov
https://github.com/18F/analytics-reporter
https://www.digital.gov/guides/dap/
https://cloud.gov/


Now, we can scrape the events per page. Observe, that now, our `base_url` http://www.kultunaut.dk/perl/arrlist/type-nynaut/UK?Startnr={}&showmap=&Area=Kbh.%20og%20Frederiksberg&periode=&Genre=Musik& has a placeholder for the paginated results (`Startnr=`).

Consequently, we scrape each page separately, see the function on the next slide: `scrape_events_per_page`. From examining the page's source code, we know that events are all given as table entries with a corresponding header. We iterate over each of the table cells and extract the strings for dates and prices if they exist.

In [18]:
from tqdm import tqdm

    
def scrape_events_per_page(url):
    """
    returns:
        A list of tuples of strings holding title, place, date, and price
        for concerts in Copenhagen scraped from Kulturnaut.dk
    """
    r = requests.get(url)
    r.raise_for_status()

    soup = bs4.BeautifulSoup(r.text, 'html.parser')
    event_cells = soup.find_all('div', {'class' : 'arr-info'})
    #print('size',len(event_cells))
    print(event_cells[0])
    scraped_events_per_page = []
    
    for event_cell in event_cells:
        try:
            title = event_cell.select('h3 > strong')[0].text
            rest = event_cell.find('time').text.split(',')
            try:
                place = rest[1]
            except:
                place = ''
            try:
                date = rest[0]
            except:
                date = ''
        except Exception as e:
            print(e)
            continue
            
        scraped_events_per_page.append((title, place, date))
        
    return scraped_events_per_page


scraped_events = []
indexes = list(range(1, int(no_of_events), 12))
indexes[0] = 0


for idx in tqdm(indexes):
    scrape_url = kultunaut_url2.format(idx)
    #print(scrape_url)
    scraped_events += scrape_events_per_page(scrape_url)

  4%|▎         | 1/27 [00:00<00:18,  1.43it/s]

<div class="arr-info">
<div class="arr-genre">
<span class="genre_cat notranslate">
          Music
          
        </span>
<h3><strong>Morgensang med Christian Bro</strong></h3>
</div>
<div class="arr-description"><span>Morgensang i Aalholm kirke hver tirsdag morgen. Den bedste måde at begynde dagen på er med en sang. Hver tirsdag morgen i Aalholm Kirke synger vi i vores smukke kirkerum, hvor organisten sidder ved flyglet. Denne tirsdag er det cand. theol.

</span>
</div>
<div class="kult-month-day">
<time>Tues 15 Nov 2022 8.20 am, <b>Aalholm Kirke</b></time>
<div class="arrow_right"><img alt="arrow right" src="https://www.kultunaut.dk/images/nynaut22/np-arrow-right.png"/></div>
</div>
</div>


  7%|▋         | 2/27 [00:01<00:17,  1.40it/s]

<div class="arr-info">
<div class="arr-genre">
<span class="genre_cat notranslate">
          Music
          
        </span>
<h3><strong>Whitney (US)</strong></h3>
</div>
<div class="arr-description"><span>Whitney vender tilbage til et af deres "yndlingsspillesteder i hele verden", når den populære amerikanske folkpop-gruppe indtager Store VEGA i København med ny musik den 16. november.

</span>
</div>
<div class="kult-month-day">
<time>Wed 16 Nov 2022 8 pm, <b>VEGA</b></time>
<div class="arrow_right"><img alt="arrow right" src="https://www.kultunaut.dk/images/nynaut22/np-arrow-right.png"/></div>
</div>
</div>


 11%|█         | 3/27 [00:02<00:17,  1.36it/s]

<div class="arr-info">
<div class="arr-genre">
<span class="genre_cat notranslate">
          Funk/Blues/R&amp;B
        </span>
<h3><strong>Kehlani</strong></h3>
</div>
<div class="arr-description"><span>Support: Destin Conrad. KEHLANI I DEN GRÅ HAL 17. NOVEMBER 2022. En af tidens mest betagende R&amp;B-sangerinder, Kehlani, giver koncert i Den Grå Hal til efteråret.

</span>
</div>
<div class="kult-month-day">
<time>Thur 17 Nov 2022 8 pm, <b>Den Grå Hal, Christiania</b></time>
<div class="arrow_right"><img alt="arrow right" src="https://www.kultunaut.dk/images/nynaut22/np-arrow-right.png"/></div>
</div>
</div>


 15%|█▍        | 4/27 [00:02<00:17,  1.31it/s]

<div class="arr-info">
<div class="arr-genre">
<span class="genre_cat notranslate">
          Jazz
        </span>
<h3><strong>Heine Hansen Trio</strong></h3>
</div>
<div class="arr-description"><span>Pianisten Heine Hansen forstår både jazzens amerikanske rødder og lader musikken vokse ind i den skandinaviske tradition vi kender fra bl.a. Jan Johansson og Thomas Clausen. Med sin trio tager han chancer og citerer fra jazzens store sangbog.

</span>
</div>
<div class="kult-month-day">
<time>Fri 18 Nov 2022 7.30 pm  - 9.30 pm, <b>Kvarterhuset</b></time>
<div class="arrow_right"><img alt="arrow right" src="https://www.kultunaut.dk/images/nynaut22/np-arrow-right.png"/></div>
</div>
</div>


 19%|█▊        | 5/27 [00:03<00:17,  1.27it/s]

<div class="arr-info">
<div class="arr-genre">
<span class="genre_cat notranslate">
          Music
          
        </span>
<h3><strong>Koncert vigdis</strong></h3>
</div>
<div class="arr-description"><span>Solstrejf i sneen. Kvindekoret VIGDIS hylder med denne koncert den kolde tid. VIGDIS består af unge kvinder, som alle tidligere har sunget i DR-pigekoret.

</span>
</div>
<div class="kult-month-day">
<time>Sat 19 Nov 2022 5 pm, <b>Solbjerg Kirke</b></time>
<div class="arrow_right"><img alt="arrow right" src="https://www.kultunaut.dk/images/nynaut22/np-arrow-right.png"/></div>
</div>
</div>


 22%|██▏       | 6/27 [00:04<00:15,  1.38it/s]

<div class="arr-info">
<div class="arr-genre">
<span class="genre_cat notranslate">
          Classical
        </span>
<h3><strong>Chopins 2. klaverkoncert</strong></h3>
</div>
<div class="arr-description"><span>Velkommen til gratis koncert i Ansgarkirken med Kristoffer Hersnack og strygekvartet! Kristoffer Hersnack opfører ved denne koncert F. Chopins 2. klaverkoncert i f-mol med strygekvartet.

</span>
</div>
<div class="kult-month-day">
<time>Sun 20 Nov 2022 4 pm  - 5.30 pm, <b>Ansgarkirken</b></time>
<div class="arrow_right"><img alt="arrow right" src="https://www.kultunaut.dk/images/nynaut22/np-arrow-right.png"/></div>
</div>
</div>


 26%|██▌       | 7/27 [00:05<00:14,  1.37it/s]

<div class="arr-info">
<div class="arr-genre">
<span class="genre_cat notranslate">
          Rock/Pop
        </span>
<h3><strong>Tobe Nwigwe - The moMINTs abroad tour</strong></h3>
</div>
<div class="arr-description"><span>Dørene åbner kl. 19.00. https://www.allthingslive.dk/. https://www.facebook.com/allthingslivedenmark/. https://www.tobenwigwe.com/. https://www.vega.dk/.

</span>
</div>
<div class="kult-month-day">
<time>Mon 21 Nov 2022 8 pm, <b>VEGA</b></time>
<div class="arrow_right"><img alt="arrow right" src="https://www.kultunaut.dk/images/nynaut22/np-arrow-right.png"/></div>
</div>
</div>


 30%|██▉       | 8/27 [00:05<00:14,  1.35it/s]

<div class="arr-info">
<div class="arr-genre">
<span class="genre_cat notranslate">
          Music
          
        </span>
<h3><strong>Home Concerts - Josephine Philip</strong></h3>
</div>
<div class="arr-description"><span>
</span>
</div>
<div class="kult-month-day">
<time>Wed 23 Nov 2022 7.30 pm, <b>La Oficina</b></time>
<div class="arrow_right"><img alt="arrow right" src="https://www.kultunaut.dk/images/nynaut22/np-arrow-right.png"/></div>
</div>
</div>


 33%|███▎      | 9/27 [00:06<00:13,  1.37it/s]

<div class="arr-info">
<div class="arr-genre">
<span class="genre_cat notranslate">
          Music
          
        </span>
<h3><strong>Juniorkor</strong></h3>
</div>
<div class="arr-description"><span>Du kan læse mere om vores kor her!

</span>
</div>
<div class="kult-month-day">
<time>Thur 24 Nov 2022 3.30 pm, <b>Simon Peters Kirke</b></time>
<div class="arrow_right"><img alt="arrow right" src="https://www.kultunaut.dk/images/nynaut22/np-arrow-right.png"/></div>
</div>
</div>


 37%|███▋      | 10/27 [00:07<00:12,  1.35it/s]

<div class="arr-info">
<div class="arr-genre">
<span class="genre_cat notranslate">
          Music
          
        </span>
<h3><strong>Fredagskoncert</strong></h3>
</div>
<div class="arr-description"><span>Musik af Heinrich Schütz i anledning af 350 året for hans død, bl.a. Gejstliche Konzerte. Annemette Pødenphandt, sopran. Rikke Lender, mezzosopran. Gerald Geerink, tenor. Rasmus Kure Thomsen, bas. Kristoffer Thams, orgel og cembalo. Fri entré.

</span>
</div>
<div class="kult-month-day">
<time>Fri 25 Nov 2022 4.30 pm  - 5.30 pm, <b>Trinitatis Kirke</b></time>
<div class="arrow_right"><img alt="arrow right" src="https://www.kultunaut.dk/images/nynaut22/np-arrow-right.png"/></div>
</div>
</div>


 41%|████      | 11/27 [00:08<00:11,  1.37it/s]

<div class="arr-info">
<div class="arr-genre">
<span class="genre_cat notranslate">
          Rock/Pop
        </span>
<h3><strong>Queen machine - udsolgt</strong></h3>
</div>
<div class="arr-description"><span>Hvis der er udsolgt i forsalg, vil der så være udsolgt i døren.

</span>
</div>
<div class="kult-month-day">
<time>Fri 25 Nov 2022 9 pm, <b>Amager Bio</b></time>
<div class="arrow_right"><img alt="arrow right" src="https://www.kultunaut.dk/images/nynaut22/np-arrow-right.png"/></div>
</div>
</div>


 44%|████▍     | 12/27 [00:08<00:10,  1.37it/s]

<div class="arr-info">
<div class="arr-genre">
<span class="genre_cat notranslate">
          Music
          
        </span>
<h3><strong>Soirée karaoke chansons françaises/franske sange</strong></h3>
</div>
<div class="arr-description"><span>Venez participer à notre soirée karaoké 100% chansons françaises. Kom og deltag i vores 100 % franske sange karaoke aften. inscription obligatoire/tilmelding nødvendig - info : chansonsfrancaisesdk@gmail.com coronapas nødvendig.

</span>
</div>
<div class="kult-month-day">
<time>Sat 26 Nov 2022 6.30 pm  - 10.00 pm, <b>Metronomen</b></time>
<div class="arrow_right"><img alt="arrow right" src="https://www.kultunaut.dk/images/nynaut22/np-arrow-right.png"/></div>
</div>
</div>


 48%|████▊     | 13/27 [00:09<00:10,  1.32it/s]

<div class="arr-info">
<div class="arr-genre">
<span class="genre_cat notranslate">
          Music
          
        </span>
<h3><strong>Adventsmusik</strong></h3>
</div>
<div class="arr-description"><span>Sankt Markus Kirkes kor opfører en halvt times musik til adventstiden under ledelse af organisten David Bendix Nielsen.

</span>
</div>
<div class="kult-month-day">
<time>Sun 27 Nov 2022 10 am, <b>Sankt Markus Kirke</b></time>
<div class="arrow_right"><img alt="arrow right" src="https://www.kultunaut.dk/images/nynaut22/np-arrow-right.png"/></div>
</div>
</div>


 52%|█████▏    | 14/27 [00:10<00:10,  1.26it/s]

<div class="arr-info">
<div class="arr-genre">
<span class="genre_cat notranslate">
          Rock/Pop
        </span>
<h3><strong>Set it off</strong></h3>
</div>
<div class="arr-description"><span>Koncerten er flyttet til d. 28. november 2022. Allerde købte billetter gælder til den nye dato. Refunderingsfrist: 8. april 2022. __________________________________________.

</span>
</div>
<div class="kult-month-day">
<time>Mon 28 Nov 2022 8 pm, <b>Pumpehuset</b></time>
<div class="arrow_right"><img alt="arrow right" src="https://www.kultunaut.dk/images/nynaut22/np-arrow-right.png"/></div>
</div>
</div>


 56%|█████▌    | 15/27 [00:11<00:09,  1.28it/s]

<div class="arr-info">
<div class="arr-genre">
<span class="genre_cat notranslate">
          Rock/Pop
        </span>
<h3><strong>Nothing,nowhere</strong></h3>
</div>
<div class="arr-description"><span>Dørene åbner kl. 19.00. Se mere på. https://www.facebook.com/allthingslivedenmark. http://www.allthingslive.dk/.

</span>
</div>
<div class="kult-month-day">
<time>Wed 30 Nov 2022 8 pm, <b>Pumpehuset</b></time>
<div class="arrow_right"><img alt="arrow right" src="https://www.kultunaut.dk/images/nynaut22/np-arrow-right.png"/></div>
</div>
</div>


 59%|█████▉    | 16/27 [00:11<00:08,  1.35it/s]

<div class="arr-info">
<div class="arr-genre">
<span class="genre_cat notranslate">
          Gospel/Choir/Sing-along
        </span>
<h3><strong>Copenhagen Gospel Choir</strong></h3>
</div>
<div class="arr-description"><span>Syng med! Næste sæsonstart: Torsdag, d. 1. september 2022. Vi øver hver torsdag, fordelt på 2 hold: · Hold 1: Kl. 17:00 - 19:00 · Hold 2: Kl. 19:30 - 21:30 Vælg det hold der passer dig bedst. V Sted: Timotheuskirken, Christen Bergs Allé 5, 2500 Valby.

</span>
</div>
<div class="kult-month-day">
<time>Thur 1 Dec 2022 5 pm, <b>Timotheuskirken</b></time>
<div class="arrow_right"><img alt="arrow right" src="https://www.kultunaut.dk/images/nynaut22/np-arrow-right.png"/></div>
</div>
</div>


 63%|██████▎   | 17/27 [00:12<00:07,  1.38it/s]

<div class="arr-info">
<div class="arr-genre">
<span class="genre_cat notranslate">
          Hip-Hop/Reggae
        </span>
<h3><strong>Masego // Studying Abroad Tour - Ekstra Koncert</strong></h3>
</div>
<div class="arr-description"><span>Se mere.

</span>
</div>
<div class="kult-month-day">
<time>Fri 2 Dec 2022 8 pm, <b>Amager Bio</b></time>
<div class="arrow_right"><img alt="arrow right" src="https://www.kultunaut.dk/images/nynaut22/np-arrow-right.png"/></div>
</div>
</div>


 67%|██████▋   | 18/27 [00:13<00:06,  1.34it/s]

<div class="arr-info">
<div class="arr-genre">
<span class="genre_cat notranslate">
          Classical
        </span>
<h3><strong>Klassisk koncert med Kian Kristensen Rashid</strong></h3>
</div>
<div class="arr-description"><span>En klassisk koncert som har til måls at give et indblik ind i de store og de små følelser der ligger skjult mellem musikkens linjer. Med et program præget af Mozart, Schumann og Prokofiev.

</span>
</div>
<div class="kult-month-day">
<time>Sun 4 Dec 2022 Kl 18:30 - 7.40 pm, <b>Metronomen</b></time>
<div class="arrow_right"><img alt="arrow right" src="https://www.kultunaut.dk/images/nynaut22/np-arrow-right.png"/></div>
</div>
</div>


 70%|███████   | 19/27 [00:14<00:05,  1.35it/s]

<div class="arr-info">
<div class="arr-genre">
<span class="genre_cat notranslate">
          Rock/Pop
        </span>
<h3><strong>Madison Cunningham (US)</strong></h3>
</div>
<div class="arr-description"><span>Læs mere.

</span>
</div>
<div class="kult-month-day">
<time>Sun 4 Dec 2022 8 pm, <b>Ideal Bar</b></time>
<div class="arrow_right"><img alt="arrow right" src="https://www.kultunaut.dk/images/nynaut22/np-arrow-right.png"/></div>
</div>
</div>


 74%|███████▍  | 20/27 [00:14<00:05,  1.31it/s]

<div class="arr-info">
<div class="arr-genre">
<span class="genre_cat notranslate">
          Gospel/Choir/Sing-along
        </span>
<h3><strong>Korsang i København: Corpo Fairies</strong></h3>
</div>
<div class="arr-description"><span>Et utraditionelt kor, vi synger sange fra Bulgarien, Georgien, Romasange og gospel. Plus improvisation, lydunivers, tonelandskaber, meditation og lydhealing.

  Organizer VoiceColour.

</span>
</div>
<div class="kult-month-day">
<time>Tues 6 Dec 2022 6.00 pm  - 8.30 pm, <b>Christianshavns Beboerhus</b></time>
<div class="arrow_right"><img alt="arrow right" src="https://www.kultunaut.dk/images/nynaut22/np-arrow-right.png"/></div>
</div>
</div>


 78%|███████▊  | 21/27 [00:15<00:04,  1.33it/s]

<div class="arr-info">
<div class="arr-genre">
<span class="genre_cat notranslate">
          Rock/Pop
        </span>
<h3><strong>Nicklas Sahl (DK)</strong></h3>
</div>
<div class="arr-description"><span>Læs mere.

</span>
</div>
<div class="kult-month-day">
<time>Wed 7 Dec 2022 8 pm, <b>VEGA</b></time>
<div class="arrow_right"><img alt="arrow right" src="https://www.kultunaut.dk/images/nynaut22/np-arrow-right.png"/></div>
</div>
</div>


 81%|████████▏ | 22/27 [00:16<00:03,  1.34it/s]

<div class="arr-info">
<div class="arr-genre">
<span class="genre_cat notranslate">
          Music
          
        </span>
<h3><strong>Julefryd, stille fred</strong></h3>
</div>
<div class="arr-description"><span>På omtrent dette tidspunkt hvert år i december er skuldrene godt på vej op omkring ørerne, jagten på den perfekte jul er gået ind og de fleste har glemt hvad det nu egentlig var det der med jul drejede sig om.

</span>
</div>
<div class="kult-month-day">
<time>Thur 8 Dec 2022 9 pm, <b>Lindevang Kirke</b></time>
<div class="arrow_right"><img alt="arrow right" src="https://www.kultunaut.dk/images/nynaut22/np-arrow-right.png"/></div>
</div>
</div>


 85%|████████▌ | 23/27 [00:17<00:02,  1.36it/s]

<div class="arr-info">
<div class="arr-genre">
<span class="genre_cat notranslate">
          Music
          
        </span>
<h3><strong>Moonjam / Julekoncert M. Special Guest</strong></h3>
</div>
<div class="arr-description"><span>Hvis der er udsolgt i forsalg, vil der så være udsolgt i døren.

</span>
</div>
<div class="kult-month-day">
<time>Fri 9 Dec 2022 9 pm, <b>Amager Bio</b></time>
<div class="arrow_right"><img alt="arrow right" src="https://www.kultunaut.dk/images/nynaut22/np-arrow-right.png"/></div>
</div>
</div>


 89%|████████▉ | 24/27 [00:17<00:02,  1.37it/s]

<div class="arr-info">
<div class="arr-genre">
<span class="genre_cat notranslate">
          Rock/Pop
        </span>
<h3><strong>Paul Potts</strong></h3>
</div>
<div class="arr-description"><span>Paul Potts' vender tilbage til Danmark, og med sig har han Gry Johansen på dette års turne.

</span>
</div>
<div class="kult-month-day">
<time>Sat 10 Dec 2022 8 pm, <b>Docken Nordhavn</b></time>
<div class="arrow_right"><img alt="arrow right" src="https://www.kultunaut.dk/images/nynaut22/np-arrow-right.png"/></div>
</div>
</div>


 93%|█████████▎| 25/27 [00:18<00:01,  1.38it/s]

<div class="arr-info">
<div class="arr-genre">
<span class="genre_cat notranslate">
          Rock/Pop
        </span>
<h3><strong>Mouritz/Hørslev Projektet (DK)</strong></h3>
</div>
<div class="arr-description"><span>Læs mere.

</span>
</div>
<div class="kult-month-day">
<time>Sun 11 Dec 2022 8 pm, <b>VEGA</b></time>
<div class="arrow_right"><img alt="arrow right" src="https://www.kultunaut.dk/images/nynaut22/np-arrow-right.png"/></div>
</div>
</div>


 96%|█████████▋| 26/27 [00:19<00:00,  1.39it/s]

<div class="arr-info">
<div class="arr-genre">
<span class="genre_cat notranslate">
          Hip-Hop/Reggae
        </span>
<h3><strong>Little Simz</strong></h3>
</div>
<div class="arr-description"><span>OBS: Koncerten med Little Simz d. 15 januar 2022 i Amager Bio udskydes til d. 14 december 2022. Billetkøbere skal blot beholde den nuværende billet og møde op til koncerten d. 14 december 2022.

</span>
</div>
<div class="kult-month-day">
<time>Wed 14 Dec 2022 8 pm, <b>Amager Bio</b></time>
<div class="arrow_right"><img alt="arrow right" src="https://www.kultunaut.dk/images/nynaut22/np-arrow-right.png"/></div>
</div>
</div>


100%|██████████| 27/27 [00:19<00:00,  1.35it/s]

<div class="arr-info">
<div class="arr-genre">
<span class="genre_cat notranslate">
          Music
          
        </span>
<h3><strong>Julekoncert med Nathanaelskoret</strong></h3>
</div>
<div class="arr-description"><span>Nathanaelskorets julekoncert.

</span>
</div>
<div class="kult-month-day">
<time>Thur 15 Dec 2022 7.30 pm, <b>Nathanaels Kirke  og sognegård</b></time>
<div class="arrow_right"><img alt="arrow right" src="https://www.kultunaut.dk/images/nynaut22/np-arrow-right.png"/></div>
</div>
</div>





In [None]:
scraped_events[:20]

### How to Extract Dates and Prices from Strings.

Remember, the raw data, which we extracted from the web pages is all of type `str`. To do statistics about possible correlation of start times and entry fees, we need to convert the corresponding tuple fields into datetimes and integers respectively.


Since dates given on the web do not necessarily conform to standardized time formats, we can apply the `dateparser` (https://pypi.python.org/pypi/dateparser) module, which tries to parse arbitrary strings into datetimes.

You can read more about the module and its capabilities https://dateparser.readthedocs.io/en/latest/.

In [None]:
%%bash
#pip install dateparser

In [19]:
from tqdm import tqdm
import re
from dateparser import parse
from datetime import datetime


def get_dates_and_prices(scraped_events):
    """
    NO LONGER WORKS WELL WITH KULTUNAUT website after they changes layout and hid the prices behind a js function.
    Cleanup the data. Get price as integer and date as date.
    
    returns:
        A two-element tuple with a datetime representing the start 
        time of an event and an integer representing the price in Dkk.
    """

    price_regexp = r"(?P<price>\d+)" #initial ? is a lookbehind. r() r is for raw text, P<some pattern> is to give a pattern name to refer to. \d is numeric digit, + is for 1 or more.

    data_points = []
    three_at_night = datetime.now().replace(hour=3, minute=0, second=0, microsecond=0).time()
    for event_data in tqdm(scraped_events):
        title_str, place_str, date_str, price_str = event_data
        
        if 'Free admission' in price_str:
            price = 0
        else:
            m = re.search(price_regexp, price_str) # m is the Match object returned from re.search (might be None)
            try:
                price = int(m.group('price')) # if price can be converted to int then we do it else return 0.
            except:
                price = 0

        date_str = date_str.strip().strip('.')
        if '&' in date_str:
            date_str = date_str.split('&')[0]
        if '-' in date_str:
            date_str = date_str.split('-')[0]
        if '.' in date_str:
            date_str = date_str.replace('.', ':')
        
        date = parse(date_str)
        if date and date.time() > three_at_night:
            data_points.append((date, price))
            
    return data_points

def get_dates(scraped_events):
    """
    Cleanup the data. Get date as date.
    
    returns:
        A datetime representing the start 
        time of an event.
    """
    three_at_night = datetime.now().replace(hour=3, minute=0, second=0, microsecond=0).time()
    dates = []
    for event_data in tqdm(scraped_events):
        title_str, place_str, date_str = event_data
        
        date_str = date_str.strip().strip('.')
        if '&' in date_str:
            date_str = date_str.split('&')[0]
        if '-' in date_str:
            date_str = date_str.split('-')[0]
        if '.' in date_str:
            date_str = date_str.replace('.', ':')
        
        date = parse(date_str)
        if date and date.time() > three_at_night:
            dates.append(date)
    return dates

dates = get_dates(scraped_events)
print(dates)

  date_obj = stz.localize(date_obj)
100%|██████████| 319/319 [00:04<00:00, 65.51it/s] 

[datetime.datetime(2022, 11, 15, 8, 20), datetime.datetime(2022, 11, 15, 18, 0), datetime.datetime(2022, 11, 15, 19, 0), datetime.datetime(2022, 11, 15, 19, 30), datetime.datetime(2022, 11, 15, 19, 30), datetime.datetime(2022, 11, 15, 20, 0), datetime.datetime(2022, 11, 15, 20, 0), datetime.datetime(2022, 11, 16, 8, 20), datetime.datetime(2022, 11, 16, 20, 0), datetime.datetime(2022, 11, 16, 20, 0), datetime.datetime(2022, 11, 16, 20, 0), datetime.datetime(2022, 11, 16, 20, 0), datetime.datetime(2022, 11, 16, 20, 0), datetime.datetime(2022, 11, 16, 20, 0), datetime.datetime(2022, 11, 18, 21, 0), datetime.datetime(2022, 11, 18, 14, 0), datetime.datetime(2022, 11, 18, 15, 0), datetime.datetime(2022, 11, 18, 15, 30), datetime.datetime(2022, 11, 18, 18, 0), datetime.datetime(2022, 11, 18, 18, 0), datetime.datetime(2022, 11, 18, 19, 30), datetime.datetime(2022, 11, 18, 19, 30), datetime.datetime(2022, 11, 18, 19, 30), datetime.datetime(2022, 11, 18, 19, 30), datetime.datetime(2022, 11, 18, 




In [None]:
dates[10:20]

## Scraping Images from a Page

In the following code you will use Beautiful Soup to extract all links to images, which are in `img` tags on a web page.

In [21]:
!mkdir data/test

In [22]:
import bs4
import os
import sys
import requests
import shutil


def collect_img_links(url):
    """based on a url returns a list of image links contained in the requested page"""
    r = requests.get(url)
    r.raise_for_status()
    soup = bs4.BeautifulSoup(r.text, 'html.parser')
    #print(soup.select('img'))
    return [img.get('src') for img in soup.select('img') 
            if img.get('src') and img.get('src').startswith('http')]


def download_imgs(links, out_folder="./data/test/"):
    """download all images from a list of image links. 
    Requires a folder named: test to be there"""
    img_no = 0
    for l in links:
        img_no += 1
        r = requests.get(l, stream=True)
        with open(out_folder+'img'+str(img_no)+'.jpg', 'wb') as f:
            r.raw.decode_content = True
            shutil.copyfileobj(r.raw, f)     
        
links = collect_img_links('https://www.google.dk/search?site=&tbm=isch&source=hp&biw=1163&bih=812&q=minions&oq=minions')
print(links)
download_imgs(links)

['https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcQsoEDUxaYMfyrf9YpsbZI84gSY5xvvjQ-dOFy9pqD2nFir8-D4Jmed-GAFXQ&s', 'https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcQeyctG9tUDC5VuiPT0fmcNfpka1vbjkyaz4b0EfwEFJbM6thWGtBZA4IbpOR8&s', 'https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcSJWWVrFCSqO1s9SLTpANuqMXnJYUYvCdCT29e3J7dNsNU8CIvSnVl9eP71QVQ&s', 'https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcSbX6jFwptGkZ0GFTx0s5HSuqx-6ichbd9yhFdLCIOTWi7ng7vKBfD_Yuch7zI&s', 'https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcSp22_qmF6ljKHcBt-d-HIZ-jW8WNcIKuhxWZIawRFQ3xW-P9T4v_x_CrDtjQ&s', 'https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcTnUim_kuO8HLKOi_3p7VdlO3Tm-5znWb1oIiq13oBqxsA9yn-UnJLdZGZnNnc&s', 'https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcSxfMwig0ZYdlJpPsFjw8HefcjgMTeTlB5hWDXSlSQTsnQXYnPkheh3mqo8Ayg&s', 'https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcQc8M4CKrXBdTY3U4ek9Z1mITF82r8qoN6xiLGPzaQvT7r_Eg0BBMN1GuP-CA&s', 'https://encrypted-tbn0.gstatic.co

## 01 Exercise: Writing a Simple Web Crawler

1. Write a simple web scraper that can capture all images in a document (like: https://www.cphbusiness.dk/). 
2. Write to a .md file in a format so all images are shown when markdown is pasted and executed in a cell
3. Extend the web scraper to find all links (`<a>` elements) in the document.
4. If you have time, let the crawler find all links in the linked documents as well (so we get one more level of the hypertext graph). Use threads if helpfull.


In case a page returns a status code, which is not `200` we just disregard this page. See https://en.wikipedia.org/wiki/List_of_HTTP_status_codes for more detailes on the various HTTP status codes.


In [None]:
def scrape_links(from_url, for_depth, all_links={}):
    # TODO: Implement code for websraper
    # return dict(key=url, value=list of outgoing urls)
    pass


start_url = 'https://www.version2.dk/artikel/google-deepmind-vi-oeger-sikkerheden-mod-misbrug-sundhedsdata-1074452'

link_dict = scrape_links(from_url=start_url, for_depth=2)

# Scrapy
The web crawler that you wrote above is perhaps not the most performant. If you are interested in more web scraping and application of crawlers have a look at the `scrapy` module (https://scrapy.org)