# Data formats and open data
**Exercises for week 5B** in Digital Methods, University of Copenhagen

## 1. HTML

HTML is the markup language used by web pages. It's ubiquitous on the web; even when editing this notebook you are interacting with HTML (right click and hit "View Page Source" if you need proof). Here follows some exercises to get you comfortable with navigating HTML on web pages.

> **Ex. 1**: Right click inside the cell below and hit "Inspect". This should launch the "Inspector" tool in your browser, showing you where the element that renders the cell sits inside the DOM.
1. How deeply is it nested? Are there any sibling elements?
2. What happens when you update it? Change the text and see for yourself.
>
> *Hint: Most modern browsers (e.g. Firefox, Chrome, Brave) will let you hover elements in the DOM to show where they display on the web page.*

*HTML is a beautiful soup of hypertext! I know*

> **Ex. 2**: In the HTML code below:
1. What is typically the use of the `<p>`, `<h1>` and `<h2>` tags? Look them up, what are they for?
2. What are the attributes of the `div` element?
3. Create a text file that ends with ".html" and open it in a browser.

    <html>
    <body>

    <div width=200 height=100 id="main">
        <h1>This is the main title of the webpage</h1>
        <h2>This is a sub-heading</h2>
        <p>This is a paragraph of text.</p>
    </div>

    <h2>This is another sub-heading</h2>
    <p>This is a paragraph of text with some words in bold.</p>
    <img src="https://external-content.duckduckgo.com/iu/?u=https%3A%2F%2Fmedia.giphy.com%2Fmedia%2FkFgzrTt798d2w%2Fgiphy.gif&f=1&nofb=1" width="493" height="340">
    <p>And that just above is an image.</p>

    </body>
    </html>


> **Ex. 3**: Using the `requests` module, download [this web page](https://www.boliga.dk/resultat?propertyType=3&zipCodes=2200&page=1). Print the first 100 lines of the html string. How many lines are there in total?
>
> *Hint: use the `requests.get` method. To figure out how it works, execute `?requests.get` (after importing `requests`), this displays the module documentation.*

In [20]:
import requests as rq

webpage = rq.get('https://www.boliga.dk/resultat?propertyType=3&zipCodes=2200&page=1').text
webpage = webpage.split('\n')
for item in range(100):
    print(webpage[item])

<!DOCTYPE html><html lang="en"><head>
  <!-- Google Tag Manager -->
  <script>(function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':
    new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],
    j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src=
    'https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);
    })(window,document,'script','dataLayer','GTM-PWD5VZT');
  </script>
  <!-- End Google Tag Manager -->

  <meta charset="utf-8">
  <meta name="description" content="2200 København N &amp; Ejerlejlighed. Udforsk boligerne og få alle oplysninger om boligen, inden du køber. Se resultaterne på en liste her.">
  <meta name="keywords" content="">

  <title>2200 København N &amp; Ejerlejlighed - Boliga.dk</title>
  <base href="/">
  <meta name="viewport" content="width=device-width, initial-scale=1, maximum-scale=1, user-scalable=0">
  <link rel="icon" type="image/x-icon" href="favicon.ico">
  <link href="https://fonts.g

## 1.2 Scraping

*Scraping* means to parse HTML and collect the important pieces of information inside. *Crawling* is
another important contect, and the word refers to automatically sifting through pages of the web and scraping
information on each page. 90% of scraping and crawling work can be done using the two modules `requests` and
`BeautifulSoup`.

> **Ex. 4:** Load the toy example HTML with BeautifulSoup. Use the [documentation page](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) for reference on how to do this.
1. Access the `h1` element inside the `div` and print out its content (which is "This is the main title of the webpage").
2. Get the value of the `src` attribute inside the `img` element.
3. Get the second subheading that contains "This is another sub-heading" and print out that content.
4. Get the `div` element by searching for its id.

In [36]:
from bs4 import BeautifulSoup

#from IPython.display import HTML
#HTML(filename='toy_html.html')

html_doc = """
<html>
<body>

<div width=200 height=100 id="main">
    <h1>This is the main title of the webpage</h1>
    <h2>This is a sub-heading</h2>
    <p>This is a paragraph of text.</p>
</div>

<h2>This is another sub-heading</h2>
<p>This is a paragraph of text with some words in bold.</p>
<img src="https://external-content.duckduckgo.com/iu/?u=https%3A%2F%2Fmedia.giphy.com%2Fmedia%2FkFgzrTt798d2w%2Fgiphy.gif&f=1&nofb=1" width="493" height="340">
<p>And that just above is an image.</p>

</body>
</html>"""

print(html_doc)



<html>
<body>

<div width=200 height=100 id="main">
    <h1>This is the main title of the webpage</h1>
    <h2>This is a sub-heading</h2>
    <p>This is a paragraph of text.</p>
</div>

<h2>This is another sub-heading</h2>
<p>This is a paragraph of text with some words in bold.</p>
<img src="https://external-content.duckduckgo.com/iu/?u=https%3A%2F%2Fmedia.giphy.com%2Fmedia%2FkFgzrTt798d2w%2Fgiphy.gif&f=1&nofb=1" width="493" height="340">
<p>And that just above is an image.</p>

</body>
</html>


> **Ex. 5** Load the HTML you downloaded in Ex. 3. For each post, extract price, square meter size and "Ejerudgift". You should create three different lists that contain each variable across posts.

> **Ex. 6:** Make a scatter plot of square meter size vs. extracted price. Then make a new variable that 
measures price per square meter and scatter plot this against "Ejerudgift". Can you say anything about how
"Ejerudgift" influences square meter price?

> **Supercharge:** Crawl over pages of Boliga to collect this data for the entire borough of Nørrebro. Or all of Copenhagen!