# Web scraping
To understand how web scraping works, you need to understand the basic idea of HTML.

HTML uses tags to wrap elements into each other. For example, the following code defines the text `This is a headline` as a (first order) headline:
```html
<h1>This is a headline</h1>
```

Everything between the starting `<tag>` tag and the ending `</tag>` is defined by this tag. Other examples are:
```html
<a>This is a link</a>
```
The `a`-tag defines a link. Everything between the tags is hence clickable.

```html
<div>This is a headline</div>
```
This defines a `div`. This a very common tag. It is used for various purposes and defines a section without intrinsic functionality.

Now, you might ask how the browser knows where the link from the first example should point to. We use *attributes* to declare this information:
```html
<a href="http://wikipedia.org">This is a link</a>
```
`href` is the the attribute and everything inside the quotes is the value. There are various attributes:
```html
<span id="logo">Wikipedia</span>
<header class="introduction"><p>Long time ago…</p></header>
```
The first example is a simple text string with the id `logo`. Note ids are unique on websites and hence should only be used once, i.e. there is no second element with the id `logo`.
The second example defines a are similarly to `div` but with a semantic meaning that defines it as a »header« of something. Classes can occur multiple times on the website. Inside the header is a `p` tag that defines the text as a paragraph.

Some tags don’t follow the opening and closing syntax. Images, for example, are closed in itself (with the trailing `/`):
```html
<img src="https://en.m.wikipedia.org/static/images/mobile/copyright/wikipedia-wordmark-en.svg" class="logo small" />
```
Here, the `src` defines the source of the image, i.e. the path to the image. Note how the `class` attribute can define multiple classes to the image.

You might worry, how you can memorise all these tags. The good news is, there are not too many. The following tags will probably define 90% of all the content on websites:
```html
<html>This is wrapped around the whole document</html>
<head>This defines the head of the website. It does not contain any visible content, but instead defines, for example, the title of the website.</head>
<body>This is wrapped around the content of the website.</body>
<div>This is undefined area</div>
<section>This is undefined section</section>
<main>This is the main part of the website (whatever this is on the website)</main>
<header>This is the header of the specific area</header>
<footer>This is the footer of the specific area</footer>
<aside>This is some additional content of the specific area</aside>
<nav>This is the navigation of the website</nav>
<h1>This is a first degree headline</h1>
<h6>This is a six degree headline</h6>
<span>This is a simple text</span>
<p>This is a paragraph of text</p>
<a>This is a link</a>
<ul>
    <li>This is a list with two items</li>
    <li>This is the second item</li>
</ul>
<img>
```

**Please note, HTML only defines areas. It does not style the items (though there are some default styles applied by the browser). The website looks the same no matter if you have something wrapped inside a `header` or a `span` tag. Links will work regardless if they are placed inside the `nav` tags or `h3` tags. The Styling is applied  separately by CSS. HTML is only used to apply semantic meaning to elements. In theory, you could wrap nearly everything inside `div` tags, but the semantic meaning of elements helps for example search engines, screen readers, bots and browsers to make sense of the website. There are some rules of what you should not do (for example wrap a headline inside a paragraph), but browsers will still display the website.**

Now, go to any website, right-click and click on *Inspect element* (or something similar depending on your browser). You will see many elements inside each other.

Now that you know how websites are built, you can tell a scraper how to extract content from them. When scraping a website, you basically tell a script (or bot, crawler, …) to visit the website, get all its content and navigate through the containing elements.

We use the Python library *BeautifulSoup* to »navigate« the HTML. First, import the library:

In [54]:
from bs4 import BeautifulSoup

Now, let’s give the library some HTML:

In [60]:
soup = BeautifulSoup('<h1 id="headline">This is how it all starts</h1><h2 class="caption">A small introduction into something</h2>', 'html.parser')

Let’s say we want the script to find the caption: 

In [63]:
soup.find('h2')

<h2 class="caption">A small introduction into something</h2>

If you just want the containing text, use `.text`:

In [66]:
soup.find('h2').text

'A small introduction into something'

You want the value of the `class` attribute?

In [67]:
soup.find('h2')['class']

['caption']

Okay, let’s get some real content from a existing website. For that, we use the `request` library.

In [68]:
import requests
response = requests.get('https://datavis.berlin')
soup = BeautifulSoup(response.text, 'html.parser')

Let’s search for a tag that has the `id` with the value `professional`.

In [75]:
result = soup.find(id='professional')
print(result.prettify())

<h2 id="professional">
 <i aria-hidden="true" class="icon icon-link">
 </i>
 Professional data visualisation
</h2>


Wonder, what `.prettify()` does? Just remove it and see the result.

But what happens if you search by tag and multiple elements have the same tag?

In [76]:
headlines = soup.find_all('h2')
for headline in headlines:
    print(headline.get_text())

Professional data visualisation
Places
Meet other data people
Related degree programmes in Germany


Here, we »loop over« all elements from `headlines` and print out the text for each.

Similarly, we can search by class and tag:

In [77]:
headlines = soup.find_all('a', class_='headline')
for headline in headlines:
    print(headline.get_text())
    print(headline.h2['id'])

Professional data visualisation
professional
Places
places
Meet other data people
meet
Related degree programmes in Germany
programmes


The code reads: Find all elements with the tag `a` that have the class `headline`. Since class is used by Python as command BeautifulSoup used `class_`.

Inside each headline link, we search for a `h2` element and print out its `id`.

Elements can have any tags. One that helps screen readers is `aria-labelledby`. If we want to search by it we use a slightly different syntax.

Inside each headline we search for a link. Save the result in a variable, get the text of the link and also its `href`. If the `href` starts with `https://twitter.com` we print it out.

In [79]:
headlines = soup.find('ul', attrs={'aria-labelledby' : 'people'})
for headline in headlines:
    link = headline.find('a')
    text = link.get_text()
    href = link['href']
    if (href.startswith('https://twitter.com')):
        print(text, href)

Alsino Skowronnek https://twitter.com/Alsinosko
André Pätzold https://twitter.com/alterpaetz
Angelo Zehr https://twitter.com/angelozehr
Arran Ridley https://twitter.com/arranarranarran
Bernd Riedel https://twitter.com/berndriedellery
Boris Müller https://twitter.com/borism
Burak Korkmaz https://twitter.com/BKorkmaz_KD
Cedric Kiefer https://twitter.com/CedricKiefer
Christian Laesser https://twitter.com/christianlaessr
Christian Schlippes https://twitter.com/geplapper
Christopher Möller https://twitter.com/chrtze
Christopher Pietsch https://twitter.com/chrispiecom
David Elsche https://twitter.com/ElscheDavid
David Wendler https://twitter.com/newreld
Dilyana Suleymanova https://twitter.com/DilyanaFlower
Dirk Aschoff https://twitter.com/dirkaschoff
Dirk Brockmann https://twitter.com/DirkBrockmann
Elena Erdmann https://twitter.com/elena_erdmann
Fabian Dinklage https://twitter.com/fdnklg
Fabian Ehmel https://twitter.com/fabianehmel
Fidel Thomet https://twitter.com/fidelthomet
Flavio Gortana 

Now, let’s safe it to a CSV file:

In [81]:
import csv
with open('people.csv', "w") as output:
    writer = csv.writer(output, lineterminator='\n')

    headlines = soup.find('ul', attrs={'aria-labelledby' : 'people'})
    for headline in headlines:
        link = headline.find('a')
        text = link.get_text()
        href = link['href']
        if (href.startswith('https://twitter.com')):
            writer.writerow([text,href]) 