# Obtaining, parsing and structuring static HTML websites

In this notebook we will learn how to scrape basic static, i.e. non-interactive HTML-based websites. We will
- learn best practices of Webscraping
- learn to efficiently search text objects using `RegEx`
- obtain the HTML raw content using the `requests` module
- convert the raw HTML into a format that is easier to search, or parse, using the `BeautifulSoup` module
- learn how to identify the elements of interest in the raw HTML using the browser's inspect functionality and the CSS SelectorGadget
- construct a table, or dataframe, with the popular table calculation module `pandas` and store the output locally in a standard spreadsheet format

## Rules of the game
(*Recommendations taken from Pablo Barberá's course on Webscraping*)

1. Respect the hosting site's wishes:

    - Check if an API exists or if data are available for download
    - Keep in mind where data comes from and give credit (and respect copyright if you wnat to republish the data!)
    - Some websites **disallow** scrapers on `robots.txt` file
    
2. Limit your bandwidth use:
    
    - Wait one or two seconds after each hit
    - Scrape only what you need, and just once (e.g. store the html file in disk, then parse it)

3. When using APIs, read documentation

    - Is there a batch download option?
    - Are there any rate limits?
    - Can you share the data?

## The art of webscraping

Workflow:

1. Learn about structure of website
2. Choose your strategy (static vs. dynamic, identification of elements, scale & backend)
3. Build prototype code: extract, prepare, validate
4. Generalize: functions, loops, debugging
5. Data cleaning

## HTML Basics: A primer

- HTML (= Hypertext Markup Language): a raw "text" file interpreted by an internet browser
- structure defined by `tags`
    1. opening and closing `<>` `</>`
    2. **attributes** of the element inside of tag
    3. The **text** to be structured

<img src="https://www.w3schools.com/js/pic_htmltree.gif" alt="Drawing" style="width: 500px;">


Example:

```html
<!DOCTYPE html>
<html>
<body>
    
    <h1>My First Heading</h1>
    <p>My first paragraph.</p>
    <a href="https://hu-berlin.de/">Link to HU Berlin</a>      # text to appear comes after link
    
</body>
    
</html>
```

In [1]:
# browser's interpretation

from IPython.core.display import display, HTML
display(HTML('<!DOCTYPE html> <html> <body> <h1>My First Heading</h1> <p>My first paragraph. <br> <body> <strong> Lorem ipsum... </strong> <br> Another text </body></p> <a href="https://hu-berlin.de/">Link to HU Berlin</a> </body> </html>'))

## Some common tags:

- Document elements: `<head>`, `<body>`, `<footer>`...
- Document components: `<title>`, `<h1>`, `<div>`...
- Text style: `<b>`, `<i>`, `<strong>`...
- Hyperlinks: `<a>`

## Some additional website components:

**CSS**

- Cascading Style Sheets (CSS) describe the formatting and e.g. colors of HTML components (e.g. `<h1>`, `<div>`,...)
- CSS is useful because we can use CSS pointers (selectors) to identify our HTML elements of interest

**Javascript**

- Javascript extends the functionality of websites (e.g. a change of content after loading of a website)
- Javascript is executed on the client-side and not server-side!
- Therefore, JS based websites are a big problem for us as client-side changes of the HTML document cannot be captured using conventional `requests`

## HTML inspection in a browser

- Open the page https://www.ecb.europa.eu/ and open the Developer Tools with key `F12` (alternatively right click + Inspect)
- Hover over the different elements and observe how the elements are being highlighted
- In the Developer section you can access all information regarding specific elements, e.g. `id`, `class` etc. as well as traffic

## Example: ResearchGate.net - is scraping allowed?

[ResearchGate.net](https://de.wikipedia.org/wiki/ResearchGate) is a social network for researchers.

First check: `robots.txt` file: https://www.researchgate.net/robots.txt

````
User-agent: *
Allow: /
Disallow: /connector/
Disallow: /plugins.
Disallow: /firststeps.
Disallow: /publicliterature.PublicLiterature.search.html
Disallow: /lite.publication.PublicationRequestFulltextPromo.requestFulltext.html
Disallow: /amp/authorize
Allow: /signup.SignUp.html
Disallow: /signup.
````

- User-agent: * means that the following sections apply for **any** user-agent (e.g. Google Bots or our Python programs).
- This file defines which sections (`domains`) are prohibited to be scraped, e.g. `/connector/`

As such, it would be permissible to scrape e.g. the job ads under the domain (`/jobs/`):

![alt text](researchgate_jobs.PNG)

**But** in the [Terms of Service by researchgate](https://www.researchgate.net/application.TermsAndConditions.html) it is clearly stated, that the operators of this website do not allow wescraping:

![alt text](researchgate_tac.PNG)

**Conclusion**
- You should not scrape the website in any case without the explicit permission by the website operators
- (and you won't get it - other colleagues have tried several times :P)

## What about https://www.ecb.europa.eu/?
- Check the robots.txt and define a `list` containing all prohibited domains.

## BeautifulSoup

BeautifulSoup is a Python Parsing Package which "understands" HTML and XML Strings and can represent them in an aforementioned tree structure.
```python
!conda install pip
!pip install beautifulsoup4
```

In [2]:
from bs4 import BeautifulSoup

html_doc = """<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

display(HTML(html_doc))

A string containing HTML content is processed through the function `BeautifulSoup()`:

In [4]:
soup = BeautifulSoup(html_doc, "html.parser")
type(soup)

bs4.BeautifulSoup

Afterwards, attributes can be retrieved from the tree structure:

In [8]:
type(soup.title)

bs4.element.Tag

In [6]:
print(soup.title.text)

The Dormouse's story


What's the difference? How can you check?

In [7]:
print(soup.prettify())

<html>
 <head>
  <title>
   The Dormouse's story
  </title>
 </head>
 <body>
  <p class="title">
   <b>
    The Dormouse's story
   </b>
  </p>
  <p class="story">
   Once upon a time there were three little sisters; and their names were
   <a class="sister" href="http://example.com/elsie" id="link1">
    Elsie
   </a>
   ,
   <a class="sister" href="http://example.com/lacie" id="link2">
    Lacie
   </a>
   and
   <a class="sister" href="http://example.com/tillie" id="link3">
    Tillie
   </a>
   ;
and they lived at the bottom of a well.
  </p>
  <p class="story">
   ...
  </p>
 </body>
</html>


It's especially useful to efficiently search for specific tags inside the entire HTML document, e.g. `a` for links:

In [11]:
soup.find_all('p')

[<p class="title"><b>The Dormouse's story</b></p>,
 <p class="story">Once upon a time there were three little sisters; and their names were
 <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
 and they lived at the bottom of a well.</p>,
 <p class="story">...</p>]

Moreover, elements can be identified through attributes such as `id`, `href` or `class`:

In [12]:
print('id', soup.find(id='link2'))
print('href', soup.find(href='htpp://example.com/lacie'))
print('class', soup.find(class_='story'))

id <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
href None
class <p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>


## Intermezzo: Regular Expressions

> Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems

    Jamie Zawinski
    
Regular expressions specify subsets from a finite set of characters/symbols $\Sigma$. They can perform three operations:
If $x$ and $y$ are regular expressions, then

1. Concatenation: ($xy$)
2. Alternative: ($x | y$)
3. Repetition (Kleene-star): ($x^{*}$)

## `re`

Python's module for regular expressions. Can be called as method or as function.

Additional information: [PyDocs RegEx HowTo](https://docs.python.org/3/howto/regex.html)

In [15]:
import re

pattern = 'a'
string = 'Spam, Eggs and Bacon'

print(re.search(pattern, string))

# returns only one match-object!

<re.Match object; span=(2, 3), match='a'>


In [20]:
pattern = '.*(gg).*'
# returns an empty match which cannot be grouped, i.e. returns the matched string
match = re.search(pattern, string)
print(match.group(0,1))

('Spam, Eggs and Bacon', 'gg')


In [24]:
pattern = re.compile('a.')
print(pattern.findall(string))

['am', 'an', 'ac']


In [None]:
# Bonus: what type of object is match.group(0,1)? Is it immutable?

Compile raw string into a proper `re` object with methods for various operations

## `re.findall`

Finds all occurrences of a given regular expression in a string and returns it as a *list-of-strings*.

In [None]:
print(string)
print(pattern)

## Special characters

1. `.` (dot) is the most general regular expression. It specifies an arbitrary character within the string.
2. `^` (carret) refers to the beginning of the string.
3. `$` (dollar) refers to the position in front of a newline (`\n`) or the end of the string in `MULTILINE` mode

In [28]:
# returns regex + first arbitrary character after regex
# does the same and returns a list of strings
print(re.findall('a.', string))

['am', 'an', 'ac']


## Concatenation

Specifies strings in a certain order. The order can be reversed by adding a set `[]`. 

In [32]:
print(re.search('AND', 'AND DNA ZYZ')) # "normal" order
print(re.findall('[AND]', 'AND DNA ZYZ')) # identifies the first character's occurrence of the pattern in reversed order
print(string)
# specifies the set of patterns, not only the entire sequence, in an abritrary sequence

<re.Match object; span=(0, 3), match='AND'>
['A', 'N', 'D', 'D', 'N', 'A']
Spam, Eggs and Bacon


## Alternatives

Finds regular expression $x$ **or** $y$ and returns list. Operator is `|`.

In [None]:
# scheme: pattern, string

## Special characters: Summary

The following characters have special meanings in regular expressions:

Character | Meaning
- | - 
`.` | Arbitrary character. With `DOTALL` including Newline (`\n`)
`^` | Beginning of a string. If `MULTILINE` also after each `\n`
`$` | End of a string. If `MULTILINE` also in front of each `\n`
`\` | Escape for special characters or describe a set
`[]` | Defines a set of characters
`()` | Defines the scope, i.e. sets groups.

if we're searching '.' (dot) then `\.`

## Repetitions

Specifies the number of repetitions of a preceding regular expression $x$. The following repetitions are possible:

Syntax | Meaning
- | - 
`*` | 0 or more repetitions
`+` | 1 or more repetitions
`{m}` | Exactly `m` repetitions
`{m,n}` | From `m` up until (including) `n`

By default, repetitions are *greedy*, i.e. it is as much consumed of a string as possible. This behavior can be disabled by setting a `?` after the repetition.

In [None]:
input_string = '/2021/abcdefg'

# objective: search for anything within first occurrence of / and subsequent occurrence of /
# specify this to be from set of type digits
# restrict matches to m=4 repetitions
# return every match as a list



In [33]:
spamskit = '''The screen is filled by the face of PETER PARKER, a seventeen year
old boy. High school must not be any fun for Peter, he's one
hundred percent nerd- skinny, zitty, glasses. His face is just
frozen there, a cringing expression on it, which strikes us odd
until we realize the image is freeze framed.'''

# return the regex 'ee' ocurring at least once as a list

A useful regex combination, `.*?` can be used to allow for multiple placeholders (`.`) arbitrary repeatedly (`*`) **up until** the next specified pattern is initially found (`?`) - remember: *(non-)greedy*

In [None]:
test = 'eeeAiiZuuuuAoooZeeee'

# 'A' followed by an arbitrary character repeated for an arbitrary number of times until closed by the last occurrence of 'Z' - greedy!

# non-greedy!

## Extensions of RegEx

Regular expressions have a variety of syntactical nuances to simplify the specification of any possibly imaginable string.

## Specification of sets

Syntax (short cut) | Equivalent | Meaning
-|-|-
`\d` | `[0-9]` | Integers
`\D` | `[^0-9]` | Anything expect integers 
`\s` | `[ \t\n\r\f\v]` | Anything that is whitespace
`\S` | `[^ \t\n\r\f\v] ` | Anything that is not whitespace
`\w` | `[a-zA-Z0-9_]` | Alphanumeric characters and underscore
`\W` | `[^a-zA-Z0-9_]` | Anything but alphanumeric characters and underscore

In [34]:
spamskit

"The screen is filled by the face of PETER PARKER, a seventeen year\nold boy. High school must not be any fun for Peter, he's one\nhundred percent nerd- skinny, zitty, glasses. His face is just\nfrozen there, a cringing expression on it, which strikes us odd\nuntil we realize the image is freeze framed."

In [None]:
# Substitution



## Exercise 1: Tokenization using `re`

Write a regular expression which returns a list of strings from the input string `spamskit` that contains a "sequence" of at least one alphanumeric character (i.e. no whitespace!) of arbitrary length (until the next non-matching character is found, i.e. the subsequent non-alphanumeric character = whitespace).

## --> Back to `BeautifulSoup`

Any query can be supplied with a regular expression.

In [39]:
# search in attribute href
soup.find_all('a')
soup.find_all('a', id=re.compile('link[\d]+'))

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

In [None]:
# search in attribute id: 'link' + set of integers

## First attempt: ECB homepage

1. Open the Anaconda Prompt and install the module `requests`

In [41]:
import requests

In [43]:
seed_external = 'https://www.ecb.europa.eu/'

2. What data type is the object `seed_external`? How can you check?

In [None]:
proxies =  {'https' : 'https://ap-python-proxy:x2o7rCPYuN1JuV8H@app-gw-2.ecb.de:8080',
           'http'  : 'http://ap-python-proxy:x2o7rCPYuN1JuV8H@app-gw-2.ecb.de:8080'}

3. Is this domain an admissible path? Hint: Check the `robots.txt`

In [44]:
html = requests.get(seed_external)

4. Was the request successful? How can you check the status? Hint: Check the available methods by using Jupyter's auto-complete functionality, i.e. type a dot at the end of the object you're investigating followed by <kbd>Tab</kbd>

In [45]:
html.status_code

200

5. Which method could be most informative w.r.t. actual content? How many characters long is the raw HTML file?

In [48]:
html.headers

{'Server': 'myracloud', 'Date': 'Mon, 06 Sep 2021 09:27:12 GMT', 'Content-Type': 'text/html; charset=UTF-8', 'Transfer-Encoding': 'chunked', 'Connection': 'keep-alive', 'X-Frame-Options': 'SAMEORIGIN', 'Strict-Transport-Security': 'max-age=63072000; includeSubDomains', 'Cache-Control': 'must-revalidate, max-age=10', 'Pragma': 'cache', 'X-XSS-Protection': '1; mode=block', 'Content-Encoding': 'gzip', 'vary': 'accept-encoding', 'Expires': 'Mon, 06 Sep 2021 09:27:16 GMT', 'ETag': '"myra-180d3fad"', 'X-CDN': '1'}

6. Display the first 518 characters of the `html` object.

7. Display meta information on the origin of the HTTP request, e.g. date. Note that it is possible to specify the `user-agent` that the server receives and provides the response (website representation) such that it optimised, e.g. Desktop vs. mobile. If it's not specified, the request will be sent using default values (potentially) containing information about your operating system, screen resolution, keyboard language, IP address and many more.

The cell below saves the HTML object's text attribute in HTML format locally.

In [49]:
with open('ECB.html', 'w', encoding='utf-8') as f:
    
    f.write(html.text)

8. Install the module `BeautifulSoup` via `pip install beautifulsoup4`

In [None]:
from bs4 import BeautifulSoup

In [50]:
soup = BeautifulSoup(html.text, 'html.parser')

9. Parse the BeautifulSoup object `soup` for all Affiliate Links. Hint: In a HTML document all elements that lead to another domain are indicated by an `a` and follow the structure `<a href="...", ... >text</a>`. Hint: Use `soup`'s method `find_all()` where the input argument is the elements' prefix. What object type is the output? Can you iterate over it? How many elements of an Affiliate Link type are contained in the HTML file?

In [53]:
print(soup.find_all("a"))

[<a aria-hidden="true" href="/home/html/index.en.html" tabindex="-1">
<img src="/shared/img/logo/logo_name.en.svg"/>
</a>, <a aria-hidden="true" href="/home/html/index.en.html">
<div style="background-image:url(/shared/img/logo/logo_name_mobile.en.svg)"></div>
</a>, <a class="available" href="index.bg.html" lang="bg" title="Български"><span class="ecb-full">Български</span></a>, <a class="available" href="index.cs.html" lang="cs" title="Čeština"><span class="ecb-full">Čeština</span></a>, <a class="available" href="index.da.html" lang="da" title="Dansk"><span class="ecb-full">Dansk</span></a>, <a class="available" href="index.de.html" lang="de" title="Deutsch"><span class="ecb-full">Deutsch</span></a>, <a class="available" href="index.el.html" lang="el" title="Eλληνικά"><span class="ecb-full">Eλληνικά</span></a>, <a class="selected available" href="index.en.html" lang="en" title="English"><span class="ecb-full">English</span></a>, <a class="available" href="index.es.html" lang="es" titl

10. Convert the BeautifulSoup object into a "plain" Python list object containing the elements' **text** attributes by iterating over it. Hint: Instantiate an empty `list` object, write a for-loop and `append` each element to the list object. You may also remove any unwanted whitespaces by using the `strip` function.

In [55]:
empty_list = []
for link in soup.find_all('a'):
    empty_list.append(link.text.strip())
print(empty_list)

['', '', 'Български', 'Čeština', 'Dansk', 'Deutsch', 'Eλληνικά', 'English', 'Español', 'Eesti keel', 'Suomi', 'Français', 'Gaeilge', 'Hrvatski', 'Magyar', 'Italiano', 'Lietuvių', 'Latviešu', 'Malti', 'Nederlands', 'Polski', 'Português', 'Română', 'Slovenčina', 'Slovenščina', 'Svenska', 'Banking supervision website', '', 'About', 'Media', 'Explainers', 'Research & Publications', 'Statistics', 'Monetary Policy', 'The euro', 'Payments & Markets', 'Careers', 'Banking supervision', 'Search', '', 'Interview', 'Interview', 'Explainer', 'Euro area bank interest rate statistics: July 2021', 'English', 'Consolidated financial statement of the Eurosystem as at 27 August 2021', 'English', 'БългарскиBG', 'ČeštinaCS', 'DanskDA', 'DeutschDE', 'EλληνικάEL', 'EspañolES', 'Eesti keelET', 'SuomiFI', 'FrançaisFR', 'MagyarHU', 'HrvatskiHR', 'ItalianoIT', 'LietuviųLT', 'LatviešuLV', 'MaltiMT', 'NederlandsNL', 'PolskiPL', 'PortuguêsPT', 'RomânăRO', 'SlovenčinaSK', 'SlovenščinaSL', 'SvenskaSV', 'Commentary', 

#### Pro-Tipp
Instead of explicitly writing a for-loop when disentangling specific objects from an aggregate object you can use Python's built-in `map` and `lambda` functions as a one-liner.

In [58]:
results_list = list(map(lambda x: x.text.strip(), soup.find_all('a')))
print(results_list)

['', '', 'Български', 'Čeština', 'Dansk', 'Deutsch', 'Eλληνικά', 'English', 'Español', 'Eesti keel', 'Suomi', 'Français', 'Gaeilge', 'Hrvatski', 'Magyar', 'Italiano', 'Lietuvių', 'Latviešu', 'Malti', 'Nederlands', 'Polski', 'Português', 'Română', 'Slovenčina', 'Slovenščina', 'Svenska', 'Banking supervision website', '', 'About', 'Media', 'Explainers', 'Research & Publications', 'Statistics', 'Monetary Policy', 'The euro', 'Payments & Markets', 'Careers', 'Banking supervision', 'Search', '', 'Interview', 'Interview', 'Explainer', 'Euro area bank interest rate statistics: July 2021', 'English', 'Consolidated financial statement of the Eurosystem as at 27 August 2021', 'English', 'БългарскиBG', 'ČeštinaCS', 'DanskDA', 'DeutschDE', 'EλληνικάEL', 'EspañolES', 'Eesti keelET', 'SuomiFI', 'FrançaisFR', 'MagyarHU', 'HrvatskiHR', 'ItalianoIT', 'LietuviųLT', 'LatviešuLV', 'MaltiMT', 'NederlandsNL', 'PolskiPL', 'PortuguêsPT', 'RomânăRO', 'SlovenčinaSK', 'SlovenščinaSL', 'SvenskaSV', 'Commentary', 

In [60]:
results_list.index('Research & Publications')

31

11. Identify the element which text attribute's value is equal to `"Research & Publications"`. Return the element's position (`index`) within the list.

In [64]:
new_seed = soup.find_all('a')[31].get('href')
new_seed

'/pub/html/index.en.html'

12. Obtain this element's value of the `href` attribute. It should be an URL pointing at the domain where the news at Universität Potsdam are collected.

In [67]:
new_new_seed = seed_external + new_seed[1:]
new_new_seed

'https://www.ecb.europa.eu/pub/html/index.en.html'

In [71]:
def URL_to_BS(url):
    
    html = requests.get(url)
    soup = BeautifulSoup(html.text, "html.parser")
    return soup

In [72]:
sub_page = URL_to_BS(new_new_seed)
sub_page

<!DOCTYPE html>

<html lang="en">
<head>
<meta charset="utf-8"/>
<meta content="IE=edge" http-equiv="X-UA-Compatible"/>
<link as="font" crossorigin="" href="/shared/dist/fonts/opensans_fixed/OpenSans-Regular.woff2" rel="preload"/>
<link as="font" crossorigin="" href="/shared/dist/fonts/opensans_fixed/OpenSans-SemiBold.woff2" rel="preload"/>
<link as="font" crossorigin="" href="/shared/dist/fonts/opensans_fixed/OpenSans-Bold.woff2" rel="preload"/>
<link as="font" crossorigin="" href="/shared/dist/fonts/ecb-iconset/ECB-icon-set.woff2" rel="preload"/>
<link href="/fav.ico" rel="icon" sizes="16x16"/>
<link href="/favicon-16.png" rel="icon" sizes="16x16"/>
<link href="/favicon-32.png" rel="icon" sizes="32x32"/>
<link href="/favicon-64.png" rel="icon" sizes="64x64"/>
<link href="/favicon-128.png" rel="icon" sizes="128x128"/>
<link href="/favicon-192.png" rel="icon" sizes="192x192"/>
<link href="/favicon-256.png" rel="icon" sizes="256x256"/>
<link href="/favicon-196.png" rel="shortcut icon" s

13. Write a function which takes a String-type object (e.g. an URL) as input and returns a readily parse-able `BeautifulSoup` object.

In [76]:
results_list = list(map(lambda x: x.text.strip(), sub_page.find_all('a')))

Navigate further into the section of "Our researchers".

In [77]:
results_list.index('Our researchers')

50

In [78]:
results_list

['',
 '',
 'Български',
 'Čeština',
 'Dansk',
 'Deutsch',
 'Eλληνικά',
 'English',
 'Español',
 'Eesti keel',
 'Suomi',
 'Français',
 'Gaeilge',
 'Hrvatski',
 'Magyar',
 'Italiano',
 'Lietuvių',
 'Latviešu',
 'Malti',
 'Nederlands',
 'Polski',
 'Português',
 'Română',
 'Slovenčina',
 'Slovenščina',
 'Svenska',
 'Banking supervision website',
 '',
 'About',
 'Media',
 'Explainers',
 'Research & Publications',
 'Statistics',
 'Monetary Policy',
 'The euro',
 'Payments & Markets',
 'Careers',
 'Banking supervision',
 'Search',
 '',
 '',
 '',
 '',
 '',
 '',
 'Read our latest Economic Bulletin',
 'Read the Research Bulletin',
 'Read our Financial Stability Review',
 'Read our latest Macroprudential Bulletin',
 'Read our latest working papers',
 'Our researchers',
 '31 August 2021\nAsset encumbrance in euro area banks: analysing trends, drivers and prediction properties for individual bank crises',
 '27 August 2021\nThe corporate saving glut and the current account in Germany',
 '25 August 2

14. Open the `new_seed` URL in your browser and enable the CSS SelectorGadget. Highlight the box containing the first article. The other, similar boxes should be highlighted as well. Copy the identified CSS selector and parse through the `news_soup` object but this time over elements corresponding to the CSS selector you found (use `.select()` instead of `find_all()`). Store the subset of elements in a list. You can achieve all of this in one line of code. How many items does this list contain?

15. Split the list's elements into their hyperlinks (`href`) and text attributes' values.

In [None]:
# cannot index by position, needs to be explicit key OR D.values[positional index]

In [None]:
import json

In [None]:
with open('ECB_research_dict.json', 'w', encoding='utf-8') as f:
    
    json.dump(D, f, ensure_ascii=False)

In [None]:
with open('ECB_research_dict.json', 'r', encoding='utf-8') as f:
    
    D_read = json.load(f)

In [None]:
D == D_read

## Directories & navigating with `os`

We have seen how to change directory with inline magic 
```python 
%cd
```

A more precise way of creating directories, changing and retrieving references is facilitated by `os`:

1. os.getcwd() - get current working directory
2. os.chdir(path) - change directory to specified `path`
3. os.mkdir(path) - create directory named `path`
4. os.listdir(path) - lists all files and folders inside of `path`

## Try & catch

When interacting with web services many things can go wrong, i.e. `TimeOutExceptions`, incomplete downloads, unanticipated behavior client-side etc.

In general, in absence of an object your program will crash.

A way to leave exceptional cases slack / unspecified and instruct your program to run *as if* no exception occurred is to try & catch:

```python
try:
    
    operation_1
    
except:
    
    operation_2
    
    OR
    
    pass
    
```

## Machine Learning intermezzo: Image recognition
```python
!pip install --user deepface
```
Using [`deepface`](https://github.com/serengil/deepface), a lightweight face recognition library based on TensorFlow & Keras and openCV comprising state-of-the-art models such as VGG-Face (University of Oxford), Google FaceNet and Facebook DeepFace, we can predict `gender`, `age` and `facial expression` from an image. Note that all image input needs to have the identical dimension. Without going too much into technical details, this is the broad algorithm:

1. **Feature engineering**: Re-arranging pixels into one array with each index position, one value [0, 255] per color from the RGB spectrum + **segmentation**
2. **Model training**: 
    - Labeled data
    - (Convolutional) Neural networks seem to perform well, probably due to the versatility of interactions between features and correlatedness of neighbouring pixels actually **reduces** to be estimated parameter space and scales well
    - flexible way of scanning the image to allow misalignment
    - additional features = convolutional layers, "edges" etc.
    
    1) Kernel: randomly distributed sub-image per iteration
    
    2) Backpropagation: Tweak weights until Kernel matches its corresponding position within the original image (Dot product)
    
    3) Convolution: "Scanning of image" in >= 1 step sizes
    
    ![Alt Text](2D_Convolution_Animation.gif)
    
    4) Feature map + bias: summary of neighbouring pixels / features
    
    5) Rectified Linear Unit (ReLU): Set all negative values in feature map to 0
    
    $f(x) = max(0, x)$ or $f(x) = ln(1+e^{x})$
    
    6) Additional (smaller) filter on rectified feature map: $f(x) = max(0, x)$ or mean
    
    7) Max-pooled (mean-pooled): summary of first filter where highest similarity was found from feature map
    
    8) Input nodes: flattened max-pool + associated weights (Dot product + bias)
    
    9) ReLU + Output nodes (as many as classes), e.g. two for female & male
    
    10) Dot product + bias = 1 or 0 
      
3. **Cross-Validation**: Trial&Error, adjust parameter weights such that cost function is minimized
4. **Evaluation**: Final check on unseen Test dataset, usually several different models/specifications

We will be using a pre-trained model from Keras/TensorFlow which is commonly stored in HDF5 file format and can contain both data and model weights.

In [None]:
# only for ECB laptops
import os
os.environ["HTTP_PROXY"] = "http://ap-python-proxy:x2o7rCPYuN1JuV8H@app-gw-2.ecb.de:8080"
os.environ["HTTPS_PROXY"] = "https://ap-python-proxy:x2o7rCPYuN1JuV8H@app-gw-2.ecb.de:8080"

from deepface import DeepFace

In [None]:
from IPython.display import Image


In [None]:
import matplotlib.pyplot as plt
from matplotlib.patches import Rectangle
%matplotlib inline
from PIL import Image

im = Image.open('./Images/sample_image_0.png')
plt.imshow(im)
ax = plt.gca()
rect = Rectangle((obj['region']['x'],obj['region']['y']),obj['region']['w'],obj['region']['h'],linewidth=1,edgecolor='r',facecolor='none')
ax.add_patch(rect)

## Exercise 2: Identifying elements, writing to disk and perform operation using `requests`

1. Instantiate an empty data container, e.g. a `list` which will be appended with a well-structured dictionary in each iteration.
2. Start a `for`-loop to iterate over each element in the dictionary `D`. Use the `len()` function to specify the upper bound of iterations.
3. Assign objects for each field:
    - `reseacher_lastname` (Hint: Use the .split() function)
    - `researcher_firstname` (Hint: Use the .split() function)
    - `reseacher_url` (link to the researcher's dedicated page)
    - `researcher_image_url` (link to the researcher's image space)
4. `request` the `researcher_image_url` and write the `response`'s `content` to your local directory, e.g.
    ```python
       file = open("./Images/researcher_image_{}.png".format(i), "wb")
       file.write(...)
       file.close()
    ```
    Hint: Check if the directory `dirName` = 'Images' exists (os.path.exists(dirName)), if not: create it.
    Include a pause of one second after each `request`. Hint: use the `time` module and time.sleep(1)
5. Call the `DeepFace.analyze()` function and supply your `file_path` of the image. Retrieve predictions for `age`, `gender` and `emotion`. **Note:** Not every `researcher_image_url` has an actual file associated with it! Implement a fallback-mechanism to record this instance for the corresponding cases.
6. Go back to step 2) until you have reached the end of your loop.

*Bonus: use a `while`-loop instead of a `for`-loop that runs until you receive a `response.status_code != 200`*

In [None]:
from tqdm.notebook import tqdm
import numpy as np
import os
import time

# Code for exercise 2








## Getting rid of SSL verification errors using requests
1. Download the `ECBInterceptionRootCA.cer` from this [page](https://confluence.ecb.de/display/~dirienzo/Work+with+GitHub+repos+from+ECB+laptop) to your `J:` drive.
2. Run this code:
```python
import certifi
certifi.where()
```
This will give you the directory of your certificate that will be used for `requests`.
3. Navigate into this directory and copy over everything in the `ECBInterceptionRootCA.cer` including the -----BEGIN CERTIFICATE----- and -----END CERTIFICATE----- tags.

This will maintain ECBs security standards (e.g. against man-in-the-middle attacks) and not simply ignore the warnings.

## Exercise 3: Pagination

Go to the [news page](https://www.uni-potsdam.de/de/nachrichten/) of the University of Potsdam.

You have probably realised that the articles presented on the first news page are not the entire collection of the University of Potsdam. Your goal is to retrieve a complete collection of all articles that are available on the university's website and you can easily apply your new knowledge in a repetitive manner.

16. Figure out how many pages containing articles content there are in total. You can do it manually by e.g. inspecting the URL when you proceed through the collection in your browser or by checking it programmatically by writing a `while` loop that continues until some condition, such as a status returned from your request, is violated. Make sure to include a short pause (1 second) in order not to overcharge the server that in some cases could lead to a temporary ban of your device.

In [None]:
# Code for exercise 3







In [None]:
with open('articles_links.txt', 'w') as output:
    
    output.writelines("%s\n" % line for line in articles_links)

17. Read in the JSON file you stored in step 17 and iterate over each hyperlink. Split the list into 4 evenly sized chunks and iterate over each chunk. In each iteration, obtain the HTML, parse it and identify the elements of the publication date, the contact, the image's hyperlink/reference and the main text body's length. Note that some, or even all, of these elements may not be available. Define an appropriate data type for each field and append it **as a dictionary** in each iteration to a list.

In [None]:
# Partitioning




## Asynchronous HTTP requests

18. Install the libaries `asyncio`, `aiohttp` and `tqdm`.

In [None]:
import asyncio
import aiohttp # !pip install aiohttp
import bs4
import tqdm

In [None]:
import nest_asyncio # !pip install nest_asyncio
nest_asyncio.apply()

async def get(*args, **kwargs):
    
    async with aiohttp.ClientSession() as session:
        
        async with session.get(*args, **kwargs) as resp:
            return (await resp.text())
        
def get_fields(page):
    
    soup = bs4.BeautifulSoup(page, "html.parser")
    
    try:
        publication_date = soup.findAll(class_ =['time', 'up-news-single-date'])[0].text.strip()
    except:
        publication_date = 'No publication date found.'
        
    try:
        author_name = soup.findAll(class_ =['up-news-single-author'])[0].text[5:].strip()
    except:
        author_name = 'No author found.'
    
    try:
        image_list = list(map(lambda x: x, soup.select('img')))
        image_url = ['https://www.uni-potsdam.de' + x['src'] for x in image_list if x['src'][-4:] == ".jpg"][0]
    except:
        image_url = "No image found."
        
    try:
        abstract = soup.select('.up-opener-text-with-border')[0].text.strip()
        abstract_length = str(len(abstract))
    except:
        abstract = 'No abstract found.'
        abstract_length = 0
    
    return publication_date, author_name, image_url, abstract, abstract_length


async def save_fields(query):
    
    url = query
    
    async with sem:
        
        page = await get(url, proxy = list(proxies.values())[1], compress=True)
        
        empty_dict = {}
    
        [publication_date, author_name, image_url, abstract, abstract_length] = get_fields(page)
    
        empty_dict['URL'] = url
        empty_dict['Publication date'] = publication_date
        empty_dict['Author Name'] = author_name
        empty_dict['Image URL'] = image_url
        empty_dict['Abstract'] = abstract
        empty_dict['Abstract Length'] = int(abstract_length)
    
        results_list.append(empty_dict)

sem = asyncio.Semaphore(5)

In [None]:
start = time.time()

results_list = []

loop = asyncio.get_event_loop()
f = asyncio.wait([save_fields(d) for d in articles_links_r])
result = loop.run_until_complete(f)

end = time.time()

print(end - start)

19. Find the missing link that appears in `articles_links_r` but not in `results_list` using a list comprehension. Are there any?

20. Install the `pandas` library.

In [None]:
import pandas as pd

21. Convert the `publication_date` into a `pandas` `datetime` object and plot a time series of published articles on a daily basis. Bonus: Aggregate the time series into monthly frequency. In which month-year were most articles published?

22. Install the library `matplotlib`.

In [None]:
import matplotlib
import matplotlib.pyplot as plt

In [None]:


fig, ax = plt.subplots()
ax.plot(monthly_count.index, monthly_count.URL)

ax.set(xlabel='Date (monthly)', ylabel='Number of articles published',
       title='A simple so-so-looking graph')
ax.grid()

#fig.savefig("monthly_publications.png")
plt.show()

23. Install the library `plotly`.

In [None]:
# !pip install plotly

import plotly.graph_objs as go
import plotly.figure_factory as ff
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
init_notebook_mode(connected=True)

24. Install the `chart-studio` library.

In [None]:
import chart_studio
import chart_studio.plotly as py
import plotly.graph_objs as go

25. Log in to [Plotly Chart Studio](https://chart-studio.plotly.com/Auth/login/#/) and obtain your `Username` and `API key`. Store them both line-by-line in a .py file, e.g. name it "plotly_config.py".

In [None]:
# only for ECB laptops
import os
os.environ["HTTP_PROXY"] = "http://ap-python-proxy:x2o7rCPYuN1JuV8H@app-gw-2.ecb.de:8080"
os.environ["HTTPS_PROXY"] = "https://ap-python-proxy:x2o7rCPYuN1JuV8H@app-gw-2.ecb.de:8080"

In [None]:
# useful for reloading files into memory



In [None]:
fig = go.Figure(data=[
    
    go.Scatter(name='Published articles', x = list(monthly_count.index),
    y = list(monthly_count['URL']))
    
])

fig.layout.update(title = go.layout.Title(
                        text='Published articles (monthly)'))

fig.layout.update(yaxis= go.layout.YAxis(title=go.layout.yaxis.Title(
                        text='Count')))

fig.layout.update(xaxis = go.layout.XAxis(title = go.layout.xaxis.Title(text = 'Date'), rangeslider = dict(visible = True)));

