# Python II - Assignment 2

This **Home Assignment** is to be submitted and you will be given points for each of the tasks. It familiarizes you with basics of *web scraping* and basics of *regular expressions*.

## Formalities
**Submit in a group of 2-3 people until 22.06.2020 23:59CET. The deadline is strict!**

## Evaluation and Grading
General advice for programming excercises at *CSSH*:
Evaluation of your submission is done semi automatically. Think of it as this notebook being 
executed once. Afterwards, some test functions are appended to this file and executed respectively.

Therefore:
* Submit valid _Python3_ code only!
* Use external libraries only when specified by task.
* Ensure your definitions (functions, classes, methods, variables) follow the specification if
  given. The concrete signature of e.g. a function usually can be inferred from task description, 
  code skeletons and test cases.
* Ensure the notebook does not rely on current notebook or system state!
  * Use `Kernel --> Restart & Run All` to see if you are using any definitions, variables etc. that 
    are not in scope anymore.
  * Double check if your code relies on presence of files or directories other than those mentioned
    in given tasks. Tests run under Linux, hence don't use Windows style paths 
    (`some\path`, `C:\another\path`). Also, use paths only that are relative to and within your
    working directory (OK: `some/path`, `./some/path`; NOT OK: `/home/alice/python`, 
    `../../python`).
* Keep your code idempotent! Running it or parts of it multiple times must not yield different
  results. Minimize usage of global variables.
* Ensure your code / notebook terminates in reasonable time.

**There's a story behind each of these points! Don't expect us to fix your stuff!**

Regarding the scores, you will get no points for a task if:
- your function throws an unexpected error (e.g. takes the wrong number of arguments)
- gets stuck in an infinite loop
- takes much much longer than expected (e.g. >1s to compute the mean of two numbers)
- does not produce the desired output (e.g. returns an descendingly sorted list even though we asked for ascending, returns the mean and the std even though we asked for the mean only, only prints the output instead of returning it!)
- ...

# History in Wikipedia (10 points total)

Wikipedia has a lot of information on historic events. Assume you want to conduct a study that examines which language edition talks more about different historic event (as indicated by years). As an example you first consider the article "History_of_Germany" in the englisch and german wikipedia.

To get articles from the web, use the `requests` library. To deal with html content you can use the Beautiful soup library.

In [2]:
import requests
from bs4 import BeautifulSoup
import os
import re

## a) Grabbing a wikipedia article (1 + 0.5 + 0.5 + 0.5)
Write a function `get_article_from_web(article_name, language_edition)` that returns the HTML as string for that article in a specific language edition. Assume that the article exists.

To save bandwith when conducting multiple experiments we want to setup a cache of wikipedia articles.
To setup the cache write the function `save_article_to_disk(article_name, language_edition, content)` that saves the content for that wikipedia article in `'./cache/{language_edition}/{article_name}.html'`. If any of the folders do not exist they are created. Please read the information on evaluation and grading.


Then write a function `get_article_from_disk(article_name, language_edition)` that returns the cached version of the article from disk. If the article does not exists, it raises a `ValueError`.


Write a function `get_article(article_name, language_edition)` that uses a local cache for fetching articles from wikipedia. So if you that article exists in cache it returns the cached version. If not it fetches the html for that article from the web and writes it to the cache so there is no need to get it from the web the next time. Thereby use the previously defined functions.

Use the "caching" version `get_article` for all of the following tasks.

In [3]:
def get_article_from_web(article_name, language_edition):
    global URL
    URL = f"https://{language_edition}.wikipedia.org/wiki/{article_name}"
    page = requests.get(url=URL,timeout=30)
    soup = BeautifulSoup(page.content, 'html.parser')

    return soup.prettify()

In [4]:
def save_article_to_disk(article_name, language_edition, content):
    try:
        os.makedirs(f"./cache/{language_edition}")
    except OSError:
        pass
    
    my_file=open(f"./cache/{language_edition}/{article_name}.html","w", encoding="utf-8")
    my_file.write(content)
    my_file.close()


In [5]:
def get_article_from_disk(article_name, language_edition):
    try:
        with open(f"./cache/{language_edition}/{article_name}.html","r",encoding="utf-8") as f:
            return f.read()
    except FileNotFoundError: 
        raise ValueError("The article does not exists")



In [6]:
def get_article(article_name, language_edition):
    try:
        return get_article_from_disk(article_name,language_edition)
    except:
        content=get_article_from_web(article_name,language_edition)
        save_article_to_disk(article_name,language_edition,content)
        return get_article_from_disk(article_name,language_edition)

## b) Links from one article to other articles  (1)
Write a function `get_links(article_name, language_edition)` that returns a list of wikipedia article names.  These links are obtained throught the article specified by `article_name` (in the language_edition). Only include links in its 'content' div. Do not include links that you can get through the left navigation bar.

Sort the links in alphabetically increasing order.

In [7]:
def get_links(article_name, language_edition):
    page_content = get_article(article_name, language_edition)
    # convert the string file to soup
    soup = BeautifulSoup(page_content, 'html.parser')
    links = []
    for link in soup.find_all("a"):
        l=str(link.get('href'))
        if l.startswith("/w") and ('Wikipedia' not in l) and ('index' not in l) and ("File" not in l):
            b=f"https://{language_edition}.wikipedia.org"+l
            if b not in links:
                links.append(b)
    return sorted(links)
    

In [8]:
get_links("History_of_Germany","en")

s://en.wikipedia.org/wiki/History_of_Anglo-Saxon_England',
 'https://en.wikipedia.org/wiki/History_of_Armenia',
 'https://en.wikipedia.org/wiki/History_of_Austria',
 'https://en.wikipedia.org/wiki/History_of_Azerbaijan',
 'https://en.wikipedia.org/wiki/History_of_Baden',
 'https://en.wikipedia.org/wiki/History_of_Belarus',
 'https://en.wikipedia.org/wiki/History_of_Belgium',
 'https://en.wikipedia.org/wiki/History_of_Berlin',
 'https://en.wikipedia.org/wiki/History_of_Bosnia_and_Herzegovina',
 'https://en.wikipedia.org/wiki/History_of_Bulgaria',
 'https://en.wikipedia.org/wiki/History_of_Christianity',
 'https://en.wikipedia.org/wiki/History_of_Cologne',
 'https://en.wikipedia.org/wiki/History_of_Croatia',
 'https://en.wikipedia.org/wiki/History_of_Cyprus',
 'https://en.wikipedia.org/wiki/History_of_Denmark',
 'https://en.wikipedia.org/wiki/History_of_East_Germany',
 'https://en.wikipedia.org/wiki/History_of_Estonia',
 'https://en.wikipedia.org/wiki/History_of_Europe',
 'https://en.wik

## c) Getting the same article in a different language edition (1.0)
Write a funciton `switch_language(article_name, old_language_edition, new_language_edition)` that returns the name of the wikipedia article in the new language edition.

In [21]:
def switch_language(article_name, old_language_edition, new_language_edition):
    page_content = get_article(article_name, old_language_edition)
    # convert the string file to soup
    soup = BeautifulSoup(page_content, 'html.parser')
    wiki_new_lang = soup.find_all(class_="interlanguage-link-target",lang=new_language_edition)
    new_name_article = " ".join(wiki_new_lang[0]['title'].split()[:-2]) 
    return new_name_article

In [22]:
switch_language("History_of_Germany","en","de")

'Geschichte Deutschlands'

## d) Using regular expressions to extract years from a page content (1.5)
Write a function `extract_years(string_input)` that gets a string and returns how often a certain year number from 1000-2019 occurs. You can assume that each number between 1000 and 2020 is a year number. The result is a dictionary with year numbers as keys, and the number of occurrences as values.

Example: `extract_years("The king reigned from 1245 to 1268. He died in 1268. His favorite number was 12689")` should return the dictionary `{1245: 1, 1268:2}`

## e) Aggregate years for articles (0.5)
Write a function `extract_years_for_articles(article_names, language_edition)` that extracts the years counts for all the articles in that particular language edition and aggregates them into a single dictionary. This aggregated dictionary is returned.

## f) Bringing it all together (1.5)
Write a function `get_all_years(base_article, base_language_edition, n=None, target_language_edition=None)`

- Determines 'real' base article (if target_language edition is specified, the article in the target language edition is used.)
- extracts the first n links (all if n is None) from that 'real' base article
- aggregates the year counts across the base article **and** all the n articles that the base article links to.
- it then returns that dictionary

## g) Visualize and Interpret your results (1 + 1)
Visualize the numbers of occurrences for the article History_of_Germany in both englisch (en) and german (de) in a timeline. Use `n=20`. Show the results here in the notebook and also save it to 'timeline.png' in code. Make sure the plot has a legend, axis labels, ...

Describe the visualization, and give reasons for possible differences of the german and englisch timelines. Write that string to a file 'timeline.txt'