# Scrape Stories from Zusammen Gegen Corona

Scrape German Corona stories from Zusammen Gegen Corona (together against Corona) by the German Ministry of Health: https://www.zusammengegencorona.de/informieren/ichhattecorona/.

The stories contain highlighted quotes which are stored separately from the main text.

The stories are stored in dictionaries with the following fields:

- `link`: The link of the story page (string)
- `title`: The title of the story page (string; is the same for all stories)
- `author`: The author of the story (string)
- `headline`: The title of the story (string; same as author)
- `intro_text`: The introductory text giving some background info on story (string)
- `quote_text`: A list with the highlighted quotes from the story (strings)
- `story_text`: A list with main text in between the quotes (strings)

The web pages containing the stories are stored as `.html`. The notebook requires a folder at `DATA_DIR`.

In [1]:
from urllib.request import Request, urlopen
from bs4 import BeautifulSoup
from string import capwords
import os

In [2]:
# Set URL of the main page containing the links to the story pages

URL = "https://www.zusammengegencorona.de/informieren/ichhattecorona"

# Pages 2 and 3 cannot easily be reached by web scraper, so their story links are hand coded

HREFS_PAGE_2 = ["vanessa-41-aus-wetter", "nora-23-aus-brand-erbisdorf", "thomas-53-aus-hamburg", 
                "markus-49-aus-bad-harzburg", "joachim-56-aus-homburg", "anton-11-aus-crawinkel", 
                "laura-38-aus-hamburg", "eine-familie-aus-baden-wuerttemberg", "iris-54-aus-lueneburg"]
HREFS_PAGE_3 = ["monika-54-aus-oberbayern", "peter-und-martina-63-und-62-jahre-alt-aus-brieselang", "peggy-46-aus-appenweier", 
                "bianca-34-aus-berlin", "mateo-24-aus-durmersheim", "julia-26-aus-amberg", "dieter-61-aus-braunschweig"]

# Set directory for storing web pages

DATA_DIR = "../zusammengegencorona.de/scraped/"

In [3]:
# Function to check if directory for saving web page exists

def check_dir(DATA_DIR):
    if not os.path.isdir(DATA_DIR):
        os.makedirs(DATA_DIR)
        print(f"Created saving directory at {DATA_DIR}")

In [4]:
# Function to check if web page can be loaded from disk; otherwise fetch website and save as .html to disk

def load_web_page(URL, file_name, DATA_DIR):
    check_dir(DATA_DIR)
    if os.path.exists(file_name):
        page = open(file_name, "r", encoding="utf-8").read()
        print(f"Loaded web page from {file_name}")
    else:
        req = Request(URL, headers = {"User-Agent": "Mozilla/5.0"})
        page = urlopen(req).read()
        with open(file_name, "w", encoding="utf-8") as file:
            file.write(page.decode())
        print(f"Saved web page to {file_name}")
    return page

In [5]:
# Function to extract links to the story pages from main page

def extract_hrefs_from_url(URL):
    hrefs = []
    
    main_page = BeautifulSoup(urlopen(Request(URL + "/?", headers = {"User-Agent": "Mozilla/5.0"})).read(), "html.parser")

    for instance in main_page.find_all("a", attrs = {"class": "o-button link o-button--tertiary o-button--reverse"}):
        new_link = instance.get("href")
        hrefs.append("https://www.zusammengegencorona.de" + new_link)
        print("Extracted link: {}".format(hrefs[-1]))
    
    print("Done")
    
    return hrefs

In [6]:
# Function to add hand coded links

def add_extra_links(hrefs, page_list, URL):
    for page in page_list:
        for link in page:
            hrefs.append(URL + "/" + link + "/")
            print("Added extra link: {}".format(hrefs[-1])) 
    
    print("Done")
    
    return hrefs
            

In [7]:
# Function to extract text from story pages

def extract_text_from_url(URL):
    new_file_name = DATA_DIR + URL.split("/")[-2] + ".html"
    new_page = BeautifulSoup(load_web_page(URL, new_file_name, DATA_DIR), "html.parser")
    new_title = new_page.find("a", attrs = {"class": "o-breadcrumbs__link", "href": "/informieren/ichhattecorona/"})
    new_headline = new_page.find("h1", attrs = {"class": "o-headline o-headline--1 o-intro__headline"})
    new_intro_text = new_page.find("div", attrs = {"class": "o-copy o-intro__copy o-copy--intro o-copy--html"})
    new_quote_text = [text.string for text in new_page.find_all("h2", attrs = {"class": "o-headline o-headline--2"})]
    new_inline_text = new_page.find_all("p", attrs = {"class": "o-copy o-copy--article"})
    new_stripped_text = [text.string for text in new_inline_text]
    new_story_text = [text.replace("\xa0", " ") for text in new_stripped_text if text is not None]
    new_doc = {
        "link": URL,
        "title": new_title.string if new_title is not None else "",
        "author": new_headline.string if new_headline is not None else "",
        "headline": new_headline.string if new_headline is not None else "",
        "intro_text": new_intro_text.string if new_intro_text is not None else "",
        "quote_text": new_quote_text,
        "story_text": new_story_text
    }
    print("Extracted text from: {}".format(URL))
    
    return new_doc

In [8]:
# Wrapper to for list of story pages

def extract_texts_from_url_list(URL_list):
    docs = []
    for URL in URL_list:
        docs.append(extract_text_from_url(URL))
    
    print("Done")
    
    return docs

In [9]:
# Function to print doc

def print_doc(doc):
    for field in doc.keys():
        if isinstance(doc[field], list):
            print(field + ": " + "".join([s + "\n" + "\t" for s in doc[field]]) + "\n")
        else:
            print(field + ": " + doc[field] + "\n")
        
    return

In [10]:
# Extract links from main page

hrefs = add_extra_links(extract_hrefs_from_url(URL), [HREFS_PAGE_2, HREFS_PAGE_3], URL)

Extracted link: https://www.zusammengegencorona.de/informieren/ichhattecorona/siegfried-94-aus-lueneburg/
Extracted link: https://www.zusammengegencorona.de/informieren/ichhattecorona/daniela-52-aus-stuttgart/
Extracted link: https://www.zusammengegencorona.de/informieren/ichhattecorona/lasse-36-aus-braunschweig/
Extracted link: https://www.zusammengegencorona.de/informieren/ichhattecorona/daniela-39-aus-dem-muensterland/
Extracted link: https://www.zusammengegencorona.de/informieren/ichhattecorona/hans-juergen-61-aus-koeln/
Extracted link: https://www.zusammengegencorona.de/informieren/ichhattecorona/sigmar-und-gabriele-beide-79-berlin/
Extracted link: https://www.zusammengegencorona.de/informieren/ichhattecorona/eren-18-aus-meerbusch/
Extracted link: https://www.zusammengegencorona.de/informieren/ichhattecorona/revan-29-aus-duesseldorf/
Extracted link: https://www.zusammengegencorona.de/informieren/ichhattecorona/carsten-53-aus-buckenhof-in-bayern/
Done
Added extra link: https://www.

In [11]:
# Extract texts from links

docs = extract_texts_from_url_list(hrefs)

Created saving directory at ../zusammengegencorona.de/scraped/
Saved web page to ../zusammengegencorona.de/scraped/siegfried-94-aus-lueneburg.html
Extracted text from: https://www.zusammengegencorona.de/informieren/ichhattecorona/siegfried-94-aus-lueneburg/
Saved web page to ../zusammengegencorona.de/scraped/daniela-52-aus-stuttgart.html
Extracted text from: https://www.zusammengegencorona.de/informieren/ichhattecorona/daniela-52-aus-stuttgart/
Saved web page to ../zusammengegencorona.de/scraped/lasse-36-aus-braunschweig.html
Extracted text from: https://www.zusammengegencorona.de/informieren/ichhattecorona/lasse-36-aus-braunschweig/
Saved web page to ../zusammengegencorona.de/scraped/daniela-39-aus-dem-muensterland.html
Extracted text from: https://www.zusammengegencorona.de/informieren/ichhattecorona/daniela-39-aus-dem-muensterland/
Saved web page to ../zusammengegencorona.de/scraped/hans-juergen-61-aus-koeln.html
Extracted text from: https://www.zusammengegencorona.de/informieren/ic

In [12]:
# Show example doc

print_doc(docs[0])

link: https://www.zusammengegencorona.de/informieren/ichhattecorona/siegfried-94-aus-lueneburg/

title: Ich hatte Corona

author: Siegfried, 94, aus Lüneburg

headline: Siegfried, 94, aus Lüneburg

intro_text: Im November bricht in seinem Altenheim das Coronavirus aus. Das Haus wird abgeriegelt, die Bewohner müssen in ihren Zimmern bleiben. Den Rentner erwischt es trotzdem.

quote_text: „Ich nehme es hin, wie es ist“
	„Wir haben hier alle schon eine Menge durchgestanden. Das hilft jetzt”
	„Der Wald ist 30 Schritte entfernt – und doch zurzeit unerreichbar“
	

story_text: Wie es mir heute geht? Dreiviertel gut würde ich sagen. Bloß: Dass ich nicht mehr ganz fit bin, Wehwehchen, auch einmal Atemnot habe, ist wohl ganz normal mit 94 Jahren. Unmöglich zu sagen, ob das nun an meiner Corona-Infektion liegt oder nicht. Ich nehme es hin, wie es ist. Diese Haltung hat mich immer gut durchs Leben getragen – und durchs vergangene Jahr.
	Corona hat mich von Anfang an interessiert: Dass ein winziges