# Scrape Stories from Zusammen Gegen Corona

Scrape German Corona stories from Zusammen Gegen Corona (together against Corona) by the German Ministry of Health: https://www.zusammengegencorona.de/informieren/ichhattecorona/.

The stories contain highlighted quotes which are stored separately from the main text.

The stories are stored in dictionaries with the following fields:

- `link`: The link of the story page (string)
- `title`: The title of the story page (string; is the same for all stories)
- `author`: The author of the story (string)
- `headline`: The title of the story (string; same as author)
- `intro_text`: The introductory text giving some background info on story (string)
- `quote_text`: A list with the highlighted quotes from the story (strings)
- `story_text`: A list with main text in between the quotes (strings)

In [1]:
from urllib.request import Request, urlopen
from bs4 import BeautifulSoup
from string import capwords

In [2]:
# Set URL of the main page containing the links to the story pages

URL = "https://www.zusammengegencorona.de/informieren/ichhattecorona"

# PAGES = ["", "page=2", "page=3"] # This variable serves to loop over the subsequent pages of the main page

main_page = BeautifulSoup(urlopen(Request("https://www.zusammengegencorona.de/informieren/ichhattecorona/?page=3", headers = {"User-Agent": "Mozilla/5.0"})).read(), "html.parser")

In [3]:
# Function to extract links to the story pages from main page

def extract_hrefs_from_url(URL, PAGES):
    hrefs = []
    
    for PAGE in PAGES:
        main_page = BeautifulSoup(urlopen(Request(URL + "/?" + PAGE, headers = {"User-Agent": "Mozilla/5.0"})).read(), "html.parser")

        for instance in main_page.find_all("a", attrs = {"class": "o-button link o-button--tertiary o-button--reverse"}):
            new_link = instance.get("href")
            hrefs.append("https://www.zusammengegencorona.de" + new_link)
            print("Extracted link: {}".format(hrefs[-1]))
    
    print("Done")
    
    return hrefs

In [4]:
# Function to extract text from story pages

def extract_text_from_url(URL):
    new_req = Request(URL, headers = {"User-Agent": "Mozilla/5.0"})
    new_page = BeautifulSoup(urlopen(new_req).read(), "html.parser")
    new_title = new_page.find("a", attrs = {"class": "o-breadcrumbs__link", "href": "/informieren/ichhattecorona/"})
    new_headline = new_page.find("h1", attrs = {"class": "o-headline o-headline--1 o-intro__headline"})
    new_intro_text = new_page.find("div", attrs = {"class": "o-copy o-intro__copy o-copy--intro o-copy--html"})
    new_quote_text = [text.string for text in new_page.find_all("h2", attrs = {"class": "o-headline o-headline--2"})]
    new_inline_text = new_page.find_all("p", attrs = {"class": "o-copy o-copy--article"})
    new_stripped_text = [text.string for text in new_inline_text]
    new_story_text = [text.replace("\xa0", "") for text in new_stripped_text if text is not None]
    new_doc = {
        "link": URL,
        "title": new_title.string if new_title is not None else "",
        "author": new_headline.string if new_headline is not None else "",
        "headline": new_headline.string if new_headline is not None else "",
        "intro_text": new_intro_text.string if new_intro_text is not None else "",
        "quote_text": new_quote_text,
        "story_text": new_story_text
    }
    print("Extracted text from: {}".format(URL))
    
    return new_doc

In [5]:
# Wrapper to for list of story pages

def extract_texts_from_url_list(URL_list):
    docs = []
    for URL in URL_list:
        docs.append(extract_text_from_url(URL))
    
    print("Done")
    
    return docs

In [6]:
# Function to print doc

def print_doc(doc):
    for field in doc.keys():
        if isinstance(doc[field], list):
            print(field + ": " + doc[field][0] + "\n")
        else:
            print(field + ": " + doc[field] + "\n")
        
    return

In [7]:
# Extract links from main page

hrefs = extract_hrefs_from_url(URL, [""])

Extracted link: https://www.zusammengegencorona.de/informieren/ichhattecorona/siegfried-94-aus-lueneburg/
Extracted link: https://www.zusammengegencorona.de/informieren/ichhattecorona/daniela-52-aus-stuttgart/
Extracted link: https://www.zusammengegencorona.de/informieren/ichhattecorona/lasse-36-aus-braunschweig/
Extracted link: https://www.zusammengegencorona.de/informieren/ichhattecorona/daniela-39-aus-dem-muensterland/
Extracted link: https://www.zusammengegencorona.de/informieren/ichhattecorona/hans-juergen-61-aus-koeln/
Extracted link: https://www.zusammengegencorona.de/informieren/ichhattecorona/sigmar-und-gabriele-beide-79-berlin/
Extracted link: https://www.zusammengegencorona.de/informieren/ichhattecorona/eren-18-aus-meerbusch/
Extracted link: https://www.zusammengegencorona.de/informieren/ichhattecorona/revan-29-aus-duesseldorf/
Extracted link: https://www.zusammengegencorona.de/informieren/ichhattecorona/carsten-53-aus-buckenhof-in-bayern/
Done


In [8]:
# Extract texts from links

docs = extract_texts_from_url_list(hrefs)

Extracted text from: https://www.zusammengegencorona.de/informieren/ichhattecorona/siegfried-94-aus-lueneburg/
Extracted text from: https://www.zusammengegencorona.de/informieren/ichhattecorona/daniela-52-aus-stuttgart/
Extracted text from: https://www.zusammengegencorona.de/informieren/ichhattecorona/lasse-36-aus-braunschweig/
Extracted text from: https://www.zusammengegencorona.de/informieren/ichhattecorona/daniela-39-aus-dem-muensterland/
Extracted text from: https://www.zusammengegencorona.de/informieren/ichhattecorona/hans-juergen-61-aus-koeln/
Extracted text from: https://www.zusammengegencorona.de/informieren/ichhattecorona/sigmar-und-gabriele-beide-79-berlin/
Extracted text from: https://www.zusammengegencorona.de/informieren/ichhattecorona/eren-18-aus-meerbusch/
Extracted text from: https://www.zusammengegencorona.de/informieren/ichhattecorona/revan-29-aus-duesseldorf/
Extracted text from: https://www.zusammengegencorona.de/informieren/ichhattecorona/carsten-53-aus-buckenhof-i

In [9]:
# Show example doc

print_doc(docs[0])

link: https://www.zusammengegencorona.de/informieren/ichhattecorona/siegfried-94-aus-lueneburg/

title: Ich hatte Corona

author: Siegfried, 94, aus Lüneburg

headline: Siegfried, 94, aus Lüneburg

intro_text: Im November bricht in seinem Altenheim das Coronavirus aus. Das Haus wird abgeriegelt, die Bewohner müssen in ihren Zimmern bleiben. Den Rentner erwischt es trotzdem.

quote_text: „Ich nehme es hin, wie es ist“

story_text: Wie es mir heute geht? Dreiviertel gut würde ich sagen. Bloß: Dass ich nicht mehr ganz fit bin, Wehwehchen, auch einmal Atemnot habe, ist wohl ganz normal mit 94 Jahren. Unmöglich zu sagen, ob das nun an meiner Corona-Infektion liegt oder nicht. Ich nehme es hin, wie es ist. Diese Haltung hat mich immer gut durchs Leben getragen – und durchs vergangene Jahr.



## Missing

Cannot access new links from page 2 and 3 yet because the html code for these pages seems to contain the same links as in page 1.