# Scrape Stories from HSPV NRW

Scrape German Corona stories from HSPV NRW (University of Applied Sciences for the Police and Public Administration Nordrhrein Westfalen [German State]): https://www.hspv.nrw.de/services/corona-krise/corona-geschichten.

The stories are written by staff and students of the university.

The stories are stored in dictionaries with the following fields:

- `link`: The link of the story page (string)
- `title`: The title of the story page (string; is the same for all stories)
- `author`: The author of the story (string)
- `date`: The when the story was published (string)
- `description`: The introductory text giving some background info on story (string)
- `story_text`: The main text of the story (string)

The web pages containing the stories are stored as `.html`. The notebook requires a folder at `DATA_DIR`.

In [1]:
""" Scrape stories from Zusammen Gegen Corona """

import os
from urllib.request import Request, urlopen
from bs4 import BeautifulSoup

In [2]:
# Set URL of the main page containing the links to the story pages

URL = "https://www.hspv.nrw.de/nachrichten/artikel/corona-geschichten"

# Define story page ids

ID_STORIES = ["01", "02", "03", "04", "05", "06", "07", "08", "09", "10", "11", "12", "13"]

# Set directory for storing web pages

DATA_DIR = "../hspvnrw.de/scraped/"

In [3]:
def check_dir(data_dir):
    """ Check if directory for saving web page exists """
    if not os.path.isdir(data_dir):
        os.makedirs(data_dir)
        print(f"Created saving directory at {data_dir}")

In [4]:
def load_web_page(url, file_name, data_dir):
    """ Check if web page can be loaded from disk;
    otherwise fetch website and save as .html to disk """
    check_dir(data_dir)
    if os.path.exists(file_name):
        with open(file_name, "r", encoding="utf-8") as file:
            page = file.read()
        print(f"Loaded web page from {file_name}")
    else:
        req = Request(url, headers={"User-Agent": "Mozilla/5.0"})
        page = urlopen(req).read()
        with open(file_name, "w", encoding="utf-8") as file:
            file.write(page.decode())
        print(f"Saved web page to {file_name}")
    return page

In [5]:
def extract_text_from_url(url):
    """ Extract text from story pages """
    new_file_name = DATA_DIR + url.split("-")[-1] + ".html"
    new_page = BeautifulSoup(load_web_page(
        url, new_file_name, DATA_DIR), "html.parser")
    new_title = new_page.title.string
    new_author = new_page.find("meta", property="og:author")["content"]
    new_date = new_page.find("time", attrs={"itemprop": "date"}).string
    new_description = new_page.find("meta", property="og:description")["content"]
    new_story_text = new_page.find("div", attrs={"class": "ce-textpic ce-h-left ce-v-above"}).text
    
    new_doc = {
        "link": url,
        "title": new_title.split("#WirmeisterndieKrise ")[-1].split(" | HSPV NRW")[0],
        "author": new_author,
        "date": new_date,
        "description": new_description,
        "story_text": new_story_text
    }
    print(f"Extracted text from: {url}")

    return new_doc

In [6]:
def extract_texts_from_url_ids(url, ids):
    """ Wrapper to for list of story pages """
    docs = []
    for id in ids:
        docs.append(extract_text_from_url(f"{url}-{id}"))

    print("Done")

    return docs

In [7]:
def print_doc(doc):
    """ Print doc """
    for field in doc.keys():
        print(field + ": " + doc[field] + "\n")

In [8]:
# Extract texts from links

docs = extract_texts_from_url_ids(URL, ID_STORIES)

Loaded web page from ../hspvnrw.de/scraped/01.html
Extracted text from: https://www.hspv.nrw.de/nachrichten/artikel/corona-geschichten-01
Loaded web page from ../hspvnrw.de/scraped/02.html
Extracted text from: https://www.hspv.nrw.de/nachrichten/artikel/corona-geschichten-02
Loaded web page from ../hspvnrw.de/scraped/03.html
Extracted text from: https://www.hspv.nrw.de/nachrichten/artikel/corona-geschichten-03
Loaded web page from ../hspvnrw.de/scraped/04.html
Extracted text from: https://www.hspv.nrw.de/nachrichten/artikel/corona-geschichten-04
Loaded web page from ../hspvnrw.de/scraped/05.html
Extracted text from: https://www.hspv.nrw.de/nachrichten/artikel/corona-geschichten-05
Loaded web page from ../hspvnrw.de/scraped/06.html
Extracted text from: https://www.hspv.nrw.de/nachrichten/artikel/corona-geschichten-06
Loaded web page from ../hspvnrw.de/scraped/07.html
Extracted text from: https://www.hspv.nrw.de/nachrichten/artikel/corona-geschichten-07
Loaded web page from ../hspvnrw.de

In [9]:
# Show example doc

print_doc(docs[0])

link: https://www.hspv.nrw.de/nachrichten/artikel/corona-geschichten-01

title: Ich freue mich, wenn wir uns wiedersehen!

author: Prof. Dr. Dorothee Dienstbühl

date: 29. Januar 2021

description: Prof. Dr. Dorothee Dienstbühl, hauptamtlich Lehrende am Studienort Mülheim an der Ruhr, berichtet, wie sie ihren Alltag während der Corona-Krise meistert und den Wechsel zur Online-Lehre erlebt hat.

story_text: Diese Zeit ist wirklich verrückt. Noch vor einem Jahr war für mich nicht absehbar, was eine Pandemie bedeutet und wie sehr sie in unser aller Leben eingreift. Der erste Lockdown und die Umstellung auf Online-Lehre fielen mir persönlich sehr schwer, denn ich gehöre nicht zu den besonders technikversierten Menschen, absolut nicht. Entsprechend besorgt war ich, ob ich online eine vernünftige Lehre gestalten kann. Irgendwie haben wir, die Studierenden und ich, uns gemeinsam durch diese neue Erfahrung gearbeitet.Im Sommer wieder in Präsenz zu prüfen und im September mit Präsenzlehre zu st