# Scrape Stories from Stadt Frankfurt

Scrape German Corona stories from the website of the city of Frankfurt (Stadt Frankfurt am Main): https://frankfurt.de/service-und-rathaus/presse/texte-und-kampagnen/corona-geschichten.

The stories are in an interview format with questions and answers. The stories are mostly captured in the answers, but for completeness, both questions and answers are scraped and stored.

The stories are stored in dictionaries with the following fields:

- `link`: The link of the story page (string)
- `title`: The title of the story page on the main page (string)
- `headline`: The title of the story by the author (string)
- `subline`: The subtitle of the story describing the interviewee (string)
- `intro_text`: The introductory text giving some background info on the interviewee and story (string)
- `question_text`: A list with the interview questions (strings)
- `answer_text`: A list with the interview answers (strings)
- `interviewee`: The name of the interviewee (string)

In [52]:
from urllib.request import urlopen
from bs4 import BeautifulSoup
from string import capwords

In [2]:
# Set URL of the main page containing the links to the story pages

URL = "https://frankfurt.de/service-und-rathaus/presse/texte-und-kampagnen/corona-geschichten"

In [3]:
# Function to extract links to the story pages from main page

def extract_hrefs_from_url(URL):
    main_page = BeautifulSoup(urlopen(URL).read(), "html.parser")
    
    hrefs = []

    for instance in main_page.find_all("a", attrs = {"class": "contentTile noprint"}):
        new_link = instance.get("href")
        pos_443 = new_link.index(":443")
        hrefs.append(new_link[:pos_443] + new_link[pos_443+4:])
        print("Extracted link: {}".format(hrefs[-1]))
    
    print("Done")
    
    return hrefs

In [53]:
# Function to extract text from story pages

def extract_text_from_url(URL):
    new_page = BeautifulSoup(urlopen(URL).read(), "html.parser")
    new_title = new_page.title.string[:new_page.title.string.find(" |")]
    new_headline = new_page.find("h4", attrs = {"class": "_headline"})
    new_subline = new_page.find("p", attrs = {"class": "_subline"})
    new_intro_text = new_page.find("p", attrs = {"class": "_introTxt"})
    new_inline_text = new_page.find("span", attrs = {"class": "_inlineTxt"})
    new_stripped_text = [text for text in new_inline_text.stripped_strings]
    new_question_text = [new_stripped_text[i-1] for i in range(len(new_stripped_text)) if new_stripped_text[i].isupper() and i > 0]
    new_answer_text = [text for i, text in enumerate(new_stripped_text) if new_stripped_text[i-1].isupper() and i > 0]
    new_interviewee = [capwords(text[:-1]) for text in new_stripped_text if text.isupper()]
    new_doc = {
        "link": URL,
        "title": new_title,
        "headline": new_headline.get_text() if new_headline is not None else "",
        "subline": new_subline.get_text() if new_subline is not None else "",
        "intro_text": new_intro_text.get_text() if new_intro_text is not None else "",
        "question_text": new_question_text,
        "answer_text": new_answer_text,
        "interviewee": new_interviewee[0] if len(new_interviewee) > 0 else ""
    }
    print("Extracted text from: {}".format(URL))
    
    return new_doc

In [54]:
# Wrapper to for list of story pages

def extract_texts_from_url_list(URL_list):
    docs = []
    for URL in URL_list:
        docs.append(extract_text_from_url(URL))
    
    print("Done")
    
    return docs

In [55]:
# Function to print doc

def print_doc(doc):
    for field in doc.keys():
        if isinstance(doc[field], list):
            print(field + ": " + doc[field][0] + "\n")
        else:
            print(field + ": " + doc[field] + "\n")
        
    return

In [56]:
# Extract links from main page

hrefs = extract_hrefs_from_url(URL)

Extracted link: https://frankfurt.de/service-und-rathaus/presse/texte-und-kampagnen/corona-geschichten/die-vhs-mitarbeiterin
Extracted link: https://frankfurt.de/service-und-rathaus/presse/texte-und-kampagnen/corona-geschichten/die-kfz-zulassungsstelle
Extracted link: https://frankfurt.de/service-und-rathaus/presse/texte-und-kampagnen/corona-geschichten/das-buergeramt
Extracted link: https://frankfurt.de/service-und-rathaus/presse/texte-und-kampagnen/corona-geschichten/abteilungsleiterin-strassenverkehrsamt
Extracted link: https://frankfurt.de/service-und-rathaus/presse/texte-und-kampagnen/corona-geschichten/der-pfarrer
Extracted link: https://frankfurt.de/service-und-rathaus/presse/texte-und-kampagnen/corona-geschichten/das-amtsblatt-team
Extracted link: https://frankfurt.de/service-und-rathaus/presse/texte-und-kampagnen/corona-geschichten/katja-hartung
Extracted link: https://frankfurt.de/service-und-rathaus/presse/texte-und-kampagnen/corona-geschichten/magnus-welkerling
Extracted li

In [57]:
# Extract texts from links

docs = extract_texts_from_url_list(hrefs)

Extracted text from: https://frankfurt.de/service-und-rathaus/presse/texte-und-kampagnen/corona-geschichten/die-vhs-mitarbeiterin
Extracted text from: https://frankfurt.de/service-und-rathaus/presse/texte-und-kampagnen/corona-geschichten/die-kfz-zulassungsstelle
Extracted text from: https://frankfurt.de/service-und-rathaus/presse/texte-und-kampagnen/corona-geschichten/das-buergeramt
Extracted text from: https://frankfurt.de/service-und-rathaus/presse/texte-und-kampagnen/corona-geschichten/abteilungsleiterin-strassenverkehrsamt
Extracted text from: https://frankfurt.de/service-und-rathaus/presse/texte-und-kampagnen/corona-geschichten/der-pfarrer
Extracted text from: https://frankfurt.de/service-und-rathaus/presse/texte-und-kampagnen/corona-geschichten/das-amtsblatt-team
Extracted text from: https://frankfurt.de/service-und-rathaus/presse/texte-und-kampagnen/corona-geschichten/katja-hartung
Extracted text from: https://frankfurt.de/service-und-rathaus/presse/texte-und-kampagnen/corona-ge

In [58]:
# Show example doc

print_doc(docs[1])

link: https://frankfurt.de/service-und-rathaus/presse/texte-und-kampagnen/corona-geschichten/die-kfz-zulassungsstelle

title: Die KFZ-Zulassungsstelle

headline: „Wir mussten uns immer wieder neu erfinden!“

subline: In der neuesten Frankfurter Corona-Geschichte berichtet Kai Günther, wie Corona die Arbeit der Kfz-Zulassungsstelle verändert hat.

intro_text: 

question_text: Herr Günther, lassen Sie uns auf die vergangenen zwölf Monate
zurückblicken. Wie haben Sie diese erlebt?

answer_text: Am 17. März 2020 kam der Shutdown. Das
hieß, wir hatten komplett geschlossen und lediglich einen Notbetrieb für
Privatkunden aufrechterhalten. Wem etwa sein Kennzeichen gestohlen wurde oder
Fahrzeugpapiere verloren hatte, konnte sich auch in dieser Zeit an uns wenden.
Klar war auch, dass systemrelevante Organisationen wie Rettungsdienste oder
Katastrophenschutz ihre Fahrzeuge trotz Lockdown zulassen müssen. Ab Mitte
April hatten wir angefangen, über Terminvergabe per E-Mail oder Telefon in
kleinen 