 # Scraping _derstandard.at_

 **TODO:** talk about some legal stuff with scraping

 First we define the function `get_standard_soup()`, which sends a request to derstandard, along with a cookie that derstandard checks if a user has accepted their DSGVO notice. If this cookie is not sent, a banner is displayed and the html is only partially loaded.

 Secondly, we define `get_frontpage_articles()`, which expects a bs4 soup object of a frontpage. While derstandard no longer offers an archive, the frontpage articles of a given day can be conveniently accessed with the pattern `frontpage/y/m/d`. Each frontpage contains article sections which contain the (sub)heading, lead, number of comments, storylabels etc.

 We will start by pulling the frontpage of december 20th 2023.


In [1]:
from bs4 import BeautifulSoup
import requests

# fetch the html content of a derstandard.at page
def get_standard_soup(link):
    response = requests.get(link, cookies={'DSGVO_ZUSAGE_V1': 'true'})
    return BeautifulSoup(response.content, 'html.parser')

# generate a dictionary of articles with title as key and the bs4 element as value
def get_frontpage_articles(soup):
    articles_dict = {}
    articles = soup.select('div.chronological>section article')
    for article in articles:
        title_tag = article.find('a')
        if title_tag and title_tag.has_attr('title'):
            title = title_tag['title']
            articles_dict[title] = article
    return articles_dict

# Generate the articles dictionary for an arbitrary frontpage
soup = get_standard_soup('https://www.derstandard.at/frontpage/2023/12/22')
articles_dict = get_frontpage_articles(soup)

print(f'We have fetched {len(articles_dict)} articles\n')


We have fetched 137 articles



 In the next step, lets look at the information we can get from those article sections on the frontpage. By inspecting the html, we have already identified various elements that we will use in the subsequent steps:
 * title
 * subtitle
 * article type
 * link
 * datetime
 * kicker (like an additional tag, not 100% about its meaning yet)
 * postingcount
 * storylabels

 while playing with the data, we noticed that not every article contains storylabels. We will check this in the following step, as well as if every article tag has a type.

In [2]:
# Function to analyze attributes of specified tags and their attributes
def analyze_tag_attributes(articles_dict):
    no_data_type = set()
    no_story_label = set()

    for title, article in articles_dict.items():
        # Check if every article tag has a data-type attribute - basically the type of the article
        if not article.has_attr('data-type'):
            no_data_type.add(title)
        # search for <div class="storylabels"> in articles - the story labels
        if not article.find('div', class_='storylabels'):
            no_story_label.add(title)

    return no_data_type, no_story_label

no_data_type, no_story_label = analyze_tag_attributes(articles_dict)
print(f'Number of articles without data-type attribute: {len(no_data_type)}')
print(f'Number of articles without storylabels: {len(no_story_label)}')
# get articles that have a story label
has_label = set(articles_dict.keys()).difference(no_story_label)
print(f'Number of articles with story attribute: {len(has_label)}')

# a lot of articles do not have a story label, maybe an interesting goal for machine learning



Number of articles without data-type attribute: 0
Number of articles without storylabels: 104
Number of articles with story attribute: 33


 All articles have a data-type, but only a few articles have story attributes. This could be an interesting labeling task for our machine learning project later.
 Next, we print out the html of two articles to show the data we are interested in.

In [3]:
# example of an article without story label
print(f'No storylabel:\n{articles_dict[list(no_story_label)[0]]}\n')
print(80*'-')

# articles with story label 
print(f'With storylabel:\n{articles_dict[list(has_label)[0]]}')


No storylabel:
<article class="fig" data-dg="p1-43" data-dt="7x2" data-mg="p1-43" data-mt="4x4" data-type="story">
<div class="teaser-inner">
<a href="/story/3000000200853/trump-uebte-laut-medienbericht-druck-auf-wahlpruefer-in-michigan-aus" title="Trump übte laut Medienbericht Druck auf Wahlprüfer in Michigan aus">
<figure data-type="image">
<picture>
<source data-lazy-srcset="https://i.ds.at/V3WkIw/c:1200:800:fp:0.500:0.500/rs:fill:280:187/g:fp:0.54:0.29/plain/lido-images/2023/12/22/3f9fca60-5573-4475-8e22-0cd35d00484b.jpeg" media="(min-width: 960px)"/>
<source data-lazy-srcset="https://i.ds.at/KnrHlA/c:1200:800:fp:0.500:0.500/rs:fill:750:375/g:fp:0.54:0.29/plain/lido-images/2023/12/22/3f9fca60-5573-4475-8e22-0cd35d00484b.jpeg" media="(max-width: 959px)"/>
<img alt="Election_2024-President-New_Mexico_69156" data-lazy-src="https://i.ds.at/WDo_zA/rs:fill:600:400/plain/lido-images/2023/12/22/3f9fca60-5573-4475-8e22-0cd35d00484b.jpeg" referrerpolicy="unsafe-url"/>
</picture>
</figure>
<h

 This is the information one can get from the frontpage. Later we will also follow the links and scrape additional data from the articles, but lets first focus on getting a dataset just based on the information that can be attained from the frontpage.

 To this end, we define `extract_article_data()`, which uses the dictionary following the pattern `article_title: article_section_soup`. From the html (soup), we will extract and clean:

 * title
 * teaser-subtitle
 * link
 * time
 * teaser-kicker
 * n_posts
 * storylabels

In [7]:
# Function to extract specific data from each article
def extract_article_data(articles_dict):
    HOST = 'https://www.derstandard.at'
    article_data = []
    
    for title, article in articles_dict.items():
        data = {
            'title': title,
            'teaser-subtitle': None,
            'link': None,
            'time': None,
            'teaser-kicker': None,
            'n_posts': None,
            'storylabels': None
        }

        # most links are relative, so we need to add the host
        link = article.find('a')['href']
        if not link.startswith(HOST):
            link = HOST + link
        data['link'] = link
        
        # for live articles, there is a second time tag with the duration of the live post
        # however, we only care about the time of publication here
        time = [tag for tag in article.find_all('time') if 'datetime' in tag.attrs][0]
        data['time'] = time['datetime'].rstrip('\r\n')

        # if there are no comments, the string is empty so set it to 0
        n_posts = article.find('div', 'teaser-postingcount')
        try: data['n_posts'] = int(n_posts.rstrip('Posting').get_text(strip=True).replace('.', ''))
        except: data['n_posts'] = 0
        
        # Extracting other specified tags
        for tag, class_name in [('p', 'teaser-kicker'), 
                                ('p', 'teaser-subtitle'), 
                                ('div', 'storylabels')]:
            found_tag = article.find(tag, class_=class_name)
            if found_tag:
                data[class_name] = found_tag.get_text(strip=True)

        article_data.append(data)

    return article_data

article_data = extract_article_data(articles_dict)
# last 5 articles, of which some have a story label
article_data[-5:]


[{'title': 'Stadtforscher: "Architektur ist Teil unserer Wegwerfgesellschaft geworden"',
  'teaser-subtitle': 'Jetzt anhören: In Zukunft müssen Städte wieder dichter und dauerhafter werden, sagt der renommierte italienische Architekt undStadtwissenschafterVittorio Lampugnani',
  'link': 'https://www.derstandard.at/story/3000000200499/stadtforscher-architektur-ist-teil-unserer-wegwerfgesellschaft-geworden',
  'time': '2023-12-22T06:00',
  'teaser-kicker': 'Edition Zukunft',
  'n_posts': 0,
  'storylabels': 'Podcast'},
 {'title': 'Machen uns höhere Löhne weniger konkurrenzfähig?',
  'teaser-subtitle': 'Die Inflation in Österreich wird höher bleiben als in der Eurozone, eine Folge davon sind stärkereLohnsteigerungen.Aber wie sehr wird der Kostendruck der Industrie zusetzen? Ökonomen sind sich weniger einig, als es scheint',
  'link': 'https://www.derstandard.at/story/3000000200711/machen-uns-h246here-l246hne-weniger-konkurrenzf228hig',
  'time': '2023-12-22T06:00',
  'teaser-kicker': 'Wir

The various attributes will be analyzed once we convert our data to a dataframe.

But before we start scraping like mad, lets check robots.txt such that we can comply with derstandards scraping policies.

In [5]:
print(get_standard_soup('https://www.derstandard.at/robots.txt'))

User-agent: *

Disallow: /profil/

Sitemap: https://www.derstandard.at/sitemaps/news.xml
Sitemap: https://www.derstandard.at/sitemaps/sitemap.xml

Crawl-delay: 1


  return BeautifulSoup(response.content, 'html.parser')


`Craw-delay: 1`, so lets be nice and wait 1 second between requests. Then we'll scrape the frontpage of every day in 2023 until the 20th of december and save the data as a csv.

**Caution:** you might not want to run this cell, as it takes about ~13 minutes to run. The data has already been extracted once and has been saved to `data/derstandard_frontpage_data.csv`.

In [8]:
from datetime import datetime, timedelta
from time import sleep

def scrape_frontpage(start_date: str, end_date: str):
    # Validate that dates follow the pattern YYYY-MM-DD
    try:
        start = datetime.strptime(start_date, '%Y-%m-%d')
        end = datetime.strptime(end_date, '%Y-%m-%d')
    except ValueError:
        print("Invalid date format. Please use YYYY-MM-DD.")
        return

    data = []
    # all dates between start and end (inclusive)
    delta = end - start
    for i in range(delta.days + 1):
        # generate link for each day
        day = start + timedelta(days=i)
        date = day.strftime('%Y/%m/%d')
        link = f'https://www.derstandard.at/frontpage/{date}'
        # make a request to the link and extract the data
        article_dict = get_frontpage_articles(get_standard_soup(link))
        articles = extract_article_data(article_dict)
        print(f'Fetched {len(articles)} articles from {date}')
        data += articles
        # wait almost a second before next request, our data processing takes a bit of time as well
        sleep(0.8)
    
    return data

# scrape the data for an entire year
data = scrape_frontpage('2022-12-20', '2023-12-20')


Fetched 149 articles from 2022/12/20
Fetched 143 articles from 2022/12/21
Fetched 131 articles from 2022/12/22
Fetched 113 articles from 2022/12/23
Fetched 53 articles from 2022/12/24
Fetched 48 articles from 2022/12/25
Fetched 80 articles from 2022/12/26
Fetched 108 articles from 2022/12/27
Fetched 124 articles from 2022/12/28
Fetched 114 articles from 2022/12/29
Fetched 113 articles from 2022/12/30
Fetched 72 articles from 2022/12/31
Fetched 82 articles from 2023/01/01
Fetched 111 articles from 2023/01/02
Fetched 140 articles from 2023/01/03
Fetched 129 articles from 2023/01/04
Fetched 125 articles from 2023/01/05
Fetched 92 articles from 2023/01/06
Fetched 77 articles from 2023/01/07
Fetched 84 articles from 2023/01/08
Fetched 135 articles from 2023/01/09
Fetched 180 articles from 2023/01/10
Fetched 157 articles from 2023/01/11
Fetched 153 articles from 2023/01/12
Fetched 153 articles from 2023/01/13
Fetched 92 articles from 2023/01/14
Fetched 91 articles from 2023/01/15
Fetched 140

Okay, this cell took a while to run obviously, but we finally have our precious data. Lets convert it to a dataframe and see what we have.

In [9]:
import pandas as pd

df = pd.DataFrame(data)
df.columns = df.columns.str.replace('teaser-', '')
df


Unnamed: 0,title,subtitle,link,time,kicker,n_posts,storylabels
0,Putin räumt in Videobotschaft Probleme in der ...,Die USA verlangen ein stärkeres Vorgehen gegen...,https://www.derstandard.at/jetzt/livebericht/2...,2022-12-20T23:25,Krieg in der Ukraine,0,NachleseLivebericht
1,EU-Kommission genehmigt Milliardenhilfen für d...,Der Konzern muss sich bis Ende 2026 unter ande...,https://www.derstandard.at/story/2000141983415...,2022-12-20T22:52,Gasimporteur,0,
2,Stefan Bachmann wird laut ORF neuer Burgtheate...,Der Schweizer ist derzeit Intendant des Schaus...,https://www.derstandard.at/story/2000141975610...,2022-12-20T22:26,Kušej-Nachfolge,0,
3,Weiter Streit um Androsch-Villa in Altaussee,Auch dasMauthausen-Komiteeschlägt die Errichtu...,https://www.derstandard.at/story/2000141982994...,2022-12-20T20:07,Panorama,0,
4,Eingeschränkter OP-Betrieb auf Urologie am AKH...,Pflegemangel an der Urologie verunmögliche Ver...,https://www.derstandard.at/story/2000141957008...,2022-12-20T20:02,Spitalsengpässe,0,
...,...,...,...,...,...,...,...
44696,Ärger über Touchscreens: Volkswagen baut nun w...,Der deutsche Autohersteller VW will künftig wi...,https://www.derstandard.at/story/3000000200347...,2023-12-20T06:00,"Einen Touch weniger, bitte",0,
44697,Warum es in Österreich kein Kopftuchverbot bei...,LautEU-Höchstgerichtist ein Kopftuchverbot im ...,https://www.derstandard.at/story/3000000200411...,2023-12-20T06:00,EU-Urteil,0,
44698,Werden Sie Guru!,In der besinnlichen Zeit lässt es sich gut übe...,https://www.derstandard.at/story/3000000200408...,2023-12-20T06:00,Renate Graber,0,Einserkastl
44699,"Magisch oder toxisch? Wir haben ""Tatsächlich.....",Mittlerweile hat sich sogar Regisseur Richard ...,https://www.derstandard.at/story/3000000200191...,2023-12-20T06:00,Wiedergesehen,0,


over 44 thousand rows, this should give us plenty data to analyze!

Lets save it to a csv, totalling 14.6MB of pure textual data

In [10]:
df.to_csv('data/derstandard_frontpage_data.csv', index=False)