 # Scraping _derstandard.at_

 **TODO:** talk about some legal stuff with scraping

 First we define the function `get_standard_soup()`, which sends a request to derstandard, along with a cookie that derstandard checks if a user has accepted their DSGVO notice. If this cookie is not sent, a banner is displayed and the html is only partially loaded.

 Secondly, we define `get_frontpage_articles()`, which expects a bs4 soup object of a frontpage. While derstandard no longer offers an archive, the frontpage articles of a given day can be conveniently accessed with the pattern `frontpage/y/m/d`. Each frontpage contains article sections which contain the (sub)heading, lead, number of comments, storylabels etc.

 We will start by pulling the frontpage of december 22th 2023.


In [17]:
from bs4 import BeautifulSoup
import requests

# fetch the html content of a derstandard.at page
def get_standard_soup(link):
    response = requests.get(link, cookies={'DSGVO_ZUSAGE_V1': 'true'})
    return BeautifulSoup(response.content, 'html.parser')

# generate a dictionary of articles with title as key and the bs4 element as value
def get_frontpage_articles(soup):
    articles_dict = {}
    articles = soup.select('div.chronological>section article')
    for article in articles:
        title_tag = article.find('a')
        if title_tag and title_tag.has_attr('title'):
            title = title_tag['title']
            articles_dict[title] = article
    return articles_dict

# Generate the articles dictionary for an arbitrary frontpage
soup = get_standard_soup('https://www.derstandard.at/frontpage/2023/12/22')
articles_dict = get_frontpage_articles(soup)

print(f'We have fetched {len(articles_dict)} articles\n')


We have fetched 137 articles



 In the next step, lets look at the information we can get from those article sections on the frontpage. By inspecting the html, we have already identified various elements that we will use in the subsequent steps:
 * title
 * subtitle
 * article type
 * link
 * datetime
 * kicker (like an additional tag, not 100% about its meaning yet)
 * postingcount
 * storylabels

 while playing with the data, we noticed that not every article contains storylabels. We will check this in the following step, as well as if every article tag has a type.

In [18]:
# Function to analyze attributes of specified tags and their attributes
def analyze_tag_attributes(articles_dict):
    no_data_type = set()
    no_story_label = set()

    for title, article in articles_dict.items():
        # Check if every article tag has a data-type attribute - basically the type of the article
        if not article.has_attr('data-type'):
            no_data_type.add(title)
        # search for <div class="storylabels"> in articles - the story labels
        if not article.find('div', class_='storylabels'):
            no_story_label.add(title)

    return no_data_type, no_story_label

no_data_type, no_story_label = analyze_tag_attributes(articles_dict)
print(f'Number of articles without data-type attribute: {len(no_data_type)}')
print(f'Number of articles without storylabels: {len(no_story_label)}')
# get articles that have a story label
has_label = set(articles_dict.keys()).difference(no_story_label)
print(f'Number of articles with story attribute: {len(has_label)}')

# a lot of articles do not have a story label, maybe an interesting goal for machine learning



Number of articles without data-type attribute: 0
Number of articles without storylabels: 104
Number of articles with story attribute: 33


 All articles have a data-type, but only a few articles have story attributes. This could be an interesting labeling task for our machine learning project later.
 Next, we print out the html of two articles to show the data we are interested in.

In [19]:
# example of an article without story label
print(f'No storylabel:\n{articles_dict[list(no_story_label)[0]]}\n')
print(80*'-')

# articles with story label 
print(f'With storylabel:\n{articles_dict[list(has_label)[0]]}')


No storylabel:
<article class="fig" data-dg="p1-43" data-dt="7x2" data-mg="p1-43" data-mt="4x4" data-type="story">
<div class="teaser-inner">
<a href="/story/3000000200853/trump-uebte-laut-medienbericht-druck-auf-wahlpruefer-in-michigan-aus" title="Trump übte laut Medienbericht Druck auf Wahlprüfer in Michigan aus">
<figure data-type="image">
<picture>
<source data-lazy-srcset="https://i.ds.at/V3WkIw/c:1200:800:fp:0.500:0.500/rs:fill:280:187/g:fp:0.54:0.29/plain/lido-images/2023/12/22/3f9fca60-5573-4475-8e22-0cd35d00484b.jpeg" media="(min-width: 960px)"/>
<source data-lazy-srcset="https://i.ds.at/KnrHlA/c:1200:800:fp:0.500:0.500/rs:fill:750:375/g:fp:0.54:0.29/plain/lido-images/2023/12/22/3f9fca60-5573-4475-8e22-0cd35d00484b.jpeg" media="(max-width: 959px)"/>
<img alt="Election_2024-President-New_Mexico_69156" data-lazy-src="https://i.ds.at/WDo_zA/rs:fill:600:400/plain/lido-images/2023/12/22/3f9fca60-5573-4475-8e22-0cd35d00484b.jpeg" referrerpolicy="unsafe-url"/>
</picture>
</figure>
<h

 This is the information one can get from the frontpage. Later we will also follow the links and scrape additional data from the articles, but lets first focus on getting a dataset just based on the information that can be attained from the frontpage.

 To this end, we define `extract_article_data()`, which uses the dictionary following the pattern `article_title: article_section_soup`. From the html (soup), we will extract and clean:

 * title
 * teaser-subtitle
 * link
 * time
 * teaser-kicker
 * n_posts
 * storylabels

In [20]:
# Function to extract specific data from each article
def extract_article_data(articles_dict):
    HOST = 'https://www.derstandard.at'
    article_data = []
    
    for title, article in articles_dict.items():
        data = {
            'title': title,
            'teaser-subtitle': None,
            'link': None,
            'time': None,
            'teaser-kicker': None,
            'n_posts': None,
            'storylabels': None
        }

        # most links are relative, so we need to add the host
        link = article.find('a')['href']
        if not link.startswith(HOST):
            link = HOST + link
        data['link'] = link
        
        # for live articles, there is a second time tag with the duration of the live post
        # however, we only care about the time of publication here
        time = [tag for tag in article.find_all('time') if 'datetime' in tag.attrs][0]
        data['time'] = time['datetime'].rstrip('\r\n')

        # if there are no comments, the string is empty so set it to 0
        n_posts = article.find('div', 'teaser-postingcount')
        try: data['n_posts'] = int(n_posts.get_text(strip=True).rstrip('Posting').replace('.', ''))
        except: data['n_posts'] = 0
        
        # Extracting other specified tags
        for tag, class_name in [('p', 'teaser-kicker'), 
                                ('p', 'teaser-subtitle'), 
                                ('div', 'storylabels')]:
            found_tag = article.find(tag, class_=class_name)
            if found_tag:
                data[class_name] = found_tag.get_text(strip=True)

        article_data.append(data)

    return article_data

article_data = extract_article_data(articles_dict)
# last 5 articles, of which some have a story label
article_data[-5:]


[{'title': 'Jesus-Geburt unter Palmen',
  'teaser-subtitle': 'Auch der Koran erzählt über die Geburt von Jesus Christus. Nicht als Sohn Gottes, aber als Sohn Marias, die als Frau im Koran eine besondere Stellung einnimt',
  'link': 'https://www.derstandard.at/story/3000000200744/jesus-geburt-unter-palmen',
  'time': '2023-12-22T06:00',
  'teaser-kicker': 'Wussten Sie schon?',
  'n_posts': 1,
  'storylabels': None},
 {'title': '"Aquaman and the Lost Kingdom" scheitert an versuchter Schadensbegrenzung',
  'teaser-subtitle': 'Der Erfolg derSuperheldenfilmeleidet immer öfter unter den privaten Problemen ihrerHauptdarsteller.Der Fortsetzung von "Aquaman" droht auch deshalb ein Flop',
  'link': 'https://www.derstandard.at/story/3000000200724/aquaman-and-the-lost-kingdom-scheitert-an-versuchter-schadensbegrenzung',
  'time': '2023-12-22T06:00',
  'teaser-kicker': 'Im Kino',
  'n_posts': 242,
  'storylabels': None},
 {'title': 'One-Man-Show mit fatalen Folgen für die Demokratie in Serbien',
  

The various attributes will be analyzed once we convert our data to a dataframe.

But before we start scraping like mad, lets check robots.txt such that we can comply with derstandards scraping policies.

In [21]:
print(get_standard_soup('https://www.derstandard.at/robots.txt'))

User-agent: *

Disallow: /profil/

Sitemap: https://www.derstandard.at/sitemaps/news.xml
Sitemap: https://www.derstandard.at/sitemaps/sitemap.xml

Crawl-delay: 1


  return BeautifulSoup(response.content, 'html.parser')


`Craw-delay: 1`, so lets be nice and wait 1 second between requests. Then we'll scrape the frontpage of every day in 2023 until the 20th of december and save the data as a csv.

**Caution:** you might not want to run this cell, as it takes about ~13 minutes to run. The data has already been extracted once and has been saved to `data/derstandard_frontpage_data.csv`.

In [22]:
from datetime import datetime, timedelta
from time import sleep
from tqdm import tqdm


def scrape_frontpage(start_date: str, end_date: str, logging=False):
    # Validate that dates follow the pattern YYYY-MM-DD
    try:
        start = datetime.strptime(start_date, '%Y-%m-%d')
        end = datetime.strptime(end_date, '%Y-%m-%d')
    except ValueError:
        print("Invalid date format. Please use YYYY-MM-DD.")
        return

    data = []
    # all dates between start and end (inclusive)
    delta = end - start
    for i in tqdm(range(delta.days + 1)):
        # generate link for each day
        day = start + timedelta(days=i)
        date = day.strftime('%Y/%m/%d')
        link = f'https://www.derstandard.at/frontpage/{date}'
        # make a request to the link and extract the data
        article_dict = get_frontpage_articles(get_standard_soup(link))
        articles = extract_article_data(article_dict)
        if logging:
            print(f'Fetched {len(articles)} articles from {date}')
        data += articles
        # wait almost a second before next request, our data processing takes a bit of time as well
        sleep(0.8)
    
    return data

# scrape the data for 2 years
data = scrape_frontpage('2021-12-22', '2023-12-22')


100%|██████████| 731/731 [24:13<00:00,  1.99s/it]


Okay, this cell took a while to run obviously, but we finally have our precious data. Lets convert it to a dataframe and see what we have.

In [24]:
import pandas as pd

df = pd.DataFrame(data)
df.columns = df.columns.str.replace('teaser-', '')
df.rename(columns={'time': 'datetime'}, inplace=True)
df


Unnamed: 0,title,subtitle,link,datetime,kicker,n_posts,storylabels
0,Liverpool nach großem Zittern im Ligacup-Halbf...,Minamino rettete die Reds in der Nachspielzeit...,https://www.derstandard.at/story/2000132122809...,2021-12-22T23:47,Premier League,7,
1,USA lassen Pfizers Covid-Tablette Paxlovid für...,Am Dienstag wurden in Österreich 2.269 Neuinfe...,https://www.derstandard.at/jetzt/livebericht/2...,2021-12-22T23:39,Omikron-Welle,22127,NachleseLivebericht
2,Real Madrid baut Vorsprung in LaLiga aus,Madrilenen nach 2:1 in Bilbao acht Punkte vor ...,https://www.derstandard.at/story/2000132122694...,2021-12-22T23:38,Primera Division,40,
3,Paris Saint-Germain wendet zweite Liga-Niederl...,Icardi besorgteAusgleichstreffergegen Grbic-Te...,https://www.derstandard.at/story/2000132122616...,2021-12-22T23:30,Fußball,3,
4,OSZE verkündet Einigung auf Waffenstillstand i...,"DieKonfliktparteiensollen zugestimmt haben, da...",https://www.derstandard.at/story/2000132122215...,2021-12-22T22:33,Weihnachtsfrieden,44,
...,...,...,...,...,...,...,...
90979,Jesus-Geburt unter Palmen,Auch der Koran erzählt über die Geburt von Jes...,https://www.derstandard.at/story/3000000200744...,2023-12-22T06:00,Wussten Sie schon?,1,
90980,"""Aquaman and the Lost Kingdom"" scheitert an ve...",Der Erfolg derSuperheldenfilmeleidet immer öft...,https://www.derstandard.at/story/3000000200724...,2023-12-22T06:00,Im Kino,242,
90981,One-Man-Show mit fatalen Folgen für die Demokr...,Die Wahlen in Serbien waren eine Farce. Die Eu...,https://www.derstandard.at/story/3000000200698...,2023-12-22T06:00,Vedran Džihić,113,Kommentar der anderen
90982,David Alaba zum zehnten Mal zu Österreichs Fuß...,Zehn von zwölf Trainern wählten den derzeit ve...,https://www.derstandard.at/story/3000000200745...,2023-12-22T05:46,Fußball,33,


In [None]:
df.columns['time']

over 90 thousand rows, this should give us plenty data to analyze!

Lets save it to a csv, totalling almost 30mb

In [25]:
df.to_csv('data/derstandard_frontpage_data.csv', index=False)