# Web Scraping Exercise

Web Scraping allows you to gather large volumes of data from diverse and real-time online sources. This data can be crucial for enriching your datasets, filling in gaps, and providing current information that enhances the quality and relevance of your analysis. Web scraping enables you to collect data that might not be readily available through traditional APIs or databases, offering a competitive edge by incorporating unique and comprehensive insights. Moreover, it automates the data collection process, saving time and resources while ensuring a scalable approach to continuously updating and maintaining your datasets.

Ethical web scraping involves respecting website terms of service, avoiding overloading servers, and ensuring that the collected data is used responsibly and in compliance with privacy laws and regulations.

Use Python, ```requests```, ```BeautifulSoup``` and/or ```pandas``` to scrape web data:

## Import Libraries

In [1]:
import requests
from bs4 import BeautifulSoup
import json
import time
import csv
import pandas as pd

## Define the Target URL

In [2]:
BASE_URL = "https://www.preisjaeger.at/?page="
HEADERS = {
    "User-Agent": "Mozilla/5.0"
}
deals = []
seen_ids = set()
page = 1
MAX_DEALS = 1000 # Maximale Deals festgelegt, um nicht ewig zu scrapen

## Send a Request to the Website

Do not forget to check the response status code

In [3]:
response = requests.get(BASE_URL + str(page), headers=HEADERS)
if response.status_code != 200:
    print(f"Fehler beim Abrufen von Seite {page} – Statuscode: {response.status_code}")

## Parse the HTML Content

Use a library to access the HTMl content

In [4]:
soup = BeautifulSoup(response.text, 'html.parser')
article = soup.find("article", class_="thread")
print(article)

<article class="thread cept-thread-item thread--newCard thread--shadow thread--type-list imgFrame-container--scale thread--deal" data-handler="history thread-click" data-history='{"endpoint":"https://www.preisjaeger.at","replace":true,"data":{"scrollTo":"#thread_346813","offset":70,"scrollContainer":"#main"},"events":["history","click"],"delegate":true}' data-ocular='{"thread_ids":346813}' data-t="thread" data-t-d='{"id":346813}' data-t-view="" data-t-view-twig="" id="thread_346813"><div aria-busy="true" class="js-vue2" data-handler="vue2" data-vue2='{"name":"ThreadMainListItemNormalizer","props":{"thread":{"threadId":"346813","titleSlug":"braun-series-9-pro-scherkopf-30-bei-bipa-und-10eur-cashback-lokal","title":"Braun Series 9 (Pro) Scherkopf: -30 % bei Bipa und 10\u20ac Cashback (Lokal)","currentUserVoteDirection":null,"commentCount":0,"status":"Activated","isExpired":false,"isNew":false,"isPinned":false,"isTrending":null,"bookmarked":false,"isLocal":true,"temperature":109.02,"tempe

In [5]:
vue_data = article.find("div", attrs={"data-vue2": True})
vue_json = json.loads(vue_data["data-vue2"])
thread = vue_json.get("props", {}).get("thread", {})

print(thread.get("title"))
print(thread.get("price"))
print(thread.get("nextBestPrice"))
print(thread.get("temperature"))

Braun Series 9 (Pro) Scherkopf: -30 % bei Bipa und 10€ Cashback (Lokal)
31.99
41.99
109.02


The prices are sometimes zero because they are labeled as "KOSTENLOS" on the website.

## Identify the Data to be Scraped

Write a couple of sentence on the data you want to scrape

I want to scrape product deals from preisjaeger.at, including the title of the deal, the discounted price, the original (comparison) price, and the popularity score (referred to as "Grad"). The goal is to collect approximately 1000 unique deals across multiple paginated pages for further analysis or storage.

## Extract Data

Find specific elements and extract text or attributes from elements (handle pagination if necessary)

In [6]:
while len(deals) < MAX_DEALS:
    print(f"Scraping Seite {page}...")
    response = requests.get(BASE_URL + str(page), headers=HEADERS)
    if response.status_code != 200:
        print(f"Fehler beim Abrufen von Seite {page} – Statuscode: {response.status_code}")
        break

    soup = BeautifulSoup(response.text, 'html.parser')

    articles = soup.find_all("article", class_="thread")
    if not articles:
        print("Keine weiteren Artikel gefunden – Abbruch.")
        break

    new_deals_found = False

    for article in articles:
        try:
            vue_data = article.find("div", attrs={"data-vue2": True})
            if not vue_data:
                continue

            vue_json = json.loads(vue_data["data-vue2"])
            thread = vue_json.get("props", {}).get("thread", {})
            thread_id = thread.get("threadId")
            if thread_id in seen_ids:
                continue  # schon gesehen
            seen_ids.add(thread_id) # nicht gesehen -> set aktualisieren

            title = thread.get("title")
            price = thread.get("price")
            next_best_price = thread.get("nextBestPrice")
            temperature = round(thread.get("temperature", 0))

            deals.append({
                "Titel": title,
                "Preis (€)": price,
                "Vergleichspreis (€)": next_best_price,
                "Grad": f"{temperature}"
            })

            new_deals_found = True

            if len(deals) >= MAX_DEALS:
                break
        except Exception as e:
            print(f"Fehler beim Parsen eines Eintrags: {e}")

    if not new_deals_found:
        print("Keine neuen Deals auf dieser Seite – Abbruch.")
        break

    page += 1
    time.sleep(0.5)  # Refresh-Rate (IP)

Scraping Seite 1...
Scraping Seite 2...
Scraping Seite 3...
Scraping Seite 4...
Scraping Seite 5...
Scraping Seite 6...
Scraping Seite 7...
Scraping Seite 8...
Scraping Seite 9...
Scraping Seite 10...
Scraping Seite 11...
Scraping Seite 12...
Scraping Seite 13...
Scraping Seite 14...
Scraping Seite 15...
Scraping Seite 16...
Scraping Seite 17...
Scraping Seite 18...
Scraping Seite 19...
Scraping Seite 20...
Scraping Seite 21...
Scraping Seite 22...
Scraping Seite 23...
Scraping Seite 24...
Scraping Seite 25...
Scraping Seite 26...
Scraping Seite 27...
Scraping Seite 28...
Scraping Seite 29...
Scraping Seite 30...
Scraping Seite 31...
Scraping Seite 32...
Scraping Seite 33...
Scraping Seite 34...


## Store Data in a Structured Format

Give a brief overview of the data collected (e.g. count, fields, ...)

In [7]:
print(f"Anzahl der gesammelten Deals: {len(deals)}")
if deals:
    print("Felder pro Deal:", ", ".join(deals[0].keys()))
# Head of 5
print()
for deal in deals[:5]:
    print(deal)

Anzahl der gesammelten Deals: 1000
Felder pro Deal: Titel, Preis (€), Vergleichspreis (€), Grad

{'Titel': 'Braun Series 9 (Pro) Scherkopf: -30 % bei Bipa und 10€ Cashback (Lokal)', 'Preis (€)': 31.99, 'Vergleichspreis (€)': 41.99, 'Grad': '109'}
{'Titel': '1200 Yums bei TheFork', 'Preis (€)': 0, 'Vergleichspreis (€)': 0, 'Grad': '127'}
{'Titel': 'Samsung S36GD Essential Curved Monitor 24 Zoll, Full HD, 4 ms Reaktionszeit, 100 Hz, Eco Saving Plus, Flicker Free, Schwarz, S24D364GAU', 'Preis (€)': 69.58, 'Vergleichspreis (€)': 89.43, 'Grad': '119'}
{'Titel': '-25% auf die Crisp n Cream von NEOH', 'Preis (€)': 0, 'Vergleichspreis (€)': 0, 'Grad': '114'}
{'Titel': 'DE LONGHI Pinguino PAC EX105 Klimagerät (Max. Raumgröße: 100 m³, EEK: A+++, 10000 BTU/h, Weiß)', 'Preis (€)': 699, 'Vergleichspreis (€)': 883, 'Grad': '150'}


In summary, we scraped 1000 Deals (which can be adjusted with the variables before) and had the most important fields:
- Title (the title of the deal)
- Price (the current price)
- Comparision Price (the prior price)
- Heat (describing the popularity of the deal)

## Save the Data

In [8]:
with open("preisjaeger_deals.csv", mode="w", newline="", encoding="utf-8") as f:
    writer = csv.DictWriter(f, fieldnames=["Titel", "Preis (€)", "Vergleichspreis (€)", "Grad"])
    writer.writeheader()
    writer.writerows(deals)

print("CSV-Datei 'preisjaeger_deals.csv' wurde erfolgreich erstellt.")

CSV-Datei 'preisjaeger_deals.csv' wurde erfolgreich erstellt.


The data has been exported to a structured CSV file, which can be further utilized.

In [9]:
df = pd.read_csv("preisjaeger_deals.csv", encoding="utf-8")
df.describe()

Unnamed: 0,Preis (€),Vergleichspreis (€),Grad
count,1000.0,1000.0,1000.0
mean,114.02866,141.33316,284.245
std,263.350352,327.30766,233.992828
min,0.0,0.0,101.0
25%,0.0,0.0,163.75
50%,18.05,26.985,218.5
75%,84.22,109.21,317.0
max,2365.43,3199.0,2745.0


Some additional statistical overview of our scraped dataset with pandas.