# Web Scraping Exercise

Web Scraping allows you to gather large volumes of data from diverse and real-time online sources. This data can be crucial for enriching your datasets, filling in gaps, and providing current information that enhances the quality and relevance of your analysis. Web scraping enables you to collect data that might not be readily available through traditional APIs or databases, offering a competitive edge by incorporating unique and comprehensive insights. Moreover, it automates the data collection process, saving time and resources while ensuring a scalable approach to continuously updating and maintaining your datasets.

Ethical web scraping involves respecting website terms of service, avoiding overloading servers, and ensuring that the collected data is used responsibly and in compliance with privacy laws and regulations.

Use Python, ```requests```, ```BeautifulSoup``` and/or ```pandas``` to scrape web data:

## Import Libraries

In [1]:
import requests
from bs4 import BeautifulSoup
import re
import json
import pandas as pd
from datetime import datetime

## Define the Target URL

In [2]:
url = "https://www.timeanddate.de/wetter/niederlande/amsterdam/rueckblick?month=5&year=2015"

## Send a Request to the Website

Do not forget to check the response status code

In [3]:
response = requests.get(url)
html = response.text
print("Status Code:", response.status_code)

Status Code: 200


## Parse the HTML Content

Use a library to access the HTMl content

In [4]:
soup = BeautifulSoup(response.text, 'html.parser')

## Identify the Data to be Scraped

Write a couple of sentence on the data you want to scrape

### Identify the Data to be Scraped
### TODO: I want to scrape historical temperature data for Amsterdam from timeanddate.de,
### where each entry includes a timestamp and the corresponding temperature in °C.

## Extract Data

Find specific elements and extract text or attributes from elements (handle pagination if necessary)

In [9]:
# Wetterdaten im Script als JSON suchen (regulärer Ausdruck)
match = re.search(r'"temp":(\[.*?\])', html)

if match:
    temp_data = json.loads(match.group(1))

    # Daten verarbeiten
    records = []
    for entry in temp_data:
        timestamp = entry["date"] / 1000  # ms -> s
        dt = datetime.fromtimestamp(timestamp)
        records.append({"Date": dt.strftime('%Y-%m-%d %H:%M'), "Temperature": entry["temp"]})
        
        # In DataFrame speichern
    df = pd.DataFrame(records)
    print(df.describe())
    print(df.head())

       Temperature
count  1465.000000
mean     12.242321
std       3.041491
min       3.000000
25%      10.000000
50%      12.000000
75%      14.000000
max      24.000000
               Date  Temperature
0  2015-05-01 02:25            6
1  2015-05-01 02:55            7
2  2015-05-01 03:25            7
3  2015-05-01 03:55            7
4  2015-05-01 04:25            6


## Store Data in a Structured Format

Give a brief overview of the data collected (e.g. count, fields, ...)

In [None]:
Die Daten bestehen aus 2 Spalten:
- `Date`: Datum und Uhrzeit im Format `YYYY-MM-DD HH:MM`
- `Temperature`: Temperatur in °C

Insgesamt wurden **732** erfasst, die über 2 Wochen verteilt sind und eine Zeitreihe bilden.

## Save the Data

In [6]:
# Speichern als CSV
if match:
    df.to_csv("amsterdam_weather_may2015.csv", index=False)
    print("Daten gespeichert in 'amsterdam_weather_time_series.csv'")

Daten gespeichert in 'amsterdam_weather_time_series.csv'


# Gather Data for Presentation

In [1]:
import requests
import re
import json
import pandas as pd
from datetime import datetime

# Städte definieren (URL-Pfade)
cities = {
    "Amsterdam": "niederlande/amsterdam",
    "Berlin": "deutschland/berlin",
    "Paris": "frankreich/paris",
    "Madrid": "spanien/madrid",
    "Wien": "oesterreich/wien"
}

# Gesamtdatenliste
all_records = []

for city_name, city_path in cities.items():
    for month in range(1, 13):
        url = f"https://www.timeanddate.de/wetter/{city_path}/rueckblick?month={month}&year=2020"
        response = requests.get(url)
        html = response.text

        # Temperaturdaten als JSON aus HTML extrahieren
        match = re.search(r'"temp":(\[.*?\])', html)
        if match:
            try:
                temp_data = json.loads(match.group(1))
                for entry in temp_data:
                    timestamp = entry["date"] / 1000  # ms → s
                    dt = datetime.fromtimestamp(timestamp)

                    # Nur speichern, wenn Temperatur vorhanden
                    if "temp" in entry:
                        all_records.append({
                            "City": city_name,
                            "Date": dt.strftime('%Y-%m-%d %H:%M'),
                            "Temperature": entry["temp"]
                        })
            except Exception as e:
                print(f"⚠️ Fehler beim Verarbeiten von {city_name}, Monat {month}: {e}")
        else:
            print(f"❌ Keine Temperaturdaten für {city_name}, Monat {month}")

# In DataFrame umwandeln
df = pd.DataFrame(all_records)

# In CSV speichern
df.to_csv("weather_europe_2020.csv", index=False)
print("✅ Fertig! Daten gespeichert in 'weather_europe_2015.csv'")


✅ Fertig! Daten gespeichert in 'weather_europe_2015.csv'


In [16]:
df = pd.DataFrame(all_records)
print(df.head())
print("Datensätze gesamt:", len(df))


        City              Date  Temperature
0  Amsterdam  2015-01-01 01:25            4
1  Amsterdam  2015-01-01 01:55            4
2  Amsterdam  2015-01-01 02:25            3
3  Amsterdam  2015-01-01 02:55            3
4  Amsterdam  2015-01-01 03:25            2
Datensätze gesamt: 78768
