# WEB SCRAPING

In [12]:
import requests 
import pandas as pd
from bs4 import BeautifulSoup as bs
import time
import json

`requests` bruges til at hente HTML-indhold fra websites

`BeautifulSoup` (importeret som bs) bruges til at parse og navigere HTML-dokumenter

In [8]:
response = requests.get("https://climate.ec.europa.eu/news-your-voice/news_en")

Sender en `GET`-request til EU's klima-nyhedsside og gemmer svaret i `response` objektet. 

Dette henter hele HTML-dokumentet fra websiden.

200

`status_code` returnerer HTTP statuskoden (200 = OK)

`reason` giver en tekstuel forklaring af statuskoden ("OK")

Gemmer det rå HTML-indhold og printer de første 1000 karakterer for at få et overblik over strukturen, for at forstå hvordan websiden er bygget op.

Konverterer det rå HTML-indhold til et Beautiful Soup objekt, som gør det muligt at søge og navigere struktureret gennem HTML-elementerne.

`prettify()`, nedenfor, viser det parsede HTML i en pænt formateret struktur med korrekt indrykning, hvilket gør det lettere at læse og forstå HTML-strukturen.

Finder det første `<p>` (paragraf) element på siden. Dette er nyttigt til at identificere hvor specifikt indhold befinder sig.

Finder alle `<p>` elementer på siden og returnerer dem som en liste. Dette giver et overblik over alt tekstindhold i paragraffer.

Første linje udtrækker kun teksten (uden HTML-tags) fra det første paragraf-element

`For`-løkken, nedenfor, itererer gennem alle paragraffer og printer deres tekstindhold

Skaber et sæt (`set`) med alle unikke HTML-tag navne på siden. Dette giver overblik over hvilke elementer websiden indeholder og hjælper med at identificere relevante tags for scraping.

Finder alle elementer der har en `id attribut` og samler de unikke ID-værdier. ID'er er ofte nyttige til at målrette scraping til specifikke sektioner af en webside.

In [29]:
soup.find("div", class_ = "ecl-content-block__title")

<div class="ecl-content-block__title" data-ecl-title-link=""><a class="ecl-link ecl-link--standalone" href="/news-other-reads/news/call-become-european-climate-pact-ambassador-or-partner-now-open-2025-09-19_en">The call to become European Climate Pact Ambassador or Partner is now open!</a></div>

Finder det første `<div>` element med CSS-klassen "ecl-content-block__title". Bemærk brugen af `class_` (med underscore) da `class` er et reserveret ord i Python.

> "`<div>` står for "division" og er et generisk container-element i HTML. Det bruges til at gruppere andre HTML-elementer sammen og strukturere indhold på websider."


Begge kodestykker nedenfor finder alle div-elementer der indeholder nyhedstitler.

Udtrækker link-elementerne (`<a>` tags) fra hver `titel-div`.

Printer teksten fra hver titel. 

Udtrækker `href attributterne` (URL'erne) fra alle nyhedslinks. Dette giver os de fulde links til nyhedsartiklerne.

Finder alle `<time>` elementer og udtrækker deres datetime attributter. Disse indeholder strukturerede tidsstempler for hvornår nyhederne blev publiceret.

Finder alle `div-elementer` der indeholder nyhedssammendrag og udtrækker tekstindholdet. Dette giver os korte beskrivelser af hver nyhedsartikel.

In [13]:
article_rows_soup = soup.find_all("article", class_ = "ecl-content-item")

article_list = []

for row in article_rows_soup:
    article_dict = {}
    
    article_title_soup = row.find("div", class_ = "ecl-content-block__title").find("a")
    article_title = article_title_soup.get_text()
    article_link = article_title_soup['href']
    
    article_date = row.find("time")["datetime"]
    
    article_summary_soup = row.find("div", class_ = "ecl-content-block__description")
    try:
        article_summary = article_summary_soup.get_text(strip = True)
    except:
        article_summary = ""
    
    article_dict['title'] = article_title
    article_dict['link'] = article_link
    article_dict['date'] = article_date
    article_dict['summary'] = article_summary
    
    article_list.append(article_dict)

1. Finder alle article-elementer på siden
2. For hver artikel udtrækkes systematisk: `titel, link, dato og sammendrag`
3. Dataene organiseres i dictionaries og samles i en liste
4. `Try-except` bruges til at håndtere artikler der måske mangler sammendrag
5. `strip=True` parameter fjerner overflødige whitespace karakterer

## Purpose-Built Practice Sites:

* http://quotes.toscrape.com - Simple quotes with author info, pagination
* http://books.toscrape.com - Bookstore with categories, ratings, prices
* https://scrapethissite.com - Multiple challenges from basic to advanced
* http://httpbin.org - API testing and HTTP request practice

## [http://quotes.toscrape.com](http://quotes.toscrape.com)

In [18]:
url = "http://quotes.toscrape.com"
response = requests.get(url)
soup = bs(response.content, 'html.parser')

In [19]:
quotes = []
    
# Find all quote containers
for quote in soup.find_all('div', class_='quote'):
    text = quote.find('span', class_='text').get_text()
    author = quote.find('small', class_='author').get_text()
    tags = [tag.get_text() for tag in quote.find_all('a', class_='tag')]
        
    quotes.append({
        'text': text,
        'author': author,
        'tags': ', '.join(tags)
    })

In [20]:
quotes = pd.DataFrame.from_records(quotes)

In [23]:
quotes.head(10)

Unnamed: 0,text,author,tags
0,“The world as we have created it is a process ...,Albert Einstein,"change, deep-thoughts, thinking, world"
1,"“It is our choices, Harry, that show what we t...",J.K. Rowling,"abilities, choices"
2,“There are only two ways to live your life. On...,Albert Einstein,"inspirational, life, live, miracle, miracles"
3,"“The person, be it gentleman or lady, who has ...",Jane Austen,"aliteracy, books, classic, humor"
4,"“Imperfection is beauty, madness is genius and...",Marilyn Monroe,"be-yourself, inspirational"
5,“Try not to become a man of success. Rather be...,Albert Einstein,"adulthood, success, value"
6,“It is better to be hated for what you are tha...,André Gide,"life, love"
7,"“I have not failed. I've just found 10,000 way...",Thomas A. Edison,"edison, failure, inspirational, paraphrased"
8,“A woman is like a tea bag; you never know how...,Eleanor Roosevelt,misattributed-eleanor-roosevelt
9,"“A day without sunshine is like, you know, nig...",Steve Martin,"humor, obvious, simile"


## http://books.toscrape.com

In [24]:
books = []
page = 1
    
while True:
    url = f"http://books.toscrape.com/catalogue/page-{page}.html"
    response = requests.get(url)
        
    if response.status_code != 200:
        break
            
    soup = bs(response.content, 'html.parser')
    book_containers = soup.find_all('article', class_='product_pod')
        
    if not book_containers:
        break
            
    for book in book_containers:
        title = book.find('h3').find('a')['title']
        price = book.find('p', class_='price_color').get_text()
        rating = book.find('p', class_='star-rating')['class'][1]
            
        books.append({
            'title': title,
            'price': price,
            'rating': rating
        })
        
    page += 1
    if page > 3:  # Limit for at vi ikke kører for længe ... 
        break
            
    time.sleep(1)

In [25]:
books = pd.DataFrame.from_records(books)

In [28]:
books.head(20)

Unnamed: 0,title,price,rating
0,A Light in the Attic,£51.77,Three
1,Tipping the Velvet,£53.74,One
2,Soumission,£50.10,One
3,Sharp Objects,£47.82,Four
4,Sapiens: A Brief History of Humankind,£54.23,Five
5,The Requiem Red,£22.65,One
6,The Dirty Little Secrets of Getting Your Dream...,£33.34,Four
7,The Coming Woman: A Novel Based on the Life of...,£17.93,Three
8,The Boys in the Boat: Nine Americans and Their...,£22.60,Four
9,The Black Maria,£52.15,One


# CRAWLERS 

In [36]:
import requests
import scrapy
import pandas as pd
from scrapy.crawler import CrawlerProcess
from urllib.parse import urljoin
from bs4 import BeautifulSoup as bs

AttributeError: module 'cryptography.utils' has no attribute 'DeprecatedIn40'

Denne Scrapy crawler fungerer som en automatiseret web-scraper der systematisk gennemgår EU's klimanyhedsside.

Scrapy er et open-source framework til webscraping og web crawling skrevet i Python. Det er designet til at automatisere processen med at navigere websites, udtrække data og håndtere komplekse scraping-opgaver på en struktureret og effektiv måde.

Her bruges en meget kort User-Agent der kun identificerer sig som en generisk Mozilla-kompatibel browser. Mange moderne websites vil blokere eller mistænkeliggøre så korte User-Agent strenge.


In [29]:
class eu_crawler(scrapy.Spider): #intentional error to avoid mass crawling
    name = "eu_crawler"
    main_url = 'https://climate.ec.europa.eu/news-your-voice/news_en'
    start_urls = ['https://climate.ec.europa.eu/news-your-voice/news_en']
    
    def parse(self, response):
        soup = bs(response.text, "html.parser") # Notice that HTML content is refered to as .text in a scrapy response
        
        article_rows_soup = soup.find_all("article", class_ = "ecl-content-item")
        
        for row in article_rows_soup:
            article_dict = {}

            article_title_soup = row.find("div", class_ = "ecl-content-block__title").find("a")
            article_title = article_title_soup.get_text()
            article_link = article_title_soup['href']
            
            article_date = row.find("time")["datetime"]
            
            article_summary_soup = row.find("div", class_ = "ecl-content-block__description")
            try:
                article_summary = article_summary_soup.get_text(strip = True)
            except:
                article_summary = ""

            article_dict['title'] = article_title
            article_dict['link'] = article_link
            article_dict['date'] = article_date
            article_dict['summary'] = article_summary

            article_list.append(article_dict)
        
        try:
            next_page_url = urljoin(self.main_url, soup.find("a", attrs = {'aria-label': "Go to next page"})['href'])
        except:
            next_page_url = None
            
        if next_page_url is not None:
            yield scrapy.Request(url = next_page_url, callback=self.parse)

article_list = []
process = CrawlerProcess({
    'USER_AGENT': 'Mozilla/5.0 (compatible; SDS Course Web Scraper; Aalborg University Research)',
    'ROBOTSTXT_OBEY': True,  # Respektér robots.txt
    'DOWNLOAD_DELAY': 1,     # Vær høflig mod serveren
})

process.crawl(eu_crawler)
process.start()

NameError: name 'scrapy' is not defined

In [52]:
print(article_list)

2025-09-22 11:27:07 [positron_ipykernel.variables] DEBUG: Snapshotting namespace took 0.0014 seconds
2025-09-22 11:27:07 [positron_ipykernel.variables] DEBUG: Variables copied: ['eu_df']
2025-09-22 11:27:07 [positron_ipykernel._vendor.pygls.protocol.json_rpc] DEBUG: Sending notification: 'textDocument/publishDiagnostics' PublishDiagnosticsParams(uri='vscode-notebook-cell:/Users/jeppefl/Library/CloudStorage/OneDrive-AalborgUniversitet/01_work/undervisning/sds1/lektion-6/lektion-6-f%C3%A6lles.ipynb#Y101sZmlsZQ%3D%3D', diagnostics=[], version=None)
2025-09-22 11:27:07 [positron_ipykernel.variables] DEBUG: Detecting namespace changes took 0.0002 seconds
2025-09-22 11:27:07 [positron_ipykernel._vendor.pygls.protocol.json_rpc] INFO: Sending data: {"params": {"uri": "vscode-notebook-cell:/Users/jeppefl/Library/CloudStorage/OneDrive-AalborgUniversitet/01_work/undervisning/sds1/lektion-6/lektion-6-f%C3%A6lles.ipynb#Y101sZmlsZQ%3D%3D", "diagnostics": []}, "method": "textDocument/publishDiagnosti



In [53]:
df = pd.DataFrame.from_records(article_list)

2025-09-22 11:27:35 [positron_ipykernel.variables] DEBUG: Snapshotting namespace took 0.0004 seconds
2025-09-22 11:27:35 [positron_ipykernel.variables] DEBUG: Variables copied: ['eu_df']
2025-09-22 11:27:35 [positron_ipykernel.variables] DEBUG: Detecting namespace changes took 0.0001 seconds


In [54]:
df.head(10)

2025-09-22 11:27:40 [positron_ipykernel.variables] DEBUG: Snapshotting namespace took 0.0012 seconds
2025-09-22 11:27:40 [positron_ipykernel.variables] DEBUG: Variables copied: ['eu_df', 'df']


Unnamed: 0,title,link,date,summary
0,The call to become European Climate Pact Ambas...,/news-other-reads/news/call-become-european-cl...,2025-09-19T12:00:00Z,Scale up local efforts into Europe-wide change...
1,EU allocates €100m-worth of ETS allowances to ...,/news-other-reads/news/eu-allocates-eu100m-wor...,2025-09-17T12:00:00Z,"On 12 September 2025, the Commission adopted a..."
2,New study provides toolbox for early decarboni...,/news-other-reads/news/new-study-provides-tool...,2025-09-09T12:00:00Z,The Commission has published a study to help M...
3,Africa Climate Summit 2: European Commission E...,https://www.eeas.europa.eu/delegations/african...,2025-09-05T12:00:00Z,EU Executive Vice-President Teresa Ribera head...
4,5 things you should know about carbon pricing,/news-other-reads/news/5-things-you-should-kno...,2025-09-05T12:00:00Z,5 things you should know about carbon pricing
5,EU–ASEAN Joint Side Event on Carbon Pricing an...,https://www.eeas.europa.eu/eu-asean-carbonpric...,2025-09-04T12:00:00Z,The EU and ASEAN joined forces at a high-level...
6,Cooling hot cities by giving LIFE to buildings,https://cinea.ec.europa.eu/news-events/news/co...,2025-08-29T12:00:00Z,With heatwaves becoming more intense due to cl...
7,New EU studies explore purchasing programme to...,/news-other-reads/news/new-eu-studies-explore-...,2025-08-27T12:00:00Z,DG CLIMA is exploring the design of an EU-wide...
8,Revised 2025 and 2026 EU ETS auction calendars...,/news-other-reads/news/revised-2025-and-2026-e...,2025-07-28T12:00:00Z,"Today, the European Energy Exchange (EEX) publ..."
9,Joint EU-China press statement on climate,https://ec.europa.eu/commission/presscorner/de...,2025-07-24T12:00:00Z,"At the EU-China Summit in Beijing, the followi..."


2025-09-22 11:27:40 [positron_ipykernel.variables] DEBUG: Detecting namespace changes took 0.0007 seconds


# Hvordan virker denne Scrapy crawler?

Denne Scrapy crawler fungerer som en automatiseret web-scraper der systematisk gennemgår EU's klimanyhedsside. Lad mig forklare hvordan den virker:

## Hvordan crawleren kører automatisk

#### 1. Start og initialisering

```python
start_urls = ['https://climate.ec.europa.eu/news-your-voice/news_en']
```

Crawleren starter med én URL og begynder automatisk at processere den gennem parse()-metoden.

#### 2. Automatisk paginering

Crawleren fortsætter automatisk til næste side gennem denne logik:

```python
next_page_url = urljoin(self.main_url, soup.find("a", attrs = {'aria-label': "Go to next page"})['href'])

if next_page_url is not None:
    yield scrapy.Request(url = next_page_url, callback=self.parse)
```

* På hver side søger den efter *"Go to next page"* linket
* Hvis linket findes, opretter den automatisk en ny request til næste side
* `yield scrapy.Request()` fortæller Scrapy at fortsætte med at kalde `parse()` på den nye URL
* Dette skaber en rekursiv `loop` der fortsætter gennem alle sider

### Crawleren stopper automatisk når én af disse betingelser er opfyldt:

1. **Ikke flere sider (primær stop-betingelse)**

```python
try:
    next_page_url = urljoin(self.main_url, soup.find("a", attrs = {'aria-label': "Go to next page"})['href'])
except:
    next_page_url = None
    
if next_page_url is not None:  # Hvis None, yields den ikke mere
    yield scrapy.Request(url = next_page_url, callback=self.parse)
```

Når der ikke findes flere "næste side" links, sættes next_page_url til None, og crawleren stopper naturligt.

2. **Fejl eller exceptions**

Hvis der opstår kritiske fejl (netværksfejl, server nedlukning, etc.), vil Scrapy stoppe crawleren.

3. **Manual stop**

Crawleren kan stoppes manuelt med Ctrl+C eller ved at dræbe processen.


# Sammenligning: Scrapy vs requests + BeautifulSoup

## Requests + BeautifulSoup


```{python}
import requests
from bs4 import BeautifulSoup
import time

def scrape_manually():
    session = requests.Session()
    urls_to_visit = ['https://example.com/page1']
    visited = set()
    results = []
    
    while urls_to_visit:
        url = urls_to_visit.pop()
        if url in visited:
            continue
            
        time.sleep(1)  # Manual rate limiting
        
        try:
            response = session.get(url)
            soup = BeautifulSoup(response.content, 'html.parser')
            
            # Manual data extraction
            for article in soup.find_all('article'):
                results.append({
                    'title': article.find('h2').text,
                    'content': article.find('p').text
                })
            
            # Manual link following
            for link in soup.find_all('a', href=True):
                next_url = urljoin(url, link['href'])
                if 'example.com' in next_url:
                    urls_to_visit.append(next_url)
                    
        except Exception as e:
            print(f"Error scraping {url}: {e}")
            
        visited.add(url)
    
    return results
```

## Scrapy ækvivalent

```{python}
class ExampleSpider(scrapy.Spider):
    name = 'example'
    start_urls = ['https://example.com/page1']
    
    custom_settings = {
        'DOWNLOAD_DELAY': 1,
        'CONCURRENT_REQUESTS': 1
    }
    
    def parse(self, response):
        # Automatisk data extraction
        for article in response.css('article'):
            yield {
                'title': article.css('h2::text').get(),
                'content': article.css('p::text').get()
            }
        
        # Automatisk link following med duplicate filtering
        for link in response.css('a::attr(href)').getall():
            yield response.follow(link, callback=self.parse)
```

# APIer

## Hvordan bruges API'er?

API'er (til databaser) er meget forskellige, men indeholder typisk de samme følgende grundkomponenter:

**Request:** Ligesom al anden kommunikation med internettet, involverer brug af API at sende en *request* (GET eller POST) til en server. 

**Endpoint:** API'er indeholder typisk flere forskellige *endpoints*. Tænk på disse som "haner", man kan koble sig på, for at udtrække data. Endpoints er URL'er, som en request skal sendes til.

**Parameters:** Parametre er de argumenter eller indstillinger, som endpoint accepterer eller forventer. Nogle er krævet for overhovedet at få data tilbage, andre er valgfrie.

**Authentication:** De fleste API'er kræver en eller anden form for autentificering. Man ser typisk to former for autentificering: HTTPS autentificering (brugernavn og adgangskode) og autentificering via "tokens". Tokens er unikke "nøgler" - en samling af tekst, tegn og tal, som for serveren unikt identificerer, hvem der sender henvendelsen.

## Ekesempel 1: Brug af Statistikbankens API

Link til API dokumentation: https://www.dst.dk/en/Statistik/brug-statistikken/muligheder-i-statistikbanken/api

Man kan tilgå data i Statistikbanken via API.

I det følgende ses nogle eksempler på at bruge API'en via Python.

*Bemærk*: Denne API kræver ikke autentificering

In [57]:
import requests
from io import StringIO
import pandas as pd

2025-09-22 11:36:45 [positron_ipykernel.variables] DEBUG: Snapshotting namespace took 0.0017 seconds
2025-09-22 11:36:45 [positron_ipykernel.variables] DEBUG: Variables copied: ['eu_df', 'df']
2025-09-22 11:36:45 [positron_ipykernel.variables] DEBUG: Detecting namespace changes took 0.0007 seconds
2025-09-22 11:36:45 [positron_ipykernel._vendor.pygls.protocol.json_rpc] DEBUG: Sending notification: 'textDocument/publishDiagnostics' PublishDiagnosticsParams(uri='vscode-notebook-cell:/Users/jeppefl/Library/CloudStorage/OneDrive-AalborgUniversitet/01_work/undervisning/sds1/lektion-6/lektion-6-f%C3%A6lles.ipynb#Y123sZmlsZQ%3D%3D', diagnostics=[], version=None)
2025-09-22 11:36:45 [positron_ipykernel._vendor.pygls.protocol.json_rpc] INFO: Sending data: {"params": {"uri": "vscode-notebook-cell:/Users/jeppefl/Library/CloudStorage/OneDrive-AalborgUniversitet/01_work/undervisning/sds1/lektion-6/lektion-6-f%C3%A6lles.ipynb#Y123sZmlsZQ%3D%3D", "diagnostics": []}, "method": "textDocument/publishDia

In [58]:
statbank_api = "https://api.statbank.dk/v1/data"  #Endpoint of the data API

param = {'table': 'folk1c',
        'format': 'CSV',
        'variables': [{'code': 'OMRÅDE', 'values': ['101', '851']},  
        {'code': 'ALDER', 'values': ['20-24', '25-29']}]
        }

data_req = requests.post(statbank_api, json=param)  #Sending requests

print(data_req.text)  # Printing the raw text output

2025-09-22 11:36:50 [positron_ipykernel.variables] DEBUG: Snapshotting namespace took 0.0007 seconds
2025-09-22 11:36:50 [positron_ipykernel.variables] DEBUG: Variables copied: ['eu_df', 'df']
2025-09-22 11:36:50 [urllib3.connectionpool] DEBUG: Starting new HTTPS connection (1): api.statbank.dk:443
2025-09-22 11:36:50 [urllib3.connectionpool] DEBUG: https://api.statbank.dk:443 "POST /v1/data HTTP/1.1" 200 None
2025-09-22 11:36:50 [positron_ipykernel.variables] DEBUG: Detecting namespace changes took 0.0008 seconds


OMRÅDE;ALDER;TID;INDHOLD
København;20-24 år;2025K3;62167
København;25-29 år;2025K3;92173
Aalborg;20-24 år;2025K3;20674
Aalborg;25-29 år;2025K3;22301



In [60]:
dstdata = StringIO(data_req.text)  
dstdf = pd.read_csv(dstdata, sep=";")  
print(dstdf)  

2025-09-22 11:37:17 [positron_ipykernel.variables] DEBUG: Snapshotting namespace took 0.0012 seconds
2025-09-22 11:37:17 [positron_ipykernel.variables] DEBUG: Variables copied: ['eu_df', 'df', 'dstdf']
2025-09-22 11:37:17 [positron_ipykernel.variables] DEBUG: Detecting namespace changes took 0.0047 seconds


      OMRÅDE     ALDER     TID  INDHOLD
0  København  20-24 år  2025K3    62167
1  København  25-29 år  2025K3    92173
2    Aalborg  20-24 år  2025K3    20674
3    Aalborg  25-29 år  2025K3    22301


In [61]:
dstdf.groupby(['OMRÅDE']).sum() 

2025-09-22 11:37:37 [positron_ipykernel.variables] DEBUG: Snapshotting namespace took 0.0009 seconds
2025-09-22 11:37:37 [positron_ipykernel.variables] DEBUG: Variables copied: ['eu_df', 'df', 'dstdf']


Unnamed: 0_level_0,ALDER,TID,INDHOLD
OMRÅDE,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Aalborg,20-24 år25-29 år,2025K32025K3,42975
København,20-24 år25-29 år,2025K32025K3,154340


2025-09-22 11:37:37 [positron_ipykernel.variables] DEBUG: Detecting namespace changes took 0.0010 seconds


## Eksempel 2: Brug af Danmarks Adresser Web API (DAWA)

Link til API dokumentation: https://dawadocs.dataforsyningen.dk/dok/api/

Danmarks Adresser Web API (DAWA) kan bruges til at hente data om samt søge efter adresser i Danmark.

*Bemærk:* Denne API kræver ikke autentificering.

In [68]:
adress_end = 'https://api.dataforsyningen.dk/adresser' # endpoint til at søge på adresser (API'en har andre endpoints)

# parametre/indstillinger til søgning (se dokumentation)
parameters = {'q': 'Solsort*',  # søg efter adresser der indeholder Solsort
              'kommunekode': '0851'} # søg i kommunekode 0851 (Aalborg)

data_req = requests.get(adress_end, params = parameters)

2025-09-22 11:44:40 [positron_ipykernel.variables] DEBUG: Snapshotting namespace took 0.0009 seconds
2025-09-22 11:44:40 [positron_ipykernel.variables] DEBUG: Variables copied: ['eu_df', 'df', 'dstdf']
2025-09-22 11:44:40 [urllib3.connectionpool] DEBUG: Starting new HTTPS connection (1): api.dataforsyningen.dk:443
2025-09-22 11:44:40 [urllib3.connectionpool] DEBUG: https://api.dataforsyningen.dk:443 "GET /adresser?q=Solsort%2A&kommunekode=0851 HTTP/1.1" 200 None
2025-09-22 11:44:40 [positron_ipykernel.variables] DEBUG: Detecting namespace changes took 0.0006 seconds


In [69]:
data_req.status_code # gik henvendelse igennem?

2025-09-22 11:44:44 [positron_ipykernel.variables] DEBUG: Snapshotting namespace took 0.0018 seconds
2025-09-22 11:44:44 [positron_ipykernel.variables] DEBUG: Variables copied: ['eu_df', 'df', 'dstdf']


200

2025-09-22 11:44:44 [positron_ipykernel.variables] DEBUG: Detecting namespace changes took 0.0007 seconds


In [114]:
# Check den faktiske response
print(f"Status code: {data_req.status_code}")
print(f"Content-Type: {data_req.headers.get('Content-Type')}")
print(f"Response size: {len(data_req.content)} bytes")
print(f"First 1000 characters: {data_req.text[:1000]}")

2025-09-22 12:06:48 [positron_ipykernel.variables] DEBUG: Snapshotting namespace took 0.0021 seconds
2025-09-22 12:06:48 [positron_ipykernel.variables] DEBUG: Variables copied: ['eu_df', 'df', 'dstdf', 'data_df', 'quotes', 'books', 'countries']
2025-09-22 12:06:48 [positron_ipykernel.variables] DEBUG: Detecting namespace changes took 0.0020 seconds


Status code: 200
Content-Type: application/json; charset=UTF-8
Response size: 120429 bytes
First 1000 characters: [
{
  "id": "0a3f50ca-baff-32b8-e044-0003ba298018",
  "status": 1,
  "darstatus": 3,
  "oprettet": "2000-02-05T21:22:49.000",
  "ændret": "2000-02-05T21:22:49.000",
  "ikrafttrædelse": "2000-02-05T21:22:49.000",
  "nedlagt": null,
  "vejkode": "7688",
  "vejnavn": "Solsortvej",
  "adresseringsvejnavn": "Solsortvej",
  "husnr": "2",
  "etage": null,
  "dør": null,
  "supplerendebynavn": null,
  "postnr": "9000",
  "postnrnavn": "Aalborg",
  "stormodtagerpostnr": null,
  "stormodtagerpostnrnavn": null,
  "kommunekode": "0851",
  "kommunenavn": "Aalborg",
  "ejerlavkode": 2005058,
  "ejerlavnavn": "Sohngårdsholm Hgd., Aalborg Jorder",
  "matrikelnr": "14bh",
  "esrejendomsnr": "0",
  "etrs89koordinat_øst": 556511.84,
  "etrs89koordinat_nord": 6321216.29,
  "wgs84koordinat_bredde": 57.03094927,
  "wgs84koordinat_længde": 9.93105827,
  "nøjagtighed": "A",
  "kilde": 5,
  "teknis

In [71]:
data = data_req.json() # data returneres som standard som JSON (tages ud på denne måde)

2025-09-22 11:44:55 [positron_ipykernel.variables] DEBUG: Snapshotting namespace took 0.0011 seconds
2025-09-22 11:44:55 [positron_ipykernel.variables] DEBUG: Variables copied: ['eu_df', 'df', 'dstdf']
2025-09-22 11:44:55 [positron_ipykernel.variables] DEBUG: Detecting namespace changes took 0.0007 seconds


In [73]:
print(data[:1])

2025-09-22 11:45:05 [positron_ipykernel.variables] DEBUG: Snapshotting namespace took 0.0013 seconds
2025-09-22 11:45:05 [positron_ipykernel.variables] DEBUG: Variables copied: ['eu_df', 'df', 'dstdf']
2025-09-22 11:45:05 [positron_ipykernel.variables] DEBUG: Detecting namespace changes took 0.0007 seconds


[{'id': '0a3f50ca-baff-32b8-e044-0003ba298018', 'kvhx': '08517688___2_______', 'status': 1, 'darstatus': 3, 'href': 'https://api.dataforsyningen.dk/adresser/0a3f50ca-baff-32b8-e044-0003ba298018', 'historik': {'oprettet': '2000-02-05T21:22:49.000', 'ændret': '2000-02-05T21:22:49.000', 'ikrafttrædelse': '2000-02-05T21:22:49.000', 'nedlagt': None}, 'etage': None, 'dør': None, 'adressebetegnelse': 'Solsortvej 2, 9000 Aalborg', 'adgangsadresse': {'href': 'https://api.dataforsyningen.dk/adgangsadresser/0a3f509c-b9d5-32b8-e044-0003ba298018', 'id': '0a3f509c-b9d5-32b8-e044-0003ba298018', 'adressebetegnelse': 'Solsortvej 2, 9000 Aalborg', 'kvh': '08517688___2', 'status': 1, 'darstatus': 3, 'vejstykke': {'href': 'https://api.dataforsyningen.dk/vejstykker/851/7688', 'navn': 'Solsortvej', 'adresseringsnavn': 'Solsortvej', 'kode': '7688'}, 'husnr': '2', 'navngivenvej': {'href': 'https://api.dataforsyningen.dk/navngivneveje/f2d548ba-4501-427d-8b28-089df535780c', 'id': 'f2d548ba-4501-427d-8b28-089df5

In [74]:
len(data) # hvor mange hits?

2025-09-22 11:45:27 [positron_ipykernel.variables] DEBUG: Snapshotting namespace took 0.0009 seconds
2025-09-22 11:45:27 [positron_ipykernel.variables] DEBUG: Variables copied: ['eu_df', 'df', 'dstdf']


42

2025-09-22 11:45:27 [positron_ipykernel.variables] DEBUG: Detecting namespace changes took 0.0007 seconds


In [75]:
data[0].keys() # hvilke oplysninger/variable?

2025-09-22 11:45:30 [positron_ipykernel.variables] DEBUG: Snapshotting namespace took 0.0010 seconds
2025-09-22 11:45:30 [positron_ipykernel.variables] DEBUG: Variables copied: ['eu_df', 'df', 'dstdf']


dict_keys(['id', 'kvhx', 'status', 'darstatus', 'href', 'historik', 'etage', 'dør', 'adressebetegnelse', 'adgangsadresse'])

2025-09-22 11:45:30 [positron_ipykernel.variables] DEBUG: Detecting namespace changes took 0.0008 seconds


DAWA kan sende data tilbage i forskellige formater med parameter "struktur":

In [76]:
adress_end = 'https://api.dataforsyningen.dk/adresser' # endpoint til at søge på adresser (API'en har andre endpoints)

# parametre/indstillinger til søgning (se dokumentation)
parameters = {'q': 'Solsort*', # søg efter adresser der indeholder Solsort
              'kommunekode': '0851', # søg i kommunekode 0851 (Aalborg)
              'struktur': 'flad'} # send data tilbage i "flad" struktur (data er ikke "nestet" og kan konverteres direkte til tabel)

data_req = requests.get(adress_end, params = parameters)

2025-09-22 11:45:42 [positron_ipykernel.variables] DEBUG: Snapshotting namespace took 0.0008 seconds
2025-09-22 11:45:42 [positron_ipykernel.variables] DEBUG: Variables copied: ['eu_df', 'df', 'dstdf']
2025-09-22 11:45:42 [urllib3.connectionpool] DEBUG: Starting new HTTPS connection (1): api.dataforsyningen.dk:443
2025-09-22 11:45:43 [urllib3.connectionpool] DEBUG: https://api.dataforsyningen.dk:443 "GET /adresser?q=Solsort%2A&kommunekode=0851&struktur=flad HTTP/1.1" 200 None
2025-09-22 11:45:43 [positron_ipykernel.variables] DEBUG: Detecting namespace changes took 0.0007 seconds


In [77]:
data = data_req.json() # data tages stadig ud som JSON, nu er det blot et JSON records format

2025-09-22 11:45:45 [positron_ipykernel.variables] DEBUG: Snapshotting namespace took 0.0014 seconds
2025-09-22 11:45:45 [positron_ipykernel.variables] DEBUG: Variables copied: ['eu_df', 'df', 'dstdf']
2025-09-22 11:45:45 [positron_ipykernel.variables] DEBUG: Detecting namespace changes took 0.0007 seconds


In [78]:
data_df = pd.DataFrame(data)

2025-09-22 11:45:47 [positron_ipykernel.variables] DEBUG: Snapshotting namespace took 0.0014 seconds
2025-09-22 11:45:47 [positron_ipykernel.variables] DEBUG: Variables copied: ['eu_df', 'df', 'dstdf']
2025-09-22 11:45:47 [positron_ipykernel.variables] DEBUG: Detecting namespace changes took 0.0006 seconds


In [79]:
data_df.head()

2025-09-22 11:45:48 [positron_ipykernel.variables] DEBUG: Snapshotting namespace took 0.0014 seconds
2025-09-22 11:45:48 [positron_ipykernel.variables] DEBUG: Variables copied: ['eu_df', 'df', 'dstdf', 'data_df']


Unnamed: 0,id,status,darstatus,oprettet,ændret,ikrafttrædelse,nedlagt,vejkode,vejnavn,adresseringsvejnavn,husnr,etage,dør,supplerendebynavn,postnr,postnrnavn,stormodtagerpostnr,stormodtagerpostnrnavn,kommunekode,kommunenavn,ejerlavkode,ejerlavnavn,matrikelnr,esrejendomsnr,etrs89koordinat_øst,etrs89koordinat_nord,wgs84koordinat_bredde,wgs84koordinat_længde,nøjagtighed,kilde,tekniskstandard,tekstretning,ddkn_m100,ddkn_km1,ddkn_km10,adressepunktændringsdato,adgangsadresseid,adgangsadresse_status,adgangsadresse_darstatus,adgangsadresse_oprettet,adgangsadresse_ændret,adgangsadresse_ikrafttrædelse,adgangsadresse_nedlagt,regionskode,regionsnavn,jordstykke_ejerlavnavn,jordstykke_ejerlavkode,jordstykke_matrikelnr,jordstykke_esrejendomsnr,højde,adgangspunktid,vejpunkt_x,vejpunkt_y,vejpunkt_id,vejpunkt_kilde,vejpunkt_nøjagtighed,vejpunkt_tekniskstandard,vejpunkt_ændret,sognekode,sognenavn,politikredskode,politikredsnavn,retskredskode,retskredsnavn,opstillingskredskode,opstillingskredsnavn,zone,afstemningsområdenummer,afstemningsområdenavn,menighedsrådsafstemningsområdenummer,menighedsrådsafstemningsområdenavn,brofast,supplerendebynavn_dagi_id,navngivenvej_id,storkredsnummer,storkredsnavn,valglandsdelsbogstav,valglandsdelsnavn,landsdelsnuts3,landsdelsnavn,kvhx,kvh,betegnelse
0,0a3f50ca-baff-32b8-e044-0003ba298018,1,3,2000-02-05T21:22:49.000,2000-02-05T21:22:49.000,2000-02-05T21:22:49.000,,7688,Solsortvej,Solsortvej,2,,,,9000,Aalborg,,,851,Aalborg,2005058,"Sohngårdsholm Hgd., Aalborg Jorder",14bh,0,556511.84,6321216.29,57.030949,9.931058,A,5,TN,131.7,100m_63212_5565,1km_6321_556,10km_632_55,2007-03-30T23:59:00.000,0a3f509c-b9d5-32b8-e044-0003ba298018,1,3,2000-02-05T21:22:49.000,2021-01-19T12:37:01.982,2000-02-05T21:22:49.000,,1081,Region Nordjylland,"Sohngårdsholm Hgd., Aalborg Jorder",2005058,14bh,0,17.1,0a3f509c-b9d5-32b8-e044-0003ba298018,9.930774,57.031029,1f5d7019-af45-11e7-847e-066cff24d637,Ekstern,B,V0,2018-05-03T14:08:02.125,8373,Hans Egedes,1460,Nordjyllands Politi,1178,Retten i Aalborg,90,Aalborg Øst,Udfaset,36,Aalborghus Gymnasium,32,Hans Egedes,True,,f2d548ba-4501-427d-8b28-089df535780c,10,Nordjylland,C,Midtjylland-Nordjylland,DK050,Nordjylland,08517688___2_______,08517688___2,"Solsortvej 2, 9000 Aalborg"
1,0a3f50ca-bb01-32b8-e044-0003ba298018,1,3,2000-02-05T21:22:49.000,2000-02-05T21:22:49.000,2000-02-05T21:22:49.000,,7688,Solsortvej,Solsortvej,4,,,,9000,Aalborg,,,851,Aalborg,2005058,"Sohngårdsholm Hgd., Aalborg Jorder",13t,0,556518.96,6321244.5,57.031202,9.931182,A,5,TN,132.0,100m_63212_5565,1km_6321_556,10km_632_55,2007-03-30T23:59:00.000,0a3f509c-b9d7-32b8-e044-0003ba298018,1,3,2000-02-05T21:22:49.000,2021-01-19T12:37:01.982,2000-02-05T21:22:49.000,,1081,Region Nordjylland,"Sohngårdsholm Hgd., Aalborg Jorder",2005058,13t,0,17.3,0a3f509c-b9d7-32b8-e044-0003ba298018,9.930995,57.03126,1f5d701b-af45-11e7-847e-066cff24d637,Ekstern,B,V0,2018-05-03T14:08:02.125,8373,Hans Egedes,1460,Nordjyllands Politi,1178,Retten i Aalborg,90,Aalborg Øst,Udfaset,36,Aalborghus Gymnasium,32,Hans Egedes,True,,f2d548ba-4501-427d-8b28-089df535780c,10,Nordjylland,C,Midtjylland-Nordjylland,DK050,Nordjylland,08517688___4_______,08517688___4,"Solsortvej 4, 9000 Aalborg"
2,0a3f50ca-bb02-32b8-e044-0003ba298018,1,3,2000-02-05T21:25:36.000,2000-02-05T21:25:36.000,2000-02-05T21:25:36.000,,7688,Solsortvej,Solsortvej,5,,,,9000,Aalborg,,,851,Aalborg,2005058,"Sohngårdsholm Hgd., Aalborg Jorder",14en,0,556490.25,6321241.08,57.031175,9.930708,A,5,TN,132.7,100m_63212_5564,1km_6321_556,10km_632_55,2007-03-30T23:59:00.000,0a3f509c-b9d8-32b8-e044-0003ba298018,1,3,2000-02-05T21:25:36.000,2021-01-19T12:37:01.982,2000-02-05T21:25:36.000,,1081,Region Nordjylland,"Sohngårdsholm Hgd., Aalborg Jorder",2005058,14en,0,14.2,0a3f509c-b9d8-32b8-e044-0003ba298018,9.930869,57.03113,1f5d701c-af45-11e7-847e-066cff24d637,Ekstern,B,V0,2018-05-03T14:08:02.125,8373,Hans Egedes,1460,Nordjyllands Politi,1178,Retten i Aalborg,90,Aalborg Øst,Udfaset,36,Aalborghus Gymnasium,32,Hans Egedes,True,,f2d548ba-4501-427d-8b28-089df535780c,10,Nordjylland,C,Midtjylland-Nordjylland,DK050,Nordjylland,08517688___5_______,08517688___5,"Solsortvej 5, 9000 Aalborg"
3,0a3f50ca-bb03-32b8-e044-0003ba298018,1,3,2000-02-05T21:22:49.000,2000-02-05T21:22:49.000,2000-02-05T21:22:49.000,,7688,Solsortvej,Solsortvej,6,,,,9000,Aalborg,,,851,Aalborg,2005058,"Sohngårdsholm Hgd., Aalborg Jorder",13s,0,556528.44,6321259.69,57.031337,9.931341,A,5,TN,131.4,100m_63212_5565,1km_6321_556,10km_632_55,2007-03-30T23:59:00.000,0a3f509c-b9d9-32b8-e044-0003ba298018,1,3,2000-02-05T21:22:49.000,2021-01-19T12:37:01.982,2000-02-05T21:22:49.000,,1081,Region Nordjylland,"Sohngårdsholm Hgd., Aalborg Jorder",2005058,13s,0,17.5,0a3f509c-b9d9-32b8-e044-0003ba298018,9.931141,57.031399,1f5d701d-af45-11e7-847e-066cff24d637,Ekstern,B,V0,2018-05-03T14:08:02.125,8373,Hans Egedes,1460,Nordjyllands Politi,1178,Retten i Aalborg,90,Aalborg Øst,Udfaset,36,Aalborghus Gymnasium,32,Hans Egedes,True,,f2d548ba-4501-427d-8b28-089df535780c,10,Nordjylland,C,Midtjylland-Nordjylland,DK050,Nordjylland,08517688___6_______,08517688___6,"Solsortvej 6, 9000 Aalborg"
4,0a3f50ca-bb04-32b8-e044-0003ba298018,1,3,2000-02-05T21:22:49.000,2000-02-05T21:22:49.000,2000-02-05T21:22:49.000,,7688,Solsortvej,Solsortvej,7,,,,9000,Aalborg,,,851,Aalborg,2005058,"Sohngårdsholm Hgd., Aalborg Jorder",13af,0,556502.07,6321258.5,57.03133,9.930907,A,5,TN,131.0,100m_63212_5565,1km_6321_556,10km_632_55,2007-03-30T23:59:00.000,0a3f509c-b9da-32b8-e044-0003ba298018,1,3,2000-02-05T21:22:49.000,2021-01-19T12:37:01.982,2000-02-05T21:22:49.000,,1081,Region Nordjylland,"Sohngårdsholm Hgd., Aalborg Jorder",2005058,13af,0,14.0,0a3f509c-b9da-32b8-e044-0003ba298018,9.931029,57.031292,1f5d701e-af45-11e7-847e-066cff24d637,Ekstern,B,V0,2018-05-03T14:08:02.125,8373,Hans Egedes,1460,Nordjyllands Politi,1178,Retten i Aalborg,90,Aalborg Øst,Udfaset,36,Aalborghus Gymnasium,32,Hans Egedes,True,,f2d548ba-4501-427d-8b28-089df535780c,10,Nordjylland,C,Midtjylland-Nordjylland,DK050,Nordjylland,08517688___7_______,08517688___7,"Solsortvej 7, 9000 Aalborg"


2025-09-22 11:45:48 [positron_ipykernel.variables] DEBUG: Detecting namespace changes took 0.0016 seconds


# Bluesky


# Telegram 