# CC5 Scraper Task: Population Density in Swedish counties (län)

This notebook scrapes population density data from [Wikipedia - Sveriges län](https://sv.wikipedia.org/wiki/Sveriges_län) (Swedish counties). We use  `pandas.read_html()` to extract HTML tables directly from the webpage. The CSV has been further tidied for use with a Vega-Lite choropleth map.

In [None]:
# Import packages
import pandas as pd
import numpy as np

In [None]:
import requests
from io import StringIO

# Fetch tables from Wikipedia
url = "https://sv.wikipedia.org/wiki/Sveriges_län"

# Add a User-Agent header to mimic a web browser
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
}

# Use requests to get the HTML content with the header
response = requests.get(url, headers=headers)
response.raise_for_status()

# pd.read_html() returns a list of all tables found on the page
tables = pd.read_html(StringIO(response.text))
print(f"Found {len(tables)} tables on the page")

Found 15 tables on the page


In [None]:
# Find the main counties table
df = None
for table in tables:
    if len(table) >= 20 and len(table) <= 25:
        df = table
        print(f"Found counties table with {len(df)} rows")
        break

print(f"\nRaw table preview:")
df.head()

# Skip header rows (rows 0 and 1) and last row (national average)
df = df.iloc[2:-1].reset_index(drop=True)

print(f"Data rows after skipping headers and national total: {len(df)}")
print(f"\nFirst few rows:")
df.head()

Found counties table with 24 rows

Raw table preview:
Data rows after skipping headers and national total: 21

First few rows:


Unnamed: 0,0,1,2,3,4,5,6,7,8,9
0,Blekinge län,K,10,2 931,156 880,535,Karlskrona,5,1683,20
1,Dalarnas län,W,20,28 029,286 143,102,Falun,15,1634[5],6
2,Gotlands län,I,9,3 135,61 032,195,Visby,1,1645,16
3,Gävleborgs län,X,21,18 118,283 862,157,Gävle,10,1762,5
4,Hallands län,N,13,5 427,346 065,638,Halmstad,6,1658,18


In [None]:
# Extract columns and fix density (divide by 10)
df_clean = pd.DataFrame()

# Column 0: County name
df_clean['name'] = df.iloc[:, 0].astype(str)

# Column 5: Density - values are 10x too high (535 should be 53.5)
df_clean['density'] = pd.to_numeric(df.iloc[:, 5], errors='coerce') / 10

print(f"Raw density values: {df.iloc[:, 5].head(5).tolist()}")
print(f"Fixed density values: {df_clean['density'].head(5).tolist()}")
display(df_clean)

Raw density values: ['535', '102', '195', '157', '638']
Fixed density values: [53.5, 10.2, 19.5, 15.7, 63.8]


Unnamed: 0,name,density
0,Blekinge län,53.5
1,Dalarnas län,10.2
2,Gotlands län,19.5
3,Gävleborgs län,15.7
4,Hallands län,63.8
5,Jämtlands län,2.7
6,Jönköpings län,35.5
7,Kalmar län,22.0
8,Kronobergs län,24.1
9,Norrbottens län,2.6


In [None]:
# Clean county names to match se.json GeoJSON format

def clean_county_name(name):
    name = str(name).strip()
    if name.endswith('s län'):
        name = name[:-5]
    elif name.endswith(' län'):
        name = name[:-4]
    return name

df_clean['name'] = df_clean['name'].apply(clean_county_name)

print(f"Cleaned county names:")
print(df_clean['name'].tolist())

Cleaned county names:
['Blekinge', 'Dalarna', 'Gotland', 'Gävleborg', 'Halland', 'Jämtland', 'Jönköping', 'Kalmar', 'Kronoberg', 'Norrbotten', 'Skåne', 'Stockholm', 'Södermanland', 'Uppsala', 'Värmland', 'Västerbotten', 'Västernorrland', 'Västmanland', 'Västra Götaland', 'Örebro', 'Östergötland']


In [None]:
# Final cleanup
df_final = df_clean[['name', 'density']].copy()
df_final = df_final.dropna(subset=['density'])

print(f"Final dataset: {len(df_final)} counties")
print(f"\nTop 5 by density:")
print(df_final.nlargest(5, 'density').to_string(index=False))
print(f"\nBottom 5 by density:")
print(df_final.nsmallest(5, 'density').to_string(index=False))

Final dataset: 21 counties

Top 5 by density:
           name  density
      Stockholm    380.9
          Skåne    130.7
Västra Götaland     74.7
        Halland     63.8
    Västmanland     54.9

Bottom 5 by density:
          name  density
    Norrbotten      2.6
      Jämtland      2.7
  Västerbotten      5.1
       Dalarna     10.2
Västernorrland     11.2


In [None]:
# Export to CSV and download
df_final.to_csv('cc5_data.csv', index=False)

print("Data exported to sweden_county_density.csv")
print(f"\nAll rows:")
print(df_final.to_string(index=False))

from google.colab import files
files.download('cc5_data.csv')

Data exported to sweden_county_density.csv

All rows:
           name  density
       Blekinge     53.5
        Dalarna     10.2
        Gotland     19.5
      Gävleborg     15.7
        Halland     63.8
       Jämtland      2.7
      Jönköping     35.5
         Kalmar     22.0
      Kronoberg     24.1
     Norrbotten      2.6
          Skåne    130.7
      Stockholm    380.9
   Södermanland     49.5
        Uppsala     50.1
       Värmland     16.2
   Västerbotten      5.1
 Västernorrland     11.2
    Västmanland     54.9
Västra Götaland     74.7
         Örebro     36.3
   Östergötland     44.8


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

In [None]:
# Export to CSV and download
df_final.to_csv('cc5_data2.csv', index=False)

print("Data exported to cc5_data2.csv")
print(f"\nFirst 10 rows:")
print(df_final.head(10).to_string(index=False))

from google.colab import files

files.download('cc5_data2.csv')

Data exported to cc5_data2.csv

First 10 rows:
municipality code  density
         Ale 1440   102.82
    Alingsås 1489    90.70
     Alvesta 0764    20.29
       Aneby 0604    13.24
      Arboga 1984    42.83
    Arjeplog 2506     0.20
  Arvidsjaur 2505     1.07
      Arvika 1784    15.43
   Askersund 1882    13.97
      Avesta 2084    36.55


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>