# DAIOE SSYK2012

Here I Import and process the Raw DAIOE dataset, specifically the DAIOE SSYK 2012


## Setup and data sources
Set up local paths and define the DAIOE CSV and SCB Parquet sources used throughout the notebook.


In [40]:
import polars as pl
from pathlib import Path


ROOT = Path.cwd().resolve()
DATA_DIR = ROOT / "data"
DATA_DIR.mkdir(parents=True, exist_ok=True)

DAIOE_SOURCE: str = (
    "https://raw.githubusercontent.com/joseph-data/07_translate_ssyk/main/"
    "03_translated_files/daioe_ssyk2012_translated.csv"
)

SCB_SOURCE: str = (
    "https://raw.githubusercontent.com/joseph-data/daioe-explorer-years/development/"
    "data/processed/ssyk12_aggregated_ssyk4_to_ssyk1.parquet"
)

## Load data lazily
Load both datasets as Polars LazyFrames so the pipeline stays memory-efficient.


In [41]:
daioe_lazy_lf = pl.scan_csv(
    DAIOE_SOURCE
)

scb_lazy_lf = pl.scan_parquet(
    SCB_SOURCE
)

## Quick sanity checks
Preview a few rows from each source to confirm the schemas and data look as expected.


In [42]:
print(daioe_lazy_lf.head(5).collect())

shape: (5, 27)
┌────────────┬──────┬────────────┬────────────┬───┬────────────┬───────────┬───────────┬───────────┐
│ ssyk2012_4 ┆ year ┆ daioe_alla ┆ daioe_stra ┆ … ┆ pctl_rank_ ┆ ssyk2012_ ┆ ssyk2012_ ┆ ssyk2012_ │
│ ---        ┆ ---  ┆ pps        ┆ tgames     ┆   ┆ genai      ┆ 1         ┆ 2         ┆ 3         │
│ str        ┆ i64  ┆ ---        ┆ ---        ┆   ┆ ---        ┆ ---       ┆ ---       ┆ ---       │
│            ┆      ┆ f64        ┆ f64        ┆   ┆ f64        ┆ str       ┆ str       ┆ str       │
╞════════════╪══════╪════════════╪════════════╪═══╪════════════╪═══════════╪═══════════╪═══════════╡
│ 0110 Commi ┆ 2010 ┆ null       ┆ null       ┆ … ┆ null       ┆ 0 Armed   ┆ 01        ┆ 011 Commi │
│ ssioned    ┆      ┆            ┆            ┆   ┆            ┆ forces    ┆ Officers  ┆ ssioned   │
│ armed      ┆      ┆            ┆            ┆   ┆            ┆ occupatio ┆           ┆ armed     │
│ forces…    ┆      ┆            ┆            ┆   ┆            ┆ ns        ┆

In [43]:
print(scb_lazy_lf.head(5).collect())

shape: (5, 7)
┌───────┬───────────┬───────┬───────┬──────┬───────┬─────────────────────────────────┐
│ level ┆ ssyk_code ┆ age   ┆ sex   ┆ year ┆ count ┆ occupation                      │
│ ---   ┆ ---       ┆ ---   ┆ ---   ┆ ---  ┆ ---   ┆ ---                             │
│ str   ┆ str       ┆ str   ┆ str   ┆ i64  ┆ i64   ┆ str                             │
╞═══════╪═══════════╪═══════╪═══════╪══════╪═══════╪═════════════════════════════════╡
│ SSYK4 ┆ 7116      ┆ 45-49 ┆ women ┆ 2018 ┆ 8     ┆ Scaffold builders               │
│ SSYK4 ┆ 3514      ┆ 25-29 ┆ men   ┆ 2019 ┆ 1065  ┆ Computer network and systems t… │
│ SSYK4 ┆ 3522      ┆ 16-24 ┆ women ┆ 2022 ┆ 30    ┆ Light, sound and stage technic… │
│ SSYK4 ┆ 7131      ┆ 45-49 ┆ women ┆ 2017 ┆ 55    ┆ Painters and related workers    │
│ SSYK4 ┆ 7233      ┆ 16-24 ┆ women ┆ 2020 ┆ 253   ┆ Agricultural and industrial ma… │
└───────┴───────────┴───────┴───────┴──────┴───────┴─────────────────────────────────┘


In [44]:
# daioe_lazy_lf.collect_schema()

In [45]:
#scb_lazy_lf.collect().collect_schema()

## Derive SSYK levels and align years
Split `ssyk2012_4` into 1-4 digit codes, drop the original SSYK column, and keep the SSYK12 era (2014+).
If SCB has later years, extend the DAIOE series by carrying the latest year forward.


In [46]:
daioe_lazy_lf_ssyk12 = (
    daioe_lazy_lf\
    .with_columns([
    pl.col("ssyk2012_4").str.slice(0, 1).alias("code_1"),
    pl.col("ssyk2012_4").str.slice(0, 2).alias("code_2"),
    pl.col("ssyk2012_4").str.slice(0, 3).alias("code_3"),
    pl.col("ssyk2012_4").str.slice(0, 4).alias("code_4")
])\
    .drop(pl.col("^ssyk2012.*$"))\
        .filter(pl.col("year") >= 2014) ## The Year stretch from the first SSYK12 publication
)

In [47]:
base = daioe_lazy_lf_ssyk12

daioe_max = base.select(pl.max("year")).collect().item()
scb_max   = scb_lazy_lf.select(pl.max("year")).collect().item()

missing = list(range(daioe_max + 1, scb_max + 1))

daioe_lazy_lf_extended = (
    base
    if not missing
    else pl.concat(
        [
            base,
            base
            .filter(pl.col("year") == daioe_max)
            .drop("year")
            .join(pl.LazyFrame({"year": missing}), how="cross")
            .select(base.collect_schema().names()),  # ensure same column order/schema
        ],
        how="vertical",
    )
)



In [48]:
def inspect_lazy(lf: pl.LazyFrame) -> None:
    """
    Print the shape of a Polars LazyFrame in a memory-efficient manner.

    This function computes the number of rows using a lazy row-count
    aggregation (`pl.len()`) and retrieves the number of columns from
    the resolved schema without materializing the full dataset.

    Parameters
    ----------
    lf : pl.LazyFrame
        The LazyFrame to inspect.

    Notes
    -----
    - The row count triggers execution of the lazy query plan,
      but avoids collecting all columns into memory.
    - The column count is obtained from the schema metadata and
      does not require data materialization.
    - Intended for debugging and validation of large lazy pipelines.
    """
    n_rows = lf.select(pl.len()).collect().item()
    n_cols = len(lf.collect_schema())
    print(f"Rows: {n_rows:,}")
    print(f"Columns: {n_cols}")


In [49]:
inspect_lazy(daioe_lazy_lf_extended)


Rows: 4,719
Columns: 27


## Build SCB SSYK4 counts
Aggregate SCB to 4-digit SSYK by year to create employment counts used as weights.


In [50]:
scb_lazy_lf_level4 = (
    scb_lazy_lf
        .filter(pl.col("ssyk_code").str.len_chars() == 4)
        .group_by(["year", "ssyk_code"])
        .agg(pl.col("count").sum().alias("total_count"))
)



In [51]:
inspect_lazy(scb_lazy_lf_level4)

Rows: 4,719
Columns: 3


## Merge and filter
Join DAIOE rows to SCB counts by year and SSYK4, inspect unmatched codes, and remove the military/army group (code_1 == '0').


In [52]:
daioe_lazy_lf_extended.head(5).collect()

year,daioe_allapps,daioe_stratgames,daioe_videogames,daioe_imgrec,daioe_imgcompr,daioe_imggen,daioe_readcompr,daioe_lngmod,daioe_translat,daioe_speechrec,daioe_genai,pctl_rank_allapps,pctl_rank_stratgames,pctl_rank_videogames,pctl_rank_imgrec,pctl_rank_imgcompr,pctl_rank_imggen,pctl_rank_readcompr,pctl_rank_lngmod,pctl_rank_translat,pctl_rank_speechrec,pctl_rank_genai,code_1,code_2,code_3,code_4
i64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,str,str,str,str
2014,,,,,,,,,,,,,,,,,,,,,,,"""0""","""01""","""011""","""0110"""
2015,,,,,,,,,,,,,,,,,,,,,,,"""0""","""01""","""011""","""0110"""
2016,,,,,,,,,,,,,,,,,,,,,,,"""0""","""01""","""011""","""0110"""
2017,,,,,,,,,,,,,,,,,,,,,,,"""0""","""01""","""011""","""0110"""
2018,,,,,,,,,,,,,,,,,,,,,,,"""0""","""01""","""011""","""0110"""


In [53]:
scb_lazy_lf_level4.head(5).collect()

year,ssyk_code,total_count
i64,str,i64
2022,"""0110""",2577
2023,"""4224""",11424
2017,"""5223""",103709
2015,"""8181""",5387
2014,"""3116""",1176


In [54]:
daioe_scb_years = daioe_lazy_lf_extended\
    .join(
        scb_lazy_lf_level4,
        left_on=["year", "code_4"],
        right_on=["year", "ssyk_code"],
        how="left"
    )

In [55]:
inspect_lazy(daioe_scb_years)

Rows: 4,719
Columns: 28


In [56]:
# DAIOE codes with no SCB match
daioe_scb_years_unmatched = daioe_lazy_lf_extended\
    .join(
        scb_lazy_lf_level4,
        left_on=["year", "code_4"],
        right_on=["year", "ssyk_code"],
        how="anti"
    )



In [57]:
inspect_lazy(daioe_scb_years_unmatched)

Rows: 0
Columns: 27


In [58]:
daioe_scb_years.collect_schema()

Schema([('year', Int64),
        ('daioe_allapps', Float64),
        ('daioe_stratgames', Float64),
        ('daioe_videogames', Float64),
        ('daioe_imgrec', Float64),
        ('daioe_imgcompr', Float64),
        ('daioe_imggen', Float64),
        ('daioe_readcompr', Float64),
        ('daioe_lngmod', Float64),
        ('daioe_translat', Float64),
        ('daioe_speechrec', Float64),
        ('daioe_genai', Float64),
        ('pctl_rank_allapps', Float64),
        ('pctl_rank_stratgames', Float64),
        ('pctl_rank_videogames', Float64),
        ('pctl_rank_imgrec', Float64),
        ('pctl_rank_imgcompr', Float64),
        ('pctl_rank_imggen', Float64),
        ('pctl_rank_readcompr', Float64),
        ('pctl_rank_lngmod', Float64),
        ('pctl_rank_translat', Float64),
        ('pctl_rank_speechrec', Float64),
        ('pctl_rank_genai', Float64),
        ('code_1', String),
        ('code_2', String),
        ('code_3', String),
        ('code_4', String),
        ('tot

In [59]:
daioe_scb_years.collect_schema().names()

['year',
 'daioe_allapps',
 'daioe_stratgames',
 'daioe_videogames',
 'daioe_imgrec',
 'daioe_imgcompr',
 'daioe_imggen',
 'daioe_readcompr',
 'daioe_lngmod',
 'daioe_translat',
 'daioe_speechrec',
 'daioe_genai',
 'pctl_rank_allapps',
 'pctl_rank_stratgames',
 'pctl_rank_videogames',
 'pctl_rank_imgrec',
 'pctl_rank_imgcompr',
 'pctl_rank_imggen',
 'pctl_rank_readcompr',
 'pctl_rank_lngmod',
 'pctl_rank_translat',
 'pctl_rank_speechrec',
 'pctl_rank_genai',
 'code_1',
 'code_2',
 'code_3',
 'code_4',
 'total_count']

In [60]:
## Here I ommitted the Army and Military from the data 

daioe_scb_filtered = daioe_scb_years\
    .filter(pl.col("code_1") != "0")

In [61]:
inspect_lazy(daioe_scb_filtered)

Rows: 4,686
Columns: 28


## Identify DAIOE measure columns
Collect all DAIOE indicator columns and define the weight column used for averaging.


In [62]:
daioe_cols = [
    c for c in daioe_scb_filtered.collect_schema().names()
    if c.startswith("daioe_")
]

w = pl.col("total_count")


## Aggregate to SSYK3/2/1
Compute simple and employment-weighted averages for each higher SSYK level.


In [63]:

daioe_scb_lv3 = (
    daioe_scb_filtered
    .select(["code_3", "year", "total_count", *daioe_cols])
    .group_by(["year", "code_3"])
    .agg(
        [
            w.sum().alias("weight_sum"),

            # simple averages
            *[pl.col(c).mean().alias(f"{c}_avg") for c in daioe_cols],

            # employment-weighted averages
            *[
                pl.when(w.sum() > 0)
                  .then((pl.col(c) * w).sum() / w.sum())
                  .otherwise(None)
                  .alias(f"{c}_wavg")
                for c in daioe_cols
            ],
        ]
    )
    .with_columns(pl.lit("SSYK3").alias("level"))
    .rename({"code_3": "ssyk_code"})
)

# preview
daioe_scb_lv3.limit(10).collect()


year,ssyk_code,weight_sum,daioe_allapps_avg,daioe_stratgames_avg,daioe_videogames_avg,daioe_imgrec_avg,daioe_imgcompr_avg,daioe_imggen_avg,daioe_readcompr_avg,daioe_lngmod_avg,daioe_translat_avg,daioe_speechrec_avg,daioe_genai_avg,daioe_allapps_wavg,daioe_stratgames_wavg,daioe_videogames_wavg,daioe_imgrec_wavg,daioe_imgcompr_wavg,daioe_imggen_wavg,daioe_readcompr_wavg,daioe_lngmod_wavg,daioe_translat_wavg,daioe_speechrec_wavg,daioe_genai_wavg,level
i64,str,i64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,str
2015,"""131""",12036,10.082962,0.265683,2.910249,0.217128,,0.033146,0.006921,0.02851,0.003783,0.252389,0.102772,10.205928,0.269076,2.942594,0.220362,0.0,0.03363,0.007011,0.028893,0.00383,0.25554,0.104207,"""SSYK3"""
2021,"""334""",17448,31.711133,0.297017,4.758023,0.474671,0.173088,0.7185,0.257265,0.497106,0.055468,0.588016,2.170395,31.511932,0.292851,4.779009,0.462283,0.168886,0.707053,0.250944,0.490015,0.05578,0.596242,2.137162,"""SSYK3"""
2020,"""511""",7139,21.654744,0.224079,4.158179,0.302405,0.071579,0.408729,0.144921,0.254007,0.031204,0.360517,1.148163,21.449291,0.223942,4.24379,0.302709,0.07046,0.403061,0.138183,0.239028,0.029441,0.341331,1.109823,"""SSYK3"""
2023,"""323""",150,29.499712,0.255629,4.773472,0.36096,0.189297,0.555734,0.323761,0.791526,0.031492,0.620364,2.208142,29.499712,0.255629,4.773472,0.36096,0.189297,0.555734,0.323761,0.791526,0.031492,0.620364,2.208142,"""SSYK3"""
2019,"""952""",141,21.963699,0.232202,4.333576,0.246238,0.067198,0.377367,0.198835,0.143277,0.040808,0.417719,0.871231,21.963699,0.232202,4.333576,0.246238,0.067198,0.377367,0.198835,0.143277,0.040808,0.417719,0.871231,"""SSYK3"""
2014,"""172""",6457,4.340442,0.16692,1.861195,0.099235,,,0.002187,0.008943,,0.032316,0.008943,4.340442,0.16692,1.861195,0.099235,0.0,0.0,0.002187,0.008943,0.0,0.032316,0.008943,"""SSYK3"""
2024,"""321""",30778,33.085011,0.285715,5.361805,0.456064,0.23029,0.670565,0.337148,0.835958,0.032991,0.642516,2.477967,33.912202,0.295117,5.446787,0.476181,0.238926,0.697894,0.349179,0.859965,0.033545,0.643735,2.563218,"""SSYK3"""
2015,"""171""",1290,6.861485,0.16499,2.115214,0.138707,,0.020555,0.003765,0.016506,0.002361,0.171425,0.061712,6.861485,0.16499,2.115214,0.138707,0.0,0.020555,0.003765,0.016506,0.002361,0.171425,0.061712,"""SSYK3"""
2020,"""211""",6284,31.118139,0.355452,4.773976,0.494157,0.117805,0.754028,0.276673,0.444693,0.048478,0.466074,2.073284,30.860926,0.354611,4.959237,0.481086,0.11395,0.746307,0.263443,0.422859,0.045667,0.441624,2.018145,"""SSYK3"""
2015,"""753""",2244,9.52848,0.200762,3.914311,0.165104,,0.020186,0.002593,0.01138,0.00156,0.128208,0.051625,9.626892,0.205913,3.936686,0.168816,0.0,0.021379,0.002676,0.011623,0.001568,0.127053,0.053935,"""SSYK3"""


In [64]:
inspect_lazy(daioe_scb_lv3)

Rows: 1,595
Columns: 26


In [65]:

daioe_scb_lv2 = (
    daioe_scb_filtered
    .select(["code_2", "year", "total_count", *daioe_cols])
    .group_by(["year", "code_2"])
    .agg(
        [
            w.sum().alias("weight_sum"),

            # simple averages
            *[pl.col(c).mean().alias(f"{c}_avg") for c in daioe_cols],

            # employment-weighted averages
            *[
                pl.when(w.sum() > 0)
                  .then((pl.col(c) * w).sum() / w.sum())
                  .otherwise(None)
                  .alias(f"{c}_wavg")
                for c in daioe_cols
            ],
        ]
    )
    .with_columns(pl.lit("SSYK2").alias("level"))
    .rename({"code_2": "ssyk_code"})
)

# preview
daioe_scb_lv2.limit(10).collect()

year,ssyk_code,weight_sum,daioe_allapps_avg,daioe_stratgames_avg,daioe_videogames_avg,daioe_imgrec_avg,daioe_imgcompr_avg,daioe_imggen_avg,daioe_readcompr_avg,daioe_lngmod_avg,daioe_translat_avg,daioe_speechrec_avg,daioe_genai_avg,daioe_allapps_wavg,daioe_stratgames_wavg,daioe_videogames_wavg,daioe_imgrec_wavg,daioe_imgcompr_wavg,daioe_imggen_wavg,daioe_readcompr_wavg,daioe_lngmod_wavg,daioe_translat_wavg,daioe_speechrec_wavg,daioe_genai_wavg,level
i64,str,i64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,str
2022,"""14""",17704,28.661194,0.256797,3.627298,0.3979,0.166104,0.721434,0.275656,0.857831,0.044841,0.472465,2.689522,28.401272,0.25508,3.540188,0.395993,0.165489,0.722513,0.27655,0.857228,0.044586,0.467842,2.690405,"""SSYK2"""
2014,"""22""",184159,4.900888,0.199026,1.998374,0.125763,,,0.002944,0.011388,,0.037848,0.011388,4.819378,0.195335,1.981483,0.12121,0.0,0.0,0.002844,0.010984,0.0,0.036609,0.010984,"""SSYK2"""
2017,"""41""",174205,17.986941,0.331663,3.343741,0.25867,0.085011,0.390323,0.191925,0.055725,0.018884,0.3862,0.660474,18.410376,0.34387,3.445345,0.266098,0.08696,0.396785,0.194542,0.05655,0.019079,0.389609,0.671109,"""SSYK2"""
2016,"""94""",77315,12.279564,0.220978,3.581027,0.137045,0.032271,0.20958,0.071533,0.021236,0.005275,0.176611,0.330492,11.827262,0.209054,3.688176,0.13635,0.030571,0.195233,0.058271,0.017358,0.004221,0.146249,0.299694,"""SSYK2"""
2019,"""91""",89203,19.155047,0.221923,5.198164,0.235953,0.053136,0.283174,0.108796,0.075123,0.018824,0.211297,0.578372,19.089253,0.217379,5.335997,0.239496,0.05294,0.268549,0.100915,0.069598,0.017297,0.197583,0.544674,"""SSYK2"""
2021,"""31""",145409,28.189569,0.290318,5.199152,0.465574,0.155478,0.714877,0.180831,0.331247,0.032133,0.345302,1.82304,28.06428,0.292341,5.253365,0.461967,0.153823,0.701072,0.179247,0.32626,0.031488,0.334986,1.791942,"""SSYK2"""
2017,"""52""",249056,13.432956,0.229226,2.994268,0.178198,0.057999,0.277953,0.111036,0.033742,0.012077,0.266715,0.451455,13.32115,0.231296,3.043985,0.183794,0.057758,0.285976,0.102756,0.031317,0.010628,0.239041,0.454186,"""SSYK2"""
2020,"""52""",237985,22.802933,0.229452,4.34126,0.303107,0.074389,0.440471,0.159327,0.282369,0.035748,0.393733,1.250983,22.518551,0.231387,4.442495,0.314319,0.074099,0.454267,0.145619,0.258981,0.030905,0.348028,1.228315,"""SSYK2"""
2023,"""95""",260,32.952858,0.232309,4.38107,0.370014,0.219624,0.612759,0.413124,1.074724,0.048462,0.91601,2.73825,32.952858,0.232309,4.38107,0.370014,0.219624,0.612759,0.413124,1.074724,0.048462,0.91601,2.73825,"""SSYK2"""
2018,"""42""",67496,18.59711,0.278199,3.478149,0.228782,0.071474,0.394708,0.162153,0.111515,0.023218,0.413538,0.813338,19.164634,0.289911,3.532226,0.240807,0.074783,0.429082,0.170093,0.115198,0.023618,0.407423,0.870127,"""SSYK2"""


In [66]:
inspect_lazy(daioe_scb_lv2)

Rows: 473
Columns: 26


In [67]:

daioe_scb_lv1 = (
    daioe_scb_filtered
    .select(["code_1", "year", "total_count", *daioe_cols])
    .group_by(["year", "code_1"])
    .agg(
        [
            w.sum().alias("weight_sum"),

            # simple averages
            *[pl.col(c).mean().alias(f"{c}_avg") for c in daioe_cols],

            # employment-weighted averages
            *[
                pl.when(w.sum() > 0)
                  .then((pl.col(c) * w).sum() / w.sum())
                  .otherwise(None)
                  .alias(f"{c}_wavg")
                for c in daioe_cols
            ],
        ]
    )
    .with_columns(pl.lit("SSYK1").alias("level"))
    .rename({"code_1": "ssyk_code"})
)

# preview
daioe_scb_lv1.limit(10).collect()

year,ssyk_code,weight_sum,daioe_allapps_avg,daioe_stratgames_avg,daioe_videogames_avg,daioe_imgrec_avg,daioe_imgcompr_avg,daioe_imggen_avg,daioe_readcompr_avg,daioe_lngmod_avg,daioe_translat_avg,daioe_speechrec_avg,daioe_genai_avg,daioe_allapps_wavg,daioe_stratgames_wavg,daioe_videogames_wavg,daioe_imgrec_wavg,daioe_imgcompr_wavg,daioe_imggen_wavg,daioe_readcompr_wavg,daioe_lngmod_wavg,daioe_translat_wavg,daioe_speechrec_wavg,daioe_genai_wavg,level
i64,str,i64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,str
2021,"""9""",239382,22.765384,0.22139,5.211372,0.345287,0.112233,0.459916,0.112948,0.220988,0.023036,0.271673,1.186927,22.277239,0.218067,5.379357,0.339883,0.10794,0.429242,0.101903,0.199146,0.020099,0.243012,1.094453,"""SSYK1"""
2020,"""3""",593828,25.998325,0.28527,4.775846,0.377033,0.088372,0.543334,0.195447,0.322775,0.037203,0.390414,1.496045,26.291625,0.289053,4.647714,0.383803,0.090891,0.553179,0.205654,0.340992,0.039473,0.408396,1.547965,"""SSYK1"""
2015,"""7""",385178,9.126059,0.1983,3.632182,0.164148,,0.020246,0.002718,0.01158,0.001621,0.131513,0.052242,8.918352,0.195749,3.525873,0.162469,0.0,0.020315,0.002689,0.011373,0.001609,0.129819,0.051989,"""SSYK1"""
2019,"""4""",364056,25.121251,0.295991,4.698125,0.32175,0.080641,0.453838,0.225223,0.160009,0.043288,0.454407,1.020646,25.772734,0.309839,4.92113,0.33434,0.082966,0.466006,0.22909,0.160871,0.042688,0.438213,1.040282,"""SSYK1"""
2018,"""8""",317621,15.919269,0.246893,4.604306,0.198016,0.051548,0.286954,0.073528,0.046725,0.009141,0.174037,0.502825,16.005425,0.244558,4.544924,0.205939,0.053733,0.298741,0.075525,0.047931,0.009502,0.177868,0.521336,"""SSYK1"""
2024,"""2""",1273548,36.43596,0.300564,4.399623,0.47834,0.260963,0.816614,0.487964,1.17935,0.046742,0.838765,3.267108,37.614616,0.31157,4.411262,0.490161,0.268846,0.851257,0.510163,1.239915,0.048799,0.878971,3.423775,"""SSYK1"""
2021,"""7""",399245,24.166788,0.248149,5.71469,0.375726,0.119025,0.534775,0.11084,0.206973,0.020461,0.241668,1.272799,23.674488,0.244359,5.529758,0.371823,0.117563,0.536624,0.10955,0.20315,0.020285,0.238143,1.267686,"""SSYK1"""
2018,"""2""",1094556,18.470385,0.300338,3.460411,0.24227,0.073991,0.450894,0.174386,0.109693,0.020347,0.320905,0.885633,18.721711,0.306073,3.429893,0.244394,0.07506,0.461841,0.179081,0.113578,0.021009,0.333683,0.910635,"""SSYK1"""
2023,"""1""",325195,33.409918,0.271655,3.912004,0.437919,0.241536,0.75333,0.444917,1.09101,0.044111,0.796681,3.021413,34.363217,0.280235,3.977586,0.454216,0.250032,0.783522,0.461122,1.126364,0.04522,0.813577,3.129474,"""SSYK1"""
2016,"""9""",235399,12.368514,0.21964,3.56791,0.154175,0.035831,0.225512,0.072541,0.020552,0.005171,0.167438,0.347305,12.211664,0.21614,3.689088,0.151661,0.034402,0.209998,0.065042,0.018443,0.004502,0.149821,0.32139,"""SSYK1"""


In [68]:
inspect_lazy(daioe_scb_lv1)

Rows: 99
Columns: 26


## Generalized aggregation + percentiles
Use a reusable function to aggregate all SSYK levels and add within-year percentiles for each DAIOE metric.


In [69]:
import polars as pl

def aggregate_daioe_level(
    lf: pl.LazyFrame,
    code_col: str,
    level_label: str,
    weight_col: str = "total_count",
    prefix: str = "daioe_",
    add_percentiles: bool = True,
    pct_scale: int = 100,
    descending: bool = False,
) -> pl.LazyFrame:

    daioe_cols = [c for c in lf.collect_schema().names() if c.startswith(prefix)]
    w = pl.col(weight_col)

    out = (
        lf
        .group_by(["year", code_col])
        .agg(
            w.sum().alias("weight_sum"),
            pl.col(daioe_cols).mean().name.suffix("_avg"),
            ((pl.col(daioe_cols) * w).sum() / w.sum()).name.suffix("_wavg"),
        )
        .with_columns(pl.lit(level_label).alias("level"))
        .rename({code_col: "ssyk_code"})
    )

    if not add_percentiles:
        return out

    group_keys = ["year", "level"]

    rank_expr = (
        pl.col(f"^{prefix}.*_(avg|wavg)$")
        .rank(method="average", descending=descending)
        .over(group_keys)
    )

    n_expr = pl.len().over(group_keys)

    return out.with_columns(
        (
            pl.when(n_expr > 1)
            .then((rank_expr - 1) / (n_expr - 1))
            .otherwise(0.0)
            * pct_scale
        ).name.prefix("pctl_")
    )


In [70]:
levels = {
    "code_4": "SSYK4",
    "code_3": "SSYK3",
    "code_2": "SSYK2",
    "code_1": "SSYK1",
}

aggregated = [
    aggregate_daioe_level(daioe_scb_filtered, col, label)
    for col, label in levels.items()
]

daioe_all_levels = (
    pl.concat(aggregated)
    .sort(["level", "year", "ssyk_code"])
)


In [71]:
inspect_lazy(daioe_all_levels)

Rows: 6,853
Columns: 48


In [72]:
print(
    daioe_all_levels
    .group_by("level")
    .len()
    .collect()
)


shape: (4, 2)
┌───────┬──────┐
│ level ┆ len  │
│ ---   ┆ ---  │
│ str   ┆ u32  │
╞═══════╪══════╡
│ SSYK1 ┆ 99   │
│ SSYK3 ┆ 1595 │
│ SSYK2 ┆ 473  │
│ SSYK4 ┆ 4686 │
└───────┴──────┘


In [73]:
daioe_all_levels.collect_schema()

Schema([('year', Int64),
        ('ssyk_code', String),
        ('weight_sum', Int64),
        ('daioe_allapps_avg', Float64),
        ('daioe_stratgames_avg', Float64),
        ('daioe_videogames_avg', Float64),
        ('daioe_imgrec_avg', Float64),
        ('daioe_imgcompr_avg', Float64),
        ('daioe_imggen_avg', Float64),
        ('daioe_readcompr_avg', Float64),
        ('daioe_lngmod_avg', Float64),
        ('daioe_translat_avg', Float64),
        ('daioe_speechrec_avg', Float64),
        ('daioe_genai_avg', Float64),
        ('daioe_allapps_wavg', Float64),
        ('daioe_stratgames_wavg', Float64),
        ('daioe_videogames_wavg', Float64),
        ('daioe_imgrec_wavg', Float64),
        ('daioe_imgcompr_wavg', Float64),
        ('daioe_imggen_wavg', Float64),
        ('daioe_readcompr_wavg', Float64),
        ('daioe_lngmod_wavg', Float64),
        ('daioe_translat_wavg', Float64),
        ('daioe_speechrec_wavg', Float64),
        ('daioe_genai_wavg', Float64),
        

In [74]:
scb_lazy_lf.collect_schema()

Schema([('level', String),
        ('ssyk_code', String),
        ('age', String),
        ('sex', String),
        ('year', Int64),
        ('count', Int64),
        ('occupation', String)])

In [75]:
inspect_lazy(scb_lazy_lf)

Rows: 125,334
Columns: 7


In [76]:
inspect_lazy(daioe_all_levels)

Rows: 6,853
Columns: 48


In [77]:
final_merge = scb_lazy_lf\
    .join(
        daioe_all_levels,
        left_on=["year", "ssyk_code"],
        right_on=["year", "ssyk_code"],
        how="left"
    )

In [78]:
inspect_lazy(final_merge)

Rows: 125,334
Columns: 53


## Export
Write the combined dataset to Parquet for downstream use.


In [79]:
output_path = DATA_DIR / "daioe_scb_years_all_levels.parquet"

final_merge.sink_parquet(output_path)