This notebook:
- Creates wage table merged with county codes
- Creates unemployment tables merged with county codes:
    - (a) By sex-age
    - (b) By sex-education
- Explores multiple definitions of unemployent measure

In [278]:
import sys
from pathlib import Path

p = Path.cwd().resolve()
repo_root = next((parent for parent in [p] + list(p.parents) if (parent / ".git").exists()), None)
if repo_root is None:
    raise RuntimeError("Repo root not found. Open the repo folder in VS Code.")

sys.path.insert(0, str(repo_root))
print("Repo root:", repo_root)

import pandas as pd 
import numpy as np 
import matplotlib.pyplot as plt

Repo root: C:\Users\harri\OneDrive - Imperial College London\Year 3 Group Project\Group_Project_Y3


In [279]:
# Get the county codes table
county_codes = pd.read_csv(repo_root / "cleaned/00_codes/county_codes.csv")
print(county_codes.shape)
county_codes.head()

(380, 3)


Unnamed: 0,county_code,county_kts,county_name
0,201,10030210101000,Powiat bolesławiecki
1,202,10030210302000,Powiat dzierżoniowski
2,203,10030210203000,Powiat głogowski
3,204,10030210204000,Powiat górowski
4,205,10030210105000,Powiat jaworski


**A - Wage Outcome:**

Takes wage by powiat (p2497)

In [280]:
wages = pd.read_csv(repo_root / "cleaned/03_01_outcome_data/wage_powiat_p2497.csv", index_col=0)

wages["merge_code"] = wages["code"].apply(lambda x: int(str(x)[:-3]))
wages.merge_code.nunique()

print(wages.shape)
wages.head()

(9144, 6)


Unnamed: 0,code,powiat,type,year,value,merge_code
0,201000,Powiat bolesławiecki,grand total,2002,1873.59,201
1,202000,Powiat dzierżoniowski,grand total,2002,1703.68,202
2,203000,Powiat głogowski,grand total,2002,1868.6,203
3,204000,Powiat górowski,grand total,2002,1730.53,204
4,205000,Powiat jaworski,grand total,2002,1705.19,205


Merge with county codes:

In [281]:
wages = wages.merge(
    county_codes,
    how = "left", 
    left_on = "merge_code", 
    right_on = "county_code"
)

wages.head()

Unnamed: 0,code,powiat,type,year,value,merge_code,county_code,county_kts,county_name
0,201000,Powiat bolesławiecki,grand total,2002,1873.59,201,201.0,10030210000000.0,Powiat bolesławiecki
1,202000,Powiat dzierżoniowski,grand total,2002,1703.68,202,202.0,10030210000000.0,Powiat dzierżoniowski
2,203000,Powiat głogowski,grand total,2002,1868.6,203,203.0,10030210000000.0,Powiat głogowski
3,204000,Powiat górowski,grand total,2002,1730.53,204,204.0,10030210000000.0,Powiat górowski
4,205000,Powiat jaworski,grand total,2002,1705.19,205,205.0,10030210000000.0,Powiat jaworski


In [282]:
missing = wages["county_code"].isna()
wages.loc[missing, "county_code"] = 265
wages.loc[missing, "county_kts"] = 10030210365000
wages.loc[missing, "county_name"] = "Powiat m. Wałbrzych"

Quick analysis:

In [283]:
wage_summary = wages.groupby("county_code")

table_a = wage_summary["year"].agg(["min", "max", "count"])

print(table_a.describe())
print()
print(table_a["count"].value_counts())
print()
print(table_a[table_a["count"]>24])

wages[wages["county_code"] == 265]

          min     max       count
count   380.0   380.0  380.000000
mean   2002.0  2025.0   24.063158
std       0.0     0.0    1.231174
min    2002.0  2025.0   24.000000
25%    2002.0  2025.0   24.000000
50%    2002.0  2025.0   24.000000
75%    2002.0  2025.0   24.000000
max    2002.0  2025.0   48.000000

count
24    379
48      1
Name: count, dtype: int64

              min   max  count
county_code                   
265.0        2002  2025     48


Unnamed: 0,code,powiat,type,year,value,merge_code,county_code,county_kts,county_name
28,263000,City with powiat status Wałbrzych to 2002,grand total,2002,2117.52,263,265.0,10030210000000.0,Powiat m. Wałbrzych
30,265000,City with powiat status Wałbrzych since 2013,grand total,2002,,265,265.0,10030210000000.0,Powiat m. Wałbrzych
409,263000,City with powiat status Wałbrzych to 2002,grand total,2003,,263,265.0,10030210000000.0,Powiat m. Wałbrzych
411,265000,City with powiat status Wałbrzych since 2013,grand total,2003,,265,265.0,10030210000000.0,Powiat m. Wałbrzych
790,263000,City with powiat status Wałbrzych to 2002,grand total,2004,,263,265.0,10030210000000.0,Powiat m. Wałbrzych
792,265000,City with powiat status Wałbrzych since 2013,grand total,2004,,265,265.0,10030210000000.0,Powiat m. Wałbrzych
1171,263000,City with powiat status Wałbrzych to 2002,grand total,2005,,263,265.0,10030210000000.0,Powiat m. Wałbrzych
1173,265000,City with powiat status Wałbrzych since 2013,grand total,2005,,265,265.0,10030210000000.0,Powiat m. Wałbrzych
1552,263000,City with powiat status Wałbrzych to 2002,grand total,2006,,263,265.0,10030210000000.0,Powiat m. Wałbrzych
1554,265000,City with powiat status Wałbrzych since 2013,grand total,2006,,265,265.0,10030210000000.0,Powiat m. Wałbrzych


County 265 has a lot of missing values - why? Becuase between 2003 and 2013 it was merged with wałbrzyski. We can ignore this given we likely don't care about pre 2013.

Now we need to add a column with population - such that powiats can be combined into our labour market clusters. For this we should match the population measures used in unemployment measures to be consistent. e.g. Use NC 2021 population, or use yearly powiat population. These are given in wages as NC_population, and YR_population. 

NC population
- From sex_age (p4181) or sex_ed (p4315) and summed to powiat total
- Doing both to check they give consistent population

YR population
- From sex_agegr (p2137)

First National Census Population merged:

In [284]:
nc_pop_sa = pd.read_csv(repo_root / "cleaned/03_01_outcome_data/pop_nc_sex_age_p4181.csv", index_col=0)
nc_pop_sa.head()

nc_pop_se = pd.read_csv(repo_root / "cleaned/03_01_outcome_data/pop_nc_sex_ed_p4315.csv", index_col=0)
nc_pop_se.head()

# Filter to ages 13 upwards and total for sex
age_filter = ['total', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12']
nc_pop_sa = nc_pop_sa[
    ~(nc_pop_sa["age"].isin(age_filter)) & (nc_pop_sa["sex"]=="total")
]
nc_pop_sa

# Filter to just total education and sex
nc_pop_se = nc_pop_se[
    (nc_pop_se["education"]=="total") & (nc_pop_se["sex"]=="total")
]
nc_pop_se

# Check they give the same powiat totals
nc_pop_1 = nc_pop_sa.groupby("code")["count"].sum()
nc_pop_2 = nc_pop_se.groupby("code")["count"].sum()

(nc_pop_1 == nc_pop_2).sum()

np.int64(380)

In [285]:
# They are consistent so will go with the first
nc_pop = pd.DataFrame(nc_pop_1).reset_index()

nc_pop["merge_code"] = nc_pop["code"].apply(lambda x: int(str(x)[:-3]))

nc_pop = nc_pop.merge(
    county_codes,
    how="left",
    left_on="merge_code",
    right_on="county_code"
)

nc_pop

Unnamed: 0,code,count,merge_code,county_code,county_kts,county_name
0,201000,76739,201,201,10030210101000,Powiat bolesławiecki
1,202000,86543,202,202,10030210302000,Powiat dzierżoniowski
2,203000,75191,203,203,10030210203000,Powiat głogowski
3,204000,29042,204,204,10030210204000,Powiat górowski
4,205000,42493,205,205,10030210105000,Powiat jaworski
...,...,...,...,...,...,...
375,3217000,44491,3217,3217,10023216417000,Powiat wałecki
376,3218000,30007,3218,3218,10023216418000,Powiat łobeski
377,3261000,93083,3261,3261,10023216361000,Powiat m. Koszalin
378,3262000,349790,3262,3262,10023216562000,Powiat m. Szczecin


In [286]:
wages = wages.merge(
    nc_pop[["county_code", "count"]],
    how="left",
    left_on="county_code",
    right_on="county_code"
)

wages = wages.rename(
    columns={"count": "NC_population"}
)

wages

Unnamed: 0,code,powiat,type,year,value,merge_code,county_code,county_kts,county_name,NC_population
0,201000,Powiat bolesławiecki,grand total,2002,1873.59,201,201.0,1.003021e+13,Powiat bolesławiecki,76739
1,202000,Powiat dzierżoniowski,grand total,2002,1703.68,202,202.0,1.003021e+13,Powiat dzierżoniowski,86543
2,203000,Powiat głogowski,grand total,2002,1868.60,203,203.0,1.003021e+13,Powiat głogowski,75191
3,204000,Powiat górowski,grand total,2002,1730.53,204,204.0,1.003021e+13,Powiat górowski,29042
4,205000,Powiat jaworski,grand total,2002,1705.19,205,205.0,1.003021e+13,Powiat jaworski,42493
...,...,...,...,...,...,...,...,...,...,...
9139,3217000,Powiat wałecki,grand total,2025,,3217,3217.0,1.002322e+13,Powiat wałecki,44491
9140,3218000,Powiat łobeski,grand total,2025,,3218,3218.0,1.002322e+13,Powiat łobeski,30007
9141,3261000,City with powiat status Koszalin,grand total,2025,,3261,3261.0,1.002322e+13,Powiat m. Koszalin,93083
9142,3262000,City with powiat status Szczecin,grand total,2025,,3262,3262.0,1.002322e+13,Powiat m. Szczecin,349790


Now yearly population (p2137):

In [287]:
yr_pop = pd.read_csv(repo_root / "cleaned/03_01_outcome_data/pop_yr_sex_agegr_p2137.csv", index_col=0)

yr_pop

Unnamed: 0,code,powiat,year,sex,age_group,count
0,201000,Powiat bolesławiecki,1995,total,total,89407.0
1,202000,Powiat dzierżoniowski,1995,total,total,113810.0
2,203000,Powiat głogowski,1995,total,total,91373.0
3,204000,Powiat górowski,1995,total,total,37826.0
4,205000,Powiat jaworski,1995,total,total,54914.0
...,...,...,...,...,...,...
721975,3217000,Powiat wałecki,2024,females,0-14,3279.0
721976,3218000,Powiat łobeski,2024,females,0-14,2178.0
721977,3261000,City with powiat status Koszalin,2024,females,0-14,6713.0
721978,3262000,City with powiat status Szczecin,2024,females,0-14,24100.0


In [288]:
print(yr_pop.age_group.unique())
print(yr_pop.sex.unique())

# Filter to just relevant (total sex and age group 15+)
age_filter = ['total', '0-4', '5-9', '10-14', '0-14']
yr_pop = yr_pop[
    ~(yr_pop["age_group"].isin(age_filter)) & (yr_pop["sex"]=="total")
].copy()
yr_pop

['total' '0-4' '5-9' '10-14' '15-19' '20-24' '25-29' '30-34' '35-39'
 '40-44' '45-49' '50-54' '55-59' '60-64' '65-69' '70 and more' '70-74'
 '75-79' '80-84' '85 and more' '0-14']
['total' 'males' 'females']


Unnamed: 0,code,powiat,year,sex,age_group,count
137520,201000,Powiat bolesławiecki,1995,total,15-19,7577.0
137521,202000,Powiat dzierżoniowski,1995,total,15-19,9301.0
137522,203000,Powiat głogowski,1995,total,15-19,9744.0
137523,204000,Powiat górowski,1995,total,15-19,3236.0
137524,205000,Powiat jaworski,1995,total,15-19,4761.0
...,...,...,...,...,...,...
664675,3217000,Powiat wałecki,2024,total,85 and more,846.0
664676,3218000,Powiat łobeski,2024,total,85 and more,673.0
664677,3261000,City with powiat status Koszalin,2024,total,85 and more,3031.0
664678,3262000,City with powiat status Szczecin,2024,total,85 and more,10557.0


In [289]:
yr_pop.age_group.unique()

array(['15-19', '20-24', '25-29', '30-34', '35-39', '40-44', '45-49',
       '50-54', '55-59', '60-64', '65-69', '70 and more', '70-74',
       '75-79', '80-84', '85 and more'], dtype=object)

In [290]:
yr_pop["merge_code"] = yr_pop["code"].apply(lambda x: int(str(x)[:-3]))

yr_pop = yr_pop.merge(
    county_codes,
    how="left",
    left_on="merge_code",
    right_on="county_code"
)

yr_pop

Unnamed: 0,code,powiat,year,sex,age_group,count,merge_code,county_code,county_kts,county_name
0,201000,Powiat bolesławiecki,1995,total,15-19,7577.0,201,201.0,1.003021e+13,Powiat bolesławiecki
1,202000,Powiat dzierżoniowski,1995,total,15-19,9301.0,202,202.0,1.003021e+13,Powiat dzierżoniowski
2,203000,Powiat głogowski,1995,total,15-19,9744.0,203,203.0,1.003021e+13,Powiat głogowski
3,204000,Powiat górowski,1995,total,15-19,3236.0,204,204.0,1.003021e+13,Powiat górowski
4,205000,Powiat jaworski,1995,total,15-19,4761.0,205,205.0,1.003021e+13,Powiat jaworski
...,...,...,...,...,...,...,...,...,...,...
183355,3217000,Powiat wałecki,2024,total,85 and more,846.0,3217,3217.0,1.002322e+13,Powiat wałecki
183356,3218000,Powiat łobeski,2024,total,85 and more,673.0,3218,3218.0,1.002322e+13,Powiat łobeski
183357,3261000,City with powiat status Koszalin,2024,total,85 and more,3031.0,3261,3261.0,1.002322e+13,Powiat m. Koszalin
183358,3262000,City with powiat status Szczecin,2024,total,85 and more,10557.0,3262,3262.0,1.002322e+13,Powiat m. Szczecin


In [291]:
yr_pop[yr_pop["county_code"].isna()]

Unnamed: 0,code,powiat,year,sex,age_group,count,merge_code,county_code,county_kts,county_name
28,263000,City with powiat status Wałbrzych to 2002,1995,total,15-19,11449.0,263,,,
168,1431000,Powiat warszawski,1995,total,15-19,117361.0,1431,,,
410,263000,City with powiat status Wałbrzych to 2002,1996,total,15-19,11621.0,263,,,
550,1431000,Powiat warszawski,1996,total,15-19,116256.0,1431,,,
792,263000,City with powiat status Wałbrzych to 2002,1997,total,15-19,11807.0,263,,,
...,...,...,...,...,...,...,...,...,...,...
182382,1431000,Powiat warszawski,2022,total,85 and more,,1431,,,
182624,263000,City with powiat status Wałbrzych to 2002,2023,total,85 and more,,263,,,
182764,1431000,Powiat warszawski,2023,total,85 and more,,1431,,,
183006,263000,City with powiat status Wałbrzych to 2002,2024,total,85 and more,,263,,,


In [292]:
yr_pop[yr_pop["merge_code"]==1431].head(10)

Unnamed: 0,code,powiat,year,sex,age_group,count,merge_code,county_code,county_kts,county_name
168,1431000,Powiat warszawski,1995,total,15-19,117361.0,1431,,,
550,1431000,Powiat warszawski,1996,total,15-19,116256.0,1431,,,
932,1431000,Powiat warszawski,1997,total,15-19,116149.0,1431,,,
1314,1431000,Powiat warszawski,1998,total,15-19,117247.0,1431,,,
1696,1431000,Powiat warszawski,1999,total,15-19,128590.0,1431,,,
2078,1431000,Powiat warszawski,2000,total,15-19,121749.0,1431,,,
2460,1431000,Powiat warszawski,2001,total,15-19,114527.0,1431,,,
2842,1431000,Powiat warszawski,2002,total,15-19,,1431,,,
3224,1431000,Powiat warszawski,2003,total,15-19,,1431,,,
3606,1431000,Powiat warszawski,2004,total,15-19,,1431,,,


In [293]:
missing = yr_pop["merge_code"]==263 # as before
yr_pop.loc[missing, "county_code"] = 265
yr_pop.loc[missing, "county_kts"] = 10030210365000
yr_pop.loc[missing, "county_name"] = "Powiat m. Wałbrzych"

# 1431 only exists up to 2001 - we will just drop
yr_pop = yr_pop.dropna(subset=["county_code"])

In [294]:
yr_pop_sum = pd.DataFrame(yr_pop.groupby(["county_code", "year"])["count"].sum()).reset_index()
# yr_pop_sum = yr_pop_sum.merge(
#     county_codes,
#     how="left",
#     left_on="county_code",
#     right_on="county_code"
# )
yr_pop_sum

Unnamed: 0,county_code,year,count
0,201.0,1995,68528.0
1,201.0,1996,69108.0
2,201.0,1997,69943.0
3,201.0,1998,70777.0
4,201.0,1999,69443.0
...,...,...,...
11395,3263.0,2020,41252.0
11396,3263.0,2021,41035.0
11397,3263.0,2022,40940.0
11398,3263.0,2023,40897.0


In [295]:
wages = wages.merge(
    yr_pop_sum,
    how="left", 
    left_on=["county_code", "year"],
    right_on=["county_code", "year"]
)

wages = wages.rename(columns={"count": "YR_population"})

wages

Unnamed: 0,code,powiat,type,year,value,merge_code,county_code,county_kts,county_name,NC_population,YR_population
0,201000,Powiat bolesławiecki,grand total,2002,1873.59,201,201.0,1.003021e+13,Powiat bolesławiecki,76739,78228.0
1,202000,Powiat dzierżoniowski,grand total,2002,1703.68,202,202.0,1.003021e+13,Powiat dzierżoniowski,86543,100315.0
2,203000,Powiat głogowski,grand total,2002,1868.60,203,203.0,1.003021e+13,Powiat głogowski,75191,76992.0
3,204000,Powiat górowski,grand total,2002,1730.53,204,204.0,1.003021e+13,Powiat górowski,29042,32252.0
4,205000,Powiat jaworski,grand total,2002,1705.19,205,205.0,1.003021e+13,Powiat jaworski,42493,47991.0
...,...,...,...,...,...,...,...,...,...,...,...
9139,3217000,Powiat wałecki,grand total,2025,,3217,3217.0,1.002322e+13,Powiat wałecki,44491,
9140,3218000,Powiat łobeski,grand total,2025,,3218,3218.0,1.002322e+13,Powiat łobeski,30007,
9141,3261000,City with powiat status Koszalin,grand total,2025,,3261,3261.0,1.002322e+13,Powiat m. Koszalin,93083,
9142,3262000,City with powiat status Szczecin,grand total,2025,,3262,3262.0,1.002322e+13,Powiat m. Szczecin,349790,


Comparison of population measures for wages:

In [296]:
rel = wages[wages["year"]==2021]

rel.describe() 

Unnamed: 0,code,year,value,merge_code,county_code,county_kts,NC_population,YR_population
count,381.0,381.0,380.0,381.0,381.0,381.0,381.0,381.0
mean,1716853.0,2021.0,5211.628921,1716.853018,1716.858268,10040110000000.0,86554.08,96562.59
std,943888.6,0.0,639.608635,943.888628,943.880526,19722160000.0,107355.4,121857.4
min,201000.0,2021.0,4244.56,201.0,201.0,10011210000000.0,16844.0,18912.0
25%,1004000.0,2021.0,4837.56,1004.0,1004.0,10023020000000.0,46502.0,51580.0
50%,1611000.0,2021.0,5070.22,1611.0,1611.0,10040420000000.0,64482.0,71498.0
75%,2475000.0,2021.0,5407.175,2475.0,2475.0,10060610000000.0,94355.0,106748.0
max,3263000.0,2021.0,10076.64,3263.0,3263.0,10071430000000.0,1608993.0,1825641.0


In [297]:
diff = (rel["NC_population"] - rel["YR_population"]) / rel["YR_population"]

diff.describe()

count    381.000000
mean      -0.099994
std        0.013831
min       -0.168743
25%       -0.108652
50%       -0.098101
75%       -0.090018
max       -0.068653
dtype: float64

In [298]:
rel[["NC_population", "YR_population"]].corr()

Unnamed: 0,NC_population,YR_population
NC_population,1.0,0.999858
YR_population,0.999858,1.0


So these are extremely highly correlated measures - both measuring powiat population, and mean difference of 9% between measures. Note also that the second measure is expected to be somewhat less as it does not include 13-14 years as national census does.

Now - issue is should these also be either (a) economically active, (b) employed - i.e. do we want average wage among employed people or average wage among economically average people - **need to check literature on this.** I think ideally it would be person-hours worked but this is not available. 

- I cannot find a measure of employed people that is powait and long term. Best I can find is employed people by powait 2022-2025. 
- If we think worthwhile doing another measure - could do model activity rate based on national/region/powait for the data we have (2022-) and interpolate backwards - but not ideal still

Also **need to check** but according to chat the wage measure is only firms of 10+ which may not be very representative of low skilled immigration - but Ukraine high skilled so less of an issue more just something to note as a limitation. 

In [299]:
wage_output_cols = ["county_code", "county_kts", "county_name", "type", "year", "value", "NC_population", "YR_population"]
wages_output = wages[wage_output_cols].rename(
    columns={"value": "wage"}
)
wages_output.to_csv(repo_root / "cleaned/03_01_outcome_tables/wage_yr_table.csv")
wages_output

Unnamed: 0,county_code,county_kts,county_name,type,year,wage,NC_population,YR_population
0,201.0,1.003021e+13,Powiat bolesławiecki,grand total,2002,1873.59,76739,78228.0
1,202.0,1.003021e+13,Powiat dzierżoniowski,grand total,2002,1703.68,86543,100315.0
2,203.0,1.003021e+13,Powiat głogowski,grand total,2002,1868.60,75191,76992.0
3,204.0,1.003021e+13,Powiat górowski,grand total,2002,1730.53,29042,32252.0
4,205.0,1.003021e+13,Powiat jaworski,grand total,2002,1705.19,42493,47991.0
...,...,...,...,...,...,...,...,...
9139,3217.0,1.002322e+13,Powiat wałecki,grand total,2025,,44491,
9140,3218.0,1.002322e+13,Powiat łobeski,grand total,2025,,30007,
9141,3261.0,1.002322e+13,Powiat m. Koszalin,grand total,2025,,93083,
9142,3262.0,1.002322e+13,Powiat m. Szczecin,grand total,2025,,349790,


**B - Unemployment Outcome:**

Three steps:
- Read in unemployment tables
- Read in population / activity data and construct powiat-year data 
- Merge this to unemployment tables

**B1 - Unemployment data:**

P1946 - Registered unemployment by sex and age

In [63]:
unemploy_sa = pd.read_csv(repo_root / "cleaned/03_01_outcome_data/ru_sex_age_p1946.csv", index_col=0)
unemploy_sa

Unnamed: 0,code,powiat,year,sex,age,count
0,201000,Powiat bolesławiecki,2000,total,total,8886.0
1,202000,Powiat dzierżoniowski,2000,total,total,10625.0
2,203000,Powiat głogowski,2000,total,total,7752.0
3,204000,Powiat górowski,2000,total,total,3833.0
4,205000,Powiat jaworski,2000,total,total,6254.0
...,...,...,...,...,...,...
207475,3217000,Powiat wałecki,2025,females,55 and more,79.0
207476,3218000,Powiat łobeski,2025,females,55 and more,75.0
207477,3261000,City with powiat status Koszalin,2025,females,55 and more,146.0
207478,3262000,City with powiat status Szczecin,2025,females,55 and more,343.0


P1947 - Registered unemployment by sex and education

In [64]:
unemploy_se = pd.read_csv(repo_root / "cleaned/03_01_outcome_data/ru_sex_ed_p1947.csv", index_col=0)
unemploy_se

Unnamed: 0,code,powiat,year,sex,education,count
0,201000,Powiat bolesławiecki,2000,total,total,8886.0
1,202000,Powiat dzierżoniowski,2000,total,total,10625.0
2,203000,Powiat głogowski,2000,total,total,7752.0
3,204000,Powiat górowski,2000,total,total,3833.0
4,205000,Powiat jaworski,2000,total,total,6254.0
...,...,...,...,...,...,...
187715,3217000,Powiat wałecki,2025,females,"lower secondary, primary and incomplete primary",297.0
187716,3218000,Powiat łobeski,2025,females,"lower secondary, primary and incomplete primary",280.0
187717,3261000,City with powiat status Koszalin,2025,females,"lower secondary, primary and incomplete primary",394.0
187718,3262000,City with powiat status Szczecin,2025,females,"lower secondary, primary and incomplete primary",647.0


Merge to county codes

In [65]:
for df in [unemploy_sa, unemploy_se]:
    df["merge_code"] = df["code"].apply(lambda x: int(str(x)[:-3]))

In [66]:
unemploy_sa = unemploy_sa.merge(
    county_codes,
    how="left",
    left_on="merge_code",
    right_on="county_code"
)
unemploy_sa.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 207480 entries, 0 to 207479
Data columns (total 10 columns):
 #   Column       Non-Null Count   Dtype  
---  ------       --------------   -----  
 0   code         207480 non-null  int64  
 1   powiat       207480 non-null  object 
 2   year         207480 non-null  int64  
 3   sex          207480 non-null  object 
 4   age          207480 non-null  object 
 5   count        206871 non-null  float64
 6   merge_code   207480 non-null  int64  
 7   county_code  207480 non-null  int64  
 8   county_kts   207480 non-null  int64  
 9   county_name  207480 non-null  object 
dtypes: float64(1), int64(5), object(4)
memory usage: 15.8+ MB


In [69]:
unemploy_se = unemploy_se.merge(
    county_codes,
    how="left",
    left_on="merge_code",
    right_on="county_code"
)
unemploy_se.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 187720 entries, 0 to 187719
Data columns (total 10 columns):
 #   Column       Non-Null Count   Dtype  
---  ------       --------------   -----  
 0   code         187720 non-null  int64  
 1   powiat       187720 non-null  object 
 2   year         187720 non-null  int64  
 3   sex          187720 non-null  object 
 4   education    187720 non-null  object 
 5   count        162318 non-null  float64
 6   merge_code   187720 non-null  int64  
 7   county_code  187720 non-null  int64  
 8   county_kts   187720 non-null  int64  
 9   county_name  187720 non-null  object 
dtypes: float64(1), int64(5), object(4)
memory usage: 14.3+ MB


Quick analysis of tables:

In [67]:
table_b = unemploy_sa.groupby("code")[["count", "year"]].agg(["min", "max", "count"])
table_b

Unnamed: 0_level_0,count,count,count,year,year,year
Unnamed: 0_level_1,min,max,count,min,max,count
code,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
201000,42.0,9218.0,546,2000,2025,546
202000,20.0,12436.0,546,2000,2025,546
203000,32.0,9203.0,546,2000,2025,546
204000,7.0,4561.0,546,2000,2025,546
205000,23.0,7166.0,546,2000,2025,546
...,...,...,...,...,...,...
3217000,16.0,6965.0,546,2000,2025,546
3218000,33.0,5922.0,504,2000,2025,546
3261000,43.0,10479.0,546,2000,2025,546
3262000,90.0,29423.0,546,2000,2025,546


In [68]:
table_b.agg(["min", "max"])

Unnamed: 0_level_0,count,count,count,year,year,year
Unnamed: 0_level_1,min,max,count,min,max,count
min,0.0,1641.0,273,2000,2025,546
max,289.0,65177.0,546,2000,2025,546


In [70]:
table_c = unemploy_se.groupby("code")[["count", "year"]].agg(["min", "max", "count"])
table_c

Unnamed: 0_level_0,count,count,count,year,year,year
Unnamed: 0_level_1,min,max,count,min,max,count
code,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
201000,15.0,9218.0,428,2000,2025,494
202000,39.0,12436.0,428,2000,2025,494
203000,51.0,9203.0,428,2000,2025,494
204000,17.0,4561.0,428,2000,2025,494
205000,34.0,7166.0,428,2000,2025,494
...,...,...,...,...,...,...
3217000,22.0,6965.0,428,2000,2025,494
3218000,19.0,5922.0,412,2000,2025,494
3261000,86.0,10479.0,428,2000,2025,494
3262000,193.0,29423.0,428,2000,2025,494


In [71]:
table_c.agg(["min", "max"])

Unnamed: 0_level_0,count,count,count,year,year,year
Unnamed: 0_level_1,min,max,count,min,max,count
min,6.0,1641.0,234,2000,2025,494
max,770.0,65177.0,428,2000,2025,494


**B2 - Population Data:**

From wages we have the yearly population table, yr_pop (p2137) - this is by sex and age group