This notebook:
- Creates wage table merged with county codes
- Creates unemployment tables merged with county codes:
    - (a) By sex-age
    - (b) By sex-education
- Explores multiple definitions of unemployent measure

In [35]:
import sys
from pathlib import Path

p = Path.cwd().resolve()
repo_root = next((parent for parent in [p] + list(p.parents) if (parent / ".git").exists()), None)
if repo_root is None:
    raise RuntimeError("Repo root not found. Open the repo folder in VS Code.")

sys.path.insert(0, str(repo_root))
print("Repo root:", repo_root)

import pandas as pd 
import numpy as np 
import matplotlib.pyplot as plt

Repo root: C:\Users\harri\OneDrive - Imperial College London\Year 3 Group Project\Group_Project_Y3


In [36]:
# Get the county codes table
county_codes = pd.read_csv(repo_root / "cleaned/00_codes/county_codes.csv")
print(county_codes.shape)
county_codes.head()

(380, 3)


Unnamed: 0,county_code,county_kts,county_name
0,201,10030210101000,Powiat bolesławiecki
1,202,10030210302000,Powiat dzierżoniowski
2,203,10030210203000,Powiat głogowski
3,204,10030210204000,Powiat górowski
4,205,10030210105000,Powiat jaworski


**A - Wage Outcome:**

Takes wage by powiat (p2497)

In [37]:
wages = pd.read_csv(repo_root / "cleaned/03_01_outcome_data/wage_powiat_p2497.csv", index_col=0)

wages["merge_code"] = wages["code"].apply(lambda x: int(str(x)[:-3]))
wages.merge_code.nunique()

print(wages.shape)
wages.head()

(9144, 6)


Unnamed: 0,code,powiat,type,year,value,merge_code
0,201000,Powiat bolesławiecki,grand total,2002,1873.59,201
1,202000,Powiat dzierżoniowski,grand total,2002,1703.68,202
2,203000,Powiat głogowski,grand total,2002,1868.6,203
3,204000,Powiat górowski,grand total,2002,1730.53,204
4,205000,Powiat jaworski,grand total,2002,1705.19,205


Merge with county codes:

In [38]:
wages = wages.merge(
    county_codes,
    how = "left", 
    left_on = "merge_code", 
    right_on = "county_code"
)

wages.head()

Unnamed: 0,code,powiat,type,year,value,merge_code,county_code,county_kts,county_name
0,201000,Powiat bolesławiecki,grand total,2002,1873.59,201,201.0,10030210000000.0,Powiat bolesławiecki
1,202000,Powiat dzierżoniowski,grand total,2002,1703.68,202,202.0,10030210000000.0,Powiat dzierżoniowski
2,203000,Powiat głogowski,grand total,2002,1868.6,203,203.0,10030210000000.0,Powiat głogowski
3,204000,Powiat górowski,grand total,2002,1730.53,204,204.0,10030210000000.0,Powiat górowski
4,205000,Powiat jaworski,grand total,2002,1705.19,205,205.0,10030210000000.0,Powiat jaworski


In [39]:
missing = wages["county_code"].isna()
wages.loc[missing, "county_code"] = 265
wages.loc[missing, "county_kts"] = 10030210365000
wages.loc[missing, "county_name"] = "Powiat m. Wałbrzych"

Quick analysis:

In [40]:
wage_summary = wages.groupby("county_code")

table_a = wage_summary["year"].agg(["min", "max", "count"])

print(table_a.describe())
print()
print(table_a["count"].value_counts())
print()
print(table_a[table_a["count"]>24])

wages[wages["county_code"] == 265]

          min     max       count
count   380.0   380.0  380.000000
mean   2002.0  2025.0   24.063158
std       0.0     0.0    1.231174
min    2002.0  2025.0   24.000000
25%    2002.0  2025.0   24.000000
50%    2002.0  2025.0   24.000000
75%    2002.0  2025.0   24.000000
max    2002.0  2025.0   48.000000

count
24    379
48      1
Name: count, dtype: int64

              min   max  count
county_code                   
265.0        2002  2025     48


Unnamed: 0,code,powiat,type,year,value,merge_code,county_code,county_kts,county_name
28,263000,City with powiat status Wałbrzych to 2002,grand total,2002,2117.52,263,265.0,10030210000000.0,Powiat m. Wałbrzych
30,265000,City with powiat status Wałbrzych since 2013,grand total,2002,,265,265.0,10030210000000.0,Powiat m. Wałbrzych
409,263000,City with powiat status Wałbrzych to 2002,grand total,2003,,263,265.0,10030210000000.0,Powiat m. Wałbrzych
411,265000,City with powiat status Wałbrzych since 2013,grand total,2003,,265,265.0,10030210000000.0,Powiat m. Wałbrzych
790,263000,City with powiat status Wałbrzych to 2002,grand total,2004,,263,265.0,10030210000000.0,Powiat m. Wałbrzych
792,265000,City with powiat status Wałbrzych since 2013,grand total,2004,,265,265.0,10030210000000.0,Powiat m. Wałbrzych
1171,263000,City with powiat status Wałbrzych to 2002,grand total,2005,,263,265.0,10030210000000.0,Powiat m. Wałbrzych
1173,265000,City with powiat status Wałbrzych since 2013,grand total,2005,,265,265.0,10030210000000.0,Powiat m. Wałbrzych
1552,263000,City with powiat status Wałbrzych to 2002,grand total,2006,,263,265.0,10030210000000.0,Powiat m. Wałbrzych
1554,265000,City with powiat status Wałbrzych since 2013,grand total,2006,,265,265.0,10030210000000.0,Powiat m. Wałbrzych


County 265 has a lot of missing values - why? Becuase between 2003 and 2013 it was merged with wałbrzyski. We can ignore this given we likely don't care about pre 2013.

Now we need to add a column with population - such that powiats can be combined into our labour market clusters. For this we should match the population measures used in unemployment measures to be consistent. e.g. Use NC 2021 population, or use yearly powiat population. These are given in wages as NC_population, and YR_population. 

NC population
- From sex_age (p4181) or sex_ed (p4315) and summed to powiat total
- Doing both to check they give consistent population

YR population
- From sex_agegr (p2137)

First National Census Population merged:

In [41]:
nc_pop_sa = pd.read_csv(repo_root / "cleaned/03_01_outcome_data/pop_nc_sex_age_p4181.csv", index_col=0)
nc_pop_sa.head()

nc_pop_se = pd.read_csv(repo_root / "cleaned/03_01_outcome_data/pop_nc_sex_age_p4181.csv", index_col=0)
nc_pop_se.head()

# Check they give the same powiat totals
nc_pop_1 = nc_pop_sa.groupby("code")["count"].sum()
nc_pop_2 = nc_pop_se.groupby("code")["count"].sum()

(nc_pop_1 == nc_pop_2).sum()

np.int64(380)

In [42]:
# They are consistent so will go with the first
nc_pop = pd.DataFrame(nc_pop_1).reset_index()

nc_pop["merge_code"] = nc_pop["code"].apply(lambda x: int(str(x)[:-3]))

nc_pop = nc_pop.merge(
    county_codes,
    how="left",
    left_on="merge_code",
    right_on="county_code"
)

nc_pop

Unnamed: 0,code,count,merge_code,county_code,county_kts,county_name
0,201000,353740,201,201,10030210101000,Powiat bolesławiecki
1,202000,390884,202,202,10030210302000,Powiat dzierżoniowski
2,203000,346672,203,203,10030210203000,Powiat głogowski
3,204000,133268,204,204,10030210204000,Powiat górowski
4,205000,194012,205,205,10030210105000,Powiat jaworski
...,...,...,...,...,...,...
375,3217000,204068,3217,3217,10023216417000,Powiat wałecki
376,3218000,137856,3218,3218,10023216418000,Powiat łobeski
377,3261000,423532,3261,3261,10023216361000,Powiat m. Koszalin
378,3262000,1584672,3262,3262,10023216562000,Powiat m. Szczecin


In [43]:
wages = wages.merge(
    nc_pop[["county_code", "count"]],
    how="left",
    left_on="county_code",
    right_on="county_code"
)

wages = wages.rename(
    columns={"count": "NC_population"}
)

wages

Unnamed: 0,code,powiat,type,year,value,merge_code,county_code,county_kts,county_name,NC_population
0,201000,Powiat bolesławiecki,grand total,2002,1873.59,201,201.0,1.003021e+13,Powiat bolesławiecki,353740
1,202000,Powiat dzierżoniowski,grand total,2002,1703.68,202,202.0,1.003021e+13,Powiat dzierżoniowski,390884
2,203000,Powiat głogowski,grand total,2002,1868.60,203,203.0,1.003021e+13,Powiat głogowski,346672
3,204000,Powiat górowski,grand total,2002,1730.53,204,204.0,1.003021e+13,Powiat górowski,133268
4,205000,Powiat jaworski,grand total,2002,1705.19,205,205.0,1.003021e+13,Powiat jaworski,194012
...,...,...,...,...,...,...,...,...,...,...
9139,3217000,Powiat wałecki,grand total,2025,,3217,3217.0,1.002322e+13,Powiat wałecki,204068
9140,3218000,Powiat łobeski,grand total,2025,,3218,3218.0,1.002322e+13,Powiat łobeski,137856
9141,3261000,City with powiat status Koszalin,grand total,2025,,3261,3261.0,1.002322e+13,Powiat m. Koszalin,423532
9142,3262000,City with powiat status Szczecin,grand total,2025,,3262,3262.0,1.002322e+13,Powiat m. Szczecin,1584672


Now yearly population (p2137):

In [44]:
yr_pop = pd.read_csv(repo_root / "cleaned/03_01_outcome_data/pop_yr_sex_agegr_p2137.csv", index_col=0)

yr_pop.head()

Unnamed: 0,code,powiat,year,sex,age_group,count
0,201000,Powiat bolesławiecki,1995,total,total,89407.0
1,202000,Powiat dzierżoniowski,1995,total,total,113810.0
2,203000,Powiat głogowski,1995,total,total,91373.0
3,204000,Powiat górowski,1995,total,total,37826.0
4,205000,Powiat jaworski,1995,total,total,54914.0


In [45]:
yr_pop["merge_code"] = yr_pop["code"].apply(lambda x: int(str(x)[:-3]))

yr_pop = yr_pop.merge(
    county_codes,
    how="left",
    left_on="merge_code",
    right_on="county_code"
)

yr_pop

Unnamed: 0,code,powiat,year,sex,age_group,count,merge_code,county_code,county_kts,county_name
0,201000,Powiat bolesławiecki,1995,total,total,89407.0,201,201.0,1.003021e+13,Powiat bolesławiecki
1,202000,Powiat dzierżoniowski,1995,total,total,113810.0,202,202.0,1.003021e+13,Powiat dzierżoniowski
2,203000,Powiat głogowski,1995,total,total,91373.0,203,203.0,1.003021e+13,Powiat głogowski
3,204000,Powiat górowski,1995,total,total,37826.0,204,204.0,1.003021e+13,Powiat górowski
4,205000,Powiat jaworski,1995,total,total,54914.0,205,205.0,1.003021e+13,Powiat jaworski
...,...,...,...,...,...,...,...,...,...,...
721975,3217000,Powiat wałecki,2024,females,0-14,3279.0,3217,3217.0,1.002322e+13,Powiat wałecki
721976,3218000,Powiat łobeski,2024,females,0-14,2178.0,3218,3218.0,1.002322e+13,Powiat łobeski
721977,3261000,City with powiat status Koszalin,2024,females,0-14,6713.0,3261,3261.0,1.002322e+13,Powiat m. Koszalin
721978,3262000,City with powiat status Szczecin,2024,females,0-14,24100.0,3262,3262.0,1.002322e+13,Powiat m. Szczecin


In [46]:
yr_pop[yr_pop["county_code"].isna()]

Unnamed: 0,code,powiat,year,sex,age_group,count,merge_code,county_code,county_kts,county_name
28,263000,City with powiat status Wałbrzych to 2002,1995,total,total,139219.0,263,,,
168,1431000,Powiat warszawski,1995,total,total,1635112.0,1431,,,
410,263000,City with powiat status Wałbrzych to 2002,1996,total,total,138597.0,263,,,
550,1431000,Powiat warszawski,1996,total,total,1628505.0,1431,,,
792,263000,City with powiat status Wałbrzych to 2002,1997,total,total,137829.0,263,,,
...,...,...,...,...,...,...,...,...,...,...
721002,1431000,Powiat warszawski,2022,females,0-14,,1431,,,
721244,263000,City with powiat status Wałbrzych to 2002,2023,females,0-14,,263,,,
721384,1431000,Powiat warszawski,2023,females,0-14,,1431,,,
721626,263000,City with powiat status Wałbrzych to 2002,2024,females,0-14,,263,,,


In [47]:
yr_pop[yr_pop["merge_code"]==1431].head(10)

Unnamed: 0,code,powiat,year,sex,age_group,count,merge_code,county_code,county_kts,county_name
168,1431000,Powiat warszawski,1995,total,total,1635112.0,1431,,,
550,1431000,Powiat warszawski,1996,total,total,1628505.0,1431,,,
932,1431000,Powiat warszawski,1997,total,total,1624843.0,1431,,,
1314,1431000,Powiat warszawski,1998,total,total,1618468.0,1431,,,
1696,1431000,Powiat warszawski,1999,total,total,1677316.0,1431,,,
2078,1431000,Powiat warszawski,2000,total,total,1672418.0,1431,,,
2460,1431000,Powiat warszawski,2001,total,total,1671727.0,1431,,,
2842,1431000,Powiat warszawski,2002,total,total,,1431,,,
3224,1431000,Powiat warszawski,2003,total,total,,1431,,,
3606,1431000,Powiat warszawski,2004,total,total,,1431,,,


In [48]:
missing = yr_pop["merge_code"]==263 # as before
yr_pop.loc[missing, "county_code"] = 265
yr_pop.loc[missing, "county_kts"] = 10030210365000
yr_pop.loc[missing, "county_name"] = "Powiat m. Wałbrzych"

# 1431 only exists up to 2001 - we will just drop
yr_pop = yr_pop.dropna(subset=["county_code"])

In [49]:
yr_pop_sum = pd.DataFrame(yr_pop.groupby(["county_code", "year"])["count"].sum()).reset_index()
# yr_pop_sum = yr_pop_sum.merge(
#     county_codes,
#     how="left",
#     left_on="county_code",
#     right_on="county_code"
# )
yr_pop_sum

Unnamed: 0,county_code,year,count
0,201.0,1995,357628.0
1,201.0,1996,357644.0
2,201.0,1997,358384.0
3,201.0,1998,358360.0
4,201.0,1999,350960.0
...,...,...,...
11395,3263.0,2020,182932.0
11396,3263.0,2021,181234.0
11397,3263.0,2022,179612.0
11398,3263.0,2023,177950.0


In [50]:
wages = wages.merge(
    yr_pop_sum,
    how="left", 
    left_on=["county_code", "year"],
    right_on=["county_code", "year"]
)

wages = wages.rename(columns={"count": "YR_population"})

wages

Unnamed: 0,code,powiat,type,year,value,merge_code,county_code,county_kts,county_name,NC_population,YR_population
0,201000,Powiat bolesławiecki,grand total,2002,1873.59,201,201.0,1.003021e+13,Powiat bolesławiecki,353740,365496.0
1,202000,Powiat dzierżoniowski,grand total,2002,1703.68,202,202.0,1.003021e+13,Powiat dzierżoniowski,390884,447728.0
2,203000,Powiat głogowski,grand total,2002,1868.60,203,203.0,1.003021e+13,Powiat głogowski,346672,360812.0
3,204000,Powiat górowski,grand total,2002,1730.53,204,204.0,1.003021e+13,Powiat górowski,133268,153556.0
4,205000,Powiat jaworski,grand total,2002,1705.19,205,205.0,1.003021e+13,Powiat jaworski,194012,219840.0
...,...,...,...,...,...,...,...,...,...,...,...
9139,3217000,Powiat wałecki,grand total,2025,,3217,3217.0,1.002322e+13,Powiat wałecki,204068,
9140,3218000,Powiat łobeski,grand total,2025,,3218,3218.0,1.002322e+13,Powiat łobeski,137856,
9141,3261000,City with powiat status Koszalin,grand total,2025,,3261,3261.0,1.002322e+13,Powiat m. Koszalin,423532,
9142,3262000,City with powiat status Szczecin,grand total,2025,,3262,3262.0,1.002322e+13,Powiat m. Szczecin,1584672,


Comparison of population measures for wages:

In [55]:
rel = wages[wages["year"]==2021]
diff = rel["YR_population"] - rel["NC_population"]

diff.describe()

count    3.810000e+02
mean     5.434953e+04
std      7.061015e+04
min      9.788000e+03
25%      2.712800e+04
50%      4.004000e+04
75%      5.859400e+04
max      1.077074e+06
dtype: float64

Now - issue is should these also be either (a) economically active, (b) employed - i.e. do we want average wage among employed people or average wage among economically average people - **need to check literature on this.**

Nevermind - I think - still need to check literature - but it should ideally be person-hours worked. But if not available then should be employed.

Also **need to check** but according to chat the wage measure is only firms of 10+ which may not be very representative of low skilled immigration - but Ukraine high skilled so less of an issue more just something to note as a limitation. 