# <span style="color:#d3d1df">Ms Thesis Wage Data Discovery</span>

## <span style="color:#f1c232">Overview</span>

**To-Do's:** <br>
*Data Discovery for:*

 1. <span style="color:#00ff21">Wage Data</span>
 2. Employment Data
 3. Trade Data
 4. FDI Data
 5. Sectoral Output Data
 6. Penn Table *(Controls)*
 7. Price Indexes *(Optional depend on data availability of Penn Table)*


## <span style="color:#f1c232">Environment</span>

In [1]:
#Packages

import pandas as pd
from Utilities import wages 
import eurostat

#Variables

EARN_SES_14=["EARN_SES_AGT14","EARN_SES06_14","EARN_SES10_14","EARN_SES14_14","EARN_SES18_14"]

EARN_SES_47=["EARN_SES06_47","EARN_SES10_47","EARN_SES14_47","EARN_SES18_47"]

EARN_SES_16=["EARN_SES_AGT16","EARN_SES06_16","EARN_SES10_16","EARN_SES14_16","EARN_SES18_16"]


## <span style="color:#f1c232">Wage Data</span>

Before discovering the data it is convenient to explain the datasets that will be used in the analysis briefly.
Studies conducted in the field of wage premia generally rely on population surveys to analyze premiums on more categories such as gender, age group, economic activity, and most importantly, occupation. European-level population surveys are conducted once in 4 years accessible from 2002 to 2018 with the name: ***"Structure of Earnings Survey (SES)"***. It is important to mention that the original, or raw, survey data is disaggregated *(individual level)* and not open for public access. It is possible to obtain the so-called **microdata** with the proper request, though due to extended delivery time and short thesis deadline, this research will be conducted by aggregated *(by certain categories)* datasets of **Eurostat**.

After browsi̇ng the Eurostat Database, I ended up with the following three different data sets for the years 2002, 2006, 2010, 2014, 2018:

* Mean hourly earnings by sex, economic activity and occupation *(Not available for 2002)*,
* Mean hourly earnings by sex, age, occupation,
* Mean hourly earnings by economic activity, sex, educational attainment level

Among the three datasets listed above, the first one would be the most proper selection for my research since it contains the necessary categories: economic activity, and occupation. While the need for occupation category is strait forward, the introduction of economic activity to analysis is motivated by the following hypothesizes:

* The offshoring, one of the explanatory variables we will examine in empirics, might be highly related in terms of scope with sectors beside occupational titles. In other words besides the occupation, the sector of economic activity might be useful to investigate offshoring activity.

      



### <span style="color:#b70101">**Mean hourly earnings by sex, economic activity and occupation (EARN_SES_47)**</span>

##### **Overview**


| Dataset Code  | Year | Description                                                   | Main Source                            | Source Definition                                                     |
|---------------|------|---------------------------------------------------------------|----------------------------------------|-----------------------------------------------------------------------|
| EARN_SES06_47 | 2006 | Mean hourly earnings by economic activity, sex, occupation    | The Structure of Earnings Survey (SES) | https://ec.europa.eu/eurostat/cache/metadata/en/earn_ses06_esms.htm   |
| EARN_SES10_47 | 2010 | Mean hourly earnings by sex, economic activity and occupation | The Structure of Earnings Survey (SES) | https://ec.europa.eu/eurostat/cache/metadata/en/earn_ses2010_esms.htm |
| EARN_SES14_47 | 2014 | Mean hourly earnings by sex, economic activity and occupation | The Structure of Earnings Survey (SES) | https://ec.europa.eu/eurostat/cache/metadata/en/earn_ses2014_esms.htm |
| EARN_SES18_47 | 2018 | Mean hourly earnings by sex, economic activity and occupation | The Structure of Earnings Survey (SES) | https://ec.europa.eu/eurostat/cache/metadata/en/earn_ses2018_esms.htm |

In [7]:
wages.cat_explorer(EARN_SES_47).pivot(index="Category", columns=["Dataset","Year"], values=["#ofUniques","Uniques"])

Unnamed: 0_level_0,#ofUniques,#ofUniques,#ofUniques,#ofUniques,Uniques,Uniques,Uniques,Uniques
Dataset,EARN_SES06_47,EARN_SES10_47,EARN_SES14_47,EARN_SES18_47,EARN_SES06_47,EARN_SES10_47,EARN_SES14_47,EARN_SES18_47
Year,2006,2010,2014,2018,2006,2010,2014,2018
Category,Unnamed: 1_level_3,Unnamed: 2_level_3,Unnamed: 3_level_3,Unnamed: 4_level_3,Unnamed: 5_level_3,Unnamed: 6_level_3,Unnamed: 7_level_3,Unnamed: 8_level_3
currency,,3.0,4.0,,,"[EUR, PPS, PC]","[EUR, PPS, PC, NAC]",
freq,1.0,1.0,1.0,1.0,[A],[A],[A],[A]
geo,35.0,39.0,40.0,41.0,"[EU27_2007, EU25, EU15, EA16, EA13, BE, BG, CZ...","[EU28, EU27_2007, EU25, EU15, EA17, EA13, BE, ...","[EU28, EU27_2007, EA19, EA18, EA17, BE, BG, CZ...","[EU27_2020, EU28, EU27_2007, EA19, EA18, EA17,..."
indic_se,4.0,4.0,4.0,3.0,"[ERN, E_F_M_PC, OPAY, OP_E_PC]","[ERN, E_F_M_PC, OPAY, OP_E_PC]","[ERN, E_F_M_PC, OPAY, OP_E_PC]","[ERN, OPAY, OP_E_PC]"
isco08,,13.0,13.0,14.0,,"[TOTAL, OC1-5, OC1, OC2, OC3, OC4, OC5, OC6, O...","[TOTAL, OC1-5, OC1, OC2, OC3, OC4, OC5, OC6, O...","[TOTAL, OC1-5, OC1, OC2, OC3, OC4, OC5, OC6-8,..."
isco88,14.0,,,,"[TOTAL, ISCO1-5, ISCO1, ISCO2, ISCO3, ISCO4, I...",,,
nace_r1,22.0,,,,"[C-O, C-O_X_L, C-K, C-F, C-E, C, D, E, F, G-K,...",,,
nace_r2,,29.0,29.0,29.0,,"[B-S, B-S_X_O, B-N, B-F, B-E, B, C, D, E, F, G...","[B-S, B-S_X_O, B-N, B-F, B-E, B, C, D, E, F, G...","[B-S, B-S_X_O, B-N, B-F, B-E, B, C, D, E, F, G..."
sex,3.0,3.0,3.0,3.0,"[T, M, F]","[T, M, F]","[T, M, F]","[T, M, F]"
sizeclas,2.0,2.0,2.0,2.0,"[TOTAL, GE10]","[TOTAL, GE10]","[TOTAL, GE10]","[TOTAL, GE10]"


The *cat explorer* function lists the categories listed in Eurostat's given dataset, EARN_SES_47. Thereby, we can pivot that output and see which categories have how many unique values in any year, or equivalently dataset. Before proceeding with more detailed discoveries for some "problematic" categories, it will be beneficial to briefly list the findings and further actions for the trivial ones.     


##### **currency-unit:**

In [15]:
wages.cat_describer(EARN_SES_47,['currency','unit']).pivot(index="Descriptions", columns=["Year"], values=["Category"])

Error: currency not in EARN_SES06_47
Error: unit not in EARN_SES10_47
Error: unit not in EARN_SES14_47
Error: currency not in EARN_SES18_47


Unnamed: 0_level_0,Category,Category,Category,Category
Year,2006,2010,2014,2018
Descriptions,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
"(EUR, Euro)",unit,currency,currency,unit
"(NAC, National currency)",,,currency,unit
"(PC, Percentage)",unit,currency,currency,unit
"(PPS, Purchasing Power Standard)",,currency,currency,
"(PPS, Purchasing power standard (PPS))",unit,,,unit


The output of *cat_describer* clarifies some certain steps for the data preperation part such as:
* The only missing description in currencies is NAC.
* Two categories can be merged without mapping.


##### **freq:**

In [14]:
wages.cat_describer(EARN_SES_47,['freq']).pivot(index="Descriptions", columns=["Year"], values=["Category"])

Unnamed: 0_level_0,Category,Category,Category,Category
Year,2006,2010,2014,2018
Descriptions,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
"(A, Annual)",freq,freq,freq,freq


We observe the same value "[A]" implying data is annual.
- It will not be used in the final table(s).

##### **geo:** <br>
 

In [13]:
wages.cat_describer(EARN_SES_47,['geo']).pivot(index="Descriptions", columns=["Year"], values=["Category"])

Unnamed: 0_level_0,Category,Category,Category,Category
Year,2006,2010,2014,2018
Descriptions,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
"(AL, Albania)",,,,geo
"(AT, Austria)",geo,geo,geo,geo
"(BE, Belgium)",geo,geo,geo,geo
"(BG, Bulgaria)",geo,geo,geo,geo
"(CH, Switzerland)",,geo,geo,geo
"(CY, Cyprus)",geo,geo,geo,geo
"(CZ, Czechia)",geo,geo,geo,geo
"(DE, Germany (until 1990 former territory of the FRG))",geo,geo,geo,geo
"(DK, Denmark)",geo,geo,geo,geo
"(EA13, Euro area - 13 countries (2007))",geo,geo,,


* Certain descriptions are irrelevant to our analysis:<br>
    - EA,EA12,EA13,EA16,EA17,EA18,EA19,EU15,EU25,EU27_2007,EU27_2020,EU28

In [20]:
geo_EARN_SES_47=wages.cat_describer(EARN_SES_47,['geo']).groupby(['Descriptions']).count()
geo_EARN_SES_47.head()

Unnamed: 0_level_0,Dataset,Year,Category
Descriptions,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
"(AL, Albania)",1,1,1
"(AT, Austria)",4,4,4
"(BE, Belgium)",4,4,4
"(BG, Bulgaria)",4,4,4
"(CH, Switzerland)",3,3,3


Dataframe *geo_EARN_SES_47* stores the geographic entities and their observation times in EARN_SES_47 dataset.<br>
A similar approach will be applied to different datasets during the project so narrow down the country list while gaining consistency in data.

##### **indic_se:**

In [12]:
wages.cat_describer(EARN_SES_47,['indic_se']).pivot(index="Descriptions", columns=["Year"], values=["Category"])

Unnamed: 0_level_0,Category,Category,Category,Category
Year,2006,2010,2014,2018
Descriptions,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
"(ERN, Gross earnings)",indic_se,indic_se,indic_se,indic_se
"(E_F_M_PC, Gross earnings of women as a percentage of those of men)",indic_se,indic_se,indic_se,
"(OPAY, Overtime pay)",indic_se,indic_se,indic_se,indic_se
"(OP_E_PC, Overtime pay as a percentage of earnings)",indic_se,indic_se,indic_se,indic_se


This is an earning category.
* We will only use "ERN".

##### **isco08-isco88:**

In [16]:
wages.cat_describer(EARN_SES_47,['isco08','isco88']).pivot(index="Descriptions", columns=["Year"], values=["Category"])

Error: isco08 not in EARN_SES06_47
Error: isco88 not in EARN_SES10_47
Error: isco88 not in EARN_SES14_47
Error: isco88 not in EARN_SES18_47


Unnamed: 0_level_0,Category,Category,Category,Category
Year,2006,2010,2014,2018
Descriptions,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
"(ISCO0, Armed forces)",isco88,,,
"(ISCO1, Legislators, senior officials and managers)",isco88,,,
"(ISCO1-5, Non manual workers)",isco88,,,
"(ISCO2, Professionals)",isco88,,,
"(ISCO3, Technicians and associate professionals)",isco88,,,
"(ISCO4, Clerks)",isco88,,,
"(ISCO5, Service workers and shop and market sales workers)",isco88,,,
"(ISCO6, Skilled agricultural and fishery workers)",isco88,,,
"(ISCO7, Craft and related trades workers)",isco88,,,
"(ISCO7-9, Manual workers)",isco88,,,


Eurostat's "Comparability ISCO_08-ISCO_88" report, which can be found in the Resources folder, states the intentions and the logic behind the change in the classification method from 2002 to afterward. While the document mentions that the changes in the categorization were conducted in a fashion to allow time series analysis, It still requires us to conduct mapping and merging operations on the data.  

##### **nace_r1-nace_r2:**

In [17]:
wages.cat_describer(EARN_SES_47,['nace_r1','nace_r2']).pivot(index="Descriptions", columns=["Year"], values=["Category"])

Error: nace_r2 not in EARN_SES06_47
Error: nace_r1 not in EARN_SES10_47
Error: nace_r1 not in EARN_SES14_47
Error: nace_r1 not in EARN_SES18_47


Unnamed: 0_level_0,Category,Category,Category,Category
Year,2006,2010,2014,2018
Descriptions,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
"(B, Mining and quarrying)",,nace_r2,nace_r2,nace_r2
"(B-E, Industry (except construction))",,nace_r2,nace_r2,nace_r2
"(B-F, Industry and construction)",,nace_r2,nace_r2,nace_r2
"(B-N, Business economy)",,nace_r2,nace_r2,nace_r2
"(B-S, Industry, construction and services (except activities of households as employers and extra-territorial organisations and bodies))",,nace_r2,nace_r2,nace_r2
"(B-S_X_O, Industry, construction and services (except public administration, defense, compulsory social security))",,nace_r2,nace_r2,nace_r2
"(C, Manufacturing)",,nace_r2,nace_r2,nace_r2
"(C, Mining and quarrying)",nace_r1,,,
"(C-E, Industry (except construction))",nace_r1,,,
"(C-F, Industry)",nace_r1,,,


##### **sex:**

In [22]:
wages.cat_describer(EARN_SES_47,['sex']).pivot(index="Descriptions", columns=["Year"], values=["Category"])

Unnamed: 0_level_0,Category,Category,Category,Category
Year,2006,2010,2014,2018
Descriptions,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
"(F, Females)",sex,sex,sex,sex
"(M, Males)",sex,sex,sex,sex
"(T, Total)",sex,sex,sex,sex


##### **sizeclas:**

In [23]:
wages.cat_describer(EARN_SES_47,['sizeclas']).pivot(index="Descriptions", columns=["Year"], values=["Category"])

Unnamed: 0_level_0,Category,Category,Category,Category
Year,2006,2010,2014,2018
Descriptions,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
"(GE10, 10 employees or more)",sizeclas,sizeclas,sizeclas,sizeclas
"(TOTAL, Total)",sizeclas,sizeclas,sizeclas,sizeclas


##### **Notes for Data Preperation:**

### <span style="color:#b70101">**Mean hourly earnings by sex, age, occupation**</span>

As mentioned withfololows:</br>
</br>

| Dataset Code  | Year | Description                                                                 | Main Source                                 | Source Definition                                                                |
|---------------|------|--------------------------------------------------------------------------------|----------------------------------------|-----------------------------------------------------------------------|
| EARN_SES_AGT14 | 2002 | Mean hourly earnings by sex, age, occupation    | The Structure of Earnings Survey (SES) | https://ec.europa.eu/eurostat/cache/metadata/en/earn_ses_esms.htm   |
| EARN_SES06_14 | 2006 | Mean hourly earnings by sex, age, occupation - NACE Rev. 1.1, C-O excluding L   | The Structure of Earnings Survey (SES) | https://ec.europa.eu/eurostat/cache/metadata/en/earn_ses06_esms.htm   |
| EARN_SES10_14 | 2010 | Mean hourly earnings by sex, age and occupation - NACE Rev. 2, B-S excluding O | The Structure of Earnings Survey (SES) | https://ec.europa.eu/eurostat/cache/metadata/en/earn_ses2010_esms.htm |
| EARN_SES14_14 | 2014 | Mean hourly earnings by sex, age and occupation - NACE Rev. 2, B-S excluding O | The Structure of Earnings Survey (SES) | https://ec.europa.eu/eurostat/cache/metadata/en/earn_ses2014_esms.htm |
| EARN_SES18_14 | 2018 | Mean hourly earnings by sex, age and occupation - NACE Rev. 2, B-S excluding O | The Structure of Earnings Survey (SES) | https://ec.europa.eu/eurostat/cache/metadata/en/earn_ses2018_esms.htm |

</br>
We can describe the categories in these datasets as follows:


In [2]:
wages.cat_explorer(EARN_SES_14).pivot(index="Category", columns=["Dataset","Year"], values=["#ofUniques","Uniques"])

Unnamed: 0_level_0,#ofUniques,#ofUniques,#ofUniques,#ofUniques,#ofUniques,Uniques,Uniques,Uniques,Uniques,Uniques
Dataset,EARN_SES_AGT14,EARN_SES06_14,EARN_SES10_14,EARN_SES14_14,EARN_SES18_14,EARN_SES_AGT14,EARN_SES06_14,EARN_SES10_14,EARN_SES14_14,EARN_SES18_14
Year,2002,2006,2010,2014,2018,2002,2006,2010,2014,2018
Category,Unnamed: 1_level_3,Unnamed: 2_level_3,Unnamed: 3_level_3,Unnamed: 4_level_3,Unnamed: 5_level_3,Unnamed: 6_level_3,Unnamed: 7_level_3,Unnamed: 8_level_3,Unnamed: 9_level_3,Unnamed: 10_level_3
age,7.0,7.0,6.0,8.0,5.0,"[TOTAL, Y_LT30, Y30-39, Y40-49, Y50-59, Y_GE60...","[TOTAL, Y_LT30, Y30-39, Y40-49, Y50-59, Y_GE60...","[TOTAL, Y_LT30, Y30-39, Y40-49, Y50-59, Y_GE60]","[TOTAL, Y_LT30, Y30-39, Y30-49, Y40-49, Y50-59...","[TOTAL, Y_LT30, Y30-49, Y_GE50, UNK]"
currency,,,3.0,4.0,,,,"[EUR, PPS, PC]","[EUR, PPS, PC, NAC]",
freq,1.0,1.0,1.0,1.0,1.0,[A],[A],[A],[A],[A]
geo,34.0,35.0,40.0,40.0,41.0,"[EU25, EU15, NMS10, EA, EA12, BE, BG, CZ, DK, ...","[EU27_2007, EU25, EU15, EA16, EA13, BE, BG, CZ...","[EU28, EU27_2007, EU25, EU15, EA17, EA16, EA13...","[EU28, EU27_2007, EA19, EA18, EA17, BE, BG, CZ...","[EU27_2020, EU28, EU27_2007, EA19, EA18, EA17,..."
indic_se,4.0,4.0,4.0,4.0,3.0,"[ERN, E_F_M_PC, OPAY, OP_E_PC]","[ERN, E_F_M_PC, OPAY, OP_E_PC]","[ERN, E_F_M_PC, OPAY, OP_E_PC]","[ERN, E_F_M_PC, OPAY, OP_E_PC]","[ERN, OPAY, OP_E_PC]"
isco08,,,13.0,13.0,14.0,,,"[TOTAL, OC1-5, OC1, OC2, OC3, OC4, OC5, OC6, O...","[TOTAL, OC1-5, OC1, OC2, OC3, OC4, OC5, OC6, O...","[TOTAL, OC1-5, OC1, OC2, OC3, OC4, OC5, OC6-8,..."
isco88,13.0,14.0,,,,"[TOTAL, ISCO1-5, ISCO1, ISCO2, ISCO3, ISCO4, I...","[TOTAL, ISCO1-5, ISCO1, ISCO2, ISCO3, ISCO4, I...",,,
sex,3.0,3.0,3.0,3.0,3.0,"[T, M, F]","[T, M, F]","[T, M, F]","[T, M, F]","[T, M, F]"
sizeclas,,2.0,2.0,2.0,2.0,,"[TOTAL, GE10]","[TOTAL, GE10]","[TOTAL, GE10]","[TOTAL, GE10]"
unit,3.0,3.0,,,4.0,"[EUR, PPS, PC]","[EUR, PPS, PC]",,,"[EUR, NAC, PPS, PC]"


* age
* currency-unit
* freq
* geo
* indic_se
* isco08-isco88
* sex
* sizeclas

In [5]:
wages.cat_describer(EARN_SES_14,['age']).pivot(index="Descriptions", columns=["Year"], values=["Category"])

<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>
<class 'str'>


Unnamed: 0_level_0,Category,Category,Category,Category,Category
Year,2002,2006,2010,2014,2018
Descriptions,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
"(TOTAL, Total)",age,age,age,age,age
"(UNK, Unknown)",age,age,,,age
"(Y30-39, From 30 to 39 years)",age,age,age,age,
"(Y30-49, From 30 to 49 years)",,,,age,age
"(Y40-49, From 40 to 49 years)",age,age,age,age,
"(Y50-59, From 50 to 59 years)",age,age,age,age,
"(Y_GE50, 50 years or over)",,,,age,age
"(Y_GE60, 60 years or over)",age,age,age,age,
"(Y_LT30, Less than 30 years)",age,age,age,age,age


Error: currency not in EARN_SES_AGT14
Error: currency not in EARN_SES06_14
Error: unit not in EARN_SES10_14
Error: unit not in EARN_SES14_14
Error: currency not in EARN_SES18_14


Unnamed: 0_level_0,Category,Category,Category,Category,Category
Year,2002,2006,2010,2014,2018
Descriptions,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
"(EUR, Euro)",unit,unit,currency,currency,unit
"(NAC, National currency)",,,,currency,unit
"(PC, Percentage)",unit,unit,currency,currency,unit
"(PPS, Purchasing Power Standard)",,,currency,currency,
"(PPS, Purchasing power standard (PPS))",unit,unit,,,unit


In [3]:
wages.cat_describer(EARN_SES_14,['isco88','isco08']).pivot(index="Descriptions", columns=["Year"], values=["Category"])

Error: isco08 not in EARN_SES_AGT14
Error: isco08 not in EARN_SES06_14
Error: isco88 not in EARN_SES10_14
Error: isco88 not in EARN_SES14_14
Error: isco88 not in EARN_SES18_14


Unnamed: 0_level_0,Category,Category,Category,Category,Category
Year,2002,2006,2010,2014,2018
Descriptions,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
"(ISCO0, Armed forces)",isco88,isco88,,,
"(ISCO1, Legislators, senior officials and managers)",isco88,isco88,,,
"(ISCO1-5, Non manual workers)",isco88,isco88,,,
"(ISCO2, Professionals)",isco88,isco88,,,
"(ISCO3, Technicians and associate professionals)",isco88,isco88,,,
"(ISCO4, Clerks)",isco88,isco88,,,
"(ISCO5, Service workers and shop and market sales workers)",isco88,isco88,,,
"(ISCO6, Skilled agricultural and fishery workers)",,isco88,,,
"(ISCO7, Craft and related trades workers)",isco88,isco88,,,
"(ISCO7-9, Manual workers)",isco88,isco88,,,


In [4]:
wages.cat_describer(EARN_SES_14,['geo']).pivot(index="Descriptions", columns=["Year"], values=["Category"])

Unnamed: 0_level_0,Category,Category,Category,Category,Category
Year,2002,2006,2010,2014,2018
Descriptions,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
"(AL, Albania)",,,,,geo
"(AT, Austria)",geo,geo,geo,geo,geo
"(BE, Belgium)",geo,geo,geo,geo,geo
"(BG, Bulgaria)",geo,geo,geo,geo,geo
"(CH, Switzerland)",,,geo,geo,geo
"(CY, Cyprus)",geo,geo,geo,geo,geo
"(CZ, Czechia)",geo,geo,geo,geo,geo
"(DE, Germany (until 1990 former territory of the FRG))",geo,geo,geo,geo,geo
"(DK, Denmark)",geo,geo,geo,geo,geo
"(EA, Euro area (EA11-1999, EA12-2001, EA13-2007, EA15-2008, EA16-2009, EA17-2011, EA18-2014, EA19-2015, EA20-2023))",geo,,,,


* Certain descriptions are irrelevant to our analysis:<br>
    - EA,EA12,EA13,EA16,EA17,EA18,EA19,EU15,EU25,EU27_2007,EU27_2020,EU28

In [24]:
wages.cat_describer(EARN_SES_14,['geo']).groupby(['Descriptions']).count().query('Dataset>=4')

Unnamed: 0_level_0,Dataset,Year,Category
Descriptions,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
"(AT, Austria)",5,5,5
"(BE, Belgium)",5,5,5
"(BG, Bulgaria)",5,5,5
"(CY, Cyprus)",5,5,5
"(CZ, Czechia)",5,5,5
"(DE, Germany (until 1990 former territory of the FRG))",5,5,5
"(DK, Denmark)",5,5,5
"(EE, Estonia)",5,5,5
"(EL, Greece)",5,5,5
"(ES, Spain)",5,5,5


In [25]:
wages.cat_describer(EARN_SES_14,['geo']).groupby(['Descriptions']).count().query('Dataset>=4').count()

Dataset     31
Year        31
Category    31
dtype: int64

In [3]:

eurostat.get_dic('EARN_SES06_14','indic_se', full=False)

[('ERN', 'Gross earnings'),
 ('E_F_M_PC', 'Gross earnings of women as a percentage of those of men'),
 ('OPAY', 'Overtime pay'),
 ('OP_E_PC', 'Overtime pay as a percentage of earnings')]

### Flag Check

For potential problems, it is good to check the data quality prior. Thereby we will import the data including the Flags which are:
 * b = break in time series, c = confidential, d = definition differs, see metadata, e = estimated, f = forecast, n = not significant, p = provisional, r = revised, s = Eurostat estimate, u = low reliability, z = not applicable.

I will check  ***n,p,u*** among these flags for wage and employment data from Eurostat and record it to the data catalog. 

In [67]:
df.query('f == "s"').groupby(['geo\TIME_PERIOD'])['unit'].count()

geo\TIME_PERIOD
EA12    672
EU15    672
EU25    672
Name: unit, dtype: int64