# <span style="color:#d3d1df">Ms Thesis Wage Data Discovery</span>

## <span style="color:#f1c232">Environment</span>

In [19]:
#Packages

import pandas as pd
import numpy as np 
from Utilities import wages 
import eurostat

#Variables

EARN_SES_14=["EARN_SES_AGT14","EARN_SES06_14","EARN_SES10_14","EARN_SES14_14","EARN_SES18_14"]

EARN_SES_47=["EARN_SES06_47","EARN_SES10_47","EARN_SES14_47","EARN_SES18_47"]

EARN_SES_16=["EARN_SES_AGT16","EARN_SES06_16","EARN_SES10_16","EARN_SES14_16","EARN_SES18_16"]


---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

## <span style="color:#f1c232">Data Discovery</span>

Before discovering the data it is convenient to explain the datasets that will be used in the analysis briefly.
Studies conducted in the field of wage premia generally rely on population surveys to analyze premiums on more categories such as gender, age group, economic activity, and most importantly, occupation. European-level population surveys are conducted once in 4 years accessible from 2002 to 2018 with the name: ***"Structure of Earnings Survey (SES)"***. It is important to mention that the original, or raw, survey data is disaggregated *(individual level)* and not open for public access. It is possible to obtain the so-called **microdata** with the proper request, though due to extended delivery time and short thesis deadline, this research will be conducted by aggregated *(by certain categories)* datasets of **Eurostat**.

After browsi̇ng the Eurostat Database, I ended up with the following three different data sets for the years 2002, 2006, 2010, 2014, 2018:

* Mean hourly earnings by sex, economic activity and occupation *(Not available for 2002)*,
* Mean hourly earnings by sex, age, occupation,
* Mean hourly earnings by economic activity, sex, educational attainment level

Among the three datasets listed above, the first one would be the most proper selection for my research since it contains the necessary categories: economic activity, and occupation. While the need for occupation category is strait forward, the introduction of economic activity to analysis is motivated by the following hypothesizes:

* The offshoring, one of the explanatory variables we will examine in empirics, might be highly related in terms of scope with sectors beside occupational titles. In other words besides the occupation, the sector of economic activity might be useful to investigate offshoring activity.

      



### <span style="color:#909a07">**Mean hourly earnings by sex, economic activity and occupation (EARN_SES_47)**</span>


##### **Overview**


| Dataset Code  | Year | Description                                                   | Main Source                            | Source Definition                                                     |
|---------------|------|---------------------------------------------------------------|----------------------------------------|-----------------------------------------------------------------------|
| EARN_SES06_47 | 2006 | Mean hourly earnings by economic activity, sex, occupation    | The Structure of Earnings Survey (SES) | https://ec.europa.eu/eurostat/cache/metadata/en/earn_ses06_esms.htm   |
| EARN_SES10_47 | 2010 | Mean hourly earnings by sex, economic activity and occupation | The Structure of Earnings Survey (SES) | https://ec.europa.eu/eurostat/cache/metadata/en/earn_ses2010_esms.htm |
| EARN_SES14_47 | 2014 | Mean hourly earnings by sex, economic activity and occupation | The Structure of Earnings Survey (SES) | https://ec.europa.eu/eurostat/cache/metadata/en/earn_ses2014_esms.htm |
| EARN_SES18_47 | 2018 | Mean hourly earnings by sex, economic activity and occupation | The Structure of Earnings Survey (SES) | https://ec.europa.eu/eurostat/cache/metadata/en/earn_ses2018_esms.htm |

In [3]:
wages.cat_explorer(EARN_SES_47).pivot(index="Category", columns=["Dataset","Year"], values=["#ofUniques","Uniques"])

Unnamed: 0_level_0,#ofUniques,#ofUniques,#ofUniques,#ofUniques,Uniques,Uniques,Uniques,Uniques
Dataset,EARN_SES06_47,EARN_SES10_47,EARN_SES14_47,EARN_SES18_47,EARN_SES06_47,EARN_SES10_47,EARN_SES14_47,EARN_SES18_47
Year,2006,2010,2014,2018,2006,2010,2014,2018
Category,Unnamed: 1_level_3,Unnamed: 2_level_3,Unnamed: 3_level_3,Unnamed: 4_level_3,Unnamed: 5_level_3,Unnamed: 6_level_3,Unnamed: 7_level_3,Unnamed: 8_level_3
currency,,3.0,4.0,,,"[EUR, PPS, PC]","[EUR, PPS, PC, NAC]",
freq,1.0,1.0,1.0,1.0,[A],[A],[A],[A]
geo,35.0,39.0,40.0,41.0,"[EU27_2007, EU25, EU15, EA16, EA13, BE, BG, CZ...","[EU28, EU27_2007, EU25, EU15, EA17, EA13, BE, ...","[EU28, EU27_2007, EA19, EA18, EA17, BE, BG, CZ...","[EU27_2020, EU28, EU27_2007, EA19, EA18, EA17,..."
indic_se,4.0,4.0,4.0,3.0,"[ERN, E_F_M_PC, OPAY, OP_E_PC]","[ERN, E_F_M_PC, OPAY, OP_E_PC]","[ERN, E_F_M_PC, OPAY, OP_E_PC]","[ERN, OPAY, OP_E_PC]"
isco08,,13.0,13.0,14.0,,"[TOTAL, OC1-5, OC1, OC2, OC3, OC4, OC5, OC6, O...","[TOTAL, OC1-5, OC1, OC2, OC3, OC4, OC5, OC6, O...","[TOTAL, OC1-5, OC1, OC2, OC3, OC4, OC5, OC6-8,..."
isco88,14.0,,,,"[TOTAL, ISCO1-5, ISCO1, ISCO2, ISCO3, ISCO4, I...",,,
nace_r1,22.0,,,,"[C-O, C-O_X_L, C-K, C-F, C-E, C, D, E, F, G-K,...",,,
nace_r2,,29.0,29.0,29.0,,"[B-S, B-S_X_O, B-N, B-F, B-E, B, C, D, E, F, G...","[B-S, B-S_X_O, B-N, B-F, B-E, B, C, D, E, F, G...","[B-S, B-S_X_O, B-N, B-F, B-E, B, C, D, E, F, G..."
sex,3.0,3.0,3.0,3.0,"[T, M, F]","[T, M, F]","[T, M, F]","[T, M, F]"
sizeclas,2.0,2.0,2.0,2.0,"[TOTAL, GE10]","[TOTAL, GE10]","[TOTAL, GE10]","[TOTAL, GE10]"


The *cat explorer* function lists the categories listed in Eurostat's given dataset, EARN_SES_47. Thereby, we can pivot that output and see which categories have how many unique values in any year, or equivalently dataset. Before proceeding with more detailed discoveries for some "problematic" categories, it will be beneficial to briefly list the findings and further actions for the trivial ones.     


##### **currency-unit:**

In [4]:
wages.cat_describer(EARN_SES_47,['currency','unit']).pivot(index="Descriptions", columns=["Year"], values=["Category"])

Error: currency not in EARN_SES06_47
Error: unit not in EARN_SES10_47
Error: unit not in EARN_SES14_47
Error: currency not in EARN_SES18_47


Unnamed: 0_level_0,Category,Category,Category,Category
Year,2006,2010,2014,2018
Descriptions,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
"(EUR, Euro)",unit,currency,currency,unit
"(NAC, National currency)",,,currency,unit
"(PC, Percentage)",unit,currency,currency,unit
"(PPS, Purchasing Power Standard)",,currency,currency,
"(PPS, Purchasing power standard (PPS))",unit,,,unit


https://ec.europa.eu/eurostat/statistics-explained/index.php?title=Glossary:Purchasing_power_standard_(PPS)

The output of *cat_describer* clarifies some certain steps for the data preperation part such as:
* The only missing description in currencies is NAC.
* Two categories can be merged without mapping.


##### **freq:**

In [5]:
wages.cat_describer(EARN_SES_47,['freq']).pivot(index="Descriptions", columns=["Year"], values=["Category"])

Unnamed: 0_level_0,Category,Category,Category,Category
Year,2006,2010,2014,2018
Descriptions,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
"(A, Annual)",freq,freq,freq,freq


We observe the same value "[A]" implying data is annual.
- It will not be used in the final table(s).

##### **geo:** <br>
 

In [6]:
wages.cat_describer(EARN_SES_47,['geo']).pivot(index="Descriptions", columns=["Year"], values=["Category"])

Unnamed: 0_level_0,Category,Category,Category,Category
Year,2006,2010,2014,2018
Descriptions,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
"(AL, Albania)",,,,geo
"(AT, Austria)",geo,geo,geo,geo
"(BE, Belgium)",geo,geo,geo,geo
"(BG, Bulgaria)",geo,geo,geo,geo
"(CH, Switzerland)",,geo,geo,geo
"(CY, Cyprus)",geo,geo,geo,geo
"(CZ, Czechia)",geo,geo,geo,geo
"(DE, Germany (until 1990 former territory of the FRG))",geo,geo,geo,geo
"(DK, Denmark)",geo,geo,geo,geo
"(EA13, Euro area - 13 countries (2007))",geo,geo,,


* Certain descriptions are irrelevant to our analysis:<br>
    - EA,EA12,EA13,EA16,EA17,EA18,EA19,EU15,EU25,EU27_2007,EU27_2020,EU28

In [7]:
geo_EARN_SES_47=wages.cat_describer(EARN_SES_47,['geo']).groupby(['Descriptions']).count()
geo_EARN_SES_47.head()

Unnamed: 0_level_0,Dataset,Year,Category
Descriptions,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
"(AL, Albania)",1,1,1
"(AT, Austria)",4,4,4
"(BE, Belgium)",4,4,4
"(BG, Bulgaria)",4,4,4
"(CH, Switzerland)",3,3,3


Dataframe *geo_EARN_SES_47* stores the geographic entities and their observation times in EARN_SES_47 dataset.<br>
A similar approach will be applied to different datasets during the project so narrow down the country list while gaining consistency in data.

##### **indic_se:**

In [8]:
wages.cat_describer(EARN_SES_47,['indic_se']).pivot(index="Descriptions", columns=["Year"], values=["Category"])

Unnamed: 0_level_0,Category,Category,Category,Category
Year,2006,2010,2014,2018
Descriptions,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
"(ERN, Gross earnings)",indic_se,indic_se,indic_se,indic_se
"(E_F_M_PC, Gross earnings of women as a percentage of those of men)",indic_se,indic_se,indic_se,
"(OPAY, Overtime pay)",indic_se,indic_se,indic_se,indic_se
"(OP_E_PC, Overtime pay as a percentage of earnings)",indic_se,indic_se,indic_se,indic_se


This is an earning category.
* We will only use "ERN".

##### **isco08-isco88:**

In [10]:
wages.cat_describer(EARN_SES_47,['isco08','isco88']).pivot(index="Descriptions", columns=["Year"], values=["Category"])

Error: isco08 not in EARN_SES06_47
Error: isco88 not in EARN_SES10_47
Error: isco88 not in EARN_SES14_47
Error: isco88 not in EARN_SES18_47


Unnamed: 0_level_0,Category,Category,Category,Category
Year,2006,2010,2014,2018
Descriptions,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
"(ISCO0, Armed forces)",isco88,,,
"(ISCO1, Legislators, senior officials and managers)",isco88,,,
"(ISCO1-5, Non manual workers)",isco88,,,
"(ISCO2, Professionals)",isco88,,,
"(ISCO3, Technicians and associate professionals)",isco88,,,
"(ISCO4, Clerks)",isco88,,,
"(ISCO5, Service workers and shop and market sales workers)",isco88,,,
"(ISCO6, Skilled agricultural and fishery workers)",isco88,,,
"(ISCO7, Craft and related trades workers)",isco88,,,
"(ISCO7-9, Manual workers)",isco88,,,


Eurostat's "Comparability ISCO_08-ISCO_88" report, which can be found in the Resources folder, states the intentions and the logic behind the change in the classification method from 2002 to afterward. While the document mentions that the changes in the categorization were conducted in a fashion to allow time series analysis, It still requires us to conduct mapping and merging operations on the data.  

##### **nace_r1-nace_r2:**

In [None]:
wages.cat_describer(EARN_SES_47,['nace_r1','nace_r2']).pivot(index="Descriptions", columns=["Year"], values=["Category"])

Error: nace_r2 not in EARN_SES06_47
Error: nace_r1 not in EARN_SES10_47
Error: nace_r1 not in EARN_SES14_47
Error: nace_r1 not in EARN_SES18_47


Unnamed: 0_level_0,Category,Category,Category,Category
Year,2006,2010,2014,2018
Descriptions,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
"(B, Mining and quarrying)",,nace_r2,nace_r2,nace_r2
"(B-E, Industry (except construction))",,nace_r2,nace_r2,nace_r2
"(B-F, Industry and construction)",,nace_r2,nace_r2,nace_r2
"(B-N, Business economy)",,nace_r2,nace_r2,nace_r2
"(B-S, Industry, construction and services (except activities of households as employers and extra-territorial organisations and bodies))",,nace_r2,nace_r2,nace_r2
"(B-S_X_O, Industry, construction and services (except public administration, defense, compulsory social security))",,nace_r2,nace_r2,nace_r2
"(C, Manufacturing)",,nace_r2,nace_r2,nace_r2
"(C, Mining and quarrying)",nace_r1,,,
"(C-E, Industry (except construction))",nace_r1,,,
"(C-F, Industry)",nace_r1,,,


##### **sex:**

In [None]:
wages.cat_describer(EARN_SES_47,['sex']).pivot(index="Descriptions", columns=["Year"], values=["Category"])

Unnamed: 0_level_0,Category,Category,Category,Category
Year,2006,2010,2014,2018
Descriptions,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
"(F, Females)",sex,sex,sex,sex
"(M, Males)",sex,sex,sex,sex
"(T, Total)",sex,sex,sex,sex


##### **sizeclas:**

In [None]:
wages.cat_describer(EARN_SES_47,['sizeclas']).pivot(index="Descriptions", columns=["Year"], values=["Category"])

Unnamed: 0_level_0,Category,Category,Category,Category
Year,2006,2010,2014,2018
Descriptions,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
"(GE10, 10 employees or more)",sizeclas,sizeclas,sizeclas,sizeclas
"(TOTAL, Total)",sizeclas,sizeclas,sizeclas,sizeclas


---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

## <span style="color:#f1c232">Data Preperation</span>

Get the datasets via Eurostat API and **CONCAT** them to get an initial data for 4 different years. 

In [None]:
df=pd.concat([eurostat.get_data_df(j, flags=True).assign(year=2006+4*i) for i, j in enumerate(EARN_SES_47)],ignore_index=True)

Now we need to filter the variables and/or observation that will not be used in the analysis.

In [21]:
#Gender, earning type, and company size will not be examined in the research.  
df=df[(df['sex']=='T')&(df['indic_se']=='ERN')&(df['sizeclas']=='GE10')]
df=df.drop(['freq','sex','indic_se','sizeclas'], axis=1)

#Rename Country code column
df=df.rename(columns={'geo\TIME_PERIOD':'code'})

df

Unnamed: 0,unit,nace_r1,isco88,code,2006_value,2006_flag,year,currency,isco08,nace_r2,2010_value,2010_flag,2014_value,2014_flag,2018_value,2018_flag
1151,EUR,C,ISCO0,NO,,: c,2006,,,,,,,,,
1193,EUR,C,ISCO1,BG,2.44,,2006,,,,,,,,,
1194,EUR,C,ISCO1,CY,23.55,,2006,,,,,,,,,
1195,EUR,C,ISCO1,CZ,11.43,,2006,,,,,,,,,
1196,EUR,C,ISCO1,DE,32.44,,2006,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1080852,PPS,,,NO,,,2018,,TOTAL,S,,,,,20.95,
1080853,PPS,,,PL,,,2018,,TOTAL,S,,,,,7.71,
1080854,PPS,,,SI,,,2018,,TOTAL,S,,,,,8.51,
1080855,PPS,,,SK,,,2018,,TOTAL,S,,,,,6.14,


Now, we will need to define and handle *(merge)* the variables with different labels in different years.

In [22]:
#Merging unit and currency colums, eleminating unnessesary currencies
df['unit']=df['unit'].replace(np.nan, '', regex=True)+df['currency'].replace(np.nan, '', regex=True)
df=df.drop(['currency'], axis=1)
df=df[(df['unit']!= 'NAC')]
df['unit'].unique()

array(['EUR', 'PPS'], dtype=object)

Transform the data from the long form into the short form.

In [23]:
#Arranging value and flag columns
df['value']=df['2006_value'].replace(np.nan, 0, regex=True)+df['2010_value'].replace(np.nan, 0, regex=True)+df['2014_value'].replace(np.nan, 0, regex=True)+df['2018_value'].replace(np.nan, 0, regex=True)
df=df.drop(['2006_value','2010_value','2014_value','2018_value'], axis=1)
df['flag']=df['2006_flag'].replace(np.nan, '', regex=True)+df['2010_flag'].replace(np.nan, '', regex=True)+df['2014_flag'].replace(np.nan, '', regex=True)+df['2018_flag'].replace(np.nan, '', regex=True)
df=df.drop(['2006_flag','2010_flag','2014_flag','2018_flag'], axis=1)

Afterward,  we will need to conduct mapping for occupational **(isco88-->isco08)** and sectoral **(nace_r1-->nace_r2)** classification standards.

In [24]:
occupation=pd.DataFrame({
    'isco88': ['ISCO0', 'ISCO1', 'ISCO1-5', 'ISCO2', 'ISCO3', 'ISCO4', 'ISCO5', 'ISCO6', 'OC6-8', 'ISCO7', 'ISCO7-9', 'ISCO8', 'ISCO9','TOTAL'],
    'isco08': ['OC0', 'OC1', 'OC1-5', 'OC2', 'OC3', 'OC4', 'OC5', 'OC6', 'OC6-8', 'OC7', 'OC7-9', 'OC8', 'OC9','TOTAL'],
    'Description': ['Armed forces occupations', 'Managers', 'Non manual workers', 'Professionals', 'Technicians and associate professionals', 'Clerical support workers', 'Service and sales workers', 'Skilled agricultural, forestry and fishery workers', 'Skilled manual workers', 'Craft and related trades workers', 'Manual workers', 'Plant and machine operators and assemblers', 'Elementary occupations','Total']
            })
df['isco88']=df['isco88'].map(occupation.set_index('isco88')['isco08'])
df['isco08']=df['isco88'].replace(np.nan, '', regex=True)+df['isco08'].replace(np.nan, '', regex=True)
df=df.drop(['isco88'], axis=1)
df

Unnamed: 0,unit,nace_r1,isco88,code,year,isco08,nace_r2,value,flag
1151,EUR,C,OC0,NO,2006,OC0,,0.00,: c
1193,EUR,C,OC1,BG,2006,OC1,,2.44,
1194,EUR,C,OC1,CY,2006,OC1,,23.55,
1195,EUR,C,OC1,CZ,2006,OC1,,11.43,
1196,EUR,C,OC1,DE,2006,OC1,,32.44,
...,...,...,...,...,...,...,...,...,...
1080852,PPS,,,NO,2018,TOTAL,S,20.95,
1080853,PPS,,,PL,2018,TOTAL,S,7.71,
1080854,PPS,,,SI,2018,TOTAL,S,8.51,
1080855,PPS,,,SK,2018,TOTAL,S,6.14,


In [25]:
#Define sector dataframe containing Nace Mapping and descriptions
S = [['A', 'A', 'Agriculture, forestry and fishing'],['C', 'B', 'Mining and quarrying'],['D', 'C', 'Manufacturing'],['E', 'D', 'Electricity, gas, steam and air conditioning supply'],['E', 'E', 'Water supply, sewerage, waste management and remediation activities'],['F', 'F', 'Construction'],['G', 'G', 'Wholesale and retail trade; repair of motor vehicles and motorcycles'],['H', 'I', 'Accommodation and food service activities'],['I', 'H', 'Transportation and storage'],['I', 'J', 'Information and communication'],['J', 'K', 'Financial and insurance activities'],['K', 'L', 'Real estate activities'],['K', 'M', 'Professional, scientific and technical activities'],['K', 'N', 'Administrative and support service activities'],['L', 'O', 'Public administration and defence; compulsory social security'],['M', 'P', 'Education'],['N', 'Q', 'Human health and social work activities'],['O', 'R', 'Arts, entertainment and recreation'],['O', 'S', 'Other service activities'],['P', 'T', 'Activities of households as employers; undifferentiated goods- and services-producing activities of households for own use'],['Q', 'U', 'Activities of extraterritorial organisations and bodies']]
sector = pd.DataFrame(S, columns=['nace_r1', 'nace_r2', 'description'])

In [26]:
df['dummy']=df['nace_r1'].map({item[0]: item[1] for item in S})
df['nace_r2']=df['dummy'].replace(np.nan, '', regex=True)+df['nace_r2'].replace(np.nan, '', regex=True)
df=df.drop(['nace_r1','dummy'], axis=1)
df

Unnamed: 0,unit,isco88,code,year,isco08,nace_r2,value,flag
1151,EUR,OC0,NO,2006,OC0,B,0.00,: c
1193,EUR,OC1,BG,2006,OC1,B,2.44,
1194,EUR,OC1,CY,2006,OC1,B,23.55,
1195,EUR,OC1,CZ,2006,OC1,B,11.43,
1196,EUR,OC1,DE,2006,OC1,B,32.44,
...,...,...,...,...,...,...,...,...
1080852,PPS,,NO,2018,TOTAL,S,20.95,
1080853,PPS,,PL,2018,TOTAL,S,7.71,
1080854,PPS,,SI,2018,TOTAL,S,8.51,
1080855,PPS,,SK,2018,TOTAL,S,6.14,


In [31]:
df=df.set_index(['code','nace_r2','year'])
df

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,unit,isco88,isco08,value,flag
code,nace_r2,year,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
NO,B,2006,EUR,OC0,OC0,0.00,: c
BG,B,2006,EUR,OC1,OC1,2.44,
CY,B,2006,EUR,OC1,OC1,23.55,
CZ,B,2006,EUR,OC1,OC1,11.43,
DE,B,2006,EUR,OC1,OC1,32.44,
...,...,...,...,...,...,...,...
NO,S,2018,PPS,,TOTAL,20.95,
PL,S,2018,PPS,,TOTAL,7.71,
SI,S,2018,PPS,,TOTAL,8.51,
SK,S,2018,PPS,,TOTAL,6.14,


---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

## <span style="color:#f1c232">Function Dump</span>

In [35]:
def wage_getter():
    EARN_SES_47=["EARN_SES06_47","EARN_SES10_47","EARN_SES14_47","EARN_SES18_47"]
    df=pd.concat([eurostat.get_data_df(j, flags=True).assign(year=2006+4*i) for i, j in enumerate(EARN_SES_47)],ignore_index=True)
    df=df[(df['sex']=='T')&(df['indic_se']=='ERN')&(df['sizeclas']=='TOTAL')]
    df=df.drop(['freq','sex','indic_se','sizeclas'], axis=1)
    df=df.rename(columns={'geo\TIME_PERIOD':'code'})
    df['unit']=df['unit'].replace(np.nan, '', regex=True)+df['currency'].replace(np.nan, '', regex=True)
    df=df.drop(['currency'], axis=1)
    df=df[(df['unit']!= 'NAC')]
    df['value']=df['2006_value'].replace(np.nan, 0, regex=True)+df['2010_value'].replace(np.nan, 0, regex=True)+df['2014_value'].replace(np.nan, 0, regex=True)+df['2018_value'].replace(np.nan, 0, regex=True)
    df=df.drop(['2006_value','2010_value','2014_value','2018_value'], axis=1)
    df['flag']=df['2006_flag'].replace(np.nan, '', regex=True)+df['2010_flag'].replace(np.nan, '', regex=True)+df['2014_flag'].replace(np.nan, '', regex=True)+df['2018_flag'].replace(np.nan, '', regex=True)
    df=df.drop(['2006_flag','2010_flag','2014_flag','2018_flag'], axis=1)
    occupation=pd.DataFrame({
    'isco88': ['ISCO0', 'ISCO1', 'ISCO1-5', 'ISCO2', 'ISCO3', 'ISCO4', 'ISCO5', 'ISCO6', 'OC6-8', 'ISCO7', 'ISCO7-9', 'ISCO8', 'ISCO9','TOTAL'],
    'isco08': ['OC0', 'OC1', 'OC1-5', 'OC2', 'OC3', 'OC4', 'OC5', 'OC6', 'OC6-8', 'OC7', 'OC7-9', 'OC8', 'OC9','TOTAL'],
    'Description': ['Armed forces occupations', 'Managers', 'Non manual workers', 'Professionals', 'Technicians and associate professionals', 'Clerical support workers', 'Service and sales workers', 'Skilled agricultural, forestry and fishery workers', 'Skilled manual workers', 'Craft and related trades workers', 'Manual workers', 'Plant and machine operators and assemblers', 'Elementary occupations','Total']
            })
    df['isco88']=df['isco88'].map(occupation.set_index('isco88')['isco08'])
    df['isco08']=df['isco88'].replace(np.nan, '', regex=True)+df['isco08'].replace(np.nan, '', regex=True)
    sector = pd.DataFrame([['A', 'A', 'Agriculture, forestry and fishing'],['C', 'B', 'Mining and quarrying'],['D', 'C', 'Manufacturing'],['E', 'D', 'Electricity, gas, steam and air conditioning supply'],['E', 'E', 'Water supply, sewerage, waste management and remediation activities'],['F', 'F', 'Construction'],['G', 'G', 'Wholesale and retail trade; repair of motor vehicles and motorcycles'],['H', 'I', 'Accommodation and food service activities'],['I', 'H', 'Transportation and storage'],['I', 'J', 'Information and communication'],['J', 'K', 'Financial and insurance activities'],['K', 'L', 'Real estate activities'],['K', 'M', 'Professional, scientific and technical activities'],['K', 'N', 'Administrative and support service activities'],['L', 'O', 'Public administration and defence; compulsory social security'],['M', 'P', 'Education'],['N', 'Q', 'Human health and social work activities'],['O', 'R', 'Arts, entertainment and recreation'],['O', 'S', 'Other service activities'],['P', 'T', 'Activities of households as employers; undifferentiated goods- and services-producing activities of households for own use'],['Q', 'U', 'Activities of extraterritorial organisations and bodies']],
                           columns=['nace_r1', 'nace_r2', 'description'])
    df['deneme']=df['nace_r1'].map({item[0]: item[1] for item in S})
    df['nace_r2']=df['deneme'].replace(np.nan, '', regex=True)+df['nace_r2'].replace(np.nan, '', regex=True)
    df=df.drop(['nace_r1','deneme'], axis=1)
    df=df.set_index(['code','nace_r2','year'])

    return df
    

---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------