# <span style="color:#d3d1df">Ms Thesis Employment Data Discovery</span>

## <span style="color:#f1c232">Environment</span>

For the analysis, we need the employment figures of different countries for different sectors and occupations so that we can exclude the labor supply effects on wages *(Heckscher–Ohlin)*. In Eurostat, dataset **"LFSA_EISN2 "** satisfies these requirements of ours. So to get the dataset we are going to need to use Python's **eurostat** API. Afterward we will need to discover the categories and do the necessary preparations to transform the imported data into a meaningful and workable dataset.

In [31]:
#Packages
import pandas as pd
import eurostat

---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

## <span style="color:#f1c232">Data Discovery</span>

Let's initialize our analysis by getting the parameters of the dataset and the values contained in that dataset.

In [32]:
for i in  eurostat.get_pars('LFSA_EISN2'): print(i,eurostat.get_dic('LFSA_EISN2',i, full=False)) 

freq [('A', 'Annual')]
age [('Y_GE15', '15 years or over'), ('Y20-64', 'From 20 to 64 years')]
sex [('T', 'Total'), ('M', 'Males'), ('F', 'Females')]
nace_r2 [('TOTAL', 'Total - all NACE activities'), ('A', 'Agriculture, forestry and fishing'), ('B', 'Mining and quarrying'), ('C', 'Manufacturing'), ('D', 'Electricity, gas, steam and air conditioning supply'), ('E', 'Water supply; sewerage, waste management and remediation activities'), ('F', 'Construction'), ('G', 'Wholesale and retail trade; repair of motor vehicles and motorcycles'), ('H', 'Transportation and storage'), ('I', 'Accommodation and food service activities'), ('J', 'Information and communication'), ('K', 'Financial and insurance activities'), ('L', 'Real estate activities'), ('M', 'Professional, scientific and technical activities'), ('N', 'Administrative and support service activities'), ('O', 'Public administration and defence; compulsory social security'), ('P', 'Education'), ('Q', 'Human health and social work activit

**Observations:** <br>

* **freq** column will be redundant (deleted).
* **age** column will not be used, only *Y20-64* observations will be selected and then the column can be removed.


Let's see the structure of the dataframe imported.

In [33]:
eurostat.get_data_df("LFSA_EISN2", True).columns

Index(['freq', 'age', 'sex', 'nace_r2', 'isco08', 'unit', 'geo\TIME_PERIOD',
       '2008_value', '2008_flag', '2009_value', '2009_flag', '2010_value',
       '2010_flag', '2011_value', '2011_flag', '2012_value', '2012_flag',
       '2013_value', '2013_flag', '2014_value', '2014_flag', '2015_value',
       '2015_flag', '2016_value', '2016_flag', '2017_value', '2017_flag',
       '2018_value', '2018_flag', '2019_value', '2019_flag', '2020_value',
       '2020_flag', '2021_value', '2021_flag', '2022_value', '2022_flag'],
      dtype='object')

**Observations:** <br>

* The table is in long form so it needs to be transformed into short form.

Let us check the *nace_r2* and *isco08* columns to see some familiar (or unfamiliar) descriptions and values.

In [34]:
eurostat.get_dic("LFSA_EISN2","nace_r2", full=False)

[('TOTAL', 'Total - all NACE activities'),
 ('A', 'Agriculture, forestry and fishing'),
 ('B', 'Mining and quarrying'),
 ('C', 'Manufacturing'),
 ('D', 'Electricity, gas, steam and air conditioning supply'),
 ('E', 'Water supply; sewerage, waste management and remediation activities'),
 ('F', 'Construction'),
 ('G', 'Wholesale and retail trade; repair of motor vehicles and motorcycles'),
 ('H', 'Transportation and storage'),
 ('I', 'Accommodation and food service activities'),
 ('J', 'Information and communication'),
 ('K', 'Financial and insurance activities'),
 ('L', 'Real estate activities'),
 ('M', 'Professional, scientific and technical activities'),
 ('N', 'Administrative and support service activities'),
 ('O', 'Public administration and defence; compulsory social security'),
 ('P', 'Education'),
 ('Q', 'Human health and social work activities'),
 ('R', 'Arts, entertainment and recreation'),
 ('S', 'Other service activities'),
 ('T',
  'Activities of households as employers; und

In [35]:
eurostat.get_dic("LFSA_EISN2","isco08", full=False)

[('TOTAL', 'Total'),
 ('OC1', 'Managers'),
 ('OC2', 'Professionals'),
 ('OC3', 'Technicians and associate professionals'),
 ('OC4', 'Clerical support workers'),
 ('OC5', 'Service and sales workers'),
 ('OC6', 'Skilled agricultural, forestry and fishery workers'),
 ('OC7', 'Craft and related trades workers'),
 ('OC8', 'Plant and machine operators and assemblers'),
 ('OC9', 'Elementary occupations'),
 ('OC0', 'Armed forces occupations'),
 ('NRP', 'No response')]

---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

## <span style="color:#f1c232">Data Preperation</span>

Get the dataset via Eurostat API.

In [36]:
df=eurostat.get_data_df("LFSA_EISN2", True)
df.head(5)

Unnamed: 0,freq,age,sex,nace_r2,isco08,unit,geo\TIME_PERIOD,2008_value,2008_flag,2009_value,...,2018_value,2018_flag,2019_value,2019_flag,2020_value,2020_flag,2021_value,2021_flag,2022_value,2022_flag
0,A,Y20-64,F,A,NRP,THS_PER,CH,,: bu,,...,,: u,,: u,,: u,,: bu,,: u
1,A,Y20-64,F,A,NRP,THS_PER,DE,,: bu,,...,,:,,: u,,: bu,,: bu,,: u
2,A,Y20-64,F,A,NRP,THS_PER,DK,,:,,...,,: u,,: u,,:,,: bu,,:
3,A,Y20-64,F,A,NRP,THS_PER,EA20,,: bu,,...,,: u,,: u,,: u,,: bu,,: u
4,A,Y20-64,F,A,NRP,THS_PER,EU27_2020,,: bu,,...,,: u,,: u,,: u,,: bu,,: u


Exclude the observations and variables that will not be used in the analysis.

In [37]:
df=df[(df['sex']=='T')&(df['age']=='Y20-64')]
df=df.drop(['freq','age','sex'],axis=1)
df=df.rename(columns={'geo\TIME_PERIOD':'code'})
df

Unnamed: 0,nace_r2,isco08,unit,code,2008_value,2008_flag,2009_value,2009_flag,2010_value,2010_flag,...,2018_value,2018_flag,2019_value,2019_flag,2020_value,2020_flag,2021_value,2021_flag,2022_value,2022_flag
17537,A,NRP,THS_PER,CH,,: bu,,: u,,: bu,...,,: u,,: u,,: u,,: bu,1.2,u
17538,A,NRP,THS_PER,DE,,: bu,,: u,5.4,b,...,,:,,: u,,: bu,,: bu,,: u
17539,A,NRP,THS_PER,DK,,:,,:,,:,...,,: u,,: u,,: u,,: bu,,: u
17540,A,NRP,THS_PER,EA20,,: bu,,: u,6.1,u,...,,: u,,: u,,: u,13.7,bu,,: u
17541,A,NRP,THS_PER,EU27_2020,,: bu,,: u,,: u,...,,: u,,: u,,: u,14.1,bu,15.5,u
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
26612,U,TOTAL,THS_PER,SE,,: bu,,: u,1.4,u,...,,: bu,,: u,,: u,,: bu,,: u
26613,U,TOTAL,THS_PER,SI,,: bu,,:,,:,...,,:,,: u,,: u,,: bu,,: u
26614,U,TOTAL,THS_PER,SK,,: bu,,: bu,,: u,...,,: u,,: u,,: u,,: bu,,: u
26615,U,TOTAL,THS_PER,TR,,:,5.3,,3.7,,...,8.1,,5.8,,6.8,,,:,,:


Transform the data from the long form into the short form.

In [38]:
df_temp = df.melt(id_vars=['nace_r2','isco08','unit','code'], var_name='Cols')
df_temp['year']=df_temp['Cols'].apply(lambda x : x[0:4])
df_temp['Cols']=df_temp['Cols'].apply(lambda x : x[5:])

df=df_temp[(df_temp['Cols']=='value')].merge(df_temp[(df_temp['Cols']=='flag')],on=['nace_r2','isco08','unit','code','year'],how='outer').rename(columns={'value_x':'value','value_y':'flag'})
del df_temp
df=df.drop(['Cols_x','Cols_y'], axis=1)
df

Unnamed: 0,nace_r2,isco08,unit,code,value,year,flag
0,A,NRP,THS_PER,CH,,2008,: bu
1,A,NRP,THS_PER,DE,,2008,: bu
2,A,NRP,THS_PER,DK,,2008,:
3,A,NRP,THS_PER,EA20,,2008,: bu
4,A,NRP,THS_PER,EU27_2020,,2008,: bu
...,...,...,...,...,...,...,...
136195,U,TOTAL,THS_PER,SE,,2022,: u
136196,U,TOTAL,THS_PER,SI,,2022,: u
136197,U,TOTAL,THS_PER,SK,,2022,: u
136198,U,TOTAL,THS_PER,TR,,2022,:


Check and arrange the data types and re arrange the indexes.



In [39]:
df.dtypes

nace_r2    object
isco08     object
unit       object
code       object
value      object
year       object
flag       object
dtype: object

In [40]:
df['value'],df['year']=df['value'].astype(float),df['year'].astype(int),
df=df.set_index(['code','year','nace_r2'])
df

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,isco08,unit,value,flag
code,year,nace_r2,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
CH,2008,A,NRP,THS_PER,,: bu
DE,2008,A,NRP,THS_PER,,: bu
DK,2008,A,NRP,THS_PER,,:
EA20,2008,A,NRP,THS_PER,,: bu
EU27_2020,2008,A,NRP,THS_PER,,: bu
...,...,...,...,...,...,...
SE,2022,U,TOTAL,THS_PER,,: u
SI,2022,U,TOTAL,THS_PER,,: u
SK,2022,U,TOTAL,THS_PER,,: u
TR,2022,U,TOTAL,THS_PER,,:


---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

## <span style="color:#f1c232">Function Dump</span>

In [1]:
def lsupply_getter():
    df=eurostat.get_data_df("LFSA_EISN2", True)
    df=df[(df['sex']=='T')&(df['age']=='Y20-64')]
    df=df.drop(['freq','age','sex'],axis=1)
    df=df.rename(columns={'geo\TIME_PERIOD':'code'})
    df_temp = df.melt(id_vars=['nace_r2','isco08','unit','code'], var_name='Cols')
    df_temp['year']=df_temp['Cols'].apply(lambda x : x[0:4])
    df_temp['Cols']=df_temp['Cols'].apply(lambda x : x[5:])
    df=df_temp[(df_temp['Cols']=='value')].merge(df_temp[(df_temp['Cols']=='flag')],on=['nace_r2','isco08','unit','code','year'],how='outer').rename(columns={'value_x':'value','value_y':'flag'})
    df=df.drop(['Cols_x','Cols_y'], axis=1)
    df['value'],df['year']=df['value'].astype(float),df['year'].astype(int),
    df=df.set_index(['code','nace_r2','year'])
    return df
    
    

---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------