# <span style="color:#d3d1df">Data Discovery&Preperation for Other Indicators</span>

## <span style="color:#f1c232">Environment</span>


For the analysis, we need to understand the patterns of task outsourcing. The model that will be developed during the thesis study will be relying on output definition as the aggregation of the tasks thereby we will be interested in outsourcing (or trade-in services) as a percentage of the output. The main arguments in the study will be constructed around the foundation that claims the percentage of outsourcing affects wages. Thereby we will need the output figures for European countries and even more specifically we will need industry-specific output figures. Luckily, Eurostat's dataset **NAMA_10_A64** provides the needed data in that field.

In [44]:
#Packages
import pandas as pd
import eurostat  


## <span style="color:#f1c232">National accounts aggregates by industry</span>

### <span style="color:#909a07">**Data Discovery**</span>

Let us start our analysis by getting the parameters of the dataset and the values contained in that dataset.

In [45]:
for i in  eurostat.get_pars('NAMA_10_A64'): print(i,eurostat.get_dic('NAMA_10_A64',i, full=False)) 

freq [('A', 'Annual')]
unit [('CLV_I15', 'Chain linked volumes, index 2015=100'), ('CLV_I10', 'Chain linked volumes, index 2010=100'), ('PC_TOT', 'Percentage of total'), ('CP_MEUR', 'Current prices, million euro'), ('CP_MNAC', 'Current prices, million units of national currency'), ('CLV15_MEUR', 'Chain linked volumes (2015), million euro'), ('CLV10_MEUR', 'Chain linked volumes (2010), million euro'), ('CLV05_MEUR', 'Chain linked volumes (2005), million euro'), ('CLV15_MNAC', 'Chain linked volumes (2015), million units of national currency'), ('CLV10_MNAC', 'Chain linked volumes (2010), million units of national currency'), ('CLV05_MNAC', 'Chain linked volumes (2005), million units of national currency'), ('CLV_PCH_PRE', 'Chain linked volumes, percentage change on previous period'), ('PYP_MEUR', 'Previous year prices, million euro'), ('PYP_MNAC', 'Previous year prices, million units of national currency'), ('PD10_EUR', 'Price index (implicit deflator), 2010=100, euro'), ('PD15_NAC', 'Pr

In [46]:
eurostat.get_dic("NAMA_10_A64","unit", full=False)

[('CLV_I15', 'Chain linked volumes, index 2015=100'),
 ('CLV_I10', 'Chain linked volumes, index 2010=100'),
 ('PC_TOT', 'Percentage of total'),
 ('CP_MEUR', 'Current prices, million euro'),
 ('CP_MNAC', 'Current prices, million units of national currency'),
 ('CLV15_MEUR', 'Chain linked volumes (2015), million euro'),
 ('CLV10_MEUR', 'Chain linked volumes (2010), million euro'),
 ('CLV05_MEUR', 'Chain linked volumes (2005), million euro'),
 ('CLV15_MNAC',
  'Chain linked volumes (2015), million units of national currency'),
 ('CLV10_MNAC',
  'Chain linked volumes (2010), million units of national currency'),
 ('CLV05_MNAC',
  'Chain linked volumes (2005), million units of national currency'),
 ('CLV_PCH_PRE', 'Chain linked volumes, percentage change on previous period'),
 ('PYP_MEUR', 'Previous year prices, million euro'),
 ('PYP_MNAC', 'Previous year prices, million units of national currency'),
 ('PD10_EUR', 'Price index (implicit deflator), 2010=100, euro'),
 ('PD15_NAC', 'Price ind

In [47]:
eurostat.get_dic("NAMA_10_A64","na_item", full=False)

[('B1G', 'Value added, gross'),
 ('P1', 'Output'),
 ('P2', 'Intermediate consumption'),
 ('D1', 'Compensation of employees'),
 ('D11', 'Wages and salaries'),
 ('P51C', 'Consumption of fixed capital'),
 ('B2A3N', 'Operating surplus and mixed income, net'),
 ('D29X39', 'Other taxes less other subsidies on production')]

**Observations:** <br>

* **freq** column is redundant so it will be deleted.
* **unit** column will not be used, only *CP_MEUR* observations will be selected and then the column can be removed.
* **na_item** column will not be used, only *B1G* observations will be selected and then the column can be removed.
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------


### <span style="color:#909a07">**Data Preperation**</span>

Get the dataset via Eurostat API.

In [48]:
df=eurostat.get_data_df('NAMA_10_A64', flags=True)
df.head(10)

Unnamed: 0,freq,unit,nace_r2,na_item,geo\TIME_PERIOD,1975_value,1975_flag,1976_value,1976_flag,1977_value,...,2018_value,2018_flag,2019_value,2019_flag,2020_value,2020_flag,2021_value,2021_flag,2022_value,2022_flag
0,A,CLV05_MEUR,A,B1G,AL,,:,,:,,...,1813.6,,1824.9,p,1849.6,p,1882.9,p,,:
1,A,CLV05_MEUR,A,B1G,AT,,:,,:,,...,3913.9,,3827.8,,3727.3,,3896.8,,,:
2,A,CLV05_MEUR,A,B1G,BE,,:,,:,,...,2743.1,,2777.2,,3020.2,p,2905.9,p,,:
3,A,CLV05_MEUR,A,B1G,BG,,:,,:,,...,1488.4,,1549.4,,1498.3,,1930.4,,,:
4,A,CLV05_MEUR,A,B1G,CH,,:,,:,,...,3002.4,,2804.2,,2744.5,,,:,,:
5,A,CLV05_MEUR,A,B1G,CY,,:,,:,,...,271.9,,288.7,,285.1,,285.4,p,,:
6,A,CLV05_MEUR,A,B1G,CZ,,:,,:,,...,2235.8,,2340.7,,2599.4,,2212.8,,,:
7,A,CLV05_MEUR,A,B1G,DE,,:,,:,,...,16258.3,,18403.0,p,19879.5,p,20199.4,p,,:
8,A,CLV05_MEUR,A,B1G,DK,1031.5,,912.3,,1053.7,...,2376.8,,2422.1,,2581.7,,2187.9,,,:
9,A,CLV05_MEUR,A,B1G,EA,,:,,:,,...,157022.3,,158426.3,,158708.6,,158679.4,,,:


Apply necessary filtering, dropping and renaming operations on the dataset.

In [49]:
df=df[(df['unit']=='CP_MEUR')&(df['na_item']=='B1G')]
df=df.drop(['freq','na_item'], axis=1)
df=df.rename(columns={'geo\TIME_PERIOD':'code'})
df

Unnamed: 0,unit,nace_r2,code,1975_value,1975_flag,1976_value,1976_flag,1977_value,1977_flag,1978_value,...,2018_value,2018_flag,2019_value,2019_flag,2020_value,2020_flag,2021_value,2021_flag,2022_value,2022_flag
84408,CP_MEUR,A,AL,,:,,:,,:,,...,2364.1,,2529.3,p,2559.0,p,2782.6,p,,:
84409,CP_MEUR,A,AT,,:,,:,,:,,...,4355.9,,4179.6,,4136.3,,4923.3,,,:
84410,CP_MEUR,A,BA,,:,,:,,:,,...,992.4,b,997.2,b,1050.2,b,1004.4,b,,:
84411,CP_MEUR,A,BE,,:,,:,,:,,...,2774.3,,3207.2,,3452.1,p,3331.2,p,,:
84412,CP_MEUR,A,BG,,:,,:,,:,,...,1903.0,,1995.3,,2150.3,,3103.9,,,:
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
112578,CP_MEUR,U,RO,,:,,:,,:,,...,0.0,,0.0,,0.0,,0.0,p,,:
112579,CP_MEUR,U,RS,,:,,:,,:,,...,,:,,:,,:,,:,,:
112580,CP_MEUR,U,SE,,:,,:,,:,,...,0.0,,0.0,,0.0,,0.0,,,:
112581,CP_MEUR,U,SI,,:,,:,,:,,...,0.0,,0.0,,0.0,,0.0,,,:


Transform the data from the long form into the short form.

In [50]:
#Switch the data from long format to short format for easiness of use
df_temp = df.melt(id_vars=['nace_r2','code','unit'], var_name='Cols')
df_temp['year']=df_temp['Cols'].apply(lambda x : x[0:4])
df_temp['Cols']=df_temp['Cols'].apply(lambda x : x[5:])
df=df_temp[(df_temp['Cols']=='value')].merge(df_temp[(df_temp['Cols']=='flag')],on=['nace_r2','code','unit','year'],how='outer').rename(columns={'value_x':'value','value_y':'flag'})
df=df.drop(['Cols_x','Cols_y'], axis=1)
df=df[(df['year'].astype(int)>=2002)&(df['year'].astype(int)<=2018)]
df

Unnamed: 0,nace_r2,code,unit,value,year,flag
104706,A,AL,CP_MEUR,1015.9,2002,
104707,A,AT,CP_MEUR,3543.4,2002,
104708,A,BA,CP_MEUR,634.4,2002,
104709,A,BE,CP_MEUR,2823.1,2002,
104710,A,BG,CP_MEUR,1697.2,2002,
...,...,...,...,...,...,...
170627,U,RO,CP_MEUR,0.0,2018,
170628,U,RS,CP_MEUR,,2018,:
170629,U,SE,CP_MEUR,0.0,2018,
170630,U,SI,CP_MEUR,0.0,2018,


Check and arrange the data types and re arrange the indexes.

In [51]:
df['value'],df['year']=df['value'].astype(float),df['year'].astype(int),
df=df.set_index(['code','year','nace_r2'])
df

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,unit,value,flag
code,year,nace_r2,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
AL,2002,A,CP_MEUR,1015.9,
AT,2002,A,CP_MEUR,3543.4,
BA,2002,A,CP_MEUR,634.4,
BE,2002,A,CP_MEUR,2823.1,
BG,2002,A,CP_MEUR,1697.2,
...,...,...,...,...,...
RO,2018,U,CP_MEUR,0.0,
RS,2018,U,CP_MEUR,,:
SE,2018,U,CP_MEUR,0.0,
SI,2018,U,CP_MEUR,0.0,


---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

### <span style="color:#909a07">**Function Dump**</span>

In [1]:
def output_getter():
    df=eurostat.get_data_df('NAMA_10_A64', flags=True)
    df=df[(df['unit']=='CP_MEUR')&(df['na_item']=='B1G')]
    df=df.drop(['freq','na_item'], axis=1)
    df=df.rename(columns={'geo\TIME_PERIOD':'code'})
    df_temp = df.melt(id_vars=['nace_r2','code','unit'], var_name='Cols')
    df_temp['year']=df_temp['Cols'].apply(lambda x : x[0:4])
    df_temp['Cols']=df_temp['Cols'].apply(lambda x : x[5:])
    df=df_temp[(df_temp['Cols']=='value')].merge(df_temp[(df_temp['Cols']=='flag')],on=['nace_r2','code','unit','year'],how='outer').rename(columns={'value_x':'value','value_y':'flag'})
    df=df.drop(['Cols_x','Cols_y'], axis=1)
    df=df[(df['year'].astype(int)>=2002)&(df['year'].astype(int)<=2018)]
    df['value'],df['year']=df['value'].astype(float),df['year'].astype(int),
    df=df.set_index(['code','nace_r2','year'])
    return df   

---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------