# Official data indicators

Here we collect data and create indicators of industrial activity in NUTS2 areas based on BRES and NOMIS data.

<!-- We have already collected the data by running `make data` in the terminal. This has stored the BRES and IDBR data in the `data/external/` folder, and processed it into Nesta segments (a shorter number of industrial categories) in `data/processed`
 -->
 
This involves running the script `make_dataset.py`. This already results in indicators that are relevant for the project, such as complexity. 
 
We will also create a clean table and an indicator of the share of employment working in high median salary occupations according to the ashe data we calculate in `0-jmg-ashe_sectoral`.


## Preamble

In [1]:
%run ../notebook_preamble.ipy

In [3]:
#Need to put this in utils
def make_dirs(name,dirs = ['raw','processed','interim']):
    '''
    Utility that creates directories to save the data
    
    '''
    
    for d in dirs:
        if name not in os.listdir(f'../../data/{d}'):
            os.mkdir(f'../../data/{d}/{name}')

In [28]:
def create_local_industry_dataset(path,salary_lookup,cluster_name,year,save=False):
    '''
    This creates a long dataset with industry activity per NUTS area and extra variables with share of activity in top two deciles
    of salary, and bottom two deciles of salary
    
    Arguments:
        path (str) path to a tidy dataframe with the industrial activity information (could be employment or establishments)
        salary_lookup (dict) a lookup between industry segments and position in the salary distribution
        cluster_name (str) name of the cluster variable in the industry df
    
    '''
    #Read the data
    industry = pd.read_csv(path,dtype={'SIC4':str})
    
    #Label with salary info
    industry['median_salary_decile'] = industry[cluster_name].map(ashe_lookup)
    
    #Create wide dataset with industry activity per geography
    industry_long = industry.groupby(
        ['geo_nm','geo_cd',cluster_name])['value'].sum().reset_index(drop=False).pivot_table(
        index=['geo_nm','geo_cd'],columns=cluster_name,values='value')
    
    #Share of activity in top and bottom of salary distribution
    salary_long = industry.groupby(
        ['geo_nm','geo_cd','median_salary_decile'])['value'].sum().reset_index(drop=False).pivot_table(
        index=['geo_nm','geo_cd'],columns='median_salary_decile',values='value')
    
    #Top of distro
    high_salary = salary_long.apply(lambda x: x/x.sum(),axis=1)[[8,9]].sum(axis=1)
    
    #Bottom of distro
    low_salary = salary_long.apply(lambda x: x/x.sum(),axis=1)[[0,1]].sum(axis=1)
    
    salary_stats = pd.concat([high_salary,low_salary],axis=1)
    
    #Names
    salary_stats.columns = ['top_20_salary_share','bottom_20_salary_share']
    
    #Concatenate
    combined = pd.concat([industry_long,salary_stats],axis=1)
    
    if save==True:
        
        #Take the informative bit of the name
        name = '_'.join(path.split('_')[1:3])
        
        combined.to_csv(f'../../data/interim/industry/{today_str}_{name}_industry_salary.csv')
        
    
    salary_stats['year']=year
    return(salary_stats)
    
    #Return everything
    
    
    #return(salary_long)
    

In [77]:
def extract_segment(path,sector_list,sector_variable,sector_name):
    '''
    This function takes official data from a path and returns a segment of interest.
    We will use it to produce indicators about cultural activities in different NUTS2 regions.
    
    Arguments:
        path (str) is the path we use
        segment (list) is the list of codes we are interested in - could be segments or sectors
        sector_variable (str) is the variable that we use to identify sectors. It could be 
            the sic code or the Nesta segment.
    
    '''
    
    #Read data
    all_sectors = pd.read_csv(path,dtype={'SIC4':str})
    
    #Activity in sector
    sector = all_sectors.loc[[x in sector_list for x in all_sectors[sector_variable]]].reset_index(
        drop=True)
    
    #Regroup and aggregate
    sector_agg = sector.groupby(['geo_nm','geo_cd','year'])['value'].sum()
    
    #Add the name
    sector_agg.name = sector_name
    
    #Create dataframe so we can add years
    #sector_agg = pd.DataFrame(sector_agg)
    
    #And add years
    #sector_agg['year'] = year
    
    return(pd.DataFrame(sector_agg))
    
    

In [95]:

def make_indicator(table,target_path,var_lookup,year_var,nuts_var='nuts_code',nuts_spec=2018,decimals=3):
    '''
    We use this function to create and save indicators using our standardised format.
    
    Args:
        table (df) is a df with relevant information
        target_path (str) is the location of the directory where we want to save the data (includes interim and processed)
        var_lookup (dict) is a lookup to rename the variable into our standardised name
        year (str) is the name of the year variable
        nuts_var (str) is the name of the NUTS code variable. We assume it is nuts_code
        nuts_spec (y) is the value of the NUTS specification. We assume we are working with 2018 NUTS
    
    '''
    #Copy
    t = table.reset_index(drop=False)
    
    #Reset index (we assume that the index is the nuts code, var name and year - this might need to be changed)
    
    
    #Process the interim data into an indicator
    
    #This is the variable name and code
    var_name = list(var_lookup.keys())[0]
    
    var_code = list(var_lookup.values())[0]
    
    #Focus on those
    t = t[[year_var,nuts_var,var_name]]
    
    #Add the nuts specification
    t['nuts_year_spec'] = nuts_spec
    
    #Rename variables
    t.rename(columns={var_name:var_code,year_var:'year',nuts_var:'region_id'},inplace=True)

    #Round variables
    t[var_code] = [np.round(x,decimals) if decimals>0 else int(x) for x in t[var_code]]
    
    
    #Reorder variables
    t = t[['year','region_id','nuts_year_spec',var_code]]
    
    print(t.head())
    
    #Save in the processed folder
    t.to_csv(f'../../data/processed/{target_path}/{var_code}.csv',index=False)

In [4]:
make_dirs('industry')

## Read data

### Sector names

In [55]:
#Cultural industries 

cultural = ['services_cultural','services_recreation','services_entertainment']

### metadata (ASHE)

This is a lookup indicating the position in the salary distribution of various industries based on the analysis in the `ashe` notebook

In [6]:
#Read ashe and turn it into a lookup
ashe = pd.read_csv('../../data/interim/industry/2020_02_18_ashe_rankings.csv')

ashe_lookup = ashe.set_index('cluster')['ashe_median_salary_rank'].to_dict()

In [14]:
#bres
bres_2018 = pd.read_csv('../../data/interim/industry/nomis_BRES_2018_TYPE450.csv',dtype={'SIC4':str},
                       index_col=None)

bres_2018['sal'] = bres_2018['cluster_name'].map(ashe_lookup)

bres_2018

Unnamed: 0.1,Unnamed: 0,year,geo_type,geo_nm,geo_cd,SIC4,value,cluster_name,sal
0,0,2018,nuts 2013 level 2,Tees Valley and Durham,UKC1,0161,10,manufacture_food,2.0
1,1,2018,nuts 2013 level 2,Northumberland and Tyne and Wear,UKC2,0161,15,manufacture_food,2.0
2,2,2018,nuts 2013 level 2,Cumbria,UKD1,0161,200,manufacture_food,2.0
3,3,2018,nuts 2013 level 2,Greater Manchester,UKD3,0161,175,manufacture_food,2.0
4,4,2018,nuts 2013 level 2,Lancashire,UKD4,0161,300,manufacture_food,2.0
...,...,...,...,...,...,...,...,...,...
20470,20470,2018,nuts 2013 level 2,East Wales,UKL2,9609,2000,services_recreation,1.0
20471,20471,2018,nuts 2013 level 2,Eastern Scotland,UKM2,9609,2500,services_recreation,1.0
20472,20472,2018,nuts 2013 level 2,South Western Scotland,UKM3,9609,3500,services_recreation,1.0
20473,20473,2018,nuts 2013 level 2,North Eastern Scotland,UKM5,9609,700,services_recreation,1.0


### Make indicators

#### Level of employment in the cultural industries

In [78]:
bres_cult = pd.concat([extract_segment(
    f'../../data/interim/industry/nomis_BRES_{y}_TYPE450.csv',cultural,'cluster_name',
    'culture_entertainment_recreation') for y in [2016,2017,2018]])

In [96]:
make_indicator(bres_cult,
               'industry',
               {'culture_entertainment_recreation':'employment_culture_entertainment_recreation'},year_var='year',
              nuts_spec=2013,nuts_var='geo_cd',decimals=0)

   year region_id  nuts_year_spec  employment_culture_entertainment_recreation
0  2016      UKH2            2013                                        39390
1  2016      UKJ1            2013                                        53725
2  2016      UKD6            2013                                        18320
3  2016      UKK3            2013                                        10435
4  2016      UKD1            2013                                         8555


#### Level of employment and business activity in sectors with different salaries

We are not saving these for now as they are not key to the project

In [56]:
bres_nuts,idbr_nuts = [pd.concat([create_local_industry_dataset(
    f'../../data/interim/industry/nomis_{data}_{y}_TYPE450.csv',ashe_lookup,'cluster_name',y) 
                       for y in [2016,2017,2018]]) for data in ['BRES','IDBR']]

#### Complexity

In [88]:
compl= pd.read_csv('../../data/interim/industry/nomis_ECI.csv')

In [92]:
#make_indicator(compl.loc[compl['source']=='BRES'],
#               'industry',
#               {'eci':'economic_complexity_index'},year_var='year',nuts_spec=2013,nuts_var='geo_cd')