# Official data indicators

Here we create indicators of industrial activity in NUTS2 areas based on BRES and NOMIS data.

We have already collected the data by running `make data` in the terminal. This has stored the BRES and IDBR data in the `data/external/` folder, and processed it into Nesta segments (a shorter number of industrial categories) in `data/processed`

Here we create a clean table for the most recent year, and an indicator of the share of employment working in high median salary occupations according to the ashe data we calculated in `0-jmg-ashe_sectoral`.


## Preamble

In [None]:
%run ../notebook_preamble.ipy

In [None]:
#Need to put this in utils
def make_dirs(name,dirs = ['raw','processed']):
    '''
    Utility that creates directories to save the data
    
    '''
    
    for d in dirs:
        if name not in os.listdir(f'../../data/{d}'):
            os.mkdir(f'../../data/{d}/{name}')

In [None]:
make_dirs('industry')

## Read data

### metadata (ASHE)

This is a lookup indicating the position in the salary distribution of various industries based on the analysis in the `ashe` notebook

In [None]:
#Read ashe and turn it into a lookup
ashe = pd.read_csv('../../data/processed/official/2019_11_15_ashe_rankings.csv')

ashe_lookup = ashe.set_index('cluster')['ashe_median_salary_rank'].to_dict()

In [None]:
#bres
bres_2018 = pd.read_csv('../../data/processed/official/nomis_BRES_2018_TYPE450.csv',dtype={'SIC4':str})

bres_2018['sal'] = bres_2018['cluster_name'].map(ashe_lookup)

In [None]:
def create_local_industry_dataset(path,salary_lookup,cluster_name,save=True):
    '''
    This creates a long dataset with industry activity per NUTS area and extra variables with share of activity in top two deciles
    of salary, and bottom two deciles of salary
    
    Arguments:
        path (str) path to a tidy dataframe with the industrial activity information (could be employment or establishments)
        salary_lookup (dict) a lookup between industry segments and position in the salary distribution
        cluster_name (str) name of the cluster variable in the industry df
    
    '''
    #Read the data
    industry = pd.read_csv(path,dtype={'SIC4':str})
    
    #Label with salary info
    industry['median_salary_decile'] = industry[cluster_name].map(ashe_lookup)
    
    #Create wide dataset with industry activity per geography
    industry_long = industry.groupby(
        ['geo_nm','geo_cd',cluster_name])['value'].sum().reset_index(drop=False).pivot_table(
        index=['geo_nm','geo_cd'],columns=cluster_name,values='value')
    
    #Share of activity in top and bottom of salary distribution
    salary_long = industry.groupby(
        ['geo_nm','geo_cd','median_salary_decile'])['value'].sum().reset_index(drop=False).pivot_table(
        index=['geo_nm','geo_cd'],columns='median_salary_decile',values='value')
    
    #Top of distro
    high_salary = salary_long.apply(lambda x: x/x.sum(),axis=1)[[8,9]].sum(axis=1)
    
    #Bottom of distro
    low_salary = salary_long.apply(lambda x: x/x.sum(),axis=1)[[0,1]].sum(axis=1)
    
    salary_stats = pd.concat([high_salary,low_salary],axis=1)
    
    #Names
    salary_stats.columns = ['top_20_salary_share','bottom_20_salary_share']
    
    #Concatenate
    combined = pd.concat([industry_long,salary_stats],axis=1)
    
    if save==True:
        
        #Take the informative bit of the name
        name = '_'.join(path.split('_')[1:3])
        
        combined.to_csv(f'../../data/processed/industry/{today_str}_{name}_industry_salary.csv')
        
    
    #Return everything
    return(pd.concat([industry_long,salary_stats],axis=1))
    
    #return(salary_long)
    

In [None]:
bres_nuts = create_local_industry_dataset('../../data/processed/official/nomis_BRES_2018_TYPE450.csv',ashe_lookup,'cluster_name')

idbr_nuts = create_local_industry_dataset('../../data/processed/official/nomis_IDBR_2018_TYPE450.csv',ashe_lookup,'cluster_name')

In [None]:
bres_nuts.head()