# ASHE places

We collect data about median salaries in a NUTS2 area. This is an indicator in its own right, and we will also use it to calculate the House Affordability index.

Our strategy will be to collect the data from [Nomis](https://www.nomisweb.co.uk/query/construct/apilinks.asp?menuopt=201) for LEPS.

Unfortunately the data is not available at the NUTS2 so we will have to use an alternative source



## Preamble

In [None]:
%run ../notebook_preamble.ipy

In [None]:
from io import BytesIO
from zipfile import ZipFile

In [None]:
def make_dirs(name,dirs = ['raw','processed']):
    '''
    Utility that creates directories to save the data
    
    '''
    
    for d in dirs:
        if name not in os.listdir(f'../../data/{d}'):
            os.mkdir(f'../../data/{d}/{name}')
            
def flat_freq(a_list):
    '''
    Return value counts for categories in a nested list
    
    '''
    return(pd.Series([x for el in a_list for x in el]).value_counts())

        

def flatten_list(a_list):
    
    return([x for el in a_list for x in el])

        

In [None]:
def save_data(df,name,path,today=today_str):
    '''
    Utility to save processed data quicker
    
    Arguments:
        df (df) is the dataframe we want to save
        name (str) is the name of the file
        path (str) is the path where we want to save the file
        today (str) is the day when the data is saved
    
    '''
    
    df.to_csv(f'{path}/{today_str}_{name}.csv')
    

In [None]:
def get_process_ashe_place(api_link,var_name):
    '''
    This function collects and processes ashe place data
    
    Arguments:
        api_link (str) is the endpoint we get the data from
        var_name (str) is the name for the observed value variable
    
    
    '''
    
    #Get the data
    nomis_table = pd.read_csv(api_link)
    
    #tidy variable names
    nomis_table.columns = [x.lower() for x in nomis_table.columns]
    
    #Some subseting of rows (ie we only keep the values)
    nomis_values = nomis_table.loc[nomis_table['measures_name']=='Value']
    
    #Some subsetting of columns
    nomis_filtered = nomis_values[['date_name','geography_name','geography_code','obs_value']]
    
    #Observed value
    nomis_filtered.rename(columns={'obs_value':'var_name'})
    
    return(nomis_filtered)
    
    

In [None]:
def parse_ashe_dump_data(path,file,occupation_list):
    '''
    This function collects and parses data from an ASHE occupation salary dump
    
    Arguments:
        path (str) is the path where we have stored the excel files
        file (str) is the name of the file
        occupation_list (list) is the list of occupations that we will focus on    
    
    '''
    #Extract the year from the file name
    year = file.split(' ')[-1][:-4]
    
    print(year)
    
    
    #Read the file. We are focusin on Full-Time to keep the indicator comparable with the LEPS. 
    #We are also subsetting to remove some information at the top / bottom / sides
    
    table = pd.read_excel(path+'/'+file,
                    sheet_name='Full-Time',skiprows=4,na_values='x').iloc[:-5,:4]
    
    #Extract NUTS and occupations from the 'Descriptionc' field
    
    #We will use the fact that occupations are all Uppercase
    
    place_names = []
    occ_names = []
    
    #We go through every description and if a word is all uppercase we put it in an occupation container,
    #otherwise in a place container
    
    for category in table['Descriptionc']:
        
        split = category.split(' ')
        
        place =[]
        occ = []
        
        for word in split:
            if word.isupper()==False:
                place.append(word)
                
            else:
                occ.append(word)
                
        place_names.append(' '.join(place))
        occ_names.append(' '.join(occ))
        
    #Assign the words we identified as places to NUTS2 removing a trailing comma
    table['nuts_2'] = [x[:-1] for x in place_names]
    
    #Assign occupations
    table['occupation'] = occ_names
    
    #Assign years
    table['year']=year
    
    #Focus on occupations of interest
    table_filter = table.loc[[x in occupation_list for x in table['occupation']]]
    
    #Clean the occupation name
    table_filter['occupation'] =[x.lower() for x in table_filter['occupation']]
    
    #Rename the median variable
    table_filter.rename(columns={'Median':'gross_annual_salary_median'},inplace=True)
    
    return(table_filter[['year','nuts_2','occupation','gross_annual_salary_median']])
    

In [None]:
#dirs

if 'ashe_place' not in os.listdir('../../data/raw'):
    os.makedirs('../../data/raw/ashe_place')

if 'ashe_place' not in os.listdir('../../data/processed/'):
    os.makedirs('../../data/processed/ashe_place')

#Path to save data:

proc_path ='../../data/processed/ashe_place'

## 1. Collect data

We collect the data from NOMIS.

Note that we are collecting **annual gross salary** for full-time workers

### LEPS

The LEP case will be easy as the information is already available at the lep level

In [None]:
api_lep_link = 'https://www.nomisweb.co.uk/api/v01/dataset/NM_30_1.data.csv?geography=1925185537,1925185575,1925185538...1925185543,1925185572,1925185544,1925185570,1925185545,1925185577,1925185553,1925185547...1925185549,1925185571,1925185569,1925185551,1925185552,1925185554,1925185558,1925185555...1925185557,1925185559,1925185560,1925185550,1925185576,1925185562,1925185573,1925185563...1925185568&date=latestMINUS4-latest&sex=8&item=2&pay=7&measures=20100,20701'

In [None]:
ashe_lep = get_process_ashe_place(api_lep_link,'gross_annual_salary_median')

In [None]:
ashe_lep.head()

### NUTS2

ASHE data are not available at the NUTS2 level and it is not trivial to convert LAD data into NUTS as we have done in other places (eg House Affordability) because the information is only available as median salaries. We could have used number of jobs & average salaries to calculate wage bills and recalculate salaries at the NUTS2 level but this would mean reporting averages rather than medians. 

For all these reasons, we end using a ASHE data dump at the ONS level available [here](https://www.ons.gov.uk/employmentandlabourmarket/peopleinwork/earningsandworkinghours/adhocs/009571annualsurveyofhoursandearningsasheestimatesofannualandhourlyearningsforindustryandoccupationbynuts2andnuts3uk2011to2017)

Note that there are some concerns about the reliability of these indicators given small sample sizes etc. so any indicators built using this data should be treated with caution.

In [None]:
#Download and extract the ASHE data
data_link = 'https://www.ons.gov.uk/file?uri=/employmentandlabourmarket/peopleinwork/earningsandworkinghours/adhocs/009571annualsurveyofhoursandearningsasheestimatesofannualandhourlyearningsforindustryandoccupationbynuts2andnuts3uk2011to2017/k42forpublishing.zip'

ashe_req = requests.get(data_link)
ashe_zip = ZipFile(BytesIO(ashe_req.content))
ashe_zip.extractall(path='../../data/raw/ashe_place/download')

## 2. Processing

The extracted data are a bunch of excel files with median data by occupation and industry between 2011 and 2017.

We will focus on Science, Engineering and technology occupations.

In [None]:
my_dir = os.listdir('../../data/raw/ashe_place/download/K42a - NUTS2 by occupation')

#Files we want to consider
my_files = [x for x in my_dir if ('Annual pay' in x) & (' CV' not in x)]

my_files

In [None]:
path = '../../data/raw/ashe_place/download/K42a - NUTS2 by occupation'
occ_list = ['SCIENCE, RESEARCH, ENGINEERING AND TECHNOLOGY PROFESSIONALS']


In [None]:
sci_median_salaries = pd.concat([parse_ashe_dump_data(path,file,occ_list) for file in my_files]).reset_index(drop=True)

In [None]:
sci_median_salaries

In [None]:
#We fix a typo in one of the geographies (they switched the order of Bristol and Bath)
sci_median_salaries['nuts_2'] = ['Gloucestershire, Wiltshire and Bath/Bristol area' 
                                 if x=='Gloucestershire, Wiltshire and Bristol/Bath area' else x for x in sci_median_salaries['nuts_2']]

### Final processing

Add NUTS2 codes to the table



In [None]:
nuts_codes_url = 'https://opendata.arcgis.com/datasets/ded3b436114440e5a1561c1e53400803_0.geojson'

nuts_codes_names = requests.get(nuts_codes_url).json()['features']

In [None]:
#Add NUTS2 codes
nuts_names_to_codes = {x['properties']['NUTS218NM']:x['properties']['NUTS218CD'] for x in nuts_codes_names}

#Label the table with 2018 NUTS codes. 
sci_median_salaries['nuts_2_codes'] = [nuts_names_to_codes[x] if x in nuts_names_to_codes.keys() else np.nan for x in sci_median_salaries['nuts_2']]


In [None]:
set(sci_median_salaries.loc[sci_median_salaries['nuts_2_codes'].isna()]['nuts_2'])

There is a small number of mismatched areas due to changes in NUTS, plus aggregate non-NUTS london codes. We need to decide what to do about these.



## Save data

In [None]:
save_data(ashe_lep,'ashe_lep_all_occupations',proc_path)

save_data(sci_median_salaries,'ashe_nuts_2_sci_tech',proc_path)