# Auto NUTS

There have been several versions of the NUTS geocode standard - 2003, 2006, 2010, 2013 and 2016.

Each of these versions has an [associated enforcement date](https://ec.europa.eu/eurostat/web/nuts/history), which can lag by around 2 years from the date of introduction.

Organisations releasing data aggregated at the NUTS geographies are not required to use the latest version until the enforcement date, leaving a two year period during which it is unclear whether an organisation is using a particular version.

Here we will create a function that can automatically take a dataset with specified NUTS regions and infer the version year.

In [None]:
%run ../notebook_preamble.ipy

In [None]:
from collections import defaultdict
import geopandas as gpd
import os
from itertools import chain

nuts_years = np.array([2003, 2006, 2010, 2013, 2016])

Modelled data for air pollution across the UK is compiled by DEFRA. The values are obtained by using the data from monitoring stations and using atmospheric modelling to interpolate the data to a 1km by 1km grid across the whole country.

In [None]:
nuts_ids = {}

for nuts_year in  nuts_years:
    file = f'{data_path}/raw/gis/eurostat/NUTS_RG_01M_{nuts_year}_4326_LEVL_2.shp/NUTS_RG_01M_{nuts_year}_4326_LEVL_2.shp'
    eu_regions = region = gpd.read_file(file)
    nuts_ids[nuts_year] = set(eu_regions[eu_regions['CNTR_CODE'] == 'UK']['NUTS_ID'].values)

We can see that each NUTS version has a unique set of regions.

In [None]:
for y in nuts_years:
    s = nuts_ids[y]
    following = [v for k, v in nuts_ids.items() if k > y]
    deprecating = s.difference(*following)
    preceeding = [v for k, v in nuts_ids.items() if k < y]
    introduced = s.difference(*preceeding)
    print(y, len(s))
    if y < 2016:
        print('Deprecating:', deprecating)
    if y > 2003:
        print('Introduced:', introduced)

**NUTS Level 2 Properties:**

- 2003
  - n_regions: 37
  - deprecating: 'UKM4', 'UKM1'
- 2006
  - n_regions: 37
  - new: 'UKM6', 'UKM5'
  - deprecating: 'UKD2', 'UKD5'
  - enforced: 2008
- 2010
  - n_regions: 37
  - new: 'UKD7', 'UKD6'
  - deprecating: 'UKI1', 'UKI2'
  - enforced: 2012
- 2013
  - n_regions: 40
  - new: 'UKI3', 'UKI6', 'UKI7', 'UKI4', 'UKI5', 
  - deprecating: 'UKM3', 'UKM2'
  - enforced: 2015
- 2016
  - len: 41
  - new: 'UKM9', 'UKM7', 'UKM8'
  - enforced: 2018

In [None]:
nuts_2_deprecating = {
    2003: ['UKM1', 'UKM4'],
    2006: ['UKD2', 'UKD5'],
    2010: ['UKI2', 'UKI1'],
    2013: ['UKM3', 'UKM2'],
    2016: []
}

nuts_2_introduced = {
    2003: [],
    2006: ['UKM6', 'UKM5'],
    2010: ['UKD7', 'UKD6'],
    2013: ['UKI3', 'UKI6', 'UKI7', 'UKI4', 'UKI5'],
    2016: ['UKM9', 'UKM7', 'UKM8']
}

nuts_enforced = {
    2003: 2003,
    2006: 2008,
    2010: 2012,
    2013: 2015,
    2016: 2018
}

def nuts_earliest(year):
    '''nuts_earliest
    Returns the earliest possible NUTS version for a year
    based on the enforcement date.
    '''
    for k, v in nuts_enforced.items():
        if year >= v:
            earliest = k
    return earliest
        
def set_containment(a, b):
    i = len(set(a).intersection(set(b)))
    c = i / len(a)
    return c

In [None]:
def year_containments(ids, years):
    containments = []
    for year in years:
        year_ids = nuts_ids[year]
        containments.append(set_containment(ids, year_ids))
    containments = np.array(containments)
    return containments
    
def auto_nuts(ids, year):
    '''
    defaults: latest, earliest, closest
    '''
    earliest = nuts_earliest(year)
    years = nuts_years[nuts_years >= nuts_earliest(year)]
    # if only one year is possible, return it
    if len(years) == 1:
        return years[0]
    
    # if not calculate containments between region IDs from possible years
    containments = year_containments(ids, years)
    # check if there is a single perfect match
    perfect = containments == 1
    if np.sum(perfect) == 1:
        year_inferred = years[np.argmax(perfect)]
        return year_inferred
    
    # if there is not a perfect match
    elif np.sum(perfect) != 1:
        best = np.argwhere(containments == np.max(containments)).ravel()
        year_inferred = years[best[0]]
        return year_inferred
        

In [None]:
nuts_regions = []
years = []

for k, v in nuts_ids.items():
    nuts_regions.extend(v)
    years.extend([k] * len(v))
    
values = np.random.random(len(years))

df = pd.DataFrame({'nuts_region': nuts_regions,
                   'year': years,
                   'value': values})

In [None]:
def auto_nuts(df, year='year', nuts_id='nuts_id'):
    '''auto_nuts
    Auto generates values for nuts_year_spec if they are not provided.
    
    Args:
        df (:obj:`pd.DataFrame`): Dataframe with indicator values.
        year (:obj:`str`): Column name for the indicator value year.
        nuts_id (:obj:`str`): Column name for the NUTS region IDs.
        
    Returns:
        df (:obj:`pd.DataFrame`): Modified dataframe with new column
            for NUTS region years, `nuts_year_spec`.
    '''
    dfs = []
    for year, group in df.groupby(year_col):
        auto_nuts_year = auto_nuts(group[nuts_region_col], year)
        group = group.assign(nuts_year_spec=auto_nuts_year)
        dfs.append(group)

    df = pd.concat(dfs, axis=0)
    return df

In [None]:
df.groupby(df['year'].values).mean()

In [None]:
from beis_indicators.utils.nuts_utils import auto_nuts

In [None]:
!pip install -e ../../.

In [None]:
auto_nuts(df, year='year', nuts_id='nuts_region')

In [None]:
np.array(list(nuts_ids.keys()))