# GtR analysis

We use Gateway to Research data about public funding for R&D in the UK to answer two questions about AI:

1. What is the geography of AI research and how does it relate with the geography of automation?
2. What is the diffusion of AI research in different industries and what are its drivers?

We tackle these questions in turn after loading the data.


## Preamble

In [None]:
%run notebook_preamble.ipy

In [None]:
# Add imports here

from ast import literal_eval

# Add functions here

## 1. Load data

We will be working with three datasets:

1. An enriched Gateway to Research dataset including information about:
 * Project description including whether a project has been classified as AI or not
 * Project metadata such as year, amount of funding, funder
 * Project labels about the disciplinary, industry and SDG focus
 * Location of organisations involved in the project
 * Sectors of the organisations involved in the project
 * See notebook `02` for more 
2. A df with topic probabilities for all projects, based on a corex topic analysis of project abstracts (see notebook `3` for more)
3. A df with automation probabilities and shares for local authority districts in the UK produced by the ONS (see notebook `aux_2` for more)

#### a. GtR data

In [None]:
gtr = pd.read_csv('../data/processed/19_7_2019_gtr_processed.csv',compression='zip')

In [None]:
gtr.head()

In [None]:
# We need to parse a few variables with lists

to_parse = ['lead_lad_code','lead_lad_name','all_lad_code','all_lad_name','involved_lad_code','involved_lad_name','cluster']

#For each of those variables
for top in to_parse:
    
    if type(gtr[top][0])!=list:
        
        #If it isn't a list, then parse it
        gtr[top] = [literal_eval(x) if pd.isnull(x)==False else np.nan for x in gtr[top]]


#### b. Topic mix

How do we interpret the coefficients: probability that a document has a topic based on its words

In [None]:
topic_df = pd.read_csv('../data/processed/19_7_2019_gtr_corex_topic_mix.csv',compression='zip')

In [None]:
topic_df.head()

#### c. Automation risks by local authority district

In [None]:
aut = pd.read_csv('../data/processed/19_7_2019_ons_automation_clean.csv',index_col=None)

#### d. Additional secondary data here

We will want to use additional secondary data for example when modelling the link between AI specialisation and automation. We load it here.

In [None]:
# APS data about occupational distribution in the workforce

aps = pd.read_csv('https://www.nomisweb.co.uk/api/v01/dataset/NM_100_1.data.csv?geography=1820327937...1820328318&date=latestMINUS1&cell=404881665,404881921,404882177,404882433,404882689,404882945,404883201,404883457,404883713,404883969,404884225,404884481,404884737,404884993,404885249,404885505,404885761,404886017,404886273,404886529,404886785,404887041,404887297,404887553,404887809,404888065,404888321,404888577,404888833,404889089,404889345,404889601,404889857,404890113,404890369,404890625,404890881,404891137,404891393,404891649,404891905,404892161,404892417,404892673,404892929,404893185,404893441,404893697,404893953,404894209,404894465,404894721,404894977,404895233,404895489,404895745,404896001,404896257,404896513,404896769,404897025,404897281,404897537,404897793,404898049,404898305,404898561,404898817,404899073,404899329,404899585,404899841,404900097,404900353,404900609&measures=20100,20701')

In [None]:
aps.head()

In [None]:
#We need to do some processing of the APS data before working with it. 
# I want to create a df where every row is a LAD and the columns are levels of overall employment in different occupations

aps.columns = [x.lower() for x in aps.columns]

#Focus on values rather than confidence intervals
#Keep variables of interest (geography, variable name ie occupational group and value)

aps_sub = aps.loc[aps['measures_name']=='Value',['geography_name','cell_name','obs_value']].reset_index(drop=True)

#We are interested in 'all people', not the gender distribution
aps_sub = aps_sub.loc[['All people' in x for x in aps_sub['cell_name']]]

aps_sub.head()

In [None]:
#Extract occupation names

aps_sub['cell_name'] = [x.split('(')[1][:-2].strip().split('-')[1].strip() for x in aps_sub['cell_name']]

In [None]:
#Pivot
aps_piv = aps_sub.pivot_table(index='geography_name',columns='cell_name',values='obs_value')

aps_piv.head()

In [None]:
# We will assume that the missing values (probably due to low sample sizes) are zero

aps_piv.fillna(0,inplace=True)

## 2. Geographical analysis

Here we analyse the geography of AI research and its link with the geography of automation. Our assumption is that locations with high levels of concentration of AI research have workforces which are at relatively low risk of automation. 

We hypothesise that this relationship holds after we control for the occupational distribution of the workforce in different locations.

This means that investing in automation in a location constitutes an insurance against automation, although the mechanism is unclear.

a. Create measures of AI activity

* Number of projects and levels of funding for projects lead in a LAD
* Number of projects involving organisations from the LAD
* AI as a share of all projects
* LQ for AI research

In [None]:
from scipy.stats import zscore

In [None]:
def flatten_list(my_list):
    '''
    Flattens a list
    '''
    
    return([x for el in my_list for x in el])

def create_lq_df(df):
    '''
    Takes a df with cells = activity in col in row and returns a df with cells = lq
    
    '''
    
    area_activity = df.sum(axis=0)
    area_shares = area_activity/area_activity.sum()
    
    lqs = df.apply(lambda x: (x/x.sum())/area_shares, axis=1)
    return(lqs)


In [None]:
# Create a list of lad names
lad_names = list(set(flatten_list(gtr['involved_lad_name'])))

In [None]:
#Leading

# Number of projects
ai_lead_projects = pd.crosstab(gtr['lead_lad_value'],gtr['ai_mod'])
ai_lead_projects.columns = ['non_ai_lead_p','ai_lead_p']

#Also calculate the share
ai_lead_projects['ai_lead_p_share'] = ai_lead_projects['ai_lead_p']/ai_lead_projects.sum(axis=1)

In [None]:
#Levels of funding
ai_lead_funding  = gtr.groupby(['lead_lad_value','ai_mod'])['amount'].sum().reset_index(
    drop=False).pivot_table(index='lead_lad_value',columns='ai_mod',values='amount').fillna(0)

ai_lead_funding.columns = ['non_ai_lead_f','ai_lead_f']
ai_lead_funding['ai_lead_f_share'] = ai_lead_funding['ai_lead_f']/ai_lead_funding.sum(axis=1)

In [None]:
#Involved
# Here we count the number of AI/ non AI projects that a LAD has organisations involved in
ai_involved_projects = pd.concat([gtr.loc[[lad in x for x in gtr['involved_lad_name']]]['ai_mod'].value_counts() for lad in lad_names],axis=1).fillna(0)

ai_involved_projects = ai_involved_projects.T
ai_involved_projects.index = lad_names
ai_involved_projects.columns = ['non_ai_inv_p','ai_inv_p']

ai_involved_projects['ai_inv_p_share'] = ai_involved_projects['ai_inv_p']/ai_involved_projects.sum(axis=1)

In [None]:
ai_activity = pd.concat([ai_lead_projects,ai_lead_funding,ai_involved_projects],axis=1,sort=False)
ai_activity.head()

In [None]:
# Also create some LQd versions

ai_activity['ai_lead_p_lq'] = create_lq_df(ai_lead_projects).iloc[:,1]
ai_activity['ai_lead_f_lq'] = create_lq_df(ai_lead_funding).iloc[:,1]
ai_activity['ai_inv_p_lq'] = create_lq_df(ai_involved_projects).iloc[:,1]

In [None]:
ai_activity.fillna(0,inplace=True)

In [None]:
#Also create standardised version
ai_activity_z = ai_activity.apply(zscore)    
ai_activity_z.columns = ['z_'+x for x in ai_activity_z.columns]

ai_activity = pd.concat([ai_activity,ai_activity_z],axis=1)
ai_activity.head()

b. Create measures of automation

`aut_prob` means share of jobs with some tasks at risk of automation

`aut_high` means the share of jobs at high risk of automation

`number_high` is the number

In [None]:
aut_z = aut.iloc[:,2:].fillna(0).apply(zscore)
aut_z.columns = ['z_'+x for x in aut_z.columns]

aut = pd.concat([aut,aut_z],axis=1)

aut = aut.set_index('lad_name').drop('lad_code',axis=1)

In [None]:
aut.head()

c. Combine

In [None]:
ai_aut = pd.concat([ai_activity,aut],axis=1,join='inner')

#### Exploratory data analysis

In [None]:
import seaborn as sns

How do the shares in activity in AI research compare with the shares of activity in other areas, and with the shares of automation risk




In [None]:
# Create a df with the values for that analysis

In [None]:
#Here we consider the % of research activity and automation risks accounted by locations with different % of the workforce at risk of automation

ai_total_vars = ['non_ai_lead_p','ai_lead_p','non_ai_lead_f','ai_lead_f','non_ai_inv_p','ai_inv_p','number_high']

#This maps LADs versus their position in the automation distribution
discr = {lad:pos for lad,pos in zip(ai_aut.index,pd.qcut(ai_aut['aut_prob'],q=np.arange(0,1.1,0.2),labels=False))}

lads_auto_sorted = ai_aut.sort_values('aut_high',ascending=True).index

In [None]:
ai_discr = ai_aut[ai_total_vars]

#We focus on the share of each variable in the total
ai_discr = ai_discr.apply(lambda x: x/x.sum())

ai_discr['automation_rank'] = ai_discr.index.map(discr)

fig,ax = plt.subplots(figsize=(10,3))

ax = (100*ai_discr.groupby('automation_rank')[ai_total_vars].sum()).plot.bar(cmap='Accent_r',edgecolor='grey',ax=ax,
                                                                            title='Distribution of research activity vs risk of automation')

ax.legend(bbox_to_anchor=(1,1))

In [None]:
ax = ai_discr.loc[lads_auto_sorted,ai_total_vars].apply(np.cumsum).plot(figsize=(10,5),title='Research funding and Workforce at risk of automation')
ax.set_xticklabels([])

In [None]:
# #Explore correlation pairs (focusing on normalised scores)

# zs = ai_aut[[x for x in ai_aut.columns if ('z_' in x) and ('_lq' not in x)]]


# sns.pairplot(zs)


In [None]:
fig,ax = plt.subplots(figsize=(8,9))

sns.heatmap(ai_aut.corr().loc[ai_activity.columns,aut.columns],cmap='seismic',center=0,ax=ax,annot=True)

ax.set_title('Correlation between research activity and automation risk')

**Next steps**

Model the link between workforce automation risks and level of local AI research

## 3. Drivers of diffusion

First step (exploratory) - find sectors with highest levels of activity in relevant topics

In [None]:
topic_df.columns

I will focus on four topics: 0 (ethical), 1 (legal), 2 (data) and 30 (measurement and prediction)

#### Quantify importance of the topics in different sectors

In [None]:
#Turn the topic probabilities into dummies to simplify analysis
# A topic is present in a project if it has a weight above 0.9
topic_dummy = topic_df.iloc[:,1:].applymap(lambda x: x>0.9)

In [None]:
topic_selected = pd.concat([topic_df['project_id'],topic_dummy.iloc[:,[0,1,3,30]]],axis=1)

#Simpler column names
topic_selected.columns = ['project_id','topic_ethics','topic_social_legal','topic_data','topic_measurement']

topic_vars = list(topic_selected.columns[1:])

gtr_comb = pd.merge(gtr,topic_selected,left_on='project_id',right_on='project_id')



In [None]:
#Topic presence
topic_presence = gtr_comb.groupby('sel_industry')[topic_vars].sum()

#Normalise by activity in an industry

topic_presence_norm = topic_presence.apply(lambda x: x/gtr_comb['sel_industry'].value_counts())


In [None]:
fig,ax = plt.subplots(figsize=(5,15))

topic_presence_norm.sort_values('topic_data',ascending=True).plot.barh(ax=ax)

**Next steps**

* Discretise the labels for the sectors
* Compare levels of AI activity across discretised groups
* Model diffusion of AI

In [None]:
pd.concat([pd.crosstab(gtr_comb['sel_industry'],gtr_comb['ai_mod'],normalize=0)[True],
           topic_presence_norm],axis=1).corr()