# Analysis of GtR data for diffusion of AI paper

**Draft pre-abstract**

This paper analyses the diffusion of methods and technologies related to AI across various UK research fields using an open dataset about research funding covering the period 2006-2018. We are particularly interested in understanding differential rates of diffusion in research projects with different 'industrial orientations' (which we predict using a machine learning model trained on a large corpus of business descriptions) before and after 2012, a landmark year in the development of AI research, as well as the drivers of diffusion. Here, we focus on three potential explanations of AI diffusion:

* The 'data intensity' of a field, which we estimate using the propensity of related research fields to generate data outputs
* The 'prediction intensity' of a field, which we measure through the use of terms related to prediction, uncertainty and risk in its project descriptions
* The 'ethical and legal' risks of a field, which we measure through the semantic similarity between projects in the field and academic research on ethical risks from AI.

**Activities**
1. Load the gateway to research data
2. Identify AI papers.
  * Approach a. Use a keyword based approach
  * Approach b (robustness)
    * Train a model on GTR data
    * Use topic modelling
3. Identify data-intensive sectors.
  * Approach a. Use the data-productivity of different sectors
  * Approach b. Use project semantic similarity to documents about data, data processing and infrastructure etc.
4. Identify prediction-intensive sectors
  * Approach a. Calculate semantic similarity between projects and documents describing relevant concepts such as prediction, risk, uncertainty, decision-making
  * Approach b. Measure the 'prediction intensity' of different sectors based on their occupational distribution
5. Identify ethical and legal issues.
  * Approach a. Calculate the semantic similarity between projects in the domain and projects related to ethics and legal issues
  * Approach b. Use social media data... somehow


## 0. Preamble




In [None]:
%run notebook_preamble.ipy
%run lda_pipeline.py
%run text_classifier.py
%run keyword_searches.py
%run utils.py

In [None]:
# Put functions and classes here

def flatten_list(a_list):
    return([x for el in a_list for x in el])


def random_check(corpus,num,length):
    '''
    Prints num random examples form corpus
    
    '''
    
    selected = np.random.randint(0,len(corpus),num)
    
    texts  = [text for num,text in enumerate(corpus) if num in selected]
    
    for t in texts:
        print(t[:length])
        print('====')

In [None]:
def label_data(corpus_to_label,corpus_tokenised,w2v,seed_list,threshold,name,occ_threshold):
    '''
    This function queries a word2vec model to identify synonyms for an initial seed vocabulary, finds words with that vocabulary in the data,
    and labels a df with them.
    
    Arguments
    
        -corpus_to_label: a df where every row is a document. We want to label them
        -corpus_tokenised: a bag of words od elements corresponding to the documents
        -w2v: word2vec model used for the expansion
        -seed_lis: the list of terms we want to expand
        -threshold: similarity threshold when expanding the keyword
        -name to label the relevant documents
        
    Returns
         
        A list with the final set of keywords used for labelling, and a labelled df (labels include number of occurrences of words in the expanded seed and
        a boolean indicating if the word occurs or not)
        
    
    '''
    
    #Initialise the keywordExpansion object
    kw_exp = keywordExpander(corpus_to_label,corpus_tokenised,w2v)
    kw_exp.keyword_expansion(seed_list,thres=threshold)
    
    labeller = keywordLabeller(kw_exp)
    labeller.label_data(name=name)

    labelled_output = labeller.projects_labelled
    labelled_output['has_'+name] = labelled_output[name]>occ_threshold
    
    
    out = [kw_exp.expanded_keywords,labelled_output]
    
    return(out)

In [None]:
today_str = datetime.datetime.strftime(datetime.datetime.today(),'%d-%m-%Y')

In [None]:
from utils import get_latest_file
import datetime
from gensim.models import Word2Vec

In [None]:
#Don't want to print all the info logs
logger = logging.getLogger()
logger.setLevel(logging.CRITICAL)

## 1. Load the data

In [None]:
input_data = [x for x in os.listdir('../data/raw/') if 'labelled' in x]

latest_file = get_latest_file(input_data)

In [None]:
gtr = pd.read_csv('../data/raw/'+latest_file,index_col=False)

In [None]:
gtr.head()

We are going to focus on research grants since 2016

In [None]:
#Define a couple of column sets

discs, outputs = [[x for x in gtr.columns if w in x] for w in ['disc_','out_']]


#### Focus on grants

In [None]:
grants = gtr.query('grant_category == "Research Grant" & year > 2006 & year < 2019').reset_index(drop=True)

In [None]:
grants.shape

33k grant awards with abstracts since 2016.

In [None]:
grants['top_disc'] = grants[discs].idxmax(axis=1)

In [None]:
grants.top_disc.value_counts().plot.bar(color='blue',title='Discipline distribution')

In [None]:
fig,ax = plt.subplots(figsize=(10,10),nrows=2)

pd.crosstab(grants.year,grants.top_disc).rolling(window=3).mean().plot(ax=ax[0],title='Discipline funding evolution')

pd.pivot_table(
    grants.groupby(['year','top_disc'])['amount'].sum().reset_index(drop=False),
    index='year',columns='top_disc',aggfunc='sum').rolling(window=3).mean().plot(ax=ax[1],legend=False)

ax[0].legend(bbox_to_anchor=(1,0.2))

### 2. Find AI projects

In [None]:
ai_seed = ['machine_learning','artificial_intelligence','deep_learning','ai','machine_vision','text_mining','data_mining']

In [None]:
#Create sentence corpus
sentence_corpus = flatten_list([x.split('. ') for x in grants['abstract']])

In [None]:
#Tokenize etc using the classes above
sentence_tokenised = CleanTokenize(sentence_corpus).clean().bigram()

#Also tokenise by documents so we can query them later
corpus_tokenised = CleanTokenize(grants['abstract']).clean().bigram()

In [None]:
#Training W2V
w2v = Word2Vec(sentence_tokenised.tokenised,window=3)

In [None]:
occ_thres=0

In [None]:
ai_labelling_outputs = label_data(grants,corpus_tokenised.tokenised,w2v,ai_seed,0.85,'ai',occ_threshold=occ_thres)

We use the keyword classes and functions in `keyword_searches` above

In [None]:
#Extract the labelled df from the outputs
grants_labelled_ai = ai_labelling_outputs[1]

grants_labelled_ai['has_ai'].sum()

Almost 1600 'AI' projects

In [None]:
random_check(grants_labelled_ai.loc[grants_labelled_ai['ai']>0]['abstract'],10,length=1000)

In [None]:
def labelled_data_plots(labelled_df,var_name,
                        ax,
                        cross_tab_against=['top_disc','funder'],
                        do_random_check=True,**kwargs):
    '''
    
    Produces some plots of the data focusing on the labelled observations

    Arguments
    
        -labelled_df: the labelled dataset
        -var_name: name of the variable to report on
        -cross_tabs_against: variables to crosstab against
        -ax: matplotlib axis object
        -do_random_check: print a random sample of labelled data (using the kwargs to manage outputs)

    Returns
    
        -Line chart of activity as share of the total (number of projects and total funding)
        -For each crosstab:
            -Barchart with distribution of projects and funding
            -Linechart with project activity trends
        
    '''
    
    df = labelled_df

    has_var = f'has_{var_name}'
        
    is_df = df.loc[df[has_var]==True]
    

    
    print(f'{var_name} has {len(is_df)} projects ({np.round(100*(len(is_df)/len(df)),2)}% of the total)')
    print(f'It has received £{is_df.amount.sum()}, ({np.round(100*is_df.amount.sum()/df.amount.sum(),2)}% of the total)')
    
    #Example projects
    
    if do_random_check==True:
        print('\n')
        print(f'{var_name} EXAMPLES')
        print('=======')
        random_check(df.loc[df[has_var]>0]['abstract'],kwargs['number'],kwargs['length'])
        
    #Plots
    
    #Linechart
        #Projects
    
    #The first mean below is calculating the share of a boolean. The second is for the rolling mean
    df.groupby('year')[has_var].mean().rolling(window=2).mean().plot(title=f'{var_name} activity as a share of the total',ax=ax[0])
    
        #Funding
    
    fund_y = pd.pivot_table(df.groupby(['year',has_var])['amount'].sum().reset_index(drop=False),
                   index='year',columns=has_var,values='amount')
    fund_share = fund_y[True]/(fund_y[False]+fund_y[True])
    
    fund_share.rolling(window=2).mean().plot(ax=ax[0])
    
    ax[0].legend(labels=['projects','funding'])
    
    
    #Shares
    
    for num,var in enumerate(cross_tab_against):
        df.groupby(var)[has_var].mean().sort_values(ascending=False).plot.bar(color='blue',ax=ax[1+num],title=f'Share of {var} in {has_var}')
        
    #Trends

    
    for num,var in enumerate(cross_tab_against):
        pd.crosstab(is_df['year'],is_df[var],normalize=1).rolling(window=3).mean().plot(
            ax=ax[3+num],title=f'Share of {var_name} activity by {var} / year')

        ax[3+num].legend(bbox_to_anchor=(1,1))
    
    
    
    
    

### Some preliminary explorations

#### AI Trends

In [None]:
fig,ax = plt.subplots(nrows=5,figsize=(10,20))

labelled_data_plots(grants_labelled_ai,'ai',ax=ax,**{'number':5,'length':1000,'thres':0})

plt.tight_layout()

#### Identify prediction intensity

We will use the same approach as above but focusing on terms related to prediction.

In [None]:
pred_seed = ['prediction', 'uncertainty', 'risk', 'decision', 'probability']

In [None]:
#NB label the AI labelled set
pred_labelling_outputs = label_data(grants_labelled_ai,corpus_tokenised.tokenised,w2v,pred_seed,0.7,'prediction',occ_threshold=occ_thres)

In [None]:
pred_labelling_outputs[0]

In [None]:
fig,ax = plt.subplots(nrows=5,figsize=(10,20))

labelled_data_plots(pred_labelling_outputs[1],'prediction',ax=ax,**{'number':5,'length':1000})

plt.tight_layout()

### Identify projects related to data

In [None]:
data_seed = ['data','dataset','data_sets']

labelled_df = pred_labelling_outputs[1]

#NB label the AI labelled set
data_labelling_outputs = label_data(labelled_df,corpus_tokenised.tokenised,w2v,data_seed,0.7,'data',occ_threshold=occ_thres)


In [None]:
data_labelling_outputs[0]

In [None]:
fig,ax = plt.subplots(nrows=5,figsize=(10,20))

labelled_data_plots(data_labelling_outputs[1],'data',ax=ax,**{'number':5,'length':1000})

plt.tight_layout()

### Identify projects related to ethics

In [None]:
ethical_seed = ['legal','ethical','ethics','privacy','tort']

labelled_df = data_labelling_outputs[1]

#NB label the AI labelled set
ethical_labelling_outputs = label_data(labelled_df,corpus_tokenised.tokenised,w2v,ethical_seed,0.8,'ethics',occ_threshold=occ_thres)


In [None]:
ethical_labelling_outputs[0]

In [None]:
fig,ax = plt.subplots(nrows=5,figsize=(10,20))

labelled_data_plots(ethical_labelling_outputs[1],'ethics',ax=ax,**{'number':5,'length':1000})

plt.tight_layout()

#### Identify projects related 

In [None]:
labelled_df = ethical_labelling_outputs[1]

labelled_df['has_db']= labelled_df['out_db']>0

labelled_df[['has_ai','has_prediction','has_data','has_ethics','has_db']].corr()

In [None]:
random_check(labelled_df.loc[(labelled_df['has_ai']==True)&(labelled_df['has_ethics']==True)]['abstract'],num=30,length=1000)

## Save data

In [None]:
labelled_df.to_csv(f'../data/interim/{today_str}_projects_all_labels.csv')

In [None]:
6*5*750