# 1. GtR analysis

Analysis of AI activity in UK research funding data.

We will load a df with projects and another with IDs for AI projects.

We want to do the following analyses on it:

#### 1. Analysis of AI diffusion in the GtR data.

Here, we have two research questions + a stretch goal

##### Research questions

1. What are the levels of diffusion of AI activity in different disciplines and industries?
2. How has this evolved over time?
3. What are its drivers?

Our prior is that there will be strong concentration of activity in digital industries and (perhaps) an underrepresentation in public service oriented sectors. We will explore the determinants of activity. Do sectors with a stronger propensity to generate database outputs also see faster growth in AI activity?

#### 2. Analysis of concentration

##### Research questions

1. What are the levels of spatial concentration in AI research funding and how have they concentrated over time?
2. What are the levels of organisational concentration? (here it would be nice to have the company level information - ask Alex.
3. Do we see organisations/teams that started early moving across application domains?

3 may require loading the link table and the person data

#### 3. Analysis of purpose

##### Research questions

1. What are the levels of activity in SDGs and how have they evolved over time?
2. How do they compare to the population of activity overall?








## Preamble

In [None]:
%run notebook_preamble.ipy

In [None]:
# Imports

from ast import literal_eval

## Load GtR

In [None]:
# GtR data
gtr = pd.read_csv('../data/external/14_6_2019_gtr_cleaned.csv',compression='zip')



In [None]:
#We remove irrelevant columns
gtr = gtr.loc[:,['Unnamed' not in x for x in gtr.columns]]

In [None]:
gtr.head()

In [None]:
#Parse the lists 

lad_vars = [x for x in gtr.columns if 'lad_' in x]

for ladvar in lad_vars:
    
    gtr[ladvar] = [literal_eval(x) for x in gtr[ladvar]]


In [None]:
# Load AI IDs
with open('../data/external/ai_ids.p','rb') as infile:
    ai_ids = pickle.load(infile)


In [None]:
gtr['ai'] = [x in ai_ids for x in gtr['project_id']]

One problem: We only identified AI projects for grants.

#### We will train an AI classifier on the GtR data and use it to predict the rest.

## Preamble

In [None]:
import ast
import os
import random

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
%matplotlib inline

pd.options.mode.chained_assignment = None  # default='warn'

In [None]:
# %load lda_pipeline.py
from gensim import corpora, models
from string import punctuation
from string import digits
import re
import pandas as pd
import numpy as np

#Characters to drop
drop_characters = re.sub('-','',punctuation)+digits

#Stopwords
from nltk.corpus import stopwords

stop = stopwords.words('English')

#Stem functions
from nltk.stem import *
stemmer = PorterStemmer()

In [None]:
# Utility functions

def flatten_list(my_list):
    '''
    Flattens a list
    '''
    
    return([x for el in my_list for x in el])

def flatten(my_list):
    '''
    Flattens a list
    '''
    
    return([x for el in my_list for x in el])


def clean_tokenise(string,drop_characters=drop_characters,stopwords=stop):
    '''
    Takes a string and cleans (makes lowercase and removes stopwords)
    
    '''
    

    #Lowercase
    str_low = string.lower()
    
    
    #Remove symbols and numbers
    str_letters = re.sub('[{drop}]'.format(drop=drop_characters),'',str_low)
    
    
    #Remove stopwords
    clean = [x for x in str_letters.split(' ') if (x not in stop) & (x!='')]
    
    return(clean)


class CleanTokenize():
    '''
    This class takes a list of strings and returns a tokenised, clean list of token lists ready
    to be processed with the LdaPipeline
    
    It has a clean method to remove symbols and stopwords
    
    It has a bigram method to detect collocated words
    
    It has a stem method to stem words
    
    '''
    
    def __init__(self,corpus):
        '''
        Takes a corpus (list where each element is a string)
        '''
        
        #Store
        self.corpus = corpus
        
    def clean(self,drop=drop_characters,stopwords=stop):
        '''
        Removes strings and stopwords, 
        
        '''
        
        cleaned = [clean_tokenise(doc,drop_characters=drop,stopwords=stop) for doc in self.corpus]
        
        self.tokenised = cleaned
        return(self)
    
    def stem(self):
        '''
        Optional: stems words
        
        '''
        #Stems each word in each tokenised sentence
        stemmed = [[stemmer.stem(word) for word in sentence] for sentence in self.tokenised]
    
        self.tokenised = stemmed
        return(self)
        
    
    def bigram(self,threshold=10):
        '''
        Optional Create bigrams.
        
        '''
        
        #Colocation detector trained on the data
        phrases = models.Phrases(self.tokenised,threshold=threshold)
        
        bigram = models.phrases.Phraser(phrases)
        
        self.tokenised = bigram[self.tokenised]
        
        return(self)
        
        
        
        

class LdaPipeline():
    '''
    This class processes lists of keywords.
    How does it work?
    -It is initialised with a list where every element is a collection of keywords
    -It has a method to filter keywords removing those that appear less than a set number of times
    
    -It has a method to process the filtered df into an object that gensim can work with
    -It has a method to train the LDA model with the right parameters
    -It has a method to predict the topics in a corpus
    
    '''
    
    def __init__(self,corpus):
        '''
        Takes the list of terms
        '''
        
        #Store the corpus
        self.tokenised = corpus
        
    def filter(self,minimum=5):
        '''
        Removes keywords that appear less than 5 times.
        
        '''
        
        #Load
        tokenised = self.tokenised
        
        #Count tokens
        token_counts = pd.Series([x for el in tokenised for x in el]).value_counts()
        
        #Tokens to keep
        keep = token_counts.index[token_counts>minimum]
        
        #Filter
        tokenised_filtered = [[x for x in el if x in keep] for el in tokenised]
        
        #Store
        self.tokenised = tokenised_filtered
        self.empty_groups = np.sum([len(x)==0 for x in tokenised_filtered])
        
        return(self)
    
    def clean(self):
        '''
        Remove symbols and numbers
        
        '''
        
        
        
    
        
    def process(self):
        '''
        This creates the bag of words we use in the gensim analysis
        
        '''
        #Load the list of keywords
        tokenised = self.tokenised
        
        #Create the dictionary
        dictionary = corpora.Dictionary(tokenised)
        
        #Create the Bag of words. This converts keywords into ids
        corpus = [dictionary.doc2bow(x) for x in tokenised]
        
        self.corpus = corpus
        self.dictionary = dictionary
        return(self)
        
    def tfidf(self):
        '''
        This is optional: We extract the term-frequency inverse document frequency of the words in
        the corpus. The idea is to identify those keywords that are more salient in a document by normalising over
        their frequency in the whole corpus
        
        '''
        #Load the corpus
        corpus = self.corpus
        
        #Fit a TFIDF model on the data
        tfidf = models.TfidfModel(corpus)
        
        #Transform the corpus and save it
        self.corpus = tfidf[corpus]
        
        return(self)
    
    def fit_lda(self,num_topics=20,passes=5,iterations=75,random_state=1803):
        '''
        
        This fits the LDA model taking a set of keyword arguments.
        #Number of passes, iterations and random state for reproducibility. We will have to consider
        reproducibility eventually.
        
        '''
        
        #Load the corpus
        corpus = self.corpus
        
        #Train the LDA model with the parameters we supplied
        lda = models.LdaModel(corpus,id2word=self.dictionary,
                              num_topics=num_topics,passes=passes,iterations=iterations,random_state=random_state)
        
        #Save the outputs
        self.lda_model = lda
        self.lda_topics = lda.show_topics(num_topics=num_topics)
        

        return(self)
    
    def predict_topics(self):
        '''
        This predicts the topic mix for every observation in the corpus
        
        '''
        #Load the attributes we will be working with
        lda = self.lda_model
        corpus = self.corpus
        
        #Now we create a df
        predicted = lda[corpus]
        
        #Convert this into a dataframe
        predicted_df = pd.concat([pd.DataFrame({x[0]:x[1] for x in topics},
                                              index=[num]) for num,topics in enumerate(predicted)]).fillna(0)
        
        self.predicted_df = predicted_df
        
        return(self)
    

In [None]:
def create_lq_df(df):
    '''
    Takes a df with cells = activity in col in row and returns a df with cells = lq
    
    '''
    
    area_activity = df.sum(axis=0)
    area_shares = area_activity/area_activity.sum()
    
    lqs = df.apply(lambda x: (x/x.sum())/area_shares, axis=1)
    return(lqs)


### Classification

In [None]:
#ML imports
from sklearn.metrics import confusion_matrix
from sklearn.feature_extraction.text import CountVectorizer

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score,GridSearchCV
from sklearn.multiclass import OneVsRestClassifier
from sklearn.preprocessing import LabelBinarizer

import warnings

warnings.simplefilter('ignore',UserWarning)

In [None]:
# Utility functions

def flatten_list(my_list):
    '''
    Flattens a list
    '''
    
    return([x for el in my_list for x in el])


def dummies_from_list(list_of_categories):
    '''
    This function takes a list of categories and returns a df where every column is a dummie for each unique variable
    in the category. Admittedly, the function could be nicer.
    
    '''

    #We concatenate a bunch of series whose indices are the names of the variables.
    #We could have done something similar by creating DFs with one row
    
    cats = [x for x in set(flatten_list(list_of_categories))]

    df = pd.DataFrame()

    for category in cats:
    
        var = [category in x for x in list_of_categories]

        df[category] = var

    
    
    
    #dummy_df = pd.concat([pd.Series({v:1 for v in obs}) for obs in list_of_categories],axis=1).T.fillna(0)
    return(df)

In [None]:
# %load text_classifier.py
# CLasses

#One class for text classification based on text inputs

class TextClassification():
    '''
    This class takes a corpus (could be a list of strings or a tokenised corpus) and a target (could be multiclass or single class).
    
    When it is initialised it vectorises the list of tokens using sklearn's count vectoriser.
    
    It has a grid search method that takes a list of models and parameters and trains the model.
    
    It returns the output of grid search for diagnosis
    
    '''
    
    def __init__(self,corpus,target):
        '''
        
        Initialise. The class will recognise if we are feeding it a list of strings or a list of
        tokenised documents and vectorise accordingly. 
        
        It will also recognise is this a multiclass or one class problem based on the dimensions of the target array
        
        Later on, it will use control flow to modify model parameters depending on the type of data we have
        
        '''
        
        #Is this a multiclass classification problem or a single class classification problem?
        if target.shape[1]>1:
            self.mode = 'multiclass'
            
        else:
            self.mode = 'single_class'
    
    
        #Store the target
        self.Y = target
    
        #Did we feed the model a bunch of strings or a list of tokenised docs? If the latter, we clean and tokenise.
        
        if type(corpus[0])==str:
            #corpus = CleanTokenize(corpus).clean().bigram().tokenised
            corpus = CleanTokenize(corpus).clean().tokenised
            
        #Turn every list of tokens into a string for count vectorising
        corpus_string =  [' '.join(words) for words in corpus]
        
        
        #And then we count vectorise in a hacky way.
        count_vect = CountVectorizer(stop_words='english',min_df=5).fit(corpus_string)
        
        #Store the features
        self.X = count_vect.transform(corpus_string)
        
        #Store the count vectoriser (we will use it later on for prediction on new data)
        self.count_vect = count_vect
        
    def grid_search(self,models):
        '''
        The grid search method takes a list with models and their parameters and it does grid search crossvalidation.
        
        '''
        
        #Load inputs and targets into the model
        Y = self.Y
        X = self.X
        
        if self.mode=='multiclass':
            '''
            If the model is multiclass then we need to add some prefixes to the model paramas
            
            '''
        
            for mod in models:
                #Make ovr
                mod[0] = OneVsRestClassifier(mod[0])
                
                #Add the estimator prefix
                mod[1] = {'estimator__'+k:v for k,v in mod[1].items()}
                
        
        #Container with results
        results = []

        #For each model, run the analysis.
        for num,mod in enumerate(models):
            print(num)

            #Run the classifier
            clf = GridSearchCV(mod[0],mod[1])

            #Fit
            clf.fit(X,Y)

            #Append results
            results.append(clf)
        
        self.results = results
        return(self)

    
#Class to visualise the outputs of multilabel models.

#I call it OrangeBrick after YellowBrick, the package for ML output visualisation 
#(which currently doesn't support multilabel classification)


class OrangeBrick():
    '''
    This class takes a df with the true classes for a multilabel classification exercise and produces some charts visualising findings.
    
    The methods include:
    
        .confusion_stack: creates a stacked barchart with the confusion matrices stacked by category, sorting classes by performance
        .prec_rec: creates a barchart showing each class precision and recall;
        #Tobe done: Consider mixes between classes?
    
    '''
    
    def __init__(self,true_labels,predicted_labels,var_names):
        '''
        Initialise with a true labels, predicted labels and the variable names
        '''
         
        self.true_labels = true_labels
        self.predicted_labels = predicted_labels
        self.var_names = var_names
    
    def make_metrics(self):
        '''
        Estimates performance metrics (for now just confusion charts by class and precision/recall scores for the 0.5 
        decision rule.
        
        '''
        #NB in a confusion matrix in SKlearn the X axis indicates the predicted class and the Y axis indicates the ground truth.
        #This means that:
            #cf[0,0]-> TN
            #cf[1,1]-> TP
            #cf[0,1]-> FN (prediction is false, groundtruth is true)
            #cf[1,0]-> FP (prediction is true, ground truth is false)



        #Predictions and true labels
        true_labels = self.true_labels
        pred_labels = self.predicted_labels

        #Variable names
        var_names = self.var_names

        #Store confusion matrices
        score_store = []


        for num in np.arange(len(var_names)):

            #This is the confusion matrix
            cf = confusion_matrix(pred_labels[:,num],true_labels[:,num])

            #This is a melted confusion matrix
            melt_cf = pd.melt(pd.DataFrame(cf).reset_index(drop=False),id_vars='index')['value']
            melt_cf.index = ['true_negative','false_positive','false_negative','true_positive']
            melt_cf.name = var_names[num]
            
            #Order variables to separate failed vs correct predictions
            melt_cf = melt_cf.loc[['true_positive','true_negative','false_positive','false_negative']]

            #We are also interested in precision and recall
            prec = cf[1,1]/(cf[1,1]+cf[1,0])
            rec = cf[1,1]/(cf[1,1]+cf[0,1])

            prec_rec = pd.Series([prec,rec],index=['precision','recall'])
            prec_rec.name = var_names[num]
            score_store.append([melt_cf,prec_rec])
    
        self.score_store = score_store
        
        return(self)
    
    def confusion_chart(self,ax):
        '''
        Plot the confusion charts
        
        
        '''
        
        #Visualise confusion matrix outputs
        cf_df = pd.concat([x[0] for x in self.score_store],1)

        #This ranks categories by the error rates
        failure_rate = cf_df.apply(lambda x: x/x.sum(),axis=0).loc[['false' in x for x in cf_df.index]].sum().sort_values(
            ascending=False).index

        
        #Plot and add labels
        cf_df.T.loc[failure_rate,:].plot.bar(stacked=True,ax=ax,width=0.8,cmap='Accent')

        ax.legend(bbox_to_anchor=(1.01,1))
        #ax.set_title('Stacked confusion matrix for disease areas',size=16)
    
    
    def prec_rec_chart(self,ax):
        '''
        
        Plot a precision-recall chart
        
        '''
    

        #Again, we sort them here to assess model performance in different disease areas
        prec_rec = pd.concat([x[1] for x in self.score_store],1).T.sort_values('precision')
        prec_rec.plot.bar(ax=ax)

        #Add legend and title
        ax.legend(bbox_to_anchor=(1.01,1))
        #ax.set_title('Precision and Recall by disease area',size=16)

In [None]:
gtr_grants = gtr.loc[gtr['grant_category']=='Research Grant'].reset_index(drop=True)

In [None]:
#Here is the corpus
corpus = list(gtr_grants['abstract'])

#We use a utility function to create a df for a one vs rest classification
target = pd.get_dummies(gtr_grants['ai'])

In [None]:
#Run grid search with these model parameters
my_models = [
    [RandomForestClassifier(),
     {'class_weight':['balanced',None],'min_samples_leaf':[1,5]}],
    
    [LogisticRegression(),
     {'class_weight':['balanced',None],'penalty':['l1','l2'],
      'C':[0.1,1,100]}]]

In [None]:
# Predict groups

#Initialise the TextClassification class
gtr_t = TextClassification(corpus,target)

In [None]:
# import warnings filter
from warnings import simplefilter
# ignore all future warnings
simplefilter(action='ignore', category=FutureWarning)

In [None]:
gtr_t.grid_search(my_models)

In [None]:
#Check scores and best estimators
for res in gtr_t.results:
    print(res.best_score_)
    print(res.best_estimator_)
    
    #This is the best estimator
best_est = gtr_t.results[1].best_estimator_

In [None]:
gtr_diag = OrangeBrick(true_labels=np.array(target),
                      predicted_labels=best_est.predict(gtr_t.X),
                      var_names=target.columns).make_metrics()

In [None]:
fig,ax = plt.subplots(nrows=2,figsize=(10,7.5))

gtr_diag.confusion_chart(ax=ax[0])
gtr_diag.prec_rec_chart(ax=ax[1])

#fig.suptitle('Model evaluation for GTR disciplines',y=1.01,size=16)

plt.tight_layout()

Not bad

In [None]:
ai_mask = pd.DataFrame(best_est.predict_proba(gtr_t.X)>0.75)

### Label all the data

In [None]:
gtr_all = pd.concat([gtr_grants,gtr.loc[gtr['grant_category']!='Research Grant']],axis=0)

#Extract the lead lad - there is only one per project
gtr_all['lead_lad_value']= [x[0] if len(x)>0 else np.nan for x in gtr_all['lead_lad_name']]

abstract_all_transformed = gtr_t.count_vect.transform(gtr_all['abstract'])

In [None]:
gtr_all['ai_mod'] = pd.DataFrame(best_est.predict_proba(abstract_all_transformed))[1]>0.6

gtr_all = gtr_all.loc[(gtr_all['year']>2006)&(gtr_all['year']<2019)]

### What does this look like?

#### Funder distribution

In [None]:
pd.crosstab(gtr_all['ai_mod'],gtr_all['funder'],normalize=0).T.plot.bar()

#### Timeframe

In [None]:
pd.crosstab(gtr_all['year'],gtr_all['ai_mod'],normalize=0)[True].rolling(window=4).mean().dropna().plot(title='Share of total')

#### Discipline

In [None]:
pd.crosstab(gtr_all['sel_disc'],gtr_all['ai_mod'],normalize=1).plot.bar()

Discipline evolution

In [None]:
gtr_ai = gtr_all.loc[gtr_all['ai_mod']==True]

disc_sorted = gtr_ai['sel_disc'].value_counts().index

In [None]:
ax = pd.crosstab(gtr_ai['year'],gtr_ai['sel_disc'],normalize=0)[disc_sorted[::-1]].rolling(window=3).mean().dropna().plot.bar(stacked=True,
                                                                                                                        width=0.9,cmap='Accent',
                                                                                                                       edgecolor='lightgrey',
                                                                                                                             title='Discipline shares')
ax.legend(bbox_to_anchor=(1,1))

## Industry

In [None]:
ax = pd.crosstab(gtr_all['sel_industry'],gtr_all['ai_mod'],normalize=1).plot.bar(figsize=(10,5))

In [None]:
ind_sorted = gtr_ai['sel_industry'].value_counts().index

In [None]:
ax = pd.crosstab(gtr_ai['year'],gtr_ai['sel_industry'],normalize=0)[ind_sorted[:8]].rolling(window=3).mean().dropna().plot.bar(stacked=True,
                                                                                                                        width=0.9,cmap='Accent_r',
                                                                                                                       edgecolor='lightgrey',
                                                                                                                             title='Discipline shares')

ax.legend(bbox_to_anchor=(1,1))

### SDG

In [None]:
ax = pd.crosstab(gtr_all['sel_sdg'],gtr_all['ai_mod'],normalize=1).plot.bar(figsize=(10,5))

In [None]:
sdg_sorted = gtr_ai['sel_sdg'].value_counts().index

In [None]:
ax = pd.crosstab(gtr_ai['year'],gtr_ai['sel_sdg'],normalize=0)[sdg_sorted[:8]].rolling(window=3).mean().dropna().plot.bar(stacked=True,
                                                                                                                        width=0.9,cmap='Accent_r',
                                                                                                                       edgecolor='lightgrey',
                                                                                                                             title='Discipline shares')

ax.legend(bbox_to_anchor=(1,1))

In [None]:
#gtr_ai.loc[gtr_ai['sel_sdg']=='sdg_13_climate_action'].sort_values('sdg_13_climate_action',ascending=False)

The SDG classification is underperforming compared with the others. Its training corpus is 'furthest away' from the GtR data. We will have to review this carefully and decide what to do.

## Analysis of organisations / places


#### Counts of activity

In [None]:
#Extract the lead lad - there is only one per project
gtr_all['lead_lad_value']= [x[0] if len(x)>0 else np.nan for x in gtr_all['lead_lad_name']]

In [None]:
# Crosstab
ai_counts = pd.crosstab(gtr_all['lead_lad_value'],gtr_all['ai_mod']).sort_values(True,ascending=False)
ai_counts[:10]

In [None]:
100*pd.crosstab(gtr_all['lead_lad_value'],gtr_all['ai_mod'],normalize=0).loc[ai_counts.index[:20]].sort_values(True,ascending=False)

#### What is the level of concentration? Is it increasing over time?

#### In what industries do different locations specialise?

In [None]:
city_industry_counts = gtr_ai.groupby(['lead_lad_value','sel_industry'])['ai_mod'].sum().reset_index(
    drop=False).pivot_table(index='lead_lad_value',columns='sel_industry',values='ai_mod').fillna(0).loc[ai_counts.index[:20],ind_sorted[:7]]

In [None]:
city_industry_spec = create_lq_df(city_industry_counts).apply(lambda x: pd.qcut(x,q=np.arange(0,1.1,0.2),labels=False,duplicates='drop'),axis=1)

sns.heatmap(city_industry_spec,cmap='Oranges',edgecolor='lightgray',linewidth=0.001)

How are they evolving over time

In [None]:
city_year_counts = gtr_ai.groupby(['lead_lad_value','year'])['ai_mod'].sum().reset_index(
    drop=False).pivot_table(index='lead_lad_value',columns='year',values='ai_mod').fillna(0)

#city_year_counts.T.plot.bar(stacked=True)

## Add industries

In [None]:
projects_industries = pd.read_csv('../data/processed/6_8_2019_gtr_organisations_industries_2.csv',compression='zip')

In [None]:
gtr_all_industries = pd.merge(gtr_all,projects_industries,left_on='project_id',right_on='project_id',how='left')

In [None]:
gtr_all_industries.head()

In [None]:
gtr_all_industries.to_csv(f'../data/processed/{today_str}_gtr_processed.csv',compression='zip')