# AI predictive analysis using MAG data


**Background**

The first step towards monitoring the diffusion of AI in various research fields is identifying AI. `01_jmg_` uses a keyword expansion strategy to identify projects that mention terms related to AI. This approach has a couple of limitations: 

1. it uses a relatively small number of terms, with the risk of low recall (we miss projects that don't mention our relatively limited vocabulary). 
2. (and relatedly) the low freqency across the corpus makes it hard to create scores capturing our confidence about results, which could lower our precision.

**Goals**

Here, we use an alternative strategy to identify AI projects using a labelled dataset from Microsoft Academic Graph that we have obtained for another project. The idea is to train a model that predicts which of the papers in the dataset have AI as a label, using the words in the abstract as features. We will then transfer that model to the GtR projects and compare results with the keyword based approach.

**Activities**
1. Load, process and briefly explore the MAG data
2. Train model
3. Evaluate model
4. Load GtR data and compare model performances



## 0. Preamble

In [None]:
%run notebook_preamble.ipy
%run lda_pipeline.py
%run text_classifier.py
%run keyword_searches.py
%run utils.py

In [None]:
#Other imports

from ast import literal_eval

import random

random.seed(8)

In [None]:
#Put functions here

def random_check(corpus,num,length):
    '''
    Prints num random examples form corpus
    
    '''
    
    selected = np.random.randint(0,len(corpus),num)
    
    texts  = [text for num,text in enumerate(corpus) if num in selected]
    
    for t in texts:
        print(t[:length])
        print('====')

## 1. Load data

In [None]:
mag_data = []

mag_path = '../data/raw/mag_data/'

In [None]:
years = os.listdir(mag_path)

#For each year (folder in the directory) we extract items and put them in the mag_data list
for y in years:
    
    print(y)
    
    dir_in_y = os.listdir(mag_path+f'/{y}')
    
    for item in dir_in_y:
        with open(mag_path+f'/{y}/{item}','r') as infile:
            file = json.load(infile)
            mag_data.append(file)

In [None]:
mag_flat = flatten_list(mag_data)

In [None]:
mag_flat[0]

Tasks to work with the above:

* Extract the labels ('F')
* Reconstruct the abstract ('E')

### Parse the MAG data

I want to keep the topics, the years and the abstracts

In [None]:
def parse_mag(mag_object):
    '''
    Parses the mag data
    
    Arguments:
        -mag_object: a dict with various fields of interest. We want to extract the year, the topics and reconstruct an inverted index
        (this is the way that microsoft stores its abstract data)
    
    Returns:
        -A dict with the three element
    
    Observations:
        -If there is a failure we store a missing value.
    
    '''
    
    my_id = mag_object['Id']
    
    try:
        topic = [x['FN'] for x in mag_object['F']]
        year = int(mag_object['D'].split('-')[0])
        ia = literal_eval(mag_object['E'])['IA']['InvertedIndex']
            #This goes through the ia and reorders the words. Note that we are removing stopwords as we go
        a = [it[0] for it in sorted(flatten_list([[(k.lower(),n) for n in v if k.lower() not in stop] for k,v in ia.items()]),key = lambda x: x[1])]

        return({'topic':topic,'year':y,'abstract':a})
    
    except:
        return(f'failed item {my_id}')

In [None]:
mag_parsed = []

for n,i in enumerate(mag_flat):
    
    if n%5000==0:
        print(n)
    
    
    mag_parsed.append(parse_mag(i))

In [None]:
mag_clean = [x for x in mag_parsed if type(x)!=str]

In [None]:
len(mag_clean)

We have lost quite a few papers that are missing their abstracts

In [None]:
pd.Series(flatten_list([x['topic'] for x in mag_clean])).value_counts()[:20]

In [None]:
100*6400/len(mag_clean)

6000 papers with AI (3.6% of the total). Not bad!

### Modelling approach 1: text classifier

This approach is quite crude and probably will take a long time (we have a big corpus)

In [None]:
#We choose a random sample of a quarter of the data to speed things up
chosen = np.random.randint(0,high=len(mag_clean),
                  size=int(len(mag_clean)/1.5))

mag_chosen = [x for num,x in enumerate(mag_clean) if num in chosen]

In [None]:
target = pd.get_dummies(pd.Series([1 if 'artificial intelligence' in x['topic'] else 0 for x in mag_chosen]))

#Turn this tokenised set into strings for additional pre-processing
corpus = [' '.join(x['abstract']) for x in mag_chosen]

In [None]:
tc = TextClassification(corpus=corpus,target=target)

In [None]:
#Run grid search with these model parameters
models = [
    [RandomForestClassifier(),
     {'class_weight':['balanced'],'min_samples_leaf':[1,5]}],
    
    [LogisticRegression(),
     {'class_weight':['balanced'],'penalty':['l1','l2'],
      'C':[0.1,1,100]}]]

In [None]:
tc.grid_search(models)

In [None]:
#Check scores and best estimators
for res in tc.results:
    print(res.best_score_)
    print(res.best_estimator_)
    
    #This is the best estimator
best_est = tc.results[1].best_estimator_

In [None]:
ai_diag = OrangeBrick(true_labels=np.array(target),
                      predicted_labels=best_est.predict(tc.X),
                      var_names=target.columns).make_metrics()

In [None]:
fig,ax = plt.subplots(nrows=2,figsize=(10,7.5))

ai_diag.confusion_chart(ax=ax[0])
ai_diag.prec_rec_chart(ax=ax[1])

#fig.suptitle('Model evaluation for GTR disciplines',y=1.01,size=16)

plt.tight_layout()

In [None]:
pd.DataFrame(best_est.predict_proba(tc.X)>0.5).sum()

Excellent model performance! Let's check some of the errors.

In [None]:
ai_comb = pd.concat([pd.DataFrame(target),pd.DataFrame(best_est.predict(tc.X)),pd.Series(corpus)],axis=1)

In [None]:
ai_comb.columns = ['actual_no_ai','actual_ai','pred_no_ai','pred_ai','abstract']

In [None]:
random_check(ai_comb.loc[(ai_comb.actual_ai==1) & (ai_comb.pred_ai==1)]['abstract'],length=1000,num=5)

In [None]:
random_check(ai_comb.loc[(ai_comb.actual_ai==0) & (ai_comb.pred_ai==1)]['abstract'],length=1000,num=5)

False negatives seem to be quantitative papers with no AI activity

In [None]:
#random_check(ai_comb.loc[(ai_comb.actual_ai==1) & (ai_comb.pred_ai==0)]['abstract'],length=1000,num=5)

### Modelling approach application b. Predict ethical / legal vocabularies

### Modelling approach 2: Using document vectors

We will use document vectors (in 300 dimensional space) to predict the labels.

In [None]:
from gensim.models.doc2vec import Doc2Vec, TaggedDocument

In [None]:
#Get the tokenised corpus

corpus_tokenised = CleanTokenize([' '.join(x['abstract']) for x in mag_clean]).clean().bigram().tokenised

#Create the tagged documents
tagged_docs = [TaggedDocument(w,[i]) for i,w in enumerate(corpus_tokenised)]

#Train the doc2vec model
d2v = Doc2Vec(documents=tagged_docs,size=350,window=10,min_count=2)

In [None]:
#Run grid search with these model parameters
models = [
    [RandomForestClassifier(),
     {'class_weight':['balanced'],'min_samples_leaf':[1,5]}],
    
    [LogisticRegression(),
     {'class_weight':['balanced'],'penalty':['l1','l2'],
      'C':[0.1,1,100]}]]

In [None]:
def grid_search(target,features,models):
        '''
        Grid search over models with different parameters. 
        
        Arguments:
            target: the variable(s) we want to predict
            features: the predictor
            models: dicts with parameters we will grid search over
            
        returns:
            The results of the grid search
        
        
        '''
        
        #Load inputs and targets into the model
        Y = target
        X = features
        
        for mod in models:
            #Make ovr
            mod[0] = OneVsRestClassifier(mod[0])
                
            #Add the estimator prefix
            mod[1] = {'estimator__'+k:v for k,v in mod[1].items()}
        
        #Container with results
        results = []

        #For each model, run the analysis.
        for num,mod in enumerate(models):
            print(num)

            #Run the classifier
            clf = GridSearchCV(mod[0],mod[1])

            #Fit
            clf.fit(X,Y)

            #Append results
            results.append(clf)
        
        return(results)

In [None]:
target = pd.get_dummies(pd.Series([1 if 'artificial intelligence' in x['topic'] else 0 for x in mag_clean]))

In [None]:
doc2vec_features = np.array(d2v.docvecs.vectors_docs)

doc_models = grid_search(target=target,features=doc2vec_features,models=models)

In [None]:
#Check scores and best estimators
for res in doc_models:
    print(res.best_score_)
    print(res.best_estimator_)
    
    #This is the best estimator
best_est = doc_models[1].best_estimator_

In [None]:
ai_doc2vec = OrangeBrick(true_labels=np.array(target),
                      predicted_labels=best_est.predict(doc2vec_features),
                      var_names=target.columns).make_metrics()

In [None]:
fig,ax = plt.subplots(nrows=2,figsize=(10,7.5))

ai_doc2vec.confusion_chart(ax=ax[0])
ai_doc2vec.prec_rec_chart(ax=ax[1])

#fig.suptitle('Model evaluation for GTR disciplines',y=1.01,size=16)

plt.tight_layout()

In [None]:
pd.DataFrame(best_est.predict(doc2vec_features)).sum()

## Load the GtR data

In [None]:
gtr = pd.read_csv('../data/interim/04-04-2019_projects_all_labels.csv')

In [None]:
gtr_vect = tc.count_vect.transform(gtr['abstract'])

In [None]:
out = best_est.predict_proba(gtr_vect)

In [None]:
gtr['is_ai_model'] = pd.DataFrame(out)[1]
gtr['is_ai_model_bin'] = gtr['is_ai_model'].apply(lambda x: x>0.5)

In [None]:
pd.crosstab(gtr['is_ai_model_bin'],gtr.has_ai)

In [None]:
random_check(gtr.loc[(gtr['is_ai_model_bin']==1) & (gtr.has_ai==1)]['abstract'],10,1000)

In [None]:
random_check(gtr.loc[(gtr['is_ai_model_bin']==0) & (gtr.has_ai==1)]['abstract'],10,1000)

In [None]:
random_check(gtr.loc[(gtr['is_ai_model_bin']==1) & (gtr.has_ai==0)]['abstract'],10,1000)

### Results comparison

Both classification strategies produce quite different results. Let's explore both to decide how to classify projects into AI

In [None]:
gtr['classification'] = ['both' if (x==True) & (y==True) else 'no_ai' if (x==False) & (y==False) else 'only_kw' if (x==True) & (y==False) else 'only_model'
                         for x,y in zip(gtr['has_ai'],gtr['is_ai_model_bin'])]


In [None]:
gtr_ai = gtr.loc[gtr['classification']!='no_ai']

disc_counts = pd.concat([gtr_ai.loc[gtr['top_disc']==disc]['classification'].value_counts(normalize=True) for disc in [var for var in gtr_ai.columns if 'disc_' in var]],axis=1)

#gtr[['has_ai','is_ai_model_bin']].sum()

disc_counts.columns = [var for var in gtr_ai.columns if 'disc_' in var]

disc_counts.T.plot.bar(stacked=True)


In [None]:
pd.crosstab(gtr_ai['year'],gtr_ai['classification']).rolling(window=3).mean().plot()

## What are the differences between the vocabularies?

1. Calculate word freqs for all docs
2. Calculate word freqs for all ai docs
3. Calculate normalised word freqs for ai classes vs all and vs all ai (written as a function)


In [None]:
gtr_tokenised = CleanTokenize(gtr['abstract']).clean().bigram()
gtr_tokens = gtr_tokenised.tokenised

All frequencies

In [None]:
all_freqs = pd.Series(flatten_list(gtr_tokens)).value_counts()

all_freqs.head()

AI frequencies

In [None]:
gtr['has_ai_either'] = gtr[['has_ai','is_ai_model_bin']].sum(axis=1)>0

ai_indices = [n for n,e in enumerate(gtr['has_ai_either']) if e==True]

ai_freqs = pd.Series(flatten_list([x for n,x in enumerate(gtr_tokens) if n in ai_indices])).value_counts()

ai_freqs.head()

Function that returns frequencies

In [None]:
def get_salient_words(corpus,normaliser,threshold):
    '''
    Function to normalise word frequencies in a corpus by another and return top ones (controlling for frequency to remove noise)
    
    Arguments
        corpus: the tokenised corpus where we are looking for salient terms
        normaliser: the word frequencies to normalise with
        threshold: number of times a word has to appear in the original corpus for being included
        return = number of words to return (defaults to all)
    
    Return
        A df with the total number of appearances of a word and its normalised frequency.  
    
    '''
    
    freqs = pd.Series(flatten_list(corpus)).value_counts()
    
    freqs_selected = freqs.loc[freqs>threshold]
    
    #Calculate frequencies
    to_norm = pd.concat([freqs_selected,normaliser],axis=1).apply(lambda x: x/x.sum())
    
    to_norm.columns = ['in_corpus','for_norm']
    
    #Calculate normalised frequency
    to_norm['freq_normalised'] = to_norm['in_corpus']/to_norm['for_norm']
    
    return(to_norm).dropna(axis=0).sort_values('freq_normalised',ascending=False)
#     if returns=='all':
#         return(to_norm.sort_values('freq_normalised',ascending=False))
    
#     else:
#         return(to_norm.sort_values('freq_normalised',ascending=False).iloc[:returns])


In [None]:
all_ai_corpus = [x for n,x in enumerate(gtr_tokens) if n in ai_indices]

salient_all_ai = get_salient_words(all_ai_corpus,all_freqs,threshold=50)

In [None]:
ai_both,ai_model,ai_kw = [[w for i,w in enumerate(gtr_tokens) if i in indices] 
                           for indices in [[n for n,e in enumerate(gtr['classification']) if e ==val] for val in 
                           ['both','only_kw','only_model']]]

In [None]:
processed = []

for indices,term in zip([ai_both,ai_model,ai_kw],['both','model','kw']):
    
    salient_all = get_salient_words(indices,all_freqs,threshold=40)
    salient_ai = get_salient_words(indices,ai_freqs,threshold=40)
    
    salient_all.columns,salient_ai.columns = [[x+f'_{term}' for x in df.columns] for term,df in zip(['all','ai'],[salient_all,salient_ai])]
    
    
    out = pd.concat([salient_all,salient_ai],axis=1)
    
    out['category'] = term
    #out['pos'] = np.arange(0,len(out))
    
    processed.append(out)
    
extracted_freqs = pd.concat(processed,axis=0)

In [None]:
for x in ['both','model','kw']:
    
    print(x)
    print('===')
    
    rel = extracted_freqs.loc[extracted_freqs['category']==x]

    print('all_norm')
    print(rel.sort_values('freq_normalised_all',ascending=False)[:10].index)

    print('-------')
    print('ai_norm')
    print(rel.sort_values('freq_normalised_ai',ascending=False)[:10].index)

    print('\n')


## Correlations

In [None]:

gtr['classification']=np.nan

for i,df in enumerate(gtr):
    
    if (gtr.loc[i,'has_ai']== 1 & gtr.loc[i,'is_ai_model_bin']==1):
        gtr.loc[i,'classification']='bothx'
        
    elif (gtr.loc[i,'has_ai']== 1 & gtr.loc[i,'is_ai_model_bin']==0):
        gtr.loc[i,'classification']='only_kw'
    
    elif (gtr.loc[i,'has_ai']== 0 & gtr.loc[i,'is_ai_model_bin']==1):
        gtr.loc[i,'classification']='only_model'
    
    else:
        gtr.loc[i,'classification']='none'

In [None]:
gtr[['has_ai','is_ai_model_bin']].head()


In [None]:
gtr[['has_ai','is_ai_model_bin','has_prediction','has_ethics','has_db','has_data']].corr()