# Why was this cited? Explainable Machine Learning Applied to COVID-19 Research Articles

The notebook contains code for all models that were created for the purpose of getting results for the article: **Why was this cited? Explainable Machine Learning Applied to COVID-19 Research Articles**. 

# Expected directory structure

* cache/Entity_matrix <- cached entity extraction results, the folder must exist even if empty
* cache/FeatureImportance <- feature importance results used to prune rule learning matrices and for Shapley plots
* inputs/PUBMED <- article records
* inputs/metadata_with_opencitations.csv <- abstracts and article metadata for dataset version 1 with citations
* inputs/df_sw_tok_low_punc_lemm_v5.csv <- abstracts and article metadata for dataset version 2
* inputs/biblio.json <- journal information
* inputs/author_names_info.csv <- author names and their nationality (dataset version 1)
* inputs/citationcounts_oci_revised.csv <- citations for dataset version 2
* results/ <- results of processing
* Gollam/ <- several internally used code routines
* cordParam.R <- code for CBA models
* cordParamVisualization.R <- code for visualizing CBA models

# Inputs of this notebook - description 

## DATASET VERSION 1 
### 1) metadata_with_opencitations.csv
- This is the main input source of articles used in this analysis (CORD-19 corpus downloaded there:https://www.kaggle.com/allen-institute-for-ai/CORD-19-research-challenge) after reduction by PUBMED (downloaded https://github.com/fhircat/CORD-19-on-FHIR/tree/master/datasets/Pubtator_RDF/CORD-19-Abstracts) with added extracted Opencitations. The whole process how to receive this data from data sources and how it is cleaned is described in notebook preprocessing_dataset_v1. 

### 2) biblio.json
- This is input source for bibliometric features, used only with dataset version 1. 

### 3) author_names_info.csv
- There is a list of names used in author_names features matrix after reduction for dataset version 1 with added information about nationality of the name received from https://www.name-prism.com/.


## DATASET VERSION 2
### 1) df_sw_tok_low_punc_lemm_v5.csv
- This is newer version of CORD-19 corpus after cleaning of abstract and reduction - described in notebook preprocessing_dataset_v2.ipynb. 

### 2) citationcounts_oci_revised.csv
- Extracted citations for dataset version 2. 

# RESULTS FOR DATASET VERSION 1 

- This notebook is used for running results of article for 2 versions of datasets 
- Can be run for specific version of dataset by set the value of parameter DATASET_VERSION. 

In [None]:
DATASET_VERSION = 1  # Other choise: 2

## Libraries

In [None]:
import pandas as pd
import numpy as np
import os
import pickle
import json
import requests
import warnings
import matplotlib.pyplot as plt
import seaborn as sns
import re
import sys
import numpy
import hashlib
import itertools  
import ast

import sklearn.ensemble
from sklearn.utils import class_weight, shuffle
from sklearn.model_selection import KFold, cross_val_score, train_test_split
from sklearn.metrics import roc_auc_score, accuracy_score, roc_curve,classification_report,confusion_matrix,mean_absolute_error, mean_squared_error
from sklearn.feature_selection import SelectFromModel
from sklearn.feature_extraction.text import CountVectorizer,TfidfVectorizer
from sklearn.preprocessing import LabelEncoder
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  
from sklearn.pipeline import make_pipeline


#BERT random forest
import sys, setuptools, tokenize
import torch
import tensorflow
from tensorflow import keras
import transformers as ppb
import warnings
warnings.filterwarnings('ignore')
from ipywidgets import IntProgress

# BERT neural network
# need to make sure that you are running TensorFlow 2.0. Google Colab, by default, doesn't run your script on TensorFlow 2.0. 
try:
    %tensorflow_version 2.x
except Exception:
    pass
import tensorflow as tf
import tensorflow_hub as hub
from tensorflow.keras import layers
import bert
import math
import random

from Gollam.Data_Clean.CheckLanguage import isEnglish
from Gollam.Ngrams.ExtractNgramsEntityList import extract_ngrams_entity_list
from Gollam.Data_Clean.IsNotAscii import is_not_ascii

import shap

from rdflib import Graph
import itertools
from corels import CorelsClassifier

## FOR CBA
import en_core_sci_lg
import en_ner_bionlp13cg_md
import en_ner_jnlpba_md
import en_ner_craft_md

In [None]:
#os.environ["R_HOME"] = "/usr/lib/R" 
os.environ["R_HOME"] = r"C:\Program Files\R\R-4.1.2"
#os.environ["PATH"] = "/usr/lib/R"+  ";" + os.environ["PATH"] 
os.environ["PATH"] = r"C:\Program Files\R\R-4.1.2\bin\x64" + ";" + os.environ["PATH"]

import rpy2
from rpy2.robjects import pandas2ri, packages
pandas2ri.activate()
import rpy2.robjects as ro
from rpy2.robjects.packages import importr
from rpy2.robjects import pandas2ri
import rpy2.robjects as robjects
from rpy2.robjects.conversion import localconverter

In [None]:
from __future__ import print_function
from io import StringIO
from lime import lime_text
from lime.lime_text import LimeTextExplainer

In [None]:
import warnings
warnings.filterwarnings('ignore')

## Parameters

In [None]:
#False - regenerate matrices, may be extremely slow
#True - used cached matrices
CSV_matrix_PubTator_conceptnet = True
CSV_matrix_PubTator = True
CSV_matrix_scispacy_conceptnet = True
CSV_matrix_scispacy = True

# Reduction of BOW and TF-IDF by parameter min_df
MIN_DF = 32

# Reduction of SCISPACY and PUBTATOR matrices by feature importance from random forest
MAX_NO_OF_FEATURES_SCISPACY_AND_PUBT = 1500

# if  we use parameters from grid for random forest or default version (setted parameters are faster and results are same)
USE_GRID = True

RUN_SEARCHING_OPTIMAL_MATRIX_SIZE = False
RUN_GRID_FOR_BERT = False
RUN_GRID_RANDOM_FOREST = False

## Source data

documents from: https://allenai.org/data/cord-19

In [None]:
if DATASET_VERSION == 1:
    
    DATA_PATH = "./inputs/metadata_with_opencitation.csv"
    PubTator_DATA_PATH="./PUBMED/"
    documents = pd.read_csv(DATA_PATH, error_bad_lines=False,index_col="doi").reset_index()
    documents = documents.drop_duplicates(subset=['pubmed_id'], keep=False)
    documents = documents.drop_duplicates(subset=['doi','journal'], keep=False)
    documents = documents.set_index("doi")

    # impact factor features
    with open('inputs/biblio.json') as f:
         data = json.load(f)

    #drop documents with short abstract
    documents = documents.drop(documents[documents['abstract'].map(len) < 3].index)

    # remove word abstract from the abstracts
    documents["abstract"] = documents["abstract"].str.lower().replace('abstract','')
    documents["abstract"] = documents["abstract"].str.replace(r'abstract', '')
    documents['age']  = 2020- documents['year']
    documents["NormCitations"] = documents["OpenCitations"]/documents['age']

    DISCRETIZATION_NUMBER_OF_CATEGORIES=2
    DISCRETIZATION_PRECISION=0
    DISCRETIZATION_LABELS=["low","high"]
    documents["Target"]=pd.qcut(documents["NormCitations"], q=DISCRETIZATION_NUMBER_OF_CATEGORIES, precision=DISCRETIZATION_PRECISION, labels=DISCRETIZATION_LABELS)
    documents = documents.dropna()
    print(len(documents))
    
if DATASET_VERSION == 2:
    
    df_all = pd.read_csv("inputs/df_sw_tok_low_punc_lemm_v5.csv").rename(columns = {'doi_x':'doi'})
    df_cit = pd.read_csv("inputs/citationcounts_oci_revised.csv",error_bad_lines=False,encoding="utf-8")
    df_cit = df_cit.rename(columns={'count_opencitations;;;;;;': 'OpenCitations'})
    df_cit['OpenCitations'] =  df_cit['OpenCitations'].str.extract(r'(\d+)', expand=False)
    df_cit = df_cit[['doi','OpenCitations']].dropna()
    df_cit['OpenCitations'] = df_cit['OpenCitations'].astype(int)

    df_merged = df_cit.merge(df_all, on="doi",how="inner") # want to have articles from Kaggle with opencitations
    df_merged = df_merged[pd.notnull(df_merged['Year'])] # for target - needed to have not null year
    documents = df_merged
    documents['year']=documents['Year']

    #drop documents with short abstract
    documents = documents.drop(documents[documents['abstract'].map(len) < 3].index)
    
    # remove word abstract from the abstracts
    documents["abstract"] = documents["abstract"].str.lower().replace('abstract','')
    documents["abstract"] = documents["abstract"].str.replace(r'abstract', '')
    documents['age']  = 2021 - documents['year']
    documents["NormCitations"]= documents["OpenCitations"]/documents['age']
    DISCRETIZATION_NUMBER_OF_CATEGORIES=2
    DISCRETIZATION_PRECISION=0
    DISCRETIZATION_LABELS=["low","high"]
    documents["Target"]=pd.qcut(documents["NormCitations"], q=DISCRETIZATION_NUMBER_OF_CATEGORIES,  precision=DISCRETIZATION_PRECISION,  labels=DISCRETIZATION_LABELS)
    documents = documents.dropna()
    documents[documents["Target"]=="low"][["Target","OpenCitations"]].value_counts()
    print(len(documents))

In [None]:
if DATASET_VERSION ==1:
    # impact factor features
    with open('inputs/biblio.json') as f:
         data = json.load(f)

In [None]:
if DATASET_VERSION ==1:
    if CSV_matrix_PubTator_conceptnet == True:
        matrix_PubTator_conceptnet_pd=pd.read_csv('cache/Entity_matrix/matrix_fhir_2940_conceptnet_new.csv',engine="python",index_col="doi", error_bad_lines=False)
        print(matrix_PubTator_conceptnet_pd.shape)
    
    if CSV_matrix_PubTator == True:    
        matrix_PubTator_pd=pd.read_csv('cache/Entity_matrix/matrix_fhir_2940.csv',engine="python",index_col="doi", error_bad_lines=False)
        print(matrix_PubTator_pd.shape)
    
    if CSV_matrix_scispacy == True:
        matrix_scispacy_pd=pd.read_csv('cache/Entity_matrix/matrix_scispacy_2940.csv',engine="python",index_col="doi", error_bad_lines=False)
        print(matrix_scispacy_pd.shape)
    
    if CSV_matrix_scispacy_conceptnet == True:  
        matrix_scispacy_conceptnet_pd=pd.read_csv('cache/Entity_matrix/matrix_scispacy_2940_conceptnet.csv',engine="python",index_col="doi", error_bad_lines=False)
        print(matrix_scispacy_conceptnet_pd.shape)

# Plot citations

In [None]:
num_bins = len(documents['year'].unique())

fig, ax = plt.subplots()

# the histogram of the data
n, bins, patches = ax.hist(documents['year'], num_bins)

ax.set_xlabel('Year of publication')
ax.set_ylabel('Articles')

# Tweak spacing to prevent clipping of ylabel
fig.tight_layout()
plt.show()

In [None]:
# Warning this plot is done on data before documents for which we do not have biblio data have been removed
fig, ax = plt.subplots()

#ax.scatter(documents[documents.Target=="low"]["year"], documents[documents.Target=="low"]["Target"])
ax.hist(documents[documents.Target=="low"]["year"], num_bins, alpha=0.5)
ax.hist(documents[documents.Target=="high"]["year"], num_bins, alpha=0.5, color="red")

plt.legend(['Low citations', 'High citations']) 
ax.set_xlabel('Year of publication')
ax.set_ylabel('Number of articles')

# Tweak spacing to prevent clipping of ylabel
fig.tight_layout()
plt.show()

In [None]:
# Warning this plot is done on data before documents for which we do not have biblio data have been removed
documents["Target_OpenCitations"]=pd.qcut(documents["OpenCitations"],
                            q=DISCRETIZATION_NUMBER_OF_CATEGORIES, 
                            precision=DISCRETIZATION_PRECISION, 
                            labels=DISCRETIZATION_LABELS)
fig, ax = plt.subplots()
ax.hist(documents[documents.Target_OpenCitations=="low"]["year"], num_bins, alpha=0.5)
ax.hist(documents[documents.Target_OpenCitations=="high"]["year"], num_bins, alpha=0.5, color="red")
plt.legend(['Low citations', 'High citations']) 
ax.set_xlabel('Year of publication')
ax.set_ylabel('Citations - low')
# Tweak spacing to prevent clipping of ylabel
fig.tight_layout()
plt.show()
documents=documents.drop(columns=['Target_OpenCitations'])

## Bibliometric features

In [None]:
if DATASET_VERSION == 1:
    
    def flatten_json(y):
        out = {}

        def flatten(x, name=''):
            if type(x) is dict:
                for a in x:
                    flatten(x[a], name + a + '_')
            elif type(x) is list:
                i = 0
                for a in x:
                    flatten(a, name + str(i) + '_')
                    i += 1
            else:
                out[name[:-1]] = x

        flatten(y)
        return out

    dic_flattened = (flatten_json(item[1]) for item in data.items())
    impacts_ais_jcats_df = pd.DataFrame(dic_flattened)
    # use pubmedids as index
    impacts_ais_jcats_df.index = list(data.keys())
    #use only first wos and ford category 
    impacts_ais_jcats_df=impacts_ais_jcats_df[[ 'impact_0_impact', 'impact_1_impact', 'ais_0_ais','ais_1_ais', 'WoSkategorie_0_obor', 
                                           'WoSkategorie_0_aisQ', 'WoSkategorie_0_impactQ', 'FORD_0_ford', 'FORD_0_aisQ', 'FORD_0_impactQ']]
    impacts_ais_jcats_df2=impacts_ais_jcats_df.reset_index().rename(columns={'index': 'pubmed_id'})

In [None]:
if DATASET_VERSION == 1:
    bibliometric_features=documents.reset_index()[["doi","Target","author_count","journal","pubmed_id","license"]]
    bibliometric_features['pubmed_id']=bibliometric_features['pubmed_id'].astype(int)
    bibliometric_features['pubmed_id']=bibliometric_features['pubmed_id'].astype(str)
    bibliometric_features=bibliometric_features.merge(impacts_ais_jcats_df2,on='pubmed_id', how='left').dropna().set_index("doi")
    del bibliometric_features['pubmed_id']
    
    #target
    target_bib_features = bibliometric_features.loc[:, bibliometric_features.columns == 'Target']
    bibliometric_features = bibliometric_features.loc[:, bibliometric_features.columns != 'Target']
    
    # Reduce documents
    new_ids =  bibliometric_features.reset_index()[["doi"]].set_index("doi")
    documents = new_ids.merge(documents, how='inner', left_index=True, right_index=True)

# Create several versions of matrices

This operation can be performed on complete dataset without risk of leakage, because the vectorization performed is binary

## TF-IDF for Random forest

In [None]:
vect_tfidf = TfidfVectorizer(analyzer = "word", 
                       tokenizer = None, 
                       ngram_range=(1,3), 
                       lowercase = True,
                       strip_accents = "ascii", 
                       binary= True,
                       stop_words='english',
                       min_df=MIN_DF) 

matrix_tfidf= vect_tfidf.fit_transform(documents['abstract'])
tokens_tfidf = vect_tfidf.get_feature_names()
matrix_tfidf_pd = pd.DataFrame.sparse.from_spmatrix(matrix_tfidf, columns = tokens_tfidf,index=documents.reset_index().doi)
matrix_tfidf_pd.shape

### BOW matrix

In [None]:
cvec = CountVectorizer(analyzer = "word", 
                       tokenizer = None, 
                       ngram_range=(1,3), 
                       lowercase = True,
                       strip_accents = "ascii", 
                       binary= True,
                       stop_words='english',
                       min_df=MIN_DF)    

matrix_bow = cvec.fit_transform(documents['abstract'])
tokens_bow = cvec.get_feature_names()
matrix_bow_pd = pd.DataFrame.sparse.from_spmatrix(matrix_bow, columns = tokens_bow,index=documents.reset_index().doi)
matrix_bow_pd.shape

### Binary document matrix with scispacy entities

This operation can be performed on complete dataset without risk of leakage

In [None]:
if CSV_matrix_scispacy == False:
 
    #Scispacy imports
    #pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.3.0/en_core_sci_lg-0.3.0.tar.gz
    #pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.3.0/en_ner_bionlp13cg_md-0.3.0.tar.gz
    #pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.3.0/en_ner_jnlpba_md-0.3.0.tar.gz
    #pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.3.0/en_ner_craft_md-0.3.0.tar.gz

    nlp = en_core_sci_lg.load()
    bionlp = en_ner_bionlp13cg_md.load()
    jnlpba = en_ner_jnlpba_md.load()
    craft = en_ner_craft_md.load()

    matrix_scispacy_pd = pd.DataFrame(index = documents.index, columns=[])
    for index, document in documents.iterrows():
        print("processing " + str(document.name))
        abstract_lower = document['abstract']
        ent_nlp = nlp(abstract_lower)
        ent_bionlp = bionlp(abstract_lower)
        ent_craft = craft(abstract_lower)
        ent_jnlpba = jnlpba(abstract_lower)
        all_ents=set(ent_nlp.ents + ent_bionlp.ents + ent_craft.ents + ent_jnlpba.ents)
        #new - lemmatization
        all_ents = map(lambda x: x.lemma_, all_ents)
    
        all_ents = filter(lambda x: not(is_not_ascii(str(x))), all_ents)
        for ent in all_ents:
            matrix_scispacy_pd.at[document.name,str(ent)]=1

    matrix_scispacy_pd.shape  
    matrix_scispacy_pd.to_csv('cache/Entity_matrix/matrix_scispacy_2940_new.csv')

# Add ConceptNet entities

In [None]:
if CSV_matrix_scispacy_conceptnet == False:  

    def conceptNetExpansion(entitiesToExpand, debug=False):
        entityDict = {}
        i=0
        for entity in entitiesToExpand:
            expandedEntities = set()
            if debug:
                print(str(i) + " processing " + entity)
            else:
                print("e" + filename, end='')
            i=i+1
            if i==5:
                print("Max processed entities for TEST run reached")
                break
            str_entity = entity.lower()
            if len(str_entity.split(" "))>1:
                entityList = extract_ngrams_entity_list(str_entity)  + [str_entity]
            else:
                entityList = [str_entity]    
                
            for internal_entity in entityList:
                print("i", end='')
                print(" processing internal entity: " + internal_entity)
                if type(internal_entity) == int:
                    continue
                hash_object = hashlib.md5(internal_entity.encode('utf-8'))
                filename = hash_object.hexdigest()
                try: 
                    with open('cache/ConceptNet/'+filename+".json", 'r') as cachefile:
                        obj = json.load(cachefile)
                        if debug:
                            print("Read from cache " + filename)
                        else:
                            print("c" + filename, end='')
                except IOError:
                    url = 'http://api.conceptnet.io/c/en/'+internal_entity
                    obj = requests.get(url).json()
                    with open('cache/ConceptNet/'+filename+".json", 'w') as outfile:
                        json.dump(obj, outfile)
                        if debug:
                            print("Saved to disk as " + filename )
                        else:
                            print("s" + filename, end='')
                

                for edge in obj['edges']:
                    entityLinkedAsStart = str(edge['start']['label']) 
                    entityLinkedAsEnd = str(edge['end']['label'])# str(obj['edges'][i]['end']['label'])
                    if isEnglish(entityLinkedAsStart):
                        expandedEntities.add(entityLinkedAsStart)
                    if isEnglish(entityLinkedAsEnd):
                        expandedEntities.add(entityLinkedAsEnd)
            entityDict[entity] = expandedEntities
        return entityDict

    #This no longer works reliably as conceptnet API takes too long to respond. Maybe download conceptnet and try locally, or let it run over night and cache the results? 
    def generateConceptNetExpansionMatrix(matrix, conceptNetExpansionDict):
        matrix_expanded = pd.DataFrame(index = matrix.index, columns=[])
        for inputEntity in conceptNetExpansionDict:
            for conceptNetEntity in conceptNetExpansionDict[inputEntity]:
                matrix_expanded.loc[matrix[inputEntity]==1,conceptNetEntity]=1
        return(matrix_expanded)

    scispacyToConceptNetExpansionDict=conceptNetExpansion(matrix_scispacy_pd.columns)
    matrix_scispacy_conceptnet = generateConceptNetExpansionMatrix(matrix_scispacy_pd,scispacyToConceptNetExpansionDict)
    matrix_scispacy_conceptnet.shape

    matrix_scispacy_conceptnet.to_csv('cache/Entity_matrix/matrix_scispacy_2940_conceptnet.csv')
    scispacyToConceptNetExpansionDict

### PubTator

note that "full data" was renamed to "full_data"
PubTator processing uses all text nodes shorter than 20 characters

In [None]:
if CSV_matrix_PubTator_conceptnet == False and CSV_matrix_PubTator == False:
     
    FHIR_DATA_PATH = 'inputs/PUBMED/'
        
    matrix_PubTator = pd.DataFrame(index = documents.index, columns=[])
    entities_doc=set()

    for index, document in documents.iterrows():
        
        path = FHIR_DATA_PATH+ str(int(document.pubmed_id)) + ".ttl"
        if os.path.isfile(path):
            print("processing " + path)
            pubmed_record = open(path, 'r',encoding="utf8") 
            g = Graph()
            g.parse(pubmed_record, format='turtle')
            qres = g.query(
        """SELECT DISTINCT ?text
           WHERE {
              ?a pm:text ?text .
           }""")
            for row in qres:
                extractedEntity = str(row.asdict()['text'].toPython())
                if len(extractedEntity) < 40 and not(is_not_ascii(str(extractedEntity))):                 
                    matrix_PubTator.at[document.name,extractedEntity]=1


    PubTatorToConceptNetExpansionDict=conceptNetExpansion(matrix_PubTator.columns)
    matrix_PubTator_conceptnet = generateConceptNetExpansionMatrix(matrix_PubTator,PubTatorToConceptNetExpansionDict)
    matrix_PubTator_conceptnet
    
    matrix_PubTator_conceptnet.to_csv('cache/Entity_matrix/matrix_fhir_2940_conceptnet_new_zkouska.csv') 
    matrix_PubTator.to_csv('cache/Entity_matrix/matrix_fhir_2940_zkouska.csv')

# Plot citations

In [None]:
if DATASET_VERSION == 1:
    docSubsetWithMatchinbBiData = documents[documents.index.isin(target_bib_features.index)]
if DATASET_VERSION == 2:
    docSubsetWithMatchinbBiData = documents

In [None]:
fig, ax = plt.subplots()
ax.hist(docSubsetWithMatchinbBiData[docSubsetWithMatchinbBiData.Target=="low"]["year"], num_bins, alpha=0.5)
ax.hist(docSubsetWithMatchinbBiData[docSubsetWithMatchinbBiData.Target=="high"]["year"], num_bins, alpha=0.5, color="red")

plt.legend(['Low citations', 'High citations']) 
ax.set_xlabel('Year of publication')
ax.set_ylabel('Number of articles')
# Tweak spacing to prevent clipping of ylabel
fig.tight_layout()
plt.savefig("results/HistAfterAgeNorm.png", dpi=100)
plt.show()

In [None]:
# Warning this plot is done on data before documents for which we do not have biblio data have been removed
docSubsetWithMatchinbBiData["Target_OpenCitations"]=pd.qcut(docSubsetWithMatchinbBiData["OpenCitations"],
                            q=DISCRETIZATION_NUMBER_OF_CATEGORIES, 
                            precision=DISCRETIZATION_PRECISION, 
                            labels=DISCRETIZATION_LABELS)
fig, ax = plt.subplots()
ax.hist(docSubsetWithMatchinbBiData[docSubsetWithMatchinbBiData.Target_OpenCitations=="low"]["year"], num_bins, alpha=0.5)
ax.hist(docSubsetWithMatchinbBiData[docSubsetWithMatchinbBiData.Target_OpenCitations=="high"]["year"], num_bins, alpha=0.5, color="red")
plt.legend(['Low citations', 'High citations']) 
ax.set_xlabel('Year of publication')
ax.set_ylabel('Citations - low')
# Tweak spacing to prevent clipping of ylabel
fig.tight_layout()
plt.savefig("results/HistBeforeAgeNorm.png")
plt.show()
docSubsetWithMatchinbBiData=docSubsetWithMatchinbBiData.drop(columns=['Target_OpenCitations'])

#### Author names matrix

- About author names is better to think like about text - because we want to look for impact of individual authors and also combination of authors 


In [None]:
if DATASET_VERSION == 1:
    cvec_authors = CountVectorizer(analyzer = "word", tokenizer = None, ngram_range=(1,50), min_df=MIN_DF, lowercase = True,strip_accents = "ascii", binary= True,stop_words='english')
    matrix_authors = cvec_authors.fit_transform(documents['authors'])
    tokens = cvec_authors.get_feature_names()
    matrix_authors_pd=pd.DataFrame(data=matrix_authors.toarray(), index=documents.index,columns=tokens)

    vect_authors_tfidf = TfidfVectorizer( tokenizer = None, ngram_range=(1,50), min_df=MIN_DF, lowercase = True,strip_accents = "ascii",stop_words='english')
    matrix_authors_tfidf= vect_authors_tfidf.fit_transform(documents['authors'])
    tokens_authors_tfidf = vect_authors_tfidf.get_feature_names()
    matrix_authors_tfidf_pd=pd.DataFrame(data=matrix_authors_tfidf.toarray(), index=documents.index,columns=tokens_authors_tfidf)

    matrix_authors_pd.shape,matrix_authors_tfidf_pd.shape

### 1) Bibliomatric features matrix for Random forest

In [None]:
if DATASET_VERSION == 1:
    bibliomet_matrix_randomforest = bibliometric_features.copy()
    cols = ["impact_0_impact","impact_1_impact","ais_0_ais","ais_1_ais","FORD_0_ford"] 
    for col in cols:
        bibliomet_matrix_randomforest[col] = bibliomet_matrix_randomforest[col].astype(float)
    
    lb_make = LabelEncoder()
    cols = ["journal","license","WoSkategorie_0_obor","WoSkategorie_0_aisQ","WoSkategorie_0_impactQ","FORD_0_ford","FORD_0_aisQ","FORD_0_impactQ"]
    for col in cols:
        bibliomet_matrix_randomforest[col] = lb_make.fit_transform(bibliomet_matrix_randomforest[col])

    bibliomet_matrix_randomforest =bibliomet_matrix_randomforest.merge(matrix_authors_pd, how='inner', left_index=True, right_index=True)
    bibliomet_matrix_tfidf =bibliomet_matrix_randomforest.merge(matrix_authors_tfidf_pd, how='inner', left_index=True, right_index=True)
    bibliomet_matrix_randomforest.shape, bibliomet_matrix_tfidf.shape, 

### 2) Bibliomatric features matrix for Rulemining

In [None]:
if DATASET_VERSION == 1:
    bibliomet_matrix_rulemining = bibliometric_features.copy()

    del bibliomet_matrix_rulemining["impact_1_impact"]
    del bibliomet_matrix_rulemining["impact_0_impact"]
    del bibliomet_matrix_rulemining["ais_0_ais"]
    del bibliomet_matrix_rulemining["ais_1_ais"]
    del bibliomet_matrix_rulemining["author_count"]

    bibliomet_matrix_rulemining = bibliomet_matrix_rulemining.replace("\xe9", "_") 
    bibliomet_matrix_rulemining["journal"] = bibliomet_matrix_rulemining["journal"].replace(" ", "_") 
    bibliomet_matrix_rulemining = pd.get_dummies(bibliomet_matrix_rulemining.astype(str))
    bibliomet_matrix_rulemining =bibliomet_matrix_rulemining.merge(matrix_authors_pd, how='inner', left_index=True, right_index=True)
    bibliomet_matrix_rulemining = bibliomet_matrix_rulemining.astype(np.int64)

    u = bibliomet_matrix_rulemining.select_dtypes(object)
    bibliomet_matrix_rulemining[u.columns] = u.apply(lambda x: x.str.encode('ascii', 'ignore').str.decode('ascii'))
    bibliomet_matrix_rulemining.columns = ["".join(l) for l in bibliomet_matrix_rulemining.columns.str.replace("\s", "_").str.findall("[\w\d]+")]
    bibliomet_matrix_rulemining.columns = bibliomet_matrix_rulemining.columns.str.replace('\s+', '_').str.replace('\W+', '').str.replace('\xed', '_').str.replace('\xf1', '').str.replace('\xfc', '_').str.replace('\xe9', '_').str.replace( '\xe7', '')

    bibliomet_matrix_rulemining.shape

## Connected matrices

Reduce number of rows (to have same number of documents everywhere - like in bibliometric features)

In [None]:
if DATASET_VERSION == 1:
    matrix_scispacy_pd = matrix_scispacy_pd.merge(new_ids, how="inner",left_index=True, right_index=True)
    matrix_PubTator_conceptnet_pd = matrix_PubTator_conceptnet_pd.merge(new_ids, how="inner",left_index=True, right_index=True)
    matrix_scispacy_conceptnet_pd= matrix_scispacy_conceptnet_pd.merge(new_ids, how="inner",left_index=True, right_index=True)
    matrix_PubTator_pd=matrix_PubTator_pd.merge(new_ids, how="inner",left_index=True, right_index=True)
    matrix_authors_pd=matrix_authors_pd.merge(new_ids, how="inner",left_index=True, right_index=True)
    matrix_bow_pd=matrix_bow_pd.merge(new_ids, how="inner",left_index=True, right_index=True)
    matrix_tfidf_pd=matrix_tfidf_pd.merge(new_ids, how="inner",left_index=True, right_index=True)

In [None]:
if DATASET_VERSION == 1:
    bow_plus_bibliometric_features = bibliomet_matrix_randomforest.merge(matrix_bow_pd, how='inner', left_index=True, right_index=True)
    tfidf_plus_bibliometric_features = bibliomet_matrix_tfidf.merge(matrix_tfidf_pd, how='inner', left_index=True, right_index=True)
    bow_plus_scispacy_pd = matrix_scispacy_pd.merge(matrix_bow_pd, how='inner', left_index=True, right_index=True)
    bow_plus_PubTator_conceptnet_pd = matrix_PubTator_conceptnet_pd.merge(matrix_bow_pd, how='inner', left_index=True, right_index=True)
    bow_plus_PubTator_pd = matrix_PubTator_pd.merge(matrix_bow_pd, how='inner', left_index=True, right_index=True)
    bow_plus_scispacy_conceptnet_pd = matrix_scispacy_conceptnet_pd.merge(matrix_bow_pd, how='inner', left_index=True, right_index=True)
    bow_plus_PubtConc_plus_BibFeat_pd = bow_plus_PubTator_conceptnet_pd.merge(bibliomet_matrix_randomforest, how='inner', left_index=True, right_index=True)
    bow_plus_bibliometric_features_rule = bibliomet_matrix_rulemining.merge(matrix_bow_pd, how='inner', left_index=True, right_index=True)
    bow_plus_bibliometric_features_rule = bow_plus_bibliometric_features_rule.loc[:, bow_plus_bibliometric_features_rule.columns != 'Target']
    bow_plus_PubtConc_plus_BibFeat_pd_rule = bibliomet_matrix_rulemining.merge(bow_plus_PubTator_conceptnet_pd , how='inner', left_index=True, right_index=True)

# Optimal SIZE of matrices

In [None]:
if RUN_SEARCHING_OPTIMAL_MATRIX_SIZE:
    dfs = []
    names = []
    results = []

    min_df_matrix = [ 
             ('min_df_1', 1),('min_df_4', 4),('min_df_8',8),('min_df_12',12),('min_df_16',16),('min_df_20',20),('min_df_24',24), ('min_df_28',28),('min_df_32',32),('min_df_36',36),     ('min_df_40',40),
        ('min_df_44',44),('min_df_48',48), ('min_df_52',52),('min_df_90',90),('min_df_120',120),('min_df_150',150),('min_df_200',200),('min_df_250',250) ]
      
    for name, min_df_value in min_df_matrix:    
        cvec = CountVectorizer(analyzer = "word", tokenizer = None, ngram_range=(1,3), lowercase = True,strip_accents = "ascii", binary= True,stop_words='english',min_df=min_value)    

        matrix_bow = cvec.fit_transform(documents['abstract'])
        tokens = cvec.get_feature_names()
        matrix_bow_pd=pd.DataFrame(data=matrix_bow.toarray(), index=documents.index,columns=tokens)
        rf = RandomForestClassifier()
        X_train, X_test, y_train, y_test = train_test_split(matrix_bow_pd, documents.Target, test_size=0.3, random_state=1)
        kfold = model_selection.KFold(n_splits=3, shuffle=True, random_state=90210)
        cv_results = model_selection.cross_validate(rf, X_train, y_train, cv=kfold, scoring="accuracy")
        rf.fit(X_train, y_train)        
        results.append(cv_results)
        names.append(name)
        this_df = pd.DataFrame(cv_results)
        this_df['model'] = name
        this_df['number_of_col_in_mtrx'] = matrix_bow_pd.shape[1]
        dfs.append(this_df)

    final = pd.concat(dfs, ignore_index=True)
    final.to_csv("results/TUNING_MIN_DF_new.csv") 

# BERT 

### BERT TOKENIZATION - FOR RANDOM FOREST

In [None]:
if DATASET_VERSION == 1:
    model_class, tokenizer_class, pretrained_weights = (ppb.BertModel, ppb.BertTokenizer, 'bert-base-uncased')
    tokenizer = tokenizer_class.from_pretrained(pretrained_weights)
    model = model_class.from_pretrained(pretrained_weights)

### BERT tokenizer

In [None]:
if DATASET_VERSION == 1:
    documents_bert_1  = documents
    documents_bert_1['abstract'] = documents_bert_1['abstract'].str[:512]
    tokenized = documents_bert_1['abstract'].apply((lambda x: tokenizer.encode(x, add_special_tokens=True)))

    max_len = 0
    for i in tokenized.values:
        if len(i) > max_len:
            max_len = len(i)

    padded = np.array([i + [0]*(max_len-len(i)) for i in tokenized.values])
    attention_mask = np.where(padded != 0, 1, 0)
    padded = torch.tensor(padded).to(torch.int64)

### BERT model (BERT embeddings)

In [None]:
if DATASET_VERSION == 1:
    input_ids = torch.tensor(padded)  
    attention_mask = torch.tensor(attention_mask)

    with torch.no_grad():
        last_hidden_states = model(input_ids, attention_mask=attention_mask)
    
    features = last_hidden_states[0][:,0,:].numpy()
    labels = documents['Target']

In [None]:
if DATASET_VERSION == 1:
    X_train, X_test, y_train, y_test = train_test_split(features, labels, test_size = 0.3,random_state=1)

### Random forest with BERT - classifier

In [None]:
if RUN_GRID_FOR_BERT:
    from sklearn.model_selection import GridSearchCV
    param_grid = { 'bootstrap': [True],'max_depth': [10,150, 500,1000],'max_features': [30,500,3000], 'min_samples_leaf': [1,10,100], 'min_samples_split': [2,10,100],  'n_estimators': [10, 100]} 
    rf = RandomForestClassifier()  
    grid_search = GridSearchCV(estimator = rf, param_grid = param_grid,  cv = 3, n_jobs = -1, verbose = 2)
    grid_search.fit(X_train, y_train)
    print(grid_search.best_params_)

In [None]:
if DATASET_VERSION == 1:
    rf = RandomForestClassifier(bootstrap= True, max_depth= 500, max_features= 30, min_samples_leaf= 10, min_samples_split= 2, n_estimators= 100)
    rf.fit(X_train,y_train)

In [None]:
if DATASET_VERSION == 1:
    y_pred = rf.predict(X_test)

In [None]:
if DATASET_VERSION == 1:
    clsf_report = pd.DataFrame(classification_report(y_true = y_test, y_pred = y_pred, output_dict=True)).transpose()
    if DATASET_VERSION ==1: 
        clsf_report.to_csv('results/randomforest/results_bert_randomforest_datasetv1.csv', index= True)
    if DATASET_VERSION ==2: 
        clsf_report.to_csv('results/randomforest/results_bert_randomforest_datasetv2.csv', index= True)
    clsf_report

### Random forest with BERT - regression

In [None]:
if DATASET_VERSION == 1:
    labels = documents['NormCitations']
    X_train, X_test, y_train, y_test = train_test_split(features, labels, test_size = 0.3,random_state=1)

In [None]:
if DATASET_VERSION == 1:
    rf = RandomForestRegressor() 
    rf.fit(X_train, y_train)  

In [None]:
if DATASET_VERSION == 1:
    y_pred = rf.predict(X_test)
    MAE = mean_absolute_error(y_test, y_pred)
    MSE = mean_squared_error(y_test, y_pred)
    y_test_bin = [0 if i <=2.0 else 1 for i in y_test]
    y_pred_bin = [0 if i <=2.0 else 1 for i in y_pred]
    clsf_report = pd.DataFrame(classification_report(y_true = y_test_bin, y_pred = y_pred_bin, output_dict=True)).transpose()
    clsf_report['MAE'] = MAE
    clsf_report['MSE'] = MSE
    print(clsf_report)
    if DATASET_VERSION ==1: 
        clsf_report.to_csv('results/randomforest/results_bert_randomforest_datasetv1_regression.csv', index= True)
    if DATASET_VERSION ==2: 
        clsf_report.to_csv('results/randomforest/results_bert_randomforest_datasetv2_regression.csv', index= True)
    clsf_report

## BERT tokenizer for neural network

- previous version of model: https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/1 (now is available https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/2) 

In [None]:
# if you get module 'bert' has no attribute 'bert_tokenization
# pip install bert-tensorflow 
# pip install bert-for-tf2
BertTokenizer = bert.bert_tokenization.FullTokenizer
bert_layer = hub.KerasLayer("https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/2",trainable=False)
vocabulary_file = bert_layer.resolved_object.vocab_file.asset_path.numpy()
to_lower_case = bert_layer.resolved_object.do_lower_case.numpy()
#watch out we override tokenizer that has been created before with a BERT tokenizer
tokenizer = BertTokenizer(vocabulary_file, to_lower_case)

In [None]:
def preprocess_text(sen):
    # Removing html tags
    sentence = remove_tags(sen)

    # Remove punctuations and numbers
    sentence = re.sub('[^a-zA-Z]', ' ', sentence)

    # Single character removal
    sentence = re.sub(r"\s+[a-zA-Z]\s+", ' ', sentence)

    # Removing multiple spaces
    sentence = re.sub(r'\s+', ' ', sentence)

    return sentence
TAG_RE = re.compile(r'<[^>]+>')

def remove_tags(text):
    return TAG_RE.sub('', text)

In [None]:
documents_bert_1  = documents
documents_bert_1['abstract'] = documents_bert_1['abstract'].str[:512]

In [None]:
bert_texts = []
sentences = list(documents_bert_1['abstract'])
for sen in sentences:
    bert_texts.append(preprocess_text(sen))

In [None]:
def tokenize_texts(text_reviews):
    return tokenizer.convert_tokens_to_ids(tokenizer.tokenize(text_reviews))
tokenized_texts = [tokenize_texts(text) for text in bert_texts]

## Neural Network modelling

Preprocess data:

In [None]:
def run_preprocessing(classification = True):
    
    if classification == True:
        y = documents_bert_1['Target']
        y = np.array(list(map(lambda x: 1 if x=="high" else 0, y)))

    if classification ==False:
        y = documents_bert_1['NormCitations']
        y = np.array(list(y))
        
    add_id=documents_bert_1.reset_index()['doi']
    add_id = np.array(list(add_id))

    reviews_with_len = [[review, y[i], len(review),  add_id[i]]
                     for i, review in enumerate(tokenized_texts)]
    for_max_size = [len(review)
                     for i, review in enumerate(tokenized_texts)]

    max_size=max(for_max_size)

    reviews_with_len = [[[review[cnt] if cnt < len(review) else 0 for cnt in range(max_size)], y[i], len(review), add_id[i]]
    for i, review in enumerate(tokenized_texts)]

    random.shuffle(reviews_with_len)

    sorted_reviews_labels = [(review_lab[0], review_lab[1]) for review_lab in reviews_with_len]
    processed_dataset = tf.data.Dataset.from_generator(lambda: sorted_reviews_labels, output_types=(tf.int32, tf.int32))

    BATCH_SIZE = 32
    batched_dataset = processed_dataset.padded_batch(BATCH_SIZE, padded_shapes=((None, ), ()))

    next(iter(batched_dataset))
    
    TOTAL_BATCHES = math.ceil(len(sorted_reviews_labels) / BATCH_SIZE)
    TEST_BATCHES = TOTAL_BATCHES // 30
    test_data = batched_dataset.take(TEST_BATCHES)
    train_data = batched_dataset.skip(TEST_BATCHES)

    return test_data, train_data

In [None]:
class TEXT_MODEL(tf.keras.Model):
    
    def __init__(self,
                 regression,
                 vocabulary_size,
                 embedding_dimensions=128,
                 cnn_filters=50,
                 dnn_units=512,
                 model_output_classes=2,
                 dropout_rate=0.1,
                 training=False,
                 name="text_model"):
        super(TEXT_MODEL, self).__init__(name=name)
        
        self.embedding = layers.Embedding(vocabulary_size,
                                          embedding_dimensions)
        self.cnn_layer1 = layers.Conv1D(filters=cnn_filters, kernel_size=2,padding="valid", activation="relu")
        self.cnn_layer2 = layers.Conv1D(filters=cnn_filters, kernel_size=3, padding="valid",activation="relu")
        self.cnn_layer3 = layers.Conv1D(filters=cnn_filters,kernel_size=4,padding="valid",activation="relu")
        self.pool = layers.GlobalMaxPool1D()
        self.dense_1 = layers.Dense(units=dnn_units, activation="relu")
        self.dropout = layers.Dropout(rate=dropout_rate)
        
        if not regression:
            self.last_dense = layers.Dense(units=model_output_classes,activation="softmax")
        if regression:
            self.last_dense = layers.Dense(units=model_output_classes,activation="relu") 
    
    def call(self, inputs, training):
        l = self.embedding(inputs)
        l_1 = self.cnn_layer1(l) 
        l_1 = self.pool(l_1) 
        l_2 = self.cnn_layer2(l) 
        l_2 = self.pool(l_2)
        l_3 = self.cnn_layer3(l)
        l_3 = self.pool(l_3) 
        
        concatenated = tf.concat([l_1, l_2, l_3], axis=-1) # (batch_size, 3 * cnn_filters)
        concatenated = self.dense_1(concatenated)
        concatenated = self.dropout(concatenated, training)
        model_output = self.last_dense(concatenated)
        
        return model_output

In [None]:
def run_nn(regression = True):
    param1 = {'EMB_DIM' : 200, 
          'CNN_FILTERS' : 100}
    param2 = {'EMB_DIM' : 1400, 
          'CNN_FILTERS' : 130}
    param3 = {'EMB_DIM' : 500, 
          'CNN_FILTERS' : 200}
    param4 = {'EMB_DIM' : 1300, 
          'CNN_FILTERS' : 50}

    if regression:
        OUTPUT_CLASSES = 1
    if not regression:
        OUTPUT_CLASSES = 2

    params_mat = [  
              ('param1', param1),
              ('param2',param2),
              ('param3',param3),
              ('param4',param4)
             ]    

    for names,params in params_mat:
        text_model = TEXT_MODEL(regression,vocabulary_size=len(tokenizer.vocab),
                        embedding_dimensions=params.get('EMB_DIM'),
                        cnn_filters=params.get('CNN_FILTERS'),
                        dnn_units=256,
                        model_output_classes=OUTPUT_CLASSES,
                        dropout_rate=0.2)

        if not regression:
            text_model.compile(loss="sparse_categorical_crossentropy",optimizer="adam",metrics=["sparse_categorical_accuracy"])
            test_data, train_data = run_preprocessing(classification = True)
        if regression:
            text_model.compile(loss="mean_squared_error",optimizer="adam",metrics=["mean_squared_error"])
            test_data, train_data = run_preprocessing(classification = False)
            
        traning = text_model.fit(train_data, epochs=5)
        results = text_model.evaluate(test_data)
        print("results_test_data")
        print(results)
    
        y_test_bert = np.concatenate([y for x, y in test_data], axis=0)
        y_prob = text_model.predict(test_data) 
        predicted = np.argmax(y_prob, axis=1)

        if not regression:
            results_pd = pd.DataFrame(classification_report(y_test_bert, predicted, output_dict=True)).transpose()
            print(results_pd)
            if DATASET_VERSION ==1:
                results_pd.to_csv('results/neural_network/results_bow_bert_neural_network_classification_datasetv1_'+str(names)+'.csv', index= True)
            if DATASET_VERSION ==2:
                results_pd.to_csv('results/neural_network/results_bow_bert_neural_network_classification_datasetv2_'+str(names)+'.csv', index= True)
        
        if regression: 
            MAE = mean_absolute_error(y_test_bert, y_prob)
            MSE = mean_squared_error(y_test_bert, y_prob)
            list_metrics = [MAE, MSE]
            keys =["MAE", "MSE"]
            dictionary = dict(zip(keys, list_metrics))
            results_pd = pd.DataFrame.from_dict(dictionary,orient ='index').T 
            results_pd['model'] = "BERT"
            results_pd['number_of_col_in_mtrx'] = X_train.shape[1]
            if DATASET_VERSION ==2:
                y_test_bin = [0 if i <=2.0 else 1 for i in y_test_bert]
                y_pred_bin = [0 if i <=2.0 else 1 for i in y_prob] 
            if DATASET_VERSION ==1:
                y_test_bin = [0 if i <=2.0 else 1 for i in y_test_bert]
                y_pred_bin = [0 if i <=2.0 else 1 for i in y_prob] 
            clsf_report = pd.DataFrame(classification_report(y_true = y_test_bin, y_pred = y_pred_bin, output_dict=True)).transpose()
            accuracy = clsf_report.iat[2,0]
            results_pd['accuracy'] = accuracy
            print(results_pd)
            if DATASET_VERSION ==1:
                results_pd.to_csv('results/neural_network/results_bow_bert_neural_network_regresion_datasetv1'+str(names)+'.csv', index= True)
            if DATASET_VERSION ==2:
                results_pd.to_csv('results/neural_network/results_bow_bert_neural_network_regresion_datasetv2'+str(names)+'.csv', index= True)
          
    return print("done")

In [None]:
run_nn(regression=True)

In [None]:
run_nn(regression=False)

## Random forest

In [None]:
if DATASET_VERSION == 1:
    matrices = [  ('Scispacy', matrix_scispacy_pd),
              ('PubTator_Conceptnet',matrix_PubTator_conceptnet_pd),
              ('Scispacy_Conceptnet',matrix_scispacy_conceptnet_pd),
              ('PubTator',matrix_PubTator_pd),
              ('AuthorsNames',matrix_authors_pd),
              ('TF-IDF', matrix_tfidf_pd),
              ('Bow',matrix_bow_pd),
            # Connected matrices
              ('Bow_Scispacy',bow_plus_scispacy_pd), 
              ('Bow_Scispacy_Conceptnet',bow_plus_scispacy_conceptnet_pd), 
              ('Bow_PubTator', bow_plus_PubTator_pd),
              ('Bow_PubTator_Conceptnet', bow_plus_PubTator_conceptnet_pd),
            #Random forest matrices
              ('TF-IDF_BibliometricFeatures',tfidf_plus_bibliometric_features),
            # Rule mining matrices
              ('BibliometricFeatures',bibliomet_matrix_rulemining),
              ('Bow_BibliometricFeatures', bow_plus_bibliometric_features_rule),
              ('Bow_Pubtator_Conceptnet_BibliometricFeatures', bow_plus_PubtConc_plus_BibFeat_pd_rule)
                ]  
    
if DATASET_VERSION == 2:
    matrices = [
              ('TF-IDF', matrix_tfidf_pd),
              ('Bow',matrix_bow_pd)
            ]

### Grid

In [None]:
if RUN_GRID_RANDOM_FOREST:
    
    for name, matrix in matrices: 
    
        X_train, X_test, y_train, y_test = train_test_split(matrix, target_bib_features.Target, test_size=0.3, random_state=1)
        from sklearn.model_selection import GridSearchCV

        param_grid = {
        'bootstrap': [True],
        'max_depth': [10,150, 500,1000],
        'max_features': [30,500,3000],
        'min_samples_leaf': [1,10,100],
        'min_samples_split': [2,10,100],
        'n_estimators': [10, 100]
        }

        rf = RandomForestClassifier()

        grid_search_rf = GridSearchCV(estimator = rf, param_grid = param_grid,scoring='accuracy')
        grid_search_rf.fit(X_train, y_train)
        print(name)
        print(grid_search_rf.best_params_)

In [None]:
def run_random_forest(CLASSIFIER:bool,USE_FI_FOR_REDUCTION:bool, USE_GRID:bool, matrices_rf:list,matrix_scispacy_pd=None,
                      matrix_PubTator_conceptnet_pd=None,matrix_scispacy_conceptnet_pd=None,matrix_PubTator_pd=None,matrix_authors_pd=None,
                      matrix_tfidf_pd=None,matrix_bow_pd=None,bow_plus_scispacy_pd=None,bow_plus_scispacy_conceptnet_pd=None,
                      bow_plus_PubTator_pd=None,bow_plus_PubTator_conceptnet_pd=None,tfidf_plus_bibliometric_features=None,
                      bibliomet_matrix_randomforest=None,bow_plus_bibliometric_features=None,bow_plus_PubtConc_plus_BibFeat_pd=None,
                      bibliomet_matrix_rulemining=None,bow_plus_bibliometric_features_rule=None,bow_plus_PubtConc_plus_BibFeat_pd_rule=None):
    dfs = []
    results = []
    names = []
    target_names = ['low', 'high']  
    
    for name, matrix in matrices_rf:
        
        print(name)
        print(matrix.shape)
        
        if USE_GRID == False:
            rf=RandomForestClassifier()  
        if USE_GRID == True:  
            
# # #   Aplying tuning parameters  
            if CLASSIFIER == True:
                if name=="Scispacy":
                    rf=RandomForestClassifier(bootstrap= True, max_depth= 10, max_features= 30, min_samples_leaf= 10, min_samples_split= 100, n_estimators= 1000) 
                if name=='PubTator_Conceptnet':
                    rf=RandomForestClassifier(bootstrap= True, max_depth= 150, max_features= 30, min_samples_leaf= 1, min_samples_split= 100, n_estimators= 10) 
                if name=='Scispacy_Conceptnet':
                    rf=RandomForestClassifier(bootstrap= True, max_depth= 10, max_features= 30, min_samples_leaf= 10, min_samples_split= 100, n_estimators= 100)
                if name=="PubTator":
                    rf=RandomForestClassifier(bootstrap= True, max_depth= 150, max_features= 30, min_samples_leaf= 1, min_samples_split=100, n_estimators= 100)
                if name=="AuthorsNames":
                    rf=RandomForestClassifier(bootstrap= True, max_depth= 1000, max_features= 30, min_samples_leaf= 1, min_samples_split= 10, n_estimators= 100)
                if name=="Bow":
                    rf=RandomForestClassifier(bootstrap=True, max_depth= 150, max_features= 30, min_samples_leaf= 1, min_samples_split= 2, n_estimators= 100)
                if name=='Bow_Scispacy':
                    rf=RandomForestClassifier(bootstrap= True, max_depth=150, max_features= 30, min_samples_leaf= 1, min_samples_split= 10, n_estimators= 100)
                if name=='Bow_Scispacy_Conceptnet':
                    rf=RandomForestClassifier(bootstrap= True, max_depth=500, max_features= 500, min_samples_leaf= 1, min_samples_split= 2, n_estimators= 100)
                if name=='Bow_PubTator':
                    rf=RandomForestClassifier(bootstrap= True, max_depth=1000, max_features= 30, min_samples_leaf= 1, min_samples_split= 2, n_estimators= 100)
                if name=='Bow_PubTator_Conceptnet':
                    rf=RandomForestClassifier(bootstrap=True, max_depth= 150, max_features= 30, min_samples_leaf=1, min_samples_split= 2, n_estimators= 100)
                if name=='TF-IDF':
                    rf=RandomForestClassifier(bootstrap= True, max_depth= 1000, max_features= 30, min_samples_leaf= 1, min_samples_split= 100, n_estimators= 100)
                if name=='TF-IDF_BibliometricFeatures':
                    rf=RandomForestClassifier(bootstrap= True, max_depth= 150, max_features= 30, min_samples_leaf=1, min_samples_split= 100, n_estimators= 100)
                if name=='BibliometricFeatures':
                    rf=RandomForestClassifier(bootstrap=True, max_depth= 10, max_features= 30, min_samples_leaf= 1, min_samples_split= 2, n_estimators= 100)
                if name=='Bow_BibliometricFeatures':
                    rf=RandomForestClassifier(bootstrap= True, max_depth= 500, max_features=30, min_samples_leaf=1, min_samples_split= 100, n_estimators= 100)
                if name=='Bow_Pubtator_Conceptnet_BibliometricFeatures':
                    rf=RandomForestClassifier(bootstrap= True, max_depth= 500, max_features=500, min_samples_leaf= 1, min_samples_split=2, n_estimators= 100)
        
            if CLASSIFIER == False:
                if name=="Scispacy":
                    rf=RandomForestRegressor(bootstrap= True, max_depth= 10, max_features= 30, min_samples_leaf= 10, min_samples_split= 100, n_estimators= 1000) 
                if name=='PubTator_Conceptnet':
                    rf=RandomForestRegressor(bootstrap= True, max_depth= 150, max_features= 30, min_samples_leaf= 1, min_samples_split= 100, n_estimators= 10) 
                if name=='Scispacy_Conceptnet':
                    rf=RandomForestRegressor(bootstrap= True, max_depth= 10, max_features= 30, min_samples_leaf= 10, min_samples_split= 100, n_estimators= 100)
                if name=="PubTator":
                    rf=RandomForestRegressor(bootstrap= True, max_depth= 150, max_features= 30, min_samples_leaf= 1, min_samples_split=100, n_estimators= 100)
                if name=="AuthorsNames":
                    rf=RandomForestRegressor(bootstrap= True, max_depth= 1000, max_features= 30, min_samples_leaf= 1, min_samples_split= 10, n_estimators= 100)
                if name=="Bow":
                    rf=RandomForestRegressor(bootstrap=True, max_depth= 150, max_features= 30, min_samples_leaf= 1, min_samples_split= 2, n_estimators= 100)
                if name=='Bow_Scispacy':
                    rf=RandomForestRegressor(bootstrap= True, max_depth=150, max_features= 30, min_samples_leaf= 1, min_samples_split= 10, n_estimators= 100)
                if name=='Bow_Scispacy_Conceptnet':
                    rf=RandomForestRegressor(bootstrap= True, max_depth=500, max_features= 500, min_samples_leaf= 1, min_samples_split= 2, n_estimators= 100)
                if name=='Bow_PubTator':
                    rf=RandomForestRegressor(bootstrap= True, max_depth=1000, max_features= 30, min_samples_leaf= 1, min_samples_split= 2, n_estimators= 100)
                if name=='Bow_PubTator_Conceptnet':
                    rf=RandomForestRegressor(bootstrap=True, max_depth= 150, max_features= 30, min_samples_leaf=1, min_samples_split= 2, n_estimators= 100)
                if name=='TF-IDF':
                    rf=RandomForestRegressor(bootstrap= True, max_depth= 1000, max_features= 30, min_samples_leaf= 1, min_samples_split= 100, n_estimators= 100)
                if name=='TF-IDF_BibliometricFeatures':
                    rf=RandomForestRegressor(bootstrap= True, max_depth= 150, max_features= 30, min_samples_leaf=1, min_samples_split= 100, n_estimators= 100)
                if name=='BibliometricFeatures':
                    rf=RandomForestRegressor(bootstrap=True, max_depth= 10, max_features= 30, min_samples_leaf= 1, min_samples_split= 2, n_estimators= 100)
                if name=='Bow_BibliometricFeatures':
                    rf=RandomForestRegressor(bootstrap= True, max_depth= 500, max_features=30, min_samples_leaf=1, min_samples_split= 100, n_estimators= 100)
                if name=='Bow_Pubtator_Conceptnet_BibliometricFeatures':
                    rf=RandomForestRegressor(bootstrap= True, max_depth= 500, max_features=500, min_samples_leaf= 1, min_samples_split=2, n_estimators= 100)

            
        if not USE_FI_FOR_REDUCTION:
            # this model is runing because we need to get feature importance first - only for FI purpose ! Is not evaluated, no saved results - not needed.
            if CLASSIFIER == False:
                X_train, X_test, y_train, y_test = train_test_split(matrix, documents.NormCitations, test_size=0.3, random_state=1) 
            
            if CLASSIFIER == True:
                if DATASET_VERSION == 1:
                    X_train, X_test, y_train, y_test = train_test_split(matrix, documents.Target, test_size=0.3, random_state=1)
                if DATASET_VERSION == 2:
                    X_train, X_test, y_train, y_test = train_test_split(matrix, documents.Target, test_size=0.3, random_state=1) 
          
                
            rf.fit(X_train, y_train)  
            importance = pd.Series(rf.feature_importances_, index=X_train.columns).nlargest(10000)
            importance.to_csv("cache/FeatureImportance/importance_RF_" +name +'.csv')  
            final = "Done"
        
        if USE_FI_FOR_REDUCTION:   
            importance = pd.read_csv("cache/FeatureImportance/importance_RF_"+name+".csv", index_col=False, header=0).sort_values(by=["0"],ascending=False)
            list_of_features = importance['Unnamed: 0'].to_list()
            selected_features = itertools.islice(list_of_features, MAX_NO_OF_FEATURES_SCISPACY_AND_PUBT) # grab the first x elements
            
            if name == "Scispacy":
                matrix = matrix_scispacy_pd[selected_features]   #  reduce the matrix     
            if name == "PubTator_Conceptnet":
                matrix= matrix_PubTator_conceptnet_pd[selected_features]   #  reduce the matrix 
            if name == "Scispacy_Conceptnet":
                matrix = matrix_scispacy_conceptnet_pd[selected_features]#  reduce the matrix 
            if name == "PubTator":
                matrix = matrix_PubTator_pd[selected_features] #  reduce the matrix 
            if name =="AuthorsNames":
                matrix = matrix_authors_pd
            if name =="Bow":
                matrix = matrix_bow_pd
            if name =="TF-IDF":
                matrix = matrix_tfidf_pd
            if name =="Bow_Scispacy":
                importance = pd.read_csv("cache/FeatureImportance/importance_RF_Scispacy.csv", index_col=False, header=0).sort_values(by=["0"],ascending=False)
                list_of_features = importance['Unnamed: 0'].to_list()
                selected_features = itertools.islice(list_of_features, MAX_NO_OF_FEATURES_SCISPACY_AND_PUBT) # grab the first x elements
                matrix = matrix_scispacy_pd[selected_features].merge(matrix_bow_pd, how='inner', left_index=True, right_index=True)
            if name =="Bow_Scispacy_Conceptnet":
                importance = pd.read_csv("cache/FeatureImportance/importance_RF_Scispacy_Conceptnet.csv", index_col=False, header=0).sort_values(by=["0"],ascending=False)
                list_of_features = importance['Unnamed: 0'].to_list()
                selected_features = itertools.islice(list_of_features, MAX_NO_OF_FEATURES_SCISPACY_AND_PUBT) # grab the first x elements
                matrix = matrix_scispacy_conceptnet_pd[selected_features].merge(matrix_bow_pd, how='inner', left_index=True, right_index=True)
            if name =="Bow_PubTator":
                importance = pd.read_csv("cache/FeatureImportance/importance_RF_PubTator.csv", index_col=False, header=0).sort_values(by=["0"],ascending=False)
                list_of_features = importance['Unnamed: 0'].to_list()
                selected_features = itertools.islice(list_of_features, MAX_NO_OF_FEATURES_SCISPACY_AND_PUBT) # grab the first x elements
                matrix = matrix_PubTator_pd[selected_features].merge(matrix_bow_pd, how='inner', left_index=True, right_index=True)       
            if name =="Bow_PubTator_Conceptnet":
                importance = pd.read_csv("cache/FeatureImportance/importance_RF_PubTator_Conceptnet.csv", index_col=False, header=0).sort_values(by=["0"],ascending=False)
                list_of_features = importance['Unnamed: 0'].to_list()
                selected_features = itertools.islice(list_of_features, MAX_NO_OF_FEATURES_SCISPACY_AND_PUBT) # grab the first x elements
                matrix = matrix_PubTator_conceptnet_pd[selected_features].merge(matrix_bow_pd, how='inner', left_index=True, right_index=True)
            if name =="TF-IDF_BibliometricFeatures":
                matrix = tfidf_plus_bibliometric_features
            if name =="BibliometricFeatures_rf":
                matrix = bibliomet_matrix_randomforest
            if name =="Bow_BibliometricFeatures_rf":
                matrix = bow_plus_bibliometric_features
            if name =="BibliometricFeatures_rule":
                matrix = bibliomet_matrix_rulemining
            if name =="Bow_BibliometricFeatures_rule":
                matrix = bow_plus_bibliometric_features_rule    
            if name=="Bow_Pubtator_Conceptnet_BibliometricFeatures_rf":
                importance = pd.read_csv("cache/FeatureImportance/importance_RF_PubTator_Conceptnet.csv", index_col=False, header=0).sort_values(by=["0"],ascending=False)
                list_of_features = importance['Unnamed: 0'].to_list()
                selected_features = itertools.islice(list_of_features, MAX_NO_OF_FEATURES_SCISPACY_AND_PUBT) # grab the first x elements
                matrix = matrix_PubTator_conceptnet_pd[selected_features].merge(bow_plus_bibliometric_features, how='inner', left_index=True, right_index=True)
            if name=="Bow_Pubtator_Conceptnet_BibliometricFeatures_rule":
                importance = pd.read_csv("cache/FeatureImportance/importance_RF_PubTator_Conceptnet.csv", index_col=False, header=0).sort_values(by=["0"],ascending=False)
                list_of_features = importance['Unnamed: 0'].to_list()
                selected_features = itertools.islice(list_of_features, MAX_NO_OF_FEATURES_SCISPACY_AND_PUBT) # grab the first x elements
                matrix = matrix_PubTator_conceptnet_pd[selected_features].merge(bow_plus_bibliometric_features_rule, how='inner', left_index=True, right_index=True)
                
            if CLASSIFIER == True:
                X_train, X_test, y_train, y_test = train_test_split(matrix, documents.Target, test_size=0.3, random_state=1)
                print(X_train.shape)
                X_train.to_csv("results/matrices/X_train_"+name+"_rule.csv") # save for rule mining
                y_train.to_csv("results/matrices/y_train_"+name+"_rule.csv")
                X_test.to_csv("results/matrices/X_test_"+name+"_rule.csv")
                y_test.to_csv("results/matrices/y_test_"+name+"_rule.csv")  
            
            if CLASSIFIER == False:
                X_train, X_test, y_train, y_test = train_test_split(matrix, documents.NormCitations, test_size=0.3, random_state=1)
            
            
            rf.fit(X_train, y_train)
            
            if CLASSIFIER == True:
                ## Evaluation of model
                y_pred = rf.predict(X_test)     
                clsf_report = pd.DataFrame(classification_report(y_true = y_test, y_pred = y_pred, output_dict=True)).transpose()
                accuracy = clsf_report.iat[2,0]
                precision = clsf_report.iat[4,0]
                recall = clsf_report.iat[4,1]
                f1_score = clsf_report.iat[4,2]
                list_metrics = [accuracy, precision, recall, f1_score]
                keys =["accuracy", "precision", "recall", "f1_score"]
                dictionary = dict(zip(keys, list_metrics))
                results_pd = pd.DataFrame.from_dict(dictionary,orient ='index').T 
                results_pd['model'] = name
                results_pd['number_of_col_in_mtrx'] = X_train.shape[1]
                dfs.append(results_pd)
                print(classification_report(y_test, y_pred, target_names=target_names))   
            
            if CLASSIFIER == False:
                y_pred = rf.predict(X_test)
                MAE = mean_absolute_error(y_test, y_pred)
                MSE = mean_squared_error(y_test, y_pred)
                list_metrics = [MAE, MSE]
                keys =["MAE", "MSE"]
                dictionary = dict(zip(keys, list_metrics))
                results_pd = pd.DataFrame.from_dict(dictionary,orient ='index').T 
                results_pd['model'] = name
                results_pd['number_of_col_in_mtrx'] = X_train.shape[1]
                y_test_bin = [0 if i <=2.0 else 1 for i in y_test]
                y_pred_bin = [0 if i <=2.0 else 1 for i in y_pred]
                X_train, X_test, y_train, y_test = train_test_split(matrix, documents.Target, test_size=0.3, random_state=1)
                clsf_report = pd.DataFrame(classification_report(y_true = y_test_bin, y_pred = y_pred_bin, output_dict=True)).transpose()
                accuracy = clsf_report.iat[2,0]
                results_pd['accuracy'] = accuracy
                
                dfs.append(results_pd)
        
    if USE_FI_FOR_REDUCTION:
        final = pd.concat(dfs, ignore_index=True)
        if DATASET_VERSION ==1:
            if CLASSIFIER ==True:
                final.to_csv("results/randomforest/results_RF_datasetv1.csv") 
            if CLASSIFIER ==False:
                final.to_csv("results/randomforest/results_RF_datasetv1_regression.csv") 
        if DATASET_VERSION ==2:
            if CLASSIFIER ==True:
                final.to_csv("results/randomforest/results_RF_datasetv2.csv")
            if CLASSIFIER ==False:
                final.to_csv("results/randomforest/results_RF_datasetv2_regression.csv") 
           
    return print("Done")

#### Running to get FI for all matrices

In [None]:
run_random_forest(CLASSIFIER = True,USE_FI_FOR_REDUCTION=False, USE_GRID=True, matrices_rf=matrices)

#### Running to get results for reduced matrices

In [None]:
if DATASET_VERSION ==1:
    run_random_forest(CLASSIFIER = True,USE_FI_FOR_REDUCTION=True, USE_GRID=True, matrices_rf=matrices,
                  matrix_scispacy_pd= matrix_scispacy_pd,
                  matrix_PubTator_conceptnet_pd=matrix_PubTator_conceptnet_pd,
                  matrix_scispacy_conceptnet_pd=matrix_scispacy_conceptnet_pd,
                  matrix_PubTator_pd=matrix_PubTator_pd,
                  matrix_authors_pd=matrix_authors_pd,
                  matrix_tfidf_pd=matrix_tfidf_pd,
                  matrix_bow_pd=matrix_bow_pd,
                  bow_plus_scispacy_pd=bow_plus_scispacy_pd,
                  bow_plus_scispacy_conceptnet_pd=bow_plus_scispacy_conceptnet_pd,
                  bow_plus_PubTator_pd=bow_plus_PubTator_pd,
                  bow_plus_PubTator_conceptnet_pd=bow_plus_PubTator_conceptnet_pd,
                  tfidf_plus_bibliometric_features=tfidf_plus_bibliometric_features,
                  bibliomet_matrix_randomforest= bibliomet_matrix_randomforest,
                  bow_plus_bibliometric_features=bow_plus_bibliometric_features,
                  bow_plus_PubtConc_plus_BibFeat_pd=bow_plus_PubtConc_plus_BibFeat_pd,
                  bibliomet_matrix_rulemining=bibliomet_matrix_rulemining,
                  bow_plus_bibliometric_features_rule=bow_plus_bibliometric_features_rule,
                  bow_plus_PubtConc_plus_BibFeat_pd_rule=bow_plus_PubtConc_plus_BibFeat_pd_rule)
    
    run_random_forest(CLASSIFIER = False,USE_FI_FOR_REDUCTION=True, USE_GRID=True, matrices_rf=matrices,
                  matrix_scispacy_pd= matrix_scispacy_pd,
                  matrix_PubTator_conceptnet_pd=matrix_PubTator_conceptnet_pd,
                  matrix_scispacy_conceptnet_pd=matrix_scispacy_conceptnet_pd,
                  matrix_PubTator_pd=matrix_PubTator_pd,
                  matrix_authors_pd=matrix_authors_pd,
                  matrix_tfidf_pd=matrix_tfidf_pd,
                  matrix_bow_pd=matrix_bow_pd,
                  bow_plus_scispacy_pd=bow_plus_scispacy_pd,
                  bow_plus_scispacy_conceptnet_pd=bow_plus_scispacy_conceptnet_pd,
                  bow_plus_PubTator_pd=bow_plus_PubTator_pd,
                  bow_plus_PubTator_conceptnet_pd=bow_plus_PubTator_conceptnet_pd,
                  tfidf_plus_bibliometric_features=tfidf_plus_bibliometric_features,
                  bibliomet_matrix_randomforest= bibliomet_matrix_randomforest,
                  bow_plus_bibliometric_features=bow_plus_bibliometric_features,
                  bow_plus_PubtConc_plus_BibFeat_pd=bow_plus_PubtConc_plus_BibFeat_pd,
                  bibliomet_matrix_rulemining=bibliomet_matrix_rulemining,
                  bow_plus_bibliometric_features_rule=bow_plus_bibliometric_features_rule,
                  bow_plus_PubtConc_plus_BibFeat_pd_rule=bow_plus_PubtConc_plus_BibFeat_pd_rule)
    
if DATASET_VERSION ==2:
    run_random_forest(CLASSIFIER = True,USE_FI_FOR_REDUCTION=True, USE_GRID=False,
                  matrices_rf = matrices,
                  matrix_tfidf_pd=matrix_tfidf_pd,
                  matrix_bow_pd=matrix_bow_pd
               )
    run_random_forest(CLASSIFIER = False,USE_FI_FOR_REDUCTION=True, USE_GRID=False,
                  matrices_rf = matrices,
                  matrix_tfidf_pd=matrix_tfidf_pd,
                  matrix_bow_pd=matrix_bow_pd
               )

# SHAP

In [None]:
for name, matrix in matrices: 
    if DATASET_VERSION ==1:
        y = documents.Target
    if DATASET_VERSION ==2:
        y = documents.Target
    rf=RandomForestClassifier()
    importance = pd.read_csv("cache/FeatureImportance/importance_RF_"+name+".csv", index_col=False, header=0)
    importance = importance.sort_values(by=["0"],ascending=False)
    list_of_features = importance['Unnamed: 0'].to_list()
    selected_features = itertools.islice(list_of_features, 25) 
    matrix_reduced=matrix[selected_features]   #  reduce the matrix 
    X = matrix_reduced
    X_train, X_test, Y_train, Y_test = train_test_split(X, y, test_size = 0.3,random_state=1)  
    rf.fit(X_train, Y_train)  
    shap_values = shap.TreeExplainer(rf).shap_values(X)
    print(name)
    f = plt.figure()
    shap.summary_plot(shap_values[0], X)
    f.savefig("results/randomforest/summary_"+name+".pdf", bbox_inches='tight')       

## LIME

In [None]:
# Choose one :
#TEXT = documents["authors"]
TEXT = documents["abstract"]

#vectorizer = TfidfVectorizer(min_df= 10, stop_words={'english'},ngram_range=(1,3))
vectorizer =  CountVectorizer(analyzer = "word", tokenizer = None, ngram_range=(1,3),lowercase = True,strip_accents = "ascii", binary= True, stop_words='english',min_df=MIN_DF)

list_labels = documents["Target"].tolist()
list_corpus = TEXT.tolist()
X_train, X_test, y_train, y_test = train_test_split(list_corpus, list_labels, test_size=0.3, random_state=1)
train_vectors = vectorizer.fit_transform(X_train)
test_vectors = vectorizer.transform(X_test)
rf = RandomForestClassifier()
rf.fit(train_vectors, y_train)   
pred = rf.predict(test_vectors)
c = make_pipeline(vectorizer, rf)
class_names=list(documents.Target.unique())
explainer = LimeTextExplainer(class_names=class_names)

idx = 55   # 400 , 404 , 24, 28, 135

exp = explainer.explain_instance(X_test[idx], c.predict_proba, num_features=100, labels=[0, 1])
exp.show_in_notebook(text=y_test[idx], labels=(1,))

exp =  explainer.explain_instance(X_test[idx], c.predict_proba, num_features=20, labels=(1,))

exp.save_to_file('results/lime/lime4.html')

exp.as_pyplot_figure()

# RULE MINING MODELS 

For dataset version 1. 

In [None]:
if DATASET_VERSION ==1:
    matrices_rule = [     
              ('Scispacy', matrix_scispacy_pd),
              ('PubTator_Conceptnet',matrix_PubTator_conceptnet_pd),
              ('Scispacy_Conceptnet',matrix_scispacy_conceptnet_pd),
              ('PubTator',matrix_PubTator_pd),
              ('AuthorsNames',matrix_authors_pd),  
              ('Bow',matrix_bow_pd),
              ('Bow_Scispacy',bow_plus_scispacy_pd), 
              ('Bow_Scispacy_Conceptnet',bow_plus_scispacy_conceptnet_pd), 
              ('Bow_PubTator',bow_plus_PubTator_pd),
              ('Bow_PubTator_Conceptnet',bow_plus_PubTator_conceptnet_pd),
              ('BibliometricFeatures',bibliomet_matrix_rulemining),
              ('Bow_BibliometricFeatures', bow_plus_bibliometric_features_rule),
              ('Bow_Pubtator_Conceptnet_BibliometricFeatures', bow_plus_PubtConc_plus_BibFeat_pd_rule)
          ]      
    target_names = ['low', 'high']  # there I am not sure what column is what value of target

## CORELS

In [None]:
rules_list  = []
dfs = []
results = []
names = []

In [None]:
if DATASET_VERSION ==1:
    scoring = ['accuracy', 'precision_weighted', 'recall_weighted', 'f1_weighted', 'roc_auc']
    corels=CorelsClassifier(n_iter=50, c=0.0000, max_card=2, min_support=0.3)

    for name, matrix in matrices_rule: 
        print(name)
        X_train = pd.read_csv("results/matrices/X_train_"+name+"_rule.csv" ,index_col='doi')
        y_train = pd.read_csv("results/matrices/y_train_"+name+"_rule.csv" ,index_col='doi')
        X_test = pd.read_csv("results/matrices/X_test_"+name+"_rule.csv",index_col='doi')
        y_test = pd.read_csv("results/matrices/y_test_"+name+"_rule.csv",index_col='doi')  
        y_train = np.array(list(map(lambda x: 1 if x=="high" else 0, y_train['Target'])))
        y_test =  np.array(list(map(lambda x: 1 if x=="high" else 0, y_test['Target'])))
    
        X_train = X_train.reset_index()
        del X_train['doi']
        X_train.columns = X_train.columns.map(str)    
        features =[]
        for col in X_train.columns: 
                features.append(col)
                 
        corels=corels.fit(X_train, y_train, features=features, prediction_name="high_citation")       
        y_pred = corels.predict(X_test) 
        antLengths = list(map(lambda r: len(r["antecedents"]), corels.rl().rules))
        avgRuleLengthCorels = sum(antLengths) / len(antLengths)
        ruleCountCorels = len(corels.rl().rules)
   
        names.append(name)
        clsf_report = pd.DataFrame(classification_report(y_true = y_test, y_pred = y_pred, output_dict=True)).transpose()
        print(name)
        print(clsf_report)
        print(type(antLengths))
        print(antLengths)
        print(type(avgRuleLengthCorels))
        print(type(ruleCountCorels))
    
        accuracy = clsf_report.iat[2,0]
        precision = clsf_report.iat[4,0]
        recall = clsf_report.iat[4,1]
        f1_score = clsf_report.iat[4,2]
        list_metrics = [accuracy, precision, recall, f1_score]
        keys =["accuracy", "precision", "recall", "f1_score"]
        dictionary = dict(zip(keys, list_metrics))
        results_pd = pd.DataFrame.from_dict(dictionary,orient ='index').T 
        results_pd['model'] = name
        results_pd['number_of_col_in_mtrx'] = X_train.shape[1]
    
        results_pd['antLengths'] = antLengths[0]
        results_pd['avgRuleLengthCorels'] = avgRuleLengthCorels
        results_pd['ruleCountCorels'] = ruleCountCorels
        
        rules=corels.rl()
        f = open("results/rule_lists/results_"+name+"_rules_corels.csv","w")
        f.write(str(rules))
        f.close()
        rules_pd = pd.read_csv("results/rule_lists/results_"+name+"_rules_corels.csv")
        rules_pd['model'] = name
    
        rules_list.append(rules_pd)
        dfs.append(results_pd)

In [None]:
if DATASET_VERSION ==1:
    final = pd.concat(dfs, ignore_index=True)
    final.to_csv("results/corels/RESULTS_CORELS.csv")   
    rules_final = pd.concat(rules_list, ignore_index=True)
    rules_final.to_csv("results/corels/RULES_CORELS.csv")   

# CBA

In [None]:
if DATASET_VERSION ==1:
    
    rules_list  = []
    dfs = []

    for name, matrix in matrices_rule: 
        X_train = pd.read_csv("results/matrices/X_train_"+name+"_rule.csv" ,index_col='doi')
        y_train = pd.read_csv("results/matrices/y_train_"+name+"_rule.csv" ,index_col='doi')
        X_test = pd.read_csv("results/matrices/X_test_"+name+"_rule.csv",index_col='doi')
        y_test = pd.read_csv("results/matrices/y_test_"+name+"_rule.csv",index_col='doi') 
    
        X_train.shape, y_train.shape, X_test.shape, y_test.shape
        print(X_train.shape)
        print(X_test.shape)
        print(y_test.shape)
        print(y_train.shape)
    
        with localconverter(ro.default_converter + pandas2ri.converter):
            X_train_R = ro.conversion.py2rpy(X_train)
            X_test_R = ro.conversion.py2rpy(X_test)
            y_test_R = ro.conversion.py2rpy(y_test)
            y_train_R = ro.conversion.py2rpy(y_train)
        
            print(type(X_train_R))
                    
        robjects.globalenv["trainFold_X"] = X_train_R
        robjects.globalenv["trainFold_Y"] = y_train_R
        robjects.globalenv["testFold_X"] = X_test_R
        robjects.globalenv["testFold_Y"] = y_test_R
        robjects.globalenv["exp_name"] = name

        # 500 for fast learning, 50000 for best results
        robjects.globalenv["candidate_rule_limit"] = 50000
    
        print(name)
        
        # Execute the CBA pipeline
        # read R program with CBA 
        rscript = open('cordParam.R', 'r').read()
        result = robjects.r(rscript)
    
        pred_perf_df = robjects.globalenv["pred_perf_df"]
        print(pred_perf_df)
        pred_perf_df.to_csv("results/rule_lists/results_cba_50000_" +name +'.csv')
        results_pd = pd.read_csv("results/rule_lists/results_cba_50000_" +name +'.csv')
        results_pd['model'] = name
        results_pd['number_of_col_in_mtrx'] = X_train.shape[1]
        rules_length = robjects.globalenv["rules_length"][0]
        avgRuleLengthCBA = robjects.globalenv["avgRuleLengthCBA"][0]
        results_pd['rules_length']=rules_length
        results_pd['avgRuleLengthCBA']=avgRuleLengthCBA  
        print(results_pd)
        dfs.append(results_pd) 
    
        confusion_matrix_cba = robjects.globalenv["confusion_matrix_df"] #also on disk
        print(confusion_matrix_cba)
    
        rules_cba = robjects.globalenv["rules_df"]
        rules_cba.to_csv("results/rule_lists/rules_cba_50000_" +name +'.csv')
        rules_pd = pd.read_csv("results/rule_lists/rules_cba_50000_" +name +'.csv')
        rules_pd['model'] = name
        rules_list.append(rules_pd)

In [None]:
if DATASET_VERSION ==1:
    final = pd.concat(dfs, ignore_index=True)
    final.to_csv("results/cba/RESULTS_CBA.csv")  
    rules_final = pd.concat(rules_list, ignore_index=True)
    rules_final.to_csv("results/cba/RULES_CBA.csv")   

In [None]:
if DATASET_VERSION ==1:
    rscript = open('cordParamVizualization.R', 'r').read()
    result = robjects.r(rscript)

# Author names analysis

### Try to detect number of unique author names 

Conclusion: it is not possible to detect number of unique names, because 
            many same names are written in different ways. 

In [None]:
if DATASET_VERSION == 1:
    
    authors_series = documents['authors']
    authors_series_pd = authors_series.reset_index()
    authors_new = []
    for row in list(authors_series_pd.doi.values):
        authors_series_pd_f = authors_series_pd[authors_series_pd['doi']==row]
    
        if not authors_series_pd_f.authors.values[0].startswith("["):
            new_authors_value = authors_series_pd_f.authors.values[0]
            new_authors_value = new_authors_value.replace(",","_").split(";")
            new_authors_value = [word for word in new_authors_value]
             
        if authors_series_pd_f.authors.values[0].startswith("["):
            new_authors_value = authors_series_pd_f.authors.values[0]
            new_authors_value = ast.literal_eval(new_authors_value)
            new_authors_value = [word.replace(',','_') for word in new_authors_value]
         
        authors_new.append(new_authors_value)
  
    authors_series_pd["authors_new"] = authors_new

    authors_series_pd["authors_new"] = authors_series_pd["authors_new"].astype(str).str.replace(']','').str.replace('[','')
    dummies = authors_series_pd["authors_new"].str.get_dummies(',')
    print("Names")
    print(list(dummies.columns))
    print("Number of names:"+str(len(list(dummies.columns))))

In [None]:
if DATASET_VERSION == 1:
    cvec_authors = CountVectorizer(analyzer = "word", tokenizer = None, ngram_range=(1,50), min_df=MIN_DF, lowercase = True,strip_accents = "ascii", binary= True,stop_words='english')
    matrix_authors = cvec_authors.fit_transform(documents['authors'])
    tokens = cvec_authors.get_feature_names()
    matrix_authors_pd=pd.DataFrame(data=matrix_authors.toarray(), index=documents.index,columns=tokens)

    vect_authors_tfidf = TfidfVectorizer( tokenizer = None, ngram_range=(1,50), min_df=MIN_DF, lowercase = True,strip_accents = "ascii",stop_words='english')
    matrix_authors_tfidf= vect_authors_tfidf.fit_transform(documents['authors'])
    tokens_authors_tfidf = vect_authors_tfidf.get_feature_names()
    matrix_authors_tfidf_pd=pd.DataFrame(data=matrix_authors_tfidf.toarray(), index=documents.index,columns=tokens_authors_tfidf)

    matrix_authors_pd.shape,matrix_authors_tfidf_pd.shape
    matrix_authors_pd.sum().reset_index().sort_values(0,ascending=False).hist()

In [None]:
if DATASET_VERSION == 1:
    authors_target = matrix_authors_pd.join(documents[["Target"]]).groupby("Target").sum().transpose()
    authors_target.columns = authors_target.columns.astype(str)
    authors_target = authors_target.reset_index()
    authors_target["Total"] = authors_target["low"]+authors_target["high"]
    authors_target["frac"] = authors_target["low"] / authors_target["Total"]
    authors_target.sort_values("Total",ascending=False).head(5)

In [None]:
if DATASET_VERSION == 1:   
    jmena = pd.read_csv("inputs/author_names_info.csv",delimiter = ";")
    jmena["frac"] = jmena["frac"].str.replace(",",".").astype(float) 
    jmena["frac"] = jmena["high"]/jmena["Total"]
    print(jmena.head(50))
    print(jmena[  (jmena["W_E"]!="Arabia") ][["W_E"]].value_counts().reset_index().sort_values("W_E"))

## Plot 1 

In [None]:
if DATASET_VERSION == 1:
    fig, ax = plt.subplots(figsize=(7,5))
    num_bins = len(list(set(list(jmena["frac"].values))))
    ax.hist(jmena[jmena.W_E=="Europe"]["frac"], alpha=0.5,color="red",bins=20)
    ax.hist(jmena[jmena.W_E=="Asia"]["frac"], alpha=0.5,bins=20)

    plt.legend(['European names', 'Asia names'],fontsize=12) 
    plt.axvline(jmena[jmena.W_E=="Asia"]["frac"].mean(), color='b', linestyle='dashed')
    plt.text(jmena[jmena.W_E=="Asia"]["frac"].mean(),11,'Mean (Asia): {:.2f}'.format(jmena[jmena.W_E=="Asia"]["frac"].mean()),fontsize=12)
    plt.axvline(jmena[jmena.W_E=="Asia"]["frac"].median(), color='b', linestyle='dashed')
    plt.text(jmena[jmena.W_E=="Asia"]["frac"].median(),10,'Median (Asia): {:.2f}'.format(jmena[jmena.W_E=="Asia"]["frac"].median()),fontsize=12)
    plt.axvline(jmena[jmena.W_E=="Europe"]["frac"].mean(), color='r', linestyle='dashed')
    plt.text(jmena[jmena.W_E=="Europe"]["frac"].mean(),6,'Mean (Europe): {:.2f}'.format(jmena[jmena.W_E=="Europe"]["frac"].mean()),fontsize=12)
    plt.text(jmena[jmena.W_E=="Europe"]["frac"].median(),7,'Median (Europe): {:.2f}'.format(jmena[jmena.W_E=="Europe"]["frac"].median()),fontsize=12)
    plt.axvline(jmena[jmena.W_E=="Europe"]["frac"].median(), color='r', linestyle='dashed')
    ax.set_xlabel('Percentage of highly cited articles for each name',fontsize=13)
    ax.set_ylabel('Number of names',fontsize=12)
    fig.tight_layout()
    plt.show()

#### Statistical tests:

In [None]:
if DATASET_VERSION == 1:
    from scipy.stats import mannwhitneyu
    x = jmena[jmena.W_E=="Asia"]["frac"]
    y = jmena[jmena.W_E=="Europe"]["frac"]
    U1, p = mannwhitneyu(x, y)
    print(p)

In [None]:
if DATASET_VERSION == 1:
    from scipy.stats import ttest_ind
    stat, p = ttest_ind(x, y)
    print('stat=%.3f, p=%.3f' % (stat, p))
    if p > 0.05:
        print('Probably the same distribution')
    else:
        print('Probably different distributions')

## Plot 2

In [None]:
if DATASET_VERSION == 1:
    df_1 = matrix_authors_pd.where(matrix_authors_pd.eq(1)).stack().reset_index(level=1)['level_1'].reset_index().join(documents[['OpenCitations']],on="doi",how="left")
    df_2 = pd.merge(jmena[['name',"W_E"]],df_1, left_on='name', right_on='level_1')
    df_2 = df_2[df_2["OpenCitations"]<250]

    num_bins = len(list(set(list(df_2["OpenCitations"].values))))

    fig, ax = plt.subplots(figsize=(7,5))
    ax.hist(df_2[df_2.W_E=="Europe"]["OpenCitations"],alpha=0.5,color="r",bins=30)
    ax.hist(df_2[df_2.W_E=="Asia"]["OpenCitations"], alpha=0.5,bins=30)
    plt.legend(['European names', 'Asia names'],fontsize=12) 
    plt.axvline(df_2[df_2.W_E=="Asia"]["OpenCitations"].mean(), color='b', linestyle='dashed')
    plt.text(df_2[df_2.W_E=="Asia"]["OpenCitations"].mean(),400,'Mean (Asia): {:.2f}'.format(df_2[df_2.W_E=="Asia"]["OpenCitations"].mean()),fontsize=12)
    plt.axvline(df_2[df_2.W_E=="Asia"]["OpenCitations"].median(), color='b', linestyle='dashed')
    plt.text(df_2[df_2.W_E=="Asia"]["OpenCitations"].median(),500,'Median (Asia): {:.2f}'.format(df_2[df_2.W_E=="Asia"]["OpenCitations"].median()),fontsize=12)
    plt.axvline(df_2[df_2.W_E=="Europe"]["OpenCitations"].mean(), color='r', linestyle='dashed')
    plt.text(df_2[df_2.W_E=="Europe"]["OpenCitations"].mean(),600,'Mean (Asia): {:.2f}'.format(df_2[df_2.W_E=="Europe"]["OpenCitations"].mean()),fontsize=12)
    plt.text(df_2[df_2.W_E=="Europe"]["OpenCitations"].median(),700,'Median (Asia): {:.2f}'.format(df_2[df_2.W_E=="Europe"]["OpenCitations"].median()),fontsize=12)
    plt.axvline(df_2[df_2.W_E=="Europe"]["OpenCitations"].median(), color='r', linestyle='dashed')
    ax.set_xlabel('Number of citations of articles',fontsize=12)
    ax.set_ylabel('Number of articles by author continent',fontsize=12)
    fig.tight_layout()
    plt.show()

In [None]:
if DATASET_VERSION == 1:
    x = df_2[df_2.W_E=="Asia"]["OpenCitations"]
    y = df_2[df_2.W_E=="Europe"]["OpenCitations"]
    U1, p = mannwhitneyu(x, y)
    print(p)

In [None]:
if DATASET_VERSION == 1:
    stat, p = ttest_ind(x, y)
    print('stat=%.3f, p=%.3f' % (stat, p))
    if p > 0.05:
        print('Probably the same distribution')
    else:
        print('Probably different distributions')