# Baseline model

The project proposes to leverage machine learning (ML) and natural language processing (NLP) techniques to build a text classifier that automatizes the processing and identification of evidence of social impact in research documents. The proposal aims to solve a classification problem in which the model takes a sentence contained in a research document as input and produces as output a binary answer (1=True, 0=False) that states whether the sentence contains or not evidence of social impact, respectively.

From all research fields, this project focuses on Medical, Health, and Biological science because the ultimately goal is to understand the social impact of the research projects of the Spanish National Institue of Bioinformatics (INB by its Spanish Acronym), which is an institution that conducts medical and biological investigations.

The goal of this notebook is to develop a baseline model against to which compare the performance of the machine learning classifier. The base-line model is a vocabulary-based classifier that looks in sentences for occurrences of words in the vocabulary. Sentences are then classified as containing evidence of social impact if they include words in the vocabulary.

## Load libraries

In [57]:
import csv
import numpy as np
import nltk
import pandas as pd
import pathlib
import re
import spacy
import sys

nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')

s_nlp = spacy.load('en')

from collections import defaultdict
from nltk import corpus, pos_tag, word_tokenize
from sklearn import metrics
from tqdm import tqdm
from utils import lemmatize_words, normalize, remove_non_ascii, remove_extra_spaces

[nltk_data] Downloading package stopwords to
[nltk_data]     /home/Life/jsaldiva/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /home/Life/jsaldiva/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /home/Life/jsaldiva/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


In [2]:
data_dir = 'data'

## Load data

In [6]:
train_data = pd.read_csv('./data/train_test_data.csv')
validation_data = pd.read_csv('./data/validation_data.csv')

In [8]:
data = pd.concat([train_data, validation_data], ignore_index=True)

In [11]:
data.shape

(1006, 2)

In [9]:
data.head()

Unnamed: 0,sentence,label
0,widely featured in the national press and rad...,1
1,indeed one of these projects has been select...,1
2,impact public engagement and education influen...,1
3,reach worldwide dolly became a scientific icon...,1
4,the educational tools have been used by 11 000...,1


### Separate labels from data

In [10]:
sentences, labels = data['sentence'], data['label']

## Build dictionary

The fist step in implementing the model is to build a vocabulary of words that are commonly used to indicate social impact of research. Again descriptions of social impact of research available in REF were used to construct the dictionary. In particular, I went through the first 230 summaries of impact published [here](https://impact.ref.ac.uk/casestudies/Results.aspx?Type=I&Tag=5085) and tagged verbs and nouns employed in sentences that contain evidence of social impact of research. The identified verbs and nouns were extracted and collected in the file `data/dictionary.csv`. What I discovered during this process is that sentences with evidence of social impact are formed with combinations of these verbs and nouns. The next function reads the dictionary file and creates a list with all combination of nouns and verbs.

In [12]:
def build_dictionary(data_dir):
    impact_words_file_name = pathlib.Path(data_dir, 'dictionary.csv')
    impact_words, i_verbs, i_nouns = [], [], []
    with open(str(impact_words_file_name), 'r') as f:
        file = csv.DictReader(f, delimiter=',')
        for line in file:
            if line['pos'] == 'verb':
                lemma_words = ' '.join(lemmatize_words(normalize(line['word']), pos=corpus.wordnet.VERB))
                i_verbs.append(lemma_words)
            if line['pos'] == 'noun':
                lemma_words = ' '.join(lemmatize_words(normalize(line['word']), pos=corpus.wordnet.NOUN))
                i_nouns.append(lemma_words)
    impact_words = [i_verb + ' ' + i_noun for i_verb in i_verbs for i_noun in i_nouns if i_verb != i_noun]
    return impact_words

In [13]:
dictionary = build_dictionary(data_dir)

Let's see some entried of the dictionary list

In [14]:
dictionary[:10]

['ensure accessibility',
 'ensure accreditation',
 'ensure agenda',
 'ensure aggression',
 'ensure audience',
 'ensure awareness',
 'ensure basis',
 'ensure behavior',
 'ensure benefit',
 'ensure campaign']

## Find evidence of social impact in sentences

In [16]:
def extract_sentence_dependencies(sentence):
    sentence_dependencies = defaultdict(list)
    for token in sentence:
        token_text, token_tag, token_dependency_type, token_dependent_text, token_dependent_tag = \
            token.text, token.tag_, token.dep_, token.head.text, token.head.tag_
        # only nouns whose object dependency is a verb are considered
        if token_tag[0] == 'N' and token_dependency_type == 'dobj' and token_dependent_tag[0] == 'V':
            lemma_dependent = ' '.join(lemmatize_words(token_dependent_text))
            lemma_token = ' '.join(lemmatize_words(token_text, pos=corpus.wordnet.NOUN))            
            sentence_dependencies[lemma_dependent].append(lemma_token)
    return sentence_dependencies

In [17]:
def extract_sentence_nouns_and_verbs(sentence):
    # tag sentence and extract nouns and verbs
    tagged_tokens = pos_tag(normalized_sentence)                
    lemma_tokens = []        
    for tagged_token in tagged_tokens:
        token, tag = tagged_token
        if tag[0] == 'N':
            lemma_token = lemmatize_words(token, pos=corpus.wordnet.NOUN)
            lemma_tokens.append(' '.join(lemma_token))
        elif tag[0] == 'V':
            lemma_token = lemmatize_words(token, pos=corpus.wordnet.VERB)
            lemma_tokens.append(' '.join(lemma_token))
    return lemma_tokens

In [18]:
def find_occurrences(dictionary, sentence, sentence_dependencies):
    occurrences = set()
    for entry in dictionary:
        entry_tokens = word_tokenize(entry)
        reg_verb, reg_noun = entry_tokens[0], ' '.join(entry_tokens[1:])
        reg_exp = r'^[\w\s]+{verb}\s[\w\s]*{noun}[\w\s]*$'.format(verb=reg_verb, noun=reg_noun)
        if re.search(reg_exp, sentence):
            if sentence_dependencies.get(reg_verb):
                if reg_noun in sentence_dependencies[reg_verb]:
                    occurrences.add(entry)
    return occurrences

In [33]:
processed_labels = []
with tqdm(total=len(sentences), file=sys.stdout) as pbar:
    for sentence in sentences:
        pbar.update(1)
        # clean and preprocess sentence
        normalized_sentence = normalize(sentence)
        clean_sentence = remove_non_ascii(sentence)
        clean_sentence = remove_extra_spaces(clean_sentence)
        nlp_sentence = s_nlp(' '.join(clean_sentence))
        # extract sentence dependencies
        sentence_dependencies = extract_sentence_dependencies(nlp_sentence)
        # extract nouns and verbs
        lemma_tokens = extract_sentence_nouns_and_verbs(normalized_sentence)
        lemma_sentence = ' '.join(lemma_tokens)
        # iterate over the dictionary entries and find occurrences of 
        # the sentence nouns and verbs
        occurrences = find_occurrences(dictionary, lemma_sentence, sentence_dependencies)
        if len(occurrences) > 0:
            processed_labels.append(1)
        else:
            processed_labels.append(0)

100%|██████████| 1006/1006 [32:55<00:00,  1.96s/it]


Check number of items in processed_labels

In [41]:
assert len(processed_labels)==data.shape[0], 'Total items are incorrect!'

## Test model

### Confusion Matrix

In [63]:
cm_results = confusion_matrix(labels, processed_labels)

In [64]:
cm_results

array([[791,   9],
       [143,  63]])

### Performance metrics

In [66]:
print(f'Balaced Accuracy: {round(metrics.balanced_accuracy_score(labels, processed_labels),2)}')

Balaced Accuracy: 0.65


In [67]:
print(f'Recall: {round(metrics.recall_score(labels, processed_labels),2)}')

Recall: 0.31


In [68]:
print(f'ROC-AUC: {round(metrics.roc_auc_score(labels, processed_labels),2)}')

ROC-AUC: 0.65


In [69]:
print(f'Precision: {round(metrics.precision_score(labels, processed_labels),2)}')

Precision: 0.88


In [70]:
print(f'F1: {round(metrics.f1_score(labels, processed_labels),2)}')

F1: 0.45
