# Dataset Preparation

The project proposes to leverage machine learning (ML) and natural language processing (NLP) techniques to build a text classifier that automatizes the processing and identification of evidence of social impact in research documents. The proposal aims to solve  a classification problem in which the model takes a sentence contained in a research document as input and produces as output a binary answer (1=True, 0=False) that states whether the sentence contains or not evidence of social impact. In this sense, training a machine-learning algorithm to automatically identify evidence of social impact in research documents requires having a dataset with both examples, namely sentences that provide evidence of impact and general sentences.

From all research fields, this project focuses on Medical, Health, and Biological science because the ultimately goal is to understand the social impact of the research projects of the Spanish National Institue of Bioinformatics (INB by its Spanish Acronym), which is an institution that conducts medical and biological investigations.

The plan is to collect and process general sentences that are commonly available in scientific documents in the field of Health and Biology. Later, the processed sentences will be used to complement the dataset of evidence of social impact. Here, the dataset of full-text of 29,437 articles of health and biology produced by Ye et al. as part of their publication [SparkText:Biomedical Text Mining on Big Data Framework](https://www.researchgate.net/publication/308759738_SparkText_Biomedical_Text_Mining_on_Big_Data_Framework) will be used.

## Load dependencies

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import os
import re

from nltk.tokenize import sent_tokenize, word_tokenize
from nltk import pos_tag

In [3]:
project_dir = os.getcwd()
print('Directory of project: {}'.format(project_dir))

Directory of project: /home/jorge/Dropbox/Development/impact-classifier


## Collection of medical and biological sentences

Here, I collect sentences available in health and biology academic articles. In this part of the notebook, the dataset of full-text of 29,437 articles of health and biology produced by Ye et al. as part of their publication [SparkText:Biomedical Text Mining on Big Data Framework](https://www.researchgate.net/publication/308759738_SparkText_Biomedical_Text_Mining_on_Big_Data_Framework) is used. Every row in the dataset represents the full-text of an article.

## Load data

Let's load the dataset. I am using the default CSV library of Python because instead of Pandas because in this case I do not need a full dataframe but only a list of strings, which will represent the articles.

In [3]:
import csv
texts = []
with open('data/source/SparkText_SampleDataset_29437Fulltexts.csv', encoding='utf-8', errors='ignore') as csvfile:
    reader = csv.DictReader(csvfile, fieldnames=['code','text'], delimiter= ' ')
    for row in reader:
        if row['text'] == 'text': # header
            continue
        texts.append(row['text'])

### Sample

In [4]:
texts[0]

'In this case, apoptosis-deficient Becn1+/+ iMMECs stably expressing ERBB2 under low (1%) oxygen and glucose-deprivation conditions showed decreased number of autophagosomes compared with their non-ERBB2-expressing Becn1+/+ counterparts (Fig. 2F and G). Interestingly, the level of autophagy induction in metabolically stressed ERBB2-expressing Becn1+/+ iMMECs was similar to that of vector-expressing Becn1+/? iMMECs (Fig. 2F and G), confirming that ERBB2 overexpression renders mammary epithelial cells partially autophagy-deficient under stress. To further investigate the impact of ERBB2 overexpression on stress-induced autophagy in an alternate system and in an apoptosis-competent background, we used a transient ERBB2 expression system.40 To this intent, Becn1+/+ iMMECs stably overexpressing EGFP-LC3B were transiently transfected with a ERBB2-expressing or vector control plasmid and, after overnight recovery in regular culture medium, were incubated in Hanks medium for up to 3.5 h. Simil

## Split text in sentences

In [5]:
%%time
general_sentences = [sentence for text in texts for sentence in sent_tokenize(text)]

In [6]:
print('There are {0:,} sentences contained in the 29,437 articles in the dataset.'.format(len(general_sentences)))

There are 2,318,588 sentences contained in the 29,437 articles in the dataset.


## Sentences selection

Now, I will randomly select 3,000 sentences to complement the sentences that contain evidence of social impact. The number 3,000 was chosen to have approximately 5,000 sentences in the combined dataset that will be used train and test the machine learning model. More than 3,000 sentences can be selected but this will increase the unbalance in dataset. **The big assumption here is that the selection is composed of general sentences that do not contain evidence of social impact**. In the process of selecting sentences, I check whether sentences to be chosen are grammatically complete. For the purpose of this project, sentences are considered complete if they have at least two nouns (subject, object) and a verb.

In [7]:
def is_sentence_complete(sentence):
    sentence_pos_tags = pos_tag(word_tokenize(sentence))
    num_nouns, num_verb = 2, 1
    nouns_counter, verbs_counter = 0, 0
    for s_pos_tag in sentence_pos_tags:
        s_tag = s_pos_tag[1]
        if s_tag[:2] == 'NN':
            nouns_counter += 1
        if s_tag[:2] == 'VB':
            verbs_counter += 1
    return nouns_counter >= num_nouns and verbs_counter >= num_verb

In [8]:
# generate a list of 3000 random numbers between 0 and 2,318,588
total_sentences = 3000
selection_on = True
selected_gral_sentences = []
while selection_on:
    random_idxs = np.random.randint(low=0, high=len(general_sentences), size=total_sentences)
    for idx in random_idxs:
        selected_sentence = general_sentences[idx]
        if is_sentence_complete(selected_sentence):
            selected_gral_sentences.append(selected_sentence)
            total_sentences -= 1
    if total_sentences == 0:
        selection_on = False

### Sanity check

In [9]:
print('Out of the {0:,} sentences {1:,} of them were selected'.format(len(general_sentences), len(selected_gral_sentences)))

Out of the 2,318,588 sentences 3,000 of them were selected


### Sample

In [10]:
selected_gral_sentences[50]

'Univariate analysis revealed that NSCLC patients with serum high KLK11 had a longer overall survival (OS) and progression-free survival (PFS) than those with low KLK11 (HR of 0.36, P?=?0.002; HR of 0.46, P?=?0.009).'

### Save sentences

In [11]:
pd.DataFrame(selected_gral_sentences, columns=['sentence']).to_csv('data/sentences/gral_sentences_3000.csv')