# Generation of Social Impact Dataset (Semi-automatic Approach)

In this project, I propose to leverage machine learning and natural language processing techniques to build a text classifier that automatizes the processing and identification of evidence of social impact in research documents. The proposal aims to solve  a classification problem in which the model takes a sentence as input and produces as output a binary answer (1=True, 0=False) that states whether the sentence contains or not evidence of social impact.

The input of the machine learning model will be sentences of the research documents and the output will be whether or not the sentence provides evidence of social impact. Therefore, training a machine-learning algorithm to automatically identify evidence of social impact in text documents requires having a dataset of examples of sentences that provide evidence of impact.

In this notebook, I try a couple techniques, namely sentiment analysis and k-means, to see whether I can automatize part of process of generating the data set to train and test our classifier. The idea of using sentiment analysis is to focus the manual processing only on sentences with a positive tone since it is expected that sentences with evidence of social impact are written in a positive tone.

The final dataset will include both sentences that contain evidence of social impact of research and sentences that do not contain evidence of social impact. 

Summaries of the societal impact of Medical, Health, and Biological research published by the Research Excellence Framework (REF)---REF is the evaluation system used in the United Kingdom to measure the quality of research of their higher education institutions---will be used to build the dataset.

From all research fields, this project focuses on Medical, Health, and Biological science because the ultimately goal is to understand the social impact of the research projects of the Spanish National Institue of Bioinformatics (INB by its Spanish Acronym), which is an institution that conducts medical and biological investigations.

## Load dependencies

In [22]:
import pandas as pd
import numpy as np
import os
import re

from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
from nltk.tokenize import sent_tokenize, word_tokenize

In [10]:
project_dir = os.getcwd()
print('Directory of project: {}'.format(project_dir))

Directory of project: /home/jorge/Dropbox/Development/impact-classifier


## Load data

The data were extracted from the impact case studies published by REF (Research Excellence Framework) and it contains the summary of societal impact of Medical, Health, and Biological research. Summaries can be access [here](https://impact.ref.ac.uk/casestudies/Results.aspx?Type=I&Tag=5085).

In [11]:
data = pd.read_csv('data/social_impact_ref_bio_medicine.csv')

In [12]:
data.head()

Unnamed: 0,Case Study Id,Institution,Unit of Assessment,Title,Summary of the impact
0,942,\n University of Kent and University of Gre...,"Allied Health Professions, Dentistry, Nursing ...",\n The Guide to Receptors and Channels: a k...,\n The Guide to Receptors and Channels has ...
1,3405,\nCardiff University\n,"Geography, Environmental Studies and Archaeology",\nChanging people's perceptions of the human:a...,\nThe Cardiff Osteological Research Group (COR...
2,15620,\n Royal Veterinary College\n,"Agriculture, Veterinary and Food Science",\n Giant Animals: evolution and biomechanic...,\n Professor Hutchinson's team has pursued ...
3,16710,\n University of Cambridge\n,Physics,\n Mainstreaming Biological Physics in the ...,\n Material has been prepared for the Insti...
4,18226,\n University of Chester\n,"Psychology, Psychiatry and Neuroscience",\n Informing Global Improvements on the Wel...,\n This research programme has provided con...


Special attention will be paid to the column **Summary of the impact**, which is the column of interest here.

In [15]:
print('Data set dimension. Rows: {0}, Columns: {1}'.format(data.shape[0], data.shape[1]))

Data set dimension. Rows: 571, Columns: 5


## Clean and Transform Data

Remove no-ascii characters and split summaries of impact into sentences.

In [25]:
def process_data(text):
    # remove any character except letter, numbers, and percentage sign
    text = re.sub(r"[^a-zA-Z0-9%]", " ", text.lower())
    # remove newlines, tabs, and extra spaces
    text = re.sub(r"\t", " ", text)
    text = re.sub(r"\n", " ", text)
    text = re.sub("  ", " ", text)
    text = re.sub("   ", " ", text)
    text = text.strip()  # remove leading and trailing spaces
    return text

In [50]:
clean_data = pd.DataFrame(columns=['title', 'impact_sentence'])
for i in data.index:
    title = process_data(data.loc[i, 'Title'])
    sentences = sent_tokenize(data.loc[i, 'Summary of the impact'])
    for sentence in sentences:
        clean_data = clean_data.append({'title': title, 'impact_sentence': process_data(sentence)}, 
                                       ignore_index=True)        

In [51]:
clean_data.head()

Unnamed: 0,title,impact_sentence
0,the guide to receptors and channels a key tool...,the guide to receptors and channels has contri...
1,the guide to receptors and channels a key tool...,the key tools it provides have influenced appr...
2,the guide to receptors and channels a key tool...,it is used widely as a teaching aid for under...
3,the guide to receptors and channels a key tool...,it led to the formation of the guide to pharm...
4,changing people s perceptions of the human ani...,the cardiff osteological research group corg h...


## Apply Sentiment Analysis

Sentiment analysis is used to focus the manual processing only on sentences with a positive tone since it is expected that sentences with evidence of social impact are written using a positive attitude.

In [53]:
analyzer = SentimentIntensityAnalyzer()

In [54]:
def analyze_sentiment(row):
    score = analyzer.polarity_scores(row['impact_sentence'])
    ret_dict = {'title': row['title'], 'sentence': row['impact_sentence']}
    if score['compound'] >= 0.05:        
        ret_dict['sentiment'] = 'positive'
    elif 0.05 > score['compound'] >= -0.05:
        ret_dict['sentiment'] = 'neutral'
    else:
        ret_dict['sentiment'] = 'negative'
    ret_dict['score'] = score['compound']
    return ret_dict

In [55]:
sa_impact_text = pd.DataFrame(columns=['title', 'sentence', 'score', 'sentiment'])
sa_impact_text = clean_data.apply(analyze_sentiment, axis=1)

In [56]:
sa_impact_text.head()

0    {'title': 'the guide to receptors and channels...
1    {'title': 'the guide to receptors and channels...
2    {'title': 'the guide to receptors and channels...
3    {'title': 'the guide to receptors and channels...
4    {'title': 'changing people s perceptions of th...
dtype: object