# Tutorial: Creating Recommender Systems Datasets in Scientific Fields

- [1. Data retrieval and cleaning](##1.-Data-retrieval-and-cleaning)
    - [1.1. Import libraries](###1.1.-Import-libraries)
    - [1.2. Retrieve CORD-19](###1.2.-Retrieve-CORD-19)
    - [1.3. Exploring the articles of the dataset](###1.3.-Exploring-the-articles-of-the-dataset)
    - [1.4. Selecting a sample of articles to build our scientific recommendation dataset
](###1.4.-Selecting-a-sample-of-articles-to-build-our-scientific-recommendation-dataset)
- [2. Named Entity Recognition (NER) + Named Entity Linking (NEL)](#2.)
    - [2.1. Import libraries](###2.1.-Import-libraries)
    - [2.2. Configure MER](#2.2.-Configure-MER)
    - [2.3. Extract the entities in a single file](###2.3.-Extract-the-entities-in-a-single-file)
    - [2.4. Create entity files](###2.4.-Create-entity-files)
- [3. Creating the recommendation dataset](##3.-Creating-the-recommendation-dataset) 

## 1. Data retrieval + cleaning

**Objective**: To retrieve the [COVID-19 Open Research Dataset (CORD-19)](https://www.semanticscholar.org/cord19) and to select a sample of complete English articles (authors' info, title, body text) to build a scientific recommendation dataset.

CORD-19 includes coronavirus-related research articles extracted from several sources, such as PubMed, bioRxiv, medRxiv, WHO.

### 1.1. Import libraries

In [None]:
import json
import os
import pandas as pd
import requests
from googletrans import Translator

### 1.2. Retrieve CORD-19

CORD-19 is a large dataset, so in this tutorial we are going to use a smaller version of the dataset. 
This version is located under the directory "cord19_small"

However, if you want to retrieve the entire dataset, you can run the following code:

In [None]:
version = 'cord-19_2020-05-12.tar.gz'

url = 'https://ai2-semanticscholar-cord-19.s3-us-west-2.amazonaws.com/historical_releases/' + version

response = requests.get(url)

os.system('tar -cvf cord-19_2020-05-12.tar.gz')

Let's explore the contents of the dataset directory, particularly, the metadata file.

In [None]:
dataset_dir = 'cord19_small/'
metadata_filepath = dataset_dir + 'metadata.csv'

metadata = pd.read_csv(metadata_filepath, sep = ',', quotechar = '"',  encoding = 'utf-8', dtype=str) 

Now we have a DataFrame with the contents of the metadata file.

To print column names and first row:

In [None]:
metadata.head(1)

To access individual rows/articles:

In [None]:
metadata.loc[0]

To access the individual column title':

In [None]:
metadata['title']

Let's check the summary statistics:

In [None]:
metadata.count()

By looking at the statistics, we can see that the number of records with data for the column 'cord_uid' is higher than the number of records with data for the column 'authors'. 

For our dataset, we only want to include articles with the following characteristics:
- authors' information
- available title
- available body text
- article text expressed in English
- non-duplicate articles

### 1.3. Exploring the articles of the dataset

Let's consider the first article appearing in the metadata file.
To check if information about the article's author is available:

In [None]:
metadata.loc[0]['authors']

To check if there is an available title:

In [None]:
metadata.loc[0]['title']

We can see that the title is expressed in English, but it would not be efficient to check the language of every article in the dataset, so we will apply a language detection tool, the Python library [Googletrans](https://pypi.org/project/googletrans/):

In [None]:
translator = Translator()

title1 = metadata.loc[0]['title']

title1_lang = translator.detect(title).lang

print("Title:", title1, "\nLanguage:", title1_lang)

The tool detects English as the language of the title.

Let's check another article:

In [None]:
title2 = metadata.loc[107]['title']
print(metadata.loc[107])
title2_lang = translator.detect(title2).lang
print(title2)


In this case, the tool detects german ('de') as the language of the title.

Now, we want to check if there is an available abstract for the first article:

In [None]:
metadata.loc[0]['abstract']

The metadata file does not contain the article's text besides abstract and title, but we can access the file associated with the article using the provided information in the columns 'pdf_json_files' or 'pmc_json_files':

In [None]:
metadata.loc[0]['pdf_json_files']

Let's open the file, which is in [JSON](https://www.json.org/json-en.html) format:

In [None]:
article1_filepath = dataset_dir + metadata.loc[0]['pdf_json_files']

with open(article1_filepath, encoding='utf-8') as article1_file:
    article1_data = json.load(article1_file)

Now we have the file content stored in a dictionary with the following keys:

In [None]:
article1_data.keys()

The article content is the following:

In [None]:
article1_data

Let's check the body text:

In [None]:
body_text = article1_data['body_text']
body_text

We have a list of dictionaries, each dictionary is a paragraph beloning to a given section of the article. We want to join the scattered text in a single string: 

In [None]:
article1_text = str()

for paragraph in body_text:
    article1_text += paragraph['text'] + '\n'

print(article1_text)

All good! We will include this article in our dataset.

### 1.4. Selecting a sample of articles to build our scientific recommendation dataset

Instead of repating each operation for each file individually, let us adapt our code to automatically select a sample containing 100 preprocessed articles. 

First, create the output directory:

In [None]:
out_dir = 'cord19_sample/'

if not os.path.exists(out_dir):
    os.mkdir(out_dir)

Then, initiallize the necessary variables:

In [None]:
dataset_dir = '2020-05-19/'
metadata_filepath = dataset_dir + 'metadata.csv'
max_articles = 100
valid_articles_count = int()
out_articles_ids = list()
translator = Translator()

Open the metadata file:

In [None]:
metadata = pd.read_csv(metadata_filepath, sep = ',', quotechar = '"',  encoding = 'utf-8', dtype=str) 

Then iterate over the records in the metadata file and choose only the relevant ones:

In [None]:
valid_articles_count = int()

for index, record in metadata.iterrows():
    
    if valid_articles_count <= max_articles:
        
        if record['pubmed_id'] not in out_articles_ids:

            if record['authors']:    
     
                if record['title']:
                    title = record['title']

                    title_lang = translator.detect(title).lang
                    article_filepath = record['pdf_json_files']
                    
                    if title_lang == 'en'  \
                        and type(article_filepath) != float  \
                        and article_filepath.count("document") == 1:
                            
                        article_filepath_up = dataset_dir + record['pdf_json_files']
                       
                        with open(article_filepath_up, encoding='utf-8') as article_file:
                            article_data = json.load(article_file)
                        
                        if 'body_text' in article_data.keys():
                            command = 'cp '  \
                                     + article_filepath + ' ' \
                                     + out_dir  \
                                     + record['sha'] + '.json'
                            valid_articles_count += 1
                            #print("VALID ARTICLES", str(valid_articles_count))
                            #os.system(command)
                  
                            
    if valid_articles_count == max_articles:
        total_articles = index + 1
        break
print("TOTAL", str(total_articles))

Let's check if the output dir contain 100 articles:

In [None]:
article_count = len(os.listdir(out_dir))
assert article_count==max_articles, 'Invalid number of article(s): {}! Expected number: {}'.format(article_count, max_articles

At the end of this section, we now have a sample including 100 articles that will be the basis of our scientific recommendation dataset.

## 2. Named Entity Recognition (NER) + Named Entity Linking (NEL)

**Objective**: To recognize chemical and disease entities in the retrieved articles and to link them to the respective ontology identifiers.

We are going to use the [Disease Ontology](https://disease-ontology.org/) (DO), and the [Chemical Entities of Biological Interest](https://www.ebi.ac.uk/chebi/) (ChEBI) ontology.

To perform NER and NEL, we are going to apply Minimal Named-Entity Recognizer [MER](https://pypi.org/project/merpy/) tool.

# 2.1. Import libraries
<a id='#2.2'></a>

In [None]:
import json
import os
import merpy
from collections import Counter

### 2.2. Configure MER

First, we need to download the owl. file associated with ChEBI:

In [None]:
merpy.download_lexicon("ftp://ftp.ebi.ac.uk/pub/databases/chebi/ontology/chebi.owl",
                       "chebi", ltype="owl")

Then, we need to process the downloaded file into a lexicon that MER can use:

In [None]:
merpy.process_lexicon("chebi", ltype="owl")

We are going to delete obsolete concepts still present in the ontology file:

In [None]:
merpy.delete_obsolete("chebi")

We need to repeat the operations for the DO:

In [None]:
merpy.download_lexicon("http://purl.obolibrary.org/obo/doid.owl", 
                        "do", ltype="owl")
            
merpy.process_lexicon("do", ltype="owl")

merpy.delete_obsolete("do")

Let's check the lexicons available for MER:

In [None]:
merpy.show_lexicons()

### 2.3. Extract the entities in a single file

Let's retrieve a file from the articles sample:

In [None]:
dataset_dir = 'cord19_small/'

with open(dataset_dir + '348055649b6b8cf2b9a376498df9bf41f7123605.json') as article1_file:
    article_data = json.load(article1_file)

Let's check the contents of the article:

In [None]:
article_data.keys()

We want to recognize the entities present in title, abstract, and body. 

First, let's retrieve the title, which is a value associated with the key 'metadata':

In [None]:
title = article_data['metadata']['title']
title

Then, we apply MER to the title in order recognize disease entities and to link them to DO concepts:

In [None]:
merpy.get_entities(title, 'do')

Let's check if the annotations make sense. For instance, access the link http://purl.obolibrary.org/obo/DOID_850.

The entity 'lung disease' in the article was linked to the DO concept 'lung disease' with the ID 'DOID:850', which seems correct!

Let's apply MER to recognize chemical entities and to link them to ChEBI concepts:

In [None]:
merpy.get_entities(title, 'chebi')

Accessing the link http://purl.obolibrary.org/obo/CHEBI_16480, we can see that the entity 'nitric oxide' was linked to the ChEBI concept 'nitric oxide', which has the ID 'CHEBI:16480'.

We add the disease and chemical entities to a single list:

In [None]:
title_entities = merpy.get_entities(title, 'do') + merpy.get_entities(title, 'chebi')

title_entities

Now, we are going to apply MER to recognize entities in abstract:

In [None]:
abstract = article_data['abstract'][0]['text']

abstract_entities = merpy.get_entities(abstract, 'do') + merpy.get_entities(abstract, 'chebi')

abstract_entities

Let's apply MER in the text associated with the body of the article:

In [None]:
body_text = str()

for section in article_data['body_text']:
    body_text += section['text'] + "\n"

body_entities = merpy.get_entities(body_text, "do") + merpy.get_entities(body_text, "chebi")

body_entities

At last, we need to obtain information about the frequency of each ontology identifier in the document:

In [None]:
total_entities = title_entities + abstract_entities + body_entities

all_uris = [entity[3] for entity in entities]

entity_counter = Counter(all_uris)

entity_counter

To sort the URIs by descending order:

In [None]:
entity_counter = {
    k: v 
    for k, v in sorted(entity_counter.items(), key=lambda item: item[1], reverse=True)
    }

entity_counter 

### 2.4. Create entity files

We need to adapt our code to perform NER and NEL in all documents of our sample.

First, create the output dir:

In [None]:
out_dir = 'cord19_sample_entities/'

if not os.path.exists(out_dir):
    os.mkdir(out_dir)

Then, initiallize the necessary variables:

In [None]:
article_dir = 'cord19_small/'

output = {'id': str(), 'entities': {}, 'sections': {'title': [], 'abstract': [], 'body': []}}

Next, we are going to iterate over on each file present in the sample directory, annotate them, and create the respective entity file:

In [None]:
for article in os.listdir(article_dir):
    
    #open the article file
    with open(article_dir + article) as input_file:
        article_data = json.load(input_file)
    
    output['id'] = article_data['paper_id']
    
    # Annotate the title
    title = article_data['metadata']['title']
    title_entities = merpy.get_entities(title, 'do') + merpy.get_entities(title, 'chebi')
    output['sections']['title'] = title_entities
    
    # Annotate the abstract
    abstract = article_data['abstract'][0]['text']
    abstract_entities = merpy.get_entities(abstract, 'do') + merpy.get_entities(abstract, 'chebi')
    output['sections']['abstract'] = abstract_entities
    
    # Combine the body text and annotate it
    body_text = str()

    for section in article_data['body_text']:
        body_text += section['text'] + '\n'

    body_entities = merpy.get_entities(body_text, 'do') + merpy.get_entities(body_text, 'chebi')
    output['sections']['body'] = body_entities

    # Count URIs frequencies and sort them
    total_entities = title_entities + abstract_entities + body_entities
    all_uris = [entity[3] for entity in total_entities if len(entity)==4]
    entity_counter = Counter(all_uris)
    
    output['entities'] = {
        k: v 
        for k, v in sorted(entity_counter.items(), key=lambda item: item[1], reverse=True)
        }

    # Generate JSON file with output
    out_filepath = out_dir + output['id'] + '_entities.json'
    
    with open(out_filepath, 'w') as out_file:
        out_file.write(json.dumps(output, indent=4))

Now we have both the article files ('covid19_sample' dir) and the respective entities files ('covid19_sample_entities'), and the next step will be the generation of the scientific recomendation dataset.

## 3. Creating the recommendation dataset