# Tutorial: Creating Recommender Systems Datasets in Scientific Fields

- [1. Data retrieval and cleaning](##1.-Data-retrieval-and-cleaning)
    - [1.1. Import libraries](###1.1.-Import-libraries)
    - [1.2. Retrieve CORD-19](###1.2.-Retrieve-CORD-19)
    - [1.3. Exploring the articles of the dataset](###1.3.-Exploring-the-articles-of-the-dataset)
    - [1.4. Selecting a sample of articles to build our scientific recommendation dataset
](###1.4.-Selecting-a-sample-of-articles-to-build-our-scientific-recommendation-dataset)
- [2. Named Entity Recognition (NER) + Named Entity Linking (NEL)](#2.)
    - [2.1. Import libraries](###2.1.-Import-libraries)
    - [2.2. Configure MER](#2.2.-Configure-MER)
    - [2.3. Extract the entities in a single file](###2.3.-Extract-the-entities-in-a-single-file)
    - [2.4. Create entity files](###2.4.-Create-entity-files)
- [3. Creating the recommendation dataset](##3.-Creating-the-recommendation-dataset) 

## 1. Data retrieval + cleaning

**Objective**: To retrieve the [COVID-19 Open Research Dataset (CORD-19)](https://www.semanticscholar.org/cord19) and to select a sample of complete English articles (authors' info, title, body text) to build a scientific recommendation dataset.

CORD-19 includes coronavirus-related research articles extracted from several sources, such as PubMed, bioRxiv, medRxiv, WHO.

### 1.1. Import libraries

In [1]:
import json
import os
import pandas as pd
import requests
from langdetect import detect

### 1.2. Retrieve CORD-19

CORD-19 is a large dataset, so in this tutorial we are going to use a smaller version of the dataset. 
This version is located under the directory "cord19_small"

However, if you want to retrieve the entire dataset, you can run the following code:

In [None]:
version = 'cord-19_2020-05-12.tar.gz'

url = 'https://ai2-semanticscholar-cord-19.s3-us-west-2.amazonaws.com/historical_releases/' + version

response = requests.get(url)

os.system('tar -cvf cord-19_2020-05-12.tar.gz')

Let's explore the contents of the dataset directory, particularly, the metadata file.

In [2]:
dataset_dir = 'cord19_small/'
metadata_filepath = dataset_dir + 'metadata.csv'

metadata = pd.read_csv(metadata_filepath, sep = ',', quotechar = '"',  encoding = 'utf-8', dtype=str) 

Now we have a DataFrame with the contents of the metadata file.

To print column names and first row:

In [3]:
metadata.head(1)

Unnamed: 0,cord_uid,sha,source_x,title,doi,pmcid,pubmed_id,license,abstract,publish_time,authors,journal,mag_id,who_covidence_id,arxiv_id,pdf_json_files,pmc_json_files,url,s2_id
0,ug7v899j,d1aafb70c066a2068b02786f8929fd9c900897fb,PMC,Clinical features of culture-proven Mycoplasma...,10.1186/1471-2334-1-6,PMC35282,11472636,no-cc,OBJECTIVE: This retrospective chart review des...,2001-07-04,"Madani, Tariq A; Al-Ghamdi, Aisha A",BMC Infect Dis,,,,document_parses/pdf_json/d1aafb70c066a2068b027...,document_parses/pmc_json/PMC35282.xml.json,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3...,


To access individual rows/articles:

In [4]:
metadata.loc[0]

cord_uid                                                     ug7v899j
sha                          d1aafb70c066a2068b02786f8929fd9c900897fb
source_x                                                          PMC
title               Clinical features of culture-proven Mycoplasma...
doi                                             10.1186/1471-2334-1-6
pmcid                                                        PMC35282
pubmed_id                                                    11472636
license                                                         no-cc
abstract            OBJECTIVE: This retrospective chart review des...
publish_time                                               2001-07-04
authors                           Madani, Tariq A; Al-Ghamdi, Aisha A
journal                                                BMC Infect Dis
mag_id                                                            NaN
who_covidence_id                                                  NaN
arxiv_id            

To access the individual column title':

In [5]:
metadata['title']

0      Clinical features of culture-proven Mycoplasma...
1      Nitric oxide: a pro-inflammatory mediator in l...
2        Surfactant protein-D and pulmonary host defense
3                   Role of endothelin-1 in lung disease
4      Gene expression in epithelial cells in respons...
                             ...                        
195            Families and clans of cysteine peptidases
196                           Viral cysteine proteinases
197                       Akute Bronchitis und Influenza
198                          Akute Exazerbation bei COPD
199    Außerhalb des Krankenhauses erworbene Pneumoni...
Name: title, Length: 200, dtype: object

Let's check the summary statistics:

In [6]:
metadata.count()

cord_uid            200
sha                 200
source_x            200
title               200
doi                 199
pmcid               199
pubmed_id           198
license             200
abstract            170
publish_time        200
authors             188
journal             199
mag_id                0
who_covidence_id      0
arxiv_id              1
pdf_json_files      200
pmc_json_files      142
url                 200
s2_id                 1
dtype: int64

By looking at the statistics, we can see that the number of records with data for the column 'cord_uid' is higher than the number of records with data for the column 'authors'. 

For our dataset, we only want to include articles with the following characteristics:
- authors' information
- available title
- available body text
- article text expressed in English
- non-duplicate articles

### 1.3. Exploring the articles of the dataset

Let's consider the first article appearing in the metadata file.
To check if information about the article's author is available:

In [7]:
metadata.loc[0]['authors']

'Madani, Tariq A; Al-Ghamdi, Aisha A'

To check if there is an available title:

In [8]:
metadata.loc[0]['title']

'Clinical features of culture-proven Mycoplasma pneumoniae infections at King Abdulaziz University Hospital, Jeddah, Saudi Arabia'

We can see that the title is expressed in English, but it would not be efficient to check the language of every article in the dataset, so we will apply the language detection tool [langdetect](https://pypi.org/project/langdetect/):

In [9]:
title1 = metadata.loc[0]['title']

title1_lang = (title1)

print("Title:", title1, "\nLanguage:", title1_lang)

Title: Clinical features of culture-proven Mycoplasma pneumoniae infections at King Abdulaziz University Hospital, Jeddah, Saudi Arabia 
Language: en


The tool detects English as the language of the title.

Let's check another article:

In [10]:
title2 = metadata.loc[144]['title']

title2_lang = detect(title2)

print("Title:", title2, "\nLanguage:", title2_lang)

Title: Deutsche Gesellschaft für Pharmakologie und Toxikologie Abstracts of the 34th Spring Meeting 16–18 March 1993, Mainz 
Language: de


In this case, the tool detects german ('de') as the language of the title.

Now, we want to check if there is an available abstract for the first article:

In [11]:
metadata.loc[0]['abstract']

'OBJECTIVE: This retrospective chart review describes the epidemiology and clinical features of 40 patients with culture-proven Mycoplasma pneumoniae infections at King Abdulaziz University Hospital, Jeddah, Saudi Arabia. METHODS: Patients with positive M. pneumoniae cultures from respiratory specimens from January 1997 through December 1998 were identified through the Microbiology records. Charts of patients were reviewed. RESULTS: 40 patients were identified, 33 (82.5%) of whom required admission. Most infections (92.5%) were community-acquired. The infection affected all age groups but was most common in infants (32.5%) and pre-school children (22.5%). It occurred year-round but was most common in the fall (35%) and spring (30%). More than three-quarters of patients (77.5%) had comorbidities. Twenty-four isolates (60%) were associated with pneumonia, 14 (35%) with upper respiratory tract infections, and 2 (5%) with bronchiolitis. Cough (82.5%), fever (75%), and malaise (58.8%) were 

The metadata file does not contain the article's text besides abstract and title, but we can access the file associated with the article using the provided information in the columns 'pdf_json_files' or 'pmc_json_files':

In [12]:
metadata.loc[0]['pdf_json_files']

'document_parses/pdf_json/d1aafb70c066a2068b02786f8929fd9c900897fb.json'

Let's open the file, which is in [JSON](https://www.json.org/json-en.html) format:

In [13]:
article1_filepath = dataset_dir + metadata.loc[0]['pdf_json_files']

with open(article1_filepath, encoding='utf-8') as article1_file:
    article1_data = json.load(article1_file)

Now we have the file content stored in a dictionary with the following keys:

In [14]:
article1_data.keys()

dict_keys(['paper_id', 'metadata', 'abstract', 'body_text', 'bib_entries', 'ref_entries', 'back_matter'])

The article content is the following:

In [15]:
article1_data

{'paper_id': 'd1aafb70c066a2068b02786f8929fd9c900897fb',
 'metadata': {'title': 'Clinical features of culture-proven Mycoplasma pneumoniae infections at King',
  'authors': [{'first': 'Tariq',
    'middle': ['A'],
    'last': 'Madani',
    'suffix': '',
    'affiliation': {},
    'email': ''},
   {'first': 'Aisha',
    'middle': ['A'],
    'last': 'Al-Ghamdi',
    'suffix': '',
    'affiliation': {},
    'email': ''}]},
 'abstract': [{'text': 'Objective: This retrospective chart review describes the epidemiology and clinical features of 40 patients with culture-proven Mycoplasma pneumoniae infections at King Abdulaziz University Hospital, Jeddah, Saudi Arabia.',
   'cite_spans': [],
   'ref_spans': [],
   'section': 'Abstract'},
  {'text': 'Patients with positive M. pneumoniae cultures from respiratory specimens from January 1997 through December 1998 were identified through the Microbiology records. Charts of patients were reviewed.',
   'cite_spans': [],
   'ref_spans': [],
   'secti

Let's check the body text:

In [16]:
body_text = article1_data['body_text']
body_text

[{'text': 'Mycoplasma pneumoniae is a common cause of upper and lower respiratory tract infections. It remains one of the most frequent causes of atypical pneumonia particu-larly among young adults. [1, 2, 3, 4, 5] Although it is highly transmissible, most infections caused by this organism are relatively minor and include pharyngitis, tracheobronchitis, bronchiolitis, and croup with one fifth of in-fections being asymptomatic. [6, 7] Only 3 -10% of infected subjects develop symptoms consistent with bronchopneumonia and mortality from infection is rare. [6, 7] The organism is fastidious and difficult to grow on cultures. Therefore, diagnosis of infections caused by this organism is usually confirmed with serological tests or polymerase chain reaction-gene amplification techniques. At King Abdulaziz University Hospital (KAUH), Jeddah, Saudi Arabia, the facility to perform Mycoplasma culture has been available since January 1997. As published information concerning M. pneumoniae infectio

We have a list of dictionaries, each dictionary is a paragraph beloning to a given section of the article. We want to join the scattered text in a single string: 

In [17]:
article1_text = str()

for paragraph in body_text:
    article1_text += paragraph['text'] + '\n'

article1_text

'Mycoplasma pneumoniae is a common cause of upper and lower respiratory tract infections. It remains one of the most frequent causes of atypical pneumonia particu-larly among young adults. [1, 2, 3, 4, 5] Although it is highly transmissible, most infections caused by this organism are relatively minor and include pharyngitis, tracheobronchitis, bronchiolitis, and croup with one fifth of in-fections being asymptomatic. [6, 7] Only 3 -10% of infected subjects develop symptoms consistent with bronchopneumonia and mortality from infection is rare. [6, 7] The organism is fastidious and difficult to grow on cultures. Therefore, diagnosis of infections caused by this organism is usually confirmed with serological tests or polymerase chain reaction-gene amplification techniques. At King Abdulaziz University Hospital (KAUH), Jeddah, Saudi Arabia, the facility to perform Mycoplasma culture has been available since January 1997. As published information concerning M. pneumoniae infections in Saud

All good! We will include this article in our dataset.

### 1.4. Selecting a sample of articles to build our scientific recommendation dataset

Instead of repating each operation for each file individually, let us adapt our code to automatically select a sample containing 100 preprocessed articles. 

First, create the output directory:

In [24]:
out_dir = 'sample/'

if not os.path.exists(out_dir):
    os.mkdir(out_dir)

Then, initiallize the necessary variables:

In [19]:
max_articles = 100 #number of articles to include in the sample
dataset_dir = 'cord19_small/'
metadata_filepath = dataset_dir + 'metadata.csv'
valid_articles_count = int()
out_articles_ids = list()

Open the metadata file:

In [20]:
metadata = pd.read_csv(metadata_filepath, sep = ',', quotechar = '"',  encoding = 'utf-8', dtype=str) 

Then iterate over the records in the metadata file and choose only the relevant ones:

In [25]:
valid_articles_count = int()

for index, record in metadata.iterrows():
    
    if valid_articles_count <= max_articles:
        
        if record['pubmed_id'] not in out_articles_ids:
            
            if type(record['sha']) != float:

                if record['authors'] != '':    

                    if record['title'] != '':
                        title = record['title']
                        title_lang = detect(title)
                        article_filepath = record['pdf_json_files']

                        if title_lang == 'en'  \
                            and type(article_filepath) != float  \
                            and article_filepath.count("document") == 1:

                            article_filepath_up = dataset_dir + record['pdf_json_files']

                            with open(article_filepath_up, encoding='utf-8') as article_file:
                                article_data = json.load(article_file)

                            if 'body_text' in article_data.keys(): # the article is valid
                                valid_articles_count += 1
                                
                                # open the article file to check if it contains all info
                                with open(article_filepath_up) as article_file:
                                    article_data = json.load(article_file)
                                
                                # correct the info of the article with info present in metadata file
                                changed_article = False
                                
                                if article_data['metadata']['title'] == '':
                                    article_data['metadata']['title'] = record['title']
                                    changed_article = True
                                
                                if article_data['metadata']['authors'] == []:
                                    article_data['metadata']['authors'] = record['authors']
                                
                                # output or copy article file to out_dir
                                if changed_article:
    
                                    with open(out_dir + record['sha'] + '.json', 'w') as out_file:
                                        out_file.write(json.dumps(article_data, indent=4, ensure_ascii=False))
                                    
                                else:
                                    command = 'cp '  \
                                              + article_filepath_up + ' ' \
                                              + out_dir  \
                                              + record['sha'] + '.json'
                                    
                                    os.system(command)
                                
                    else:
                        print(record['title'])
                            
    if valid_articles_count == max_articles:
        total_articles = index + 1
        break

Let's check if the output dir contain the desired number of articles (max_articles):

In [26]:
article_count = len(os.listdir(out_dir))
assert article_count == max_articles, 'Invalid number of article(s): {}! Expected number: {}'.format(article_count, max_articles)

At the end of this section, we now have a sample including 100 articles that will be the basis of our scientific recommendation dataset.

## 2. Named Entity Recognition (NER) + Named Entity Linking (NEL)

**Objective**: To recognize chemical and disease entities in the retrieved articles and to link them to the respective ontology identifiers.

We are going to use the [Disease Ontology](https://disease-ontology.org/) (DO), and the [Chemical Entities of Biological Interest](https://www.ebi.ac.uk/chebi/) (ChEBI) ontology.

To perform NER and NEL, we are going to apply Minimal Named-Entity Recognizer [MER](https://pypi.org/project/merpy/) tool.

# 2.1. Import libraries
<a id='#2.2'></a>

In [27]:
import merpy
import multiprocessing
from collections import Counter

### 2.2. Configure MER

First, we need to download the owl. file associated with ChEBI:

In [None]:
merpy.download_lexicon("ftp://ftp.ebi.ac.uk/pub/databases/chebi/ontology/chebi.owl",
                       "chebi", ltype="owl")

Then, we need to process the downloaded file into a lexicon that MER can use:

In [None]:
merpy.process_lexicon("chebi", ltype="owl")

We are going to delete obsolete concepts still present in the ontology file:

In [None]:
merpy.delete_obsolete("chebi")

We need to repeat the operations for the DO:

In [None]:
merpy.download_lexicon("http://purl.obolibrary.org/obo/doid.owl", 
                        "do", ltype="owl")
            
merpy.process_lexicon("do", ltype="owl")

merpy.delete_obsolete("do")

Let's check the lexicons available for MER:

In [28]:
merpy.show_lexicons()

lexicons preloaded:
['lexicon', 'chebi', 'do']

lexicons loaded ready to use:
['chebi', 'do']

lexicons with linked concepts:
['do', 'lexicon', 'chebi']


### 2.3. Extract the entities in a single file

Let's retrieve a file from the articles sample:

In [34]:
dataset_dir = 'sample/'

with open(dataset_dir + '87390d2ae28407b3e03e60a6b24a7fd99ed7229a.json') as article1_file:
    article_data = json.load(article1_file)

Let's check the contents of the article:

In [35]:
article_data.keys()

dict_keys(['paper_id', 'metadata', 'abstract', 'body_text', 'bib_entries', 'ref_entries', 'back_matter'])

We want to recognize the entities present in title, abstract, and body. 

First, let's retrieve the title, which is a value associated with the key 'metadata':

In [36]:
title = article_data['metadata']['title']
title

'Pro/con clinical debate: Steroids are a key component in the treatment of SARS'

Then, we apply MER to the title in order recognize disease entities and to link them to DO concepts:

In [37]:
merpy.get_entities(title, 'do')

[['74', '78', 'SARS', 'http://purl.obolibrary.org/obo/DOID_2945']]

Let's check if the annotations make sense. For instance, access the link http://purl.obolibrary.org/obo/DOID_2945.

The entity 'SARS' in the article was linked to the DO concept 'severe acute respiratory syndrome' with the ID 'DOID:2945', which seems correct!

Let's apply MER to recognize chemical entities and to link them to ChEBI concepts:

In [38]:
merpy.get_entities(title, 'chebi')

[['25', '33', 'Steroids', 'http://purl.obolibrary.org/obo/CHEBI_35341']]

Accessing the link http://purl.obolibrary.org/obo/CHEBI_35341, we can see that the entity 'Steroids' was linked to the ChEBI concept 'steroid', which has the ID 'CHEBI:35341'.

We add the disease and chemical entities to a single list:

In [39]:
title_entities = merpy.get_entities(title, 'do') + merpy.get_entities(title, 'chebi')

title_entities

[['74', '78', 'SARS', 'http://purl.obolibrary.org/obo/DOID_2945'],
 ['25', '33', 'Steroids', 'http://purl.obolibrary.org/obo/CHEBI_35341']]

Now, we are going to apply MER to recognize entities in abstract:

In [40]:
abstract = article_data['abstract'][0]['text']

abstract_entities = merpy.get_entities(abstract, 'do') + merpy.get_entities(abstract, 'chebi')

abstract_entities

[['4', '8', 'ARDS', 'http://purl.obolibrary.org/obo/DOID_11394'],
 ['38', '46', 'syndrome', 'http://purl.obolibrary.org/obo/DOID_225'],
 ['93', '97', 'SARS', 'http://purl.obolibrary.org/obo/DOID_2945'],
 ['125', '133', 'syndrome', 'http://purl.obolibrary.org/obo/DOID_225'],
 ['224', '228', 'SARS', 'http://purl.obolibrary.org/obo/DOID_2945'],
 ['255', '263', 'syndrome', 'http://purl.obolibrary.org/obo/DOID_225'],
 ['386', '390', 'SARS', 'http://purl.obolibrary.org/obo/DOID_2945'],
 ['525', '529', 'SARS', 'http://purl.obolibrary.org/obo/DOID_2945'],
 ['11',
  '46',
  'acute respiratory distress syndrome',
  'http://purl.obolibrary.org/obo/DOID_11394'],
 ['100',
  '133',
  'severe acute respiratory syndrome',
  'http://purl.obolibrary.org/obo/DOID_2945'],
 ['230',
  '263',
  'severe acute respiratory syndrome',
  'http://purl.obolibrary.org/obo/DOID_2945'],
 ['530', '538', 'steroids', 'http://purl.obolibrary.org/obo/CHEBI_35341'],
 ['625', '632', 'steroid', 'http://purl.obolibrary.org/obo

Let's apply MER in the text associated with the body of the article:

In [41]:
body_text = str()

for section in article_data['body_text']:
    body_text += section['text'] + "\n"

body_entities = merpy.get_entities(body_text, "do") + merpy.get_entities(body_text, "chebi")

body_entities

[['0', '4', 'SARS', 'http://purl.obolibrary.org/obo/DOID_2945'],
 ['39', '46', 'disease', 'http://purl.obolibrary.org/obo/DOID_4'],
 ['72', '76', 'SARS', 'http://purl.obolibrary.org/obo/DOID_2945'],
 ['113', '120', 'disease', 'http://purl.obolibrary.org/obo/DOID_4'],
 ['515', '522', 'disease', 'http://purl.obolibrary.org/obo/DOID_4'],
 ['612', '616', 'SARS', 'http://purl.obolibrary.org/obo/DOID_2945'],
 ['723', '727', 'SARS', 'http://purl.obolibrary.org/obo/DOID_2945'],
 ['1260', '1264', 'SARS', 'http://purl.obolibrary.org/obo/DOID_2945'],
 ['1300', '1308', 'syndrome', 'http://purl.obolibrary.org/obo/DOID_225'],
 ['1310', '1314', 'ARDS', 'http://purl.obolibrary.org/obo/DOID_11394'],
 ['1369', '1373', 'ARDS', 'http://purl.obolibrary.org/obo/DOID_11394'],
 ['1458', '1462', 'ARDS', 'http://purl.obolibrary.org/obo/DOID_11394'],
 ['1504', '1508', 'ARDS', 'http://purl.obolibrary.org/obo/DOID_11394'],
 ['1545', '1549', 'ARDS', 'http://purl.obolibrary.org/obo/DOID_11394'],
 ['1691', '1695', 'A

At last, we need to obtain information about the frequency of each ontology identifier in the document:

In [43]:
total_entities = title_entities + abstract_entities + body_entities

all_uris = [entity[3] for entity in total_entities]

entity_counter = Counter(all_uris)

entity_counter

Counter({'http://purl.obolibrary.org/obo/DOID_2945': 30,
         'http://purl.obolibrary.org/obo/CHEBI_35341': 28,
         'http://purl.obolibrary.org/obo/DOID_11394': 17,
         'http://purl.obolibrary.org/obo/DOID_225': 4,
         'http://purl.obolibrary.org/obo/DOID_4': 6,
         'http://purl.obolibrary.org/obo/DOID_552': 2,
         'http://purl.obolibrary.org/obo/DOID_8659': 1,
         'http://purl.obolibrary.org/obo/DOID_13564': 1,
         'http://purl.obolibrary.org/obo/DOID_423': 1,
         'http://purl.obolibrary.org/obo/DOID_1389': 1,
         'http://purl.obolibrary.org/obo/DOID_934': 1,
         'http://purl.obolibrary.org/obo/DOID_11162': 3,
         'http://purl.obolibrary.org/obo/DOID_10533': 1,
         'http://purl.obolibrary.org/obo/CHEBI_50858': 9,
         'http://purl.obolibrary.org/obo/CHEBI_22587': 2,
         'http://purl.obolibrary.org/obo/CHEBI_15379': 2,
         'http://purl.obolibrary.org/obo/CHEBI_50906': 3,
         'http://purl.obolibrary.org/o

To sort the URIs by descending order:

In [44]:
entity_counter = {
    k: v 
    for k, v in sorted(entity_counter.items(), key=lambda item: item[1], reverse=True)
    }

entity_counter 

{'http://purl.obolibrary.org/obo/DOID_2945': 30,
 'http://purl.obolibrary.org/obo/CHEBI_35341': 28,
 'http://purl.obolibrary.org/obo/DOID_11394': 17,
 'http://purl.obolibrary.org/obo/CHEBI_50858': 9,
 'http://purl.obolibrary.org/obo/DOID_4': 6,
 'http://purl.obolibrary.org/obo/DOID_225': 4,
 'http://purl.obolibrary.org/obo/DOID_11162': 3,
 'http://purl.obolibrary.org/obo/CHEBI_50906': 3,
 'http://purl.obolibrary.org/obo/CHEBI_6888': 3,
 'http://purl.obolibrary.org/obo/DOID_552': 2,
 'http://purl.obolibrary.org/obo/CHEBI_22587': 2,
 'http://purl.obolibrary.org/obo/CHEBI_15379': 2,
 'http://purl.obolibrary.org/obo/DOID_8659': 1,
 'http://purl.obolibrary.org/obo/DOID_13564': 1,
 'http://purl.obolibrary.org/obo/DOID_423': 1,
 'http://purl.obolibrary.org/obo/DOID_1389': 1,
 'http://purl.obolibrary.org/obo/DOID_934': 1,
 'http://purl.obolibrary.org/obo/DOID_10533': 1,
 'http://purl.obolibrary.org/obo/CHEBI_58972': 1,
 'http://purl.obolibrary.org/obo/CHEBI_34935': 1,
 'http://purl.obolibrary.

### 2.4. Create entity files

We need to adapt our code to perform NER and NEL in all documents of our sample.

First, create the output dir:

In [45]:
out_dir = 'sample_entities/'

if not os.path.exists(out_dir):
    os.mkdir(out_dir)

Next, we are going to iterate over on each file present in the sample directory, annotate them, and create the respective entity file:

In [46]:

def annotate_doc(article):
    
    article_filepath = 'sample/' + article  

    with open(article_filepath) as input_file:
        article_data = json.load(input_file)

    doc_output = {'id': str(), 'entities': {}, 'sections': {'title': [], 'abstract': [], 'body': []}}
    
    doc_output['id'] = article_data['paper_id']
   
    # Annotate the title
    title = article_data['metadata']['title']
    title_entities = merpy.get_entities(title, 'do') + merpy.get_entities(title, 'chebi')
    doc_output['sections']['title'] = title_entities
    
    # Annotate the abstract
    if article_data['abstract'] != []:
        abstract = article_data['abstract'][0]['text']
        abstract_entities = merpy.get_entities(abstract, 'do') + merpy.get_entities(abstract, 'chebi')
        doc_output['sections']['abstract'] = abstract_entities
    
    else:
        abstract_entities = []

    # Combine the several paragraphs of the body text and annotate it
    body_text = str()

    for section in article_data['body_text']:
        body_text += section['text'] + '\n'

    body_entities = merpy.get_entities(body_text, 'do') + merpy.get_entities(body_text, 'chebi')
    doc_output['sections']['body'] = body_entities

    # Count URIs frequencies and sort them
    total_entities = title_entities + abstract_entities + body_entities
    all_uris = [entity[3] for entity in total_entities if len(entity)==4]
    entity_counter = Counter(all_uris)
    
    doc_output['entities'] = {
        k: v 
        for k, v in sorted(entity_counter.items(), key=lambda item: item[1], reverse=True)
        }
    
    # Generate JSON file with output
    out_filepath = out_dir + doc_output['id'] + '_entities.json'
    
    with open(out_filepath, 'w') as out_file:
        out_file.write(json.dumps(doc_output, indent=4))
        

article_dir = 'sample/'
        
with multiprocessing.Pool(processes=10) as pool: 
    # change the number of processes according to number of cpus/threads
    outputs = pool.map(annotate_doc, [article for article in os.listdir(article_dir)], chunksize=10)
    pool.close()

e4b48ce0579f908da6dd3289bb2cc262dbdeae72.json65cc7496f21429d81b3ae129d9c39764e2a1f568.jsonc5131e5f5c6000ec84139edc64778a6f1d391b83.jsondef1cf77e1ef84f4373a342e23145be05ec5e226.json
467694c7a219031c9be1734c7ab3bc42bfa07590.json87390d2ae28407b3e03e60a6b24a7fd99ed7229a.json5ae641a5bf24b53a895f5f2f04254cc00e909c08.jsond1aafb70c066a2068b02786f8929fd9c900897fb.json

a945fe15ef46edadf3f4712668dfc7ee8e5c821d.json

09fc4c5e368a43d21f5130cb8474d61d65191dd9.json




8a7d5de5ea680e784ab2bd877240bf09e4c1c02d.json
894e7274776479bb39a84e5dc363640cd435b369.json
5f48792a5fa08bed9f56016f4981ae2ca6031b32.json
7ff45096210eeb392d51f646f5c7fe011079aaf3.json
505f56215f18a8d205927dd48898f22a336b5b4b.json
d617306cda56236d02117ae7a5fc5e7fcd015554.json
3bb07ea10432f7738413dff9816809cc90f03f99.json
5806726a24dc91de3954001effbdffd7a82d54e2.json
6b18c718ecf5fb496443591ba267b2ccae0c2863.json
dde02f11923815e6a16a31dd6298c46b109c5dfa.json
c63c4d58d170136b8d3b5a66424b5ac3f73a92d9.json
348055649b6b8cf2b9a376498df9bf41f7

Now we have both the article files ('sample' dir) and the respective entities files ('sample_entities'), and the next step will be the generation of the scientific recomendation dataset.

## 3. Creating the recommendation dataset