# Tutorial: Creating Recommender Systems Datasets in Scientific Fields


<ul>
    <li><a href="#1">1. Data retrieval and cleaning</a></li>
</ul>
   
<ul>
   <li>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<a href="#1.1">1.1.Import libraries</a></li>
</ul>

<ul>
   <li>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<a href="#1.2">1.2. Retrieve CORD-19</a></li>
</ul>

<ul>
   <li>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<a href="#1.3">1.3. Exploring the articles of the dataset</a></li>
</ul>

<ul>
   <li>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<a href="#1.4">1.4. Selecting a sample of articles to build our scientific recommendation dataset</a></li>
</ul>

<ul>
   <li><a href="#2">2. Named Entity Recognition (NER) + Named Entity Linking (NEL)</a></li>
</ul>

<ul>
   <li>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<a href="#2.1">2.1. Import libraries</a></li>
</ul>

<ul>
   <li>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<a href="#2.2">2.2. Configure MER</a></li>
</ul>

<ul>
   <li>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<a href="#2.3">2.3. Import stop words vocabulary and tokenizer</a></li>
</ul>


<ul>
   <li>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<a href="#2.4">2.4. Extract the entities in a single file</a></li>
</ul>

<ul>
   <li>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<a href="#2.5">2.5. Create entity files</a></li>
</ul>

<ul>
   <li><a href="#3">3. Creating the recommendation dataset</a></li>
</ul>

<ul>
   <li>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<a href="#3.1">3.1. Import libraries</a></li>
</ul>

<ul>
   <li>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<a href="#3.2">3.2. Get all articles id that cannot be considered in use case: blacklist </a></li>
</ul>

<ul>
   <li>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<a href="#3.3">3.3. Create the dataset like < user, item, rating, year> </a></li>
</ul>


<ul>
   <li>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<a href="#3.4">3.4. Get entities labels</a></li>
</ul>

<ul>
   <li>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<a href="#3.5">3.5. Save data</a></li>
</ul>

<ul>
   <li><a href="#4">4. Data statistics</a></li>
</ul>

















<a id="1"></a>

## 1. Data retrieval + cleaning

**Goal**: To retrieve the [COVID-19 Open Research Dataset (CORD-19)](https://www.semanticscholar.org/cord19) and to select a sample of complete English articles (authors' info, title, body text) to build a scientific recommendation dataset.

CORD-19 includes coronavirus-related research articles extracted from several sources, such as PubMed, bioRxiv, medRxiv, WHO.

<a id="1.1"></a>
### 1.1 Import libraries

In [1]:
import json
import os
import pandas as pd
import requests
import sys
import langdetect
from langdetect import detect
sys.path.append("./")

ModuleNotFoundError: No module named 'langdetect'

In [6]:
pip3 install --upgrade pip

SyntaxError: invalid syntax (<ipython-input-6-fa9c188119e3>, line 1)

<a id="1.2"></a>
### 1.2. Retrieve CORD-19

CORD-19 is a large dataset, so in this tutorial we are going to use a smaller version of the dataset. 
This version is located under the directory "cord19_small":

In [2]:
os.chdir('data')
os.system('tar -xvf cord19_small.tar.xz')
os.chdir('../')

However, if you want to retrieve the entire dataset, you can run the following code:

In [None]:
os.chdir('data')
         
version = 'cord-19_2020-05-12.tar.gz'

url = 'https://ai2-semanticscholar-cord-19.s3-us-west-2.amazonaws.com/historical_releases/' + version

response = requests.get(url)

os.system('tar -xvf cord-19_2020-05-12.tar.gz')
os.chdir('../')

Let's explore the contents of the dataset directory, particularly, the metadata file.

In [3]:
dataset_dir = 'data/cord19_small/'
metadata_filepath = dataset_dir + 'metadata.csv'

metadata = pd.read_csv(metadata_filepath, sep = ',', quotechar = '"',  encoding = 'utf-8', dtype=str) 

Now we have a DataFrame with the contents of the metadata file.

To print column names and first row:

In [None]:
metadata.head(1)

To access individual rows/articles:

In [None]:
metadata.loc[0]

To access the individual column title':

In [None]:
metadata['title']

Let's check the summary statistics:

In [None]:
metadata.count()

By looking at the statistics, we can see that the number of records with data for the column 'cord_uid' is higher than the number of records with data for the column 'authors'. 

For our dataset, we only want to include articles with the following characteristics:
- authors' information
- available title
- available body text
- article text expressed in English
- non-duplicate articles

<a id="1.3"></a>

### 1.3. Exploring the articles of the dataset

Let's consider the first article appearing in the metadata file.
To check if information about the article's author is available:

In [None]:
metadata.loc[0]['authors']

To check if there is an available title:

In [None]:
metadata.loc[0]['title']

We can see that the title is expressed in English, but it would not be efficient to check the language of every article in the dataset, so we will apply the language detection tool [langdetect](https://pypi.org/project/langdetect/):

In [None]:
title1 = metadata.loc[0]['title']

title1_lang = detect(title1)

print("Title:", title1, "\nLanguage:", title1_lang)

The tool detects English as the language of the title.

Let's check another article:

In [None]:
title2 = metadata.loc[199]['title']

title2_lang = detect(title2)

print("Title:", title2, "\nLanguage:", title2_lang)

In this case, the tool detects german ('de') as the language of the title.

Now, we want to check if there is an available abstract for the first article:

In [None]:
metadata.loc[0]['abstract']

The metadata file does not contain the article's text besides abstract and title, but we can access the file associated with the article using the provided information in the columns 'pdf_json_files' or 'pmc_json_files':

In [None]:
metadata.loc[0]['pdf_json_files']

Let's open the file, which is in [JSON](https://www.json.org/json-en.html) format:

In [None]:
article1_filepath = dataset_dir + metadata.loc[0]['pdf_json_files']

with open(article1_filepath, encoding='utf-8') as article1_file:
    article1_data = json.load(article1_file)

Now we have the file content stored in a dictionary with the following keys:

In [None]:
article1_data.keys()

The article content is the following:

In [None]:
article1_data

Let's check the body text:

In [None]:
body_text = article1_data['body_text']
body_text

We have a list of dictionaries, each dictionary is a paragraph belonging to a given section of the article. We want to join the different parts text in a single string: 

In [None]:
article1_text = str()

for paragraph in body_text:
    article1_text += paragraph['text'] + '\n'

article1_text

All good! We will include this article in our dataset.

<a id="1.4"></a>

### 1.4. Selecting a sample of articles to build our scientific recommendation dataset

Instead of repating each operation for each file individually, let us adapt our code to automatically select a sample containing 100 preprocessed articles. 

First, create the output directory:

In [None]:
out_dir = 'data/sample/'

if not os.path.exists(out_dir):
    os.mkdir(out_dir)

Then, initiallize the necessary variables:

In [None]:
max_articles = 100 #number of articles to include in the sample
dataset_dir = 'data/cord19_small/'
metadata_filepath = dataset_dir + 'metadata.csv'
valid_articles_count = int()
out_articles_ids = list()

Open the metadata file:

In [None]:
metadata = pd.read_csv(metadata_filepath, sep = ',', quotechar = '"',  encoding = 'utf-8', dtype=str) 

Then iterate over the records in the metadata file and choose only the relevant ones:

In [None]:
valid_articles_count = int()
blacklist = str()
blacklist_count = int()

for index, record in metadata.iterrows():
    invalid_article = True
    
    if valid_articles_count <= max_articles:
        
        if record['pubmed_id'] not in out_articles_ids:
            
            if type(record['sha']) != float:

                if record['authors'] != '':    

                    if record['title'] != '':
                        title = record['title']
                        title_lang = detect(title)
                        article_filepath = record['pdf_json_files']
                        
                        if article_filepath != '': # to consider onyl articles from the pdf_json directory

                            if title_lang == 'en'  \
                                and type(article_filepath) != float  \
                                and article_filepath.count("document") == 1:

                                article_filepath_up = dataset_dir + record['pdf_json_files']

                                with open(article_filepath_up, encoding='utf-8') as article_file:
                                    article_data = json.load(article_file)

                                if 'body_text' in article_data.keys(): # the article is valid
                                    valid_articles_count += 1
                                    invalid_article = False

                                    # open the article file to check if it contains all info
                                    with open(article_filepath_up) as article_file:
                                        article_data = json.load(article_file)

                                    # correct the info of the article with info present in metadata file
                                    changed_article = False

                                    if article_data['metadata']['title'] == '':
                                        article_data['metadata']['title'] = record['title']
                                        changed_article = True

                                    if article_data['metadata']['authors'] == []:
                                        authors = record['authors'].split(';')   
                                        
                                        add_author = {
                                                     'first': '', 'middle': [], 'last': '',
                                                     'suffix': '', 'affiliation': {}, 'email': ''
                                                     }
                                        authors_up = list()            
                                        
                                        for author in authors:
                                            add_author['last'] = author.split(',')[0]
                                            
                                            begin_names = author.split(',')[1].split(' ')
                                            add_author['first'] = begin_names[1]
                                            
                                            if len(begin_names) == 3:
                                                add_author['middle'] = [begin_names[2]]
                                            
                                            authors_up.append(add_author)
                                            
                                        article_data['metadata']['authors'] = authors_up
                                        changed_article = True
                                        
                                    # output or copy article file to out_dir
                                    if changed_article:
                                      
                                        with open(out_dir + record['sha'] + '.json', 'w') as out_file:
                                            out_file.write(json.dumps(article_data, indent=4, ensure_ascii=False))

                                    else:
                                        command = 'cp '  \
                                                  + article_filepath_up + ' ' \
                                                  + out_dir  \
                                                  + record['sha'] + '.json'

                                        os.system(command)
        
        if invalid_article: # store article pubmed id in blacklist file
            blacklist += record['pubmed_id'] + "\n"
            blacklist_count += 1
            
    if valid_articles_count == max_articles:
        total_articles = index + 1
        break

#Create blacklist file with info about invalid articles
with open('data/blacklist/blacklist_articles.txt', 'w') as blacklist_file:
    blacklist_file.write(blacklist)
    blacklist_file.close()

print("Invalid articles:", str(blacklist_count))

If you were not able to run the code, you can uncompress the file 'sample.tar.xz' under 'data' directory.

Let's check if the output directory contain the desired number of articles (max_articles):

In [None]:
article_count = len(os.listdir(out_dir))
assert article_count == max_articles, 'Invalid number of article(s): {}! Expected number: {}'.format(article_count, max_articles)

At the end of this section, we now have a sample including 100 articles that will be the basis of our scientific recommendation dataset.

<a id="2"></a>

## 2. Named Entity Recognition (NER) + Named Entity Linking (NEL)

**Goal**: To recognize chemical and disease entities in the retrieved articles and to link them to the respective ontology identifiers.

We are going to use the [Disease Ontology](https://disease-ontology.org/) (DO), and the [Chemical Entities of Biological Interest](https://www.ebi.ac.uk/chebi/) (ChEBI) ontology.

To perform NER and NEL, we are going to apply Minimal Named-Entity Recognizer [MER](https://pypi.org/project/merpy/) tool.

<a id="2.1"></a>
### 2.1. Import libraries

In [4]:
import merpy
import multiprocessing
from collections import Counter

<a id="2.2"></a>
### 2.2. Configure MER

First, we are going to download the ontologies files:

**IMPORTANT:**  Only for OSX
see: https://github.com/prisma-labs/python-graphql-client/issues/13

In [5]:
import urllib.request
import ssl
ssl._create_default_https_context = ssl._create_unverified_context
response = urllib.request.urlopen('https://www.python.org')

In [6]:
os.chdir('data/ontologies')

#Donwload DO (2021-06-03 version)
os.system('wget https://raw.githubusercontent.com/DiseaseOntology/HumanDiseaseOntology/v2021-06-03/src/ontology/releases/doid.owl')
merpy.download_lexicon('https://raw.githubusercontent.com/DiseaseOntology/HumanDiseaseOntology/v2021-06-03/src/ontology/releases/doid.owl',
                       'doid', ltype='owl')

#Donwload ChEBI (2021-07-01 version)
os.system('wget ftp://ftp.ebi.ac.uk/pub/databases/chebi/archive/rel201/ontology/chebi_lite.owl.gz')
os.system ('gzip -d chebi_lite.owl.gz')
merpy.download_lexicon('ftp://ftp.ebi.ac.uk/pub/databases/chebi/archive/rel201/ontology/chebi_lite.owl',
                       'chebi', ltype='owl')

os.chdir('../../')

wrote doid lexicon
wrote chebi lexicon


Then, we need to process the downloaded files into lexicons that MER can use:

In [11]:
merpy.process_lexicon("doid", ltype="owl")

merpy.process_lexicon("chebi", ltype="owl")

tlrombiculiasis
tlrypanosomiasis
tluberculosis
tlubulinopathy
tlularemia
tlungiasis
tlylosis
tlympanosclerosis
tlyphus
tlyrosinemia
tluberculous salpingitis
tluberous sclerosis
tlubocurarine allergy
tlubular adenocarcinoma
tlubular androblastoma
tlubular carcinoma
tlufting enteropathy
tlwenty.nail dystrophy
tlyphoid fever
tlyphoidal tularemia
tlype iii short rib.polydactyly syndrome
tlype iib hyperlipoproteinemia
tlype iv short rib polydactyly syndrome
tlype l leprosy
tlype t leprosy
tlype vib ehlers.danlos syndrome
tlyrosine kinase 2 deficiency
tlyrosinemia type i
tlyrosinemia type ii
tlyrosinemia type iii
tlype i
tlype ii
tlype iii
tlype iib
tlype iv
tlype l
tlype t
tlype vib
tlyrosine kinase
tlyrosinemia type
 
tlyrocidine
tlyropanoate
tlyrosinate.1..
tlyrosinate.2..
tlyrosine
tlyrosine.4.azobenzenearsonate
tlyrosinium
tlyrosyltyrosine
tlyrvalin
tlyvelose
tlyrocidine d
tlyrosin.o.4..yl group
tlyrosinate residue
tlyrosine betaine
tlyrosine derivative
tlyrosine residue
tlyrosino group

We are going to delete obsolete concepts still present in the ontologies file:

In [12]:
merpy.delete_obsolete("chebi")

merpy.delete_obsolete("doid")

Let's check the lexicons available for MER:

In [13]:
merpy.show_lexicons()

lexicons preloaded:
['chebi', 'doid', 'lexicon']

lexicons loaded ready to use:
['chebi', 'doid']

lexicons with linked concepts:
['lexicon', 'chebi', 'doid']


<a id="2.3"></a>
### 2.3. Import stop words vocabulary and tokenizer

Stop words are common words of a given language (for example the words 'the', 'and', 'in'). A typical pre-processing step is to tokenize the text and remove the stopwords. For that, we are going to import NLTK's list of english stopwords and use the NLTK tokenizer.

In [None]:
import nltk 
nltk.download('punkt')

from nltk.corpus import stopwords

nltk.download('stopwords')

from nltk.tokenize import word_tokenize

all_stopwords = stopwords.words('english')

We are going to extend the stop words vocabulary by adding stop words associated with ChEBI and DO:

In [None]:
kbs_stopwords = list()
blacklist_dir = 'data/blacklist/'
filenames = ['chebi.txt', 'doid.txt']

for filename in filenames:
    
    with open(blacklist_dir + filename, 'r') as backlist_file:
        stopwords = [content.strip('\n') for content in backlist_file.readlines()]
        kbs_stopwords.extend(stopwords)
        backlist_file.close()

#Extend stop words vocabulary with the retrieved KBs stopwords
all_stopwords.extend(kbs_stopwords)

<a id="2.4"></a>
### 2.4. Extract the entities in a single file

Let's retrieve a file from the articles sample:

In [None]:
dataset_dir = 'data/sample/'

with open(dataset_dir + '87390d2ae28407b3e03e60a6b24a7fd99ed7229a.json') as article1_file:
    article_data = json.load(article1_file)

Let's check the contents of the article:

In [None]:
article_data.keys()

We want to recognize the entities present in title, abstract, and body. 

First, let's retrieve the title, which is a value associated with the key 'metadata':

In [None]:
title = article_data['metadata']['title']
title

Let's tokenize the title:

In [None]:
title_tokens = word_tokenize(title)
title_tokens

And remove the tokens relative to stop words:

In [None]:
title_tokens_up = [word for word in title_tokens if not word in all_stopwords]
title_tokens_up

And rebuild the title without the stop words:

In [None]:
title_up = (' ').join(title_tokens_up)
title_up

Then, we apply MER to the preprocessed title in order to recognize disease entities and to link them to DO concepts:

In [None]:
merpy.get_entities(title_up, 'doid')

Let's check if the annotations make sense. For instance, access the link http://purl.obolibrary.org/obo/DOID_2945.

The entity 'SARS' in the article was linked to the DO concept 'severe acute respiratory syndrome' with the identifier 'DOID:2945', which seems correct!

Let's apply MER to recognize chemical entities and to link them to ChEBI concepts:

In [None]:
merpy.get_entities(title_up, 'chebi')

Accessing the link http://purl.obolibrary.org/obo/CHEBI_35341, we can see that the entity 'Steroids' was linked to the ChEBI concept 'steroid', which has the identifier 'CHEBI:35341'.

We add the disease and chemical entities to a single list:

In [None]:
title_entities = merpy.get_entities(title_up, 'doid') + merpy.get_entities(title_up, 'chebi')
title_entities_up = [entity for entity in title_entities if entity != ['']]

title_entities_up

Now, we are going to apply MER to recognize entities in abstract:

In [None]:
abstract = article_data['abstract'][0]['text']

#Tokenize and remove stop words
abstract_tokens = word_tokenize(abstract)
abstract_tokens_up = [word for word in abstract_tokens if not word in all_stopwords]
abstract_up = (' ').join(abstract_tokens_up)

#Entity recognition and linking 
abstract_entities =  merpy.get_entities(abstract_up, 'doid') + merpy.get_entities(abstract_up, 'chebi')
abstract_entities_up = [entity for entity in abstract_entities if entity != ['']]

abstract_entities_up

Let's apply MER in the text associated with the body of the article:

In [None]:
body = str()

for section in article_data['body_text']:
    body += section['text'] + "\n"
    
#Tokenize and remove stop words
body_tokens = word_tokenize(body)
body_tokens_up = [word for word in body_tokens if not word in all_stopwords]
body_up = (' ').join(body_tokens_up)

#Entity recognition and linking 
body_entities =  merpy.get_entities(body_up, "doid") + merpy.get_entities(body_up, "chebi")
body_entities_up = [entity for entity in body_entities if entity != ['']]

body_entities_up

At last, we need to obtain information about the frequency of each ontology identifier in the document:

In [None]:
total_entities = title_entities_up + abstract_entities_up + body_entities_up

all_uris = [entity[3] for entity in total_entities]

entity_counter = Counter(all_uris)

entity_counter

To sort the URIs by descending order:

In [None]:
entity_counter = {
    k: v 
    for k, v in sorted(entity_counter.items(), key=lambda item: item[1], reverse=True)
    }

entity_counter 

<a id="2.5"></a>
### 2.5. Create entity files

We need to adapt our code to perform NER and NEL in all documents of our sample.

First, create the output dir:

In [None]:
out_dir = 'data/sample_entities/'

if not os.path.exists(out_dir):
    os.mkdir(out_dir)

Next, we are going to iterate over each file present in the sample directory, annotate it, and create the respective entity file:

In [None]:
def annotate_doc(article):
    
    article_filepath = 'data/sample/' + article  

    with open(article_filepath) as input_file:
        article_data = json.load(input_file)

    doc_output = {'id': str(), 'entities': {}, 'sections': {'title': [], 'abstract': [], 'body': []}}
    
    doc_output['id'] = article_data['paper_id']
   
    #Annotate the title
    title = article_data['metadata']['title']
    
    #Tokenize and remove stop words
    title_tokens = word_tokenize(title)
    title_tokens_up = [word for word in title_tokens if not word in all_stopwords]
    title_up = (' ').join(title_tokens_up)
    
    #Entity recognition and linking
    title_entities = merpy.get_entities(title_up, 'doid') + merpy.get_entities(title_up, 'chebi')
    doc_output['sections']['title'] = [entity for entity in title_entities if entity != ['']]
    
    #Annotate the abstract
    if article_data['abstract'] != []:
        abstract = article_data['abstract'][0]['text']
        
        #Tokenize and remove stop words
        abstract_tokens = word_tokenize(abstract)
        abstract_tokens_up = [word for word in abstract_tokens if not word in all_stopwords]
        abstract_up = (' ').join(abstract_tokens_up)
        
        #Entity recognition and linking
        abstract_entities = merpy.get_entities(abstract_up, 'doid') + merpy.get_entities(abstract_up, 'chebi')
        doc_output['sections']['abstract'] = [entity for entity in abstract_entities if entity != ['']]
    
    else:
        abstract_entities = []

    #Combine the several paragraphs of the body text and annotate it
    body = str()

    for section in article_data['body_text']:
        body += section['text'] + '\n'
    
    #Tokenize and remove stop words
    body_tokens = word_tokenize(body)
    body_tokens_up = [word for word in body_tokens if not word in all_stopwords]
    body_up = (' ').join(body_tokens_up)    
        
    #Entity recognition and linking
    body_entities = merpy.get_entities(body_up, 'doid') + merpy.get_entities(body_up, 'chebi')
    doc_output['sections']['body'] = [entity for entity in body_entities if entity != ['']]

    # Count URIs frequencies and sort them
    total_entities = title_entities + abstract_entities + body_entities
    all_uris = [entity[3] for entity in total_entities if len(entity)==4]
    entity_counter = Counter(all_uris)
    
    doc_output['entities'] = {
        k: v 
        for k, v in sorted(entity_counter.items(), key=lambda item: item[1], reverse=True)
        }
    
    #Generate JSON file with output
    out_filepath = out_dir + doc_output['id'] + '_entities.json'
    
    with open(out_filepath, 'w') as out_file:
        out_file.write(json.dumps(doc_output, indent=4))
        

article_dir = 'data/sample/'
        
with multiprocessing.Pool(processes=10) as pool: 
    # change the number of processes according to number of available cores
    outputs = pool.map(annotate_doc, [article for article in os.listdir(article_dir)], chunksize=10)
    pool.close()

**IMPORTANT:** If you were not able to run the previous code, you can extract the file 'sample_entities.tar.xz' under 'data' directory

In [7]:
os.chdir('data')
os.system('tar -xvf sample_entities.tar.xz')
os.chdir('../')

Now we have both the article files ('data/sample' directory) and the respective entities files ('data/sample_entities' directory), and the next step will be the generation of the scientific recomendation dataset using the LIBRETTI algorithm.

<a id="3"></a>
## 3. Create the recommendation dataset

<a id="3.1"></a>
### 3.1. Import libraries

In [4]:
import numpy as np
import unidecode

from pathlib import Path
import rdflib
from rdflib import URIRef

In [5]:
pd.set_option('display.max_columns', None)
pd.set_option("max_rows", None)

Path of original' json and entities' json folder, blacklist and metadata

In [6]:
dataset_dir = 'data/cord19_small/document_parses/pdf_json/'
entities_dir = 'data/sample_entities/'
metadata_filepath = 'data/cord19_small/metadata.csv'
blacklist_filepath = 'data/blacklist/blacklist_articles.txt'

List containing the names of the entries in the directory given by path

In [11]:
entities_list_of_json_files = os.listdir(entities_dir)
print(entities_list_of_json_files)

['d0c6b0c2d387baae89eb2898969913218b3bedff_entities.json', '8d7400a2b387820cd391d7df8194642e50402a0c_entities.json', 'faaf1022ccfe93b032c5608097a53543ba24aedb_entities.json', 'ab1ac4f7b9c57dad5e3d61f72e8e3d552059fa09_entities.json', '40e41f0c52b39669dee24e37875d7a9fabc38636_entities.json', '8a7d5de5ea680e784ab2bd877240bf09e4c1c02d_entities.json', '72ace5af731fdf4c384e912b074193d13902b7a1_entities.json', '93c6eef32a1a511ee989a259eab0e12174dc6859_entities.json', 'e3af2ca43010f59c3d1bb731abd011e3dd0fc51c_entities.json', '894e7274776479bb39a84e5dc363640cd435b369_entities.json', '03203ab50eb64271a9e825f94a1b1a6c46ea14b3_entities.json', '23bc55d6f63fab18b02004483888db2b6a0bfa48_entities.json', '98a3b0606a67d829816c1d934e2d1a7196985151_entities.json', '6a3fa8ed278df0d05c5e009521de11c72308f60b_entities.json', 'bb348e16b7b1390554883a8ad0815ec6965c8b2f_entities.json', '7ff45096210eeb392d51f646f5c7fe011079aaf3_entities.json', '3bda17d21aee670c29e22635d622b780820572af_entities.json', '18c21db94a22

<a id="3.2"></a>
### 3.2. Get all articles id that cannot be considered in use case: blacklist 

In [8]:
# Return all articles to be removed, due some errors found there

articles_blacklist = []
with open(blacklist_filepath, 'r') as f:
    black_list = [content for content in f.readlines()]

articles_blacklist.extend(black_list) 
f.close()

articles_blacklist

['10921875\n', '10921875\n', '10921875\n']

In [None]:
# Return all articles to be removed, due some errors found there

authors_blacklist = []
with open(blacklist_filepath, 'r') as f:
    black_list = [content for content in f.readlines()]

authors_blacklist.extend(black_list) 
f.close()

authors_blacklist

Open metadata file:

In [9]:
metadata = pd.read_csv(metadata_filepath, sep = ',', quotechar = '"',  encoding = 'utf-8', dtype=str) 
metadata.iloc[:].head(1)

Unnamed: 0,cord_uid,sha,source_x,title,doi,pmcid,pubmed_id,license,abstract,publish_time,authors,journal,mag_id,who_covidence_id,arxiv_id,pdf_json_files,pmc_json_files,url,s2_id
0,ug7v899j,d1aafb70c066a2068b02786f8929fd9c900897fb,PMC,Clinical features of culture-proven Mycoplasma...,10.1186/1471-2334-1-6,PMC35282,11472636,no-cc,OBJECTIVE: This retrospective chart review des...,2001-07-04,"Madani, Tariq A; Al-Ghamdi, Aisha A",BMC Infect Dis,,,,document_parses/pdf_json/d1aafb70c066a2068b027...,document_parses/pmc_json/PMC35282.xml.json,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3...,


<a id="3.3"></a>
### 3.3. Create the dataset like <user, item, rating, year> 

In [12]:
user_item_rating_all = []
count = 0

#e.g.
#entities_list_of_json_files=['23bc55d6f63fab18b02004483888db2b6a0bfa48_entities.json']

for file in entities_list_of_json_files:
    
    if file.replace('_entities.json','') in articles_blacklist:
        continue
     
    print(count, "-", len(entities_list_of_json_files))            
    print(file)
        
    # check valid json file, i.e. contains values
    try:
        j_file_entities = pd.read_json(entities_dir + file, orient = 'index')    
    except Exception as e:
        print(f'Json file does not contain values. Error message {e}')
        #This function receives the article to and saves them in the
        # blacklist file. The backlist contains all invalid articles, such
        # as non-authors, non-entities, and others
        articles_blacklist.extend(file.replace('_entities.json','').rstrip('\r\n') + '\n')
        continue  

    my_dict = j_file_entities.loc['entities'][0]  
    df_entities = pd.DataFrame(my_dict.items(), columns=['entities', 'count'])
    
    df_entities['entities_id'] = df_entities.entities.str.split(pat="/").str[-1]
   
    if df_entities.empty:
        print(f'Json file does not contain values.')
        articles_blacklist.extend(file.replace('_entities.json','').rstrip('\r\n') + '\n')
        continue     
    
    #print(df_entities)
    
    article_id = j_file_entities.loc['id'].values[0]
    
    # check valid json file, i.e. contains values
    try:
        # Convert a JSON string to pandas object, and return a dataframe
        with open(dataset_dir + article_id + '.json', encoding='utf-8') as json_file:
            j_file_original = json.load(json_file)
        #print(j_file_original)
    except Exception as e:
        print(f'Original json file does not exist. Error message {e}')
        articles_blacklist.extend(file.replace('_entities.json','').rstrip('\r\n') + '\n')
        continue    
    
    # check if json file contains AUTHORS, otherwise try to find them in metadata.csv
    # if value remains null them put this article in the blacklist file
    
    list_of_authors = []
    for p in j_file_original['metadata']['authors']:
        
        if len(p['first']) == 0 or len(p['last']) == 0:
            continue
        else:
            # remove all characters except alphabets from a string to unidecode
            first = unidecode.unidecode( ''.join(m for m in p['first'] if m.isalpha()))
            middle = unidecode.unidecode( ''.join(m for m in p['middle'] if m.isalpha()))
            last = unidecode.unidecode( ''.join(m for m in p['last'] if m.isalpha()))

            list_of_authors.append(last + ', '+ first + ' '+ middle)
    
###  ---- Begin of Metadata file ----
    
    # CHECK AUTHORS
    
    # if authors is empty we will find in metadata.csv file
    if len(list_of_authors)==0:
        ##if string is NaN
        try:
            authors = data[metadata.sha == article_id].authors.values[0].split(';')
            for a in authors:
                a = a.split(',')
                first = unidecode.unidecode( ''.join(m for m in a[1] if m.isalpha()))
                last = unidecode.unidecode( ''.join(m for m in a[0] if m.isalpha()))
                list_of_authors.append(last + ', '+  first) 
        
        except Exception as e:
            print(f'Empty values {e}')   
            articles_blacklist.extend(file.replace('_entities.json','').rstrip('\r\n') + '\n')             
        continue
    
    # check valid authors' name, i.e. contains surname and is the name of a person
    try:
        list_of_authors    
    except Exception as e:
        print(f'Json file does not contain values. Error message {e}')
        #This function receives the article to and saves them in the
        # blacklist file. The backlist contains all invalid articles, such
        # as non-authors, non-entities, and others
        articles_blacklist.extend(file.replace('_entities.json','').rstrip('\r\n') + '\n')
        continue  
        
    # CHECK YEAR
    
    publish_date=''      
    try:
        publish_date = metadata[metadata.sha == article_id].\
            publish_time.map(lambda v: v.split('-')[0]).tolist()[0]
    finally:
        pass  
    #print(publish_date)
    
    if publish_date==None:
        articles_blacklist.extend(file.replace('_entities.json','').rstrip('\r\n') + '\n')
        continue

    # put more articles ids in the blacklist file    
    with open(blacklist_filepath, 'r+') as blacklist_file:
            content = blacklist_file.read()
            blacklist_file.seek(0, 0)
            blacklist_file.write(' '.join(map(str, articles_blacklist)) )
            blacklist_file.write(content)
            blacklist_file.close()  
            
    count+=1
    
###  ---- End of Metadata file ----            
    
    # Append <user, item, rating> tuple
    user_item_rating = []
    for author in list_of_authors:
        for entity in df_entities.entities_id:
            user_item_rating.append([author, entity, 1])

    #print(user_item_rating)
    
    ## add publish_date in array in index column = 3
    user_item_rating = np.insert(user_item_rating, 3, publish_date, axis=1)
    #print(user_item_rating)
    
    user_item_rating_all.append(user_item_rating)

0 - 100
d0c6b0c2d387baae89eb2898969913218b3bedff_entities.json
1 - 100
8d7400a2b387820cd391d7df8194642e50402a0c_entities.json
2 - 100
faaf1022ccfe93b032c5608097a53543ba24aedb_entities.json
3 - 100
ab1ac4f7b9c57dad5e3d61f72e8e3d552059fa09_entities.json
4 - 100
40e41f0c52b39669dee24e37875d7a9fabc38636_entities.json
5 - 100
8a7d5de5ea680e784ab2bd877240bf09e4c1c02d_entities.json
6 - 100
72ace5af731fdf4c384e912b074193d13902b7a1_entities.json
7 - 100
93c6eef32a1a511ee989a259eab0e12174dc6859_entities.json
8 - 100
e3af2ca43010f59c3d1bb731abd011e3dd0fc51c_entities.json
9 - 100
894e7274776479bb39a84e5dc363640cd435b369_entities.json
10 - 100
03203ab50eb64271a9e825f94a1b1a6c46ea14b3_entities.json
11 - 100
23bc55d6f63fab18b02004483888db2b6a0bfa48_entities.json
12 - 100
98a3b0606a67d829816c1d934e2d1a7196985151_entities.json
13 - 100
6a3fa8ed278df0d05c5e009521de11c72308f60b_entities.json
14 - 100
bb348e16b7b1390554883a8ad0815ec6965c8b2f_entities.json
15 - 100
7ff45096210eeb392d51f646f5c7fe011079aaf3_

In [13]:
flat_list = []
for sublist in user_item_rating_all:
    for item in sublist:
        flat_list.append(item)

Anonymization of the author's name: adding an userid to the author's name of the article <author_name, item, rating, year, user>

In [14]:
final_data = pd.DataFrame(np.array(flat_list),  columns=['user', 'item', 'rating', 'year'])
sum_df = final_data.groupby(['user', 'item', 'year']).size().reset_index().rename(columns={0: 'rating'})   

#     maps the values to the lowest consecutive values
df_user_index = pd.DataFrame(sum_df.user.unique(), columns=["user"])
df_user_index["new_index"] = np.arange(0, len(sum_df.user.unique()))

sum_df["index_user"] = sum_df["user"].map(df_user_index.set_index('user')["new_index"]).fillna(0) 
df_with_user_id = sum_df

In [15]:
df_with_user_id.rename(columns={'user': 'author_name', 'index_user': 'user'}, inplace = True)
df_with_user_id

Unnamed: 0,author_name,item,year,rating,user
0,", Ng",CHEBI_17891,2004,1,0
1,", Ng",CHEBI_2511,2004,1,0
2,", Ng",CHEBI_33232,2004,1,0
3,", Ng",CHEBI_36976,2004,1,0
4,", Ng",CHEBI_37958,2004,1,0
5,", Ng",CHEBI_50406,2004,1,0
6,", Ng",CHEBI_5054,2004,1,0
7,", Ng",CHEBI_50906,2004,1,0
8,", Ng",CHEBI_7754,2004,1,0
9,", Ng",DOID_11247,2004,1,0


<a id="3.4"></a>
### 3.4. Get entities labels

In [16]:
list_of_entities = df_with_user_id.item.unique()
print(list_of_entities) 

['CHEBI_17891' 'CHEBI_2511' 'CHEBI_33232' 'CHEBI_36976' 'CHEBI_37958'
 'CHEBI_50406' 'CHEBI_5054' 'CHEBI_50906' 'CHEBI_7754' 'DOID_11247'
 'DOID_2237' 'DOID_2945' 'DOID_552' 'DOID_6132' 'DOID_614' 'DOID_615'
 'DOID_8469' 'CHEBI_131189' 'CHEBI_15347' 'CHEBI_16382' 'CHEBI_17790'
 'CHEBI_22587' 'CHEBI_27998' 'CHEBI_30212' 'CHEBI_36080' 'DOID_0050152'
 'DOID_0110740' 'DOID_10533' 'DOID_11394' 'DOID_8566' 'DOID_874'
 'CHEBI_23888' 'DOID_0050012' 'DOID_12365' 'DOID_12384' 'DOID_12385'
 'DOID_1498' 'DOID_3482' 'DOID_4' 'DOID_635' 'CHEBI_33731' 'CHEBI_75830'
 'DOID_0111084' 'DOID_0111627' 'DOID_225' 'CHEBI_15035' 'CHEBI_17234'
 'CHEBI_2509' 'CHEBI_35341' 'DOID_0050185' 'DOID_0060317' 'DOID_0080750'
 'DOID_10247' 'DOID_10554' 'DOID_10923' 'DOID_11162' 'DOID_11335'
 'DOID_12375' 'DOID_1588' 'DOID_1673' 'DOID_1787' 'DOID_1969' 'DOID_2275'
 'DOID_2355' 'DOID_2841' 'DOID_2942' 'DOID_583' 'DOID_6000' 'DOID_820'
 'DOID_848' 'DOID_934' 'DOID_9395' 'CHEBI_35143' 'DOID_162' 'DOID_2043'
 'DOID_5844' 'DOI

Now we create a graph, a representation of the ontology with **RDFLib library**

**Concept of RDF**
The RDF is a standard graph-based representation format that stores semantic facts. I.e., RDF is a model for data publishing and interchange on the Web standardized by W3C. It is also used in semantic graph databases (also known as RDF triplestores). 

RDF triplestore databases are successfully used for managing **Linked Open Data** datasets.

**Linked Open Data** is a set of design principles for sharing machine-readable interlinked Open Data on the Web.

<center><img src="img/Linked-Data.png" width="300" ><br> (Source: https://www.ontotext.com)</center>

As the HTTP protocol provides a simple mechanism for retrieving resources, when things can be identified by URIs in conjunction with this protocol, they become easier to find. This expedites publishing any kind of data and adding it to the global data space.

The Uniform Resource Identifier (URI) is a unique sequence of characters used for giving unique names to anything – from digital content available on the Web to real-world objects and abstract concepts. With the help of URIs, we can distinguish between different things or know that one thing from one dataset is the same as another in a different dataset.

Linked Data is one of the core pillars of the **Semantic Web**:

<center><img src="img/semanticweb.png" width="250" ><br> (Source: wikipedia)</center>


In [18]:
os.chdir('data/ontologies')

In [19]:
chebi = rdflib.Graph()
print('Loading ... chebi')
chebi.load('chebi_lite.owl')

doid = rdflib.Graph()
print('Loading ... doid')   
doid.load('doid.owl')

print('Successful loading!')

Loading ... chebi
Loading ... doid
Successful loading!


In [None]:
os.chdir('../../')

The entities labels are available in http://purl.obolibrary.org/obo/ based on items prefix

e.g., considering 'CHEBI:35341', the corresponding link is http://purl.obolibrary.org/obo/CHEBI_35341, and the label found is the entity 'steroid'. I.e., the identifier 'CHEBI:35341' was linked to the ChEBI concept 'steroid'.

<center><img src="img/steroid_chebi.png" width=400><img src="img/steroid_graphview.png" width="400" ><br> </center>

Now define the key words that we will use (the edge weights of the graph)

In [17]:
entities_label = []
for id in list_of_entities: 
    uri = URIRef('http://purl.obolibrary.org/obo/' + id)
    if id.startswith('CHEBI'):
         lab = chebi.label(uri)
    elif id.startswith('DO'):
        lab = doid.label(uri)
    entities_label.append(lab)

entities_label

NameError: name 'chebi' is not defined

In [None]:
df_entities = pd.DataFrame(list_of_entities, columns=["item_id"])
df_entities["entity_name"] = np.array(entities_label)
df_entities

Mapping labels, and get dataset as <author_name, item, year, rating, user, item_name>

In [None]:
df_with_user_id["item_name"] = df_with_user_id["item"].map(df_entities.set_index('item_id')["entity_name"]).fillna(0)
df_with_user_id

<a id="3.5"></a>
### 3.5. Save data

First, create the output dir 

In [None]:
cord_ds_dir = 'data/results/sample_cord-19_ds.csv'
cord_userid_dir = 'data/results/sample_cord-19_ds_userid.csv'

if not os.path.exists(os.path.dirname(cord_ds_dir)):
        Path(os.path.dirname(cord_ds_dir)).mkdir(parents=True, exist_ok=True)

Now, save all values in csv file        

In [None]:
df_with_user_id[['user', 'item', 'rating', 'item_name', 'year']].to_csv(cord_ds_dir, index=False, header=False)

df_with_user_id[['user', 'author_name']].to_csv(cord_userid_dir, index=False, header=False) 

In [None]:
sample_cord = pd.read_csv(cord_ds_dir, sep = ',', quotechar = '"',  encoding = 'utf-8', dtype=str, header=None) 
sample_cord.iloc[:].head(1)

<a id="4"></a>
## 4. Data statistics 

In [None]:
import pandas as pd

dataset = pd.read_csv('data/results/sample_cord-19_ds.csv', names=['user', 'item', 'rating', 'item_name', 'year'])
print(dataset.head(20))

In [None]:
# number of unique users
# number of unique items
# number of ratings
# sparsity 

print('n users: ', dataset.user.unique().shape[0])
print('n items: ', dataset.item.unique().shape[0])
print('n ratings: ', dataset.size)

print('sparsity: ', 1 - (dataset.size / (dataset.user.unique().shape[0] * dataset.item.unique().shape[0])))

In [None]:
# items by ontology

print("CHEBI items: ", dataset[dataset.item.str.startswith('CHEBI')].item.unique().shape[0])

print("DOID items: ", dataset[dataset.item.str.startswith('DOID')].item.unique().shape[0])

In [None]:
# items by user
import matplotlib as mpl
import matplotlib.pyplot as plt
#%matplotlib inline

unique_users = dataset.user.unique()
items_by_user = dataset.groupby(['user'])["item"].count().reset_index()
items_by_user = items_by_user.sort_values(by=['item'], ascending=False)
print(items_by_user)
items_by_user.user = items_by_user.user.astype('str')

print('max items by user: ', items_by_user.item.max())
print('min items by user: ', items_by_user.item.min())
print('mean items by user: ', items_by_user.item.mean())

In [None]:
plt.scatter(items_by_user.user, items_by_user.item)
plt.axhline(y=items_by_user.item.mean(), color='r', linestyle='-')
plt.ylabel('number of items')
plt.xlabel('user')
plt.show()

In [None]:
# users by item

unique_items = dataset.item.unique()
users_by_item = dataset.groupby(['item'])["user"].count().reset_index()
users_by_item = users_by_item.sort_values(by=['user'], ascending=False)

print('max users by item: ', users_by_item.user.max())
list_of_max_items = users_by_item[users_by_item.user == users_by_item.user.max()].item.values
print(list_of_max_items)

print(dataset[dataset.item == list_of_max_items[0]])

print('min users by item: ', users_by_item.user.min())
print('mean users by item: ', users_by_item.user.mean())

In [None]:
dataset_doid = dataset[dataset.item.str.startswith('DOID')]
unique_items_d = dataset_doid.item.unique()
users_by_item_d = dataset_doid.groupby(['item'])["user"].count().reset_index()
users_by_item_d = users_by_item_d.sort_values(by=['user'], ascending=False)

print('max users by item: ', users_by_item_d.user.max())
list_of_max_items_d = users_by_item_d[users_by_item_d.user == users_by_item_d.user.max()].item.values
print(list_of_max_items_d)

print(dataset_doid[dataset_doid.item == list_of_max_items_d[0]])

print('min users by item: ', users_by_item_d.user.min())
print('mean users by item: ', users_by_item_d.user.mean())

In [None]:
dataset_chebi = dataset[dataset.item.str.startswith('CHEBI')]
unique_items_c = dataset_chebi.item.unique()
users_by_item_c = dataset_chebi.groupby(['item'])["user"].count().reset_index()
users_by_item_c = users_by_item_c.sort_values(by=['user'], ascending=False)

print('max users by item: ', users_by_item_c.user.max())
list_of_max_items_c = users_by_item_c[users_by_item_c.user == users_by_item_c.user.max()].item.values
print(list_of_max_items_c)

print(dataset_chebi[dataset_chebi.item == list_of_max_items_c[0]])

print('min users by item: ', users_by_item_c.user.min())
print('mean users by item: ', users_by_item_c.user.mean())

In [None]:
#%matplotlib qt

plt.scatter(users_by_item.item, users_by_item.user)
plt.axhline(y=users_by_item.user.mean(), color='r', linestyle='-')
#plt.yscale('log')
plt.ylabel('number of users')
plt.xlabel('items')
plt.show()