# Create CORD19 MongoDB Database Entries

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/gsarti/covid-paper-browser/blob/master/notebooks/create_db_entries_pickles.ipynb)

This notebook is intended to be run on Google Colab or other GPU providers platforms as a step-by-step replacement for the `create_db` script that can be run locally and don't require excessive amounts of time.

**Disclaimer**: The procedure requires having access to a Google Drive account with at least 13 GB of storage available. The total runtime in Colab is approximately  3 hours.

## Steps

1. Open this notebook in Colab by clicking the button above.

2. Connect to a GPU runtime and mount your Drive (on the left, Files -> Mount Drive). You should see the path `drive/My Drive` containing your files on the left section.

3. Run the cells below. Those will download data, install Python libraries, define methods and ultimately run the `create_db` method for the two database entry types. Instead of actually creating the database, this saves the database entries in `X` files named `{out_name}X.pkl` on your Google Drive (default is one file `overview0.pkl` for overview entries and five files `detailsX.pkl` for details). The total size is approx 12.45 GB for all details entries (including both `title_abstract_embeddings` and `paragraphs_embeddings`), 364 MB for overview entries only. The latter are also way faster (10 min 31 sec for me).

4. Download the pickled files on your PC.

5. Run a MongoDB session in the background, open a Python session in the same folder of the downloaded files and run:

```python
import os
from pymongo import MongoClient

YOUR_DB_NAME = 'coviddb'
YOUR_OVERVIEW_COLLECTION_NAME = 'cord19scibertoverview'
YOUR_DETAILS_COLLECTION_NAME = 'cord19scibertdetails'
YOUR_OVERVIEW_OUTNAME = 'overview'
YOUR_DETAILS_OUTNAME = 'details'

overview_files = [f for f in os.listdir() if f.startswith(YOUR_DETAILS_OUTNAME)]
details_files = [f for f in os.listdir() if f.startswith(YOUR_OVERVIEW_OUTNAME)]
client = MongoClient()
db = client[YOUR_DB_NAME]
overview_col = db[YOUR_OVERVIEW_COLLECTION_NAME]
details_col = db[YOUR_DETAILS_COLLECTION_NAME]
for fname in tqdm(overview_files):
    with open(fname, 'rb') as f:
        entries = pickle.load(f)
    overview_col.insert_many(entries)
for fname in tqdm(details_files):
    with open(fname, 'rb') as f:
        entries = pickle.load(f)
    details_col.insert_many(entries)
```

Notice that doing this may cannibalize your resources for the Details entries (it did with me on a 16GB RAM machine), so consider doing this by loading and inserting one pickled file at a time.


## Download Data from AI2 Servers

In [0]:
%%bash
mkdir data

DATE=2020-03-27
DATA_DIR=data

wget https://ai2-semanticscholar-cord-19.s3-us-west-2.amazonaws.com/"${DATE}"/comm_use_subset.tar.gz -P "${DATA_DIR}"
wget https://ai2-semanticscholar-cord-19.s3-us-west-2.amazonaws.com/"${DATE}"/noncomm_use_subset.tar.gz -P "${DATA_DIR}"
wget https://ai2-semanticscholar-cord-19.s3-us-west-2.amazonaws.com/"${DATE}"/custom_license.tar.gz -P "${DATA_DIR}"
wget https://ai2-semanticscholar-cord-19.s3-us-west-2.amazonaws.com/"${DATE}"/biorxiv_medrxiv.tar.gz -P "${DATA_DIR}"
wget https://ai2-semanticscholar-cord-19.s3-us-west-2.amazonaws.com/"${DATE}"/metadata.csv -P "${DATA_DIR}"

tar -zxvf "${DATA_DIR}"/comm_use_subset.tar.gz -C "${DATA_DIR}"
tar -zxvf "${DATA_DIR}"/noncomm_use_subset.tar.gz -C "${DATA_DIR}"
tar -zxvf "${DATA_DIR}"/custom_license.tar.gz -C "${DATA_DIR}"
tar -zxvf "${DATA_DIR}"/biorxiv_medrxiv.tar.gz -C "${DATA_DIR}"

## Install Python Libraries

In [0]:
%%capture
!pip install -U transformers pandas sentence_transformers tqdm

## Define the PaperDatabaseEntry Classes (Overview & Details)

In [0]:
class PaperDatabaseEntryOverview:
    """ Defines the PaperDatabaseEntryOverview object stored in the database used to retrieve the list of papers. """
    def __init__(self, x):
        self.cord_id = x['cord_uid']
        self.title = x['title'] if x['title'] not in FILTER_TITLES else ''
        self.license = x['license']
        self.abstract = x['abstract'] if x['abstract'] not in FILTER_ABSTRACTS else ''
        self.publish_time = x['publish_time']
        self.authors = x['authors'].split('; ')
        self.journal = x['journal']
        self.title_abstract_embeddings = []

    def as_dict(self):
        return {
            'cord_id': self.cord_id,
            'title': self.title,
            'license': self.license,
            'abstract': self.abstract,
            'publish_time': self.publish_time,
            'authors': self.authors,
            'journal': self.journal,
            'title_abstract_embeddings': self.title_abstract_embeddings,
        }

    def compute_title_abstract_embeddings(self, model):
        if self.title != '' or self.abstract != '':
            title_abstract = self.title + ' ' + self.abstract
            embedding = model.encode([title_abstract], show_progress_bar=False)
            self.title_abstract_embeddings = embedding[0].tolist()


class PaperDatabaseEntryDetails(PaperDatabaseEntryOverview):
    """ Defines the PaperDatabaseEntryDetails object stored in the database containing additional information for single-paper view. """
    def __init__(self, x):
        super().__init__(x)
        self.url = x['url']
        self.sha = x['sha'].split(';')[0]
        self.source = x['source_x']
        self.doi = x['doi']
        self.pmc_id = x['pmcid']
        self.pubmed_id = x['pubmed_id']
        self.microsoft_id = x['Microsoft Academic Paper ID']
        self.who_id = x['WHO #Covidence']
        self.paragraphs = [] # List of tuples (section_name, text)
        self.bibliography = [] # List of dictionaries
        self.paragraphs_embeddings = []

    def as_dict(self):
        return {
            'cord_id': self.cord_id,
            'url': self.url,
            'sha': self.sha,
            'title': self.title,
            'source': self.source,
            'doi': self.doi,
            'pmc_id': self.pmc_id,
            'pubmed_id': self.pubmed_id,
            'license': self.license,
            'abstract': self.abstract,
            'publish_time': self.publish_time,
            'authors': self.authors,
            'journal': self.journal,
            'microsoft_id': self.microsoft_id,
            'who_id': self.who_id,
            'paragraphs': self.paragraphs,
            'bibliography': self.bibliography,
            'title_abstract_embeddings': self.title_abstract_embeddings,
            'paragraphs_embeddings': self.paragraphs_embeddings,
        }
    
    def compute_paragraphs_embeddings(self, model):
        if len(self.paragraphs) > 0:
            paragraphs_text = [tup[1] for tup in self.paragraphs]
            paragraph_embeddings = model.encode(paragraphs_text, show_progress_bar=False)
            self.paragraphs_embeddings = [e.tolist() for e in paragraph_embeddings]

FILTER_TITLES = ['Index', 'Subject Index', 'Subject index', 'Author index', 'Contents', 
        'Articles of Significant Interest Selected from This Issue by the Editors',
        'Information for Authors', 'Graphical contents list', 'Table of Contents',
        'In brief', 'Preface', 'Editorial Board', 'Author Index',
        'Volume Contents', 'Research brief', 'Abstracts', 'Keyword index',
        'In This Issue', 'Department of Error', 'Contents list', 'Highlights of this issue',
        'Abbreviations', 'Introduction', 'Cumulative Index', 'Positions available',
        'Index of Authors', 'Editorial', 'Journal Watch', 'QUIZ CORNER', 'Foreword', 'Table of contents',
        'Quiz Corner', 'INDEX', 'Bibliography of the current world literature',
        'Index of Subjects', '60 Seconds', 'Contributors',
        'Public Health Watch', 'Commentary', 'Chapter 1 Introduction',
        'Facts and ideas from anywhere', 'Erratum', 'Contents of Volume', 'Patent reports',
        'Oral presentations', 'Abkürzungen', 'Abstracts cont.', 'Related elsevier virology titles contents alert',
        'Keyword Index', 'Volume contents', 'Articles of Significant Interest in This Issue',
        'Appendix', 'Abkürzungsverzeichnis', 'List of Abbreviations', 'Editorial Board and Contents',
        'Instructions for Authors', 'Corrections', 'II. Sachverzeichnis', '1 Introduction',  'List of abbreviations',
        'Response', 'Feedback', 'Poster Sessions', 'News Briefs', 'Commentary on the Feature Article',
        'Papers to Appear in Forthcoming Issues', 'TOC', 'Glossary', 'Letter from the editor', 'Croup',
        'Acronyms and Abbreviations', 'Highlights', 'Forthcoming papers', 'Poster presentations', 'Authors',
        'Journal Roundup', 'Index of authors', 'Table des mots-clés', 'Posters', 'Cumulative Index 2004', 
        'A Message from the Editor', 'Contents and Editorial Board', 'SUBJECT INDEX', 'Contents page 1',
]

FILTER_ABSTRACTS = ['Unknown', '[Image: see text]']

## Methods to Load Model and Rank Abstracts/Paragraphs

In [0]:
import numpy as np
from scipy.spatial.distance import cdist
from sentence_transformers import models, SentenceTransformer


def load_sentence_transformer(
    name: str = 'gsarti/scibert-nli', 
    max_seq_length: int  = 128, 
    do_lower_case: bool  = True) -> SentenceTransformer:
    """ Loads a SentenceTransformer from HuggingFace AutoModel bestiary """
    word_embedding_model = models.BERT(
            'gsarti/scibert-nli',
            max_seq_length=128,
            do_lower_case=True
        )
    pooling_model = models.Pooling(word_embedding_model.get_word_embedding_dimension(),
            pooling_mode_mean_tokens=True,
            pooling_mode_cls_token=False,
            pooling_mode_max_tokens=False
        )
    return SentenceTransformer(modules=[word_embedding_model, pooling_model])


def match_query(
    query: str,
    model: SentenceTransformer,
    corpus: list,
    corpus_embed: list,
    top_k: int = 5) -> list:
    """ Matches query and paragraph embeddings, returning top scoring paragraphs ids and scores """
    query_embed = model.encode([query], show_progress_bar=False)[0].reshape(1,-1)
    distances = 1 - cdist(query_embed, corpus_embed, "cosine")
    results = zip(corpus, distances.reshape(-1,1))
    results = sorted(results, key=lambda x: x[1], reverse=True)
    return results[:top_k]

## Methods to Create Database Entries and Pickle Files

In [0]:
import os
import sys
import json
import pandas as pd
import numpy as np
import pickle
from tqdm import tqdm
from sentence_transformers import SentenceTransformer


def create_db_entry(
    data_path: str,
    csv_entry: dict, 
    model: SentenceTransformer,
    data_type):
    """ Creates a single DB entry from a csv entry using the model for creating embeddings """
    db_entry = data_type(csv_entry)
    db_entry.compute_title_abstract_embeddings(model)
    if csv_entry['has_full_text'] == True and data_type == PaperDatabaseEntryDetails:
        foldername = csv_entry['full_text_file']
        # Format is e.g. 'data/biorxiv_medrxiv/file.json'
        path = os.path.join('data', foldername, f'{db_entry.sha}.json')
        file = json.load(open(path, 'r'))
        paragraphs = []
        # Order is: abstracts, body, back_matter, ref_entries
        parts = [file['abstract'], file['body_text'], file['back_matter']]
        for part in parts:
            for paragraph in part:
                paragraphs.append((paragraph['section'], paragraph['text']))
        for key, paragraph in file['ref_entries'].items():
            paragraphs.append((paragraph['type'].title(), paragraph['text']))
        db_entry.paragraphs = paragraphs
        db_entry.compute_paragraphs_embeddings(model)
        db_entry.bibliography = [file['bib_entries'][entry] for entry in file['bib_entries']]
    return db_entry


def create_pickles(
    input_file_path: str = 'data/metadata.csv',
    out_name: str = 'db_entries',
    model_name: str = 'gsarti/scibert-nli',
    n_batches: int = 1,
    data_type = PaperDatabaseEntryOverview) -> None:
    """ Creates a new Mongo database with entries from input_file_path, using model model_name """
    model = load_sentence_transformer(model_name)
    df = pd.read_csv(input_file_path)
    df = df.fillna('')
    df_batches = np.array_split(df, n_batches)
    inserted = 0
    for i, batch in enumerate(df_batches):
        print(f'Processing batch {i}')
        db_entries = []
        for _, row in tqdm(batch.iterrows()):
            db_entry = create_db_entry('data', row, model, data_type)
            # Only add entries with at least one between title and abstract to enable search
            if len(db_entry.title_abstract_embeddings) > 0:
                db_entries.append(db_entry.as_dict())
        print('Saving entries to', f'drive/My Drive/{out_name}{i}.pkl')
        with open(f'drive/My Drive/{out_name}{i}.pkl', 'wb') as f:
            pickle.dump(db_entries, f)
        inserted += len(db_entries)
        print(f'Inserted {len(db_entries)} new entries.')
    print(f'Done. {len(df)} processed, {inserted} inserted.')

## Run the Creation of Pickle Files

This is the part that will take approximately 3 hours to run for PaperDatabaseEntryDetails objects, 11 minutes for PaperDatabaseEntryOverview ones

In [16]:
create_pickles(
    input_file_path='data/metadata.csv',
    out_name='overview',
    model_name='gsarti/scibert-nli'
    n_batches=1,
    data_type=PaperDatabaseEntryOverview
)

5it [00:00, 48.20it/s]

Processing batch 0


45774it [10:31, 72.52it/s]


Saving entries to drive/My Drive/overview0.pkl
Inserted 45444 new entries.
Done. 45774 processed, 45444 inserted.


In [0]:
create_pickles(
    input_file_path='data/metadata.csv',
    out_name='details',
    model_name='gsarti/scibert-nli'
    n_batches=5,
    data_type=PaperDatabaseEntryDetails
)

## Check that Files were Generated Correctly

You should see roughly 10 GB here

In [17]:
!du -h "drive/My Drive/overview"*

364M	drive/My Drive/overview0.pkl


In [0]:
!du -h "drive/My Drive/details"*