#  COVID-19 Open Research Dataset (CORD-19)

- https://pages.semanticscholar.org/coronavirus-research
- https://www.kaggle.com/acmiyaguchi/cord-19-citation-network-with-deduping/output
- https://lg-covid-19-hotp.cs.duke.edu/

### Stats

Papers in CORD-19:
- has valid DOI
External papers cited by CORD-19 papers:
- total
- with DOI
- scraped


### Schema

https://ai2-semanticscholar-cord-19.s3-us-west-2.amazonaws.com/2020-03-13/json_schema.txt
```
# JSON schema of full text documents
{
    "paper_id": <str>,                      # 40-character sha1 of the PDF
    "metadata": {
        "title": <str>,
        "authors": [                        # list of author dicts, in order
            {
                "first": <str>,
                "middle": <list of str>,
                "last": <str>,
                "suffix": <str>,
                "affiliation": <dict>,
                "email": <str>
            },
            ...
        ],
        "abstract": [                       # list of paragraphs in the abstract
            {
                "text": <str>,
                "cite_spans": [             # list of character indices of inline citations
                                            # e.g. citation "[7]" occurs at positions 151-154 in "text"
                                            #      linked to bibliography entry BIBREF3
                    {
                        "start": 151,
                        "end": 154,
                        "text": "[7]",
                        "ref_id": "BIBREF3"
                    },
                    ...
                ],
                "ref_spans": <list of dicts similar to cite_spans>,     # e.g. inline reference to "Table 1"
                "section": "Abstract"
            },
            ...
        ],
        "body_text": [                      # list of paragraphs in full body
                                            # paragraph dicts look the same as above
            {
                "text": <str>,
                "cite_spans": [],
                "ref_spans": [],
                "eq_spans": [],
                "section": "Introduction"
            },
            ...
            {
                ...,
                "section": "Conclusion"
            }
        ],
        "bib_entries": {
            "BIBREF0": {
                "ref_id": <str>,
                "title": <str>,
                "authors": <list of dict>       # same structure as earlier,
                                                # but without `affiliation` or `email`
                "year": <int>,
                "venue": <str>,
                "volume": <str>,
                "issn": <str>,
                "pages": <str>,
                "other_ids": {
                    "DOI": [
                        <str>
                    ]
                }
            },
            "BIBREF1": {},
            ...
            "BIBREF25": {}
        },
        "ref_entries":
            "FIGREF0": {
                "text": <str>,                  # figure caption text
                "type": "figure"
            },
            ...
            "TABREF13": {
                "text": <str>,                  # table caption text
                "type": "table"
            }
        },
        "back_matter": <list of dict>           # same structure as body_text
    }
}
```

### Target output

- With text `text,text_b,label`
- With IDs `doc_id,doc_id_b,label`

In [1]:
import pandas as pd
import os
import math
import random
import pickle
import json
import re
import numpy as np
from tqdm import tqdm_notebook as tqdm
from collections import defaultdict
import requests
import time
from sklearn.model_selection import StratifiedKFold
from fuzzywuzzy import fuzz

In [2]:
from experiments.environment import get_env

env = get_env()

n_splits = 4

scraper_dir = './output/cord19/'
cord19_dir = os.path.join(env['datasets_dir'], 'cord-19')
dummy_id = '21a4369f83891bf6975dd916c0aa495d5df8709e'

/mnt/hdd/experiments/mostendorff/acl-anthology/environments
Environment detected: gpu_server (in default.yml)


In [3]:
meta_df = pd.read_csv(os.path.join(cord19_dir, 'metadata.csv'), index_col=0, dtype={'doi': str})

In [4]:
meta_df.tail()

Unnamed: 0_level_0,sha,source_x,title,doi,pmcid,pubmed_id,license,abstract,publish_time,authors,journal,Microsoft Academic Paper ID,WHO #Covidence,has_full_text,full_text_file,url
cord_uid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
4360s2yu,289deae0b2050aa259a05ba84565a4df82fa099a,Elsevier,Personal Protective Equipment: Protecting Heal...,10.1016/j.clinthera.2015.07.007,PMC4661082,26452427.0,els-covid,Abstract Purpose The recent Ebola epidemic tha...,2015-11-01,"Fischer, William A.; Weber, David J.; Wohl, Da...",Clinical Therapeutics,,,True,custom_license,https://doi.org/10.1016/j.clinthera.2015.07.007
66jumbir,21a4369f83891bf6975dd916c0aa495d5df8709e,Elsevier,Viruses and asthma,10.1016/j.bbagen.2011.01.012,PMC3130828,21291960.0,els-covid,Abstract Background Viral respiratory infectio...,2011-11-30,"Dulek, Daniel E.; Peebles, R. Stokes",Biochimica et Biophysica Acta (BBA) - General ...,,,True,custom_license,https://doi.org/10.1016/j.bbagen.2011.01.012
3wk36h9p,,Elsevier,Why the WHO won't use the p-word,10.1016/s0262-4079(20)30474-7,,,els-covid,"There are no criteria for a pandemic, but covi...",2020-03-07,"MacKenzie, Debora",New Scientist,,#5716,False,custom_license,https://doi.org/10.1016/s0262-4079(20)30474-7
0ujw0gak,,WHO,"Communication, transparency key as Canada face...",10.1503/cmaj.1095846,PMC7030882,32071113.0,unk,,2020-02-17,"Glauser, Wendy",Canadian Medical Association Journal,1953688000.0,#4117,False,,https://doi.org/10.1503/cmaj.1095846
28vx9w58,3369a14e1d116943f48b3a33597796c9802de279; f523...,PMC,Searching for animal models and potential targ...,10.1016/j.onehlt.2017.03.001,PMC5454147,28616501.0,cc-by-nc-nd,Emerging and re-emerging pathogens represent a...,2017-03-03,"Vergara-Alert, Júlia; Vidal, Enric; Bensaid, A...",One Health,,,True,noncomm_use_subset,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5...


In [5]:
len(meta_df['doi'].unique()) / len(meta_df)

0.9271638921658584

In [6]:
id2meta = {row['sha']: row for idx, row in meta_df.iterrows() if row['sha']}
len(id2meta)

31745

In [7]:
id2meta[dummy_id]['doi']

'10.1016/j.bbagen.2011.01.012'

# Load paper data

In [8]:
subsets = ['biorxiv_medrxiv', 'comm_use_subset', 'custom_license', 'noncomm_use_subset']
id2paper = {}

has_doi = 0
bib_count = 0  
cits = [] # from_doi, to_doi, <section title>
    
for ss in subsets:
    ss_dir = os.path.join(cord19_dir, ss)
    
    # iterate over files
    for fn in os.listdir(ss_dir):
        if not fn.endswith('.json'):
            continue
            
        fp = os.path.join(ss_dir, fn)
        with open(fp, 'r') as f:
            paper = json.load(f)            
            
            if paper['paper_id'] not in id2meta:
                continue
            
            meta = id2meta[paper['paper_id']]
            
            paper['_meta'] = dict(meta)
            
            id2paper[paper['paper_id']] = paper
            
            # has valid DOI
            if isinstance(meta['doi'], str) and len(meta['doi']) > 10:
                # iterate over body text            
                for paragraph in paper['body_text']:
                    # iterate over each citation marker
                    for cit in paragraph['cite_spans']:
                        # find corresponding bib entry
                        if cit['ref_id'] in paper['bib_entries']:
                            bib = paper['bib_entries'][cit['ref_id']]
                            bib_count += 1

                            # only use bib entries with DOI
                            if 'DOI' in bib['other_ids']:
                                has_doi += 1

                                for out_doi in bib['other_ids']['DOI']:
                                    cits.append((
                                        meta['doi'],
                                        out_doi,
                                        paragraph['section']
                                    ))
        #break
    #break

In [9]:
id2paper[dummy_id]['metadata']['title']

'Viruses and asthma ☆'

In [10]:
# Load paper data from disk (scraped)
if os.path.exists(os.path.join(scraper_dir, 'doi2paper.json')):
    with open(os.path.join(scraper_dir, 'doi2paper.json'), 'r') as f:
        doi2s2paper = json.load(f)

    print(f'Loaded {len(doi2s2paper):,} from disk')
else:
    doi2s2paper = None

Loaded 65,309 from disk


In [11]:
for k in doi2s2paper:
    print(k)
    break

10.1142/S1793048008000708


In [12]:
# Load paper data from CORD-19
doi2paper = {id2meta[pid]['doi']: paper for pid, paper in id2paper.items() if pid in id2meta}

print(f'Loaded {len(doi2paper)} from CORD-19')

Loaded 29891 from CORD-19


In [13]:
# Merge CORD-19 + S2


In [13]:
print(f'Paper count: {len(id2paper)}')
print(f'DOI exists: {has_doi/bib_count} (total: {bib_count}; doi: {has_doi})')
print(f'Citation pairs: {len(cits)}')

Paper count: 30197
DOI exists: 0.166464416275948 (total: 1971154; doi: 328127)
Citation pairs: 328127


In [14]:
#cits_with_doi = [c for c in cits if c[0] in doi2paper and c[1] in doi2paper]
cits_with_doi = [c for c in cits if (c[0] in doi2paper or c[0] in doi2s2paper) and (c[1] in doi2paper or c[1] in doi2s2paper)]

In [15]:
# CORD-19 only: Citations with DOI: 30655 (0.09342419246206499)
# + S2: Citations with DOI: 170454 (0.5194756908148369)

print(f'Citations with DOI: {len(cits_with_doi)} ({len(cits_with_doi)/len(cits)})')

Citations with DOI: 170454 (0.5194756908148369)


In [16]:
missing_papers = [c[0] for c in cits if c[0] not in doi2paper]
missing_papers += [c[1] for c in cits if c[1] not in doi2paper]

print(f'Missing paper data, but DOI: {len(missing_papers)}')

Missing paper data, but DOI: 297472


In [17]:
unique_missing_papers = set(missing_papers)

print(f'Unique DOIs of missing papers: {len(unique_missing_papers)}')

Unique DOIs of missing papers: 140682


In [18]:
unique_cits = {(c[0], c[1]) for c in cits_with_doi}
len(unique_cits)

108001

In [32]:
# section titles
sect_titles_count = defaultdict(int)

for from_doi, to_doi, sect_title in cits_with_doi:
    for t in normalize_section(sect_title).split(' and '):
        sect_titles_count[t] += 1

In [36]:
import operator
top_sect_titles = sorted(sect_titles_count.items(), key=operator.itemgetter(1))
top_sect_titles.reverse()
top_sect_titles


[('discussion', 28879),
 ('introduction', 27012),
 ('', 9706),
 ('conclusion', 1971),
 ('results', 1941),
 ('background', 1192),
 ('methods', 1192),
 ('materials', 894),
 ('virus', 452),
 ('future work', 332),
 ('autophagy', 291),
 ('epidemiology', 250),
 ('cells', 243),
 ('treatment', 216),
 ('pathogenesis', 210),
 ('the study', 186),
 ('figure 2.', 172),
 ('influenza virus', 172),
 ('statistical analysis', 172),
 ('author summary', 171),
 ('phylogenetic analysis', 148),
 ('government-published information (n = 33; 15%)', 141),
 ('perspectives', 141),
 ('rna virus', 132),
 ('innate immunity', 128),
 ('influenza', 127),
 ('structure', 125),
 ('apoptosis', 120),
 ('vaccines', 117),
 ('evolution', 111),
 ('trpml1-3 channels', 110),
 ('expression', 108),
 ('prevention', 108),
 ('macrophages', 107),
 ('classification', 106),
 ('plasmids', 105),
 ('quantification', 102),
 ('summary', 97),
 ('replication', 97),
 ('mda5', 96),
 ('clinical features', 91),
 ('disease', 91),
 ('sequencing', 91),

In [37]:
# normalize section title
def normalize_section(title):
    return title.strip().lower()\
        .replace('conclusions', 'conclusion')\
        .replace('concluding remarks', 'conclusion')\
        .replace('future perspectives', 'future work')\
        .replace('future directions', 'future work')\
        .replace('viruses.', 'virus')\
        .replace('viruses', 'virus')            
        #.replace('conclusion and future perspectives', 'conclusion')\
        #.replace('materials and methods', 'methods')

In [38]:
# resolve 'and' titles
def resolve_and_sect_titles(cits):
    for from_doi, to_doi, sect_title in cits:
        for t in normalize_section(sect_title).split(' and '):
            yield (from_doi, to_doi, t)

normalized_cits_with_doi = resolve_and_sect_titles(cits_with_doi)

In [39]:
list(resolve_and_sect_titles(cits_with_doi[:3]))

[('10.1101/001727', '10.1038/nature11711', 'introduction'),
 ('10.1101/003889',
  '10.1111/irv.12226',
  'validity of networked metapopulation'),
 ('10.1101/006866', '10.4049/jimmunol.1000445', 'introduction')]

In [40]:
cits_df = pd.DataFrame(normalized_cits_with_doi, columns=['from_doi', 'to_doi', 'citing_section'])
cits_df

Unnamed: 0,from_doi,to_doi,citing_section
0,10.1101/001727,10.1038/nature11711,introduction
1,10.1101/003889,10.1111/irv.12226,validity of networked metapopulation
2,10.1101/006866,10.4049/jimmunol.1000445,introduction
3,10.1101/006866,10.1073/pnas.1107498108,introduction
4,10.1101/006866,10.4049/jimmunol.1102097,introduction
...,...,...,...
199767,10.1093/jtm/taaa021,10.2139/ssrn.3525558,results
199768,10.3348/kjr.2020.0096,10.3348/kjr.2020.0078,
199769,10.2478/jccm-2020-0013,10.1016/s0140-6736(20)30183-5,
199770,10.2478/jccm-2020-0013,10.1016/s0140-6736(20)30183-5,


In [41]:
print(f'After normalization: {len(cits_df):,} (before: {len(cits_with_doi):,})')

After normalization: 199,772 (before: 170,454)


In [42]:
cits_df['citing_section'] = [normalize_section(t) for t in cits_df['citing_section'].values]

In [43]:
top_sections = 10

cits_df['citing_section'].value_counts()[:top_sections]

discussion      28879
introduction    27012
                 9706
conclusion       1971
results          1941
methods          1192
background       1192
materials         894
virus             452
future work       332
Name: citing_section, dtype: int64

In [44]:
labels = list(filter(lambda t: t, cits_df['citing_section'].value_counts()[:top_sections].keys()))
labels

['discussion',
 'introduction',
 'conclusion',
 'results',
 'methods',
 'background',
 'materials',
 'virus',
 'future work']

In [45]:
def to_label(t, labels):
    t = normalize_section(t)
    
    if t in labels:
        return t
    else:
        return 'other'

In [46]:
label_col = 'label'

In [47]:
cits_df[label_col] = [to_label(t, labels) for t in cits_df['citing_section']]
cits_df.drop_duplicates(['from_doi', 'to_doi', 'label'], keep='first', inplace=True)
len(cits_df)

122598

In [48]:
tmp_df = cits_df.groupby(['from_doi', 'to_doi']).label.agg([('label_count', 'count'), (label_col, ','.join)]).reset_index()
tmp_df

Unnamed: 0,from_doi,to_doi,label_count,label
0,10.1002/1873-3468.13478,10.1128/JVI.01240-17,1,other
1,10.1002/ame2.12017,10.3201/eid2206.152113,1,other
2,10.1002/bdm.2056,10.3102/10769986031004437,1,other
3,10.1002/cbf.3182,10.1182/blood-2011-01-329060,1,other
4,10.1002/cbf.3182,10.1186/s12977-015-0178-0,1,other
...,...,...,...,...
107996,10.9745/ghsp-d-19-00188,10.1371/journal.pntd.0006007,1,introduction
107997,10.9745/ghsp-d-19-00188,10.1371/journal.pone.0129054,1,other
107998,10.9745/ghsp-d-19-00188,10.15406/ogij.2016.05.00180,1,other
107999,10.9745/ghsp-d-19-00188,10.2471/BLT.17.201541,1,introduction


In [49]:
tmp_df['label_count'].value_counts()

1    95524
2    10633
3     1609
4      196
5       37
6        2
Name: label_count, dtype: int64

In [50]:
for k, p in doi2paper.items():
    break
    
p['abstract'][0]['text']
#p['metadata']['title']

'Next-generation sequencing is increasingly being used to study samples composed of mixtures of organisms, such as in clinical applications where the presence of a pathogen at very low abundance may be highly important. We present an analytical method (SIANN: Strain Identification by Alignment to Near Neighbors) specifically designed to rapidly detect a set of target organisms in mixed samples that achieves a high degree of species-and strain-specificity by aligning short sequence reads to the genomes of near neighbor organisms, as well as that of the target. Empirical benchmarking alongside the current state-of-the-art methods shows an extremely high Positive Predictive Value, even at very low abundances of the target organism in a mixed sample. SIANN is available as an Illumina BaseSpace app, as well as through Signature Science, LLC. SIANN results are presented in a streamlined report designed to be comprehensible to the non-specialist user, providing a powerful tool for rapid speci

In [51]:
for k, p in doi2s2paper.items():
    break
    
p['title']
p['abstract']

"The spread of epidemics is inevitably entangled with human behavior, social contacts, and population flows among different geographical regions. The collection and analysis of datasets which trace the activities and interactions of individuals, social patterns, transportation infrastructures and travel fluxes, have unveiled the presence of connectivity patterns characterized by complex features encoded in large-scale heterogeneities and unbounded statistical fluctuations. These features dramatically affect the behavior of dynamical processes occurring on networks, and are responsible for the observed statistical properties of the processes' dynamics and evolution patterns. Here we will present a large-scale stochastic computational approach for the study of the global spread of emergent infectious diseases which explicitly incorporates real world transportation networks and census data."

# Generate output as TSV

In [63]:
def get_text_from_doi(doi):
    text = ''
    sep = '\n'
    
    if doi in doi2s2paper:
        # from s2 scraper
        #text += doi2s2paper[doi]['title']
        
        if doi2s2paper[doi]['abstract']:
            #text += '\n' + doi2s2paper[doi]['abstract']
            text = doi2s2paper[doi]['title'] + sep + doi2s2paper[doi]['abstract']

    elif doi in doi2paper:
        #text += doi2paper[doi]['metadata']['title']
        
        if len(doi2paper[doi]['abstract']) > 0:
            #text += doi2paper[doi]['metadata']['title'] + '\n' + doi2paper[doi]['abstract'][0]['text'] 
            text = doi2paper[doi]['metadata']['title'] + sep + doi2paper[doi]['abstract'][0]['text'] 
    else:
        raise ValueError('DOI not found')
        
    return text


In [64]:
# Positive samples
pos_rows = []

for idx, r in tmp_df.iterrows():
    text = get_text_from_doi(r['from_doi'])
    text_b = get_text_from_doi(r['to_doi'])
    
    # Filter out empty texts
    if text != '' and text_b != '':
        pos_rows.append((r['from_doi'], r['to_doi'], text, text_b, r[label_col]))

# Negative sampling

Requirements:
- no connected with citation
- no co-citation
- no shared author
- no shared venue

In [54]:
all_dois = list(map(str, set(list(doi2s2paper.keys()) + list(doi2paper.keys()))))

print(f'Total DOIs: {len(all_dois):,}')

Total DOIs: 95,200


In [55]:
def get_cit_pair(a, b):
    # ensure citation pair is always in same order
    if a > b:
        return (a, b)
    else:
        return (b, a)


cits_set = set([get_cit_pair(from_doi, to_doi) for from_doi, to_doi, label in cits_with_doi])

print(f'Total citation count: {len(cits_set):,}')

Total citation count: 108,001


In [56]:
# co cits
from_to_cits = defaultdict(set)

for from_doi, to_doi, label in cits_with_doi:
    from_to_cits[from_doi].add(to_doi)


cocits_set = set()

for from_cit, to_cits in from_to_cits.items():
    for a in to_cits:
        for b in to_cits:
            cocits_set.add(get_cit_pair(a,b))
            
print(f'total co-citation count: {len(cocits_set):,}')

total co-citation count: 2,339,001


In [57]:
# shared author
def get_authors(doi):
    if doi in doi2s2paper:
        s2paper = doi2s2paper[doi]
        last_names = [a['name'].split()[-1].lower() for a in s2paper['authors']]
        return last_names
    elif doi in doi2paper:
        paper = doi2paper[doi]
        last_names = [a['last'].lower() for a in paper['metadata']['authors']]
        return last_names
    else:
        raise ValueError(f'DOI not found: {doi}')

def have_no_shared_authors(a_doi, b_doi):
    try:
        a_authors = set(get_authors(a_doi))
        b_authors = set(get_authors(b_doi))
        
        overlap = a_authors & b_authors
        
        if len(overlap) == 0:
            return True
        else:
            return False
        
    except ValueError:
        return False
    

In [58]:
# has same venue
def get_venue(doi):
    if doi in doi2s2paper:
        s2paper = doi2s2paper[doi]
        return s2paper['venue'].lower().strip()
    
    elif doi in doi2paper:
        paper = doi2paper[doi]
        venue = paper['_meta']['journal']
        
        if isinstance(venue, float) and math.isnan(venue):
            return ''
        else:
            return venue.lower().strip()
    else:
        raise ValueError(f'DOI not found: {doi}')

def have_not_same_venue(a_doi, b_doi):
    a_venue = get_venue(a_doi)
    b_venue = get_venue(b_doi)
    
    if a_venue == "" or b_venue == "":
        # cant answer if venue is not set
        return False
        
    if fuzz.ratio(a_venue, b_venue) < 0.75:
        # fuzzy string matching score must be low!
        return True
    else:
        return False
    

In [66]:
negative_label = 'none'
#negative_needed = 10000 #105492  # len(df)
negative_ratio = 0.5
negative_needed = math.ceil(len(pos_rows) * 0.5)
negative_rows = []
negative_pairs = set()
tries = 0

# Negatives needed: 52,746 (ratio: 0.5)
print(f'Negatives needed: {negative_needed:,} (ratio: {negative_ratio})')

while len(negative_pairs) < negative_needed:
    a = random.choice(all_dois)
    b = random.choice(all_dois)
    
    if a == b:
        tries += 1
        continue
        
    pair = tuple((a,b))
    
    if pair in negative_pairs:
        continue
            
    cit_pair = get_cit_pair(a,b)
    if cit_pair in cits_set:
        tries += 1
        continue
    
    if cit_pair in cocits_set:
        tries += 1
        continue
    
    if not have_no_shared_authors(a, b):
        tries += 1
        continue
    
    if not have_not_same_venue(a, b):
        tries += 1
        continue
        
    text = get_text_from_doi(a)
    text_b = get_text_from_doi(b)
    
    if text == '' or text_b == '':
        continue

    # None of the criteria above matches...
    negative_pairs.add(pair)
    negative_rows.append((
        a,
        b,
        text,
        text_b,
        negative_label,
    ))
    
# Found 45,923 negative rows (tried 16,069,718 random samples)
print(f'Found {len(negative_rows):,} negative rows (tried {tries:,} random samples)')

Negatives needed: 45,923 (ratio: 0.5)
Found 45,923 negative rows (tried 16,176,619 random samples)


# Merge pos + neg samples

In [67]:
# construct
df = pd.DataFrame(pos_rows + negative_rows, columns=['from_doi', 'to_doi', 'text', 'text_b', label_col])

print(f'Total df rows: {len(df)}')

df.drop_duplicates(['text', 'text_b'], keep='first', inplace=True)

print(f'After drop_duplicates - df rows: {len(df)}')

df

Total df rows: 137769
After drop_duplicates - df rows: 137769


Unnamed: 0,from_doi,to_doi,text,text_b,label
0,10.1002/1873-3468.13478,10.1128/JVI.01240-17,Mechanisms and biomedical implications of -1 p...,HIV-1 Exploits a Dynamic Multi-aminoacyl-tRNA ...,other
1,10.1002/ame2.12017,10.3201/eid2206.152113,The battle against SARS and MERS coronaviruses...,MERS-CoV Infection of Alpaca in a Region Where...,other
2,10.1002/bdm.2056,10.3102/10769986031004437,Suffering a Loss Is Good Fortune: Myth or Real...,Computational Tools for Probing Interactions i...,other
3,10.1002/cbf.3182,10.1182/blood-2011-01-329060,Chloroquine could be used for the treatment of...,Hydroxychloroquine drastically reduces immune ...,other
4,10.1002/cbf.3182,10.1186/s12977-015-0178-0,Chloroquine could be used for the treatment of...,Chloroquine and beyond: exploring anti-rheumat...,other
...,...,...,...,...,...
137764,10.1136/bmj.g6120,10.1007/s11262-010-0528-x,Spanish authorities investigate how nurse cont...,Molecular analysis of infectious bronchitis vi...,none
137765,10.1001/jama.2014.14601,10.1016/j.virol.2013.12.002,Molecular findings among patients referred for...,Genome rearrangement of a mycovirus Rosellinia...,none
137766,10.1016/S0140-6736(10)62356-2,10.1128/mbio.00077-11,Development assistance for health: trends and ...,An Insect Nidovirus Emerging from a Primary Tr...,none
137767,10.1016/j.cell.2017.12.023,10.1261/rna.039438.113,Transmembrane Pickets Connect Cyto- and Perice...,Automated classification of RNA 3D motifs and ...,none


In [68]:
df[label_col].value_counts()

other                                 52317
none                                  45923
introduction                          13447
discussion                            13366
introduction,other                     2434
                                      ...  
results,materials,methods,other           1
materials,methods,other,discussion        1
background,methods,discussion             1
introduction,materials                    1
results,conclusion                        1
Name: label, Length: 161, dtype: int64

In [69]:
# Sample data (for debugging & development)
sample_df = df.sample(n=1000, weights=df[label_col].value_counts()[df[label_col]].values.tolist())

display(sample_df[label_col].value_counts())

sample_kf = StratifiedKFold(n_splits=4, random_state=0, shuffle=True)
for train_index, test_index in sample_kf.split(sample_df.index.tolist(), sample_df[label_col].values.tolist()):
    k = 'sample_1k'
    split_dir = os.path.join(cord19_dir, 'splits', str(k))
    
    if not os.path.exists(split_dir):
        os.makedirs(split_dir)
        
    split_train_df = sample_df.iloc[train_index]
    split_test_df = sample_df.iloc[test_index]

    print(f'Total: {len(sample_df):,}; Train: {len(split_train_df):,}; Test: {len(split_test_df):,}')

    split_train_df.to_csv(os.path.join(split_dir, 'train.tsv'), sep='\t', index=False)
    split_test_df.to_csv(os.path.join(split_dir, 'test.tsv'), sep='\t', index=False)
    
    break # we only need one sample set!

del sample_kf


other                 525
none                  405
introduction           41
discussion             27
other,discussion        1
introduction,other      1
Name: label, dtype: int64

Total: 1,000; Train: 747; Test: 253




In [70]:
labels

['discussion',
 'introduction',
 'conclusion',
 'results',
 'methods',
 'background',
 'materials',
 'virus',
 'future work']

In [71]:
# Full training and test set
kf = StratifiedKFold(n_splits=n_splits, random_state=0, shuffle=True)

# Stratified K-Folds cross-validator
for k, (train_index, test_index) in enumerate(kf.split(df.index.tolist(), df[label_col].values.tolist()), 1):
    split_dir = os.path.join(cord19_dir, 'splits', str(k))
    
    if not os.path.exists(split_dir):
        os.makedirs(split_dir)
    
    split_train_df = df.iloc[train_index]
    split_test_df = df.iloc[test_index]
    
    print(f'Total: {len(df):,}; Train: {len(split_train_df):,}; Test: {len(split_test_df):,}')
    
    split_train_df.to_csv(os.path.join(split_dir, 'train.tsv'), sep='\t', index=False)
    split_test_df.to_csv(os.path.join(split_dir, 'test.tsv'), sep='\t', index=False)
    #break



Total: 137,769; Train: 103,293; Test: 34,476
Total: 137,769; Train: 103,326; Test: 34,443
Total: 137,769; Train: 103,323; Test: 34,446
Total: 137,769; Train: 103,365; Test: 34,404


In [147]:
# ...

# Scrape missing paper data from S2 API with DOI

In [5]:
with open(os.path.join(scraper_dir, 'unique_missing_papers.json'), 'r') as f:
    unique_missing_papers = set(json.load(f))  

In [6]:
len(unique_missing_papers)  # list of DOIs

140682

In [21]:
with open(os.path.join(scraper_dir, 'unique_missing_papers.json'), 'w') as f:
    json.dump(list(unique_missing_papers), f)     

In [87]:
errors = set()
notfound = set()
doi2paper = {}

In [7]:
with open(os.path.join(scraper_dir, 'doi2paper.json'), 'r') as f:
    doi2paper = json.load(f)
with open(os.path.join(scraper_dir, 'errors.json'), 'r') as f:
    errors = set(json.load(f))
with open(os.path.join(scraper_dir, 'notfound.json'), 'r') as f:
    notfound = set(json.load(f))

In [15]:
print(f'doi2paper: {len(doi2paper)}')  # 12451, 22938
print(f'errors: {len(errors)}')
print(f'notfound: {len(notfound)}')

doi2paper: 23715
errors: 98
notfound: 1526


In [16]:
check_points = []
api_url = 'http://api.semanticscholar.org/v1/paper/'
offset = 0 

for i, doi in enumerate(tqdm(unique_missing_papers, total=len(unique_missing_papers))):
    if i < offset:  # skip 
        continue
        
    if doi in doi2paper or doi in errors or doi in notfound:
        continue
        
    res = requests.get(api_url + doi)    
    
    if res.status_code == 200:
        try:
            doi2paper[doi] = res.json()
        except ValueError:
            print(f'Error cannot parse JSON: {doi}')
            errors.add(doi)
    elif res.status_code == 429:
        print(f'Stop! Rate limit reached at: {i}')
        break
    elif res.status_code == 403:
        print(f'Stop! Forbidden / rate limit reached at: {i}')
        break
    elif res.status_code == 404:
        notfound.add(doi)
    else:
        print(f'Error status: {res.status_code} - {doi}')
        errors.add(doi)
    
    if (i % 10000) == 0 and i > 0:
        with open(os.path.join(scraper_dir, 'doi2paper.json'), 'w') as f:
            json.dump(doi2paper, f)
        with open(os.path.join(scraper_dir, 'errors.json'), 'w') as f:
            json.dump(list(errors), f)
        with open(os.path.join(scraper_dir, 'notfound.json'), 'w') as f:
            json.dump(list(notfound), f)
            
            
    time.sleep(2.5)
        

HBox(children=(IntProgress(value=0, max=140682), HTML(value='')))

Error status: 500 - 10.1667/0033-7587(2003)159[0484:RROCDA]2.0.CO;2
Error status: 500 - 10.1002/(sici)1520-6300(1996)8:4<497::Aid-ajhb10>3.0.Co;2-h
Error status: 500 - 10.1002/1097-0142(19920115)69:2<537::AID-CNCR2820690242>3.0.CO;2-3
Error status: 500 - 10.1002/1522-2683(200203)23:5<677::AID-ELPS677>3.0.CO;2â•ﬁ8
Error status: 500 - 10.1128/CMR.14.2.430&ndash;445.2001
Error status: 500 - 10.1002/1521-4141(200209)32:9<2635::AID-IMMU2635>3.0.CO;2-N
Error status: 500 - 10.1073/pnas.1104306108;10.1073/pnas.1104306108
Error status: 500 - 10.1175/1520-0450(1984)023<1674:AUSOAT>2.0.CO;2
Error status: 500 - 10.1002/(SICI)1097-4547(19980201)51:3<403::AID-JNR13>3.0.CO;2-7


IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)



Error status: 500 - 10.1603/0022-2585(2005)042[0473:DORBAB]2.0.CO;2
Error status: 500 - 10.1002/(SICI)1521-1878(199911)21:11<932::AID-BIES5>3.0.CO;2-N
Error status: 500 - 10.1002/(SICI)1097-0290(19990720)64:2<135::AID-BIT2>3.3.CO;2-H
Error status: 500 - 10.1128/CMR.14.1.129&ndash;149.2001


IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)



Error status: 500 - 10.1002/(SICI)1096-9071(199706)52:2<121::AID-JMV1>3.0.CO;2-5
Error status: 500 - 10.1128/JVI.79.20.12989&ndash;12998.2005
Error status: 500 - 10.1038/nmeth.1923
Error status: 500 - 10.1128/IAI.70.11.6365&ndash;6372.2002
Error status: 500 - 10.1002/(SICI)1097-4598(199904)22:4<460::AID-MUS6>3.0.CO;2-L
Error status: 500 - 10.1002/(SICI)1522-2683(19991201)20:18%3C3551::AID-ELPS3551%3E3.0.CO;2-2
Error status: 500 - 10.1175/1520-0469(1963)020<0130:DNF>2.0.CO;2
Error status: 500 - 10.1093/molbev/msr121
Error status: 500 - 10.1002/(SICI)1099-1573(199702)11:1<42::AID-PTR940>3.0.CO;2-5
Error status: 500 - 10.1007/s00705&ndash;012&ndash;1591&ndash;5
Error status: 500 - 10.1002/1521-4141(200207)32:7<2004::AID-IMMU2004>3.0.CO;2-5
Error status: 500 - 10.1002/(SICI)1098-1004(200001)15:1<57::AID-HUMU12>3.0.CO;2-G
Error status: 500 - 10.1002/(SICI)1096-9071(199912)59:4<552::AID-JMV21>3.0.CO;2-A
Error status: 500 - 10.1002/(SICI)1097-0290(19970605)54:5<468::AID-BIT7>3.0.CO;2-C
Error 

IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)



Error status: 500 - 10.1002/(SICI)1521-4095(199903)11:3<253::AID-ADMA253>3.0.CO;2-7
Error status: 500 - 10.1093/bioinformatics/btp352
Error status: 500 - 10.1002/(SICI)1097-0010(199601)70:1<55::AID-JSFA471>3.0.CO;2-X
Error status: 500 - 10.1002/(SICI)1096-9071(199810)56:2<159::AID-JMV10>3.0.CO;2-B
Error status: 500 - 10.1002/1531-8249(200002)47:2ï»¿<ï»¿276::AID-ANA28ï»¿>ï»¿3.3.CO;2-T


IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)



Error status: 500 - 10.1002/(SICI)1096-9071(199605)49:1<1::AID-JMV1>3.0.CO;2-A
Error status: 500 - 10.3398/1527-0904(2006)66[390:BIBCCL]2.0.CO;2
Error status: 500 - 10.1128/IAI.01539-07


IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)



Error status: 500 - 10.1002/1521-4141(200201)32:1<97::AID-IMMU97>3.0.CO;2-Y
Error status: 500 - 10.1002/(SICI)1096-9071(200005)61:1%6052::AID-JMV8%623.0.CO;2-L
Error status: 500 - 10.1128/JVI.75.8.4014&ndash;4018.2001
Error status: 500 - 10.1202/0002-8894(2000)061<0056:IOBMEF>2.0.CO;2
Error status: 500 - 10.1002/1521-3773(20001002)39:19<3430::AID-ANIE3430>3.0.CO;2-3
Error status: 500 - 10.1016/0021-9681(87)90171-8
Error status: 500 - 10.1099/vir.0.056341&ndash;0


IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)



Error status: 500 - 10.1128/JB.183.22.6573&ndash;6578.2001
Error status: 500 - 10.1128/JVI.79.15.9439&ndash;9448.2005
Error status: 500 - 10.1002/(SICI)1096-9071(199811)56:3<186::AID-JMV2>3.0.CO;2-3
Error status: 500 - 10.1099/vir.0.043935&ndash;0
Error status: 500 - 10.1637/0005-2086(2002)046[0053:poahko]2.0.co;2
Error status: 500 - 10.1002/(SICI)1097-0010(199909)79:12<1601::AID-JSFA407>3.0.CO;2-1
Error status: 500 - 10.1290/1071-2690(2002)038%3c0123:MIECIP%3e2.0.CO;2
Error status: 500 - 10.1002/1521-4141(200107)31:7<2104::AID-IMMU2104>3.0.CO;2-3
Error status: 500 - 10.1002/(SICI)1098-1136(19990101)25:1<21::AID-GLIA3>3.0.CO;2-R


IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)



Error status: 500 - 10.1002/(ISSN)1521-414110.1002/1521-4141(2000)30:8<>1.0.CO;2-010.1002/1521-4141(2000)30:8<2372::AID-IMMU2372>3.0.CO;2-D
Error status: 500 - 10.1002/(SICI)1099-1573(199905)13:3<222::AID-PTR447>3.0.CO;2-P
Error status: 500 - 10.1086/519795


IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)



Error status: 500 - 10.1002/9781119062585&copy;2017
Error status: 500 - 10.1002/(SICI)1521-4141(199804)28:04<1251::AID-IMMU1251>3.0.CO;2-O
Error status: 500 - 10.1017/S1049023&times;13008601
Error status: 500 - 10.1002/(SICI)1520-667X(1998)10:4%3c313:AID-MCS1%3e3.0.CO;2-J
Error status: 500 - 10.1002/(SICI)1096-9071(199708)52:4<406::AID-JMV11>3.0.CO;2-E
Error status: 500 - 10.1099/vir.0.040287&ndash;0
Error status: 500 - 10.1645/0022-3395(2000)086[0627:POTGAI]2.0.CO;2
Error status: 500 - 10.1002/1098-1136(200012)32:3%3C214::AID-GLIA20%3E3.0.CO;2-7
Error status: 500 - 10.1637/0005-2086(2007)51[725:IOSDOV]2.0.CO;2
Error status: 500 - 10.1638/1042-7260(2000)031[0353:ALSOEC]2.0.CO;2


IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)



Error status: 500 - 10.1638/1042-7260(2000)031[0335:ASOCCD]2.0.CO;2
Error status: 500 - 10.1002/(SICI)1097-4547(19960915)45:6%3C735::AID-JNR10%3E3.0.CO;2-V
Error status: 500 - 10.1038/nature10404
Error status: 500 - 10.1093/jpids/piw091
Error status: 500 - 10.1002/rmv.483
Error status: 500 - 10.1073/pnas.85.21.7972
Error status: 500 - 10.3390/ijerph17020428
Error status: 500 - 10.3201/eid2402.171216
Error status: 500 - 10.1097/INF.0b013e31814536ba
Error status: 500 - 10.1002/(SICI)1096-9071(199902)57:2<186::AID-JMV17>3.0.CO;2-Q
Error status: 500 - 10.1890/0012-9658(2000)081[0654:LHAEPP]2.0.CO;2


IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)



KeyboardInterrupt: 

In [17]:
i

54680

In [28]:
res.json()

{'abstract': "Context: Qigong, Tai‐chi and dancing have all been proven effective for Parkinson's disease (PD); however, no study has yet assessed the efficacy of Turo, a hybrid qigong dancing program developed to relieve symptoms in PD patients. Objective: To determine whether Turo may provide benefit in addressing the symptoms of PD patients. Design: Randomized, assessor blind, waiting‐list control, partial crossover study. Setting: Kyung Hee University Korean Medicine Hospital, Seoul, Republic of Korea. Participants: A total of 32 PD patients (mean age 65.7 ± 6.8). Intervention: Participants were assigned to the Turo group or the waiting‐list control group. The Turo group participated in an 8‐week Turo training program (60‐minute sessions twice a week). The waiting‐list control group received no additional treatment during the same period; then underwent the same 8‐week Turo training. Outcome Measures: The primary outcome was a score on the Unified Parkinson's Disease Rating Scale (

In [95]:
requests.get(api_url + '10.4314/ovj.v8i1.5').json()

{'abstract': 'Cancer constitutes the major health problem both in human and veterinary medicine. Comparative oncology as an integrative approach offers to learn more about naturally occurring cancers across different species. Canine models have many advantages as they experience spontaneous disease, have many genes similar to human genes, five to seven-fold accelerated ageing compared to humans, respond to treatments similarly as humans do and health care levels second only to humans. Also, the clinical trials in canines could generate more robust data, as their spontaneous nature mimics real-life situations and could be translated to humans.',
 'arxivId': None,
 'authors': [{'authorId': '47472141',
   'name': 'Faheem Sultan',
   'url': 'https://www.semanticscholar.org/author/47472141'},
  {'authorId': '35458013',
   'name': 'Bilal Ahmad Ganaie',
   'url': 'https://www.semanticscholar.org/author/35458013'}],
 'citationVelocity': 0,
 'citations': [{'arxivId': None,
   'authors': [{'auth

In [19]:
with open(os.path.join(scraper_dir, 'doi2paper.json'), 'w') as f:
    json.dump(doi2paper, f)
with open(os.path.join(scraper_dir, 'errors.json'), 'w') as f:
    json.dump(list(errors), f)
with open(os.path.join(scraper_dir, 'notfound.json'), 'w') as f:
    json.dump(list(notfound), f)

In [20]:
len(doi2paper) # 22938

65309

In [21]:
len(errors) # 98

264

In [22]:
len(notfound) # 1526

4147

# Citation network dedup

In [6]:
cit_df = pd.read_parquet(os.path.join(cord19_dir, 't60_citation_index.parquet'))

In [8]:
for idx, row in cit_df.iterrows():
    break

In [9]:
row

citation_id                     ff5b04afab04334df136f78f8fee03af5dd565c5
approx_citation_ids    [4073b3fad31dd01667e794bb7db1dccce400ea92, 1bc...
near_duplicates                                                     1068
rank                                                                   1
Name: 0, dtype: object

In [10]:
row.approx_citation_ids

['4073b3fad31dd01667e794bb7db1dccce400ea92',
 '1bcfcc6d113f9225467954ed880dc9d1dddc584d',
 'b91c07de17f79cfbdb4046602c377e590393cc8e',
 '2074224e10bf884718b7373665ff372525cbb830',
 '1936c179aecd5b3895a9aac32221ff03230a7a63',
 '95497926021f939c17b2c267773cd20756370fed',
 '90308637bf6ffb3a626dd19975bb4ee287a3ce62',
 'ab1c38d2c087c20f6784984a23604cc882b4a8ac',
 'e02547b34463accb464d91d8de93ae95f50d4f66',
 '70fdc2c9713c71c74591bda72ee9af638e4dc695',
 'bedeaaeee17a6166ef5bca20b4d65f10ad6d47dc',
 'b5b84516e3e2154ebbf424298223e98512d7f76b',
 '13ee2b0843adeab2103dd67ec4d65ca793cf5687',
 'e49191b299e7434174401161c9860eae8c5ae481',
 '0fd656403484903391493e46c1e1047f27c8a4da',
 'b075f06f8b901152c10206a84dbff1d267c88063',
 '25757bba0841229b6690343b20ce252e5e3a1af4',
 'd8515865b23baeaccda2068b54a07abb8fa41ac6',
 'd442a361e7d777e8675f7a9464feec176515f26c',
 'c9732ee56d2491f1acc37bb65c165b4d4f1dc6c7',
 'b97eed61f6edee125a19346a63be5a319c77a0df',
 'ed642bdbec29cc1738fcf628e58cac77257b475f',
 'a0d5def2

In [83]:
has_shared_authors(a,b)

False

In [43]:
last_names = [a['name'].split()[-1].lower() for a in s2paper['authors']]
last_names

['colizza', 'barrat', 'barthelemy', 'vespignani']

In [72]:
for doi, paper in doi2paper.items():
    break

In [73]:
paper

{'paper_id': 'f056da9c64fbf00a4645ae326e8a4339d015d155',
 'metadata': {'title': 'SIANN: Strain Identification by Alignment to Near Neighbors',
  'authors': [{'first': 'Samuel',
    'middle': ['S'],
    'last': 'Minot',
    'suffix': '',
    'affiliation': {},
    'email': ''},
   {'first': 'Stephen',
    'middle': ['D'],
    'last': 'Turner',
    'suffix': '',
    'affiliation': {},
    'email': ''},
   {'first': 'Krista',
    'middle': ['L'],
    'last': 'Ternus',
    'suffix': '',
    'affiliation': {},
    'email': ''},
   {'first': 'Dana',
    'middle': ['R'],
    'last': 'Kadavy',
    'suffix': '',
    'affiliation': {},
    'email': ''}]},
 'abstract': [{'text': 'Next-generation sequencing is increasingly being used to study samples composed of mixtures of organisms, such as in clinical applications where the presence of a pathogen at very low abundance may be highly important. We present an analytical method (SIANN: Strain Identification by Alignment to Near Neighbors) specifica

In [34]:
last_names = [a['last'].lower() for a in paper['metadata']['authors']]
last_names

['minot', 'turner', 'ternus', 'kadavy']

In [65]:
s2paper['venue']

''

In [79]:
math.isnan(paper['_meta']['journal'])

True

# Validate train/test

In [72]:
error_pairs = []

for k in range(1, n_splits+1):
    
    split_dir = os.path.join(cord19_dir, 'splits', str(k))
    split_train_df = pd.read_csv(os.path.join(split_dir, 'train.tsv'), sep='\t')
    split_test_df = pd.read_csv(os.path.join(split_dir, 'test.tsv'), sep='\t')

    train_pairs = split_train_df[['text', 'text_b']].values.tolist()
    test_pairs = split_test_df[['text', 'text_b']].values.tolist()

    for p in test_pairs:
        if p in train_pairs:
            #raise ValueError('ERROR - Test pair exists also in train!')
            error_pairs.append(p)
            
len(error_pairs)

0

In [74]:
error_pairs

[]

In [13]:
p

['Rapid communication',
 'Transmission of 2019-nCoV Infection from an Asymptomatic Contact in Germany.\n2019-nCoV Transmission from Asymptomatic Patient In this report, investigators in Germany detected the spread of the novel coronavirus (2019-nCoV) from a person who had recently traveled from China...']

4

# Meta data

- title
- year
- venue
- authors
- arxiv_id
- s2_id
- doi
- cord19_id

In [120]:
meta_rows = []

# Cord 19
for i, (doi, p) in enumerate(doi2paper.items()):
    m = p['_meta']
    meta_rows.append({
        'doi': doi,
        'title': m['title'],
        'authors': m['authors'],
        'year': int(m['publish_time'].split('-')[0]),
        'venue': '' if isinstance(m['journal'], float) and math.isnan(m['journal']) else m['journal'],
        's2_id': p['paper_id'],
        'arxiv_id': '',
        'in_citations_count': -1,
        'out_citations_count': len(p['bib_entries']),
    })   

# S2 paper
for i, (doi, p) in enumerate(doi2s2paper.items()):

    meta_rows.append({
        'doi': doi,
        'title': p['title'],
        'authors': '; '.join([a['name'] for a in p['authors']]),
        'year': int(p['year'] or 0),
        'venue': p['venue'] or '',
        's2_id': p['paperId'],
        'arxiv_id': p['arxivId'] or '',
        'in_citations_count': len(p['citations']),
        'out_citations_count': len(p['references']),
    })   
    
    
meta_df = pd.DataFrame(meta_rows) #, dtype={'year': 'number'})
meta_df

Unnamed: 0,doi,title,authors,year,venue,s2_id,arxiv_id,in_citations_count,out_citations_count
0,10.1101/001727,SIANN: Strain Identification by Alignment to N...,Samuel Minot; Stephen D Turner; Krista L Ternu...,2014,,f056da9c64fbf00a4645ae326e8a4339d015d155,,-1,10
1,10.1101/003889,Spatial epidemiology of networked metapopulati...,Lin WANG; Xiang Li,2014,,daf32e013d325a6feb80e83d15aabc64a48fae33,,-1,149
2,10.1101/006866,Sequencing of the human IG light chain loci fr...,Corey T Watson; Karyn Meltz Steinberg; Tina A ...,2014,,f33c6d94b0efaa198f8f3f20e644625fa3fe10d2,,-1,62
3,10.1101/007476,Bayesian mixture analysis for metagenomic comm...,Sofia Morfopoulou; Vincent Plagnol,2014,,4da8a87e614373d56070ed272487451266dce919,,-1,29
4,10.1101/010389,Mapping a viral phylogeny onto outbreak trees ...,Stephen P Velsko; Jonathan E Allen,2014,,eccef80cfbe078235df22398f195d5db462d8000,,-1,30
...,...,...,...,...,...,...,...,...,...
95195,10.1016/S0006-291X(67)80055-X,On the size of the active site in proteases. I...,Israel Schechter; Abe Berger,1967,Biochemical and biophysical research communica...,1ad6237a340cc6be5043f1b63f165512ece17df4,,772,0
95196,10.1038/srep36160,HCV core protein inhibits polarization and act...,Qianqian Zhang; Yang Wang; Naicui Zhai; Hongxi...,2016,Scientific reports,959a8d9c934bbe74d0d55b6da33f1a712e3eec3f,,12,49
95197,10.1128/AAC.04447-14,Relationship between azithromycin susceptibili...,Begoña Euba; Javier Moleres; Cristina Viadas; ...,2015,Antimicrobial agents and chemotherapy,9a1e4f5ac24aa47eb958bc493899974840ccc1f8,,9,0
95198,10.1016/j.jep.2012.04.053,Effect of Sophora flavescens Aiton extract on ...,Hyungwoo Kim; Mi Ran Lee; Guem San Lee; Won Gu...,2012,Journal of ethnopharmacology,fc4c38837d0718a0d7c2c0e8bba0457ba6aa3e29,,24,35


In [121]:
meta_df['year'].value_counts()

2015    7002
2014    6936
2013    6589
2016    6518
2017    6058
        ... 
1945       1
1935       1
1936       1
1937       1
1938       1
Name: year, Length: 91, dtype: int64

In [124]:
meta_df.to_csv(os.path.join(cord19_dir, 'meta.csv'), index=False)

In [111]:
json.dumps(meta_rows)

'[{"doi": "10.1101/001727", "title": "SIANN: Strain Identification by Alignment to Near Neighbors", "authors": "Samuel Minot; Stephen D Turner; Krista L Ternus; Dana R Kadavy", "year": 2014, "venue": null, "s2_id": "f056da9c64fbf00a4645ae326e8a4339d015d155", "arxiv_id": null, "in_citations_count": null, "out_citations_count": 10}, {"doi": "10.1101/003889", "title": "Spatial epidemiology of networked metapopulation: An overview", "authors": "Lin WANG; Xiang Li", "year": 2014, "venue": null, "s2_id": "daf32e013d325a6feb80e83d15aabc64a48fae33", "arxiv_id": null, "in_citations_count": null, "out_citations_count": 149}, {"doi": "10.1101/006866", "title": "Sequencing of the human IG light chain loci from a hydatidiform mole BAC library reveals locus-specific signatures of genetic diversity", "authors": "Corey T Watson; Karyn Meltz Steinberg; Tina A Graves-Lindsay; Rene L Warren; Maika Malig; Jacqueline E Schein; Richard K Wilson; Rob Holt; Evan Eichler; Felix Breden", "year": 2014, "venue": 

In [112]:
#meta_rows = []


    if i > 10:
        break
    
#p['abstract'][0]['text']
#p['metadata']['title']
meta_rows

[{'doi': '10.1142/S1793048008000708',
  'title': 'EPIDEMIC PREDICTIONS AND PREDICTABILITY IN COMPLEX ENVIRONMENTS',
  'authors': 'Vittoria Colizza; Alain Barrat; Marc Barthelemy; Alessandro Vespignani',
  'year': 2008,
  'venue': '',
  's2_id': '9c4f57a877f8bd1eab094d42dd9adef6b7023dde',
  'arxiv_id': None,
  'in_citations_count': 2,
  'out_citations_count': 0},
 {'doi': '10.1186/1743-422X-4-102',
  'title': 'Cloning of the canine RNA polymerase I promoter and establishment of reverse genetics for influenza A and B in MDCK cells',
  'authors': 'Zhaoti Wang; Gregory M. Duke',
  'year': 2007,
  'venue': 'Virology Journal',
  's2_id': '383311a7b538cde429515869677fbda4826f9faa',
  'arxiv_id': None,
  'in_citations_count': 18,
  'out_citations_count': 18},
 {'doi': '10.1016/S0168-1702(99)00032-5',
  'title': 'The S gene of canine coronavirus, strain UCD-1, is more closely related to the S gene of transmissible gastroenteritis virus than to that of feline infectious peritonitis virus.',
  'a

In [95]:
p.keys()

dict_keys(['abstract', 'arxivId', 'authors', 'citationVelocity', 'citations', 'corpusId', 'doi', 'fieldsOfStudy', 'influentialCitationCount', 'is_open_access', 'is_publisher_licensed', 'paperId', 'references', 'title', 'topics', 'url', 'venue', 'year'])

'Sally Roberts, Craig P. Delury, Elizabeth Kate Marsh'

In [79]:
p.keys()

dict_keys(['paper_id', 'metadata', 'abstract', 'body_text', 'bib_entries', 'ref_entries', 'back_matter', '_meta'])

In [80]:
p['_meta']['authors']

{'sha': 'f056da9c64fbf00a4645ae326e8a4339d015d155',
 'source_x': 'biorxiv',
 'title': 'SIANN: Strain Identification by Alignment to Near Neighbors',
 'doi': '10.1101/001727',
 'pmcid': nan,
 'pubmed_id': nan,
 'license': 'biorxiv',
 'abstract': 'Next-generation sequencing is increasingly being used to study samples composed of mixtures of organisms, such as in clinical applications where the presence of a pathogen at very low abundance may be highly important. We present an analytical method (SIANN: Strain Identification by Alignment to Near Neighbors) specifically designed to rapidly detect a set of target organisms in mixed samples that achieves a high degree of species- and strain-specificity by aligning short sequence reads to the genomes of near neighbor organisms, as well as that of the target. Empirical benchmarking alongside the current state-of-the-art methods shows an extremely high Positive Predictive Value, even at very low abundances of the target organism in a mixed sampl

In [83]:
p['paper_id']

'f056da9c64fbf00a4645ae326e8a4339d015d155'