In [None]:
!pip install cord-19-tools

In [2]:
import cotools
from pprint import pprint
import os
import sys

# Downloading the data

First we have to download and extract the data. Currently the download function downloads the 4 sets of json data and puts them, extracted, into a directory:

In [3]:
downloaded = True # change me if you havent downloaded the data

if not downloaded:
    cotools.download(dir = 'data')
    
print(os.listdir('data'))

['download.sh', 'comm_use_subset.tar.gz', 'noncomm_use_subset', 'biorxiv_medrxiv', 'pmc_custom_license', 'comm_use_subset']


# Loading and using the data

Next, we can load the data! We will use the `Paperset` class. We initialize it by specifying the directory of JSON data:

In [4]:
comm_use = cotools.Paperset('data/comm_use_subset')
print(sys.getsizeof(comm_use))

64


We access the data through indexing. Indices can either be ints or slices. Integer indices return dicts:

In [5]:
# laziness
print(sys.getsizeof(comm_use[0]))
# returns dict
print(type(comm_use[0]))
# dict
#pprint(comm_use[0])

376
<class 'dict'>


Slice indices return lists of dicts:

In [6]:
print(sys.getsizeof(comm_use[15:65]))
print(type(comm_use[15:65]))
print([type(x) for x in comm_use[15:65]])

536
<class 'list'>
[<class 'dict'>, <class 'dict'>, <class 'dict'>, <class 'dict'>, <class 'dict'>, <class 'dict'>, <class 'dict'>, <class 'dict'>, <class 'dict'>, <class 'dict'>, <class 'dict'>, <class 'dict'>, <class 'dict'>, <class 'dict'>, <class 'dict'>, <class 'dict'>, <class 'dict'>, <class 'dict'>, <class 'dict'>, <class 'dict'>, <class 'dict'>, <class 'dict'>, <class 'dict'>, <class 'dict'>, <class 'dict'>, <class 'dict'>, <class 'dict'>, <class 'dict'>, <class 'dict'>, <class 'dict'>, <class 'dict'>, <class 'dict'>, <class 'dict'>, <class 'dict'>, <class 'dict'>, <class 'dict'>, <class 'dict'>, <class 'dict'>, <class 'dict'>, <class 'dict'>, <class 'dict'>, <class 'dict'>, <class 'dict'>, <class 'dict'>, <class 'dict'>, <class 'dict'>, <class 'dict'>, <class 'dict'>, <class 'dict'>, <class 'dict'>]


# Accessing the text

To access the text, we have three methods: 

1. `cotools.text` gets the text of a single document
2. `cotools.texts` gets the text of a list of documents (created through slicing)
3. `cotools.Paperset.texts` gets a list of the text of all the documents!

In [7]:
pprint(cotools.text(comm_use[0]))

('hypothesis-free approaches, like genome-wide association studies 15 , '
 'managed to spark a progress of the field, indicated as a lack of replication '
 'of results from such studies 10, [16] [17] [18] . Unfortunately, insights '
 'derived from the fields of clinical infectology, microbiology, immunology, '
 'epidemiology, as well as clinical, evolutionary and population genetics '
 'remained largely isolated from one another [19] [20] [21] , preventing '
 'systematic understanding of the entire field. The interest in host factors '
 'associated with respiratory infectious diseases stems from their ability to '
 'cause epidemics and pandemics. The infamous Spanish influenza of 1918 has '
 'been implicated as the largest ever recorded pandemic, causing an estimated '
 '25-100 millions of deaths 22, 23 . On the other hand, tuberculosis is an '
 'example of highly adaptive pathogen, which managed to coevolve with humans, '
 'possibly as far as the initial waves of human migrations out 

In [8]:
pprint(cotools.texts(comm_use[:12]))

['hypothesis-free approaches, like genome-wide association studies 15 , '
 'managed to spark a progress of the field, indicated as a lack of replication '
 'of results from such studies 10, [16] [17] [18] . Unfortunately, insights '
 'derived from the fields of clinical infectology, microbiology, immunology, '
 'epidemiology, as well as clinical, evolutionary and population genetics '
 'remained largely isolated from one another [19] [20] [21] , preventing '
 'systematic understanding of the entire field. The interest in host factors '
 'associated with respiratory infectious diseases stems from their ability to '
 'cause epidemics and pandemics. The infamous Spanish influenza of 1918 has '
 'been implicated as the largest ever recorded pandemic, causing an estimated '
 '25-100 millions of deaths 22, 23 . On the other hand, tuberculosis is an '
 'example of highly adaptive pathogen, which managed to coevolve with humans, '
 'possibly as far as the initial waves of human migrations out 

 'very common for authors to use the term "healthy" controls, but almost '
 'universally used dissimilar diagnostic process for classification of cases '
 'and controls. Thus, selection of controls and diagnostics process must be '
 'fundamentally improved and harmonized (Supplementary Table 10) . A '
 'substantial effort must go towards inclusion of controls that were exposed '
 'to a pathogen, which is necessary for the disease development (but may not '
 'be sufficient). Data analysis must also be improved, by adhering to the best '
 'practices, including the use and report of odds ratios and confidence '
 'intervals as measures of association, and multiple testing correction 72 . '
 'Improvements can be made in the study design, where availability of finely '
 'phenotyped life-long cohorts and biobanks provides an opportunity to use '
 'this data in an understanding of life-long risk of developing respiratory '
 'infectious disease, such as tuberculosis. The use of novel molecular 

 'irrespectively of KSHV lytic infection and only five proteins had a '
 'significant reduction (ratio cut-off <0.5) in NEs of reactivated cells. In '
 'contrast, 216 proteins showed a significant increase (ratio cut-off >1.9) '
 'during lytic replication. Importantly, multiple cellular proteins that are '
 'known or expected to localize within herpesvirus RTCs; such as those '
 'associated with KSHV ori-Lyt, the HCMV transactivator IE2-p86 protein or the '
 'herpes simplex virus-1 (HSV-1) ICP8 protein were found significantly '
 'increased in the NE regions of reactivated cells using the less stringent '
 'ratio cut-off of 1.5 (S1 Table) . Some of these cellular proteins included '
 'CSNK2A1 [45] , BLM [32] , topoisomerases I and II [31, 46, 47] and DEAD box '
 'helicases DDX5 [32] and DDX17 [45] . Thus, LC-MS/MS results confirmed the '
 'correct isolation of the NE region and accompanying RTCs. Importantly, many '
 'of the 216 identified proteins most likely represent novel cellular 

 'independent transfections were averaged with error bars as standard '
 'deviation. (E) HEK-293T cells were transiently co-transfected with pPAN-WT '
 'and either pRTA-EGFP or control pEGFP. 24 h post-transfection either vehicle '
 'drug DMSO (0.1%) or 45 μM VER-155008 was added and incubated for a further '
 '24 h. Total RNA was then extracted and qRT-PCR performed. RTA-mediated '
 'promoter transactivation and subsequent synthesis of PAN RNA occurred at a '
 'similar rate in the presence of VER-155008 or control DMSO. The results of '
 'three independent transfections were averaged with error bars as standard '
 'deviation. (F) The stability of PAN RNA was determined in HEK-293T cells '
 'that had been co-transfected with pPAN-WT and pRTA-EGFP. Following 24 h '
 'post-transfection, actinomycin D (AcD) (5 μg/ml) or DMSO control (0.5%) was '
 'added. Cells were collected over the time points indicated and total RNA was '
 'extracted followed by qRT-PCR. The average of two biological r

 'Ebola outbreaks have been recurring since the first major case in DR Congo '
 'in 1995, which affected more than 260 humans and caused 186 deaths, was '
 'reported [14] . The most recent outbreak of Ebola in West Africa in 2014 '
 'infecting about 27,898 people and causing 11,296 reported deaths [17] raised '
 'public apprehension about human-bat interactions. This is because the '
 'outbreak has been traced back to a single incident of a young child playing '
 'in the vicinity of a hollow tree frequented by bats [18] , now identified as '
 'a potential source of spillover of the EBV [11, 14, 19] . The ability of '
 'bats to spread zoonotic diseases and the fact they live close to humans '
 'heighten the potential threat of disease spillover. Notwithstanding these '
 'negative perceptions, bats play vital roles in providing ecosystems '
 'services. They are well known for their roles in seed dispersal, '
 'pollination, maintaining soil fertility, and aiding in nutrient distribution '

 'Fortunately, NAATs can be implemented in a microfluidics system for manual, '
 'semi-automated, or fully-automated operation that can be readily adapted for '
 'point-of-care use. A broad aim of microfluidics POC efforts is to make '
 'molecular diagnostics nearly as easy to use as lateral flow strip '
 'immunoassays. Microfluidics refers to various methods and technologies '
 'wherein small quantities (1 nL to 1 mL) of liquids are manipulated, '
 'processed, and analyzed using miniaturized, e.g., palm-sized, devices. In '
 'biotechnology, microfluidic components and devices have been demonstrated '
 'for, among other things, serial dilutions; cell fractionation, sorting, and '
 'enrichment; cell and virus lysis; isolation, concentration, and purification '
 'of nucleic acids; purification of proteins; immunoassays; reverse '
 'transcription; labeling of biomolecules with reporters; enzymatic '
 'amplification (e.g., PCR); electrophoresis; Micro total analytical systems '
 '(µTAS) co

 'evidence of glomerular pathology, which is considered to be a consequence of '
 'systemic inflammatory response in the context of multi-organ failure, rather '
 'than a specific effect of viral infection of the kidney [17] . SARS-CoV has '
 'never been successfully isolated from post mortem kidney tissue of infected '
 'patients [21] . In-vitro studies of SARS-CoV with immortalized human '
 'proximal tubular epithelial cells showed replication without cell impairment '
 'as seen in our study, while no infection of podocyte cell lines and only '
 'low-level replication in glomerular mesangial cells (MC) was seen, providing '
 'further evidence against specific involvement of the renal tract in SARS-CoV '
 'infection [22] . In contrast, MERS-CoV was shown to efficiently replicate in '
 'a broad range of bat, primate and also human kidney epithelial cell lines '
 'that are commonly used as laboratory models [9] . The abundant expression of '
 "both viruses' entry receptors in kidney epi

 'sequence (nt 752-731) of FIP-1 and sequence (nt 212-191) of FIP-2 were in '
 'reverse orientation, whereas the second sequence (nt 689-706) and sequence '
 '(nt 142-161) of the primers were in the forward direction. In BIP-1 and '
 'BIP-2, the forward sequence (nt 759-780) and sequence (nt 215-235) is '
 'followed by the reverse sequence (nt 813-729) and sequence (nt 293-274). For '
 'the 72 kDa gene: the first sequence (nt 813-729) of FIP-3 and sequence (nt '
 '1428-1407) of FIP-4 were in reverse orientation, whereas the second sequence '
 '(nt748-767) and sequence (nt 1355-1372) of the primers were in the forward '
 'direction. In BIP-3 and BIP-4, the forward sequence (nt 814-833) and '
 'sequence (nt 1429-1450) is followed by the reverse sequence (nt 889-871) and '
 'sequence (nt 1498-1479). Their relative location in the virus genomes were '
 'shown in Table 1 . To select the appropriate primer set, RT-LAMP using four '
 'sets of primers was carried out with two healthy controls 

 'protein saturated in the intracellular compartment, but was not found at the '
 'plasma membrane, showing this tyrosine-based sorting motif to be important '
 'for its trafficking to the cell surface. The di-acidic motif, which is a '
 'canonical ER export signal, did not show any role in the intracellular '
 'trafficking of the 3a protein. The YXXΦ motif in 3a protein is responsible '
 'for its Golgi to plasma membrane sorting The subcellular distribution of the '
 'ΔYXXΦ protein was very different from that of the wild type 3a protein. To '
 'further characterize the compartments to which the mutant 3a proteins '
 'localize, we expressed them in Huh7 cells together with markers for the ER '
 'and Golgi compartments. The wild type and ΔEXD proteins were distributed to '
 'the cell surface and punctate structures that are likely to represent '
 'trafficking vesicles; these also showed minor colocalization with the Golgi '
 'compartment ( Figure 3A ). However, the ΔYXXΦ mutant protein

In [9]:
all_text = comm_use.texts()

# Dealing with abstracts

We can get abstracts in a similar manner to getting texts:

In [10]:
cotools.abstract(comm_use[0])

'One single-nucleotide polymorphism from IL4 gene was significant for pooled respiratory infections (rs2070874; 1.66 [1.29-2.14]). We also detected an association of TLR2 gene with tuberculosis (rs5743708; 3.19 [2.03-5.02]). Subset analyses identified CCL2 as an additional risk factor for tuberculosis (rs1024611; OR = 0.79 [0.72-0.88]). The IL4-TLR2-CCL2 axis could be a highly interesting target for translation towards clinical use. However, this conclusion is based on low credibility of evidence -almost 95% of all identified studies had strong risk of bias or confounding. Future studies must build upon larger-scale collaborations, but also strictly adhere to the highest evidence-based principles in study design, in order to reduce research waste and provide clinically translatable evidence.'

In [11]:
cotools.abstracts(comm_use[14:18])

['␥ 9 ␦ 2 T cells provide a natural bridge between innate and adaptive immunity, rapidly and potently respond to pathogen infection in mucosal tissues, and are prominently induced by both tuberculosis (TB) infection and bacillus Calmette Guérin (BCG) vaccination. Mycobacterium-expanded ␥ 9 ␦ 2 T cells represent only a subset of the phosphoantigen {isopentenyl pyrophosphate [IPP] and (E)-4-hydroxy-3-methyl-but-2-enylpyrophosphate [HMBPP]}-responsive ␥ 9 ␦ 2 T cells, expressing an oligoclonal set of T cell receptor (TCR) sequences which more efficiently recognize and inhibit intracellular Mycobacterium tuberculosis infection. Based on this premise, we have been searching for M. tuberculosis antigens specifically capable of inducing a unique subset of mycobacterium-protective ␥ 9 ␦ 2 T cells. Our screening strategy includes the identification of M. tuberculosis fractions that expand ␥ 9 ␦ 2 T cells with biological functions capable of inhibiting intracellular mycobacterial replication. Ch

In [12]:
all_abstracts = comm_use.abstracts()

# Manipulating the documents

We can wrap up document manipulations with the `Paperset.apply` method. For example, if we wanted to get all the keys of each document, we could do:

In [13]:
keys = comm_use.apply(lambda x: list(x.keys()))
# then lets combine them into a set

set(sum(keys, []))

{'abstract',
 'back_matter',
 'bib_entries',
 'body_text',
 'metadata',
 'paper_id',
 'ref_entries'}

Please make PRs to make this more useful! To be added:

* Metadata
* Other datasets??

# Other Data

## John Hopkins Data

Getting the hopkins data with the `get_hopkins` function

In [14]:
confirmed, deaths, recoveries = cotools.get_hopkins()

In [15]:
print(confirmed.keys())
print(type(confirmed))
print([len(x) for x in confirmed.values()])
[type(x[0]) for x in confirmed.values()]

dict_keys(['Province/State', 'Country/Region', 'Lat', 'Long', '1/22/20', '1/23/20', '1/24/20', '1/25/20', '1/26/20', '1/27/20', '1/28/20', '1/29/20', '1/30/20', '1/31/20', '2/1/20', '2/2/20', '2/3/20', '2/4/20', '2/5/20', '2/6/20', '2/7/20', '2/8/20', '2/9/20', '2/10/20', '2/11/20', '2/12/20', '2/13/20', '2/14/20', '2/15/20', '2/16/20', '2/17/20', '2/18/20', '2/19/20', '2/20/20', '2/21/20', '2/22/20', '2/23/20', '2/24/20', '2/25/20', '2/26/20', '2/27/20', '2/28/20', '2/29/20', '3/1/20', '3/2/20', '3/3/20', '3/4/20', '3/5/20', '3/6/20', '3/7/20', '3/8/20', '3/9/20', '3/10/20', '3/11/20', '3/12/20', '3/13/20', '3/14/20', '3/15/20', '3/16/20', '3/17/20'])
<class 'dict'>
[460, 460, 460, 460, 460, 460, 460, 460, 460, 460, 460, 460, 460, 460, 460, 460, 460, 460, 460, 460, 460, 460, 460, 460, 460, 460, 460, 460, 460, 460, 460, 460, 460, 460, 460, 460, 460, 460, 460, 460, 460, 460, 460, 460, 460, 460, 460, 460, 460, 460, 460, 460, 460, 460, 460, 460, 460, 460, 460, 460]


[str,
 str,
 float,
 float,
 int,
 int,
 int,
 int,
 int,
 int,
 int,
 int,
 int,
 int,
 int,
 int,
 int,
 int,
 int,
 int,
 int,
 int,
 int,
 int,
 int,
 int,
 int,
 int,
 int,
 int,
 int,
 int,
 int,
 int,
 int,
 int,
 int,
 int,
 int,
 int,
 int,
 int,
 int,
 int,
 int,
 int,
 int,
 int,
 int,
 int,
 int,
 int,
 int,
 int,
 int,
 int,
 int,
 int,
 int,
 int]