# Data Cleaning

Create and export a df containing each paper ID (sha) and its correponding full text, as a string. This will be used in the COVID-19 Research Papers Text Extraction notebook to obtain the most "important" sentences in the relevant papers.


In [1]:
import numpy as np
import pandas as pd
import json
import itertools

Split up metadata.csv into separate dfs based on licence type.

In [2]:
df = pd.read_csv("data/metadata.csv")
df_with_full = df.loc[df['has_full_text'] == True].drop(columns = 'has_full_text')

df_custom_license = df_with_full.loc[df_with_full['full_text_file'].str.match("custom_license")]
df_noncomm_use_subset = df_with_full.loc[df_with_full['full_text_file'].str.match("noncomm_use_subset")]
df_comm_use_subset = df_with_full.loc[df_with_full['full_text_file'].str.match("comm_use_subset")]
df_biorxiv_medrxiv = df_with_full.loc[df_with_full['full_text_file'].str.match("biorxiv_medrxiv")]

invalid_rows = []
for index, row in df_custom_license.iterrows():
    if ";" in row['sha']:
        invalid_rows.append(index)

df_custom_license = df_custom_license.drop(invalid_rows)

invalid_rows = []
for index, row in df_noncomm_use_subset.iterrows():
    if ";" in row['sha']:
        invalid_rows.append(index)
        
df_noncomm_use_subset = df_noncomm_use_subset.drop(invalid_rows)

invalid_rows = []
for index, row in df_comm_use_subset.iterrows():
    if ";" in row['sha']:
        invalid_rows.append(index)
        
df_comm_use_subset = df_comm_use_subset.drop(invalid_rows)

invalid_rows = []
for index, row in df_biorxiv_medrxiv.iterrows():
    if ";" in row['sha']:
        invalid_rows.append(index)
        
df_biorxiv_medrxiv = df_biorxiv_medrxiv.drop(invalid_rows)

Define `json_to_body_string(body_json)`, which takes in the full text file of a paper (as a json) and returns all the body text as a string.

In [42]:
def json_to_body_string(body_json):
    body_text = body_json[0][3]
    body_text_json = json.dumps(body_text)
    body_text_df = pd.read_json(StringIO(body_text_json), orient="records")
    body_string = ""
    for i in np.arange(len(body_text_df)):
        if (isinstance(body_text_df["text"][i], str)):
            body_string = body_string + " " + body_text_df["text"][i]
        else:
            not_strings.append(type(body_text_df["text"][i]))
    return body_string

Define `compile_full_texts(df_license, license_type)`, which takes in
- df_license (one of the four license dfs) and
- license_type (its license type as a string)  

and adds the contents of each paper into full_texts_dict.

In [43]:
def compile_full_texts(df_license, license_type):
    for i in df_license["sha"]:
        temp_json = pd.read_json("data/" + license_type +
                                 "/" + license_type + "/" + i + ".json", orient="index")
        full_text_string = json_to_body_string(temp_json)
        full_texts_dict[i] = full_text_string

In [5]:
# full_texts_dict = {}
# not_strings = []
# for i in df_custom_license["sha"]:
#     temp_json = pd.read_json("CORD-19-research-challenge/custom_license/custom_license/" + i + ".json", orient="index")
#     full_text_string = json_to_body_string(temp_json)
#     full_texts_dict[i] = full_text_string

Compile the full texts into the dictionary full_texts_dict, which contains each paper sha mapped to its
contents as a string

# TESTING

In [40]:
# Required to make read_json work with this data for Pandas 1.1
from io import StringIO

In [41]:
def json_to_body_string(body_json):
    body_text = body_json[0][3]
    body_text_json = json.dumps(body_text)
    body_text_df = pd.read_json(StringIO(body_text_json), orient="records")
    body_string = ""
    for i in np.arange(len(body_text_df)):
        if (isinstance(body_text_df["text"][i], str)):
            body_string = body_string + " " + body_text_df["text"][i]
        else:
            not_strings.append(type(body_text_df["text"][i]))
    return body_string

In [14]:
body_json = pd.read_json("data/custom_license/custom_license/00a0ab182dc01b6c2e737dfae585f050dcf9a7a5.json", orient="index")
body_json

Unnamed: 0,0
paper_id,00a0ab182dc01b6c2e737dfae585f050dcf9a7a5
metadata,{'title': 'Middle East respiratory syndrome: A...
abstract,[]
body_text,[{'text': 'The world was made aware of a newly...
bib_entries,"{'BIBREF0': {'ref_id': 'b0', 'title': 'Effecti..."
ref_entries,{'FIGREF0': {'text': 'A timeline of key scient...
back_matter,[]


In [26]:
body_text.toJSONString()

AttributeError: 'list' object has no attribute 'toJSONString'

In [27]:
body_text = body_json[0][3]
#body_text
body_text_json = json.dumps(body_text)
type(body_text_json)

str

In [34]:
pd.DataFrame.from_dict(body_text, orient="records")

ValueError: only recognize index or columns for orient

In [38]:
from io import StringIO
#newdf = pd.read_json(StringIO(temp))

In [39]:
body_text_df = pd.read_json(StringIO(body_text_json), orient="records")

In [12]:
full_text_string = json_to_body_string(temp_json)

ValueError: Protocol not known: [{"text": "The world was made aware of a newly discovered coronavirus via an email from Dr. Ali Mohamed Zaki, an Egyptian virologist working at the Dr. Soliman Fakeeh Hospital in Jeddah in The Kingdom of Saudi Arabia (KSA). The email was published on the website of the professional emerging diseases (ProMED) network on 20- September-2014 (ProMED, 2014 . That first case was a 60 year old man from Bisha in the KSA and, thanks to the email, the rapid discovery of a second case of the virus, this time in an ill patient from Qatar, was transferred to the United Kingdom for care ( Fig. 1) (Bermingham et al., 2012) . As of 20th January 2015, there have been 969 detections of viral RNA or virus-specific antibodies reported publicly, 955 confirmed by the World Health Organization (WHO), with over a third of the positive people dying (n = 351, 37%; data from public sources including the WHO and Ministries of Health). First known as novel coronavirus (nCoV), the following two to three years were a slow discovery process revealing a virus that appears well established among dromedary camels (DC; Camelus dromedarius) across the Arabian Peninsula and parts of Africa. From infected DCs, the virus is thought to infrequently infect exposed humans. Concern was raised early on that patenting of the first viral isolate would lead to restricted access to the virus and to viral diagnostics (Sciencemag, 2014) . However, sensitive, validated reverse transcriptase real-time polymerase chain reaction (RT-rtPCR)-based diagnostics were available (Abdel-Moneim, 2014) almost immediately. Virus was also made freely available subject to routine biosafety considerations, supporting many of the research findings described herein. In search of an animal host, bats were implicated in August 2013 but in that same month a DC link was reported (Reusken et al., 2013c) and that link has matured into a verifiable association. In humans, overt disease was finally given the name Middle East respiratory syndrome and the acronym MERS. From these animal-to-human spillover events, the MERS coronavirus (MERS-CoV; see Section 3 for variation in naming) spread sporadically among people, causing more severe disease among older males with underlying diseases. The proportion of infected people who are confirmed to have died from MERS-CoV infection is much higher than for severe acute respiratory syndrome (SARS)-CoV, influenza virus or many other pathogens. The spread of MERS-CoV among humans has often been associated with outbreaks in hospitals, which in 2012-2014 usually commenced in March (Mackay, 2014; Maltezou and Tsiodras, 2014) . This spread may be linked to some seasonal environmental changes, change in host animal behaviour, or perhaps simple coincidence between season and successive hospital outbreaks. Approximately a fifth of all cases to date have involved healthcare workers (HCWs), spiking alongside periods of increased total case numbers. Social media, blogs and the mainstream media have kept close tabs on the spread of MERS-CoV to 23 countries in Europe, Asia and the United States of America (USA; Fig. 2 ), mostly with an origin in the KSA from where 88% of viral detections have occurred. Twitter in particular has provided a global forum through specific hashtags like #MERS and the Arabic hashtag #AF4FD? or #Coruna. An engaged world has helped understand how the virus has affected the KSA and its neighbouring countries and allowed outsiders to view science musings take shape, collaborations form, local news and commentary trend and new results be discussed in real time. Social media provides new avenues for scientists to express experienced opinion, to more widely communicate their research and to engage with public health entities, the public themselves and the mainstream media. This degree of engagement was not possible in 2002/2003 when the SARS global outbreak began its rise to 8100 human cases including 770 deaths (proportion of fatal cases, or PFC, of 9.5%). The ubiquity of social media appears to have changed what the public expects from a State when it communicates about new or existing infectious disease outbreaks and epidemics, and how quickly they expect that to occur.", "cite_spans": [{"start": 314, "end": 328, "text": "September-2014", "ref_id": null}, {"start": 329, "end": 342, "text": "(ProMED, 2014", "ref_id": "BIBREF130"}, {"start": 579, "end": 604, "text": "(Bermingham et al., 2012)", "ref_id": "BIBREF25"}, {"start": 1380, "end": 1398, "text": "(Sciencemag, 2014)", "ref_id": "BIBREF144"}, {"start": 1826, "end": 1849, "text": "(Reusken et al., 2013c)", "ref_id": "BIBREF137"}, {"start": 2573, "end": 2587, "text": "(Mackay, 2014;", "ref_id": "BIBREF98"}, {"start": 2588, "end": 2616, "text": "Maltezou and Tsiodras, 2014)", "ref_id": "BIBREF103"}], "ref_spans": [{"start": 571, "end": 578, "text": "Fig. 1)", "ref_id": "FIGREF0"}, {"start": 3104, "end": 3110, "text": "Fig. 2", "ref_id": "FIGREF1"}], "section": "Brief history of the localised epidemic."}, {"text": "Patients with MERS often present themselves to a hospital with systemic and lower respiratory tract (LRT) signs and symptoms of disease which usually include fever, chills or rigors, dry or productive cough, shortness of breath (dyspnea) and one or more comorbidities including diabetes (prevalent in the KSA), chronic kidney disease including renal failure, chronic heart disease and heart failure, recent surgery, hypertension, chronic lung disease, asthma, obesity, smoking, malignant disease or steroid use (Arabi et al., 2014; Assiri et al., 2013a; Hijawi et al., 2013; Zaki et al., 2012) . MERS-CoV may be identified in patients with severe hypoxaemic respiratory failure and extrapulmonary organ dysfunction which can precede death in over a third of infections (Arabi et al., 2014; Assiri et al., 2013a; Hijawi et al., 2013; Zaki et al., 2012) . Extrapulmonary disease manifestations include circulatory, renal, hepatic and hematologic dysfunction. Gastrointestinal symptoms have been seen in 20-33% of cases (Assiri et al., 2013a; Mailles et al., 2013; Memish et al., 2013b; Zumla and Memish, 2014) , manifesting as diarrhea, vomiting and abdominal pain. Gastrointestinal symptoms were not seen at all in one family cluster (Omrani et al., 2013) nor among symptomatic children in another . On occasion, fever and gastrointestinal upset may form a prodrome, after which symptoms decline to be are later followed by more severe systemic and respiratory signs and symptoms (Kraaij-Dirkzwager et al., 2014; Mailles et al., 2013) . Rarely, MERS-CoV has been detected in a person with fever but no respiratory or gastrointestinal symptoms . The extent to which infection by other gastrointestinal pathogens affect this variability is unknown.", "cite_spans": [{"start": 511, "end": 531, "text": "(Arabi et al., 2014;", "ref_id": "BIBREF17"}, {"start": 532, "end": 553, "text": "Assiri et al., 2013a;", "ref_id": "BIBREF18"}, {"start": 554, "end": 574, "text": "Hijawi et al., 2013;", "ref_id": "BIBREF79"}, {"start": 575, "end": 593, "text": "Zaki et al., 2012)", "ref_id": "BIBREF159"}, {"start": 769, "end": 789, "text": "(Arabi et al., 2014;", "ref_id": "BIBREF17"}, {"start": 790, "end": 811, "text": "Assiri et al., 2013a;", "ref_id": "BIBREF18"}, {"start": 812, "end": 832, "text": "Hijawi et al., 2013;", "ref_id": "BIBREF79"}, {"start": 833, "end": 851, "text": "Zaki et al., 2012)", "ref_id": "BIBREF159"}, {"start": 1017, "end": 1039, "text": "(Assiri et al., 2013a;", "ref_id": "BIBREF18"}, {"start": 1040, "end": 1061, "text": "Mailles et al., 2013;", "ref_id": "BIBREF101"}, {"start": 1062, "end": 1083, "text": "Memish et al., 2013b;", "ref_id": "BIBREF112"}, {"start": 1084, "end": 1107, "text": "Zumla and Memish, 2014)", "ref_id": "BIBREF165"}, {"start": 1233, "end": 1254, "text": "(Omrani et al., 2013)", "ref_id": "BIBREF123"}, {"start": 1479, "end": 1511, "text": "(Kraaij-Dirkzwager et al., 2014;", "ref_id": "BIBREF90"}, {"start": 1512, "end": 1533, "text": "Mailles et al., 2013)", "ref_id": "BIBREF101"}], "ref_spans": [], "section": "Middle East respiratory syndrome (MERS)"}, {"text": "Chest radiography of MERS patients, as distinct from MERS-CoV positive people with a subclinical infection, reveal infiltrates consistent with acute viral pneumonia (Assiri et al., 2013a; Devi et al., 2014; Tsiodras et al., 2014; Zaki et al., 2012) . Only on rare occasions have studies described upper respiratory tract (URT) signs and symptoms. In one example, approximately 15-25% of cases presented with rhinorrhoea and/or sore throat Memish et al., 2013b; Payne et al., 2014; Zumla and Memish, 2014) . To date, MERS has been an opportunistic disease. Severe MERS has been defined by admission to an intensive care unit; use of extracorporeal membrane oxygenation (ECMO), mechanical ventilation or vasopressors; cases being reported as critical or severe; a fatal outcome (Arabi et al., 2014) . The relative speed of disease progression may relate to MERS-CoV reaching earlier peak viral loads and infecting different cells than the SARS-CoV (Drosten, 2013) . Nonetheless, it is also apparent that MERS is not restricted to those with comorbidities. With increased laboratory testing, particularly of contacts of confirmed MERS cases, a number of MERS-CoV positive individuals without comorbidities have been detected who experience a mild illness or no symptoms at all (Al-Tawfiq and . This demonstrates that MERS, like most respiratory viruses, is associated with a wide spectrum of symptoms and degrees of severity.", "cite_spans": [{"start": 165, "end": 187, "text": "(Assiri et al., 2013a;", "ref_id": "BIBREF18"}, {"start": 188, "end": 206, "text": "Devi et al., 2014;", "ref_id": "BIBREF52"}, {"start": 207, "end": 229, "text": "Tsiodras et al., 2014;", "ref_id": "BIBREF151"}, {"start": 230, "end": 248, "text": "Zaki et al., 2012)", "ref_id": "BIBREF159"}, {"start": 439, "end": 460, "text": "Memish et al., 2013b;", "ref_id": "BIBREF112"}, {"start": 461, "end": 480, "text": "Payne et al., 2014;", "ref_id": "BIBREF125"}, {"start": 481, "end": 504, "text": "Zumla and Memish, 2014)", "ref_id": "BIBREF165"}, {"start": 776, "end": 796, "text": "(Arabi et al., 2014)", "ref_id": "BIBREF17"}, {"start": 946, "end": 961, "text": "(Drosten, 2013)", "ref_id": "BIBREF53"}], "ref_spans": [], "section": "Middle East respiratory syndrome (MERS)"}, {"text": "The mean incubation period in a study of 47 cases was 5.2 days, with 95% of cases having shown symptoms within 12.4 days (Assiri et al., 2013a) . In a smaller study the incubation period ranged between one and nine days, with 13-14 days between when illness began in one person and subsequently spread to another . The length and nature of the prodrome is undefined to date. The first WHO case definition (World Health Organization, 2014a) defined probable cases of MERS based on the presence of febrile illness, cough, requirement for hospitalization with suspicion of LRT involvement and included roles for contact with a probable or confirmed case or for travel or residence within the Arabian peninsula. If strictly adhered to, only the severe syndrome would meet the case definition and be subject to laboratory testing, which was the paradigm early on (Assiri et al., 2013a) . From July 2013, the revised WHO case definition included the importance of seeking out and understanding the role of asymptomatic cases (World Health Organization, 2014d) . Apart from reports from the WHO and KSA Ministry of Health, asymptomatic or subclinical cases of MERS-CoV infection have also been documented in the scientific literature Memish et al., 2014b) . In one such case, a HCW shed virus for 42 days in the absence of disease (Al-Gethamy et al., 2014) .", "cite_spans": [{"start": 121, "end": 143, "text": "(Assiri et al., 2013a)", "ref_id": "BIBREF18"}, {"start": 858, "end": 880, "text": "(Assiri et al., 2013a)", "ref_id": "BIBREF18"}, {"start": 1019, "end": 1053, "text": "(World Health Organization, 2014d)", "ref_id": null}, {"start": 1227, "end": 1248, "text": "Memish et al., 2014b)", "ref_id": "BIBREF105"}, {"start": 1324, "end": 1349, "text": "(Al-Gethamy et al., 2014)", "ref_id": "BIBREF8"}], "ref_spans": [], "section": "Middle East respiratory syndrome (MERS)"}, {"text": "MERS can progress to an acute respiratory distress syndrome (ARDS) requiring external ventilation and then to multiorgan failure (Devi et al., 2014; Reuss et al., 2014; Zaki et al., 2012) similar to severe influenza and SARS cases . Acute renal failure can occur in MERS patients, doing so sooner than it did among SARS patients (Eckerle et al., 2013) . Progressive impairment of renal function and acute kidney injury can start 9-12 days after symptom onset among MERS patients, compared to a median of 20 days for SARS patients (Chu et al., 2005; Eckerle et al., 2013; Zaki et al., 2012) . This may be due to direct infection of renal tissue by MERS-CoV (Arabi et al., 2014; Zaki et al., 2012) . Haematologic changes in MERS cases include thrombocytopenia (Assiri et al., 2013a; Drosten et al., 2013; Omrani et al., 2013) and lymphocytosis (Assiri et al., 2013a) or lymphopenia (Assiri et al., 2013a; Omrani et al., 2013) on admission (Assiri et al., 2013a) . Monocyte numbers are often normal (Assiri et al., 2013a) while neutrophils may be raised or normal (Assiri et al., 2013a) .", "cite_spans": [{"start": 129, "end": 148, "text": "(Devi et al., 2014;", "ref_id": "BIBREF52"}, {"start": 149, "end": 168, "text": "Reuss et al., 2014;", "ref_id": "BIBREF139"}, {"start": 169, "end": 187, "text": "Zaki et al., 2012)", "ref_id": "BIBREF159"}, {"start": 329, "end": 351, "text": "(Eckerle et al., 2013)", "ref_id": "BIBREF63"}, {"start": 530, "end": 548, "text": "(Chu et al., 2005;", "ref_id": "BIBREF39"}, {"start": 549, "end": 570, "text": "Eckerle et al., 2013;", "ref_id": "BIBREF63"}, {"start": 571, "end": 589, "text": "Zaki et al., 2012)", "ref_id": "BIBREF159"}, {"start": 656, "end": 676, "text": "(Arabi et al., 2014;", "ref_id": "BIBREF17"}, {"start": 677, "end": 695, "text": "Zaki et al., 2012)", "ref_id": "BIBREF159"}, {"start": 758, "end": 780, "text": "(Assiri et al., 2013a;", "ref_id": "BIBREF18"}, {"start": 781, "end": 802, "text": "Drosten et al., 2013;", "ref_id": "BIBREF57"}, {"start": 803, "end": 823, "text": "Omrani et al., 2013)", "ref_id": "BIBREF123"}, {"start": 842, "end": 864, "text": "(Assiri et al., 2013a)", "ref_id": "BIBREF18"}, {"start": 880, "end": 902, "text": "(Assiri et al., 2013a;", "ref_id": "BIBREF18"}, {"start": 903, "end": 923, "text": "Omrani et al., 2013)", "ref_id": "BIBREF123"}, {"start": 937, "end": 959, "text": "(Assiri et al., 2013a)", "ref_id": "BIBREF18"}, {"start": 996, "end": 1018, "text": "(Assiri et al., 2013a)", "ref_id": "BIBREF18"}, {"start": 1061, "end": 1083, "text": "(Assiri et al., 2013a)", "ref_id": "BIBREF18"}], "ref_spans": [], "section": "Middle East respiratory syndrome (MERS)"}, {"text": "As a group, children have rarely been reported to be positive for the virus. Between 1st September 2012 and 2nd December 2013, 11 paediatric cases (2-16 years of age; median 13-years) were identified in total; nine were asymptomatic (72%) and one died . In Amman, Jordan, 1005 samples from hospitalised children under the age of 2-years with fever and/or respiratory signs and symptoms were tested but none were positive for MERS-CoV RNA, despite being collected at a similar time to the first known outbreak of MERS-CoV in the neighbouring town of Al-Zarqa (Khuri-Bulos et al., 2013) . A second trimester stillbirth occurred in a pregnant woman during an acute respiratory illness and while not RT-rtPCR positive, the mother subsequently developed antibodies to MERS-CoV, suggestive of recent infection . Her exposure history to a MERS-CoV RT-rtPCR positive relative and an antibody-reactive husband, her incubation period and her symptom history met the WHO criteria for being a probable MERS-CoV case .", "cite_spans": [{"start": 558, "end": 584, "text": "(Khuri-Bulos et al., 2013)", "ref_id": "BIBREF85"}], "ref_spans": [], "section": "Middle East respiratory syndrome (MERS)"}, {"text": "The virus associated with MERS was initially identified as the \"novel coronavirus\" or nCOV; a problematic choice given that other novel coronaviruses could be discovered and were being discovered with regularity prior to and since the identification of MERS-CoV. When the first genome of a human variant was sequenced it was named human betacoronavirus 2c EMC (subsequently referred to here as EMC/2012), with the implication that it was a human not animal coronavirus. There were also variants named England-Qatar, Jordan-N3 and England 1. Ten months after its discovery, the coronavirus study group assembled an international consensus and the virus was renamed and given the acronym of MERS-CoV (de Groot et al., 2013) .", "cite_spans": [{"start": 698, "end": 721, "text": "(de Groot et al., 2013)", "ref_id": null}], "ref_spans": [], "section": "The Middle East respiratory syndrome coronavirus (MERS-CoV)"}, {"text": "MERS-CoV is a putative member of a new species (van Boheemen et al., 2012) within the order Nidovirales, family Coronaviridae, subfamily Coronavirinae, genus Betacoronavirus, subgroup 2c (Raj et al., 2014b) . The first full sequence defined a singlestranded, positive sense, 30,119 nucleotide (nt) long genome ( Fig. 3) (van Boheemen et al., 2012; Zaki et al., 2012) . Based on analysis of 42 complete sequences, the genome is predicted to be evolving at 1.12 \u00d7 10 \u22123 substitutions per site (Cotten et al., 2014) . This permitted a predictive calculation of the time to most recent viral ancestor (tMRCA) for most of the variants, which suggested MERS-CoV first appeared around March 2012 (ranging from December 2011 to July 2012) (Cotten et al., 2014) . Comparison of the first open reading frame's (ORF 1ab) amino acid sequence to that from its closest betacoronavirus relatives, Tylonycteris bat HKU4 and Pipistrellus bat HKU5, found there was less than 80% identity which supported the conclusion that MERS-CoV was a novel and distinct virus. This genomic region is a key taxonomic identifier of CoV species. MERS-CoV is predicted to encode at least ten open reading frames bracketed by /2012 variant isolated from sputum of a 60-year old man from Bisha, KSA) . Open reading frames are indicated as yellow rectangles bracketed by terminal untranslated regions (UTR; grey rectangles). The 5 UTR includes the predicted leader (L) transcription-regulatory sequence. FS-frame-shift. Predicted papain-like proteinase cleavage sites are indicated with orange arrows resulting in \u223c16 cleavage non-structural protein products (based on (van Boheemen et al., 2012) ). The genome is drawn to scale using Geneious v6.1.6 and annotated using Adobe Illustrator. MERS-CoV EMC/2012 sequence GenBank accession number JX869059 5 and 3 untranslated regions (Raj et al., 2014b) . Structural proteins include the spike (S), envelope (E), membrane (M) and nucleocapsid (N) (van Boheemen et al., 2012) . Nonstructural proteins (nsps) from the products of ORF1a and ORF1b have been predicted, following identification of conserved domains and after comparative analysis with other coronaviral proteins. The nsps include a papainlike protease (PLpro; nsp4 (Kilianski et al., 2013; Lin et al., 2014) transmembrane domains (nsp4, nsp6), a 3C-like protease (3CLpro; nsp5 (Kilianski et al., 2013) ), an RNA-dependent RNA polymerase (RdRp; nsp12), a helicase (nsp13) and an exonuclease (nsp14) (van Boheemen et al., 2012) .", "cite_spans": [{"start": 47, "end": 74, "text": "(van Boheemen et al., 2012)", "ref_id": "BIBREF151"}, {"start": 187, "end": 206, "text": "(Raj et al., 2014b)", "ref_id": "BIBREF134"}, {"start": 320, "end": 347, "text": "(van Boheemen et al., 2012;", "ref_id": "BIBREF151"}, {"start": 348, "end": 366, "text": "Zaki et al., 2012)", "ref_id": "BIBREF159"}, {"start": 491, "end": 512, "text": "(Cotten et al., 2014)", "ref_id": "BIBREF49"}, {"start": 731, "end": 752, "text": "(Cotten et al., 2014)", "ref_id": "BIBREF49"}, {"start": 1632, "end": 1659, "text": "(van Boheemen et al., 2012)", "ref_id": "BIBREF151"}, {"start": 1843, "end": 1862, "text": "(Raj et al., 2014b)", "ref_id": "BIBREF134"}, {"start": 1956, "end": 1983, "text": "(van Boheemen et al., 2012)", "ref_id": "BIBREF151"}, {"start": 2236, "end": 2260, "text": "(Kilianski et al., 2013;", "ref_id": "BIBREF87"}, {"start": 2261, "end": 2278, "text": "Lin et al., 2014)", "ref_id": "BIBREF93"}, {"start": 2348, "end": 2372, "text": "(Kilianski et al., 2013)", "ref_id": "BIBREF87"}, {"start": 2469, "end": 2496, "text": "(van Boheemen et al., 2012)", "ref_id": "BIBREF151"}], "ref_spans": [{"start": 312, "end": 319, "text": "Fig. 3)", "ref_id": null}, {"start": 1191, "end": 1263, "text": "/2012 variant isolated from sputum of a 60-year old man from Bisha, KSA)", "ref_id": "TABREF4"}], "section": "The viral genome"}, {"text": "Complete genome deduction using deep sequencing methods has been the predominant tool for genome analysis during the emergence of MERS-CoV (Cotten et al., 2013a (Cotten et al., ,b, 2014 the first time it's use has been so pervasive for the study of a viral outbreak with global reach. While the error rate can be higher than for traditional Sanger sequencing, the near-complete genomic length covered by just a single run (e.g. 90% of the MERS-CoV EMC/2012 genome) and the depth of coverage at each nucleotide is such that deep sequencing corrects for erroneous nucleotide van Boheemen et al., 2012) . Subgenomic sequencing, a mainstay of viral genotyping and molecular epidemiology to this point, has been used rarely for MERS-CoV identification or confirmation, despite assays having been suggested early on Corman et al. (2012b) . Such an approach is simpler, more accessible to a wider range of laboratories and faster. It's utility has since been demonstrated using molecular assays to amplify and sequence a 615 nucleotide long fragment of the spike S2 domain gene fragment (Smits et al., 2015) . This assay agreed with the results generated by the sequencing of full genomes and defined additional sequence groupings within an existing MERS-CoV clade. With the addition of more genomes over time from both humans and from DCs, two clades have become apparent; A and B. Clade A contains only human-derived MERS-CoV genomes ( Fig. 4) .", "cite_spans": [{"start": 139, "end": 160, "text": "(Cotten et al., 2013a", "ref_id": "BIBREF47"}, {"start": 161, "end": 185, "text": "(Cotten et al., ,b, 2014", "ref_id": null}, {"start": 573, "end": 599, "text": "van Boheemen et al., 2012)", "ref_id": "BIBREF151"}, {"start": 810, "end": 831, "text": "Corman et al. (2012b)", "ref_id": "BIBREF45"}, {"start": 1080, "end": 1100, "text": "(Smits et al., 2015)", "ref_id": "BIBREF148"}], "ref_spans": [{"start": 1431, "end": 1438, "text": "Fig. 4)", "ref_id": "FIGREF2"}], "section": "The viral genome"}, {"text": "To date, the MERS-CoV genomes collected from samples spanning just two years are genetically very similar to each other. An alignment of 56 complete or near-complete MERS-CoV genomes sampled from 2012 to 2014 differed by 0-0.38%. For comparison, an alignment of 31 complete HCoV-NL63 genomes from samples collected between 1983 and 2009 shows they diverge by 0.5% at the nucleotide level (data not shown; theoretically equates to 145nt for a 27,553nt genome). There is as yet no study which attaches clinical relevance to the clades or smaller groupings of MERS-CoV nor any of the genomic variation noted to date . It is interesting that Clade A contains only the African green monkey kidney (Vero; innate immune deficient cells) cell-culture passaged EMC/2012 variant and two variants of the Jordan-N3 variant from 2012, but no camel-derived MERS-CoV genomes . When the MERS-CoV genome of the variant from Bisha was re-sequenced directly from the original URT sample, the comparison of trimmed genomes (EMC/2012 vs. Bisha 1) revealed 115 nucleotide differences (0.38% difference) resolving Bisha 1 into Clade B . This is unusual because when a Jordan-N3 virus was intentionally serially passed through Vero or MRC5 cell culture (Jordan-N3/2012 MG167), only 2nt changes occurred within the entire coding region of the resultant sequence, despite eight passages (Frey et al., 2014) . For comparison, after three passages through Vero cell culture, no genetic changes were found in a DC MERS-CoV variant of Qatar 2 2014 .", "cite_spans": [{"start": 1361, "end": 1380, "text": "(Frey et al., 2014)", "ref_id": "BIBREF66"}], "ref_spans": [], "section": "Genomic variability and molecular epidemiology"}, {"text": "A very divergent MERS-CoV variant originated from an Egyptian DC likely imported from Sudan was identified as NRCE-HKU205|Nile|2013. It constructs a lineage outside the current clades, perhaps comprising the first occupant of Clade C Cotten et al., 2013b; Smits et al., 2015) . This lineage may represent additional diversity of MERS-CoV variants remaining to be discovered in DC from outside the Arabian peninsula. A virus sequenced from a Neoromicia capensis bat was more closely related to MERS-CoV than previous bat sequences had been, providing a link between human, camel and bat viruses as members of the same CoV species . Despite usually comprising \u22641% of the total genome, in silico comparison shows that viral genetic changes among variants permit geographic tracking of the spread of variants and identification that Riyadh, in particular, harbours a wide range of MERS-CoV variants ( Fig. 4) . This process of molecular epidemiology can also imply some physical direction to the movement of MERS-CoV around the region and over time (Cotten et al., 2014) .", "cite_spans": [{"start": 234, "end": 255, "text": "Cotten et al., 2013b;", "ref_id": "BIBREF48"}, {"start": 256, "end": 275, "text": "Smits et al., 2015)", "ref_id": "BIBREF148"}, {"start": 1045, "end": 1066, "text": "(Cotten et al., 2014)", "ref_id": "BIBREF49"}], "ref_spans": [{"start": 897, "end": 904, "text": "Fig. 4)", "ref_id": "FIGREF2"}], "section": "Genomic variability and molecular epidemiology"}, {"text": "When compared to Bisha 1 2012, most single nucleotide differences among variants were located in the last third of the genome, encompassing the S protein (Fig. 5 ) and accessory proteins . At least nine MERS-CoV genomes harbour amino acid substitutions in the ribosome binding domain (RBD) of the spike protein and codons 158 (N-terminal region), 460 (RBD), 1020 (in heptad repeat 1), 1202 and 1208 bear investigation as markers of adaptive change (Cotten et al., 2014; Raj et al., 2014a) . Studies are needed to determine whether there any functional outcomes on virus replication and transmission due to these and future changes (Cotten et al., 2014 ). An early in vitro analysis did not find differences in shedding, replication or immune escape among viruses isolated up to May 2014 . The location and crystal structure of the RBD was described in several reports from mid-2013 Du et al., 2013b; Lu et al., 2013; Mou et al., 2013; Wang et al., 2013) .", "cite_spans": [{"start": 154, "end": 161, "text": "(Fig. 5", "ref_id": null}, {"start": 448, "end": 469, "text": "(Cotten et al., 2014;", "ref_id": "BIBREF49"}, {"start": 470, "end": 488, "text": "Raj et al., 2014a)", "ref_id": "BIBREF132"}, {"start": 631, "end": 651, "text": "(Cotten et al., 2014", "ref_id": "BIBREF49"}, {"start": 882, "end": 899, "text": "Du et al., 2013b;", "ref_id": "BIBREF59"}, {"start": 900, "end": 916, "text": "Lu et al., 2013;", "ref_id": "BIBREF94"}, {"start": 917, "end": 934, "text": "Mou et al., 2013;", "ref_id": "BIBREF118"}, {"start": 935, "end": 953, "text": "Wang et al., 2013)", "ref_id": "BIBREF152"}], "ref_spans": [], "section": "Genomic variability and molecular epidemiology"}, {"text": "Early diagnostic methods appeared within days of the ProMED email announcing the first MERS case. These included Vero and LLC-MK2 cell culture and several in-house RT-rtPCR assays (Fig. 6 ) (Corman et al., 2012a,b; Zaki et al., 2012) . Antibody testing of human sera remains rare.", "cite_spans": [{"start": 190, "end": 214, "text": "(Corman et al., 2012a,b;", "ref_id": null}, {"start": 215, "end": 233, "text": "Zaki et al., 2012)", "ref_id": "BIBREF159"}], "ref_spans": [{"start": 180, "end": 187, "text": "(Fig. 6", "ref_id": null}], "section": "Molecular detection of MERS-CoV infection"}, {"text": "RT-rtPCR assays validated by Corman et al. were quickly recommended by the WHO having been shown to be sensitive The genetic relationship between all near-complete and complete MERS-CoV genome nucleotide sequences (downloaded from GenBank using the listed accession numbers; England2 was obtained from http

In [44]:
full_texts_dict = {}
not_strings = []
compile_full_texts(df_custom_license, "custom_license")
compile_full_texts(df_noncomm_use_subset, "noncomm_use_subset")
compile_full_texts(df_comm_use_subset, "comm_use_subset")
compile_full_texts(df_biorxiv_medrxiv, "biorxiv_medrxiv")

In [45]:
len(full_texts_dict)

27678

In [46]:
dict_df = pd.DataFrame.from_dict(full_texts_dict, orient='index')
dict_df.head()

Unnamed: 0,0
aecbc613ebdab36753235197ffb4f35734b5ca63,"The patient (Fo, ) was a 58 year old mentally..."
212e990b378e8d267042753d5f9d4a64ea5e9869,Pathogenesis and Risk Factors J. ROBERT CANTE...
bf5d344243153d58be692ceb26f52c08e2bd2d2f,"In the pathogenesis of rheumatoid arthritis, ..."
ddd2ecf42ec86ad66072962081e1ce4594431f9c,Respiratory Tract Infections JERROLD J. ELLNE...
a55cb4e724091ced46b5e55b982a14525eea1c7e,"A cute bronchitis, an illness frequently enco..."


In [47]:
dict_df.to_csv("full_texts.csv")