## Process non-commercial published papers

**Name: Vidhi Gupta (vg5vc)**

Some exploration and preprocessing to create one dataframe and export it as .csv.

Adjusted to data from 2020-04-17.

# Load Packages

In [1]:
import numpy as np 
import pandas as pd

import glob
import json

# Load and Prepare Data

To read the JSON files we follow [COVID EDA: Initial Exploration Tool](https://www.kaggle.com/ivanegapratama/covid-eda-initial-exploration-tool).

In [2]:
root_path = '../dataset/CORD-19-research-challenge/'
meta_df = pd.read_csv(root_path+'metadata.csv', dtype={
    'pubmed_id': str,
    'Microsoft Academic Paper ID': str, 
    'doi': str
})
meta_df.head(2)

Unnamed: 0,cord_uid,sha,source_x,title,doi,pmcid,pubmed_id,license,abstract,publish_time,authors,journal,Microsoft Academic Paper ID,WHO #Covidence,has_pdf_parse,has_pmc_xml_parse,full_text_file,url
0,xqhn0vbp,1e1286db212100993d03cc22374b624f7caee956,PMC,Airborne rhinovirus detection and effect of ul...,10.1186/1471-2458-3-5,PMC140314,12525263,no-cc,"BACKGROUND: Rhinovirus, the most common cause ...",2003-01-13,"Myatt, Theodore A; Johnston, Sebastian L; Rudn...",BMC Public Health,,,True,True,custom_license,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1...
1,gi6uaa83,8ae137c8da1607b3a8e4c946c07ca8bda67f88ac,PMC,Discovering human history from stomach bacteria,10.1186/gb-2003-4-5-213,PMC156578,12734001,no-cc,Recent analyses of human pathogens have reveal...,2003-04-28,"Disotell, Todd R",Genome Biol,,,True,True,custom_license,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1...


In [3]:
all_json = glob.glob(f'{root_path}noncomm_use_subset/noncomm_use_subset/pdf_json/*.json', recursive=True)
len(all_json)

2466

In [4]:
all_json_pmc = glob.glob(f'{root_path}noncomm_use_subset/noncomm_use_subset/pmc_json/*.json', recursive=True)
len(all_json_pmc)

2212

# Non - Commercial use pdf_json

In [5]:
methods = ['methods','method','statistical methods','materials','materials and methods',
                'data collection','the study','study design','experimental design','objective',
                'objectives','procedures','data collection and analysis', 'methodology',
                'material and methods','the model','experimental procedures','main text']

In [6]:
# [''.join(x.lower() for x in m if x.isalpha()) for m in methods]

# for m in methods:
#     print(''.join(x.lower() for x in m if x.isalpha()))

In [7]:
class FileReader:
    def __init__(self, file_path):
        with open(file_path) as file:
            content = json.load(file)
            self.paper_id = content['paper_id']
            self.abstract = []
            self.body_text = []
            self.methods = []
            self.results = []

            # Abstract
            for entry in content['abstract']:
                self.abstract.append(entry['text'])
            # Body text
            for entry in content['body_text']:
                self.body_text.append(entry['text'])
            # Methods
            methods = ['methods','method','statistical methods','materials','materials and methods',
                'data collection','the study','study design','experimental design','objective',
                'objectives','procedures','data collection and analysis', 'methodology',
                'material and methods','the model','experimental procedures','main text']
            for entry in content['body_text']:
                section_title = ''.join(x.lower() for x in entry['section'] if x.isalpha()) #remove numbers and spaces
                if any(m in section_title for m in [''.join(x.lower() for x in m if x.isalpha()) for m in methods]) : 
                    self.methods.append(entry['text'])
            # Results
            results_synonyms = ['result']
            for entry in content['body_text']:
                section_title = ''.join(x.lower() for x in entry['section'] if x.isalpha())
                if any(r in section_title for r in results_synonyms) :
                    self.results.append(entry['text'])
                    
            self.abstract = '\n'.join(self.abstract)
            self.body_text = '\n'.join(self.body_text)
            self.methods = '\n'.join(self.methods)
            self.results = '\n'.join(self.results)

    def __repr__(self):
        return f'{self.paper_id}: {self.abstract[:200]}... {self.body_text[:200]}...'
first_row = FileReader(all_json[0])
print(first_row)

b2f67d533f2749807f2537f3775b39da3b186051: ... There is a disproportionate number of individuals with mental and somatic illnesses among persons in detention (Bhugra, 2020; Ginn, 2012) . It is also known that infections which are transmitted human...


In [8]:
dict_ = {'paper_id': [], 'abstract': [], 'body_text': [], 'methods': [], 'results': []}
for idx, entry in enumerate(all_json):
    if idx % (len(all_json) // 10) == 0:
        print(f'Processing index: {idx} of {len(all_json)}')
    content = FileReader(entry)
    dict_['paper_id'].append(content.paper_id)
    dict_['abstract'].append(content.abstract)
    dict_['body_text'].append(content.body_text)
    dict_['methods'].append(content.methods)
    dict_['results'].append(content.results)

Processing index: 0 of 2466
Processing index: 246 of 2466
Processing index: 492 of 2466
Processing index: 738 of 2466
Processing index: 984 of 2466
Processing index: 1230 of 2466
Processing index: 1476 of 2466
Processing index: 1722 of 2466
Processing index: 1968 of 2466
Processing index: 2214 of 2466
Processing index: 2460 of 2466


In [9]:
papers = pd.DataFrame(dict_, columns=['paper_id', 'abstract', 'body_text', 'methods', 'results'])
papers.head()

Unnamed: 0,paper_id,abstract,body_text,methods,results
0,b2f67d533f2749807f2537f3775b39da3b186051,,There is a disproportionate number of individu...,,
1,ad98979eada6e333a276d39efdce21779d538625,While noncanonic xanthine nucleotides XMP/dXMP...,The concentration and ratio of purine nucleoti...,Guanine-based phosphonate (0.5 mmol) was disso...,
2,464f7d3a460eb51dbc25bd12639b22079a73f85a,Long non-coding RNAs (lncRNAs) are found not o...,Viruses are important infectious agents that i...,,
3,c436139975d97ef929b5d8452595de40bda0c11c,on behalf of the IRC002 Study Team Summary Bac...,Pandemic influenza remains a global health thr...,"This was a randomized, open-label, multicenter...","Between January 2011 and April 2015, a total o..."
4,634128ea7d7736750e1c3cd0a48bb37843d06dac,The majority of emerging zoonoses originate in...,"A total of 12,793 consensus PCR assays were pe...","Samples and PCR screening. Samples (n ϭ 1,897)...",


In [10]:
papers[(papers.results.str.len() != 0) | (papers.methods.str.len() != 0)].shape

(1131, 5)

In [11]:
df = pd.merge(papers, meta_df, left_on='paper_id', right_on='sha', how='left').drop('sha', axis=1)

In [12]:
df.columns

Index(['paper_id', 'abstract_x', 'body_text', 'methods', 'results', 'cord_uid',
       'source_x', 'title', 'doi', 'pmcid', 'pubmed_id', 'license',
       'abstract_y', 'publish_time', 'authors', 'journal',
       'Microsoft Academic Paper ID', 'WHO #Covidence', 'has_pdf_parse',
       'has_pmc_xml_parse', 'full_text_file', 'url'],
      dtype='object')

# Non - Commercial use pmc_json

This only contains the full text - no abstracts!

In [13]:
class FileReader:
    def __init__(self, file_path):
        with open(file_path) as file:
            content = json.load(file)
            self.paper_id = content['paper_id']
            self.body_text = []
            self.methods = []
            self.results = []

            # Body text
            for entry in content['body_text']:
                self.body_text.append(entry['text'])
            # Methods
            methods = ['methods','method','statistical methods','materials','materials and methods',
                'data collection','the study','study design','experimental design','objective',
                'objectives','procedures','data collection and analysis', 'methodology',
                'material and methods','the model','experimental procedures','main text']
            for entry in content['body_text']:
                section_title = ''.join(x.lower() for x in entry['section'] if x.isalpha()) #remove numbers and spaces
                if any(m in section_title for m in [''.join(x.lower() for x in m if x.isalpha()) for m in methods]) : 
                    self.methods.append(entry['text'])
            # Results
            results_synonyms = ['result']
            for entry in content['body_text']:
                section_title = ''.join(x.lower() for x in entry['section'] if x.isalpha())
                if any(r in section_title for r in results_synonyms) :
                    self.results.append(entry['text'])
                    
            self.body_text = '\n'.join(self.body_text)
            self.methods = '\n'.join(self.methods)
            self.results = '\n'.join(self.results)

    def __repr__(self):
        return f'{self.paper_id}: {self.body_text[:200]}...'
first_row = FileReader(all_json_pmc[0])
print(first_row)

PMC4834006: In 2009, a novel type A influenza (H1N1) virus was first identified in patients from Mexico and has since, spread globally.[1] During peak periods of seasonal influenza, the pandemic strain of H1N1 vi...


In [14]:
dict_ = {'paper_id': [], 'body_text': [], 'methods': [], 'results': []}
for idx, entry in enumerate(all_json_pmc):
    if idx % (len(all_json_pmc) // 10) == 0:
        print(f'Processing index: {idx} of {len(all_json_pmc)}')
    content = FileReader(entry)
    dict_['paper_id'].append(content.paper_id)
    dict_['body_text'].append(content.body_text)
    dict_['methods'].append(content.methods)
    dict_['results'].append(content.results)

Processing index: 0 of 2212
Processing index: 221 of 2212
Processing index: 442 of 2212
Processing index: 663 of 2212
Processing index: 884 of 2212
Processing index: 1105 of 2212
Processing index: 1326 of 2212
Processing index: 1547 of 2212
Processing index: 1768 of 2212
Processing index: 1989 of 2212
Processing index: 2210 of 2212


In [15]:
pmc_text = pd.DataFrame(dict_, columns=['paper_id', 'body_text', 'methods', 'results'])
pmc_text.head()

Unnamed: 0,paper_id,body_text,methods,results
0,PMC4834006,"In 2009, a novel type A influenza (H1N1) virus...",This study and the use of patient case files w...,"The demographic data, clinical data, and radio..."
1,PMC6780997,"There are many pathogens, such as HIV-1, respi...",Blood was obtained by venipuncture from health...,To generate anti-idiotype antibodies to iglb12...
2,PMC4142007,The recognition of newly described infectious ...,,
3,PMC5508335,"\nE.H. Chapel1, B.A. Scansen2, K.E. Schober1, ...","\nA.E.M. Gonçalves1, P. Itikawa1, G.T. Goldfed...","\nR. de Oliveira Alves Carvalho1, A.P.A. Costa..."
4,PMC4706628,There are more than 150 species of Candida but...,,


In [16]:
pmc_text.shape

(2212, 4)

Careful, some of the new texts are empty strings!

In [17]:
pmc_text[pmc_text.body_text == '']

Unnamed: 0,paper_id,body_text,methods,results
45,PMC6255065,,,
57,PMC2186492,,,
58,PMC2291173,,,
70,PMC2289203,,,
75,PMC2289628,,,
...,...,...,...,...
2109,PMC2288887,,,
2128,PMC2121027,,,
2129,PMC7088696,,,
2186,PMC6808672,,,


In [18]:
pmc_text = pmc_text[pmc_text.body_text != '']

In [19]:
pmc_text.shape

(2056, 4)

In [20]:
df.head()

Unnamed: 0,paper_id,abstract_x,body_text,methods,results,cord_uid,source_x,title,doi,pmcid,...,abstract_y,publish_time,authors,journal,Microsoft Academic Paper ID,WHO #Covidence,has_pdf_parse,has_pmc_xml_parse,full_text_file,url
0,b2f67d533f2749807f2537f3775b39da3b186051,,There is a disproportionate number of individu...,,,,,,,,...,,,,,,,,,,
1,ad98979eada6e333a276d39efdce21779d538625,While noncanonic xanthine nucleotides XMP/dXMP...,The concentration and ratio of purine nucleoti...,Guanine-based phosphonate (0.5 mmol) was disso...,,hdpanetr,PMC,Xanthine-based acyclic nucleoside phosphonates...,10.1177/2040206618813050,PMC6287304,...,While noncanonic xanthine nucleotides XMP/dXMP...,2018-11-29,"Baszczyňski, Ondřej; Kaiser, Martin Maxmilian;...",Antivir Chem Chemother,,,True,True,noncomm_use_subset,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6...
2,464f7d3a460eb51dbc25bd12639b22079a73f85a,Long non-coding RNAs (lncRNAs) are found not o...,Viruses are important infectious agents that i...,,,,,,,,...,,,,,,,,,,
3,c436139975d97ef929b5d8452595de40bda0c11c,on behalf of the IRC002 Study Team Summary Bac...,Pandemic influenza remains a global health thr...,"This was a randomized, open-label, multicenter...","Between January 2011 and April 2015, a total o...",,,,,,...,,,,,,,,,,
4,634128ea7d7736750e1c3cd0a48bb37843d06dac,The majority of emerging zoonoses originate in...,"A total of 12,793 consensus PCR assays were pe...","Samples and PCR screening. Samples (n ϭ 1,897)...",,6lobyyj4,PMC,A Strategy To Estimate Unknown Viral Diversity...,10.1128/mbio.00598-13,PMC3760253,...,The majority of emerging zoonoses originate in...,2013-09-03,"Anthony, Simon J.; Epstein, Jonathan H.; Murra...",mBio,,,True,True,noncomm_use_subset,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3...


In [21]:
df = pd.merge(df, pmc_text, left_on='pmcid', right_on='paper_id', how='left')

In [22]:
df.head(3)

Unnamed: 0,paper_id_x,abstract_x,body_text_x,methods_x,results_x,cord_uid,source_x,title,doi,pmcid,...,Microsoft Academic Paper ID,WHO #Covidence,has_pdf_parse,has_pmc_xml_parse,full_text_file,url,paper_id_y,body_text_y,methods_y,results_y
0,b2f67d533f2749807f2537f3775b39da3b186051,,There is a disproportionate number of individu...,,,,,,,,...,,,,,,,,,,
1,ad98979eada6e333a276d39efdce21779d538625,While noncanonic xanthine nucleotides XMP/dXMP...,The concentration and ratio of purine nucleoti...,Guanine-based phosphonate (0.5 mmol) was disso...,,hdpanetr,PMC,Xanthine-based acyclic nucleoside phosphonates...,10.1177/2040206618813050,PMC6287304,...,,,True,True,noncomm_use_subset,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6...,PMC6287304,The concentration and ratio of purine nucleoti...,Starting compounds and other chemicals were pu...,
2,464f7d3a460eb51dbc25bd12639b22079a73f85a,Long non-coding RNAs (lncRNAs) are found not o...,Viruses are important infectious agents that i...,,,,,,,,...,,,,,,,,,,


In [23]:
df.columns

Index(['paper_id_x', 'abstract_x', 'body_text_x', 'methods_x', 'results_x',
       'cord_uid', 'source_x', 'title', 'doi', 'pmcid', 'pubmed_id', 'license',
       'abstract_y', 'publish_time', 'authors', 'journal',
       'Microsoft Academic Paper ID', 'WHO #Covidence', 'has_pdf_parse',
       'has_pmc_xml_parse', 'full_text_file', 'url', 'paper_id_y',
       'body_text_y', 'methods_y', 'results_y'],
      dtype='object')

# Exploration/Cleaning

### Different Abstract in Metadata and JSON files

abstract_x from json, abstract_y from metadata

In [24]:
df[df.abstract_x != df.abstract_y].shape

(2212, 26)

In [25]:
df[df.abstract_x != df.abstract_y][['abstract_x', 'abstract_y', 'url']].tail(10)

Unnamed: 0,abstract_x,abstract_y,url
2456,"Abbreviations used in this paper: CSF, cytosta...",How cells shape and remodel organelles in resp...,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4...
2457,,Introduction: The use of antibiotics is based ...,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6...
2458,Retrovirus Moloney murine leukemia virus (M-Mu...,Retrovirus Moloney murine leukemia virus (M-Mu...,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2...
2459,Web-based social media is increasingly being u...,Web-based social media is increasingly being u...,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4...
2460,,INTRODUCTION: Kawasaki disease (KD) most commo...,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4...
2461,All positive-strand RNA viruses induce membran...,All positive-strand RNA viruses induce membran...,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3...
2462,J Jo ou ur rn na al l o of f C Ca an nc ce er ...,The advancement of high throughput omic techno...,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4...
2463,The One Health initiative is increasingly beco...,The One Health initiative is increasingly beco...,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5...
2464,The Feline coronavirus (FCoV) can lead to Feli...,The Feline coronavirus (FCoV) can lead to Feli...,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6...
2465,"With over 4,500 deaths and counting, and new c...","With over 4,500 deaths and counting, and new c...",https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4...


In [26]:
df.abstract_x.isnull().sum(), (df.abstract_x =='').sum() # missing abstracts in json files

(0, 689)

In [27]:
df.abstract_y.isnull().sum(), (df.abstract_y=='').sum() # missing abstracts in metadata

(799, 0)

Since the abstracts from the metadata seem more reliable we generally use these, but fill the missing values with the abstract from the extracted values from the JSON file.

In [28]:
df.loc[df.abstract_y.isnull() & (df.abstract_x != ''), 'abstract_y'] = df[(df.abstract_y.isnull()) & (df.abstract_x != '')].abstract_x

In [29]:
df.abstract_y.isnull().sum()

318

the remaining missing values are also empty in the json files

In [30]:
(df.abstract_y.isnull() & (df.abstract_x!='')).sum()

0

In [31]:
df.rename(columns = {'abstract_y': 'abstract'}, inplace=True)
df.drop('abstract_x', axis=1, inplace=True)

In [32]:
df.columns

Index(['paper_id_x', 'body_text_x', 'methods_x', 'results_x', 'cord_uid',
       'source_x', 'title', 'doi', 'pmcid', 'pubmed_id', 'license', 'abstract',
       'publish_time', 'authors', 'journal', 'Microsoft Academic Paper ID',
       'WHO #Covidence', 'has_pdf_parse', 'has_pmc_xml_parse',
       'full_text_file', 'url', 'paper_id_y', 'body_text_y', 'methods_y',
       'results_y'],
      dtype='object')

We still have to compare the text body from pdf and pmc files.

In [33]:
df.shape

(2466, 25)

# Quick comparison of both texts

In [34]:
df.columns

Index(['paper_id_x', 'body_text_x', 'methods_x', 'results_x', 'cord_uid',
       'source_x', 'title', 'doi', 'pmcid', 'pubmed_id', 'license', 'abstract',
       'publish_time', 'authors', 'journal', 'Microsoft Academic Paper ID',
       'WHO #Covidence', 'has_pdf_parse', 'has_pmc_xml_parse',
       'full_text_file', 'url', 'paper_id_y', 'body_text_y', 'methods_y',
       'results_y'],
      dtype='object')

In [35]:
df[['methods_x', 'methods_y', 'url']][df.methods_y.notnull()]

Unnamed: 0,methods_x,methods_y,url
1,Guanine-based phosphonate (0.5 mmol) was disso...,Starting compounds and other chemicals were pu...,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6...
4,"Samples and PCR screening. Samples (n ϭ 1,897)...","Samples (n = 1,897) were collected from appare...",https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3...
5,,,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5...
7,,Calu3 cells were utilized as previously descri...,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4...
9,,,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6...
...,...,...,...
2461,,Vero E6 cells were maintained in Dulbecco’s mo...,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3...
2462,Statistical methods test scientific theories w...,Statistical methods test scientific theories w...,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4...
2463,,,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5...
2464,,,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6...


In [36]:
(df.methods_x == '').sum(), df.methods_x.isnull().sum()

(1568, 0)

In [37]:
(df.methods_y == '').sum(), df.methods_y.isnull().sum()

(634, 783)

In [38]:
# use methods_y (from pmc) when it's available
mask = (df.methods_y.notnull()) & (df.methods_y != '')
df.loc[mask, 'methods_x'] = df.loc[mask, 'methods_y']

# same for results
mask = (df.results_y.notnull()) & (df.results_y != '')
df.loc[mask, 'results_x'] = df.loc[mask, 'results_y']

In [39]:
(df.results_x == '').sum(), df.results_x.isnull().sum()

(1261, 0)

In [40]:
(df.results_y == '').sum(), df.results_y.isnull().sum()

(672, 783)

In [41]:
df.rename(columns = {'methods_x': 'methods', 'results_x': 'results'}, inplace=True)
df.drop(columns=['methods_y', 'results_y'], inplace=True)

In [42]:
df.rename(columns = {'paper_id_x': 'paper_id', 'source_x': 'source'}, inplace=True)

In [43]:
df.columns

Index(['paper_id', 'body_text_x', 'methods', 'results', 'cord_uid', 'source',
       'title', 'doi', 'pmcid', 'pubmed_id', 'license', 'abstract',
       'publish_time', 'authors', 'journal', 'Microsoft Academic Paper ID',
       'WHO #Covidence', 'has_pdf_parse', 'has_pmc_xml_parse',
       'full_text_file', 'url', 'paper_id_y', 'body_text_y'],
      dtype='object')

In [44]:
df.head()

Unnamed: 0,paper_id,body_text_x,methods,results,cord_uid,source,title,doi,pmcid,pubmed_id,...,authors,journal,Microsoft Academic Paper ID,WHO #Covidence,has_pdf_parse,has_pmc_xml_parse,full_text_file,url,paper_id_y,body_text_y
0,b2f67d533f2749807f2537f3775b39da3b186051,There is a disproportionate number of individu...,,,,,,,,,...,,,,,,,,,,
1,ad98979eada6e333a276d39efdce21779d538625,The concentration and ratio of purine nucleoti...,Starting compounds and other chemicals were pu...,,hdpanetr,PMC,Xanthine-based acyclic nucleoside phosphonates...,10.1177/2040206618813050,PMC6287304,30497281.0,...,"Baszczyňski, Ondřej; Kaiser, Martin Maxmilian;...",Antivir Chem Chemother,,,True,True,noncomm_use_subset,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6...,PMC6287304,The concentration and ratio of purine nucleoti...
2,464f7d3a460eb51dbc25bd12639b22079a73f85a,Viruses are important infectious agents that i...,,,,,,,,,...,,,,,,,,,,
3,c436139975d97ef929b5d8452595de40bda0c11c,Pandemic influenza remains a global health thr...,"This was a randomized, open-label, multicenter...","Between January 2011 and April 2015, a total o...",,,,,,,...,,,,,,,,,,
4,634128ea7d7736750e1c3cd0a48bb37843d06dac,"A total of 12,793 consensus PCR assays were pe...","Samples (n = 1,897) were collected from appare...","A total of 12,793 consensus PCR assays were pe...",6lobyyj4,PMC,A Strategy To Estimate Unknown Viral Diversity...,10.1128/mbio.00598-13,PMC3760253,24003179.0,...,"Anthony, Simon J.; Epstein, Jonathan H.; Murra...",mBio,,,True,True,noncomm_use_subset,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3...,PMC3760253,The majority of emerging infectious diseases (...


# Duplicates

Some paper ids are duplicated

In [45]:
len(df)

2466

In [46]:
df.paper_id.nunique()

2466

In [47]:
df[df.duplicated(subset=['paper_id'], keep=False)][['paper_id', 'body_text_x']]

Unnamed: 0,paper_id,body_text_x


But luckily they also have the same text body. So we will just keep one article per paper_id.
Check for example [https://www.sciencedirect.com/science/article/pii/S1386653209701295?via%3Dihub](https://www.sciencedirect.com/science/article/pii/S1386653209701295?via%3Dihub) and [https://www.sciencedirect.com/science/article/pii/S1386653209701325?via%3Dihub](https://www.sciencedirect.com/science/article/pii/S1386653209701325?via%3Dihub) - they have the same content.

In [48]:
df[df.duplicated(subset=['paper_id', 'body_text_x'], keep=False)].shape

(0, 23)

In [49]:
df.drop_duplicates(['paper_id', 'body_text_x'], inplace=True)

In [50]:
len(df)

2466

In [51]:
df[df.duplicated(['paper_id'], keep=False)].head(2)

Unnamed: 0,paper_id,body_text_x,methods,results,cord_uid,source,title,doi,pmcid,pubmed_id,...,authors,journal,Microsoft Academic Paper ID,WHO #Covidence,has_pdf_parse,has_pmc_xml_parse,full_text_file,url,paper_id_y,body_text_y


In [52]:
df.drop_duplicates(['paper_id'], inplace=True)

In [53]:
df.paper_id.nunique()

2466

In [54]:
df.shape

(2466, 23)

Now the paper_id is unique.

In [55]:
df.isnull().sum()

paper_id                          0
body_text_x                       0
methods                           0
results                           0
cord_uid                        648
source                          648
title                           648
doi                             807
pmcid                           671
pubmed_id                       714
license                         648
abstract                        318
publish_time                    648
authors                         760
journal                         660
Microsoft Academic Paper ID    2439
WHO #Covidence                 2429
has_pdf_parse                   648
has_pmc_xml_parse               648
full_text_file                  648
url                             648
paper_id_y                      783
body_text_y                     783
dtype: int64

# Some new columns for convenience

In [56]:
# some new columns for convenience
df['publish_year'] = df.publish_time.str[:4].fillna(-1).astype(int) # 360 times None
# df['link'] = 'http://dx.doi.org/' + df.doi #dataset now has url column

In [57]:
df.columns

Index(['paper_id', 'body_text_x', 'methods', 'results', 'cord_uid', 'source',
       'title', 'doi', 'pmcid', 'pubmed_id', 'license', 'abstract',
       'publish_time', 'authors', 'journal', 'Microsoft Academic Paper ID',
       'WHO #Covidence', 'has_pdf_parse', 'has_pmc_xml_parse',
       'full_text_file', 'url', 'paper_id_y', 'body_text_y', 'publish_year'],
      dtype='object')

In [58]:
df['is_covid19'] = df.body_text_x.str.contains('COVID-19|covid|sar cov 2|SARS-CoV-2|2019-nCov|2019 ncov|SARS Coronavirus 2|2019 Novel Coronavirus|coronavirus 2019| Wuhan coronavirus|wuhan pneumonia|wuhan virus', case=False)

In [59]:
df.is_covid19.sum()

179

# Language Detection to remove non-english articles and abstracts

In [60]:
from IPython.utils import io

with io.capture_output() as captured:
    !pip install scispacy
    !pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.2.4/en_core_sci_lg-0.2.4.tar.gz
    !pip install spacy-langdetect
    !pip install spac scispacy spacy_langdetect https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.2.3/en_core_sci_lg-0.2.3.tar.gz

In [61]:
import scispacy
import spacy
import en_core_sci_lg
from spacy_langdetect import LanguageDetector

In [62]:
# medium model
nlp = en_core_sci_lg.load(disable=["tagger", "ner"])
nlp.max_length = 2000000
nlp.add_pipe(LanguageDetector(), name='language_detector', last=True)

Check language of each text body (only use the first 2000 characters).

In [63]:
df['text_language'] = df.body_text_x.apply(lambda x: nlp(str(x[:2000]))._.language['language'])

df.text_language.value_counts()

en         2445
es            8
UNKNOWN       2
de            2
it            2
fr            2
et            1
ko            1
da            1
ca            1
zh-cn         1
Name: text_language, dtype: int64

## Number of non-english texts to drop.

In [64]:
df.loc[df[df.text_language != 'en'].index].shape

(21, 26)

In [65]:
df = df.drop(df[df.text_language != 'en'].index)

In [66]:
# Check language of all abstracts

# df['abstract_lang'] = df.abstract.apply(lambda x: nlp(str(x))._.language['language'])

#  df[df.abstract.isnull()]

In [67]:
# Number of non-english abstracts

# df[(df.abstract_lang != 'en') & (df.abstract.notnull())].abstract_lang.value_counts()

# Keep all english abstracts and those without abstract

# df = df[(df.abstract_lang == 'en') | (df.abstract.isnull())]

# df.shape

# df.paper_id.nunique()

# Analyze title/text body of the papers without abstract

# temp = df[df.abstract.isnull()].copy()

# def remove_non_english_sentences(doc):
#     doc = nlp(doc)
#     doc_engl = ''
#     for s in doc.sents:
#         if (s._.language['language'] == 'en'):
#             doc_engl += s.text 
#     return doc_engl

# remove_non_english_sentences(df[df.paper_id == '1a8a4dbbaa94ced4ef6af69ec7a09d3fa4c0eece'].body_text.iloc[0])

# temp['text_length'] = temp.body_text.apply(lambda x: len(x))

# temp['english_text'] = temp.body_text.apply(remove_non_english_sentences)

# temp['english_length'] = temp.english_text.apply(lambda x: len(x))

# temp.to_csv('df_english.csv', index=False)

# (temp.english_length/temp.text_length).hist()

# ((temp.english_length/temp.text_length)<0.8).sum()

# temp[((temp.english_length/temp.text_length)<0.8)].head()

# temp[temp.paper_id == '7925057cfe0cb75ae6079879cb2d22d23e42dfa5'].body_text.values[0][:500]

# temp[temp.paper_id == '617197cc751a9208cb0af1b4e31baeddc8d2e985'].body_text.values[0]

# temp[temp.paper_id == 'ca51b53fa512085e1aa166d5308602ff1666a90c'].body_text.values[0][:500]

# df = df.drop(temp[((temp.english_length/temp.text_length)<0.8)].index)

In [68]:
# temp['title_lang'] = df.title.apply(lambda x: nlp(str(x))._.language['language'])

# temp.title_lang.value_counts()

# Too many false-positves. 

# temp[temp.paper_id == '6f6b7b1efffae7f3765f29fe801ab63dd35110bb'].body_text.values[0]

# temp[temp.title_lang == 'de']

# We check the beginning of each text body instead.

# temp['text_lang'] = df.body_text.apply(lambda x: nlp(str(x[:2000]))._.language['language'])

# temp.text_lang.value_counts()

# Number of non-english texts to drop.

# df.loc[temp[temp.text_lang != 'en'].index].shape

# df = df.drop(temp[temp.text_lang != 'en'].index)

In [69]:
df.drop(columns=['cord_uid', 'pmcid', 'pubmed_id', 'full_text_file', 'license', 'text_language'], inplace=True)

In [70]:
df.head()

Unnamed: 0,paper_id,body_text_x,methods,results,source,title,doi,abstract,publish_time,authors,journal,Microsoft Academic Paper ID,WHO #Covidence,has_pdf_parse,has_pmc_xml_parse,url,paper_id_y,body_text_y,publish_year,is_covid19
0,b2f67d533f2749807f2537f3775b39da3b186051,There is a disproportionate number of individu...,,,,,,,,,,,,,,,,,-1,True
1,ad98979eada6e333a276d39efdce21779d538625,The concentration and ratio of purine nucleoti...,Starting compounds and other chemicals were pu...,,PMC,Xanthine-based acyclic nucleoside phosphonates...,10.1177/2040206618813050,While noncanonic xanthine nucleotides XMP/dXMP...,2018-11-29,"Baszczyňski, Ondřej; Kaiser, Martin Maxmilian;...",Antivir Chem Chemother,,,True,True,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6...,PMC6287304,The concentration and ratio of purine nucleoti...,2018,False
2,464f7d3a460eb51dbc25bd12639b22079a73f85a,Viruses are important infectious agents that i...,,,,,,Long non-coding RNAs (lncRNAs) are found not o...,,,,,,,,,,,-1,False
3,c436139975d97ef929b5d8452595de40bda0c11c,Pandemic influenza remains a global health thr...,"This was a randomized, open-label, multicenter...","Between January 2011 and April 2015, a total o...",,,,on behalf of the IRC002 Study Team Summary Bac...,,,,,,,,,,,-1,False
4,634128ea7d7736750e1c3cd0a48bb37843d06dac,"A total of 12,793 consensus PCR assays were pe...","Samples (n = 1,897) were collected from appare...","A total of 12,793 consensus PCR assays were pe...",PMC,A Strategy To Estimate Unknown Viral Diversity...,10.1128/mbio.00598-13,The majority of emerging zoonoses originate in...,2013-09-03,"Anthony, Simon J.; Epstein, Jonathan H.; Murra...",mBio,,,True,True,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3...,PMC3760253,The majority of emerging infectious diseases (...,2013,False


In [71]:
df.columns

Index(['paper_id', 'body_text_x', 'methods', 'results', 'source', 'title',
       'doi', 'abstract', 'publish_time', 'authors', 'journal',
       'Microsoft Academic Paper ID', 'WHO #Covidence', 'has_pdf_parse',
       'has_pmc_xml_parse', 'url', 'paper_id_y', 'body_text_y', 'publish_year',
       'is_covid19'],
      dtype='object')

In [72]:
df = df.drop(columns = ['methods', 'body_text_x', 'source', 'doi', 'Microsoft Academic Paper ID', 'WHO #Covidence', 'has_pdf_parse', 'has_pmc_xml_parse', 'url', 'paper_id_y', 'body_text_y'])

In [73]:
df.columns

Index(['paper_id', 'results', 'title', 'abstract', 'publish_time', 'authors',
       'journal', 'publish_year', 'is_covid19'],
      dtype='object')

# Export as .csv

In [74]:
df['abstract'].replace('', np.nan, inplace=True)

In [75]:
df['results'].replace('', np.nan, inplace=True)

In [76]:
df = df.dropna()

In [77]:
results = df.drop('abstract', axis = 1)

In [78]:
abstract = df.drop('results', axis = 1)

In [79]:
results['section_num'] = 1

In [80]:
abstract['section_num'] = 0

In [81]:
results = results.rename({'results' : 'text'}, axis = 1)
results

Unnamed: 0,paper_id,text,title,publish_time,authors,journal,publish_year,is_covid19,section_num
4,634128ea7d7736750e1c3cd0a48bb37843d06dac,"A total of 12,793 consensus PCR assays were pe...",A Strategy To Estimate Unknown Viral Diversity...,2013-09-03,"Anthony, Simon J.; Epstein, Jonathan H.; Murra...",mBio,2013,False,1
7,fef0bb9eaac69559d0ff2f92ff83e0affd4435f0,ISG expression varies based on cell and tissue...,Pathogenic Influenza Viruses and Coronaviruses...,2014-05-20,"Menachery, Vineet D.; Eisfeld, Amie J.; Schäfe...",mBio,2014,False,1
14,246f59ddffedd4a166b9317dee38bdf6077b2f3f,The X-ray crystallography studies showed that ...,Identification of novel multitargeted PPARα/γ/...,2014-11-07,"Wang, Xue-Jiao; Zhang, Jun; Wang, Shu-Qing; Xu...",Drug Des Devel Ther,2014,False,1
17,9cda860c97d430aea207a063d13e8612e023320c,The efficacies of the BVD control measures wer...,Assessment of the cost effectiveness of compul...,2019-03-01,"ISODA, Norikazu; ASANO, Akihiro; ICHIJO, Michi...",J Vet Med Sci,2019,False,1
18,b66704a03a688c4065abff41c4977c4c9939c230,"Of the 50 enrolled patients, 43 had completed ...","Lobeglitazone, a Novel Thiazolidinedione, Impr...",2016-11-08,"Lee, Yong-ho; Kim, Jae Hyeon; Kim, So Ra; Jin,...",J Korean Med Sci,2016,False,1
...,...,...,...,...,...,...,...,...,...
2452,d00f4dabc58eca254f3b2ca4efafda00d671c3da,"As noted above, a motif based search removes t...",Cell cycle control (and more) by programmed −1...,2015-01-13,"Belew, Ashton T; Dinman, Jonathan D",Cell Cycle,2015,False,1
2454,4c3f357d50bfeede5dd0eec81a4b5b7f116c628a,"We quantified a total of 1109 plasma proteins,...",Plasma biomarker screening for liver fibrosis ...,2011-05-15,"Li, ShuLong; Liu, Xin; Wei, Lai; Wang, HuiFen;...",Sci China Life Sci,2011,False,1
2456,2ffacfd58f57a95344119c15d27560cdeaea2285,Metaphase-arrested egg extracts from X. laevis...,The calcium-dependent ribonuclease XendoU prom...,2014-10-13,"Schwarz, Dianne S.; Blower, Michael D.",J Cell Biol,2014,False,1
2457,4c5c841e4ad3fbf9b31d6c0c282dfec035f716bb,"With respect to the exclusion criteria, only 7...",The association between blood eosinophil perce...,2019-05-06,"Choi, Juwhan; Oh, Jee Youn; Lee, Young Seok; H...",Int J Chron Obstruct Pulmon Dis,2019,False,1


In [82]:
abstract = abstract.rename({'abstract' : 'text'}, axis = 1)
abstract

Unnamed: 0,paper_id,title,text,publish_time,authors,journal,publish_year,is_covid19,section_num
4,634128ea7d7736750e1c3cd0a48bb37843d06dac,A Strategy To Estimate Unknown Viral Diversity...,The majority of emerging zoonoses originate in...,2013-09-03,"Anthony, Simon J.; Epstein, Jonathan H.; Murra...",mBio,2013,False,0
7,fef0bb9eaac69559d0ff2f92ff83e0affd4435f0,Pathogenic Influenza Viruses and Coronaviruses...,The broad range and diversity of interferon-st...,2014-05-20,"Menachery, Vineet D.; Eisfeld, Amie J.; Schäfe...",mBio,2014,False,0
14,246f59ddffedd4a166b9317dee38bdf6077b2f3f,Identification of novel multitargeted PPARα/γ/...,The thiazolidinedione class peroxisome prolife...,2014-11-07,"Wang, Xue-Jiao; Zhang, Jun; Wang, Shu-Qing; Xu...",Drug Des Devel Ther,2014,False,0
17,9cda860c97d430aea207a063d13e8612e023320c,Assessment of the cost effectiveness of compul...,Bovine viral diarrhea (BVD) is a chronic disea...,2019-03-01,"ISODA, Norikazu; ASANO, Akihiro; ICHIJO, Michi...",J Vet Med Sci,2019,False,0
18,b66704a03a688c4065abff41c4977c4c9939c230,"Lobeglitazone, a Novel Thiazolidinedione, Impr...",Despite the rapidly increasing prevalence of n...,2016-11-08,"Lee, Yong-ho; Kim, Jae Hyeon; Kim, So Ra; Jin,...",J Korean Med Sci,2016,False,0
...,...,...,...,...,...,...,...,...,...
2452,d00f4dabc58eca254f3b2ca4efafda00d671c3da,Cell cycle control (and more) by programmed −1...,"Like most basic molecular mechanisms, programm...",2015-01-13,"Belew, Ashton T; Dinman, Jonathan D",Cell Cycle,2015,False,0
2454,4c3f357d50bfeede5dd0eec81a4b5b7f116c628a,Plasma biomarker screening for liver fibrosis ...,A non-invasive diagnostic approach is crucial ...,2011-05-15,"Li, ShuLong; Liu, Xin; Wei, Lai; Wang, HuiFen;...",Sci China Life Sci,2011,False,0
2456,2ffacfd58f57a95344119c15d27560cdeaea2285,The calcium-dependent ribonuclease XendoU prom...,How cells shape and remodel organelles in resp...,2014-10-13,"Schwarz, Dianne S.; Blower, Michael D.",J Cell Biol,2014,False,0
2457,4c5c841e4ad3fbf9b31d6c0c282dfec035f716bb,The association between blood eosinophil perce...,Introduction: The use of antibiotics is based ...,2019-05-06,"Choi, Juwhan; Oh, Jee Youn; Lee, Young Seok; H...",Int J Chron Obstruct Pulmon Dis,2019,False,0


In [83]:
import datetime

In [84]:
non_commercial_df = pd.concat([abstract, results]) 

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  """Entry point for launching an IPython kernel.


In [85]:
non_commercial_df['publish_month'] = pd.DatetimeIndex(non_commercial_df['publish_time']).month

In [86]:
non_commercial_df = non_commercial_df.drop(['authors', 'publish_time'], axis = 1)

In [87]:
non_commercial_df

Unnamed: 0,is_covid19,journal,paper_id,publish_year,section_num,text,title,publish_month
4,False,mBio,634128ea7d7736750e1c3cd0a48bb37843d06dac,2013,0,The majority of emerging zoonoses originate in...,A Strategy To Estimate Unknown Viral Diversity...,9
7,False,mBio,fef0bb9eaac69559d0ff2f92ff83e0affd4435f0,2014,0,The broad range and diversity of interferon-st...,Pathogenic Influenza Viruses and Coronaviruses...,5
14,False,Drug Des Devel Ther,246f59ddffedd4a166b9317dee38bdf6077b2f3f,2014,0,The thiazolidinedione class peroxisome prolife...,Identification of novel multitargeted PPARα/γ/...,11
17,False,J Vet Med Sci,9cda860c97d430aea207a063d13e8612e023320c,2019,0,Bovine viral diarrhea (BVD) is a chronic disea...,Assessment of the cost effectiveness of compul...,3
18,False,J Korean Med Sci,b66704a03a688c4065abff41c4977c4c9939c230,2016,0,Despite the rapidly increasing prevalence of n...,"Lobeglitazone, a Novel Thiazolidinedione, Impr...",11
...,...,...,...,...,...,...,...,...
2452,False,Cell Cycle,d00f4dabc58eca254f3b2ca4efafda00d671c3da,2015,1,"As noted above, a motif based search removes t...",Cell cycle control (and more) by programmed −1...,1
2454,False,Sci China Life Sci,4c3f357d50bfeede5dd0eec81a4b5b7f116c628a,2011,1,"We quantified a total of 1109 plasma proteins,...",Plasma biomarker screening for liver fibrosis ...,5
2456,False,J Cell Biol,2ffacfd58f57a95344119c15d27560cdeaea2285,2014,1,Metaphase-arrested egg extracts from X. laevis...,The calcium-dependent ribonuclease XendoU prom...,10
2457,False,Int J Chron Obstruct Pulmon Dis,4c5c841e4ad3fbf9b31d6c0c282dfec035f716bb,2019,1,"With respect to the exclusion criteria, only 7...",The association between blood eosinophil perce...,5


In [88]:
non_commercial_df.to_csv('../dataset/processed-files/non_commercial.csv', index=False)