This notebook is an exploration of the structure of the CORD-19 data files.

# Imports

In [1]:
import csv
import pandas as pd
from pprint import pprint as pp

# metadata.csv

In [2]:
metadataFile = open('metadata.csv', 'r')
csvReader = csv.reader(metadataFile, delimiter=',', quotechar='"')

Header = next(csvReader)
pp(Header)

['cord_uid',
 'sha',
 'source_x',
 'title',
 'doi',
 'pmcid',
 'pubmed_id',
 'license',
 'abstract',
 'publish_time',
 'authors',
 'journal',
 'mag_id',
 'who_covidence_id',
 'arxiv_id',
 'pdf_json_files',
 'pmc_json_files',
 'url']


## Columns

Starting from the readme provided with the data (metadata.readme) we can outline what each columns should cover and, importantly for databasing it, whether it is single value of potentially multi-valued.

'cord_uid' - This is a persistent identifier for the article within the CORD-19 ecosystem. __single valued__, __unique__.

'sha' - The hash of the article's PDF. There may be multiple PDFs associated to one article (supporting materials, ?preprints?). __multi-valued__.

'source_x' - The source of the article data. Sources covered include CZI (Chan-Zuckerberg Initiative), PMC (pubmed central), bioRxiv and medRxiv. Confirm sources from data. Should be __single valued__.

'title' - The article's title. __single valued__.

'doi' - The article's doi. __single valued__.

'pmcid' - The article's pubmed central id, if it has one. __single valued__, __unique__.


'pubmed_id' - The article's pubmed id. For distinction between this and the previous, see [here](https://publicaccess.nih.gov/include-pmcid-citations.htm#Difference). __single valued__, __unique__.

'license' - The license under which the data is being shared. Should be __single valued__.

'abstract' - The article's abstract. __single valued__.

'publish_time' - The date the article was published. __single valued__.

'authors' - The article's authors. Coverage questionable. __multi-valued__.

'journal' - The journal in which the article was published. If it was published in a journal. What happens with preprints? __single valued__.

'mag_id' - The article's microsoft academic graph id. __single valued__, __unique__.

'who_covidence_id' - The article's WHO #Covidence id. Should only be populated for CZI source articles. __single valued__, __unique__.

'arxiv_id' - The article's arXiv id. __single valued__, __unique__.

'pdf_json_files' - Relative path of the json(s) parsed from pdf file(s). __multi-valued__, __unique__.

'pmc_json_files' - Relative path of the json(s) parsed from pmc file(s). __multi-valued__, __unique__.

'url' - The article's URL. Should be __single valued__, __unique__.

To explore the columns, it is probably easier to work with the csv via pandas.

In [3]:
metadataFrame = pd.read_csv('metadata.csv', low_memory=False)
metadataFrame

Unnamed: 0,cord_uid,sha,source_x,title,doi,pmcid,pubmed_id,license,abstract,publish_time,authors,journal,mag_id,who_covidence_id,arxiv_id,pdf_json_files,pmc_json_files,url
0,zjufx4fo,b2897e1277f56641193a6db73825f707eed3e4c9,PMC,Sequence requirements for RNA strand transfer ...,10.1093/emboj/20.24.7220,PMC125340,11742998.0,green-oa,Nidovirus subgenomic mRNAs contain a leader se...,2001-12-17,"Pasternak, Alexander O.; van den Born, Erwin; ...",The EMBO Journal,,,,document_parses/pdf_json/b2897e1277f56641193a6...,document_parses/pmc_json/PMC125340.xml.json,http://europepmc.org/articles/pmc125340?pdf=re...
1,ymceytj3,e3d0d482ebd9a8ba81c254cc433f314142e72174,PMC,"Crystal structure of murine sCEACAM1a[1,4]: a ...",10.1093/emboj/21.9.2076,PMC125375,11980704.0,green-oa,CEACAM1 is a member of the carcinoembryonic an...,2002-05-01,"Tan, Kemin; Zelus, Bruce D.; Meijers, Rob; Liu...",The EMBO Journal,,,,document_parses/pdf_json/e3d0d482ebd9a8ba81c25...,document_parses/pmc_json/PMC125375.xml.json,http://europepmc.org/articles/pmc125375?pdf=re...
2,wzj2glte,00b1d99e70f779eb4ede50059db469c65e8c1469,PMC,Synthesis of a novel hepatitis C virus protein...,10.1093/emboj/20.14.3840,PMC125543,11447125.0,no-cc,Hepatitis C virus (HCV) is an important human ...,2001-07-16,"Xu, Zhenming; Choi, Jinah; Yen, T.S.Benedict; ...",EMBO J,,,,document_parses/pdf_json/00b1d99e70f779eb4ede5...,document_parses/pmc_json/PMC125543.xml.json,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1...
3,2sfqsfm1,cf584e00f637cbd8f1bb35f3f09f5ed07b71aeb0,PMC,Structure of coronavirus main proteinase revea...,10.1093/emboj/cdf327,PMC126080,12093723.0,green-oa,The key enzyme in coronavirus polyprotein proc...,2002-07-01,"Anand, Kanchan; Palm, Gottfried J.; Mesters, J...",The EMBO Journal,,,,document_parses/pdf_json/cf584e00f637cbd8f1bb3...,document_parses/pmc_json/PMC126080.xml.json,http://europepmc.org/articles/pmc126080?pdf=re...
4,i0zym7iq,dde02f11923815e6a16a31dd6298c46b109c5dfa,PMC,Discontinuous and non-discontinuous subgenomic...,10.1093/emboj/cdf635,PMC136939,12456663.0,green-oa,"Arteri-, corona-, toro- and roniviruses are ev...",2002-12-01,"van Vliet, A.L.W.; Smits, S.L.; Rottier, P.J.M...",The EMBO Journal,,,,document_parses/pdf_json/dde02f11923815e6a16a3...,document_parses/pmc_json/PMC136939.xml.json,http://europepmc.org/articles/pmc136939?pdf=re...
5,xqhn0vbp,1e1286db212100993d03cc22374b624f7caee956,PMC,Airborne rhinovirus detection and effect of ul...,10.1186/1471-2458-3-5,PMC140314,12525263.0,no-cc,"BACKGROUND: Rhinovirus, the most common cause ...",2003-01-13,"Myatt, Theodore A; Johnston, Sebastian L; Rudn...",BMC Public Health,,,,document_parses/pdf_json/1e1286db212100993d03c...,document_parses/pmc_json/PMC140314.xml.json,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1...
6,gi6uaa83,8ae137c8da1607b3a8e4c946c07ca8bda67f88ac,PMC,Discovering human history from stomach bacteria,10.1186/gb-2003-4-5-213,PMC156578,12734001.0,no-cc,Recent analyses of human pathogens have reveal...,2003-04-28,"Disotell, Todd R",Genome Biol,,,,document_parses/pdf_json/8ae137c8da1607b3a8e4c...,document_parses/pmc_json/PMC156578.xml.json,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1...
7,0m32ecnu,23bc55d6f63fab18b02004483888db2b6a0bfa48,PMC,Prokaryotic-style frameshifting in a plant tra...,10.1093/emboj/cdg365,PMC169038,12881428.0,green-oa,Ribosomal frameshifting signals are found in m...,2003-08-01,"Napthine, Sawsan; Vidakovic, Marijana; Girnary...",The EMBO Journal,,,,document_parses/pdf_json/23bc55d6f63fab18b0200...,document_parses/pmc_json/PMC169038.xml.json,http://europepmc.org/articles/pmc169038?pdf=re...
8,le0ogx1s,,PMC,A new recruit for the army of the men of death,10.1186/gb-2003-4-7-113,PMC193621,12844350.0,no-cc,"The army of the men of death, in John Bunyan's...",2003-06-27,"Petsko, Gregory A",Genome Biol,,,,,,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1...
9,3oxzzxnd,7ff45096210eeb392d51f646f5c7fe011079aaf3,PMC,"SseG, a virulence protein that targets Salmone...",10.1093/emboj/cdg517,PMC204495,14517239.0,green-oa,Intracellular replication of the bacterial pat...,2003-10-01,"Salcedo, Suzana P.; Holden, David W.",The EMBO Journal,,,,document_parses/pdf_json/7ff45096210eeb392d51f...,document_parses/pmc_json/PMC204495.xml.json,http://europepmc.org/articles/pmc204495?pdf=re...


First start by making sure the columns that are supposed to be single valued and unique are actually so.

In [4]:
SingleValued_and_Unique = ['cord_uid', 'pmcid', 'pubmed_id', 'mag_id', 'who_covidence_id', 'arxiv_id', 'url']

In [5]:
for Column in SingleValued_and_Unique:
    print(Column, metadataFrame[Column].dropna().is_unique)

cord_uid False
pmcid True
pubmed_id False
mag_id False
who_covidence_id True
arxiv_id True
url False


All except who_covidence_id and arxiv_id appear to be non-unique. Isn't that just wonderful...

Start by checking the cord_uid.

In [6]:
ValueCounts = metadataFrame['cord_uid'].value_counts()
ValueCounts[ValueCounts>1]

fzq71ghi    8
jsk1oztb    3
0klupmep    2
hox2xwjg    2
c4u0gxp5    2
j7swau26    2
adygntbe    2
5ei7iwu0    2
79mzwv1c    2
4fbr8fx8    2
o4r34pff    2
vp5358rr    2
qhftb6d7    2
m6q8kbjg    2
vqbreyna    2
6z5f2gz3    2
21htepa1    2
huablvd1    2
8fwa2c24    2
laq5ze8o    2
3ury4hnv    2
7y8fd521    2
sv9mdgek    2
xjpev4jw    2
e9pwguwm    2
mmls866r    2
0z5wacxs    2
21qu87oh    2
eich19nx    2
j3b964oz    2
6hdoap81    2
2maferew    2
4hlvrfeh    2
5kzx5hgg    2
brz1fn2h    2
30duqivi    2
940au47y    2
Name: cord_uid, dtype: int64

Ugh. As this supposed to the persistent identifier, looks like some duplicates happening.

So the easiest way to start with this is by checking to see if these are duplicates, and if they are just eliminating the double records.

In [7]:
metadataFrame[metadataFrame['cord_uid'].duplicated(keep=False)].sort_values(by=['cord_uid'])

Unnamed: 0,cord_uid,sha,source_x,title,doi,pmcid,pubmed_id,license,abstract,publish_time,authors,journal,mag_id,who_covidence_id,arxiv_id,pdf_json_files,pmc_json_files,url
30977,0klupmep,,Elsevier,Infectious disease surveillance update,10.1016/s1473-3099(19)30075-1,,30833065.0,els-covid,,2019-03-31,"Zwizwai, Ruth",The Lancet Infectious Diseases,,,,,,https://doi.org/10.1016/s1473-3099(19)30075-1
16454,0klupmep,,PMC,Infectious disease surveillance update,10.1016/s1473-3099(19)30075-1,PMC7129894,30833064.0,no-cc,,2019-02-27,"Zwizwai, Ruth",Lancet Infect Dis,,,,,,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7...
30885,0z5wacxs,7e787fd2ae5b544add6281d3d40ad322de26aa17,Elsevier,Transportation capacity for patients with high...,10.1111/1469-0691.12290,,24750421.0,els-covid,Abstract Highly infectious diseases (HIDs) are...,2019-04-30,"Schilling, S.; Maltezou, H.C.; Fusco, F.M.; De...",Clinical Microbiology and Infection,,,,document_parses/pdf_json/7e787fd2ae5b544add628...,,https://doi.org/10.1111/1469-0691.12290
16452,0z5wacxs,7e787fd2ae5b544add6281d3d40ad322de26aa17,PMC,Transportation capacity for patients with high...,10.1111/1469-0691.12290,PMC7128608,25636943.0,no-cc,Highly infectious diseases (HIDs) are defined ...,2015-06-22,"Schilling, S.; Maltezou, H.C.; Fusco, F.M.; De...",Clin Microbiol Infect,,,,document_parses/pdf_json/7e787fd2ae5b544add628...,document_parses/pmc_json/PMC7128608.xml.json,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7...
30416,21htepa1,a25e212b03cc65c44dcc336775b101934e30f041,Elsevier,Panspermia—true or false?,10.1016/s0140-6736(03)14040-8,,12907025.0,els-covid,,2003-08-02,"de Leon, Samuel Ponce; Lazcano, Antonio",The Lancet,,,,document_parses/pdf_json/a25e212b03cc65c44dcc3...,,https://doi.org/10.1016/s0140-6736(03)14040-8
16473,21htepa1,,PMC,Panspermia—true or false?,10.1016/s0140-6736(03)14040-8,PMC7135165,12907026.0,no-cc,,2003-08-02,"de Leon, Samuel Ponce; Lazcano, Antonio",Lancet,,,,,document_parses/pmc_json/PMC7135165.xml.json,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7...
30156,21qu87oh,68249d769e1926678af8d52d2484e36787e13525,Elsevier,Bystander CD8 T-Cell-Mediated Demyelination is...,10.1016/s0002-9440(10)63126-4,,14742242.0,els-covid,Mice infected with the coronavirus mouse hepat...,2004-02-29,"Dandekar, Ajai A.; Anghelina, Daniela; Perlman...",The American Journal of Pathology,,,,document_parses/pdf_json/68249d769e1926678af8d...,,https://doi.org/10.1016/s0002-9440(10)63126-4
42065,21qu87oh,,PMC,Bystander CD8 T-Cell-Mediated Demyelination is...,,PMC1602263,14742242.0,unk,Mice infected with the coronavirus mouse hepat...,2004-02-12,"Dandekar, Ajai A.; Anghelina, Daniela; Perlman...",,,,,,,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1...
16467,2maferew,,PMC,Virologie : l’apport de la biologie moléculair...,10.1016/s0929-693x(07)78706-7,PMC7133300,17182229.0,no-cc,The conventionnal tools used for virological d...,2008-02-15,"Brouard, J.; Vabret, A.; Perrot, S.; Nimal, D....",Arch Pediatr,,,,,,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7...
30766,2maferew,6744bc52b1b29d2ab28cbaeef8942eebece5175b,Elsevier,Virologie : l’apport de la biologie moléculair...,10.1016/s0929-693x(07)78706-7,,18280911.0,els-covid,Résumé Les outils traditionnels du diagnostic ...,2007-12-31,"Brouard, J.; Vabret, A.; Perrot, S.; Nimal, D....",Archives de Pédiatrie,,,,document_parses/pdf_json/6744bc52b1b29d2ab28cb...,,https://doi.org/10.1016/s0929-693x(07)78706-7


For the number there are here, it would probably be possible to merge manually or by a rule based approach.

I tried this in a previous version, but after messing with it for 15 minutes realized it is likely better to just drop these records and keep moving forward. Fewer than 50 records won't make a difference in the analysis that will be later carried out.

In [8]:
print('Rows before: {}'.format(metadataFrame.shape[0]))
metadataFrame = metadataFrame.drop_duplicates('cord_uid', keep=False) # drop all duplicates. keep='first' or keep='last' would keep the first/last
print('Rows after: {}'.format(metadataFrame.shape[0]))

Rows before: 63571
Rows after: 63490


In [9]:
metadataFrame['cord_uid'].dropna().is_unique

True

So that worked.

In [10]:
for Column in SingleValued_and_Unique:
    print(Column, metadataFrame[Column].dropna().is_unique)

cord_uid True
pmcid True
pubmed_id False
mag_id False
who_covidence_id True
arxiv_id True
url False


Also cleared up pmcid. Would have been nice if it cleared up a few of the others too though. Let's do a quick check of what discrepancies are left.

In [11]:
for Column in SingleValued_and_Unique:
    ValueCounts = metadataFrame[Column].value_counts()
    print(ValueCounts[ValueCounts>1])

Series([], Name: cord_uid, dtype: int64)
Series([], Name: pmcid, dtype: int64)
32117569.0    2
25957460.0    2
31903811.0    2
12932399.0    2
27381971.0    2
32305024.0    2
16194517.0    2
26766408.0    2
14550714.0    2
Name: pubmed_id, dtype: int64
3.006646e+09    14
3.004791e+09    11
3.006304e+09    10
3.005943e+09     9
2.604381e+09     8
3.006643e+09     8
3.005657e+09     7
3.001119e+09     7
3.004897e+09     6
3.005080e+09     6
3.005847e+09     6
3.006356e+09     6
3.006338e+09     6
3.005478e+09     5
3.005539e+09     5
3.005680e+09     5
3.004824e+09     5
3.005689e+09     4
3.006008e+09     4
3.002539e+09     4
3.004398e+09     4
3.004450e+09     4
3.003886e+09     4
3.006114e+09     3
3.006448e+09     3
2.002765e+09     3
3.005490e+09     3
3.005929e+09     3
3.005656e+09     3
3.004511e+09     3
                ..
3.006078e+09     2
3.006111e+09     2
3.006116e+09     2
3.006128e+09     2
3.006178e+09     2
3.005811e+09     2
3.006211e+09     2
3.005803e+09     2
3.0057

pubmed_id has only 9 duplicates. Am inclined to just keep these as is, but have a look.

In [12]:
metadataFrame[metadataFrame['pubmed_id'].duplicated(keep=False) & metadataFrame['pubmed_id'].notnull()].sort_values(by=['pubmed_id'])

Unnamed: 0,cord_uid,sha,source_x,title,doi,pmcid,pubmed_id,license,abstract,publish_time,authors,journal,mag_id,who_covidence_id,arxiv_id,pdf_json_files,pmc_json_files,url
47312,xztmxztr,,Elsevier; PMC,Questions about comparative genomics of SARS c...,10.1016/s0140-6736(03)14130-x,PMC7134733,12932399.0,no-cc,,2003-08-16,"Wood, Lowell",Lancet,,,,,document_parses/pmc_json/PMC7134733.xml.json,
51437,4cs3h4xv,,Elsevier; PMC,Questions about comparative genomics of SARS c...,10.1016/s0140-6736(03)14131-1,PMC7135576,12932399.0,els-covid,,2003-08-16,"Liu, Edison",The Lancet,,,,,,
54607,8kw4y1f2,,Elsevier; PMC,Treatment of SARS with human interferons,10.1016/s0140-6736(03)14482-0,PMC7134736,14550714.0,no-cc,,2003-10-04,"Antonelli, Guido; Scagnolari, Carolina; Vicenz...",Lancet,,,,,document_parses/pmc_json/PMC7134736.xml.json,
57434,jy3t5z8g,,Elsevier; PMC,Treatment of SARS with human interferons,10.1016/s0140-6736(03)14483-2,PMC7134646,14550714.0,no-cc,,2003-10-04,"Cinatl, J; Chandra, P; Rabenau, H; Doerr, HW",Lancet,,,,,document_parses/pmc_json/PMC7134646.xml.json,
53629,sjnb5q3z,a8d38ca8527d7796d5d31a6e1f9317cb724aed26,Elsevier; PMC,Neumomediastino espontáneo: estudio descriptiv...,10.1157/13078656,PMC7131659,16194517.0,els-covid,El neumomediastino espontáneo se define como l...,2005-09-30,"Campillo-Soto, A.; Coll-Salinas, A.; Soria-Ale...",Archivos de Bronconeumología,,,,document_parses/pdf_json/a8d38ca8527d7796d5d31...,,
61122,jl4tnzg6,95f6fbabdb2b2fb7d014935e18739101752dc528,Elsevier; PMC,Spontaneous Pneumomediastinum: Descriptive Stu...,10.1016/s1579-2129(06)60274-7,PMC7129626,16194517.0,els-covid,Spontaneous pneumomediastinum is defined as a ...,2005-09-30,"Campillo-Soto, A.; Coll-Salinas, A.; Soria-Ale...",Archivos de Bronconeumología ((English Edition)),,,,document_parses/pdf_json/95f6fbabdb2b2fb7d0149...,,
61028,jdjdeeh1,2123a7bb916a6200697beff6b7fe1f57f1680406,Elsevier; PMC,Review of Non-bacterial Infections in Respirat...,10.1016/j.arbr.2015.09.015,PMC7105177,25957460.0,els-covid,Abstract Although bacteria are the main pathog...,2015-11-30,"Galván, José María; Rajas, Olga; Aspa, Javier",Archivos de Bronconeumología (English Edition),,,,document_parses/pdf_json/2123a7bb916a6200697be...,document_parses/pmc_json/PMC7105177.xml.json,
52897,q0aasznp,28f309f78ae68a7ad40bf2fb1b4cebece70c36e1,Elsevier; PMC,Revisión sobre las infecciones no bacterianas ...,10.1016/j.arbres.2015.02.015,PMC7130696,25957460.0,els-covid,Resumen Aunque las bacterias son los principal...,2015-11-30,"Galván, José María; Rajas, Olga; Aspa, Javier",Archivos de Bronconeumología,,,,document_parses/pdf_json/28f309f78ae68a7ad40bf...,document_parses/pmc_json/PMC7130696.xml.json,
47800,1nl7q6cy,a5399231d85b304d316778dbe87a845d81db311a,Elsevier; PMC,Pediatric Asthma and Viral Infection,10.1016/j.arbr.2016.03.010,PMC7105201,26766408.0,els-covid,"Abstract Respiratory viral infections, particu...",2016-05-31,"Luz Garcia-Garcia, M.; Calvo Rey, Cristina; de...",Archivos de Bronconeumología (English Edition),,,,document_parses/pdf_json/a5399231d85b304d31677...,document_parses/pmc_json/PMC7105201.xml.json,
59846,npy4cdk9,fd5cb975c746ac1f4b98d9f4b3b03eaabb997dc7,Elsevier; PMC,Asma y virus en el niño,10.1016/j.arbres.2015.11.008,PMC7131251,26766408.0,els-covid,Resumen Las infecciones por virus respiratorio...,2016-05-31,"Garcia-Garcia, M. Luz; Calvo Rey, Cristina; de...",Archivos de Bronconeumología,,,,document_parses/pdf_json/fd5cb975c746ac1f4b98d...,document_parses/pmc_json/PMC7131251.xml.json,


It looks like some of these are merged sources (Elsevier; PMC). I don't like how there are discrepancies on the authors in some cases though.

For now am making decision to just drop.

In [13]:
# This requires a bit more care then when we dealt above with cord_uid because here NaN is a valid value.
# But drop_duplicates counts all NaN as equal.
print('Rows before: {}'.format(metadataFrame.shape[0]))
metadataFrame = metadataFrame[~(metadataFrame['pubmed_id'].duplicated(keep=False) & metadataFrame['pubmed_id'].notnull())]
print('Rows after: {}'.format(metadataFrame.shape[0]))

Rows before: 63490
Rows after: 63472


Turning to mag_id.

In [14]:
metadataFrame[metadataFrame['mag_id'].duplicated(keep=False) & metadataFrame['mag_id'].notnull()].sort_values(by=['mag_id'])

Unnamed: 0,cord_uid,sha,source_x,title,doi,pmcid,pubmed_id,license,abstract,publish_time,authors,journal,mag_id,who_covidence_id,arxiv_id,pdf_json_files,pmc_json_files,url
50637,5mo330s2,,Elsevier; PMC; WHO,Managing neonates with respiratory failure due...,10.1016/s2352-4642(20)30073-0,PMC7128679,32151320.0,no-cc,In their Comment in The Lancet Child & Adolesc...,2020-03-06,"De Luca, Daniele",Lancet Child Adolesc Health,2.002765e+09,#5421,,,document_parses/pmc_json/PMC7128679.xml.json,
31102,qzbqxjdi,c630ebcdf30652f0422c3ec12a00b50241dc9bd9,CZI,Angiotensin-converting enzyme 2 (ACE2) as a SA...,10.1007/s00134-020-05985-9,,32125455.0,cc-by-nc,,2020,"Zhang, Haibo; Penninger, Josef M.; Li, Yimin; ...",Intensive Care Med,2.002765e+09,#3252,,document_parses/pdf_json/c630ebcdf30652f0422c3...,,https://doi.org/10.1007/s00134-020-05985-9
31118,0gier0lu,,WHO,Angiotensin receptor blockers as tentative SAR...,10.1002/ddr.21656,,32129518.0,bronze-oa,At the time of writing this commentary (Februa...,2020-03-04,"Gurwitz, David",Drug Development Research,2.002765e+09,#4134,,,,https://doi.org/10.1002/ddr.21656
31202,497z31h6,,WHO,Chest CT Findings in Coronavirus Disease-19 (C...,10.1148/radiol.2020200463,,32077789.0,bronze-oa,"In this retrospective study, chest CTs of 121 ...",2020-02-20,"Bernheim, Adam; Mei, Xueyan; Huang, Mingqian; ...",Radiology,2.055652e+09,#1690,,,,https://doi.org/10.1148/radiol.2020200463
31237,97x8q2j9,,WHO,Relation Between Chest CT Findings and Clinica...,10.2214/ajr.20.22976,,32125873.0,unk,OBJECTIVE. The increasing number of cases of c...,2020-03-03,"Zhao, Wei; Zhong, Zheng; Xie, Xingzhi; Yu, Qiz...",American Journal of Roentgenology,2.055652e+09,#3238,,,,https://doi.org/10.2214/ajr.20.22976
41831,0tetqt33,,WHO,Pharma mobilizes to combat the coronavirus,10.1021/cen-09805-buscon4,,,unk,The World Health Organization has declared the...,2020-02-03,,C&EN Global Enterprise,2.095773e+09,#807,,,,https://doi.org/10.1021/cen-09805-buscon4
41838,w11wop27,,WHO,Diagnosing COVID-19,10.1021/cen-09808-scicon8,,,unk,hina continues to fight the outbreak of a nove...,2020-02-24,,C&EN Global Enterprise,2.095773e+09,#1882,,,,https://doi.org/10.1021/cen-09808-scicon8
31246,wk4zxmz9,,WHO,Lack of Vertical Transmission of Severe Acute ...,10.3201/eid2606.200287,,32134381.0,gold-oa,A woman with 2019 novel coronavirus disease in...,2020-06-01,"Li, Y.; Zhao, R.; Zheng, S.; Chen, X.; Wang, J...",Emerging Infectious Diseases,2.116733e+09,#4834,,,,https://doi.org/10.3201/eid2606.200287
31245,sonxopa0,,WHO,Community Transmission of Severe Acute Respira...,10.3201/eid2606.200239,,32125269.0,cc-by,"Since early January 2020, after the outbreak o...",2020-06-01,"Liu, Jiaye; Liao, Xuejiao; Qian, Shen; Yuan, J...",Emerging Infectious Diseases,2.116733e+09,#3346,,,,https://doi.org/10.3201/eid2606.200239
31162,lu0tni6a,,WHO,Isolation of a novel coronavirus from a man wi...,10.1056/nejmoa1211721,,23075143.0,bronze-oa,A previously unknown coronavirus was isolated ...,2012-11-08,"Zaki, Ali M.; van Boheemen, Sander; Bestebroer...",New England Journal of Medicine,2.166868e+09,#1347,,,,https://doi.org/10.1056/nejmoa1211721


mag_id is just wrong a lot of the time. Will leave these records as is. I don't plan on using mag_id in anycase, so these errors shouldn't matter.

Now looking at url.

In [15]:
metadataFrame[metadataFrame['url'].duplicated(keep=False) & metadataFrame['url'].notnull()].sort_values(by=['url'])

Unnamed: 0,cord_uid,sha,source_x,title,doi,pmcid,pubmed_id,license,abstract,publish_time,authors,journal,mag_id,who_covidence_id,arxiv_id,pdf_json_files,pmc_json_files,url
22899,dprfxamv,a1ffb2fbd3396e59c4cd1f85e6ac3e127b50862e,PMC,Effectiveness of Border Screening for Detectin...,10.2105/ajph.2012.300761r,PMC4561613,26313050.0,green-oa,Objectives. We measured symptom and influenza ...,2015-10-01,"Priest, Patricia C.; Jennings, Lance C.; Dunca...",American Journal of Public Health,,,,document_parses/pdf_json/a1ffb2fbd3396e59c4cd1...,,http://ajph.aphapublications.org/doi/pdf/10.21...
23252,bzb7cgrq,,PMC,Effectiveness of Border Screening for Detectin...,10.2105/ajph.2012.300761,PMC4007855,23237174.0,green-oa,Objectives. We measured symptom and influenza ...,2013-08-01,"Priest, Patricia C.; Jennings, Lance C.; Dunca...",American Journal of Public Health,,,,,,http://ajph.aphapublications.org/doi/pdf/10.21...
23161,pf3xdouw,,PMC,SHEA/APIC Guideline: Infection Prevention and ...,10.1086/592416,PMC3319407,18767983.0,green-oa,,2008-09-01,"Smith, Philip W.; Bennett, Gail; Bradley, Suza...",Infection Control & Hospital Epidemiology,,,,,,http://digitalcommons.unl.edu/cgi/viewcontent....
52285,8p9jw3gl,1aa2cc6f509964807d510b48a63180e09ee083de; 6eae...,Elsevier; PMC,SHEA/APIC Guideline: Infection Prevention and ...,10.1016/j.ajic.2008.06.001,PMC3375028,18786461.0,green-oa,,2008-09-01,"Smith, Philip W.; Bennett, Gail; Bradley, Suza...",American Journal of Infection Control,,,,document_parses/pdf_json/1aa2cc6f509964807d510...,document_parses/pmc_json/PMC3375028.xml.json,http://digitalcommons.unl.edu/cgi/viewcontent....
21418,2d6nptjf,,PMC,Infection Prevention in the Emergency Department,10.1016/j.annemergmed.2014.02.024,PMC4143473,24721718.0,green-oa,Infection prevention remains a major challenge...,2014-09-01,"Liang, Stephen Y.; Theodoro, Daniel L.; Schuur...",Annals of Emergency Medicine,,,,,,http://europepmc.org/articles/pmc4143473?pdf=r...
57643,gttpnxvv,8ea6e8c5dc57d3f65014a7f2a835279ae8280d71; adb3...,Elsevier; PMC,Infection Prevention for The Emergency Departm...,10.1016/j.emc.2018.06.013,PMC6203442,30297010.0,green-oa,,2018-11-01,"Liang, Stephen Y.; Riethman, Madison; Fox, Jos...",Emergency Medicine Clinics of North America,,,,document_parses/pdf_json/8ea6e8c5dc57d3f65014a...,document_parses/pmc_json/PMC6203442.xml.json,http://europepmc.org/articles/pmc4143473?pdf=r...
20659,imne9d3m,,PMC,Community-Acquired Pneumonia Requiring Hospita...,10.1056/nejmoa1500245,PMC4728150,26172429.0,bronze-oa,BACKGROUND: Community-acquired pneumonia is a ...,2015-07-30,"Jain, S.; Self, W.H.; Wunderink, R.G.; Fakhran...",New England Journal of Medicine,,,,,,http://www.hcup-us.ahrq.gov/reports/statbriefs...
21946,hwnet5jv,,PMC,Community-Acquired Pneumonia Requiring Hospita...,10.1056/nejmoa1405870,PMC4697461,25714161.0,bronze-oa,BACKGROUND: U.S. incidence estimates of pediat...,2015-02-26,"Jain, Seema; Williams, Derek J.; Arnold, Sandr...",New England Journal of Medicine,,,,,,http://www.hcup-us.ahrq.gov/reports/statbriefs...
29755,0gamepww,,PMC,Sars-CoV-2 (COVID-19) Outbreak and Breast Canc...,10.5152/ejbh.2020.300320,PMC7138359,,bronze-oa,,2020-04-06,"Çakmak, Güldeniz Karadeniz; Özmen, Vahit",European Journal of Breast Health,,,,,,https://www.eurjbreasthealth.com/Content/files...
29764,vknvwn57,,PMC,Lung Changes on Chest CT During 2019 Novel Cor...,10.5152/ejbh.2020.010420,PMC7138361,,bronze-oa,,2020-04-06,"Çinkooğlu, Akın; Bayraktaroğlu, Selen; Savaş, ...",European Journal of Breast Health,,,,,,https://www.eurjbreasthealth.com/Content/files...


I'm inclined to just drop these.

In [16]:
print('Rows before: {}'.format(metadataFrame.shape[0]))
metadataFrame = metadataFrame[~(metadataFrame['url'].duplicated(keep=False) & metadataFrame['url'].notnull())]
print('Rows after: {}'.format(metadataFrame.shape[0]))

Rows before: 63472
Rows after: 63462


Have arrived at a more/less clean data set of 63462 metadata records.