This notebook is an exploration of the structure of the CORD-19 data files.

# Imports

In [9]:
import csv
import pandas as pd
from pprint import pprint as pp

# metadata.csv

In [7]:
metadataFile = open('metadata.csv', 'r')
csvReader = csv.reader(metadataFile, delimiter=',', quotechar='"')

Header = next(csvReader)
pp(Header)

['cord_uid',
 'sha',
 'source_x',
 'title',
 'doi',
 'pmcid',
 'pubmed_id',
 'license',
 'abstract',
 'publish_time',
 'authors',
 'journal',
 'mag_id',
 'who_covidence_id',
 'arxiv_id',
 'pdf_json_files',
 'pmc_json_files',
 'url']


## Columns

Starting from the readme provided with the data (metadata.readme) we can outline what each columns should cover and, importantly for databasing it, whether it is single value of potentially multi-valued.

'cord_uid' - This is a persistent identifier for the article within the CORD-19 ecosystem. __single valued__, __unique__.

'sha' - The hash of the article's PDF. There may be multiple PDFs associated to one article (supporting materials, ?preprints?). __multi-valued__.

'source_x' - The source of the article data. Sources covered include CZI (Chan-Zuckerberg Initiative), PMC (pubmed central), bioRxiv and medRxiv. Confirm sources from data. Should be __single valued__.

'title' - The article's title. __single valued__.

'doi' - The article's doi. __single valued__.

'pmcid' - The article's pubmed central id, if it has one. __single valued__, __unique__.


'pubmed_id' - The article's pubmed id. For distinction between this and the previous, see [here](https://publicaccess.nih.gov/include-pmcid-citations.htm#Difference). __single valued__, __unique__.

'license' - The license under which the data is being shared. Should be __single valued__.

'abstract' - The article's abstract. __single valued__.

'publish_time' - The date the article was published. __single valued__.

'authors' - The article's authors. Coverage questionable. __multi-valued__.

'journal' - The journal in which the article was published. If it was published in a journal. What happens with preprints? __single valued__.

'mag_id' - The article's microsoft academic graph id. __single valued__, __unique__.

'who_covidence_id' - The article's WHO #Covidence id. Should only be populated for CZI source articles. __single valued__, __unique__.

'arxiv_id' - The article's arXiv id. __single valued__, __unique__.

'pdf_json_files' - Relative path of the json(s) parsed from pdf file(s). __multi-valued__, __unique__.

'pmc_json_files' - Relative path of the json(s) parsed from pmc file(s). __multi-valued__, __unique__.

'url' - The article's URL. Should be __single valued__, __unique__.

To explore the columns, it is probably easier to work with the csv via pandas.

In [11]:
metadataFrame = pd.read_csv('metadata.csv', low_memory=False)
metadataFrame

Unnamed: 0,cord_uid,sha,source_x,title,doi,pmcid,pubmed_id,license,abstract,publish_time,authors,journal,mag_id,who_covidence_id,arxiv_id,pdf_json_files,pmc_json_files,url
0,zjufx4fo,b2897e1277f56641193a6db73825f707eed3e4c9,PMC,Sequence requirements for RNA strand transfer ...,10.1093/emboj/20.24.7220,PMC125340,11742998.0,green-oa,Nidovirus subgenomic mRNAs contain a leader se...,2001-12-17,"Pasternak, Alexander O.; van den Born, Erwin; ...",The EMBO Journal,,,,document_parses/pdf_json/b2897e1277f56641193a6...,document_parses/pmc_json/PMC125340.xml.json,http://europepmc.org/articles/pmc125340?pdf=re...
1,ymceytj3,e3d0d482ebd9a8ba81c254cc433f314142e72174,PMC,"Crystal structure of murine sCEACAM1a[1,4]: a ...",10.1093/emboj/21.9.2076,PMC125375,11980704.0,green-oa,CEACAM1 is a member of the carcinoembryonic an...,2002-05-01,"Tan, Kemin; Zelus, Bruce D.; Meijers, Rob; Liu...",The EMBO Journal,,,,document_parses/pdf_json/e3d0d482ebd9a8ba81c25...,document_parses/pmc_json/PMC125375.xml.json,http://europepmc.org/articles/pmc125375?pdf=re...
2,wzj2glte,00b1d99e70f779eb4ede50059db469c65e8c1469,PMC,Synthesis of a novel hepatitis C virus protein...,10.1093/emboj/20.14.3840,PMC125543,11447125.0,no-cc,Hepatitis C virus (HCV) is an important human ...,2001-07-16,"Xu, Zhenming; Choi, Jinah; Yen, T.S.Benedict; ...",EMBO J,,,,document_parses/pdf_json/00b1d99e70f779eb4ede5...,document_parses/pmc_json/PMC125543.xml.json,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1...
3,2sfqsfm1,cf584e00f637cbd8f1bb35f3f09f5ed07b71aeb0,PMC,Structure of coronavirus main proteinase revea...,10.1093/emboj/cdf327,PMC126080,12093723.0,green-oa,The key enzyme in coronavirus polyprotein proc...,2002-07-01,"Anand, Kanchan; Palm, Gottfried J.; Mesters, J...",The EMBO Journal,,,,document_parses/pdf_json/cf584e00f637cbd8f1bb3...,document_parses/pmc_json/PMC126080.xml.json,http://europepmc.org/articles/pmc126080?pdf=re...
4,i0zym7iq,dde02f11923815e6a16a31dd6298c46b109c5dfa,PMC,Discontinuous and non-discontinuous subgenomic...,10.1093/emboj/cdf635,PMC136939,12456663.0,green-oa,"Arteri-, corona-, toro- and roniviruses are ev...",2002-12-01,"van Vliet, A.L.W.; Smits, S.L.; Rottier, P.J.M...",The EMBO Journal,,,,document_parses/pdf_json/dde02f11923815e6a16a3...,document_parses/pmc_json/PMC136939.xml.json,http://europepmc.org/articles/pmc136939?pdf=re...
5,xqhn0vbp,1e1286db212100993d03cc22374b624f7caee956,PMC,Airborne rhinovirus detection and effect of ul...,10.1186/1471-2458-3-5,PMC140314,12525263.0,no-cc,"BACKGROUND: Rhinovirus, the most common cause ...",2003-01-13,"Myatt, Theodore A; Johnston, Sebastian L; Rudn...",BMC Public Health,,,,document_parses/pdf_json/1e1286db212100993d03c...,document_parses/pmc_json/PMC140314.xml.json,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1...
6,gi6uaa83,8ae137c8da1607b3a8e4c946c07ca8bda67f88ac,PMC,Discovering human history from stomach bacteria,10.1186/gb-2003-4-5-213,PMC156578,12734001.0,no-cc,Recent analyses of human pathogens have reveal...,2003-04-28,"Disotell, Todd R",Genome Biol,,,,document_parses/pdf_json/8ae137c8da1607b3a8e4c...,document_parses/pmc_json/PMC156578.xml.json,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1...
7,0m32ecnu,23bc55d6f63fab18b02004483888db2b6a0bfa48,PMC,Prokaryotic-style frameshifting in a plant tra...,10.1093/emboj/cdg365,PMC169038,12881428.0,green-oa,Ribosomal frameshifting signals are found in m...,2003-08-01,"Napthine, Sawsan; Vidakovic, Marijana; Girnary...",The EMBO Journal,,,,document_parses/pdf_json/23bc55d6f63fab18b0200...,document_parses/pmc_json/PMC169038.xml.json,http://europepmc.org/articles/pmc169038?pdf=re...
8,le0ogx1s,,PMC,A new recruit for the army of the men of death,10.1186/gb-2003-4-7-113,PMC193621,12844350.0,no-cc,"The army of the men of death, in John Bunyan's...",2003-06-27,"Petsko, Gregory A",Genome Biol,,,,,,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1...
9,3oxzzxnd,7ff45096210eeb392d51f646f5c7fe011079aaf3,PMC,"SseG, a virulence protein that targets Salmone...",10.1093/emboj/cdg517,PMC204495,14517239.0,green-oa,Intracellular replication of the bacterial pat...,2003-10-01,"Salcedo, Suzana P.; Holden, David W.",The EMBO Journal,,,,document_parses/pdf_json/7ff45096210eeb392d51f...,document_parses/pmc_json/PMC204495.xml.json,http://europepmc.org/articles/pmc204495?pdf=re...


First start by making sure the columns that are supposed to be single valued and unique are actually so.

In [15]:
SingleValued_and_Unique = ['cord_uid', 'pmcid', 'pubmed_id', 'mag_id', 'who_covidence_id', 'arxiv_id', 'url']

In [27]:
for Column in SingleValued_and_Unique:
    print(Column, metadataFrame[Column].dropna().is_unique)

cord_uid False
pmcid True
pubmed_id False
mag_id False
who_covidence_id True
arxiv_id True
url False


All except who_covidence_id and arxiv_id appear to be non-unique. Isn't that just wonderful...

Start by checking the cord_uid.

In [20]:
ValueCounts = metadataFrame['cord_uid'].value_counts()
ValueCounts[ValueCounts>1]

fzq71ghi    8
jsk1oztb    3
m6q8kbjg    2
5kzx5hgg    2
4hlvrfeh    2
adygntbe    2
c4u0gxp5    2
8fwa2c24    2
e9pwguwm    2
4fbr8fx8    2
xjpev4jw    2
eich19nx    2
3ury4hnv    2
5ei7iwu0    2
j7swau26    2
hox2xwjg    2
o4r34pff    2
vqbreyna    2
mmls866r    2
huablvd1    2
qhftb6d7    2
j3b964oz    2
940au47y    2
vp5358rr    2
21htepa1    2
30duqivi    2
6hdoap81    2
0klupmep    2
sv9mdgek    2
6z5f2gz3    2
7y8fd521    2
2maferew    2
brz1fn2h    2
21qu87oh    2
laq5ze8o    2
0z5wacxs    2
79mzwv1c    2
Name: cord_uid, dtype: int64

Ugh. As this supposed to the persistent identifier, looks like some duplicates happening.

So the easiest way to start with this is by checking to see if these are duplicates, and if they are just eliminating the double records.

In [32]:
metadataFrame[metadataFrame['cord_uid'].duplicated(keep=False)].sort_values(by=['cord_uid'])

Unnamed: 0,cord_uid,sha,source_x,title,doi,pmcid,pubmed_id,license,abstract,publish_time,authors,journal,mag_id,who_covidence_id,arxiv_id,pdf_json_files,pmc_json_files,url
30977,0klupmep,,Elsevier,Infectious disease surveillance update,10.1016/s1473-3099(19)30075-1,,30833065.0,els-covid,,2019-03-31,"Zwizwai, Ruth",The Lancet Infectious Diseases,,,,,,https://doi.org/10.1016/s1473-3099(19)30075-1
16454,0klupmep,,PMC,Infectious disease surveillance update,10.1016/s1473-3099(19)30075-1,PMC7129894,30833064.0,no-cc,,2019-02-27,"Zwizwai, Ruth",Lancet Infect Dis,,,,,,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7...
30885,0z5wacxs,7e787fd2ae5b544add6281d3d40ad322de26aa17,Elsevier,Transportation capacity for patients with high...,10.1111/1469-0691.12290,,24750421.0,els-covid,Abstract Highly infectious diseases (HIDs) are...,2019-04-30,"Schilling, S.; Maltezou, H.C.; Fusco, F.M.; De...",Clinical Microbiology and Infection,,,,document_parses/pdf_json/7e787fd2ae5b544add628...,,https://doi.org/10.1111/1469-0691.12290
16452,0z5wacxs,7e787fd2ae5b544add6281d3d40ad322de26aa17,PMC,Transportation capacity for patients with high...,10.1111/1469-0691.12290,PMC7128608,25636943.0,no-cc,Highly infectious diseases (HIDs) are defined ...,2015-06-22,"Schilling, S.; Maltezou, H.C.; Fusco, F.M.; De...",Clin Microbiol Infect,,,,document_parses/pdf_json/7e787fd2ae5b544add628...,document_parses/pmc_json/PMC7128608.xml.json,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7...
30416,21htepa1,a25e212b03cc65c44dcc336775b101934e30f041,Elsevier,Panspermia—true or false?,10.1016/s0140-6736(03)14040-8,,12907025.0,els-covid,,2003-08-02,"de Leon, Samuel Ponce; Lazcano, Antonio",The Lancet,,,,document_parses/pdf_json/a25e212b03cc65c44dcc3...,,https://doi.org/10.1016/s0140-6736(03)14040-8
16473,21htepa1,,PMC,Panspermia—true or false?,10.1016/s0140-6736(03)14040-8,PMC7135165,12907026.0,no-cc,,2003-08-02,"de Leon, Samuel Ponce; Lazcano, Antonio",Lancet,,,,,document_parses/pmc_json/PMC7135165.xml.json,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7...
30156,21qu87oh,68249d769e1926678af8d52d2484e36787e13525,Elsevier,Bystander CD8 T-Cell-Mediated Demyelination is...,10.1016/s0002-9440(10)63126-4,,14742242.0,els-covid,Mice infected with the coronavirus mouse hepat...,2004-02-29,"Dandekar, Ajai A.; Anghelina, Daniela; Perlman...",The American Journal of Pathology,,,,document_parses/pdf_json/68249d769e1926678af8d...,,https://doi.org/10.1016/s0002-9440(10)63126-4
42065,21qu87oh,,PMC,Bystander CD8 T-Cell-Mediated Demyelination is...,,PMC1602263,14742242.0,unk,Mice infected with the coronavirus mouse hepat...,2004-02-12,"Dandekar, Ajai A.; Anghelina, Daniela; Perlman...",,,,,,,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1...
16467,2maferew,,PMC,Virologie : l’apport de la biologie moléculair...,10.1016/s0929-693x(07)78706-7,PMC7133300,17182229.0,no-cc,The conventionnal tools used for virological d...,2008-02-15,"Brouard, J.; Vabret, A.; Perrot, S.; Nimal, D....",Arch Pediatr,,,,,,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7...
30766,2maferew,6744bc52b1b29d2ab28cbaeef8942eebece5175b,Elsevier,Virologie : l’apport de la biologie moléculair...,10.1016/s0929-693x(07)78706-7,,18280911.0,els-covid,Résumé Les outils traditionnels du diagnostic ...,2007-12-31,"Brouard, J.; Vabret, A.; Perrot, S.; Nimal, D....",Archives de Pédiatrie,,,,document_parses/pdf_json/6744bc52b1b29d2ab28cb...,,https://doi.org/10.1016/s0929-693x(07)78706-7


For the number there are here, it is probably just easiest to merge manually.

__Restart here, merge manually__

In [21]:
for Column in SingleValued_and_Unique:
    ValueCounts = metadataFrame[Column].value_counts()
    print(ValueCounts[ValueCounts>1])
    #print(Column, metadataFrame[Column].is_unique)

fzq71ghi    8
jsk1oztb    3
m6q8kbjg    2
5kzx5hgg    2
4hlvrfeh    2
adygntbe    2
c4u0gxp5    2
8fwa2c24    2
e9pwguwm    2
4fbr8fx8    2
xjpev4jw    2
eich19nx    2
3ury4hnv    2
5ei7iwu0    2
j7swau26    2
hox2xwjg    2
o4r34pff    2
vqbreyna    2
mmls866r    2
huablvd1    2
qhftb6d7    2
j3b964oz    2
940au47y    2
vp5358rr    2
21htepa1    2
30duqivi    2
6hdoap81    2
0klupmep    2
sv9mdgek    2
6z5f2gz3    2
7y8fd521    2
2maferew    2
brz1fn2h    2
21qu87oh    2
laq5ze8o    2
0z5wacxs    2
79mzwv1c    2
Name: cord_uid, dtype: int64
Series([], Name: pmcid, dtype: int64)
32158035.0    2
32305024.0    2
31903811.0    2
14550714.0    2
16237214.0    2
32117569.0    2
14742242.0    2
26766408.0    2
16400004.0    2
32100487.0    2
11335175.0    2
12466134.0    2
27381971.0    2
15635312.0    2
15743792.0    2
10751347.0    2
12932399.0    2
6134086.0     2
25957460.0    2
16252244.0    2
15172785.0    2
17357067.0    2
32153144.0    2
16194517.0    2
16049331.0    2
12788592.0    2

This is annoying to say the least. In any case, this is not all that informative as we cannot tell if the duplicates are the same across these different columns. Other than the mag_id this is very much possible and if it is the case, we will just sort out those duplicates and discard the mag_id.

In [13]:
metadataFrame['cord_uid'].is_unique

False