This is just creating a paper trail on where the data came from and how it is opened.

Note that on 2.6.2020 this underwent a significant revision. Specifically:
* CORD is now releasing metadata.csv, document_parses.tar.gz and cord_19_embeddings.tar.gz all as one one .tar.gz. That is now the starting point for this notebook.
* I began storing the data in a Google Drive folder, thus making the exact version I am working with accessible to all.

The data I am working with in this notebook is available in Google Drive [here](https://drive.google.com/drive/folders/1jtwgw7b4ad75yzz1cwpIQUg3ur6Sn-UK?usp=sharing). It is the 1.6.2020 release of CORD-19. Release history can be checked [here](https://ai2-semanticscholar-cord-19.s3-us-west-2.amazonaws.com/historical_releases.html).

The .tar.gz file this notebook deals with contains three data::
* article metadata (metadata.csv)
* parsed articles (document_parses.tar.gz)
* article embeddings based on the SPECTER model (cord_19_embeddings.tar.gz)

This notebook simply unzips them and reorganizes.

I also do a quick check on filesizes, but contents of each will be explored in more detail in ExploringDataStructures.ipynb.

__Note__: This just uses command line tools. Unfortunately that means that it is unlikely to port to macOS and almost certainly will not port to Windows. It is just intended to be quick and dirty. This is to be corrected at a later date.

### cord-19_2020-06-01.tar.gz

The assumption is that you have downloaded the .tar.gz from the Google Drive folder linked above and it is in your local directory.

In [1]:
!du -h cord-19_2020-06-01.tar.gz

So the full .tar.gz is 2.9GB.

Now unzip it to a local directory.

In [2]:
!mkdir ./cord-19_2020-06-01
!tar -xzf cord-19_2020-06-01.tar.gz -C ./cord-19_2020-06-01/ --strip-components=1

In [3]:
!ls cord-19_2020-06-01

changelog  cord_19_embeddings.tar.gz  document_parses.tar.gz  metadata.csv


There are 4 files here. Three of which (cord_19_embeddings.tar.gz  document_parses.tar.gz  metadata.csv) will be dealt with below.

### metadata.csv

In [4]:
!wc -l < ./cord-19_2020-06-01/metadata.csv

140533


So it contains 140533 lines. It stands to reason that each line contains the metadata of a single article.

In [5]:
!du -h ./cord-19_2020-06-01/metadata.csv

203M	./cord-19_2020-06-01/metadata.csv


It is also 203MB. Which is neither here nor there, but good to keep track of.

### document_parses.tar.gz

In [6]:
!du -h ./cord-19_2020-06-01/document_parses.tar.gz

1.9G	./cord-19_2020-06-01/document_parses.tar.gz


So the tar.gz is 1.5GB.

Now unzip it.

In [7]:
!tar -xzf ./cord-19_2020-06-01/document_parses.tar.gz -C ./cord-19_2020-06-01/

In [8]:
!ls

cord-19_2020-06-01	       ProjectOverview.ipynb  Unzipping.ipynb
cord-19_2020-06-01.tar.gz      PubmedDownload.ipynb
ExploringDataStructures.ipynb  README.md


Unzipped to a directory document_parses.

In [9]:
!du -h ./cord-19_2020-06-01/document_parses

6.7G	./cord-19_2020-06-01/document_parses/pdf_json
4.7G	./cord-19_2020-06-01/document_parses/pmc_json
12G	./cord-19_2020-06-01/document_parses


At 12GB this is quite a but larger but surprisingly manageable for full text.

In [10]:
!ls ./cord-19_2020-06-01/document_parses/pdf_json/ | wc -l

65782


So that is a bit gnarly, 65782 files in the directory. I guess one per article pdf.

In [11]:
!ls ./cord-19_2020-06-01/document_parses/pmc_json/ | wc -l

48151


Again a lot of full text parse files. Given that it looks like we have metadata for 140533 articles, I guess we are missing full text for a lot. 

### cord_19_embeddings.tar.gz

In [12]:
!du -h ./cord-19_2020-06-01/cord_19_embeddings.tar.gz

897M	./cord-19_2020-06-01/cord_19_embeddings.tar.gz


This tar is almost 900MB.

Unzipping it.

In [13]:
!tar -xzf ./cord-19_2020-06-01/cord_19_embeddings.tar.gz -C ./cord-19_2020-06-01/

In [14]:
!ls ./cord-19_2020-06-01/

changelog			   document_parses
cord_19_embeddings_2020-06-01.csv  document_parses.tar.gz
cord_19_embeddings.tar.gz	   metadata.csv


Has unzipped to a file cord_19_embeddings_2020-06-01.csv.

In [15]:
!du -h ./cord-19_2020-06-01/cord_19_embeddings_2020-06-01.csv

2.0G	./cord-19_2020-06-01/cord_19_embeddings_2020-06-01.csv


2GB of embeddings data.

In [16]:
!wc -l < ./cord-19_2020-06-01/cord_19_embeddings_2020-06-01.csv

140532


Similar number of lines (and presumably, records) as metadata.csv. 