<a href="https://colab.research.google.com/github/iued-uni-heidelberg/cord19/blob/main/Cord19_v02_download2text.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Downloading and reading CORD19 corpus
This notebook downloads and reads the free cord19 corpus into one file. The notebook is hosted at IÜD, Heidelberg University github repository https://github.com/iued-uni-heidelberg/cord19

CORD19 (covid-19) open-source corpus is available from https://www.semanticscholar.org/cord19/download. 

Documentation is available at https://github.com/allenai/cord19

The original files are in json format. The output file is in plain text format; documents are separated (by default) by \<doc id="doc1000001"> ... \</doc> tags

The purpose of the plain text file is for further processing, e.g., generating linguistic annotation using the TreeTagger or the Standford parser for part-of-speech annotation or dependency / constituency parsing.



## Downloading CORD19 corpus

The corpus is downloaded and extracted from https://www.semanticscholar.org/cord19/download

Please check the link above: if you need the latest release of the corpus or if you would like to choose another release. Currently the 2022-06-02 release is downloaded.

File size is ~11GB (v2021-08-30)
File size is ~18GB (v2022-06-02)
expected download time ~9 min


In [1]:
# !wget https://ai2-semanticscholar-cord-19.s3-us-west-2.amazonaws.com/historical_releases/cord-19_2021-08-30.tar.gz
!wget https://ai2-semanticscholar-cord-19.s3-us-west-2.amazonaws.com/historical_releases/cord-19_2022-06-02.tar.gz

--2022-10-04 06:58:59--  https://ai2-semanticscholar-cord-19.s3-us-west-2.amazonaws.com/historical_releases/cord-19_2022-06-02.tar.gz
Resolving ai2-semanticscholar-cord-19.s3-us-west-2.amazonaws.com (ai2-semanticscholar-cord-19.s3-us-west-2.amazonaws.com)... 52.92.145.162
Connecting to ai2-semanticscholar-cord-19.s3-us-west-2.amazonaws.com (ai2-semanticscholar-cord-19.s3-us-west-2.amazonaws.com)|52.92.145.162|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 18657952487 (17G) [binary/octet-stream]
Saving to: ‘cord-19_2022-06-02.tar.gz’


2022-10-04 07:07:04 (36.7 MB/s) - ‘cord-19_2022-06-02.tar.gz’ saved [18657952487/18657952487]



Extracting cord-19 corpus, approximate time ~ 4 min

In [2]:
# !tar -xvzf cord-19_2021-08-30.tar.gz
!tar -xvzf cord-19_2022-06-02.tar.gz 2022-06-02/document_parses.tar.gz

2022-06-02/document_parses.tar.gz


Removing initial archive to free some disk space

In [3]:
# !rm cord-19_2021-08-30.tar.gz
!rm cord-19_2022-06-02.tar.gz
!mv 2022-06-02/document_parses.tar.gz ./document_parses.tar.gz

Removing more files to save space:

In [4]:
# removing more files to save space
# !rm --recursive 2021-08-30
# !rm --recursive document_parses/pdf_json
!rm --recursive 2022-06-02

Extracting document parsers, which contain individual articles in separate json files. This is expected to take ~ 9+ min.

In [None]:
# !tar -tvf document_parses.tar.gz >document_parses.txt

In [None]:
# !tar -xvzf 2021-08-30/document_parses.tar.gz
!tar -xvzf document_parses.tar.gz document_parses/pmc_json

## output:

PMC1054884.xml.json

PMC1065028.xml.json

PMC1065064.xml.json

PMC1065120.xml.json

PMC1065257.xml.json

PMC1072802.xml.json

PMC1072806.xml.json

PMC1072807.xml.json

PMC1074505.xml.json

PMC1074749.xml.json
...

~ 28G of json files


In [None]:
# ls /content/document_parses/pmc_json >pms_json.txt

In [6]:
# !rm --recursive document_parses/pmc_json
# !rm --recursive document_parses/pdf_json
!rm document_parses.tar.gz

In [7]:
!du -sh /content/document_parses/pmc_json

28G	/content/document_parses/pmc_json


In [10]:
!mkdir /content/document_parses/pmc_json_sample

## Reading json directory and merging into text file(s)

Run this cell to create the class; then run the next cell to execute on the directory "document_parses/pmc_json"

This is a class for reading a directory with json files and writing them to a single file or split into several text file, with "split_by_docs=N", N documents in each file.

In [2]:
# -*- coding: utf-8 -*-
# Python script to open each file, read json input and copy to one text file for subsequent processing
import os, re, sys
import json
from collections import defaultdict

class clJsonDir2txt(object):
    '''
    @author Bogdan Babych, IÜD, Heidelberg University
    @email bogdan [dot] babych [at] iued [dot] uni-heidelberg [dot] de
    a script for processing covid-19 corpus:
    @url https://www.semanticscholar.org/cord19 @url https://www.semanticscholar.org/cord19/download
        recursively reads files from a directory, and glues them together into a single corpus file

    @todo:
        working with sections - collect titles of all sections; frequent sections; select argumentative sections (e.g., discussion, analysis...)
        - to compare descriptive and argumentative parts of the corpus

        experimenting with different annotations (pos, parsing... ); MT quality evaluation...
    '''
    def __init__(self, SDirName, output_file = 'corpus_out.txt', textfilter=None, include_title = True, include_sectionNames = True, include_refs = True, include_authors = True, tag='doc', id=1000000, split_by_docs = 0, copy_docs = 0): # initialising by openning the directories
        self.SOutput_file = output_file
        self.STextFilter = textfilter
        self.RFilter = re.compile(textfilter, re.IGNORECASE | re.MULTILINE)
        self.BInclTitle = include_title # implemented
        self.BInclSectionNames = include_sectionNames # implemented
        self.BInclRefs = include_refs # not implemented yet
        self.BInclAuth = include_authors # not implemented yet
        self.STag = tag
        self.ID = id
        self.ISplitByDocs = int(split_by_docs)
        self.ICopyDocs = int(copy_docs)
        # global dictionary of section names (to check and make rules...)
        self.DSectNames = defaultdict(int)
        # print(self.ISplitByDocs)
        self.openDir(SDirName)
        self.printDictionary(self.DSectNames, 'corpus-section-names.txt')
        return


    def openDir(self, path): # implementation of recursively openning directories from a given rule directory and reading each file recursively into a string
        i = 0
        path_sample = path + '_sample'
        if self.ISplitByDocs:
            SPartFile = "part1000000" + self.SOutput_file
            FOut = open(SPartFile, 'w')
        else:
            FOut = open(self.SOutput_file, 'w')

        for root,d_names,f_names in os.walk(path):
            for f in f_names:
                i+=1
                if i%10000==0: print(str(i) + '. Processing: ' + f)
                fullpath = os.path.join(root, f)
                # print(fullpath)
                try:
                    FIn = open(fullpath,'r')
                    SIn = FIn.read()
                    # apply text filter, if not None
                    if self.STextFilter and (re.search(self.RFilter, SIn) == None): continue
                    SText2Write = self.procFile(SIn,f,i)
                    if SText2Write: FOut.write(SText2Write) # if the string is not empty then write to file
                    FIn.close()
                except:
                    print(f'file {f} cannot be read or processed')
                finally:
                    # splitting output into chunks of "split_by_docs" size
                    if self.ISplitByDocs and (i % self.ISplitByDocs == 0): # if self.ISplitByDocs == 0 then everything goes into one file; if this > 0 then
                        SPartFile = "part" + str(1000000 + i) + self.SOutput_file # generate new file name
                        FOut.flush()
                        FOut.close()
                        FOut = open(SPartFile, 'w')
                    if self.ICopyDocs and (i >= self.ICopyDocs) and (i < (self.ICopyDocs + self.ISplitByDocs)):
                        try:
                            SOutputDirN = root + '_sample'
                            SOutputFN = os.path.join(SOutputDirN, f)
                            os.system(f'cp {fullpath} {SOutputFN}')
                        except:
                            print('.')

        FOut.flush()
        FOut.close()

        return


    def procFile(self, SIn,SFNameIn,i): # sending each json string for extraction of text and attaching an correct tags to each output string output string
        STagOpen = '<' + self.STag + ' id="' + self.STag + str(self.ID + i)  + '">\n'
        STagClose = '\n</' + self.STag + '>\n\n'
        SText4Corpus = self.getJson(SIn, SFNameIn)
        if SText4Corpus:
            return STagOpen + SText4Corpus + STagClose
        else:
            print('\tNo data read from: ' + SFNameIn)
            return None


    def getJson(self, SIn, SFNameIn): # for each file-level string read from a file: managing internal structure of the covid-19 json file
        LOut = [] # collecting a list of strings
        try:
            DDoc = json.loads(SIn)
        except:
            print('\t\t' + SFNameIn + ' => error reading json2dictionary')
            return None
        # metadata:
        try:
            DMetaData = DDoc['metadata']
            if DMetaData:
                SMetaData = self.getJson_Metadata(DMetaData)
                if SMetaData: LOut.append(SMetaData)
        except:
            print('\t\t\t' + SFNameIn + ' ====> no metadata')
            DMetaData = None
        # body text
        try:
            LBodyText = DDoc['body_text']
            if LBodyText:
                SBodyText = self.getJson_BodyText(LBodyText)
                LOut.append(SBodyText)
        except:
            print('\t\t\t' + SFNameIn + ' ====> no body_text')
            LBodyText = None
        # further: to implement references

        SText = '\n\n'.join(LOut)
        return SText


    def getJson_Metadata(self, DIn): # converts interesting parts of metadata into a string
        SMetadata = ''
        LMetadata = []
        try: STitle = DIn["title"]
        except: STitle = None
        if STitle and self.BInclTitle:
            LMetadata.append(STitle)

        # to implement reading of authors' names

        if LMetadata: SMetadata = '\n\n'.join(LMetadata)
        return SMetadata


    def getJson_BodyText(self, LIn): # converts interesting parts of the body texts into a string
        SBodyText = ''
        LBodyText = []
        SSectionName0 = '' # current section name set to empty for a new text
        # todo: later, in post-processing stage for the whole corpus, maybe after lemmatization...
        # ISampleNumber = 0 # samples of 100 words in text; they do not cross section boundaries 

        for DParagraph in LIn:
            # sections added 2022-09-28
            try:
                # DParagraphs["section"] ## distinction between different sections....
                SSectionName = DParagraph["section"]
                # normalizing new section name (1)
                SSectionName = SSectionName.replace("\n", " ")

                if self.BInclSectionNames and SSectionName: # if we opted to include section names and section name is not empty
                    if SSectionName != SSectionName0: # if we found a new section name
                        # processing section name
                        SSectionName0 = SSectionName # change the current section name

                        # normalizing section name (2)
                        SSectionNameNorm = SSectionName.lower()
                        SSectionNameNorm = re.sub('[0-9\.]+', ' ', SSectionNameNorm)
                        SSectionNameNorm = re.sub('[ ]+', ' ', SSectionNameNorm)
                        SSectionNameNorm = SSectionNameNorm.strip()
                        
                        self.DSectNames[SSectionNameNorm] += 1
                        SSect4text = f'<section sName="{SSectionNameNorm}">\n{SSectionName}\n</section>'

                        LBodyText.append(SSect4text)
            except:
                print('S!',)
                continue
            # first original section (we extract text after extracting section name)
            try:
                ## DParagraphs[section] ## -- later on >> distinction between different sections....
                SParagraph = DParagraph["text"]
                LBodyText.append(SParagraph)
            except:
                print('!',)
                continue



        SBodyText = '\n\n'.join(LBodyText)
        return SBodyText

    def printDictionary(self, DFreq, SFOutDict):
        FOutDict = open(SFOutDict, 'w')
        for key, val in sorted(DFreq.items(), key=lambda x: x[1], reverse=True):
            FOutDict.write(f'{key}\t{str(val)}\n')
        FOutDict.flush()
        FOutDict.close()
    



# arguments:
'''
        sys.argv[1], # obligatory: input directory name;
            other arguments optional:
            output_file = 'covid19corpus.txt',
            textfilter = None, # if this is string, only texts containing it are collected, e.g., covid
            include_title = True, # include or exclude title
            include_refs = False, # not implemented yet: include or exclude references
            split_by_docs=0 # split by groups of n documents; if 0 then write to one file

'''

'''if __name__ == '__main__':
    OJsonDir2txt = clJsonDir2txt(sys.argv[1], output_file = 'covid19corpus.txt', textfilter=None, include_title = True, include_sectionNames = True, include_refs = False, split_by_docs=0, copy_docs=240000)
'''


"if __name__ == '__main__':\n    OJsonDir2txt = clJsonDir2txt(sys.argv[1], output_file = 'covid19corpus.txt', textfilter=None, include_title = True, include_sectionNames = True, include_refs = False, split_by_docs=0, copy_docs=240000)\n"

## numbers from previous version
This cell will executre reading of json files into a single (or multiple) files

Change the value of "split_by_docs=0" to "split_by_docs=10000" or any number ; this will create several corpus files with 10000 or any required number fo documents per file, which you wish to have.


Approximate execution time ~10 min


File size to download ~4.3 GB

It contains ~198.000 documents,

~ 671.578.587 words

~ 19.381.647 paragraphs (including empty lines, i.e., ~10M real paragraphs)

~ 4.619.100.883 characters

## numbers for the last version
Approximate execution time ~20 min

8 BNC-size (100mw) files are generted, containg together

~ 315.000 documents

  827.118.629 words

   22.094.653 paragraphs

5.650.921.315 characters


Download time can take up to 1 hour depending on your connection speed.

To split into ~BNC size chunks (100MW), split into groups of ~40000 documents (in the following cell set "split_by_docs=20000")


In [None]:
# remove parameter textfilter='covid', to return all documents
# change parameter split_by_docs=40000 to split_by_docs=0 to return a single file instead of ~5 parts with <40000 in each
OJsonDir2txt = clJsonDir2txt("document_parses/pmc_json", output_file = 'cord19.txt', textfilter='covid', include_title = True, include_sectionNames = True, include_refs = False, split_by_docs=40000, copy_docs=240000)


writing frequent section names (ordered by descending frequency, from highest to 1)

In [4]:
!ls document_parses/pmc_json_sample | wc -l
!du -sh document_parses/pmc_json_sample

40000
3.6G	document_parses/pmc_json_sample


In [None]:
!tar cvzf document_parses_sample.tar.gz document_parses/pmc_json_sample

In [None]:
!head --lines=75 /content/corpus-section-names.txt

In [7]:
!head --lines=1000 /content/corpus-section-names.txt > corpus-selection-names-top1000.txt

To see the number of words, paragraphs in your corpus you can use this command:

In [None]:
# !wc covid19corpus.txt

If you have split the text into parts, you can see the number of words in each part using this command:

In [None]:
!wc part*

   4046267  109976546  758142306 part1000000cord19.txt
   4071521  109845759  758310406 part1040000cord19.txt
   4113561  110019721  759706810 part1080000cord19.txt
   4099076  109716696  757381748 part1120000cord19.txt
   4059810  109054692  752368164 part1160000cord19.txt
   4138586  110435615  762458277 part1200000cord19.txt
   4125613  110233019  761062284 part1240000cord19.txt
   3676023   98148739  678626560 part1280000cord19.txt
  32330457  867430787 5988056555 total


In [None]:
!gzip part1000000cord19.txt
!gzip part1040000cord19.txt
!gzip part1080000cord19.txt
!gzip part1120000cord19.txt
!gzip part1160000cord19.txt
!gzip part1200000cord19.txt
!gzip part1240000cord19.txt
!gzip part1280000cord19.txt

In [None]:
!head --lines=500000 part1240000cord19.txt >part1240000cord19-500k.txt

selecting some examples to experiment with...

In [None]:
!cp /content/document_parses/pmc_json/PMC9034168.xml.json PMC9034168.xml.json
!cp /content/document_parses/pmc_json/PMC7128104.xml.json PMC7128104.xml.json
!cp /content/document_parses/pmc_json/PMC8769777.xml.json PMC8769777.xml.json
!cp /content/document_parses/pmc_json/PMC7926205.xml.json PMC7926205.xml.json
!cp /content/document_parses/pmc_json/PMC8799642.xml.json PMC8799642.xml.json
!cp /content/document_parses/pmc_json/PMC7124374.xml.json PMC7124374.xml.json
!cp /content/document_parses/pmc_json/PMC8812323.xml.json PMC8812323.xml.json
!cp /content/document_parses/pmc_json/PMC7446676.xml.json PMC7446676.xml.json
!cp /content/document_parses/pmc_json/PMC7436596.xml.json PMC7436596.xml.json
!cp /content/document_parses/pmc_json/PMC8808276.xml.json PMC8808276.xml.json