<a href="https://colab.research.google.com/github/iued-uni-heidelberg/cord19/blob/main/Cord19_v02_download2text.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Downloading and reading CORD19 corpus
This notebook downloads and reads the free cord19 corpus into one file. The notebook is hosted at IÜD, Heidelberg University github repository https://github.com/iued-uni-heidelberg/cord19

CORD19 (covid-19) open-source corpus is available from https://www.semanticscholar.org/cord19/download. 

Documentation is available at https://github.com/allenai/cord19

The original files are in json format. The output file is in plain text format; documents are separated (by default) by \<doc id="doc1000001"> ... \</doc> tags

The purpose of the plain text file is for further processing, e.g., generating linguistic annotation using the TreeTagger or the Standford parser for part-of-speech annotation or dependency / constituency parsing.



## Downloading CORD19 corpus

The corpus is downloaded and extracted from https://www.semanticscholar.org/cord19/download

Please check the link above: if you need the latest release of the corpus or if you would like to choose another release. Currently the 2022-06-02 release is downloaded.

File size is ~11GB (v2021-08-30)
File size is ~18GB (v2022-06-02)
expected download time ~9 min


In [None]:
# !wget https://ai2-semanticscholar-cord-19.s3-us-west-2.amazonaws.com/historical_releases/cord-19_2021-08-30.tar.gz
!wget https://ai2-semanticscholar-cord-19.s3-us-west-2.amazonaws.com/historical_releases/cord-19_2022-06-02.tar.gz

--2022-09-27 09:14:35--  https://ai2-semanticscholar-cord-19.s3-us-west-2.amazonaws.com/historical_releases/cord-19_2022-06-02.tar.gz
Resolving ai2-semanticscholar-cord-19.s3-us-west-2.amazonaws.com (ai2-semanticscholar-cord-19.s3-us-west-2.amazonaws.com)... 52.218.137.105
Connecting to ai2-semanticscholar-cord-19.s3-us-west-2.amazonaws.com (ai2-semanticscholar-cord-19.s3-us-west-2.amazonaws.com)|52.218.137.105|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 18657952487 (17G) [binary/octet-stream]
Saving to: ‘cord-19_2022-06-02.tar.gz’

          cord-19_2  11%[=>                  ]   2.03G  33.0MB/s    eta 7m 38s 

Extracting cord-19 corpus, approximate time ~ 4 min

In [None]:
!tar -xvzf cord-19_2021-08-30.tar.gz

2021-08-30/changelog
2021-08-30/cord_19_embeddings.tar.gz
2021-08-30/document_parses.tar.gz
2021-08-30/metadata.csv


Removing initial archive to free some disk space

In [None]:
!rm cord-19_2021-08-30.tar.gz

Extracting document parsers, which contain individual articles in separate json files. This is expected to take ~ 12 min.

In [None]:
!tar -xvzf 2021-08-30/document_parses.tar.gz

Removing more files to save space: ~ 9 seconds

In [None]:
# removing more files to save space
!rm --recursive 2021-08-30
!rm --recursive document_parses/pdf_json

## Reading json directory and merging into text file(s)

Run this cell to create the class; then run the next cell to execute on the directory "document_parses/pmc_json"

This is a class for reading a directory with json files and writing them to a single file or split into several text file, with "split_by_docs=N", N documents in each file.

In [None]:
# -*- coding: utf-8 -*-
# Python script to open each file, read json input and copy to one text file for subsequent processing
import os, re, sys
import json

class clJsonDir2txt(object):
    '''
    @author Bogdan Babych, IÜD, Heidelberg University
    @email bogdan [dot] babych [at] iued [dot] uni-heidelberg [dot] de
    a script for processing covid-19 corpus:
    @url https://www.semanticscholar.org/cord19 @url https://www.semanticscholar.org/cord19/download
        recursively reads files from a directory, and glues them together into a single corpus file

    @todo:
        working with sections - collect titles of all sections; frequent sections; select argumentative sections (e.g., discussion, analysis...)
        - to compare descriptive and argumentative parts of the corpus

        experimenting with different annotations (pos, parsing... ); MT quality evaluation...
    '''
    def __init__(self, SDirName, output_file = 'corpus_out.txt', textfilter=None, include_title = True, include_refs = True, include_authors = True, tag='doc', id=1000000, split_by_docs = 0): # initialising by openning the directories
        self.SOutput_file = output_file
        self.STextFilter = textfilter
        self.RFilter = re.compile(textfilter, re.IGNORECASE | re.MULTILINE)
        self.BInclTitle = include_title # implemented
        self.BInclRefs = include_refs # not implemented yet
        self.BInclAuth = include_authors # not implemented yet
        self.STag = tag
        self.ID = id
        self.ISplitByDocs = int(split_by_docs)
        # print(self.ISplitByDocs)
        self.openDir(SDirName)
        return


    def openDir(self, path): # implementation of recursively openning directories from a given rule directory and reading each file recursively into a string
        i = 0
        if self.ISplitByDocs:
            SPartFile = "part1000000" + self.SOutput_file
            FOut = open(SPartFile, 'w')
        else:
            FOut = open(self.SOutput_file, 'w')

        for root,d_names,f_names in os.walk(path):
            for f in f_names:
                i+=1
                if i%1000==0: print(str(i) + '. Processing: ' + f)
                fullpath = os.path.join(root, f)
                # print(fullpath)
                try:
                    FIn = open(fullpath,'r')
                    SIn = FIn.read()
                    # apply text filter, if not None
                    if self.STextFilter and (re.search(self.RFilter, SIn) == None): continue
                    SText2Write = self.procFile(SIn,f,i)
                    if SText2Write: FOut.write(SText2Write) # if the string is not empty then write to file
                    FIn.close()
                except:
                    print(f'file {f} cannot be read or processed')
                finally:
                    # splitting output into chunks of "split_by_docs" size
                    if self.ISplitByDocs and (i % self.ISplitByDocs == 0): # if self.ISplitByDocs == 0 then everything goes into one file; if this > 0 then
                        SPartFile = "part" + str(1000000 + i) + self.SOutput_file # generate new file name
                        FOut.flush()
                        FOut.close()
                        FOut = open(SPartFile, 'w')
        FOut.flush()
        FOut.close()

        return


    def procFile(self, SIn,SFNameIn,i): # sending each json string for extraction of text and attaching an correct tags to each output string output string
        STagOpen = '<' + self.STag + ' id="' + self.STag + str(self.ID + i)  + '">\n'
        STagClose = '\n</' + self.STag + '>\n\n'
        SText4Corpus = self.getJson(SIn, SFNameIn)
        if SText4Corpus:
            return STagOpen + SText4Corpus + STagClose
        else:
            print('\tNo data read from: ' + SFNameIn)
            return None


    def getJson(self, SIn, SFNameIn): # for each file-level string read from a file: managing internal structure of the covid-19 json file
        LOut = [] # collecting a list of strings
        try:
            DDoc = json.loads(SIn)
        except:
            print('\t\t' + SFNameIn + ' => error reading json2dictionary')
            return None
        # metadata:
        try:
            DMetaData = DDoc['metadata']
            if DMetaData:
                SMetaData = self.getJson_Metadata(DMetaData)
                if SMetaData: LOut.append(SMetaData)
        except:
            print('\t\t\t' + SFNameIn + ' ====> no metadata')
            DMetaData = None
        # body text
        try:
            LBodyText = DDoc['body_text']
            if LBodyText:
                SBodyText = self.getJson_BodyText(LBodyText)
                LOut.append(SBodyText)
        except:
            print('\t\t\t' + SFNameIn + ' ====> no body_text')
            LBodyText = None
        # further: to implement references

        SText = '\n\n'.join(LOut)
        return SText


    def getJson_Metadata(self, DIn): # converts interesting parts of metadata into a string
        SMetadata = ''
        LMetadata = []
        try: STitle = DIn["title"]
        except: STitle = None
        if STitle and self.BInclTitle:
            LMetadata.append(STitle)

        # to implement reading of authors' names

        if LMetadata: SMetadata = '\n\n'.join(LMetadata)
        return SMetadata


    def getJson_BodyText(self, LIn): # converts interesting parts of the body texts into a string
        SBodyText = ''
        LBodyText = []
        for DParagraph in LIn:
            try:
                ## DParagraphs[section] ## -- later on >> distinction between different sections....
                SParagraph = DParagraph["text"]
                LBodyText.append(SParagraph)
            except:
                print('!',)
                continue

        SBodyText = '\n\n'.join(LBodyText)
        return SBodyText

# arguments:
'''
        sys.argv[1], # obligatory: input directory name;
            other arguments optional:
            output_file = 'covid19corpus.txt',
            textfilter = None, # if this is string, only texts containing it are collected, e.g., covid
            include_title = True, # include or exclude title
            include_refs = False, # not implemented yet: include or exclude references
            split_by_docs=0 # split by groups of n documents; if 0 then write to one file

'''

'''if __name__ == '__main__':
    OJsonDir2txt = clJsonDir2txt(sys.argv[1], output_file = 'covid19corpus.txt', textfilter=None, include_title = True, include_refs = False, split_by_docs=0)
'''


"if __name__ == '__main__':\n    OJsonDir2txt = clJsonDir2txt(sys.argv[1], output_file = 'covid19corpus.txt', textfilter=None, include_title = True, include_refs = False, split_by_docs=0)\n"

This cell will executre reading of json files into a single (or multiple) files

Change the value of "split_by_docs=0" to "split_by_docs=10000" or any number ; this will create several corpus files with 10000 or any required number fo documents per file, which you wish to have.


Approximate execution time ~10 min

File size to download ~4.3 GB

It contains ~198.000 documents,

~ 671.578.587 words

~ 19.381.647 paragraphs (including empty lines, i.e., ~10M real paragraphs)

~ 4.619.100.883 characters

Download time can take up to 1 hour depending on your connection speed.

To split into ~BNC size chunks (100MW), split into groups of ~40000 documents (in the following cell set "split_by_docs=20000")


In [None]:
# remove parameter textfilter='covid', to return all documents
# change parameter split_by_docs=40000 to split_by_docs=0 to return a single file instead of ~5 parts with <40000 in each
OJsonDir2txt = clJsonDir2txt("document_parses/pmc_json", output_file = 'covid19corpFiltCOVID.txt', textfilter='covid', include_title = True, include_refs = False, split_by_docs=40000)


1000. Processing: PMC8328126.xml.json
2000. Processing: PMC8378831.xml.json
3000. Processing: PMC7885334.xml.json
4000. Processing: PMC7123675.xml.json
5000. Processing: PMC8237894.xml.json
6000. Processing: PMC4629194.xml.json
7000. Processing: PMC7423848.xml.json
8000. Processing: PMC8284347.xml.json
9000. Processing: PMC8278374.xml.json
10000. Processing: PMC1570461.xml.json
11000. Processing: PMC7744420.xml.json
12000. Processing: PMC7158367.xml.json
13000. Processing: PMC7869758.xml.json
14000. Processing: PMC7585733.xml.json
15000. Processing: PMC7871039.xml.json
16000. Processing: PMC7119026.xml.json
17000. Processing: PMC7705351.xml.json
18000. Processing: PMC3657891.xml.json
19000. Processing: PMC7443920.xml.json
20000. Processing: PMC7152080.xml.json
21000. Processing: PMC8012999.xml.json
22000. Processing: PMC8204832.xml.json
23000. Processing: PMC8330914.xml.json
24000. Processing: PMC7156231.xml.json
25000. Processing: PMC7195042.xml.json
26000. Processing: PMC8341344.xml.

To see the number of words, paragraphs in your corpus you can use this command:

In [None]:
!wc covid19corpus.txt

  11353308  428465423 2922833953 covid19corpus.txt


If you have split the text into parts, you can see the number of words in each part using this command:

In [None]:
!wc part*

    2330707    87381483   596272676 part1000000covid19corpFiltCOVID.txt
    2330707    87381483   596272676 part1000000covid19corpusFilterCOVID.txt
    3899868   136307720   936593762 part1000000covid19corpus.txt
    2263357    86800310   591503574 part1040000covid19corpFiltCOVID.txt
    2263357    86800310   591503574 part1040000covid19corpusFilterCOVID.txt
    3861089   135312676   930249471 part1040000covid19corpus.txt
    2296496    86650535   591056082 part1080000covid19corpFiltCOVID.txt
    2296496    86650535   591056082 part1080000covid19corpusFilterCOVID.txt
    3966522   136553801   939515657 part1080000covid19corpus.txt
    2282277    86724659   590829890 part1120000covid19corpFiltCOVID.txt
    2282277    86724659   590829890 part1120000covid19corpusFilterCOVID.txt
    3941044   136273170   937283286 part1120000covid19corpus.txt
    2180471    80908436   553171731 part1160000covid19corpFiltCOVID.txt
    2180471    80908436   553171731 part1160000covid19corpusFilterCOVID.txt
