<a href="https://colab.research.google.com/github/iued-uni-heidelberg/cord19/blob/main/Cord19_v04_download2text.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Downloading and reading CORD19 corpus
This notebook downloads and reads the free cord19 corpus into one file. The notebook is hosted at IÜD, Heidelberg University github repository https://github.com/iued-uni-heidelberg/cord19

CORD19 (covid-19) open-source corpus is available from https://www.semanticscholar.org/cord19/download. 

Documentation is available at https://github.com/allenai/cord19

The original files are in json format. The output file is in plain text format; documents are separated (by default) by \<doc id="doc1000001"> ... \</doc> tags

The purpose of the plain text file is for further processing, e.g., generating linguistic annotation using the TreeTagger or the Standford parser for part-of-speech annotation or dependency / constituency parsing.



## Downloading CORD19 corpus

The corpus is downloaded and extracted from https://www.semanticscholar.org/cord19/download

Please check the link above: if you need the latest release of the corpus or if you would like to choose another release. Currently the 2022-06-02 release is downloaded.

File size is ~11GB (v2021-08-30)
File size is ~18GB (v2022-06-02)
expected download time ~9 min


In [None]:
# !wget https://ai2-semanticscholar-cord-19.s3-us-west-2.amazonaws.com/historical_releases/cord-19_2021-08-30.tar.gz
!wget https://ai2-semanticscholar-cord-19.s3-us-west-2.amazonaws.com/historical_releases/cord-19_2022-06-02.tar.gz

--2022-10-06 06:33:31--  https://ai2-semanticscholar-cord-19.s3-us-west-2.amazonaws.com/historical_releases/cord-19_2022-06-02.tar.gz
Resolving ai2-semanticscholar-cord-19.s3-us-west-2.amazonaws.com (ai2-semanticscholar-cord-19.s3-us-west-2.amazonaws.com)... 52.92.194.90
Connecting to ai2-semanticscholar-cord-19.s3-us-west-2.amazonaws.com (ai2-semanticscholar-cord-19.s3-us-west-2.amazonaws.com)|52.92.194.90|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 18657952487 (17G) [binary/octet-stream]
Saving to: ‘cord-19_2022-06-02.tar.gz’


2022-10-06 06:40:59 (39.8 MB/s) - ‘cord-19_2022-06-02.tar.gz’ saved [18657952487/18657952487]



Extracting cord-19 corpus, approximate time ~ 4 min

In [None]:
# !tar -xvzf cord-19_2021-08-30.tar.gz
!tar -xvzf cord-19_2022-06-02.tar.gz 2022-06-02/document_parses.tar.gz

2022-06-02/document_parses.tar.gz


Removing initial archive to free some disk space

In [None]:
# !rm cord-19_2021-08-30.tar.gz
!rm cord-19_2022-06-02.tar.gz
!mv 2022-06-02/document_parses.tar.gz ./document_parses.tar.gz

Removing more files to save space:

In [None]:
# removing more files to save space
# !rm --recursive 2021-08-30
# !rm --recursive document_parses/pdf_json
!rm --recursive 2022-06-02

Extracting document parsers, which contain individual articles in separate json files. This is expected to take ~ 9+ min.

In [None]:
# !tar -tvf document_parses.tar.gz >document_parses.txt

In [None]:
# !tar -xvzf 2021-08-30/document_parses.tar.gz
!tar -xvzf document_parses.tar.gz document_parses/pmc_json

## output:

PMC1054884.xml.json

PMC1065028.xml.json

PMC1065064.xml.json

PMC1065120.xml.json

PMC1065257.xml.json

PMC1072802.xml.json

PMC1072806.xml.json

PMC1072807.xml.json

PMC1074505.xml.json

PMC1074749.xml.json
...

~ 28G of json files


In [None]:
!cp /content/document_parses/pmc_json/PMC9034168.xml.json PMC9034168.xml.json
!cp /content/document_parses/pmc_json/PMC7128104.xml.json PMC7128104.xml.json
!cp /content/document_parses/pmc_json/PMC8769777.xml.json PMC8769777.xml.json
!cp /content/document_parses/pmc_json/PMC7926205.xml.json PMC7926205.xml.json
!cp /content/document_parses/pmc_json/PMC8799642.xml.json PMC8799642.xml.json
!cp /content/document_parses/pmc_json/PMC7124374.xml.json PMC7124374.xml.json
!cp /content/document_parses/pmc_json/PMC8812323.xml.json PMC8812323.xml.json
!cp /content/document_parses/pmc_json/PMC7446676.xml.json PMC7446676.xml.json
!cp /content/document_parses/pmc_json/PMC7436596.xml.json PMC7436596.xml.json
!cp /content/document_parses/pmc_json/PMC8808276.xml.json PMC8808276.xml.json

In [None]:
!wc /content/document_parses/pmc_json/PMC9034168.xml.json PMC9034168.xml.json
!wc /content/document_parses/pmc_json/PMC7128104.xml.json PMC7128104.xml.json
!wc /content/document_parses/pmc_json/PMC8769777.xml.json PMC8769777.xml.json
!wc /content/document_parses/pmc_json/PMC7926205.xml.json PMC7926205.xml.json
!wc /content/document_parses/pmc_json/PMC8799642.xml.json PMC8799642.xml.json
!wc /content/document_parses/pmc_json/PMC7124374.xml.json PMC7124374.xml.json
!wc /content/document_parses/pmc_json/PMC8812323.xml.json PMC8812323.xml.json
!wc /content/document_parses/pmc_json/PMC7446676.xml.json PMC7446676.xml.json
!wc /content/document_parses/pmc_json/PMC7436596.xml.json PMC7436596.xml.json
!wc /content/document_parses/pmc_json/PMC8808276.xml.json PMC8808276.xml.json

In [None]:
# ls /content/document_parses/pmc_json >pms_json.txt

In [None]:
# !rm --recursive document_parses/pmc_json
# !rm --recursive document_parses/pdf_json
!rm document_parses.tar.gz

In [None]:
!du -sh /content/document_parses/pmc_json

28G	/content/document_parses/pmc_json


In [None]:
!mkdir /content/document_parses/pmc_json_sample

#Alternative: working with sample 100mw

In [None]:
!wget https://heibox.uni-heidelberg.de/f/b420d407463d4a728feb/?dl=1
!mv index.html?dl=1 document_parses_sample0.tar.gz


--2022-10-06 20:09:29--  https://heibox.uni-heidelberg.de/f/b420d407463d4a728feb/?dl=1
Resolving heibox.uni-heidelberg.de (heibox.uni-heidelberg.de)... 129.206.7.113
Connecting to heibox.uni-heidelberg.de (heibox.uni-heidelberg.de)|129.206.7.113|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://heibox.uni-heidelberg.de/seafhttp/files/999fa69c-301d-4614-ad48-30cc4b5b15b6/document_parses_sample.tar.gz [following]
--2022-10-06 20:09:30--  https://heibox.uni-heidelberg.de/seafhttp/files/999fa69c-301d-4614-ad48-30cc4b5b15b6/document_parses_sample.tar.gz
Reusing existing connection to heibox.uni-heidelberg.de:443.
HTTP request sent, awaiting response... 200 OK
Length: 611303079 (583M) [application/octet-stream]
Saving to: ‘index.html?dl=1’


2022-10-06 20:10:09 (14.9 MB/s) - ‘index.html?dl=1’ saved [611303079/611303079]



In [None]:
!tar xvzf document_parses_sample0.tar.gz

In [None]:
!rm document_parses_sample0.tar.gz

In [None]:
!du -sh /content/document_parses/pmc_json_sample/

## Reading json directory and merging into text file(s)

Run this cell to create the class; then run the next cell to execute on the directory "document_parses/pmc_json"

This is a class for reading a directory with json files and writing them to a single file or split into several text file, with "split_by_docs=N", N documents in each file.

In [None]:
# -*- coding: utf-8 -*-
# Python script to open each file, read json input and copy to one text file for subsequent processing
import os, re, sys
import json
from collections import defaultdict

class clJsonDir2txt(object):
    '''
    @author Bogdan Babych, IÜD, Heidelberg University
    @email bogdan [dot] babych [at] iued [dot] uni-heidelberg [dot] de
    a script for processing covid-19 corpus:
    @url https://www.semanticscholar.org/cord19 @url https://www.semanticscholar.org/cord19/download
        recursively reads files from a directory, and glues them together into a single corpus file

    @todo:
        working with sections - collect titles of all sections; frequent sections; select argumentative sections (e.g., discussion, analysis...)
        - to compare descriptive and argumentative parts of the corpus

        experimenting with different annotations (pos, parsing... ); MT quality evaluation...
    '''
    def __init__(self, SDirName, output_file = 'corpus_out.txt', textfilter=None, include_title = True, include_sectionNames = True, include_refs = True, include_authors = True, tag='doc', id=1000000, split_by_docs = 0, copy_docs = 0): # initialising by openning the directories
        self.SOutput_file = output_file
        self.STextFilter = textfilter
        self.RFilter = re.compile(textfilter, re.IGNORECASE | re.MULTILINE)
        self.BInclTitle = include_title # implemented
        self.BInclSectionNames = include_sectionNames # implemented
        self.BInclRefs = include_refs # not implemented yet
        self.BInclAuth = include_authors # not implemented yet
        self.STag = tag
        self.ID = id
        self.ISplitByDocs = int(split_by_docs)
        self.ICopyDocs = int(copy_docs)
        # global dictionary of section names (to check and make rules...)
        self.DSectNames = defaultdict(int)
        # print(self.ISplitByDocs)
        self.openDir(SDirName)
        self.printDictionary(self.DSectNames, 'corpus-section-names.txt')
        return


    def openDir(self, path): # implementation of recursively openning directories from a given rule directory and reading each file recursively into a string
        i = 0
        path_sample = path + '_sample'
        if self.ISplitByDocs:
            SPartFile = "part1000000" + self.SOutput_file
            FOut = open(SPartFile, 'w')
        else:
            FOut = open(self.SOutput_file, 'w')

        for root,d_names,f_names in os.walk(path):
            for f in f_names:
                i+=1
                if i%10000==0: print(str(i) + '. Processing: ' + f)
                fullpath = os.path.join(root, f)
                # print(fullpath)
                try:
                    FIn = open(fullpath,'r')
                    SIn = FIn.read()
                    # apply text filter, if not None
                    if self.STextFilter and (re.search(self.RFilter, SIn) == None): continue
                    SText2Write = self.procFile(SIn,f,i)
                    if SText2Write: FOut.write(SText2Write) # if the string is not empty then write to file
                    FIn.close()
                except:
                    print(f'file {f} cannot be read or processed')
                finally:
                    # splitting output into chunks of "split_by_docs" size
                    if self.ISplitByDocs and (i % self.ISplitByDocs == 0): # if self.ISplitByDocs == 0 then everything goes into one file; if this > 0 then
                        SPartFile = "part" + str(1000000 + i) + self.SOutput_file # generate new file name
                        FOut.flush()
                        FOut.close()
                        FOut = open(SPartFile, 'w')
                    if self.ICopyDocs and (i >= self.ICopyDocs) and (i < (self.ICopyDocs + self.ISplitByDocs)):
                        try:
                            SOutputDirN = root + '_sample'
                            SOutputFN = os.path.join(SOutputDirN, f)
                            os.system(f'cp {fullpath} {SOutputFN}')
                        except:
                            print('.')

        FOut.flush()
        FOut.close()

        return


    def procFile(self, SIn,SFNameIn,i): # sending each json string for extraction of text and attaching an correct tags to each output string output string
        STagOpen = '<' + self.STag + ' id="' + self.STag + str(self.ID + i)  + '">\n'
        STagClose = '\n</' + self.STag + '>\n\n'
        SText4Corpus = self.getJson(SIn, SFNameIn)
        if SText4Corpus:
            return STagOpen + SText4Corpus + STagClose
        else:
            print('\tNo data read from: ' + SFNameIn)
            return None


    def getJson(self, SIn, SFNameIn): # for each file-level string read from a file: managing internal structure of the covid-19 json file
        LOut = [] # collecting a list of strings
        try:
            DDoc = json.loads(SIn)
        except:
            print('\t\t' + SFNameIn + ' => error reading json2dictionary')
            return None
        # metadata:
        try:
            DMetaData = DDoc['metadata']
            if DMetaData:
                SMetaData = self.getJson_Metadata(DMetaData)
                if SMetaData: LOut.append(SMetaData)
        except:
            print('\t\t\t' + SFNameIn + ' ====> no metadata')
            DMetaData = None
        # body text
        try:
            LBodyText = DDoc['body_text']
            if LBodyText:
                SBodyText = self.getJson_BodyText(LBodyText)
                LOut.append(SBodyText)
        except:
            print('\t\t\t' + SFNameIn + ' ====> no body_text')
            LBodyText = None
        # further: to implement references

        SText = '\n\n'.join(LOut)
        return SText


    def getJson_Metadata(self, DIn): # converts interesting parts of metadata into a string
        SMetadata = ''
        LMetadata = []
        try: STitle = DIn["title"]
        except: STitle = None
        if STitle and self.BInclTitle:
            LMetadata.append(STitle)

        # to implement reading of authors' names

        if LMetadata: SMetadata = '\n\n'.join(LMetadata)
        return SMetadata


    def getJson_BodyText(self, LIn): # converts interesting parts of the body texts into a string
        SBodyText = ''
        LBodyText = []
        SSectionName0 = '' # current section name set to empty for a new text
        # todo: later, in post-processing stage for the whole corpus, maybe after lemmatization...
        # ISampleNumber = 0 # samples of 100 words in text; they do not cross section boundaries 

        for DParagraph in LIn:
            # sections added 2022-09-28
            try:
                # DParagraphs["section"] ## distinction between different sections....
                SSectionName = DParagraph["section"]
                # normalizing new section name (1)
                SSectionName = SSectionName.replace("\n", " ")

                if self.BInclSectionNames and SSectionName: # if we opted to include section names and section name is not empty
                    if SSectionName != SSectionName0: # if we found a new section name
                        # processing section name
                        SSectionName0 = SSectionName # change the current section name

                        # normalizing section name (2)
                        SSectionNameNorm = SSectionName.lower()
                        SSectionNameNorm = re.sub('[0-9\.]+', ' ', SSectionNameNorm)
                        SSectionNameNorm = re.sub('[ ]+', ' ', SSectionNameNorm)
                        SSectionNameNorm = SSectionNameNorm.strip()
                        
                        self.DSectNames[SSectionNameNorm] += 1
                        SSect4text = f'<section sName="{SSectionNameNorm}">\n{SSectionName}\n</section>'

                        LBodyText.append(SSect4text)
            except:
                print('S!',)
                continue
            # first original section (we extract text after extracting section name)
            try:
                ## DParagraphs[section] ## -- later on >> distinction between different sections....
                SParagraph = DParagraph["text"]
                LBodyText.append(SParagraph)
            except:
                print('!',)
                continue



        SBodyText = '\n\n'.join(LBodyText)
        return SBodyText

    def printDictionary(self, DFreq, SFOutDict):
        FOutDict = open(SFOutDict, 'w')
        for key, val in sorted(DFreq.items(), key=lambda x: x[1], reverse=True):
            FOutDict.write(f'{key}\t{str(val)}\n')
        FOutDict.flush()
        FOutDict.close()
    



# arguments:
'''
        sys.argv[1], # obligatory: input directory name;
            other arguments optional:
            output_file = 'covid19corpus.txt',
            textfilter = None, # if this is string, only texts containing it are collected, e.g., covid
            include_title = True, # include or exclude title
            include_refs = False, # not implemented yet: include or exclude references
            split_by_docs=0 # split by groups of n documents; if 0 then write to one file

'''

'''if __name__ == '__main__':
    OJsonDir2txt = clJsonDir2txt(sys.argv[1], output_file = 'covid19corpus.txt', textfilter=None, include_title = True, include_sectionNames = True, include_refs = False, split_by_docs=0, copy_docs=240000)
'''


"if __name__ == '__main__':\n    OJsonDir2txt = clJsonDir2txt(sys.argv[1], output_file = 'covid19corpus.txt', textfilter=None, include_title = True, include_sectionNames = True, include_refs = False, split_by_docs=0, copy_docs=240000)\n"

## full corpus:

In [None]:
# remove parameter textfilter='covid', to return all documents
# change parameter split_by_docs=40000 to split_by_docs=0 to return a single file instead of ~5 parts with <40000 in each
OJsonDir2txt = clJsonDir2txt("document_parses/pmc_json", output_file = 'cord19.txt', textfilter='covid', include_title = True, include_sectionNames = True, include_refs = False, split_by_docs=40000, copy_docs=240000)


## numbers from previous version
This cell will executre reading of json files into a single (or multiple) files

Change the value of "split_by_docs=0" to "split_by_docs=10000" or any number ; this will create several corpus files with 10000 or any required number fo documents per file, which you wish to have.


Approximate execution time ~10 min


File size to download ~4.3 GB

It contains ~198.000 documents,

~ 671.578.587 words

~ 19.381.647 paragraphs (including empty lines, i.e., ~10M real paragraphs)

~ 4.619.100.883 characters

## numbers for the last version
Approximate execution time ~20 min

8 BNC-size (100mw) files are generted, containg together

~ 315.000 documents

  827.118.629 words

   22.094.653 paragraphs

5.650.921.315 characters


Download time can take up to 1 hour depending on your connection speed.

To split into ~BNC size chunks (100MW), split into groups of ~40000 documents (in the following cell set "split_by_docs=20000")


writing frequent section names (ordered by descending frequency, from highest to 1)

In [None]:
!head --lines=500000 part1240000cord19.txt >part1240000cord19-500k.txt

In [None]:
!ls document_parses/pmc_json_sample | wc -l
!du -sh document_parses/pmc_json_sample

40000
3.6G	document_parses/pmc_json_sample


In [None]:
!tar cvzf document_parses_sample.tar.gz document_parses/pmc_json_sample

In [None]:
!head --lines=75 /content/corpus-section-names.txt

In [None]:
!head --lines=1000 /content/corpus-section-names.txt > corpus-selection-names-top1000.txt

If you have split the text into parts, you can see the number of words in each part using this command:

In [None]:
!wc part*

   4085687  108914903  752390539 part1000000cord19.txt
   4083605  109685695  756986287 part1040000cord19.txt
   4117431  110113881  760911812 part1080000cord19.txt
   4085114  109782614  757565426 part1120000cord19.txt
   4156737  110454560  764015682 part1160000cord19.txt
   4051419  109593018  756536541 part1200000cord19.txt
   4110526  110115301  761073243 part1240000cord19.txt
   3639938   98300665  678104874 part1280000cord19.txt
  32330457  866960637 5987584404 total


In [None]:
!gzip part1000000cord19.txt
!gzip part1040000cord19.txt
!gzip part1080000cord19.txt
!gzip part1120000cord19.txt
!gzip part1160000cord19.txt
!gzip part1200000cord19.txt
!gzip part1240000cord19.txt
!gzip part1280000cord19.txt

## Alternative: working with sample 100mw

In [None]:
OJsonDir2txt = clJsonDir2txt("document_parses/pmc_json_sample", output_file = 'cord19.txt', textfilter='covid', include_title = True, include_sectionNames = True, include_refs = False, split_by_docs=0, copy_docs=0)


10000. Processing: PMC7531324.xml.json
20000. Processing: PMC8588348.xml.json
30000. Processing: PMC8897106.xml.json
40000. Processing: PMC8634281.xml.json


In [None]:
!head --lines=10000 cord19.txt >cord19_10k.txt

To see the number of words, paragraphs in your corpus you can use this command:

In [None]:
!wc cord19_10k.txt
!wc cord19.txt

  10000  278029 1915103 cord19_10k.txt
  4110512 110114834 761070060 cord19.txt


## TreeTagger run on corpus

In [None]:
%%bash
rm -r treetagger/

In [None]:
%%bash
# downloading and testing TreeTagger
mkdir treetagger
cd treetagger
# Download the tagger package for your system (PC-Linux, Mac OS-X, ARM64, ARMHF, ARM-Android, PPC64le-Linux).
wget https://cis.lmu.de/~schmid/tools/TreeTagger/data/tree-tagger-linux-3.2.4.tar.gz
tar -xzvf tree-tagger-linux-3.2.4.tar.gz
# Download the tagging scripts into the same directory.
wget https://cis.lmu.de/~schmid/tools/TreeTagger/data/tagger-scripts.tar.gz
gunzip tagger-scripts.tar.gz
# Download the installation script install-tagger.sh.
wget https://cis.lmu.de/~schmid/tools/TreeTagger/data/install-tagger.sh
# Download the parameter files for the languages you want to process.
# list of all files (parameter files) https://cis.lmu.de/~schmid/tools/TreeTagger/#parfiles
wget https://cis.lmu.de/~schmid/tools/TreeTagger/data/english.par.gz
sh install-tagger.sh
cd ..
sudo pip install treetaggerwrapper
# changing options: no-unknown, sgml, lemma
mv /content/treetagger/cmd/tree-tagger-english /content/tree-tagger-english0
awk '{ if (NR == 9) print "OPTIONS=\"-token -lemma -sgml -no-unknown\""; else print $0}' /content/tree-tagger-english0 > /content/treetagger/cmd/tree-tagger-english
chmod a+x ./treetagger/cmd/tree-tagger-english

# downloading German and Georgian 
wget https://heibox.uni-heidelberg.de/f/ec8226edebb64a359407/?dl=1
mv index.html?dl=1 /content/treetagger/lib/german-utf8.par
wget https://heibox.uni-heidelberg.de/f/9183090d2bdb41e09055/?dl=1
mv index.html?dl=1 /content/treetagger/lib/georgian.par

wget https://heibox.uni-heidelberg.de/f/9cafab0509d64ed1ac4b/?dl=1
mv index.html?dl=1 /content/treetagger/cmd/tree-tagger-georgian2
# German2 = -no-unknown 
# note: tree-tagger-german will not work, as parameter files have not been downloaded, onlz use tree-tagger-german2
wget https://heibox.uni-heidelberg.de/f/acb9b8a2fa4f40e08f8a/?dl=1
mv index.html?dl=1 /content/treetagger/cmd/tree-tagger-german2
chmod a+x /content/treetagger/cmd/tree-tagger-georgian2
chmod a+x /content/treetagger/cmd/tree-tagger-german2

# test text download
wget https://heibox.uni-heidelberg.de/f/cdf240db84ca4718b718/?dl=1
mv index.html?dl=1 go1984en.txt
wget https://heibox.uni-heidelberg.de/f/ea06aa47fe2d49959a62/?dl=1
mv index.html?dl=1 go1984de.txt
wget https://heibox.uni-heidelberg.de/f/318b32556cdc44d38238/?dl=1
mv index.html?dl=1 go1984ka.txt


In [None]:
!pwd

In [None]:
# sample - Tagging
!./treetagger/cmd/tree-tagger-english go1984en.txt >go1984en_2_vert.txt
!./treetagger/cmd/tree-tagger-german2 go1984de.txt >go1984de_2_vert.txt
!./treetagger/cmd/tree-tagger-georgian2 go1984ka.txt >go1984ka_2_vert.txt

In [None]:
!./treetagger/cmd/tree-tagger-english cord19_10k.txt >cord19_10k.vert

In [None]:
!./treetagger/cmd/tree-tagger-english cord19.txt >cord19.vert

	reading parameters ...
	tagging ...
126567000	 finished.


In [None]:
!tar cvzf cord19-vert.tgz cord19.vert

cord19.vert


In [None]:
!awk -F '\t' '{if(NF==3) printf "%s ", $3; else printf "\n%s\n", $0}' < cord19_10k.vert >cord19_10k.lem

In [None]:
# !awk -F '\t' '(NF==3){printf "%s ", $3; if(FNR % 10000 == 0){printf "\n"}}' < cord19_1k.vert >cord19_1k.lem
# !awk -F '\t' '{if(NF==3) printf "%s ", $3; else printf "\n%s\n", $0}' < cord19_1k.vert >cord19_1k02.lem

!awk -F '\t' '{if(NF==3) printf "%s ", $3; else printf "\n%s\n", $0}' < cord19.vert >cord19.lem

In [None]:
!tar cvzf cord19-lem.tgz cord19.lem

cord19.lem


In [None]:
# normalising xml
import os, re, sys
FOut = open('cord19_10k.lems', 'w')
with open('cord19_10k.lem', 'r') as file:
# FOut = open('cord19.lems', 'w')
# with open('cord19.lem', 'r') as file:

    for line in file:
        if line.startswith('<doc'): BStartDoc = True
        if line.startswith('</section>'): continue
        if line.startswith('<section') and BStartDoc == True: BStartDoc = False; FOut.write(line); continue
        if line.startswith('<section') and BStartDoc == False: FOut.write('</section>\n' + line); continue
        if line.startswith('</doc>'): FOut.write('</section>\n' + line); continue
        FOut.write(line)

    # data = file.read()


In [None]:
!head --lines=20 cord19.lems

In [None]:
!tar cvzf cord19-lems.tgz cord19.lems

cord19.lems


## downloading the file which is ready...

In [None]:
!wget https://heibox.uni-heidelberg.de/f/3bf283c19c5742d1ac3b/?dl=1
!mv index.html?dl=1 cord19-lems.tgz

In [None]:
!wget https://heibox.uni-heidelberg.de/f/69e0c866bf3c4ccc8435/?dl=1
!mv index.html?dl=1 cord19-10k.lems

In [None]:
!tar xvzf cord19-lems.tgz
!wc cord19.lems

cord19.lems
  1437158 128869077 762343572 cord19.lems


In [None]:
!wc cord19-10k.lems
!head --lines=20 cord19-10k.lems

## end: lemmatization

## Sections and 100-word samples




In [None]:
# downloading mapping rules
!wget https://heibox.uni-heidelberg.de/f/32342a3aa0d04a259bf9/?dl=1
!mv index.html?dl=1 covid-sections-and-keywords.zip


In [None]:

!unzip covid-sections-and-keywords.zip

In [20]:
import os, re, sys
from bs4 import BeautifulSoup
class clXml2Stat(object):
    '''
    The class will create statistics for documents and sections marked in the file
    '''
    def __init__(self, SFXmlInput, output_file = 'corpus_out_stat.txt'):
        # FXmlInput = open(SFXmlInput, 'r')
        with open(SFXmlInput, 'r') as file:
            data = file.read()
        FTsvOutput = open(output_file, 'w')
        FTsvOutputNum = open(output_file + '_n.txt', 'w')

        print('file read into memory')
        FSectionMap = open('covid-sections.tsv', 'r')
        self.DSectionMap = self.readTsv2dict(FSectionMap, 0, 2)
        FSectionRules = open('covid_section_rules.tsv', 'r')
        FSectionLexRND = open('covid-random.tsv', 'r')
        FSectionLexMI = open('covid-mi.tsv', 'r')
        self.LTMapRules = self.readTsv2reRules(FSectionRules, 0, 1)
        self.LTMapLexRND = self.readTsv2reRules(FSectionLexRND, 0, 1, priority = ['KEY_N', 'M_ARG', 'EVAL'], prefix = ' ')
        self.LTMapLexMI = self.readTsv2reRules(FSectionLexMI, 0, 1, priority = ['KEY_N', 'M_ARG', 'EVAL'], prefix = ' ')
        count = self.runMain(data, FTsvOutput, samplesize = 100)
        print(str(count))

        return

    def readTsv2dict(self, FInTSV, IFieldLeft, IFieldRight):
        DMap = {}
        for SLine in FInTSV:
            SLine = SLine.strip()
            LLine = re.split('\t', SLine)
            try:
                k = LLine[IFieldLeft]
                v = LLine[IFieldRight]
                DMap[k] = v
            except:
                continue
        return DMap

    def readTsv2reRules(self, FInTSV, IFieldLeft, IFieldRight, priority = ['Introduction', 'Statements', 'Methods', 'Conclusion', 'Discussion', 'Presentation', 'Background', 'Results', 'Remove'], prefix = None, suffix = None):
        LTMap = []
        DRe = {}
        # initialising dictionary where the mapping will be done, then reversing
        for el in priority:
            DRe[el] = [] # first the list is empty
        for SLine in FInTSV:
            SLine = SLine.strip()
            LLine = re.split('\t', SLine)
            try:
                v = LLine[IFieldLeft]; v = v.strip()
                k = LLine[IFieldRight]; k = k.strip()
                if prefix:
                    v = prefix + v
                if suffix:
                    v = v + suffix
            except:
                sys.stdout.write('f')
                continue

            try:
                LWords2map = DRe[k]
                LWords2map.append(v) # adding new word for the mapping
                DRe[k] = LWords2map
            except:
                sys.stdout.write('d')
                continue

        for el in priority:
            print('el:' + el)
            LRe = DRe[el]
            SRe2map = '|'.join(LRe)
            print('\t' + SRe2map)
            RE4map = re.compile(SRe2map)
            TMap = (el, RE4map)
            LTMap.append(TMap)

        return LTMap

    # use instead the function readTsv2reRules (above)
    def readTsv2lexMatch(self, FInTSV, IFieldLeft, IFieldRight, priority = ['KEY_N', 'EVAL', 'M_ARG']):
        LTMap = []
        return




    def runMain(self, data, FTsvOutput, samplesize = 100):
        soup = BeautifulSoup(data)
        i = 0
        gnorm = 0
        grule = 0
        gnomatch = 0
        # FTsvOutput.write(str(docnumber) + '\t' + str(section_number) + '\t' + str(count_samples) + '\t' + str(lensample) + '\t' + str(lendict)  + '\t' + str(SCatRnd)  + '\t' + str(SCatMI) + '\t' + SCatNorm + '\t' + SSectNameNorm + '\t' + sw + str(SCatRndWords)  + '\t' + str(SCatMIWords) + '\n')
        # new string (found words in front)
        # FTsvOutput.write(str(docnumber) + '\t' + str(section_number) + '\t' + str(count_samples) + '\t' + str(lensample) + '\t' + str(lendict)  + '\t' + str(SCatRnd)  + '\t' + str(SCatMI) + '\t' + SCatNorm + '\t' + SSectNameNorm + '\t' + str(SCatRndWords)  + '\t' + str(SCatMIWords) + sw + '\n')

        FTsvOutput.write('docnumber' + '\t' + 'section_number' + '\t' + 'count_samples' + '\t' + 'lensample' + '\t' + 'lendict'  + '\t' + 'nKN_Rnd'  + '\t' + 'nMA_Rnd'  + '\t' + 'nEV_Rnd'  + '\t' + 'nKN_MI'  + '\t' + 'nMA_MI'  + '\t' + 'nEV_MI' + '\t' + 'SCatNorm' + '\t' + 'SSectNameNorm' + '\t' + 'wKN_RndWords' + '\t' + 'wMA_RndWords' + '\t' + 'wEV_RndWords'  + '\t' + 'wKN_MIWords' + '\t' + 'wMA_MIWords' + '\t' + 'wEV_MIWords' + 'sw' + '\n')
        for doc in soup.find_all('doc'):
            # print(type(str(doc)))
            i+=1
            norm, rule, nomatch = self.procDoc(str(doc), FTsvOutput, i, samplesize)
            gnorm += norm; grule += rule; gnomatch += nomatch
            if i%1000 == 0:
                print(str(i))
            # print(str(i))
        print(f'\tgnorm={gnorm}\tgrule={grule}\tgnonorm={gnomatch}')
        return i

    def procDoc(self, SDoc, FTsvOutput, docnumber, samplesize = 100):
        SDoc = SDoc.replace('\n', ' ')
        SDoc = SDoc.replace('\t', ' ')
        # LSections = re.findall('<section sname=\"([^\"]+)\">', SDoc, re.DOTALL | re.MULTILINE | re.IGNORECASE)
        LTSections = re.findall('<section sname=\"([^\"]+)\">([^<]+)</section>', SDoc, re.DOTALL | re.MULTILINE | re.IGNORECASE)
        # LSections = re.findall('<section([^<]+)</section>', SDoc, re.DOTALL | re.MULTILINE | re.IGNORECASE)
        norm = 0
        nomatch = 0
        rule = 0
        section_number = 0
        for TSection in LTSections:
            SSectName, SSectionText = TSection
            SCatFound = False
            section_number += 1

            if SSectName in self.DSectionMap.keys():
                SSectNameNorm = self.DSectionMap[SSectName]
                norm += 1
                SCatNorm = 'MAP'
                SCatFound = True

            else:
                for category, regexpression in self.LTMapRules:
                    if re.search(regexpression, SSectName):
                        SSectNameNorm = category
                        SCatNorm = 'RULE'
                        SCatFound = True
                        rule += 1
                        break
                if SCatFound == False:
                    SSectNameNorm = SSectName
                    SCatNorm = 'NONORM'
                    nomatch += 1
            LSectionText = SSectionText.split(' ')
            GLLSectionSamples = self.divide_chunks(LSectionText, samplesize)
            LLSectionSamples = list(GLLSectionSamples)
            LSectionSamplesLast = LLSectionSamples[-1]
            try:
                if len(LSectionSamplesLast) < samplesize:
                    LLSectionSamples[-2].extend(LSectionSamplesLast)
                    LLSectionSamples.pop()
            except:
                pass

            '''
            print(LLSectionSamples)
            for lw in LLSectionSamples:
                print(len(lw))
            print(' ')
            '''
            count_samples = 0
            for lw in LLSectionSamples:
                count_samples += 1
                # print(len(lw))
                # find the size of dictionary for this sample of 100 words
                setlw = set(lw)
                lendict = str(len(setlw))

                lensample = str(len(lw))

                # create string from list of 100 words
                sw = ' '.join(lw)
                # count key_n here!!!
                LCatRnd = []
                LCatMI = []
                LCatRndWords = []
                LCatMIWords = []

                for category, regexpression in self.LTMapLexRND:
                    LCatXr = re.findall(regexpression, sw)
                    LCatRnd.append(str(len(LCatXr)))
                    SCatXr = ' '.join(LCatXr)
                    LCatRndWords.append(SCatXr)
                for category, regexpression in self.LTMapLexMI:
                    LCatXm = re.findall(regexpression, sw)
                    LCatRnd.append(str(len(LCatXm)))
                    SCatXm = ' '.join(LCatXm)
                    LCatMIWords.append(SCatXm)

                SCatRnd = '\t'.join(LCatRnd)
                SCatMI = '\t'.join(LCatMI)
                SCatRndWords = '\t'.join(LCatRndWords)
                SCatMIWords = '\t'.join(LCatMIWords)

                FTsvOutput.write(str(docnumber) + '\t' + str(section_number) + '\t' + str(count_samples) + '\t' + str(lensample) + '\t' + str(lendict)  + '\t' + str(SCatRnd)  + '\t' + str(SCatMI) + '\t' + SCatNorm + '\t' + SSectNameNorm + '\t' + str(SCatRndWords)  + '\t' + str(SCatMIWords) + sw + '\n')



            # FTsvOutput.write(str(docnumber) + '\t' + str(section_number) + '\t' + SCatNorm + '\t' + SSectNameNorm + '\t' + SSectionText + '\n')
            # print(TSection)
        print(f'norm={norm}\trule={rule}\tnonorm={nomatch}')
        return norm, rule, nomatch


    def divide_chunks(self, l, n):
        # looping till length l
        for i in range(0, len(l), n):
            yield l[i:i + n]

    def procSection(self, SSectionText):
        LSectionSamples = []
        return LSectionSamples


    '''
    def runMain(self, data, FTsvOutput):
        soup = BeautifulSoup(data)
        i = 0
        for doc in soup.find_all('doc'):
            # print(doc)
            sections = soup.find_all('section', doc)
            # children = soup.findChildren()
            k = 0
            for section in sections:
                k += 1
                # print(str(k))
                # print(section)
            print('\t' + str(k))

            i += 1
            print('.')

        return i
        '''

# end: class

'''if __name__ == '__main__':
    OXml2Stat = clXml2Stat('/content/cord19_1k.lem', output_file = '/content/cord19_1k_stat.txt')

'''


"if __name__ == '__main__':\n    OXml2Stat = clXml2Stat('/content/cord19_1k.lem', output_file = '/content/cord19_1k_stat.txt')\n\n"

In [None]:
OXml2Stat = clXml2Stat('/content/cord19-10k.lems', output_file = '/content/cord19_10k_stat.txt')

In [22]:
!head --lines=74 cord19_10k_stat.txt >cord19_1ks_stat.txt

In [None]:
 OXml2Stat = clXml2Stat('/content/cord19.lems', output_file = '/content/cord19_stat.txt')

In [None]:
!head --lines=30 cord19_stat.txt

# Route via Json: test approach...
## working with 100MW sample (40k texts)
### selecting 100-word long samples, writing to JSon dictionary, mapping names, recording

In [None]:
# output directory
!mkdir document_parses/pmc_json_sample02/

In [None]:
# modifying script to select 100-word long samples
'''
algorithm:
    1. form sections as strings
    2. process sections (map names), create a record
    3. write samples in a python dictionary

    ? do we need xml output to a file ?

    architecture:
      - create list of sections from list of paragraphs
      - process each section, splitting it into samples of a pre-defined size

'''

In [None]:
# -*- coding: utf-8 -*-
# Python script to open each file, read json input and copy to one text file for subsequent processing
import os, re, sys
import json
from collections import defaultdict

class clJsonDir2txtSamples(object):
    '''
    @author Bogdan Babych, IÜD, Heidelberg University
    @email bogdan [dot] babych [at] iued [dot] uni-heidelberg [dot] de
    a script for processing covid-19 corpus:
    @url https://www.semanticscholar.org/cord19 @url https://www.semanticscholar.org/cord19/download
        recursively reads files from a directory, and glues them together into a single corpus file

    @todo:
        working with sections - collect titles of all sections; frequent sections; select argumentative sections (e.g., discussion, analysis...)
        - to compare descriptive and argumentative parts of the corpus

        experimenting with different annotations (pos, parsing... ); MT quality evaluation...
    '''
    def __init__(self, SDirName, output_file = 'corpus_out.txt', textfilter=None, include_title = True, include_sectionNames = True, include_refs = True, include_authors = True, tag='doc', id=1000000, split_by_docs = 0, copy_docs = 0, sample_size = 0, outjsondir = '02'): # initialising by openning the directories
        self.SOutput_file = output_file
        self.SOutput_file_stat = 'stat_' + output_file
        self.STextFilter = textfilter
        self.RFilter = re.compile(textfilter, re.IGNORECASE | re.MULTILINE)
        self.BInclTitle = include_title # implemented
        self.BInclSectionNames = include_sectionNames # implemented
        self.BInclRefs = include_refs # not implemented yet
        self.BInclAuth = include_authors # not implemented yet
        self.STag = tag
        self.ID = id
        self.ISplitByDocs = int(split_by_docs)
        self.ICopyDocs = int(copy_docs)
        self.ISampleSize = int(sample_size)
        self.SDirJsonOutput = outjsondir

        # global dictionary of section names (to check and make rules...)
        self.DSectNames = defaultdict(int)
        # print(self.ISplitByDocs)
        self.openDir(SDirName)
        self.printDictionary(self.DSectNames, 'corpus-section-names.txt')
        return


    def openDir(self, path): # implementation of recursively openning directories from a given rule directory and reading each file recursively into a string
        i = 0
        path_sample = path + '_sample'
        path_sample_out = path + '_sample' + self.SDirJsonOutput
        FOutStat = open(self.SOutput_file_stat, 'w')
        if self.ISplitByDocs:
            SPartFile = "part1000000" + self.SOutput_file
            FOut = open(SPartFile, 'w')
        else:
            FOut = open(self.SOutput_file, 'w')

        for root,d_names,f_names in os.walk(path):
            for f in f_names:
                i+=1
                if i%10000==0: print(str(i) + '. Processing: ' + f)
                fullpath = os.path.join(root, f)
                ## output of Json files
                fullpathout = os.path.join(path_sample_out, f) 
                # print(fullpath)
                print('full path output: ' + fullpathout)
                try:
                    ## implement : processing output file; statistics output file...
                    FIn = open(fullpath,'r')
                    SIn = FIn.read()
                    # apply text filter, if not None
                    if self.STextFilter and (re.search(self.RFilter, SIn) == None): continue
                    DData, SText2Write = self.procFile(SIn,f,i)
                    if SText2Write: FOut.write(SText2Write) # if the string is not empty then write to file
                    FIn.close()

                    ## writing json files with new dictionary structure
                    ## FJsonOut = open(fullpathout, 'w')
                    with open(fullpathout, 'w', encoding='utf-8') as jf:
                        json.dump(DData, jf, ensure_ascii=False, indent=4)

                except:
                    print(f'file {f} cannot be read or processed')
                finally:
                  ## closing json output file
                    jf.close() ## close json file
                    # splitting output into chunks of "split_by_docs" size
                    if self.ISplitByDocs and (i % self.ISplitByDocs == 0): # if self.ISplitByDocs == 0 then everything goes into one file; if this > 0 then
                        SPartFile = "part" + str(1000000 + i) + self.SOutput_file # generate new file name
                        FOut.flush()
                        FOut.close()
                        FOut = open(SPartFile, 'w')
                    if self.ICopyDocs and (i >= self.ICopyDocs) and (i < (self.ICopyDocs + self.ISplitByDocs)):
                        try:
                            SOutputDirN = root + '_sample'
                            SOutputFN = os.path.join(SOutputDirN, f)
                            os.system(f'cp {fullpath} {SOutputFN}')
                        except:
                            print('.')

        FOut.flush()
        FOut.close()


        return


    def procFile(self, SIn,SFNameIn,i): # sending each json string for extraction of text and attaching an correct tags to each output string output string
        STagOpen = '<' + self.STag + ' id="' + self.STag + str(self.ID + i)  + '">\n'
        STagClose = '\n</' + self.STag + '>\n\n'
        DData, SText4Corpus = self.getJson(SIn, SFNameIn)
        STagsAndText = STagOpen + SText4Corpus + STagClose
        if SText4Corpus:
            return DData, STagsAndText
        else:
            print('\tNo data read from: ' + SFNameIn)
            return DData, None


    def getJson(self, SIn, SFNameIn): # for each file-level string read from a file: managing internal structure of the covid-19 json file
        LOut = [] # collecting a list of strings

        LBodyTextSections = [] ## list of sections: Json body text
        ### DJsonOut = {} # this will be copied from the input json structure and enriched

        try:
            DDoc = json.loads(SIn)
        except:
            print('\t\t' + SFNameIn + ' => error reading json2dictionary')
            return {}, None

        # metadata:
        try:
            DMetaData = DDoc['metadata']
            if DMetaData:
                DMetaDataOut, SMetaData = self.getJson_Metadata(DMetaData)
                if SMetaData: LOut.append(SMetaData)
        except:
            print('\t\t\t' + SFNameIn + ' ====> no metadata')
            DMetaData = None
        # body text
        try:
            LBodyText = DDoc['body_text']
            if LBodyText:
                ## implementing sampling here ~ returning list of sections first (with a paired list of section names); then sampling for sample size...

                LBodyTextSections, LBodyTextSectNames, SBodyText = self.getJson_BodyText(LBodyText) ## modified function, returns 3 arguments
                LOut.append(SBodyText)
        except:
            print('\t\t\t' + SFNameIn + ' ====> no body_text')
            LBodyText = None
        # further: to implement references

        SText = '\n\n'.join(LOut)
        ### DJsonOut["body_text"] = LBodyTextSections
        # returning an enriched data structure
        return DDoc, SText


    def getJson_Metadata(self, DIn): # converts interesting parts of metadata into a string
        SMetadata = ''
        LMetadata = []
        DMetaDataOut = {}
        try: STitle = DIn["title"]
        except: STitle = None
        if STitle and self.BInclTitle:
            LMetadata.append(STitle)

        # to implement reading of authors' names
        '''
        try:
            SPaperID = DIn["authors"]
            DMetaDataOut["authors"] = "SAuthors"
        except:
            print('au!')
        '''

        if LMetadata: SMetadata = '\n\n'.join(LMetadata)
        return DMetaDataOut, SMetadata

    ## updated function ~
    def getJson_BodyText(self, LIn): # converts interesting parts of the body texts into a string
        ## modified ~ for json output
        LBodyTextSections = [] ## output - list of sections
        LBodyTextSectionPars = [] ## element of the LBodyTextSections[] list, one section-long        
        LBodyTextSectNames = [] ##
        LBodyTextSectNamesM = [] ##


        SBodyText = '' # output (old)
        LBodyText = []
        SSectionName0 = '' # current section name set to empty for a new text
        # todo: later, in post-processing stage for the whole corpus, maybe after lemmatization...
        # ISampleNumber = 0 # samples of 100 words in text; they do not cross section boundaries 

        for DParagraph in LIn:
            # sections added 2022-09-28
            # updating and adding the section name
            try:
                # DParagraphs["section"] ## distinction between different sections....
                SSectionName = DParagraph["section"]
                # normalizing new section name (1)
                SSectionName = SSectionName.replace("\n", " ")

                if self.BInclSectionNames and SSectionName: # if we opted to include section names and section name is not empty
                    if SSectionName != SSectionName0: # if we found a new section name
                        # processing section name
                        LBodyTextSectNames.append(SSectionName0) ## record the previous section name
                        ## to implement here: call a function for mapping the section names
                        LBodyTextSectNamesM.append(SSectionName0)

                        SSectionName0 = SSectionName # change the current section name
                        ## recording previous section if not empty; then adding the name of the section to a list

                        SBodyTextSectionPars = '\n'.join(LBodyTextSectionPars)
                        LBodyTextSections.append(SBodyTextSectionPars)
                        LBodyTextSectionPars = []
                            

                        # normalizing section name (2)
                        SSectionNameNorm = SSectionName.lower()
                        SSectionNameNorm = re.sub('[0-9\.]+', ' ', SSectionNameNorm)
                        SSectionNameNorm = re.sub('[ ]+', ' ', SSectionNameNorm)
                        SSectionNameNorm = SSectionNameNorm.strip()
                        
                        self.DSectNames[SSectionNameNorm] += 1
                        SSect4text = f'<section sName="{SSectionNameNorm}">\n{SSectionName}\n</section>'

                        LBodyText.append(SSect4text)
            except:
                print('S!',)
                continue
            # first original section (we extract text after extracting section name)
            try:
                ## DParagraphs[section] ## -- later on >> distinction between different sections....
                SParagraph = DParagraph["text"]
                LBodyText.append(SParagraph)
            except:
                print('!',)
                continue



        SBodyText = '\n\n'.join(LBodyText)
        return LBodyTextSections, LBodyTextSectNamesM, SBodyText

    def printDictionary(self, DFreq, SFOutDict):
        FOutDict = open(SFOutDict, 'w')
        for key, val in sorted(DFreq.items(), key=lambda x: x[1], reverse=True):
            FOutDict.write(f'{key}\t{str(val)}\n')
        FOutDict.flush()
        FOutDict.close()
    



# arguments:
'''
        sys.argv[1], # obligatory: input directory name;
            other arguments optional:
            output_file = 'covid19corpus.txt',
            textfilter = None, # if this is string, only texts containing it are collected, e.g., covid
            include_title = True, # include or exclude title
            include_refs = False, # not implemented yet: include or exclude references
            split_by_docs=0 # split by groups of n documents; if 0 then write to one file

'''

'''if __name__ == '__main__':
    OJsonDir2txt = clJsonDir2txtSamples(sys.argv[1], output_file = 'covid19corpus.txt', textfilter=None, include_title = True, include_sectionNames = True, include_refs = False, split_by_docs=0, copy_docs=240000, sample_size = 100, outjsondir = 'document_parses/pmc_json_sample02')
'''


"if __name__ == '__main__':\n    OJsonDir2txt = clJsonDir2txt(sys.argv[1], output_file = 'covid19corpus.txt', textfilter=None, include_title = True, include_sectionNames = True, include_refs = False, split_by_docs=0, copy_docs=240000)\n"

In [None]:
OJsonDir2txtSamples = clJsonDir2txtSamples("document_parses/pmc_json_sample", output_file = 'cord19.txt', textfilter=None, include_title = True, include_sectionNames = True, include_refs = False, split_by_docs=0, copy_docs=0, sample_size = 100, outjsondir = 'document_parses/pmc_json_sample02')
