<a href="https://colab.research.google.com/github/iued-uni-heidelberg/DAAD-Training-2021/blob/main/compLingProject110CorpusCollectCleanV02.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Working with corpus

# Corpus collection and preparation methods
## Recursively download a website

Possible sources for corpus collection. Some examples:
- Aerzteblatt.de : de and en medical texts https://www.aerzteblatt.de/int/archive/
- Georgian Medical News journal (summaries in ka and en) https://www.geomednews.com/
- Yerevan State Medical University, en, hy parallel texts https://ysmu.am/

Tasks: 
- 'recursively' download (crawl) the website
- remove html structure or convert pdf, find content text
- save as text file, prepare for Part-of-Speech tagging & alignment

wget manual and advice: 
- https://www.gnu.org/software/wget/manual/html_node/Recursive-Retrieval-Options.html
- https://stackoverflow.com/questions/273743/using-wget-to-recursively-fetch-a-directory-with-arbitrary-files-in-it

Run the following cell for a few minutes, then stop (by clicking on the rotating button) when you think you have enough data.

In [None]:
%%bash
wget --recursive --no-parent https://www.aerzteblatt.de/int/archive/

In [None]:
!tar -cvzf www.aerzteblatt.de.tgz www.aerzteblatt.de

In [None]:
!zip -r www.aerzteblatt.de.zip www.aerzteblatt.de

In [None]:
# longer collection, wget --recursive was running for ~45 min
!wget https://heibox.uni-heidelberg.de/f/00bbb48ee1c046c896af/?dl=1
# smaller collection, ~ 6 files
# !wget https://heibox.uni-heidelberg.de/f/c38d0b9f7f744c5aae16/?dl=1

In [3]:
!mv index.html?dl=1 www.aerzteblatt.de.tgz

In [None]:
!tar -xvzf www.aerzteblatt.de.tgz

## Installing and using *Lynx*

In [None]:
!apt-get install lynx

In [None]:
!lynx https://www.uni-heidelberg.de/en

In [6]:
!lynx --dump https://www.uni-heidelberg.de/en >uni-heidelberg.txt

In [7]:
!cp /content/www.aerzteblatt.de/int/archive/article/219616 /content/219616.html

In [10]:
!lynx --dump /content/219616.html > /content/219616.txt

In [12]:
!iconv -f UTF-8 -t UTF-8//IGNORE ./219616.txt > 219616_V2.txt

## Recursively processing the corpus collected by crawling the website

In [20]:
!rm -r ./texts/

In [21]:
!mkdir ./texts/

In [22]:
!rm *_2.txt

In [18]:
# <li><a class="deLink" href="/archiv/219550/Diabetes-im-Krankenhaus">German version</a></li>

In [None]:
# -*- coding: utf-8 -*-
# Python script to open each file, read json input and copy to one text file for subsequent processing
import os, re, sys

class clHtmlDir2txt(object):
    '''
    @author Bogdan Babych, IÜD, Heidelberg University, 2021
    @email bogdan [dot] babych [at] iued [dot] uni-heidelberg [dot] de
    '''
    def __init__(self, SDirName, output_file = 'corpus_text', tag='doc', id=1000000, find_parallel=False): # initialising by openning the directories
        self.SOutput_file = output_file
        self.STag = tag
        self.ID = id        
        self.openDir(SDirName)
        self.BFindParallel = find_parallel
        return

    def openDir(self, path): # implementation of recursively openning directories from a given rule directory and reading each file recursively into a string
        i = 0
        FOut = open(self.SOutput_file + '.txt', 'w')
        if self.BFindParallel:
            FOutPara = open(self.SOutput_file + '_para.txt')

        for root,d_names,f_names in os.walk(path):
            for f in f_names:
                ## remove this if using on another corpus
                if not re.match('^[0-9]+$', f): 
                    # print(f'Skipped: {f}')
                    continue

                fullpath = os.path.join(root, f)
                i+=1
                if i%1==0: 
                    print(str(i) + '. Processing: ' + f)
                    print(fullpath)

                # FIn = open(fullpath,'r')
                # SIn = FIn.read()
                # apply text filter, if not None
                # if self.STextFilter and (re.search(self.RFilter, SIn) == None): continue
                # SText2Write = self.procFile(SIn,f,i)

                SText2Write = self.procFile(fullpath, f, i) # returns converted string + tags
                if SText2Write: FOut.write(SText2Write) # if the string is not empty then write to file
                # FIn.close()

                try:
                    pass
                except:
                    print(f'file {f} cannot be read or processed')
                finally:
                    pass
        
        FOut.flush()
        FOut.close()

        return


    def procFile(self, fullpath, SFNameIn, i): # sending each json string for extraction of text and attaching an correct tags to each output string output string
        STagOpen = '<' + self.STag + ' id="' + self.STag + str(self.ID + i)  + '">\n'
        STagClose = '\n</' + self.STag + '>\n\n'
        SText4Corpus = self.getString(fullpath, SFNameIn)
        if SText4Corpus:
            return STagOpen + SText4Corpus + STagClose
        else:
            print('\tNo data read from: ' + SFNameIn)
            return None


    def getString(self, fullpath, SFNameIn):
        '''
        the function uses system commands to copy, open and extract content of needed files
        '''
        # iconv -c -t UTF-8 < input.txt > output.txt
        # iconv -f UTF-8 -t UTF-8//IGNORE 219624.txt > 219624_V2.txt
        # last works!

        SFNameHTML = SFNameIn + '_1.html'
        SFNameTXT2 = SFNameIn + '_2.txt'
        SFNameTXT3 = SFNameIn + '_3.txt'

        SCommand = 'cp ' + fullpath + ' ./texts/' + SFNameHTML
        os.system(SCommand)

        SCommand2 = 'lynx --dump ./texts/' + SFNameHTML + ' > ./texts/' + SFNameTXT2
        os.system(SCommand2)

        # SCommand2 = 'iconv -c -t UTF-8 < ' + SFNameHTML + ' > ./texts/' + SFNameHTML2
        SCommand3 = 'iconv -f UTF-8 -t UTF-8//IGNORE ./texts/' + SFNameTXT2 + ' > ./texts/' + SFNameTXT3
        os.system(SCommand3)

        # stream = os.popen('lynx --dump ./' + SFNameHTML2)
        # SFileContent = stream.read()

        F2Read = open('./texts/' + SFNameTXT3, 'r', encoding="utf8", errors="surrogateescape")
        SFileContent = F2Read.read()

        LFileContent = re.split('\n\n+', SFileContent, flags=re.MULTILINE|re.DOTALL)
        print(len(LFileContent))
        print(LFileContent[0])

        LFileContent0 = []
        for el in LFileContent:
            el = re.sub('\n', ' ', el)
            el = re.sub(' +', ' ', el)
            LFileContent0.append(el)

        SFileContent2 = '\n\n'.join(LFileContent0)
        

        return SFileContent2

# calling the class
OHtmlDir2txt = clHtmlDir2txt('/content/www.aerzteblatt.de')


In [25]:
!wc corpus_text.txt

  11920  315735 2862875 corpus_text.txt


In [None]:
!head --lines=20 corpus_text.txt

## Parallel corpus can also be created
- changing the input paremeter 'find_parallel=True'