<a href="https://colab.research.google.com/github/iued-uni-heidelberg/corpusdev/blob/main/s101wikipedia_corpus.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# dowloading a wikipedia corpus
- for each language, 
- creating frequency lists
- bootstrapping the internet corpus (e.g., via SketchEngine)

## wiki extractor - Apertium
- the script which processes wiki dumps
- documentation link: https://wiki.apertium.org/wiki/Wikipedia_Extractor


In [None]:
!wget https://raw.githubusercontent.com/apertium/WikiExtractor/master/WikiExtractor.py

In [None]:
# python3 WikiExtractor.py --infn xyzwiki-20210620-pages-articles.xml.bz2 --compress
!python3 WikiExtractor.py

### creating a frequency dictionary from corpus
- tokenizing
- printing frq dictionary
- example in https://stackabuse.com/python-for-nlp-creating-bag-of-words-model-from-scratch/


In [None]:
import nltk
nltk.download('punkt')
import re, sys, os

In [4]:
# using example in
# https://stackabuse.com/python-for-nlp-creating-bag-of-words-model-from-scratch/



def normalizeSent(SentIn):
    SentIn = SentIn.lower()
    SentIn = re.sub(r'\W',' ', SentIn)
    SentIn = re.sub(r'\s+',' ', SentIn)
    return SentIn


def printFrqDict(frqdict, FOut = 'output-frq-dict.txt'):
    if not frqdict:
        print('frqdict not found')
        return

    for k, v in sorted(frqdict.items(), key=lambda item: item[1], reverse=True):
        FOut.write(f'{k}\t{v}\n')
    

def corp2frq(FIn = None):
    wordfreq = {}

    if not FIn: 
        print('input file not specified')
        return
    j = 0
    for SLine in FIn:
        j+= 1
        # print(str(j))
        if j%50000 == 0: print(str(j))

        LLineTok = nltk.sent_tokenize(SLine)
        for i in range(len(LLineTok)):
            LLineTok[i] = normalizeSent(LLineTok[i])


        for sentence in LLineTok:
            tokens = nltk.word_tokenize(sentence)
            for token in tokens:
                if token not in wordfreq.keys():
                    wordfreq[token] = 1
                else:
                    wordfreq[token] += 1

    return wordfreq





## Georgian

In [None]:
# sample
# wget https://dumps.wikimedia.org/xyzwiki/20210620/xyzwiki-20210620-pages-articles.xml.bz2
# https://mirror.accum.se/mirror/wikimedia.org/dumps/kawiki/20221101/kawiki-20221101-pages-articles.xml.bz2
!wget https://mirror.accum.se/mirror/wikimedia.org/dumps/kawiki/20221101/kawiki-20221101-pages-articles.xml.bz2

In [None]:
!python3 WikiExtractor.py --infn kawiki-20221101-pages-articles.xml.bz2

In [None]:
FIn = open('/content/kawiki-20221101-pages-articles.txt', 'r')

In [None]:
FOut = open('/content/kawiki-20221101-frq-dict.txt', 'w')

In [None]:
wordfreq = corp2frq(FIn)

In [None]:
printFrqDict(wordfreq, FOut)

In [None]:
!wc kawiki-20221101-frq-dict.txt

 1071610  2143218 31160888 kawiki-20221101-frq-dict.txt


In [None]:
!head -n 20 kawiki-20221101-frq-dict.txt

In [None]:
!gzip kawiki-20221101-pages-articles.txt

In [None]:
!gzip kawiki-20221101-frq-dict.txt

In [None]:
!wc kawiki-20221101-pages-articles.txt

  1072294  23896746 521002360 kawiki-20221101-pages-articles.txt


In [None]:
!head -n 10 kawiki-20221101-pages-articles.txt

In [None]:
!mv wiki.txt kawiki-20221101-pages-articles.txt

In [None]:
!awk '$2 >=5 {print $0}' <kawiki-20221101-frq-dict.txt >kawiki-20221101-frq-dict-gt5.txt

## Armenian

In [None]:
!wget https://mirror.accum.se/mirror/wikimedia.org/dumps/hywiki/20221101/hywiki-20221101-pages-articles.xml.bz2

In [None]:
!python3 WikiExtractor.py --infn hywiki-20221101-pages-articles.xml.bz2

In [None]:
!mv wiki.txt hywiki-20221101-pages-articles.txt

In [None]:
!wc hywiki-20221101-pages-articles.txt

  2446411  56341167 803098410 hywiki-20221101-pages-articles.txt


In [None]:
FIn = open('/content/hywiki-20221101-pages-articles.txt', 'r')
FOut = open('/content/hywiki-20221101-frq-dict.txt', 'w')
wordfreq = corp2frq(FIn)
printFrqDict(wordfreq, FOut)


In [None]:
!wc hywiki-20221101-frq-dict.txt

 1653538  3307076 33397739 hywiki-20221101-frq-dict.txt


In [None]:
!gzip hywiki-20221101-pages-articles.txt

In [None]:
!gzip hywiki-20221101-frq-dict.txt

In [None]:
!gunzip hywiki-20221101-frq-dict.txt.gz

In [None]:
!awk '$2 >=5 {print $0}' <hywiki-20221101-frq-dict.txt >hywiki-20221101-frq-dict-gt5.txt

## Ukrainian
https://mirror.accum.se/mirror/wikimedia.org/dumps/ukwiki/20221101/ukwiki-20221101-pages-articles.xml.bz2


In [5]:
!wget https://mirror.accum.se/mirror/wikimedia.org/dumps/ukwiki/20221101/ukwiki-20221101-pages-articles.xml.bz2

--2022-11-21 10:13:09--  https://mirror.accum.se/mirror/wikimedia.org/dumps/ukwiki/20221101/ukwiki-20221101-pages-articles.xml.bz2
Resolving mirror.accum.se (mirror.accum.se)... 194.71.11.165, 194.71.11.173, 194.71.11.163, ...
Connecting to mirror.accum.se (mirror.accum.se)|194.71.11.165|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://laotzu.ftp.acc.umu.se/mirror/wikimedia.org/dumps/ukwiki/20221101/ukwiki-20221101-pages-articles.xml.bz2 [following]
--2022-11-21 10:13:10--  https://laotzu.ftp.acc.umu.se/mirror/wikimedia.org/dumps/ukwiki/20221101/ukwiki-20221101-pages-articles.xml.bz2
Resolving laotzu.ftp.acc.umu.se (laotzu.ftp.acc.umu.se)... 194.71.11.166, 2001:6b0:19::166
Connecting to laotzu.ftp.acc.umu.se (laotzu.ftp.acc.umu.se)|194.71.11.166|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1972921737 (1.8G) [application/x-bzip2]
Saving to: ‘ukwiki-20221101-pages-articles.xml.bz2’


2022-11-21 10:15:02 (17.1 MB/s) - ‘ukw

In [None]:
!python3 WikiExtractor.py --infn ukwiki-20221101-pages-articles.xml.bz2


* 4714028 Вінтер (стадіон)
* 4714037 Гетеренгелеві
* 4714045 BMW X3 (F25)
* 4714048 Lustra (band)
* 4714053 Брайкович
* 4714067 Обвалення пішохідного мосту у Морві
* 4714073 Вікі Перец
* 4714074 Микитенко Валерій Миколайович
* 4714077 The Witcher (серія відеоігор)
* 4714083 Втрати кадирівців
* 4714088 Москович Вольф Абрамович
* 4714089 Шалом Тіква
* 4714094 Isuzu Gemini
* 4714097 Sterling Trucks
* 4714098 Муніципальний стадіон (Герцлія)
* 4714103 Кобзар Федір Федорович
* 4714107 BMW X3 (G01)
* 4714109 Юрісанді Ернандес
* 4714116 Карлос Муньйос (борець)
* 4714118 Бібліографія Браяна Олдіса
* 4714127 Філатов Іван Миколайович
* 4714129 Письмова мова
* 4714130 Жан Ґрібомон
* 4714136 Паяса
* 4714140 Длугашевський Костянтин Наумович
* 4714145 Дачан-Хуейський автономний повіт
* 4714146 Лужицька міфологія
* 4714149 Дике поле (пісня)
* 4714154 Їтун-Маньчжурський автономний повіт
* 4714157 Dig, Lazarus, Dig!!!
* 4714166 Первісний потяг



In [7]:
!mv wiki.txt ukwiki-20221101-pages-articles.txt

In [8]:
!wc ukwiki-20221101-pages-articles.txt

  11762870  250966568 3298225702 ukwiki-20221101-pages-articles.txt


In [None]:
!cp ukwiki-20221101-pages-articles.txt ukwiki-20221101-pages-articles2.txt

In [None]:
!gzip ukwiki-20221101-pages-articles2.txt

In [None]:
FIn = open('/content/ukwiki-20221101-pages-articles.txt', 'r')
FOut = open('/content/ukwiki-20221101-frq-dict.txt', 'w')
wordfreq = corp2frq(FIn)
printFrqDict(wordfreq, FOut)

In [10]:
!wc ukwiki-20221101-frq-dict.txt

 3761114  7522228 73900534 ukwiki-20221101-frq-dict.txt


In [11]:
!awk '$2 >=5 {print $0}' <ukwiki-20221101-frq-dict.txt >ukwiki-20221101-frq-dict-gt5.txt

In [12]:
!wc ukwiki-20221101-frq-dict-gt5.txt

  926785  1853570 18587176 ukwiki-20221101-frq-dict-gt5.txt


In [13]:
!gzip ukwiki-20221101-frq-dict.txt