<a href="https://colab.research.google.com/github/iued-uni-heidelberg/corpusdev/blob/main/s101wikipedia_corpus.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# dowloading a wikipedia corpus
- for each language, 
- creating frequency lists
- bootstrapping the internet corpus (e.g., via SketchEngine)

## wiki extractor - Apertium
- the script which processes wiki dumps
- documentation link: https://wiki.apertium.org/wiki/Wikipedia_Extractor


In [None]:
!wget https://raw.githubusercontent.com/apertium/WikiExtractor/master/WikiExtractor.py

In [None]:
# python3 WikiExtractor.py --infn xyzwiki-20210620-pages-articles.xml.bz2 --compress
!python3 WikiExtractor.py

### creating a frequency dictionary from corpus
- tokenizing
- printing frq dictionary
- example in https://stackabuse.com/python-for-nlp-creating-bag-of-words-model-from-scratch/


In [6]:
import nltk
nltk.download('punkt')
import re, sys, os

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


In [7]:
# using example in
# https://stackabuse.com/python-for-nlp-creating-bag-of-words-model-from-scratch/



def normalizeSent(SentIn):
    SentIn = SentIn.lower()
    SentIn = re.sub(r'\W',' ', SentIn)
    SentIn = re.sub(r'\s+',' ', SentIn)
    return SentIn


def printFrqDict(frqdict, FOut = 'output-frq-dict.txt'):
    if not frqdict:
        print('frqdict not found')
        return

    for k, v in sorted(frqdict.items(), key=lambda item: item[1], reverse=True):
        FOut.write(f'{k}\t{v}\n')
    

def corp2frq(FIn = None):
    wordfreq = {}

    if not FIn: 
        print('input file not specified')
        return
    j = 0
    for SLine in FIn:
        j+= 1
        # print(str(j))
        if j%50000 == 0: print(str(j))

        LLineTok = nltk.sent_tokenize(SLine)
        for i in range(len(LLineTok)):
            LLineTok[i] = normalizeSent(LLineTok[i])


        for sentence in LLineTok:
            tokens = nltk.word_tokenize(sentence)
            for token in tokens:
                if token not in wordfreq.keys():
                    wordfreq[token] = 1
                else:
                    wordfreq[token] += 1

    return wordfreq





## Georgian

In [None]:
# sample
# wget https://dumps.wikimedia.org/xyzwiki/20210620/xyzwiki-20210620-pages-articles.xml.bz2
# https://mirror.accum.se/mirror/wikimedia.org/dumps/kawiki/20221101/kawiki-20221101-pages-articles.xml.bz2
!wget https://mirror.accum.se/mirror/wikimedia.org/dumps/kawiki/20221101/kawiki-20221101-pages-articles.xml.bz2

In [None]:
!python3 WikiExtractor.py --infn kawiki-20221101-pages-articles.xml.bz2

In [48]:
FIn = open('/content/kawiki-20221101-pages-articles.txt', 'r')

In [49]:
FOut = open('/content/kawiki-20221101-frq-dict.txt', 'w')

In [None]:
wordfreq = corp2frq(FIn)

In [51]:
printFrqDict(wordfreq, FOut)

In [52]:
!wc kawiki-20221101-frq-dict.txt

 1071610  2143218 31160888 kawiki-20221101-frq-dict.txt


In [None]:
!head -n 20 kawiki-20221101-frq-dict.txt

In [55]:
!gzip kawiki-20221101-pages-articles.txt

In [56]:
!gzip kawiki-20221101-frq-dict.txt

In [8]:
!wc kawiki-20221101-pages-articles.txt

  1072294  23896746 521002360 kawiki-20221101-pages-articles.txt


In [None]:
!head -n 10 kawiki-20221101-pages-articles.txt

In [5]:
!mv wiki.txt kawiki-20221101-pages-articles.txt

In [None]:
!awk '$2 >=5 {print $0}' <kawiki-20221101-frq-dict.txt >kawiki-20221101-frq-dict-gt5.txt

## Armenian

In [None]:
!wget https://mirror.accum.se/mirror/wikimedia.org/dumps/hywiki/20221101/hywiki-20221101-pages-articles.xml.bz2

In [None]:
!python3 WikiExtractor.py --infn hywiki-20221101-pages-articles.xml.bz2

In [5]:
!mv wiki.txt hywiki-20221101-pages-articles.txt

In [9]:
!wc hywiki-20221101-pages-articles.txt

  2446411  56341167 803098410 hywiki-20221101-pages-articles.txt


In [None]:
FIn = open('/content/hywiki-20221101-pages-articles.txt', 'r')
FOut = open('/content/hywiki-20221101-frq-dict.txt', 'w')
wordfreq = corp2frq(FIn)
printFrqDict(wordfreq, FOut)


In [10]:
!wc hywiki-20221101-frq-dict.txt

 1653538  3307076 33397739 hywiki-20221101-frq-dict.txt


In [12]:
!gzip hywiki-20221101-pages-articles.txt

In [13]:
!gzip hywiki-20221101-frq-dict.txt

In [14]:
!gunzip hywiki-20221101-frq-dict.txt.gz

In [15]:
!awk '$2 >=5 {print $0}' <hywiki-20221101-frq-dict.txt >hywiki-20221101-frq-dict-gt5.txt

## Ukrainian
https://mirror.accum.se/mirror/wikimedia.org/dumps/ukwiki/20221101/ukwiki-20221101-pages-articles.xml.bz2


In [16]:
!wget https://mirror.accum.se/mirror/wikimedia.org/dumps/ukwiki/20221101/ukwiki-20221101-pages-articles.xml.bz2

--2022-11-20 20:35:46--  https://mirror.accum.se/mirror/wikimedia.org/dumps/ukwiki/20221101/ukwiki-20221101-pages-articles.xml.bz2
Resolving mirror.accum.se (mirror.accum.se)... 194.71.11.165, 194.71.11.173, 194.71.11.163, ...
Connecting to mirror.accum.se (mirror.accum.se)|194.71.11.165|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://laotzu.ftp.acc.umu.se/mirror/wikimedia.org/dumps/ukwiki/20221101/ukwiki-20221101-pages-articles.xml.bz2 [following]
--2022-11-20 20:35:47--  https://laotzu.ftp.acc.umu.se/mirror/wikimedia.org/dumps/ukwiki/20221101/ukwiki-20221101-pages-articles.xml.bz2
Resolving laotzu.ftp.acc.umu.se (laotzu.ftp.acc.umu.se)... 194.71.11.166, 2001:6b0:19::166
Connecting to laotzu.ftp.acc.umu.se (laotzu.ftp.acc.umu.se)|194.71.11.166|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1972921737 (1.8G) [application/x-bzip2]
Saving to: ‘ukwiki-20221101-pages-articles.xml.bz2’


2022-11-20 20:37:34 (17.9 MB/s) - ‘ukw

In [None]:
!python3 WikiExtractor.py --infn ukwiki-20221101-pages-articles.xml.bz2

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
1585187 Дражевник
1585188 Двор-при-Полховем Градцу
1585189 Габрє (Доброва-Полхов Градець)
1585190 Храстениці
1585191 Хрушево
1585192 Команія
1585193 Лог-при-Полховем Градцу
1585194 Осредек-при-Доброві
1585195 Планина-над-Хорюлом
1585196 Подребер (Доброва-Полхов Градець)
1585198 Подсмрека (Доброва-Полхов Градець)
1585200 Прапроче (Доброва-Полхов Градець)
1585202 Пристава-при-Полховем Градцу
1585203 Разорі
1585204 Ровт (Доброва-Полхов Градець)
1585205 Село-над-Полховим Градцем
1585206 Сетниця (Доброва-Полхов Градець)
1585207 Сетник
1585208 Смолник (Доброва-Полхов Градець)
1585209 Средня Вас-при-Полховем Градцу
1585210 Средній Врх (Доброва-Полхов Градець)
1585211 Шентйошт-над-Хорюлом
1585212 Странська Вас (Доброва-Полхов Градець)
1585213 Шуїця
1585215 Подільські вісті
1585216 Дол-при-Любляні (община)
1585221 Дол-при-Любляні
1585222 Міжнародна панорама
1585226 Фехтування на літній Універсіаді 2013
1585228 Мале Стахове
1585245

In [None]:
!mv wiki.txt ukwiki-20221101-pages-articles.txt

In [None]:
!wc ukwiki-20221101-pages-articles.txt

In [None]:
FIn = open('/content/ukwiki-20221101-pages-articles.txt', 'r')
FOut = open('/content/ukwiki-20221101-frq-dict.txt', 'w')
wordfreq = corp2frq(FIn)
printFrqDict(wordfreq, FOut)