We're working on building a language model from the Swahili Wikipedia. To do this, You'll need to download an archive of the Swahili Wikipedia. You can either download it from [here](https://dumps.wikimedia.org/), or using `get-wikimedia.sh` from [Facebook's fastText repo](https://raw.githubusercontent.com/facebookresearch/fastText/master/get-wikimedia.sh). You can save that file anywhere, as long as it useful for Wikipedia datasets. Name the folder something like `wikimedia-datasets`, since it's generally useful if you want to build more language models from Wikipedia data. You'll also need these two Unix tools:

+ awk
+ perl

After you've read the script, run the command `chmod u+x get-wikimedia.sh` to make the script executable.
Run the script like so: `./get-wikimedia.sh`

Enter `sw` in the first prompt, and `y` in the second and wait until it's finished. The Swahili archive is rather small, which means (a) It won't take long to download, and (b) it won't take too long to train. This makes it a nice testbed for the technique.

In [1]:
%reload_ext autoreload
%autoreload 2
%matplotlib inline

from fastai.learner import *

import torchtext
from torchtext import vocab, data
from torchtext.datasets import language_modeling

from fastai.rnn_reg import *
from fastai.rnn_train import *
from fastai.nlp import *
from fastai.lm_rnn import *

import dill as pickle



    Only loading the 'en' tokenizer.



In [2]:
PATH='data/sw_wiki/'

TRN_PATH = 'train/'
VAL_PATH = 'test/'
TRN = f'{PATH}{TRN_PATH}'
VAL = f'{PATH}{VAL_PATH}'

%ls {TRN}

[0m[01;32mwiki.sw.txt[0m*


In [3]:
!head {TRN}/wiki.sw.txt

ho rac\, , aliandika mengi na vitabu vingi . utafiti wake kuhusu kiswahili ulikazia hasa utamaduni , ushairi na dini ya waswahili . lugha ya kiawyu ya kusini kwenye multitree ' kitsat ' ( au ' kihuihui ' ) ni lugha ya kiaustronesia nchini uchina inayozungumzwa na watsat . mwaka wa idadi ya wasemaji wa kitsat imehesabiwa kuwa watu . kufuatana na uainishaji wa lugha kwa ndani zaidi , kitsat iko katika kundi la kichamiki . study text of the divine liturgy of saint basil the great cambodia , http //www . ethnologue . com/language/shr wakazi ( ) ' jamhuri ya kamerun ' ( pia cameroon ) ni jamhuri ya muungano katika afrika ya magharibi . diospyros dichrophylla . jpg majani na matunda ya mjoho sumu - brooklyn museum , marekani . makala hii inahusu mwaka ' bk ' ( baada ya kristo ) . viungo vya nje jake kilrain siku hizi ( ) byamugisha anafanya kazi na shirika la word vision international pia pamoja na kasisi jape heath kutoka afrika kusini ameunda umoja wa viongozi wa kidini wa afrika wanaoish

In [9]:
!head /tmp/wiki.sw.txt

katika kuendeleza dhana ya udhibiti wa mradi kuna kujumuisha usimamizi wenye msingi wa mchakato eneo hili limekuwa likiendeshwa na matumizi ya mifano iliyokomaa kama cmmi ( jumuisho la mifano iliyokomaa ya kupima uwezo ) na iso/iec ( spice - uboreshaji wa michakato ya programu na upimaji uwezo ) . 
kihistoria ilikuwa chakula muhimu cha watu wa nchi za kaskazini kwa sababu inavumilia hali ya hewa baridi kiasi na mvua nyingi tofauti na ngano . waroma waliona wagermanik wa kale walikula hasa oti kwenye kisiwa cha britania waskoti hupenda oti lakini waingereza hulisha nafaka hii kwa wanyama hasa farasi . kupunguka kwa farasi kama wanyama wa kazi kwenye kilimo kumemaaanisha pia kurudi nyuma kwa kilimo cha oti . 
 toleo la ronan keating 
 kwa modeli 
 mommsen , t . and paul m . meyer , eds . theodosiani libri xvi cum constitutionibus sirmondianis et leges novellae ad theodosianum pertinentes ( in latin ) . berlin weidmann , . complied by nicholas palmer , revised by tony honoré for oxford te

In [4]:
!find {TRN} -name '*.txt' | xargs cat | wc -w

380774


In [5]:
!find {VAL} -name '*.txt' | xargs cat | wc -w

95195


In [15]:
TEXT = data.Field(lower=True, tokenize=spacy_tok) # word-level

In [6]:
TEXT = data.Field(lower=True, tokenize=list) # character-level: list('abc') = ['a', 'b', 'c']

In [7]:
bs=64; bptt=70

In [8]:
FILES = dict(train=TRN_PATH, validation=VAL_PATH, test=VAL_PATH)
md = LanguageModelData.from_text_files(PATH, TEXT, **FILES, bs=bs, bptt=bptt, min_freq=10)

In [9]:
pickle.dump(TEXT, open(f'{PATH}models/TEXT.pkl','wb'))

In [10]:
len(md.trn_dl), md.nt, len(md.trn_ds), len(md.trn_ds[0].text)

(475, 130, 1, 2133155)

In [11]:
TEXT.vocab.itos[:12]

['<unk>', '<pad>', ' ', 'a', 'i', 'n', 'k', 'u', 'e', 'o', 'm', 't']

In [12]:
TEXT.vocab.stoi['wanawake']

0

In [23]:
md.trn_ds[0].text[:12]

["'",
 'urusi',
 "'",
 'ni',
 'kata',
 'ndani',
 'ya',
 'jiji',
 'la',
 'zanzibar',
 'katika',
 'mkoa']

In [24]:
TEXT.numericalize([md.trn_ds[0].text[:12]])

Variable containing:
    7
  436
    7
   13
   51
   56
    3
 1028
   14
  612
   12
   63
[torch.cuda.LongTensor of size 12x1 (GPU 0)]

In [15]:
next(iter(md.trn_dl))

(Variable containing:
    29     4     2  ...      6     5     3
     2     2    24  ...      4     3    16
     7    21     2  ...     13     2     4
        ...          ⋱          ...       
     2    24     3  ...      2    10     4
    10     2     5  ...      3     3    25
     3    29     4  ...      7     2     7
 [torch.cuda.LongTensor of size 67x64 (GPU 0)], Variable containing:
   2
   2
  24
  ⋮ 
   2
  25
   5
 [torch.cuda.LongTensor of size 4288 (GPU 0)])

In [13]:
em_sz = 200  # size of each embedding vector
nh = 500     # number of hidden activations per layer
nl = 3       # number of layers

In [16]:
opt_fn = partial(optim.Adam, betas=(0.7, 0.99))

In [17]:
learner = md.get_model(opt_fn, em_sz, nh, nl,
               dropouti=0.05, dropout=0.05, wdrop=0.1, dropoute=0.02, dropouth=0.05)
learner.reg_fn = partial(seq2seq_reg, alpha=2, beta=1)
learner.clip=0.3

In [18]:
learner.fit(3e-3, 4, wds=1e-6, cycle_len=1, cycle_mult=2)

[ 0.       1.84804  1.8027 ]                                
[ 1.       1.46551  1.4337 ]                                
[ 2.       1.36759  1.36478]                                
[ 3.       1.3522   1.34719]                                
[ 4.       1.29151  1.29601]                                
[ 5.       1.21493  1.26029]                                
[ 6.       1.18884  1.2551 ]                                
[ 7.       1.29116  1.29547]                                
[ 8.       1.23742  1.27176]                                
[ 9.       1.20459  1.25499]                                
[ 10.        1.17083   1.23465]                             
[ 11.        1.13461   1.22151]                             
[ 12.        1.10217   1.21966]                             
[ 13.        1.1055    1.21128]                             
[ 14.        1.09189   1.21604]                             



In [19]:
learner.save_encoder('adam3_10_enc')

In [20]:
learner.fit(3e-3, 1, wds=1e-6, cycle_len=20, cycle_save_name='adam3_20')

[ 0.       1.09038  1.21278]                                
[ 1.       1.20757  1.25553]                                
[ 2.       1.18651  1.25224]                                
[ 3.       1.19886  1.24667]                                
[ 4.       1.17607  1.23936]                                
[ 5.       1.17065  1.2383 ]                                
[ 6.       1.13709  1.22987]                                
[ 7.       1.13116  1.22081]                                
[ 8.       1.11804  1.21652]                                
[ 9.       1.10086  1.21648]                                
[ 10.        1.0866    1.21221]                             
[ 11.        1.07416   1.20837]                             
[ 12.        1.0595    1.20681]                             
[ 13.        1.03628   1.20742]                             
[ 14.        1.02444   1.20475]                             
[ 15.        1.02398   1.20191]                              
[ 16.        1.01508   

In [21]:
learner.fit(3e-3, 2, wds=1e-6, cycle_len=20, cycle_save_name='adam3_20')

[ 0.       0.99101  1.20673]                                 
[ 1.       1.14045  1.23171]                                
[ 2.       1.13623  1.23394]                                
[ 3.       1.13346  1.22846]                                
[ 4.       1.12269  1.22489]                                
[ 5.       1.11968  1.2172 ]                                
[ 6.       1.10427  1.22015]                                
[ 7.       1.10122  1.21713]                                
[ 8.       1.08419  1.20421]                                
[ 9.       1.05875  1.20812]                                
[ 10.        1.0471    1.21016]                             
[ 11.        1.05137   1.19651]                             
[ 12.        1.01176   1.209  ]                             
[ 13.        1.01766   1.19803]                              
[ 14.        1.01231   1.19872]                              
[ 15.        0.98938   1.20526]                              
[ 16.        0.97248

In [68]:
learner.fit(3e-4, 2, wds=1e-6, cycle_len=20, cycle_save_name='adam3_20')

[ 0.       0.96005  1.21016]                                 
[ 1.       0.95857  1.20463]                                 
[ 2.       0.95164  1.20848]                                 
[ 3.       0.94882  1.20636]                                 
[ 4.       0.96099  1.2022 ]                                 
[ 5.       0.9501   1.21285]                                 
[ 6.       0.96187  1.21163]                                 
[ 7.       0.91749  1.21205]                                 
[ 8.       0.93302  1.21019]                                 
[ 9.       0.9417   1.21119]                                 
[ 10.        0.94324   1.21344]                              
[ 11.        0.92494   1.214  ]                              
[ 12.        0.95103   1.21247]                              
[ 13.        0.91194   1.2134 ]                              
[ 14.        0.94848   1.21651]                              
[ 15.        0.92945   1.21664]                              
[ 16.   

In [69]:
m=learner.model
ss=""", aina zote mbili za kawaida za utathmini wa hatua ni ule wa kliniki na upasuaji"""
s = [spacy_tok(ss)]
t=TEXT.numericalize(s)
' '.join(s[0])

', aina zote mbili za kawaida za utathmini wa hatua ni ule wa kliniki na upasuaji'

In [70]:
# Set batch size to 1
m[0].bs=1
# Turn off dropout
m.eval()
# Reset hidden state
m.reset()
# Get predictions from model
res,*_ = m(t)
# Put the batch size back to what it was
m[0].bs=bs

In [71]:
nexts = torch.topk(res[-1], 50)[1]
[TEXT.vocab.itos[o] for o in to_np(nexts)]

['<unk>',
 ' ',
 'י',
 'ա',
 'o',
 'k',
 'n',
 '\u200e',
 '/',
 't',
 'x',
 'l',
 'y',
 'c',
 'ā',
 'a',
 'ō',
 'ş',
 's',
 'ς',
 'α',
 'ر',
 'ي',
 'á',
 'e',
 '²',
 'ν',
 'ι',
 'ο',
 'i',
 'ı',
 'τ',
 'م',
 'ε',
 'ç',
 'v',
 'ة',
 'g',
 ')',
 'b',
 'κ',
 'ه',
 'w',
 '(',
 'μ',
 'د',
 'ρ',
 'ğ',
 'δ',
 'π']

In [75]:
print(ss,"\n")
for i in range(400):
    n=res[-1].topk(2)[1]
    n = n[1] if n.data[0]==0 else n[0]
    print(TEXT.vocab.itos[n.data[0]], end='')
    res,*_ = m(n[0].unsqueeze(0))
print('...')

, aina zote mbili za kawaida za utathmini wa hatua ni ule wa kliniki na upasuaji 

i f a   y a   m a r e k a n i   i l i k u w a   n a   m a t a t i z o   y a   k i t a i f a   y a   k i s a s a   y a   k i s a s a   y a   k i s a s a   y a   k i s a y a n s i   k a t i k a   m a e n e o   y a   k i s a s a   .   k a t i k a   m a s h i n d a n o   y a   m a s o m o   y a   k i m a t a i f a   y a   k i s a s a   n i   k u p i t i a   m a s h i n e   y a   m a t a i f a   y a   k i k r i s t o   .   k a t i k a   m a s h i n d a n o   y a   m a s o m o   y a   k i b i n a d a m u   ,   k i n a c h o t u m i k a   k u t o k a   k w a   m a j i   y a   m a j i   y a   m a j i   y a   k u j i t e g e m e a   .   k a t i k a   m a e n e o   y a   k i s a s a   n i   k u p i t i a   m a s h i n e   y a   m a t a t i z o   y a   k i t a i f a   y a   m a r e k a n i   .   m a ...


In [None]:
Narrow