<a href="https://colab.research.google.com/github/iam-kevin/nlp-workshop/blob/master/Demos.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Data Collection

Obtaining sample data that will be used for swahili showcasing

[OSCAR Corpus](https://oscar-corpus.com/)

In [2]:
!wget https://traces1.inria.fr/oscar/files/compressed-orig/sw.txt.gz 
!gzip -d sw.txt.gz

--2020-06-13 12:05:04--  https://traces1.inria.fr/oscar/files/compressed-orig/sw.txt.gz
Resolving traces1.inria.fr (traces1.inria.fr)... 128.93.193.43
Connecting to traces1.inria.fr (traces1.inria.fr)|128.93.193.43|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 3745590 (3.6M) [application/gzip]
Saving to: ‘sw.txt.gz’


2020-06-13 12:05:06 (4.23 MB/s) - ‘sw.txt.gz’ saved [3745590/3745590]



## Data Preparation

We load up the `sw.txt` file, make the clean and store the clean format in `sw-clean.txt`. The clean format is what we would then use in the processes to follow.

The Cleaning processes invloves
- Converting all characters to lower case,
- Skip any lines with odd punctuations. (i.e. anything that is not `. , ( ) : " '`)
- Removing the known punctuation

In [0]:
# load the data
sw_lines = None
with open('sw.txt', 'r', encoding='utf-8') as txtf:
    sw_lines = txtf.readlines()

# change all characters to lower case
sw_lines = [line.lower() for line in sw_lines]

In [0]:
import re

# only take lines with allowed characters
allowed_punctuations = r'\.\,\(\)\:\"\''

sw_clean_lines = []
for line in sw_lines:
    # take only the lines with the allowed characters
    if not bool(re.findall('((?![{}\s]+)\W+)'.format(allowed_punctuations), line)):
        sw_clean_lines.append(line)

# remove the punctuation characters
sw_clean_lines = [re.sub(f'[{allowed_punctuations}]', '', line) for line in sw_clean_lines]

In [0]:
import os
# save the file in sw-clean
with open('sw-clean.txt', 'w', encoding='utf-8') as wf:
    wf.write("".join(sw_clean_lines))

## Embeddings

These forms of representing a character, word or even a sentence using a vector, so that it can be used by a machine learning algorithm to be able to perform other tasks.

3 different forms of algorithms will be used
1. `Word2Vec` Embeddings
2. `FastText` Embeddings
3. `Flair` Embeddings

In [6]:
import re

# load the clean data to train on
with open('sw-clean.txt', 'r', encoding='utf-8') as cf:
    corpus = cf.readlines()

corpus = [ re.split('\s+', text) for text in corpus ]
corpus[0]

['miripuko',
 'hiyo',
 'inakuja',
 'mwanzoni',
 'mwa',
 'wiki',
 'takatifu',
 'kuelekea',
 'pasaka',
 'na',
 'ikiwa',
 'ni',
 'wiki',
 'chache',
 'tu',
 'kabla',
 'ya',
 'papa',
 'francis',
 'kuanza',
 'ziara',
 'yake',
 'katika',
 'nchi',
 'hiyo',
 'yenye',
 'idadi',
 'kubwa',
 'kabisa',
 'ya',
 'watu',
 'katika',
 'ulimwengu',
 'wa',
 'nchi',
 'za',
 'kiarabu',
 '']

### Using Word2Vec: (CBOW: Continuos Bag-Of-Words)

In [0]:
from gensim.models import Word2Vec

w2v_model = Word2Vec(corpus, size=100, window=5, workers=4)

In [8]:
w2v_model.wv.most_similar('kiswahili')

  if np.issubdtype(vec.dtype, np.int):


[('kiingereza', 0.8887909650802612),
 ('lugha', 0.80258709192276),
 ('biblia', 0.7429032325744629),
 ('matusi', 0.7223591804504395),
 ('idhaa', 0.7117130160331726),
 ('maandiko', 0.7038394212722778),
 ('tafsiri', 0.6838766932487488),
 ('fizikia', 0.6656526923179626),
 ('kigiriki', 0.66070955991745),
 ('kingereza', 0.6592473387718201)]

### Using FastText

In [0]:
from gensim.models import FastText

ft_model = FastText(size=100, window=3, min_count=1)  # instantiate
ft_model.build_vocab(sentences=corpus)
ft_model.train(sentences=corpus, total_examples=len(corpus), epochs=10)

In [10]:
ft_model.wv.most_similar('kiswahili')

  if np.issubdtype(vec.dtype, np.int):


[('swahili', 0.892357349395752),
 ('kiingreza', 0.8167586922645569),
 ('kiingereza', 0.8099594712257385),
 ('sahili', 0.803668200969696),
 ('kiinjili', 0.801956057548523),
 ('kiswa', 0.7989442944526672),
 ('udahili', 0.7952920794487),
 ('jahili', 0.7944253087043762),
 ('kinawli', 0.7877073287963867),
 ('kingereza', 0.7729043364524841)]

### Using Flair

Installing the `flair` package

In [0]:
!pip install flair

Collecting flair
[?25l  Downloading https://files.pythonhosted.org/packages/f1/0e/38d173e7a5b595e108c7d7a31f7b4d88fb93192f3b12a78998ff500c5203/flair-0.5-py3-none-any.whl (334kB)
[K     |█                               | 10kB 16.0MB/s eta 0:00:01[K     |██                              | 20kB 2.9MB/s eta 0:00:01[K     |███                             | 30kB 3.9MB/s eta 0:00:01[K     |████                            | 40kB 4.3MB/s eta 0:00:01[K     |█████                           | 51kB 3.3MB/s eta 0:00:01[K     |█████▉                          | 61kB 3.9MB/s eta 0:00:01[K     |██████▉                         | 71kB 4.1MB/s eta 0:00:01[K     |███████▉                        | 81kB 4.4MB/s eta 0:00:01[K     |████████▉                       | 92kB 4.7MB/s eta 0:00:01[K     |█████████▉                      | 102kB 4.5MB/s eta 0:00:01[K     |██████████▊                     | 112kB 4.5MB/s eta 0:00:01[K     |███████████▊                    | 122kB 4.5MB/s eta 0:00:01

Flair has a unique way of preparing data before it is being trained

the folder structure must look like this:
```
corpus/
corpus/train/
corpus/train/train_split_1
corpus/train/train_split_2
corpus/train/...
corpus/train/train_split_X
corpus/test.txt
corpus/valid.txt
```

In [0]:
def split_train(n_splits: int, train_text_set: list):
    ll = len(train_text_set)
    sectors = list(range(ll // n_splits, ll - n_splits + 2, ll // n_splits))

    for i, x in enumerate(sectors):
        if i == 0:
            # first item
            yield (train_text_set[:x])
            continue
        
        yield (train_text_set[sectors[i - 1]:x])

    # for the last one
    yield (train_text_set[sectors[-1]:])

In [0]:
from pathlib import Path

train_test_val_ratio = 0.7
test_val_ratio = 0.6

train_split_count = int(train_test_val_ratio * len(corpus))
train_text_set, test_val_text_set = corpus[:train_split_count], corpus[train_split_count:]

test_split_count = int(test_val_ratio * len(test_val_text_set))
test_text_set, val_text_set = test_val_text_set[:test_split_count], test_val_text_set[test_split_count:]

corpus_path = Path('./corpus').joinpath('./train')
corpus_path.mkdir(parents=True, exist_ok=True)

# save the val and test corpus
with open(corpus_path.parent.joinpath(f'valid.txt'), 'w', encoding='utf-8') as tf:
    tf.write("".join([" ".join(text) for text in val_text_set]))

with open(corpus_path.parent.joinpath(f'test.txt'), 'w', encoding='utf-8') as tf:
    tf.write("".join([" ".join(text) for text in val_text_set]))

# save the train corpus
for ix, tsdata in enumerate(split_train(10, train_text_set)):
    with open(corpus_path.joinpath(f'./train_split_{ix + 1}'), 'w', encoding='utf-8') as tf:
        tf.write("".join([" ".join(text) for text in val_text_set]))

#### Training flair's language model

In [0]:
from flair.data import Dictionary
from flair.models import LanguageModel
from flair.trainers.language_model_trainer import LanguageModelTrainer, TextCorpus

# are you training a forward or backward LM?
is_forward_lm = True

# load the default character dictionary
dictionary: Dictionary = Dictionary.load('chars')

# get your corpus, process forward and at the character level
corpus = TextCorpus('/content/corpus',
                    dictionary,
                    is_forward_lm,
                    character_level=True)

# instantiate your language model, set hidden size and number of layers
language_model = LanguageModel(dictionary,
                               is_forward_lm,
                               hidden_size=128,
                               nlayers=1)

# train your language model
trainer = LanguageModelTrainer(language_model, corpus)

trainer.train('resources/language_model',
              sequence_length=10,
              mini_batch_size=10,
              max_epochs=10,
              num_workers=8)

2020-06-08 16:04:01,567 https://s3.eu-central-1.amazonaws.com/alan-nlp/resources/models/common_characters not found in cache, downloading to /tmp/tmpb0zu5ttg


100%|██████████| 2887/2887 [00:00<00:00, 341405.09B/s]

2020-06-08 16:04:02,176 copying /tmp/tmpb0zu5ttg to cache at /root/.flair/datasets/common_characters
2020-06-08 16:04:02,183 removing temp file /tmp/tmpb0zu5ttg
2020-06-08 16:04:02,228 read text file with 1 lines





2020-06-08 16:04:09,761 read text file with 1 lines
2020-06-08 16:04:17,275 read text file with 1 lines
2020-06-08 16:04:17,275 read text file with 1 lines
2020-06-08 16:04:17,286 shuffled
2020-06-08 16:04:17,288 shuffled
2020-06-08 16:04:30,166 read text file with 1 lines
2020-06-08 16:04:30,167 read text file with 1 lines
2020-06-08 16:04:30,171 shuffled
2020-06-08 16:04:30,181 shuffled
2020-06-08 16:04:30,206 Sequence length is 10
2020-06-08 16:04:30,217 Split 1	 - (16:04:30)
2020-06-08 16:04:31,557 | split   1 / 10 |   100/11855 batches | ms/batch 13.35 | loss  2.93 | ppl    18.79
2020-06-08 16:04:32,732 | split   1 / 10 |   200/11855 batches | ms/batch 11.67 | loss  2.36 | ppl    10.59
2020-06-08 16:04:33,943 | split   1 / 10 |   300/11855 batches | ms/batch 12.05 | loss  2.25 | ppl     9.45
2020-06-08 16:04:35,196 | split   1 / 10 |   400/11855 batches | ms/batch 12.41 | loss  2.17 | ppl     8.77
2020-06-08 16:04:36,393 | split   1 / 10 |   500/11855 batches | ms/batch 11.93 | lo