# Intro

In this lab, I've tried to create a language model for writing my diploma.

Scroll down and see what was the result)

Key achievements


*   Explored different tokenizers available at NLTK and written custom tokenizers combining several processing steps
*   Implemented two spellcheckers
*   Writen n-gram language model
*   Implemented add one, linear interpolation and backoff smoothing techniques
*   Implemented different strategies for selecting next token: most probable, random, random weighted, random weighted with temperature
*   Implemented beam search
*   Experimented with different strategies for choosing first token for generation: random token, sentence deliminator token, token from user, OOV (misspelled token)



# Load data

For this lab I use my coursework: second year coursework as train text and third year coursework as test text, because for some reason coursework from current year is shorter then the one from the previous year.

Let's train a language model, so it'll write diploma for me next year.

In [2]:
from pathlib import Path

In [3]:
with open(Path().absolute() / 'coursework_2.txt') as f:
    train_text = f.read()

with open(Path().absolute() / 'coursework_3.txt') as f:
    test_text = f.read()

# Tokenizers

In [4]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

## How to tokenize sentence?

Let's examine options available with nltk library

In [8]:
from nltk import word_tokenize, TreebankWordTokenizer, wordpunct_tokenize, TweetTokenizer, regexp_tokenize

In [9]:
text_sample = "Hello-hello, (world!) How are @you?. I'm fine, but laba is killing me)"

In [10]:
# standard tokenizers
print('word_tokenize:\n', word_tokenize(text_sample))
print('wordpunct_tokenize:\n', wordpunct_tokenize(text_sample))

# something special
print('TreebankWordTokenizer:\n', TreebankWordTokenizer().tokenize(text_sample))
print('TweetTokenizer\n', TweetTokenizer().tokenize(text_sample))

word_tokenize:
 ['Hello-hello', ',', '(', 'world', '!', ')', 'How', 'are', '@', 'you', '?', '.', 'I', "'m", 'fine', ',', 'but', 'laba', 'is', 'killing', 'me', ')']
wordpunct_tokenize:
 ['Hello', '-', 'hello', ',', '(', 'world', '!)', 'How', 'are', '@', 'you', '?.', 'I', "'", 'm', 'fine', ',', 'but', 'laba', 'is', 'killing', 'me', ')']
TreebankWordTokenizer:
 ['Hello-hello', ',', '(', 'world', '!', ')', 'How', 'are', '@', 'you', '?', '.', 'I', "'m", 'fine', ',', 'but', 'laba', 'is', 'killing', 'me', ')']
TweetTokenizer
 ['Hello-hello', ',', '(', 'world', '!', ')', 'How', 'are', '@you', '?', '.', "I'm", 'fine', ',', 'but', 'laba', 'is', 'killing', 'me', ')']


With `regexp_tokenize` it's possible to highly customize the way tokenizer works. Not sure, whether all options listed below make sense, though. And I can come up with even more combinations, but decided it's enough)

In [11]:
# full control
print(regexp_tokenize(text_sample, pattern='\w+'))                 # select all sequences of alphanumeric chars

print(regexp_tokenize(text_sample, pattern='\w+'))                 # select all sequences of alphanumeric chars

print(regexp_tokenize(text_sample, pattern='\w'))                  # select all alphanumeric chars (one by one)

print(regexp_tokenize(text_sample, pattern='\W'))                  # select all non-alphanumeric chars (one by one), e.g. punct marks

print(regexp_tokenize(text_sample, pattern='\S'))                  # select all non-space chars (one by one), i.e. letters + punct

print(regexp_tokenize(text_sample, pattern='\S+'))                 # select all sequences of non-space chars (one by one), i.e. letters + punct

print(regexp_tokenize(text_sample, pattern='\w+\S\w+'))            # select words with punctuation inside, except words consisting of one letter

print(regexp_tokenize(text_sample, pattern='\w+\S\w+|\w'))         # select words with punctuation inside

print(regexp_tokenize(text_sample, pattern='\w+|[^\w\s]'))         # select words and punctuation separately

print(regexp_tokenize(text_sample, pattern='\w+\S\w+|[^\w\s]'))    # select words with punctuation inside and punctuation separately

['Hello', 'hello', 'world', 'How', 'are', 'you', 'I', 'm', 'fine', 'but', 'laba', 'is', 'killing', 'me']
['Hello', 'hello', 'world', 'How', 'are', 'you', 'I', 'm', 'fine', 'but', 'laba', 'is', 'killing', 'me']
['H', 'e', 'l', 'l', 'o', 'h', 'e', 'l', 'l', 'o', 'w', 'o', 'r', 'l', 'd', 'H', 'o', 'w', 'a', 'r', 'e', 'y', 'o', 'u', 'I', 'm', 'f', 'i', 'n', 'e', 'b', 'u', 't', 'l', 'a', 'b', 'a', 'i', 's', 'k', 'i', 'l', 'l', 'i', 'n', 'g', 'm', 'e']
['-', ',', ' ', '(', '!', ')', ' ', ' ', ' ', '@', '?', '.', ' ', "'", ' ', ',', ' ', ' ', ' ', ' ', ' ', ')']
['H', 'e', 'l', 'l', 'o', '-', 'h', 'e', 'l', 'l', 'o', ',', '(', 'w', 'o', 'r', 'l', 'd', '!', ')', 'H', 'o', 'w', 'a', 'r', 'e', '@', 'y', 'o', 'u', '?', '.', 'I', "'", 'm', 'f', 'i', 'n', 'e', ',', 'b', 'u', 't', 'l', 'a', 'b', 'a', 'i', 's', 'k', 'i', 'l', 'l', 'i', 'n', 'g', 'm', 'e', ')']
['Hello-hello,', '(world!)', 'How', 'are', '@you?.', "I'm", 'fine,', 'but', 'laba', 'is', 'killing', 'me)']
['Hello-hello', 'world', 'How', 'a

## How to split corpus on sentences

In [12]:
from nltk.tokenize import sent_tokenize

In [13]:
sent_tokenize(text_sample)

['Hello-hello, (world!)',
 'How are @you?.',
 "I'm fine, but laba is killing me)"]

## Combine multiple functionality in CustomTokenizer

CustomTokenizer is based on regexp_tokenize and sent_tokenize and allows to apply several processing steps simultaneously

Some options regarding weather to keep punctuation and cast to lower case:

In [14]:
from tokenizer import CustomTokenizer

In [15]:
custom_word_tokenizer = CustomTokenizer()
custom_word_tokenizer.tokenize_corpus(text_sample)

['hello-hello',
 'world',
 '<EOS>',
 'how',
 'are',
 'you',
 '<EOS>',
 "i'm",
 'fine',
 'but',
 'laba',
 'i',
 's',
 'killing',
 'm',
 'e',
 '<EOS>']

In [16]:
custom_word_tokenizer_uncased = CustomTokenizer(do_lower_case=False)
custom_word_tokenizer_uncased.tokenize_corpus(text_sample)

['Hello-hello',
 'world',
 '<EOS>',
 'How',
 'are',
 'you',
 '<EOS>',
 "I'm",
 'fine',
 'but',
 'laba',
 'i',
 's',
 'killing',
 'm',
 'e',
 '<EOS>']

In [17]:
custom_word_punct_tokenizer = CustomTokenizer(pattern='\w+|[^\w\s]')
custom_word_punct_tokenizer.tokenize_corpus(text_sample)

['hello',
 '-',
 'hello',
 ',',
 '(',
 'world',
 '!',
 ')',
 '<EOS>',
 'how',
 'are',
 '@',
 'you',
 '?',
 '.',
 '<EOS>',
 'i',
 "'",
 'm',
 'fine',
 ',',
 'but',
 'laba',
 'is',
 'killing',
 'me',
 ')',
 '<EOS>']

In [18]:
custom_word_punct_tokenizer_uncased = CustomTokenizer(pattern='\w+|[^\w\s]', do_lower_case=False)
custom_word_punct_tokenizer_uncased.tokenize_corpus(text_sample)

['Hello',
 '-',
 'hello',
 ',',
 '(',
 'world',
 '!',
 ')',
 '<EOS>',
 'How',
 'are',
 '@',
 'you',
 '?',
 '.',
 '<EOS>',
 'I',
 "'",
 'm',
 'fine',
 ',',
 'but',
 'laba',
 'is',
 'killing',
 'me',
 ')',
 '<EOS>']

# Spellcheckers

Here let's compare two spellcheckers: one based on Levenshtein distance and the other uses vector representations for OOV word from fastText

In [19]:
from spellchecking import EditDistanceSpellchecker, FastTextSpellChecker

In [20]:
tokenized_train = custom_word_tokenizer.tokenize_corpus(train_text)

## Edit distance spellchecker

In [21]:
edit_distance_spellchecker = EditDistanceSpellchecker()
edit_distance_spellchecker.set_vocabulary(tokenized_train)

In [22]:
edit_distance_spellchecker.find_closest('програма')

'программа'

In [23]:
edit_distance_spellchecker.find_closest('иследовательский')

'исследовательский'

In [24]:
edit_distance_spellchecker.find_closest('ленгвистика')

'лингвистика'

In [25]:
edit_distance_spellchecker.find_closest('езык')

'язык'

## Fasttext spellchecker

In [2]:
# download model
!wget https://dl.fbaipublicfiles.com/fasttext/vectors-crawl/cc.ru.300.bin.gz
!gzip -d cc.ru.300.bin.gz

Collecting fasttext
  Downloading fasttext-0.9.2.tar.gz (68 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m68.8/68.8 kB[0m [31m1.2 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting pybind11>=2.2 (from fasttext)
  Using cached pybind11-2.12.0-py3-none-any.whl (234 kB)
Building wheels for collected packages: fasttext
  Building wheel for fasttext (setup.py) ... [?25l[?25hdone
  Created wheel for fasttext: filename=fasttext-0.9.2-cp310-cp310-linux_x86_64.whl size=4227142 sha256=76fe683338269725bd24a2227178799da68cc873981f128379d1014deec8eb44
  Stored in directory: /root/.cache/pip/wheels/a5/13/75/f811c84a8ab36eedbaef977a6a58a98990e8e0f1967f98f394
Successfully built fasttext
Installing collected packages: pybind11, fasttext
Successfully installed fasttext-0.9.2 pybind11-2.12.0


In [28]:
fasttext_spell_check = FastTextSpellChecker()
fasttext_spell_check.set_vocabulary(tokenized_train)



In [29]:
fasttext_spell_check.find_closest('иследовательский')

'исследовательский'

In [30]:
fasttext_spell_check.find_closest('програма')

'программа'

In [32]:
fasttext_spell_check.find_closest('ленгвистика')

In [33]:
fasttext_spell_check.find_closest('езык')

Actually, fasttext-based spellchecker works much worse. FastText can handle OOV tokens, but seems like one misspelled letter greatly affects final word embedding from FastText

# Language model

In [26]:
from language_model import LanguageModel

## Test model on toy data

In [56]:
three_model = LanguageModel(3)

tokenized_text_sample = custom_word_punct_tokenizer.tokenize_corpus(text_sample)
tokenized_text_sample

['hello',
 '-',
 'hello',
 ',',
 '(',
 'world',
 '!',
 ')',
 '<EOS>',
 'how',
 'are',
 '@',
 'you',
 '?',
 '.',
 '<EOS>',
 'i',
 "'",
 'm',
 'fine',
 ',',
 'but',
 'laba',
 'is',
 'killing',
 'me',
 ')',
 '<EOS>']

### Perplexity with different smoothing techniques

In [57]:
three_model.fit(tokenized_text_sample)

In [58]:
# test on train data just for debugging purposes

three_model.estimate_perplexity(tokenized_text_sample)

0.5

In [59]:
three_model.estimate_perplexity(tokenized_text_sample, smoothing='linear_interpolation')

1.0859560723351365

In [60]:
three_model.estimate_perplexity(tokenized_text_sample, smoothing='backoff')

1.0270180507087725

### Test generations

In [61]:
# default (random token from vocab)

for i, token in enumerate(three_model.generate()):
    print(token)
    if i > 10:
        break

i
'
m
fine
,
but
laba
is
killing
me
)
<EOS>


In [62]:
# with start/end of sentence token

for i, token in enumerate(three_model.generate(prompt='<EOS>')):
    print(token)
    if i > 10:
        break

i
'
m
fine
,
but
laba
is
killing
me
)
<EOS>


In [63]:
# with prompt

for i, token in enumerate(three_model.generate(prompt='killing')):
    print(token)
    if i > 10:
        break

me
)
<EOS>
how
are
@
you
?
.
<EOS>
i
'


In [64]:
# prompt with spelling mistake (the same result as cell above)

for i, token in enumerate(three_model.generate(prompt='kilin')):
    print(token)
    if i > 10:
        break

me
)
<EOS>
how
are
@
you
?
.
<EOS>
i
'


In [65]:
# beam search

three_model.beam_generate(prompt='<EOS>')

"<EOS> i ' m fine , ( world ! ) <EOS> how"

## Test model on my corpus

### Estimate perplexity

#### Bi-gram model, lower_case, clean punctuation

In [142]:
tokenized_train = custom_word_tokenizer.tokenize_corpus(train_text)
tokenized_test = custom_word_tokenizer.tokenize_corpus(test_text)

In [143]:
bi_model = LanguageModel(2)
bi_model.fit(tokenized_train)
len(bi_model.vocabulary)

3816

In [144]:
bi_model.estimate_perplexity(tokenized_test)

inf

In [145]:
bi_model.estimate_perplexity(tokenized_test, smoothing ='add_one')

2700.9760747540154

In [146]:
bi_model.estimate_perplexity(tokenized_test, smoothing ='backoff')

inf

In [147]:
bi_model.estimate_perplexity(tokenized_test[:500], smoothing ='linear_interpolation')

inf

Both linear_interpolation and backoff smoothing requre that at least all unigrams from test text were present in train corpus, but it's not always the case.

Overall, result is not so good, but perplexity is at least less than total number of unique tokens in vocabulary, thus it's better than random guess

#### Bi-gram, keep upper case, clean punctuation

In [148]:
tokenized_train = custom_word_tokenizer_uncased.tokenize_corpus(train_text)
tokenized_test = custom_word_tokenizer_uncased.tokenize_corpus(test_text)

In [149]:
bi_model = LanguageModel(2)
bi_model.fit(tokenized_train)
len(bi_model.vocabulary)

4022

In [150]:
bi_model.estimate_perplexity(tokenized_test, smoothing ='add_one')

2911.0632837917447

#### Bi-gram, lower case, keep punctuation

In [151]:
tokenized_train = custom_word_punct_tokenizer.tokenize_corpus(train_text)
tokenized_test = custom_word_punct_tokenizer.tokenize_corpus(test_text)

In [152]:
bi_model = LanguageModel(2)
bi_model.fit(tokenized_train)
len(bi_model.vocabulary)

3892

In [153]:
bi_model.estimate_perplexity(tokenized_test, smoothing ='add_one')

2320.516229375627

seems like it's the best we can do

#### Bi-gram, keep upper case, keep punctuation

In [154]:
tokenized_train = custom_word_punct_tokenizer_uncased.tokenize_corpus(train_text)
tokenized_test = custom_word_punct_tokenizer_uncased.tokenize_corpus(test_text)

In [155]:
bi_model = LanguageModel(2)
bi_model.fit(tokenized_train)
len(bi_model.vocabulary)

4102

In [156]:
bi_model.estimate_perplexity(tokenized_test, smoothing ='add_one')

2467.5332264513154

#### Three-gram model, lower_case, clean punctuation

In [157]:
tokenized_train = custom_word_tokenizer.tokenize_corpus(train_text)
tokenized_test = custom_word_tokenizer.tokenize_corpus(test_text)

In [158]:
three_model = LanguageModel(3)
three_model.fit(tokenized_train)
len(three_model.vocabulary)

3816

In [159]:
three_model.estimate_perplexity(tokenized_test, smoothing ='add_one')

3561.7082897252994

### Generate text

In [215]:
# random token, most probable
for i, token in enumerate(bi_model.generate()):
    print(token)
    if i > 10:
        break

предложений
с
помощью
модели
машинного
обучения
,
а
также
способы
их
формирования


In [117]:
# start/end of sentence token
for i, token in enumerate(bi_model.generate('<EOS>')):
    print(token)
    if i > 10:
        break

Использование
разговорной
лексики
Использование
разговорной
лексики
Использование
разговорной
лексики
Использование
разговорной
лексики


In [261]:
# start/end of sentence token
for i, token in enumerate(bi_model.generate('язык')):
    print(token)
    if i > 10:
        break

программирования
Python
для
текстов
,
а
также
способы
их
формирования
и
пунктуации


In [262]:
# start/end of sentence token
for i, token in enumerate(bi_model.generate('езык')):
    print(token)
    if i > 10:
        break

программирования
Python
для
текстов
,
а
также
способы
их
формирования
и
пунктуации


In [107]:
# start/end of sentence token, add randomness
for i, token in enumerate(bi_model.generate('<EOS>', mode='random_weighted')):
    print(token)
    if i > 10:
        break

Применение
моделей
.
И
вот
казалось
бы
,
синтаксиса
и
не
ну


In [258]:
# start/end of sentence token, add randomness,
# pay less attention to weights with temperature set to 0.

for i, token in enumerate(bi_model.generate('<EOS>',
                                            mode='random_weighted',
                                            temperature=0.5)):
    print(token)
    if i > 10:
        break

Полученный
датасет
был
куплен
тогда
,
относящейся
к
многоклассовой
классификации
текстов
/


In [247]:
# start/end of sentence token, with temperature set to 1, weights are actually ignored

for i, token in enumerate(bi_model.generate('<EOS>',
                                            mode='random_weighted',
                                            temperature=1)):
    print(token)
    if i > 10:
        break

Тогда
первое
предложение
получит
векторное
представление
[
1
/
Известия
отделения
русского


In [108]:
for i, token in enumerate(bi_model.generate('<EOS>', mode='random')):
    print(token)
    if i > 10:
        break

Обучая
классификатор
на
это
,
основанные
на
физике
и
запись
файлов
,


#### beam search

In [220]:
bi_model.beam_generate('<EOS>')

'<EOS> В . <EOS> В . <EOS> В . <EOS> В .'

In [222]:
bi_model.beam_generate()

'возникло даже с помощью модели машинного обучения , а не , а'

In [223]:
bi_model.beam_generate()

'нарушение орфографических , а не , а не , а не ,'

In [242]:
bi_model.beam_generate()

'встречаются крайне редко . <EOS> В . <EOS> В . <EOS> Использование'

In [246]:
bi_model.beam_generate()

'<EOS> User identification of short texts 2017 . <EOS> В . <EOS>'

Seems like beam search does not allow to generate more natural text. For some reason,  sequence `<EOS> В . <EOS> В . <EOS> В . <EOS> В .` has very high probability

# Almost the end

I've experimented with different tokenizers. The lowest perplexity was achieved with `custom_word_punct_tokenizer`, tokenizer which splits separately punctuation and alphanumeric characters. However, it should be noted that perplexity increase with increasing vocabulary size, and as use of different tokenizers results in different vocabulary size, comparison may be not fair.

From two spellcheckers, I use the one based on edit distance in my text generation pipeline. I've found, that for the task of matching misspelled word with the one present in vocabulary ruled-based methods still work the best. However, if we to use context when correcting spelling mistakes, Transformers, I believe, would be the best choice.

I use only add one smoothing. In my test text there are even some unigrams not present in train corpus. Therefore, perplexity cannot be estimated without smoothing, other smoothing techniques implemented in my language model don't work for the same reason.

Speaking about generation, seems likes adding randomness: starting with random token, randomly choosing next token makes generated texts more interesting and natural. Beam search don't show great results in my setting. I suppose beam search works great when model can accurately predict probabilities assigned for the next token, but that's not my case. Also, it works a bit greedy, and, therefore, decreases randomness





# THE END!