Inspired by [paper](https://www.researchgate.net/publication/237135894_A_unifying_framework_for_complexity_measures_of_finite_systems)

In this notebook I will implement TSE and Excess Entropy calculation and test it on following datasets:



1.   Wikipedia
2.   Simple English Wikipedia


# Preparation

In [1]:
!python3 -m pip install sentencepiece > /dev/null && echo 'OK'

OK


In [2]:
!python3 -m pip install tensorflow_text > /dev/null && echo 'OK'

OK


In [3]:
!python3 -m pip install tensorflow_datasets > /dev/null && echo 'OK'

OK


In [4]:
!python3 -m pip install tf_sentencepiece > /dev/null && echo 'OK'

OK


### Imports

In [53]:
import sentencepiece as spm
import tensorflow_datasets as tfds
from tqdm.notebook import tqdm
import numpy as np
from typing import List, Tuple
import nltk
import matplotlib.pyplot as plt
import tensorflow as tf
import tensorflow.keras.preprocessing.text as text
import io
import timeit
import time

In [45]:
print('Tensorflow version:', tf.__version__)

Tensorflow version: 2.3.0


# Datasets

## Wikipedia

[link](https://www.tensorflow.org/datasets/catalog/wiki40b#wiki40ben_default_config) to dataset

In [12]:
DATASET_NAME = 'wikipedia/20200301.en'

In [14]:
dataset, dataset_info = tfds.load(
    name=DATASET_NAME,
    data_dir='wiki',
    with_info=True,
    split=tfds.Split.TRAIN,
    shuffle_files=False
)

local data directory. If you'd instead prefer to read directly from our public
GCS bucket (recommended if you're running on GCP), you can instead pass
`try_gcs=True` to `tfds.load` or set `data_dir=gs://tfds-data/datasets`.



[1mDownloading and preparing dataset wikipedia/20200301.en/1.0.0 (download: 16.73 GiB, generated: 17.05 GiB, total: 33.77 GiB) to wiki/wikipedia/20200301.en/1.0.0...[0m


HBox(children=(FloatProgress(value=0.0, description='Dl Completed...', max=258.0, style=ProgressStyle(descript…



[1mDataset wikipedia downloaded and prepared to wiki/wikipedia/20200301.en/1.0.0. Subsequent calls will reuse this data.[0m


In [19]:
print(dataset_info)

tfds.core.DatasetInfo(
    name='wikipedia',
    version=1.0.0,
    description='Wikipedia dataset containing cleaned articles of all languages.
The datasets are built from the Wikipedia dump
(https://dumps.wikimedia.org/) with one split per language. Each example
contains the content of one full Wikipedia article with cleaning to strip
markdown and unwanted sections (references, etc.).',
    homepage='https://dumps.wikimedia.org',
    features=FeaturesDict({
        'text': Text(shape=(), dtype=tf.string),
        'title': Text(shape=(), dtype=tf.string),
    }),
    total_num_examples=6033151,
    splits={
        'train': 6033151,
    },
    supervised_keys=None,
    citation="""@ONLINE {wikidump,
        author = "Wikimedia Foundation",
        title  = "Wikimedia Downloads",
        url    = "https://dumps.wikimedia.org"
    }""",
    redistribution_info=license: "This work is licensed under the Creative Commons Attribution-ShareAlike 3.0 Unported License. To view a copy of this l

In [27]:
print('First article','\n======\n')
for example in dataset.take(1):
    print('Title:','\n------')
    print(example['title'].numpy().decode('utf-8'))
    print()

    print('Text:', '\n------')
    print(example['text'].numpy().decode('utf-8'))

First article 

Title: 
------
Joseph Greenberg

Text: 
------
Joseph Harold Greenberg (May 28, 1915 – May 7, 2001) was an American linguist, known mainly for his work concerning linguistic typology and the genetic classification of languages.

Life

Early life and education 

Joseph Greenberg was born on May 28, 1915 to Jewish parents in Brooklyn, New York. His first great interest was music. At the age of 14, he gave a piano concert in Steinway Hall. He continued to play the piano frequently throughout his life.

After finishing high school, he decided to pursue a scholarly career rather than a musical one. He enrolled at Columbia University in New York. During his senior year, he attended a class taught by Franz Boas concerning American Indian languages. With references from Boas and Ruth Benedict, he was accepted as a graduate student by Melville J. Herskovits at Northwestern University in Chicago. During the course of his graduate studies, Greenberg did fieldwork among the Hausa p

In [31]:
def article_to_text(text):
    return text.numpy().decode('utf-8')

In [32]:
dataset_text = dataset.map(
    lambda article: tf.py_function(func=article_to_text, inp=[article['text']], Tout=tf.string)
)

In [34]:
for text in dataset_text.take(2):
    print(text.numpy())
    print('\n')

b'Joseph Harold Greenberg (May 28, 1915 \xe2\x80\x93 May 7, 2001) was an American linguist, known mainly for his work concerning linguistic typology and the genetic classification of languages.\n\nLife\n\nEarly life and education \n\nJoseph Greenberg was born on May 28, 1915 to Jewish parents in Brooklyn, New York. His first great interest was music. At the age of 14, he gave a piano concert in Steinway Hall. He continued to play the piano frequently throughout his life.\n\nAfter finishing high school, he decided to pursue a scholarly career rather than a musical one. He enrolled at Columbia University in New York. During his senior year, he attended a class taught by Franz Boas concerning American Indian languages. With references from Boas and Ruth Benedict, he was accepted as a graduate student by Melville J. Herskovits at Northwestern University in Chicago. During the course of his graduate studies, Greenberg did fieldwork among the Hausa people of Nigeria, where he learned the Hau

## SentencePiece training

In [42]:
model=io.BytesIO()

In [None]:
start_time = timeit.default_timer()
spm.SentencePieceTrainer.train(
    sentence_iterator=dataset_text.as_numpy_iterator(),
    model_writer=model,
    vocab_size=10000
)
end_time = timeit.default_timer()

In [None]:
print(f'time to trai spm = {end_time - start_time} seconds')

In [None]:
Serialize the model as file.
with open('spm_tokenizer_en_wiki.model', 'wb') as f:
    f.write(model.getvalue())

# Directly load the model from serialized model.
sp = spm.SentencePieceProcessor(model_proto=model.getvalue())
print(sp.encode('this is test'))