# Import and Read the data

In [1]:
# read the input data file
with open('raw-text-data.txt', 'r', encoding='utf-8') as f:
    text_data = f.read()

In [2]:
# length of the data in chars
print("length of the data in chars:", len(text_data))

# let's look at the first 100 characters
print(text_data[:100])

length of the data in chars: 15795757
This page allows users to search multiple sources for a book given a 10- or 13-digit International S


# randomize the text contents

In [3]:
# let's look at how many line-breaks are there in the data
len(text_data.splitlines())

2000

In [4]:
# so from analysis, we see the data source contained full scraped data into one single line
# since we had 1000 source files, seperated by '\n\n', hence 2000 line splits
# we should mix up the content, else the model will only learn content from data sources
# that comes in the beginning and falls into train dataset

In [5]:
# let's try looking at end of sentenses using full-stop punctuation mark
len(text_data.split('. '))

68059

In [6]:
# looks good! let's split it this way, and randomize!
rand_text_data = []
for content in text_data.splitlines():
    if content:
        sentence_list = content.split('. ')
        stripped_list = [sentence.strip() for sentence in sentence_list]
        rand_text_data.extend(stripped_list)

In [7]:
print("length: ", len(rand_text_data))
rand_text_data[8:12]

length:  69058


['Google books often lists other editions of a book and related books under the "about this book" link',
 'You can convert between 10 and 13 digit ISBNs with these tools: Get free access to research! Research tools and services Outreach Get involved',
 "Successful military and police takeover \xa0Royal Thai Armed Forces Royal Thai Police The 1991 Thai coup d'état was a military coup against the democratic Chatichai Choonhavan government, carried out by Thai military leaders on 23 February",
 'Although the figure head was Sunthorn Kongsompong, there was a military influence from military leaders, Chavalit Yongchaiyudh, Suchinda Kraprayoon, and Kaset Rojananil in the conflict']

In [8]:
# let's try to randomly shuffle the elements of a list 
import random
random.seed(2023)

mylist = list(range(20))
print(mylist)

random.shuffle(mylist)  # seed works!
print(mylist)

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19]
[2, 7, 3, 0, 5, 6, 11, 9, 8, 4, 18, 13, 17, 1, 15, 16, 10, 19, 14, 12]


In [9]:
# let's randomly shuffle the lines in rand_text_data
random.seed(2023)
random.shuffle(rand_text_data)

# let's check with the same piece of code above for rand_text_data
print("length: ", len(rand_text_data))
rand_text_data[8:12]

length:  69058


['A chord may be built upon any note of a musical scale',
 'While the first derivative test identifies points that might be extrema, this test does not distinguish a point that is a minimum from one that is a maximum or one that is neither',
 '"Playfair, John"',
 'The commission supported increasing enforcement against undocumented migrants and their employers, eliminating visa preferences for siblings and adult children of U.S']

In [10]:
# let's form our dataset by joining the lines together
# since the sentences would be disjointed and out of context from each other, 
# it's better to join with a '\n'
text_data = '. \n'.join(rand_text_data)

In [11]:
# (yes, we would loose context between sentences, but now the data won't be biased)
# we can decide to choose any of the techniques later

# length of the data in chars
print("length of the data in chars:", len(text_data))

# let's look at the first 100 characters
print(text_data[:100])

length of the data in chars: 15861052
The highly miniaturized product, about the size of a cigarette lighter and with a 4.6-inch long USB 


# Encoders and Tokenizers

some popular tokenizers to look out for:
- [sentencepiece](https://github.com/google/sentencepiece) (google)
- [tiktoken](https://github.com/openai/tiktoken) (openai)
- [word2vec](https://github.com/tensorflow/docs/blob/master/site/en/tutorials/text/word2vec.ipynb) (google tf)
- [code2vec](https://github.com/tech-srl/code2vec) (code2vec.org)
- [GloVe](https://nlp.stanford.edu/projects/glove/) (nlp stanford)
- [spaCy](https://spacy.io/usage/linguistic-features#vectors-similarity) (explosion.ai)

we can look up hugging face, [chapter 6](https://huggingface.co/course/chapter6/1) for more tokenizers and [tokenizer_summary](https://huggingface.co/docs/transformers/tokenizer_summary) from their transformers page. some of the useful encoders are as below:
- Byte Pair Encoding (BPE)
- WordPiece tokenization
- Unigram tokenization


for tokenizing and embedding code (say for example, C++), we can use:
- [Subword tokenizers](https://www.tensorflow.org/text/guide/subwords_tokenizer)
- Syntax-aware Tokenization
- Code-specific Tokenization
- Clang
- TreeSitter
- CodeBERT
- Code2Vec
- Graph-based Neural Networks

In [12]:
# let's build a basic, simple char embedding to int to start with.

In [13]:
# let's look at the unique chars in the text data
char_vocab = sorted(list(set(text_data)))
vocab_size = len(char_vocab)
print("char vocabulary:", ''.join(char_vocab))
print("vocabulary size:", vocab_size)

char vocabulary: 	
 !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~ ¡¢£¥§¬­®¯°±²³´µ·º»¼½ÀÁÄÅÆÇÉÎÑÓÕÖ×ØÜÞßàáâãäåæçèéêëìíîïðñòóôõöøùúûüýþĀāăąćċČčĐđēęěĝğġħĩĪīİıľŁłńņňŋŌōőŒœřŚśŞşŠšţũūŵŷźŻżŽžƒơưƿǂǎǐǒǔǣǥǧǫǵǷȘșȚțȟȳɐɑɒɔɕəɛɜɡɣɤɦɪɬɯɲɴɾʁʂʃʊʌʎʏʐʑʒʔʕʰʷʻʼʾʿˈˌːˤ̸̩̪̯̰̲̀́̃̄̊̌ͤͭ̚͡ΆΐΑΓΔΕΘΙΚΛΜΟΠΣΤΦΧΩάέήίαβγδεζηθικλμνξοπρςστυφχψωϊόύώϕϵАБГДКНОПРСТУФХабвгдезийклмнопрстуфхцчъьяїјҳְֲִֵֶַָֹֻּׁׂ֣אבגדהוחטיךכלםמןנסעפצקרשת׳،ءآأإئابةتثجحخدذرزسشصضطظعغفقكلمنهوىيَُِّْٰٱچکیचतदनबमरवशहाीे्ংকগজঞঠতথদনবরলািীু্ംമയലളാกขฃคฅงจฉญณดตถทธนบปผฝพฟภมยรลวษสหอฮะัาำิีืุูเแโใไๆ็่้ကငစညတထဒနပမယသာိီုေံး္်ြᚼᛏᛒḇḍḏḗḤḥḱḳṁṃṅṆṇṓṛṟṣṭṯṵẊẋẓẚạảấầẩậắếềểễệịọốồổộớờởỦủứữỳἀἁἄἈἐἑἔἕἙἠἡἦἰἴἶἷἸὁὅὐὑὖὤὰὲὴὶὸᾱῆῐῖῡῦῶ  ​‌‎‐–—―‖‘’“”†‡•… ‰′″‿⁄⁠⁡€₹ℏℓℝ™ℵℶ←→↔↦⇒∀∂∃∅∆∇∈∉∋∑−∗∘∙√∝∞∣∥∧∨∫∴∼≈≠≡≤≥≪≫⊂⊆⊚⊢⊨⋃⋅⋆⋯⌈⌉⌘▼★♭♮♯✓⟨⟩かじんアイガザスズベボマムユラリルンーㅇ一下业中主义之乐产人仁代企伙会但体佛促俗儒八公共兴农利华合名启命和嘉国國圳墨士天夫嫻子字孝学學宗定家密富實寶封小左市帝平康建当彬德忠成戰抓投教文新日春時會期本果桃權正武民汇法津深源漢爲王现現理生產由當百皇相眠睿矿社神禮秋简管約組經繁网義翼能自致芒花芳莊華融视訓詁語諧諸谐贾资跃運道鍾鑫银陰陽革音韻频體국리아어역카프한ﬁﺟ﻿（），𐤁𐤋𐤍𓂋𓈉𓈖𓏠
vocabulary size: 953


In [14]:
# 802 chars! 
# I know we should do data cleaning, unwanted char/word removal, etc etc...
# but let's just go with it and see what happens...

In [15]:
# a simple char to int tokenization strategy (encoding & decoding)

# mapping between chars and ints vased on data vocab
c2i = { ch:i for i,ch in enumerate(char_vocab) }
i2c = { i:ch for i,ch in enumerate(char_vocab) }

# encode: text string --> 1D num vector
encode = lambda txt_str: [c2i[ch] for ch in txt_str]

# decode: encoded 1D num vector --> text string
decode = lambda num_vect: ''.join([i2c[i] for i in num_vect])

In [16]:
# let's test our encoder and decoder
print(encode("hello LLM o/"))
print(decode([74, 71, 78, 78, 81, 2, 46, 46, 47, 2, 81, 17]))

[74, 71, 78, 78, 81, 2, 46, 46, 47, 2, 81, 17]
hello LLM o/


In [17]:
# let's see how it looks with first 100 chars from our raw text data
print(text_data[:100])
print(encode(text_data[:100]))

The highly miniaturized product, about the size of a cigarette lighter and with a 4.6-inch long USB 
[54, 74, 71, 2, 74, 75, 73, 74, 78, 91, 2, 79, 75, 80, 75, 67, 86, 87, 84, 75, 92, 71, 70, 2, 82, 84, 81, 70, 87, 69, 86, 14, 2, 67, 68, 81, 87, 86, 2, 86, 74, 71, 2, 85, 75, 92, 71, 2, 81, 72, 2, 67, 2, 69, 75, 73, 67, 84, 71, 86, 86, 71, 2, 78, 75, 73, 74, 86, 71, 84, 2, 67, 80, 70, 2, 89, 75, 86, 74, 2, 67, 2, 22, 16, 24, 15, 75, 80, 69, 74, 2, 78, 81, 80, 73, 2, 55, 53, 36, 2]


In [18]:
# so our basic encoding works!

# Using the Encoder/Tokenizer for Input Embedding

In [19]:
# let's now encode the entire text dataset and store it into torch.Tensor
# (blindly following Karpathy here), using PyTorch (https://pytorch.org)
# could have used TF, or other techniques, but will see in future

import torch

In [20]:
# use our encoder and wrap it into PyTorch tensors
data = torch.tensor(encode(text_data), dtype=torch.long)

# let's take a look at shape and dtype
print(data.shape, data.dtype)

# and the first 100 chars from earlier
print(data[:100])

torch.Size([15861052]) torch.int64
tensor([54, 74, 71,  2, 74, 75, 73, 74, 78, 91,  2, 79, 75, 80, 75, 67, 86, 87,
        84, 75, 92, 71, 70,  2, 82, 84, 81, 70, 87, 69, 86, 14,  2, 67, 68, 81,
        87, 86,  2, 86, 74, 71,  2, 85, 75, 92, 71,  2, 81, 72,  2, 67,  2, 69,
        75, 73, 67, 84, 71, 86, 86, 71,  2, 78, 75, 73, 74, 86, 71, 84,  2, 67,
        80, 70,  2, 89, 75, 86, 74,  2, 67,  2, 22, 16, 24, 15, 75, 80, 69, 74,
         2, 78, 81, 80, 73,  2, 55, 53, 36,  2])


# References

Reading Materials and References for future:
- https://towardsdatascience.com/word-embeddings-with-code2vec-glove-and-spacy-5b26420bf632
- https://huggingface.co/course/chapter6/8
- https://neptune.ai/blog/tokenization-in-nlp
- https://research.aimultiple.com/large-language-model-training/
- https://towardsdatascience.com/word-embeddings-in-2020-review-with-code-examples-11eb39a1ee6d
