# Import and Read the data

In [1]:
# read the input data file
with open('raw-text-data.txt', 'r', encoding='utf-8') as f:
    text_data = f.read()

In [2]:
# length of the data in chars
print("length of the data in chars:", len(text_data))

# let's look at the first 100 characters
print(text_data[:100])

length of the data in chars: 15795757
This page allows users to search multiple sources for a book given a 10- or 13-digit International S


# Encoders and Tokenizers

some popular tokenizers to look out for:
- [sentencepiece](https://github.com/google/sentencepiece) (google)
- [tiktoken](https://github.com/openai/tiktoken) (openai)
- [word2vec](https://github.com/tensorflow/docs/blob/master/site/en/tutorials/text/word2vec.ipynb) (google tf)
- [code2vec](https://github.com/tech-srl/code2vec) (code2vec.org)
- [GloVe](https://nlp.stanford.edu/projects/glove/) (nlp stanford)
- [spaCy](https://spacy.io/usage/linguistic-features#vectors-similarity) (explosion.ai)

we can look up hugging face, [chapter 6](https://huggingface.co/course/chapter6/1) for more tokenizers and [tokenizer_summary](https://huggingface.co/docs/transformers/tokenizer_summary) from their transformers page. some of the useful encoders are as below:
- Byte Pair Encoding (BPE)
- WordPiece tokenization
- Unigram tokenization


for tokenizing and embedding code (say for example, C++), we can use:
- [Subword tokenizers](https://www.tensorflow.org/text/guide/subwords_tokenizer)
- Syntax-aware Tokenization
- Code-specific Tokenization
- Clang
- TreeSitter
- CodeBERT
- Code2Vec
- Graph-based Neural Networks

In [3]:
# let's build a basic, simple char embedding to int to start with.

In [4]:
# let's look at the unique chars in the text data
char_vocab = sorted(list(set(text_data)))
vocab_size = len(char_vocab)
print("char vocabulary:", ''.join(char_vocab))
print("vocabulary size:", vocab_size)

char vocabulary: 	
 !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~ ¡¢£¥§¬­®¯°±²³´µ·º»¼½ÀÁÄÅÆÇÉÎÑÓÕÖ×ØÜÞßàáâãäåæçèéêëìíîïðñòóôõöøùúûüýþĀāăąćċČčĐđēęěĝğġħĩĪīİıľŁłńņňŋŌōőŒœřŚśŞşŠšţũūŵŷźŻżŽžƒơưƿǂǎǐǒǔǣǥǧǫǵǷȘșȚțȟȳɐɑɒɔɕəɛɜɡɣɤɦɪɬɯɲɴɾʁʂʃʊʌʎʏʐʑʒʔʕʰʷʻʼʾʿˈˌːˤ̸̩̪̯̰̲̀́̃̄̊̌ͤͭ̚͡ΆΐΑΓΔΕΘΙΚΛΜΟΠΣΤΦΧΩάέήίαβγδεζηθικλμνξοπρςστυφχψωϊόύώϕϵАБГДКНОПРСТУФХабвгдезийклмнопрстуфхцчъьяїјҳְֲִֵֶַָֹֻּׁׂ֣אבגדהוחטיךכלםמןנסעפצקרשת׳،ءآأإئابةتثجحخدذرزسشصضطظعغفقكلمنهوىيَُِّْٰٱچکیचतदनबमरवशहाीे्ংকগজঞঠতথদনবরলািীু্ംമയലളാกขฃคฅงจฉญณดตถทธนบปผฝพฟภมยรลวษสหอฮะัาำิีืุูเแโใไๆ็่้ကငစညတထဒနပမယသာိီုေံး္်ြᚼᛏᛒḇḍḏḗḤḥḱḳṁṃṅṆṇṓṛṟṣṭṯṵẊẋẓẚạảấầẩậắếềểễệịọốồổộớờởỦủứữỳἀἁἄἈἐἑἔἕἙἠἡἦἰἴἶἷἸὁὅὐὑὖὤὰὲὴὶὸᾱῆῐῖῡῦῶ  ​‌‎‐–—―‖‘’“”†‡•… ‰′″‿⁄⁠⁡€₹ℏℓℝ™ℵℶ←→↔↦⇒∀∂∃∅∆∇∈∉∋∑−∗∘∙√∝∞∣∥∧∨∫∴∼≈≠≡≤≥≪≫⊂⊆⊚⊢⊨⋃⋅⋆⋯⌈⌉⌘▼★♭♮♯✓⟨⟩かじんアイガザスズベボマムユラリルンーㅇ一下业中主义之乐产人仁代企伙会但体佛促俗儒八公共兴农利华合名启命和嘉国國圳墨士天夫嫻子字孝学學宗定家密富實寶封小左市帝平康建当彬德忠成戰抓投教文新日春時會期本果桃權正武民汇法津深源漢爲王现現理生產由當百皇相眠睿矿社神禮秋简管約組經繁网義翼能自致芒花芳莊華融视訓詁語諧諸谐贾资跃運道鍾鑫银陰陽革音韻频體국리아어역카프한ﬁﺟ﻿（），𐤁𐤋𐤍𓂋𓈉𓈖𓏠
vocabulary size: 953


In [5]:
# 802 chars! 
# I know we should do data cleaning, unwanted char/word removal, etc etc...
# but let's just go with it and see what happens...

In [6]:
# a simple char to int tokenization strategy (encoding & decoding)

# mapping between chars and ints vased on data vocab
c2i = { ch:i for i,ch in enumerate(char_vocab) }
i2c = { i:ch for i,ch in enumerate(char_vocab) }

# encode: text string --> 1D num vector
encode = lambda txt_str: [c2i[ch] for ch in txt_str]

# decode: encoded 1D num vector --> text string
decode = lambda num_vect: ''.join([i2c[i] for i in num_vect])

In [7]:
# let's test our encoder and decoder
print(encode("hello LLM o/"))
print(decode([74, 71, 78, 78, 81, 2, 46, 46, 47, 2, 81, 17]))

[74, 71, 78, 78, 81, 2, 46, 46, 47, 2, 81, 17]
hello LLM o/


In [8]:
# let's see how it looks with first 100 chars from our raw text data
print(text_data[:100])
print(encode(text_data[:100]))

This page allows users to search multiple sources for a book given a 10- or 13-digit International S
[54, 74, 75, 85, 2, 82, 67, 73, 71, 2, 67, 78, 78, 81, 89, 85, 2, 87, 85, 71, 84, 85, 2, 86, 81, 2, 85, 71, 67, 84, 69, 74, 2, 79, 87, 78, 86, 75, 82, 78, 71, 2, 85, 81, 87, 84, 69, 71, 85, 2, 72, 81, 84, 2, 67, 2, 68, 81, 81, 77, 2, 73, 75, 88, 71, 80, 2, 67, 2, 19, 18, 15, 2, 81, 84, 2, 19, 21, 15, 70, 75, 73, 75, 86, 2, 43, 80, 86, 71, 84, 80, 67, 86, 75, 81, 80, 67, 78, 2, 53]


In [9]:
# so our basic encoding works!

# Using the Encoder/Tokenizer for Input Embedding

In [10]:
# let's now encode the entire text dataset and store it into torch.Tensor
# (blindly following Karpathy here), using PyTorch (https://pytorch.org)
# could have used TF, or other techniques, but will see in future

import torch

In [11]:
# use our encoder and wrap it into PyTorch tensors
data = torch.tensor(encode(text_data), dtype=torch.long)

# let's take a look at shape and dtype
print(data.shape, data.dtype)

# and the first 100 chars from earlier
print(data[:100])

torch.Size([15795757]) torch.int64
tensor([54, 74, 75, 85,  2, 82, 67, 73, 71,  2, 67, 78, 78, 81, 89, 85,  2, 87,
        85, 71, 84, 85,  2, 86, 81,  2, 85, 71, 67, 84, 69, 74,  2, 79, 87, 78,
        86, 75, 82, 78, 71,  2, 85, 81, 87, 84, 69, 71, 85,  2, 72, 81, 84,  2,
        67,  2, 68, 81, 81, 77,  2, 73, 75, 88, 71, 80,  2, 67,  2, 19, 18, 15,
         2, 81, 84,  2, 19, 21, 15, 70, 75, 73, 75, 86,  2, 43, 80, 86, 71, 84,
        80, 67, 86, 75, 81, 80, 67, 78,  2, 53])


# References

Reading Materials and References for future:
- https://towardsdatascience.com/word-embeddings-with-code2vec-glove-and-spacy-5b26420bf632
- https://huggingface.co/course/chapter6/8
- https://neptune.ai/blog/tokenization-in-nlp
- https://research.aimultiple.com/large-language-model-training/
- https://towardsdatascience.com/word-embeddings-in-2020-review-with-code-examples-11eb39a1ee6d
