<a href="https://colab.research.google.com/github/karna-charan/LLM/blob/main/Tokenization_Techniques.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Whitespace Tokenization**

In [1]:
# Take input text from user
text = input("Enter text: ")
# Split text based on spaces
tokens = text.split()
# Display tokens
print("Whitespace Tokens:", tokens)

Enter text:  Natural language processing is fun. 
Whitespace Tokens: ['Natural', 'language', 'processing', 'is', 'fun.']


**Word Tokenization (NLP Library – NLTK)**

In [2]:
import nltk
nltk.download('punkt')
nltk.download('punkt_tab')
from nltk.tokenize import word_tokenize
# Take input text from user
text = input("Enter text: ")
# Tokenize text into words using NLTK
tokens = word_tokenize(text)
# Display tokens
print("Word Tokens:", tokens)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


Enter text:  NLP helps computers understand human language
Word Tokens: ['NLP', 'helps', 'computers', 'understand', 'human', 'language']


**Character-Level Tokenization**

In [4]:
# Take input text from user
text = input("Enter text: ")
# Convert text into character-level tokens
tokens = list(text)
# Display tokens
print("Character Tokens:", tokens)

Enter text: NLP
Character Tokens: ['N', 'L', 'P']


**Sentence Tokenization**

In [6]:
 import nltk
nltk.download('punkt')
nltk.download('punkt_tab')
from nltk.tokenize import sent_tokenize
# Take input text from user
text = input("Enter text: ")
# Tokenize text into sentences
sentences = sent_tokenize(text)
# Display sentences
print("Sentence Tokens:")
for s in sentences:
    print(s)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


Enter text: NLP is interesting. It helps machines understand language. AI is the future.
Sentence Tokens:
NLP is interesting.
It helps machines understand language.
AI is the future.


**Stopword Removal**

In [8]:
import nltk
nltk.download('stopwords')
nltk.download('punkt')
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
# Take input text from user
text = input("Enter text: ")
# Tokenize text into words
words = word_tokenize(text)
# Load English stopwords
stop_words = set(stopwords.words('english'))
# Remove stopwords
filtered_words = [word for word in words if word.lower() not in stop_words]
# Display results
print("Original Tokens:", words)
print("After Stopword Removal:", filtered_words)

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


Enter text: NLP is very useful for understanding human language.
Original Tokens: ['NLP', 'is', 'very', 'useful', 'for', 'understanding', 'human', 'language', '.']
After Stopword Removal: ['NLP', 'useful', 'understanding', 'human', 'language', '.']


**Subword Tokenization**

In [10]:
from transformers import AutoTokenizer
# Load BERT tokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
# Take input text from user
text = input("Enter text: ")
# Perform subword tokenization
tokens = tokenizer.tokenize(text)
# Convert tokens to token IDs
token_ids = tokenizer.encode(text)
# Display tokens and IDs
print("Subword Tokens:", tokens)
print("Token IDs:", token_ids)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Enter text: Tokenisation is important in LLM
Subword Tokens: ['token', '##isation', 'is', 'important', 'in', 'll', '##m']
Token IDs: [101, 19204, 6648, 2003, 2590, 1999, 2222, 2213, 102]


**1B. Text Embedding Technique**


**One-Hot Encoding:**

In [11]:
from sklearn.preprocessing import OneHotEncoder
# Take input sentence
text = input("Enter a sentence: ")
# Split sentence into words
words = text.lower().split()
# Initialize One-Hot Encoder
encoder = OneHotEncoder(sparse_output=False)
# Fit and transform words
one_hot = encoder.fit_transform([[w] for w in words])
# Display result
print("Words:", words)
print("One-Hot Encoding:")
print(one_hot)

Enter a sentence: Deep learning is very powerful
Words: ['deep', 'learning', 'is', 'very', 'powerful']
One-Hot Encoding:
[[1. 0. 0. 0. 0.]
 [0. 0. 1. 0. 0.]
 [0. 1. 0. 0. 0.]
 [0. 0. 0. 0. 1.]
 [0. 0. 0. 1. 0.]]


**Bag of Words**

In [12]:
from sklearn.feature_extraction.text import CountVectorizer
# Take input sentence
text = input("Enter a sentence: ")
# Create CountVectorizer object
vectorizer = CountVectorizer()
# Generate BoW representation
bow = vectorizer.fit_transform([text])
# Display vocabulary and vector
print("Vocabulary:", vectorizer.get_feature_names_out())
print("BoW Vector:")
print(bow.toarray())

Enter a sentence: Deep learning makes learning easy
Vocabulary: ['deep' 'easy' 'learning' 'makes']
BoW Vector:
[[1 1 2 1]]


**. Word2Vec:**

In [15]:
!pip install gensim

from gensim.models import Word2Vec
import re
# Take input sentence
text = input("Enter a sentence: ")
# Tokenize sentence using regex
tokens = re.findall(r'\b\w+\b', text.lower())
# Train Word2Vec model
model = Word2Vec(
    [tokens],
    vector_size=50,
    window=3,
    min_count=1,
    sg=1
)
# Display embeddings for each word
print("Word2Vec Embeddings:")
for word in tokens:
    print(word, "->", model.wv[word])

Collecting gensim
  Downloading gensim-4.4.0-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl.metadata (8.4 kB)
Downloading gensim-4.4.0-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl (27.9 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m27.9/27.9 MB[0m [31m56.7 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: gensim
Successfully installed gensim-4.4.0
Enter a sentence: Word2Vec Embeddings
Word2Vec Embeddings:
word2vec -> [-0.01631583  0.0089916  -0.00827415  0.00164907  0.01699724 -0.00892435
  0.009035   -0.01357392 -0.00709698  0.01879702 -0.00315531  0.00064274
 -0.00828126 -0.01536538 -0.00301602  0.00493959 -0.00177605  0.01106732
 -0.00548595  0.00452013  0.01091159  0.01669191 -0.00290748 -0.01841629
  0.0087411   0.00114357  0.01488382 -0.00162657 -0.00527683 -0.01750602
 -0.00171311  0.00565313  0.01080286  0.01410531 -0.01140624  0.00371764
  0.01217773 -0.0095961  -0.00621452  0.01359526  0.00326295  0.00037983
 

**GloVe Embedding**

In [16]:
import gensim.downloader as api
import re
# Load pre-trained GloVe model (50-dimensional)
glove_model = api.load("glove-wiki-gigaword-50")
# Take input sentence
text = input("Enter a sentence: ")
# Tokenize sentence
tokens = re.findall(r'\b\w+\b', text.lower())
# Display embeddings
print("GloVe Embeddings:")
for word in tokens:
    if word in glove_model:
        print(word, "->", glove_model[word])
    else:
        print(word, "-> Not in vocabulary")

Enter a sentence: Deep learning uses neural networks
GloVe Embeddings:
deep -> [ 0.31445   1.2024    0.066651 -0.20096  -0.049636  0.66882  -0.049386
  0.44174   0.1799   -0.10196  -0.43674   0.12076  -0.12495   0.43378
 -0.87784   0.010281  0.54592  -0.28928  -0.46115  -0.32058  -0.69094
  0.49733   0.40657  -0.90062   0.69699  -1.1536   -0.12229   1.0657
  0.93207   0.20439   3.3004    0.14223   0.46493   0.075359 -0.56755
  0.30769  -1.1251   -0.37871   0.57479  -0.12629   0.13589   0.10633
  0.058432  0.40321   0.10243   0.12004   0.41383   0.051987 -0.5835
 -1.1159  ]
learning -> [ 0.20461   0.48659  -0.55308  -0.27019   0.26336   0.15751  -0.28994
 -0.51824   0.051829  0.36225   0.37077   0.1322   -0.061377 -0.53606
 -0.34733  -0.043981 -0.086744  0.78305   0.41422   0.027996  0.23433
  0.98844  -0.41049   0.6206    1.3966   -0.65427  -0.18221  -1.0293
 -0.014741 -0.25384   3.227     0.39509  -0.33042  -1.229     0.29048
  0.33654  -0.24817   0.47105   0.32964   0.23997   0.08830

**BERT Embedding:**

In [20]:
import torch
from transformers import BertTokenizer, BertModel
# Load BERT tokenizer and model
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
model = BertModel.from_pretrained("bert-base-uncased")

# Take input sentence
text = input("Enter a sentence: ")
# Convert sentence into BERT input format
inputs = tokenizer(text, return_tensors="pt")
# Get model outputs
outputs = model(**inputs)
# Extract CLS token embedding
cls_embedding = outputs.last_hidden_state[:, 0, :]
# Display embedding
print("BERT CLS Embedding Shape:", cls_embedding.shape)
print("BERT Embedding:")
print(cls_embedding)

Enter a sentence: Transformers are powerful language models
BERT CLS Embedding Shape: torch.Size([1, 768])
BERT Embedding:
tensor([[-2.2781e-01,  1.3411e-02,  3.8629e-01, -6.3446e-02, -3.3132e-01,
         -4.2333e-01,  3.3524e-01,  4.6735e-01, -3.3935e-02, -2.2304e-01,
         -1.3062e-01, -2.7309e-02,  6.3044e-02,  1.5671e-01, -1.2636e-01,
         -8.1111e-02, -5.8163e-01,  5.5437e-01,  1.3238e-01,  3.3685e-02,
         -2.2062e-01, -4.5429e-01,  2.7165e-01, -2.1725e-01, -1.7911e-01,
         -1.7770e-01,  1.6396e-01, -2.7592e-01,  5.9301e-02,  8.4795e-02,
          1.1383e-01,  4.9886e-01, -5.7623e-02, -6.8120e-02,  1.0200e-01,
         -1.3418e-01,  2.8313e-01, -6.5388e-02,  2.1492e-01,  6.3221e-02,
          1.0210e-01, -2.1340e-01,  3.1965e-01,  3.9024e-02, -4.7008e-01,
         -1.2937e-01, -2.5419e+00, -1.2216e-01, -1.2809e-01, -4.4822e-01,
         -2.2159e-01,  2.4788e-01,  1.0951e-01,  6.3015e-01,  1.7443e-02,
          2.3771e-01, -5.4257e-02,  2.3744e-01,  2.1086e-01,  2