# Tokens, Word Embeddings & Language Representation

## Learning Objectives:
- Understand what tokens are.
- Learn how to tokenize text using the Hugging Face Transformers library.
- Explore what embeddings are, their role in representing meaning, and how to extract them from a pretrained neural network model.
- Work with a real dataset to see these ideas in action.

## Introduction to Tokens

In Natural Language Processing (NLP), **tokens** are the basic units into which text is split. They can be words, subwords, or characters. Tokenization is essential because:
- **Standardization:** It converts raw text into a standardized form.
- **Input for Models:** Neural networks process fixed units (tokens), not raw text.
- **Handling Vocabulary:** It helps in building a vocabulary and managing out-of-vocabulary words.

Below, we will see how to tokenize text using a pretrained tokenizer.

In [39]:
from transformers import AutoTokenizer, AutoModel
from datasets import load_dataset

# Load a pretrained tokenizer (we'll use 'bert-base-uncased' as an example).
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

# Example text for tokenization
example_text = "Generative AI is transforming the way we work with data."

# Tokenize the text
tokens = tokenizer.tokenize(example_text)
token_ids = tokenizer(example_text)["input_ids"]

print("Original Text:\n", example_text)
print("\nTokens:\n", tokens)
print("\nToken IDs:\n", token_ids)
print("\nDecoded Text:\n", tokenizer.decode(token_ids))


Original Text:
 Generative AI is transforming the way we work with data.

Tokens:
 ['genera', '##tive', 'ai', 'is', 'transforming', 'the', 'way', 'we', 'work', 'with', 'data', '.']

Token IDs:
 [101, 11416, 6024, 9932, 2003, 17903, 1996, 2126, 2057, 2147, 2007, 2951, 1012, 102]

Decoded Text:
 [CLS] generative ai is transforming the way we work with data. [SEP]


### Main Tasks of a Tokenizer

Tokenizers can be thought of as **lookup tables** that map text to numerical representations, enabling models to process and understand language.

1. **Tokenization**: Splits text into tokens (e.g., words, subwords, or characters).
2. **Token-to-ID Mapping**: Converts tokens into numerical IDs.
3. **Special Tokens**: Adds tokens like `[CLS]` or `[SEP]` for specific models.
4. **OOV Handling**: Replaces unknown words with `[UNK]`.
5. **Padding/Truncation**: Adjusts sequence lengths with `[PAD]` or truncates.
6. **Attention Masks**: Marks real tokens vs. padding (e.g., `[1, 1, 0, 0]`).


In [24]:
# Display the vocabulary
vocabulary = tokenizer.vocab

import pandas as pd

# Convert the vocabulary dictionary to a DataFrame
vocab_df = pd.DataFrame(list(vocabulary.items()), columns=["Token", "Index"])

# Display the first 20 tokens in a tabular format
print("Vocabulary Size:", len(vocabulary))
print("\nSample Vocabulary:")
vocab_df.head(10)


Vocabulary Size: 30522

Sample Vocabulary:


Unnamed: 0,Token,Index
0,artisans,26818
1,duration,9367
2,euclidean,25826
3,ind,27427
4,sbs,21342
5,alloys,28655
6,realization,12393
7,##emia,17577
8,employed,4846
9,bahamas,17094


## Working with a Real Dataset

For practical learning, we use a real dataset from Hugging Face. In this notebook, we load a small subset of the IMDb dataset, which is widely used for NLP tasks.


In [11]:
# Load a small subset of the IMDb dataset
dataset = load_dataset("imdb", split="train[:1000]")

# Inspect a sample review
sample_review = dataset[0]["text"]
print("Sample Review:\n", sample_review[:500], "...\n")


Sample Review:
 I rented I AM CURIOUS-YELLOW from my video store because of all the controversy that surrounded it when it was first released in 1967. I also heard that at first it was seized by U.S. customs if it ever tried to enter this country, therefore being a fan of films considered "controversial" I really had to see this for myself.<br /><br />The plot is centered around a young Swedish drama student named Lena who wants to learn everything she can about life. In particular she wants to focus her attent ...



We can tokenize the entire dataset efficiently in batches.

In [41]:
# Tokenize the dataset
tokenized_dataset = dataset.map(lambda x: tokenizer(x['text'], padding='max_length', truncation=True), batched=True)

# Inspect the tokenized dataset
print(tokenized_dataset)

# Convert the tokenized dataset to a pandas DataFrame
tokenized_df = tokenized_dataset.to_pandas()[["text", "input_ids"]]

# Display the DataFrame
tokenized_df.head()

Dataset({
    features: ['text', 'label', 'input_ids', 'token_type_ids', 'attention_mask'],
    num_rows: 1000
})


Unnamed: 0,text,input_ids
0,I rented I AM CURIOUS-YELLOW from my video sto...,"[101, 1045, 12524, 1045, 2572, 8025, 1011, 375..."
1,"""I Am Curious: Yellow"" is a risible and preten...","[101, 1000, 1045, 2572, 8025, 1024, 3756, 1000..."
2,If only to avoid making this type of film in t...,"[101, 2065, 2069, 2000, 4468, 2437, 2023, 2828..."
3,This film was probably inspired by Godard's Ma...,"[101, 2023, 2143, 2001, 2763, 4427, 2011, 2643..."
4,"Oh, brother...after hearing about this ridicul...","[101, 2821, 1010, 2567, 1012, 1012, 1012, 2044..."
