<a href="https://www.kaggle.com/code/mrafraim/dl-day-28-nlp-preprocessing?scriptVersionId=290982701" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# Day 28: NLP Preprocessing

Welcome to Day 28!

Today you’ll learn:
1. What Natural Language Processing (NLP) is and why it matters in AI
2. Why raw text is challenging for neural networks
3. Types of NLP problems with real-world use cases
4. Tokenization strategies:
   - Word-level
   - Character-level
   - Subword-level (BPE/WordPiece)
5. Vocabulary creation and integer encoding
6. Why neural networks require embeddings rather than raw indices
7. How embeddings capture semantic relationships
8. The necessity of padding for batch processing
9. How all these steps integrate into a robust training pipeline

By the end of this notebook, you'll be able to transform raw text into numerically usable, semantically rich, batchable tensors ready for deep learning.

If you found this notebook helpful, your **<b style="color:orange;">UPVOTE</b>** would be greatly appreciated! It helps others discover the work and supports continuous improvement.

---


# What is NLP?

Natural Language Processing (NLP) is the branch of AI focused on enabling machines to understand, interpret, and generate human language. 

Key points:
- Language is inherently ambiguous, contextual, and structured at multiple levels (words, syntax, semantics, discourse)
- NLP allows AI to extract meaning, sentiment, or patterns from text

**Examples of NLP Applications:**
- **Text classification:** Spam detection, topic tagging
- **Sentiment analysis:** Product reviews, social media analysis
- **Named Entity Recognition (NER):** Extracting people, locations, organizations
- **Machine translation:** Google Translate, DeepL
- **Conversational AI:** Chatbots, virtual assistants
- **Text generation:** GPT, story generation, code generation


## Challenges in NLP

**Text characteristics:**
- **Unstructured:** Unlike images (pixels), text is variable and symbolic
- **Variable length:** Sentences and paragraphs differ in size
- **Ambiguous:** Words can have multiple meanings (polysemy)
- **Context-dependent:** Meaning depends on surrounding words

**Neural network requirements:**
- Expect **numerical input**
- Prefer **fixed-size tensors**
- Often work in **batches** for efficiency

Preprocessing bridges **human-readable text → machine-readable numerical tensors**


##  Types of NLP Tasks

| Task | Example | Description |
|------|---------|-------------|
| Text Classification | Spam vs Ham | Assign a label to entire text |
| Sequence Prediction | Next word/character | Predict next token given context |
| Named Entity Recognition | "John" → Person | Identify entities in text |
| Machine Translation | English → French | Translate text from one language to another |
| Text Generation | GPT-like models | Generate coherent text given a prompt |
| Question Answering | SQuAD | Answer questions from a passage |
| Summarization | News article → Summary | Condense information while preserving meaning |


## Raw Text Example


In [1]:
sentences = [
    "Deep learning is powerful",
    "NLP is fascinating",
    "Transformers revolutionized NLP"
]

sentences

['Deep learning is powerful',
 'NLP is fascinating',
 'Transformers revolutionized NLP']

Observation:

- Variable-length sequences
- Words are strings, not usable directly by neural networks

#  Tokenization

Tokenization = breaking text into atomic units (tokens) that models can process

**Why it’s needed:**
- Neural networks cannot process raw strings
- Tokens act as the vocabulary units

**Tokenization strategies:**
1. **Word-level:** Each word → one token
2. **Character-level:** Each character → one token
3. **Subword-level:** Merge common sequences (e.g., BPE, WordPiece, SentencePiece)

**Trade-offs:**
- Word-level: simple, interpretable, large vocabulary
- Character-level: small vocab, handles unknown words, longer sequences
- Subword-level: balance between vocab size & generalization (used in Transformers)


## Word-Level Tokenization

In [2]:
tokenized_sentences = [s.lower().split() for s in sentences]

print("Tokenzied Sentences:")
tokenized_sentences

Tokenzied Sentences:


[['deep', 'learning', 'is', 'powerful'],
 ['nlp', 'is', 'fascinating'],
 ['transformers', 'revolutionized', 'nlp']]

Notes:

- Converted text to lowercase for consistency
- Split on spaces → basic tokenization
- Still strings, need mapping to integers

## Vocabulary Creation

In [3]:
# Flatten all token lists to build a set of unique words (vocabulary)

vocab = set()

for sentence in tokenized_sentences:
    for word in sentence:
        vocab.add(word)

print(vocab)

vocab_size = len(vocab)
print("Vocabulary size: ", vocab_size)

{'learning', 'transformers', 'fascinating', 'revolutionized', 'is', 'powerful', 'deep', 'nlp'}
Vocabulary size:  8


In [4]:
# Map words to unique integer IDs
word_to_idx = {word: idx for idx, word in enumerate(vocab)}
idx_to_word = {idx: word for word, idx in word_to_idx.items()}

print("Word to index mapping:")
word_to_idx

Word to index mapping:


{'learning': 0,
 'transformers': 1,
 'fascinating': 2,
 'revolutionized': 3,
 'is': 4,
 'powerful': 5,
 'deep': 6,
 'nlp': 7}

In [5]:
# Convert each token in the sentences to its corresponding integer index
encoded_sentences = [
    [word_to_idx[word] for word in sentence]
    for sentence in tokenized_sentences
]

print("Encoded sentences:")
encoded_sentences


Encoded sentences:


[[6, 0, 4, 5], [7, 4, 2], [1, 3, 7]]

## Why Not Use Word Indices Directly?

**Problem with raw indices:**
- Imply ordinal relationship (e.g., 2 < 7 → king < queen? meaningless)
- No semantic similarity captured
- Poor generalization

**Solution:** **Word Embeddings**
- Map words → dense vectors in ℝ^d
- Vectors capture semantic relationships (king ≈ queen - man + woman)
- Learned during training or pre-trained (Word2Vec, GloVe, FastText, BERT)


# Embedding Layer

- Converts word indices → dense vectors
- Captures semantic similarity
- Learned end-to-end with task

Mathematically:
$$
\text{word index} \rightarrow \mathbf{v} \in \mathbb{R}^d
$$

- `v` = embedding vector
- Words with similar context → similar vectors

*Real-world note: Pre-trained embeddings reduce training time and improve generalization.*

## Embedding Layer in PyTorch

In [6]:
import torch
import torch.nn as nn

# Define embedding dimension
embedding_dim = 5

# Create a PyTorch embedding layer
embedding = nn.Embedding(num_embeddings=vocab_size, embedding_dim=embedding_dim)

# Convert first encoded sentence to tensor
sample_input = torch.tensor(encoded_sentences[0])

# Pass through embedding layer
embedded_output = embedding(sample_input)

print("Embedded output shape:", embedded_output.shape)
print("Embedded output tensor:")
embedded_output


Embedded output shape: torch.Size([4, 5])
Embedded output tensor:


tensor([[ 0.0294,  1.8543, -1.0368,  0.3157,  0.0081],
        [-0.2989,  0.7121,  0.0983, -0.0576,  0.1752],
        [ 0.6905, -0.7170, -1.0517, -0.2305, -1.1802],
        [-0.4986,  1.2611,  1.2982,  0.5773, -0.7170]],
       grad_fn=<EmbeddingBackward0>)

- Each word in our vocabulary gets mapped to a vector of size `embedding_dim`
- `num_embeddings = vocab_size`: total number of unique tokens (words) in our vocab.
- PyTorch initializes this table randomly. During training, these vectors are updated.
- Output Shape = `(sequence_length, embedding_dim)`
- Ready to feed into RNN, LSTM, GRU, or Transformer

# Padding 

Padding is the process of adding special placeholder tokens to sequences so that all sequences in a batch have the same length.  

In NLP, sequences are usually sentences represented as lists of word IDs, and sentences naturally vary in length:

- Sentence 1: $[12, 5, 9]$ # length 3
- Sentence 2: $[7, 2]$ # length 2
- Sentence 3: $[3, 8, 1, 4, 9]$ # length 5


Neural networks require uniform input shapes, so we add `<PAD>` tokens to the shorter sequences:

Padded Sentences (max length = 5):

- $[12, 5, 9, 0, 0]$ # padded with 0
- $[7, 2, 0, 0, 0]$ # padded with 0
- $[3, 8, 1, 4, 9]$ # no padding needed

Here, `0` is typically used as the **PAD token**.


**Key Notes**

- **PAD token value**: Usually `0`, but can be any integer not representing a real word.  
- **Masking**: Some models use attention masks or sequence masks to ignore PAD tokens during training.  
- **Dynamic padding**: Many libraries support padding each batch to the longest sentence in that batch to reduce computation waste.


## Why Padding is Needed

1. **Uniform sequence length for batches**

   - Deep learning frameworks process data in batches for efficiency.  
   - All sequences in a batch must have the same length to form a 2D tensor: `[batch_size, seq_length]`.  

2. **Vectorized computation**

   - Neural networks operate on tensors, not Python lists of varying lengths.  
   - Without padding, sequences of different lengths would create “jagged arrays” that cannot be efficiently processed.  

3. **Compatibility with RNNs, LSTMs, Transformers**

   - These models expect fixed-length inputs per batch (or require masking to ignore PAD tokens.) 
   - Padding allows these models to compute forward and backward passes without shape errors.


## Padding Sequences

In [7]:
from torch.nn.utils.rnn import pad_sequence

# Convert encoded sentences → tensors
tensor_sentences = [torch.tensor(seq) for seq in encoded_sentences]

# Pad sequences
padded_sequences = pad_sequence(
    tensor_sentences,
    batch_first=True,  # output shape = (batch_size, seq_len)
    padding_value=0
)

padded_sequences


tensor([[6, 0, 4, 5],
        [7, 4, 2, 0],
        [1, 3, 7, 0]])

#  NLP Preprocessing Pipeline
    
```mermaid
flowchart TD
    A[Raw Text] --> B[Lowercasing & Cleaning]
    B --> C[Tokenization]
    C --> D[Vocabulary Mapping]
    D --> E[Integer Encoding]
    E --> F[Embedding]
    F --> G[Padding]
    G --> H[Batching]
    H --> I[Neural Network Input]

    %% Node styles
    classDef startEnd fill:#ffcc00,stroke:#333,stroke-width:2px,color:#000
    classDef process fill:#00ccff,stroke:#333,stroke-width:2px,color:#000

    class A,I startEnd
    class B,C,D,E,F,G,H process
```


1. **Raw text**
2. **Lowercasing & Cleaning:** Remove punctuation, special characters, stopwords if needed
3. **Tokenization:** Word, character, or subword
4. **Vocabulary mapping:** Build word → index dictionary
5. **Integer encoding:** Convert tokens → indices
6. **Embedding:** Convert indices → dense vectors
7. **Padding:** Equalize sequence lengths in batch
8. **Batching:** Form batch tensors
9. **Neural Network Input:** Ready for RNN/LSTM/Transformer

*Professional tip: Some pipelines also include lemmatization, stemming, or subword tokenization for better generalization.*

# Key Takeaways from Day 28

- NLP transforms unstructured text → structured, numeric tensors
- Preprocessing is mandatory, not optional
- Tokenization defines the granularity of language representation
- Embeddings encode semantic meaning beyond raw indices
- Padding allows batch training and efficient computation
- Preprocessing decisions significantly affect model performance

---


<p style="text-align:center; font-size:18px;">
© 2026 Mostafizur Rahman
</p>
