Part 1 — Foundations of NLP in Deep Learning
1️⃣ Text → Numbers

Machines can’t understand words — they understand numbers.

Key steps:

Tokenisation – breaking text into smaller pieces (tokens)
e.g. "I love cats" → ["I", "love", "cats"]

Vocabulary – a mapping from tokens → integer IDs
e.g. "I"=1, "love"=2, "cats"=3

Padding/truncation – make all sequences the same length

2️⃣ Representing Words: From One-Hot to Embeddings

One-hot encoding: simple but inefficient (very sparse, no relationships)

"dog" and "cat" are equally distant — no notion of similarity.

Word embeddings: dense, learned representations of words in a continuous vector space.

Words with similar meanings have similar embeddings.

Learned by models like Word2Vec, GloVe, or within deep networks.

💡 Example:
king - man + woman ≈ queen

Embeddings are the foundation of how models understand meaning.

3️⃣ Sequence Models: RNNs

Once words are vectors, you can feed them into a model that handles sequences.

RNN (Recurrent Neural Network) processes tokens one by one, keeping a hidden state that carries context forward.

Good for short dependencies (“I love cats”)

Struggles with long sentences due to vanishing gradients (it forgets long-term context)

4️⃣ Improvement: LSTM (Long Short-Term Memory)

An LSTM is a special kind of RNN designed to remember longer-term information.

It uses gates:

Input gate: decide what new info to store

Forget gate: decide what to throw away

Output gate: decide what to pass on

🧠 LSTMs = “RNNs with memory and control.”
They were the workhorse for early translation and speech models before Transformers.

5️⃣ Encoder–Decoder Architecture (Seq2Seq)

To handle tasks like translation, you need two parts:

Encoder: reads the source sentence (e.g. English)

Decoder: generates the target sentence (e.g. French)

Both can be built using LSTMs.

💬 Example:

Input: “I love cats.”
Output: “J’aime les chats.”

But pure encoder–decoder LSTMs still struggle with long sentences, because the encoder squashes the entire meaning into a single vector.

6️⃣ Attention Mechanism

This was a major breakthrough before Transformers.

Instead of relying on one “summary vector”, the attention mechanism lets the decoder “look back” at the encoder’s outputs directly, deciding which parts of the input sentence are most relevant at each translation step.

💡 Example:
When translating “cats”, the decoder can pay attention to the word “cats” in the input rather than relying on memory alone.

🪄 7️⃣ Leading to the Transformer

Transformers remove recurrence entirely and rely purely on:

Self-Attention (each word looks at all other words)

Positional encoding (since order matters)

This gives:

Parallelisation (faster training)

Better long-term dependency handling

State-of-the-art performance in translation and beyond.

So the “bilingual Transformer-based NMT” section you’re entering builds on all of this:

Concept	Role in the journey
Tokenisation	Turns text into numbers
Embeddings	Turns numbers into meaningful vectors
RNNs	Process sequences in order
LSTMs	Improve memory over time
Seq2Seq	Encode → decode for translation
Attention	Focus on important words
Transformer	Pure attention → modern NMT models