In [40]:
import torch
from transformers import T5Tokenizer, T5ForConditionalGeneration, T5Config

In [41]:
# !pip install sentencepiece

# Option 1: Using T5 Model

In [42]:
#initialize the model and tokenizer
model_name = 't5-small'
tokenizer = T5Tokenizer.from_pretrained(model_name)
config = T5Config.from_pretrained(model_name)
model = T5ForConditionalGeneration.from_pretrained(model_name, config=config)

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565


T5 models expect the input text to be prefixed with a task-specific prompt, such as "summarize: " for summarization tasks. This helps the model understand what kind of output is expected.

In [None]:
text="""We all know that OpenAI actually started the trend of AI tools after releasing ChatGPT.
After that, we saw everyone shift to AI. Developers and big companies began building AI tools, and even individuals started learning about artificial intelligence.
Thanks to that, we have tons of popular AI tools like RunwayML, Lovable, Claude, Gemini, Perplexity, Cursor, Stitch, NotebookLM, Leonardo AI, Framer AI, and the list goes on.
That's not all. Every day, tons of new AI tools are launched, which makes it difficult for people to find the best ones for their needs.
That's why I spend a lot of time testing some of the best new AI tools and write a couple of posts every month to share the ones that truly stand out."""

In [44]:
import re
import html
# modified with GPT for best practices
def clean_text(text):
    text = html.unescape(text)
    text = text.lower()
    text = re.sub(r'<.*?>', ' ', text)             # remove HTML
    text = re.sub(r'http\S+|www\.\S+', ' ', text)  # remove URLs
    text = re.sub(r'\S+@\S+', ' ', text)           # remove emails
    text = re.sub(r"(.)\1{2,}", r"\1\1", text)     # soooo -> soo

    # keep ! and ? (sentiment)
    text = re.sub(r'[^a-z!? ]', ' ', text)
    text = re.sub(r'\s+', ' ', text).strip()
    return text

In [45]:
cleaned_text=clean_text(text).split()
len(cleaned_text)

127

In [46]:
tokenized_text=tokenizer.encode("summarize: " + clean_text(text), return_tensors="pt", max_length=512, truncation=True)
print(tokenized_text)

tensor([[21603,    10,    62,    66,   214,    24,   539,     9,    23,   700,
           708,     8,  4166,    13,     3,     9,    23,  1339,   227,     3,
         16306,  3582,   122,   102,    17,   227,    24,    62,  1509,   921,
          4108,    12,     3,     9,    23,  5564,    11,   600,   688,  1553,
           740,     3,     9,    23,  1339,    11,   237,  1742,   708,  1036,
            81,  7353,  6123,  2049,    12,    24,    62,    43,  8760,    13,
          1012,     3,     9,    23,  1339,   114, 22750,    51,    40,     3,
          5850,   179,     3,    75, 12513,    15,   873,  7619,   399,  9247,
           485,  8385,   127, 12261, 16638,    40,    51,    90,   106,   986,
            32,     3,     9,    23,  2835,    52,     3,     9,    23,    11,
             8,   570,  1550,    30,    24,     3,     7,    59,    66,   334,
           239,  8760,    13,   126,     3,     9,    23,  1339,    33,  3759,
            84,   656,    34,  1256,    21,   151,  

In [47]:
summary_ids=model.generate(tokenized_text, max_length=150, min_length=40, length_penalty=2.0, num_beams=4, early_stopping=True)
summary=tokenizer.decode(summary_ids[0], skip_special_tokens=True)

print("Summary:")
print(summary)
print(len(summary.split()))

Summary:
openai has tons of popular ai tools like runwayml lovable claude gemini perplexity cursor stitch notebooklm leonardo ai framer ai. the list goes on that s not all every day tons of new ai tools are launched which makes it difficult for people to find the best ones for their needs.
51


# Option 2 : Using Pipeline

In [48]:
from transformers import pipeline

summarizer = pipeline("summarization", model="facebook/bart-large-cnn")
summary = summarizer(text, max_length=130, min_length=30)

print(summary)




Device set to use cpu


[{'summary_text': "Every day, tons of new AI tools are launched, which makes it difficult for people to find the best ones for their needs. That's why I spend a lot of time testing some of the best newAI tools and write a couple of posts every month to share the ones that truly stand out."}]


# Option 3: LSTM/GRU/RNN

In [49]:
import os
from glob import glob
from pathlib import Path

news_files = Path("E:\\70 Days 70 Project\\Text Summarization\\data\\BBC News Summary\\News Articles")
summaries_files = Path("E:\\70 Days 70 Project\\Text Summarization\\data\\BBC News Summary\\Summaries")

news_categories=os.listdir(news_files)
summary_categories=os.listdir(summaries_files)
news_categories,summary_categories

(['business', 'entertainment', 'politics', 'sport', 'tech'],
 ['business', 'entertainment', 'politics', 'sport', 'tech'])

In [50]:
for files in os.listdir(Path(news_files/news_categories[0])):
    print(files)

001.txt
002.txt
003.txt
004.txt
005.txt
006.txt
007.txt
008.txt
009.txt
010.txt
011.txt
012.txt
013.txt
014.txt
015.txt
016.txt
017.txt
018.txt
019.txt
020.txt
021.txt
022.txt
023.txt
024.txt
025.txt
026.txt
027.txt
028.txt
029.txt
030.txt
031.txt
032.txt
033.txt
034.txt
035.txt
036.txt
037.txt
038.txt
039.txt
040.txt
041.txt
042.txt
043.txt
044.txt
045.txt
046.txt
047.txt
048.txt
049.txt
050.txt
051.txt
052.txt
053.txt
054.txt
055.txt
056.txt
057.txt
058.txt
059.txt
060.txt
061.txt
062.txt
063.txt
064.txt
065.txt
066.txt
067.txt
068.txt
069.txt
070.txt
071.txt
072.txt
073.txt
074.txt
075.txt
076.txt
077.txt
078.txt
079.txt
080.txt
081.txt
082.txt
083.txt
084.txt
085.txt
086.txt
087.txt
088.txt
089.txt
090.txt
091.txt
092.txt
093.txt
094.txt
095.txt
096.txt
097.txt
098.txt
099.txt
100.txt
101.txt
102.txt
103.txt
104.txt
105.txt
106.txt
107.txt
108.txt
109.txt
110.txt
111.txt
112.txt
113.txt
114.txt
115.txt
116.txt
117.txt
118.txt
119.txt
120.txt
121.txt
122.txt
123.txt
124.txt
125.txt


Ref: https://www.kaggle.com/code/mallaavinash/text-summarization

In [8]:
contraction_mapping = {"ain't": "is not", "aren't": "are not","can't": "cannot", "'cause": "because", "could've": "could have", "couldn't": "could not",

                           "didn't": "did not", "doesn't": "does not", "don't": "do not", "hadn't": "had not", "hasn't": "has not", "haven't": "have not",

                           "he'd": "he would","he'll": "he will", "he's": "he is", "how'd": "how did", "how'd'y": "how do you", "how'll": "how will", "how's": "how is",

                           "I'd": "I would", "I'd've": "I would have", "I'll": "I will", "I'll've": "I will have","I'm": "I am", "I've": "I have", "i'd": "i would",

                           "i'd've": "i would have", "i'll": "i will",  "i'll've": "i will have","i'm": "i am", "i've": "i have", "isn't": "is not", "it'd": "it would",

                           "it'd've": "it would have", "it'll": "it will", "it'll've": "it will have","it's": "it is", "let's": "let us", "ma'am": "madam",

                           "mayn't": "may not", "might've": "might have","mightn't": "might not","mightn't've": "might not have", "must've": "must have",
                            "mustn't": "must not", "mustn't've": "must not have", "needn't": "need not", "needn't've": "need not have","o'clock": "of the clock",

                           "oughtn't": "ought not", "oughtn't've": "ought not have", "shan't": "shall not", "sha'n't": "shall not", "shan't've": "shall not have",

                           "she'd": "she would", "she'd've": "she would have", "she'll": "she will", "she'll've": "she will have", "she's": "she is",

                           "should've": "should have", "shouldn't": "should not", "shouldn't've": "should not have", "so've": "so have","so's": "so as",

                           "this's": "this is","that'd": "that would", "that'd've": "that would have", "that's": "that is", "there'd": "there would",

                           "there'd've": "there would have", "there's": "there is", "here's": "here is","they'd": "they would", "they'd've": "they would have",
                           "they'll": "they will", "they'll've": "they will have", "they're": "they are", "they've": "they have", "to've": "to have",

                           "wasn't": "was not", "we'd": "we would", "we'd've": "we would have", "we'll": "we will", "we'll've": "we will have", "we're": "we are",

                           "we've": "we have", "weren't": "were not", "what'll": "what will", "what'll've": "what will have", "what're": "what are",

                           "what's": "what is", "what've": "what have", "when's": "when is", "when've": "when have", "where'd": "where did", "where's": "where is",

                           "where've": "where have", "who'll": "who will", "who'll've": "who will have", "who's": "who is", "who've": "who have",

                           "why's": "why is", "why've": "why have", "will've": "will have", "won't": "will not", "won't've": "will not have",

                           "would've": "would have", "wouldn't": "would not", "wouldn't've": "would not have", "y'all": "you all",
                           "y'all'd": "you all would","y'all'd've": "you all would have","y'all're": "you all are","y'all've": "you all have",

                           "you'd": "you would", "you'd've": "you would have", "you'll": "you will", "you'll've": "you will have",

                           "you're": "you are", "you've": "you have"}


In [9]:
import re
import html
# modified with GPT for best practices
def clean_text(text):
    text = html.unescape(text)
    text = text.lower()
    text=' '.join([contraction_mapping[i] if i in contraction_mapping.keys() else i for i in text.split()])
    text = re.sub(r'<.*?>', ' ', text)             # remove HTML
    text = re.sub(r'http\S+|www\.\S+', ' ', text)  # remove URLs
    text = re.sub(r'\S+@\S+', ' ', text)           # remove emails
    text = re.sub(r"(.)\1{2,}", r"\1\1", text)     # soooo -> soo

    # keep ! and ? (sentiment)
    text = re.sub(r'[^a-z!? ]', ' ', text)
    text = re.sub(r'\s+', ' ', text).strip()
    return text


#### Run once

In [10]:
# import pandas as pd
# from tqdm import tqdm

# dataframe={'news':[], 'summary':[]}

# for category in news_categories:
#     for file in tqdm(os.listdir(Path(news_files/category)),desc=f"News Category: {category}"):
#         with open(Path(news_files/category/file), 'r', encoding='utf-8', errors='replace') as news_file:
#             news_content=news_file.read()
#             dataframe['news'].append(news_content)
#     for file in tqdm(os.listdir(Path(summaries_files/category)),desc=f"Summary Category: {category}"):
#         with open(Path(summaries_files/category/file), 'r', encoding='utf-8', errors='replace') as summary_file:
#             summary_content=summary_file.read()
#             dataframe['summary'].append(summary_content)
# df=pd.DataFrame(dataframe)
# df.head()

In [11]:
# df.shape

In [12]:
# df.to_csv("bbc_news_summary_dataset.csv", index=False)

In [13]:
import pandas as pd

df=pd.read_csv("bbc_news_summary_dataset.csv")
df.head()

Unnamed: 0,news,summary
0,Ad sales boost Time Warner profit\n\nQuarterly...,TimeWarner said fourth quarter sales rose 2% t...
1,Dollar gains on Greenspan speech\n\nThe dollar...,The dollar has hit its highest level against t...
2,Yukos unit buyer faces loan claim\n\nThe owner...,Yukos' owner Menatep Group says it will ask Ro...
3,High fuel prices hit BA's profits\n\nBritish A...,"Rod Eddington, BA's chief executive, said the ..."
4,Pernod takeover talk lifts Domecq\n\nShares in...,Pernod has reduced the debt it took on to fund...


In [14]:
df["news"]=df["news"].apply(lambda x: clean_text(x))
df["summary"]=df["summary"].apply(lambda x: clean_text(x))
df.head()

Unnamed: 0,news,summary
0,ad sales boost time warner profit quarterly pr...,timewarner said fourth quarter sales rose to b...
1,dollar gains on greenspan speech the dollar ha...,the dollar has hit its highest level against t...
2,yukos unit buyer faces loan claim the owners o...,yukos owner menatep group says it will ask ros...
3,high fuel prices hit ba s profits british airw...,rod eddington ba s chief executive said the re...
4,pernod takeover talk lifts domecq shares in uk...,pernod has reduced the debt it took on to fund...


In [15]:
df["summary"]='<sos> '+df["summary"]+' <eos>'
df.head()

Unnamed: 0,news,summary
0,ad sales boost time warner profit quarterly pr...,<sos> timewarner said fourth quarter sales ros...
1,dollar gains on greenspan speech the dollar ha...,<sos> the dollar has hit its highest level aga...
2,yukos unit buyer faces loan claim the owners o...,<sos> yukos owner menatep group says it will a...
3,high fuel prices hit ba s profits british airw...,<sos> rod eddington ba s chief executive said ...
4,pernod takeover talk lifts domecq shares in uk...,<sos> pernod has reduced the debt it took on t...


In [16]:
PAD = "<pad>"
SOS = "<sos>"
EOS = "<eos>"
UNK = "<unk>"

Ref: GPT

In [17]:
from collections import Counter

class Vocab:
    def __init__(self, texts, max_size=30000, min_freq=2):
        counter = Counter()
        for text in texts:
            counter.update(text.split())

        self.itos = [PAD, SOS, EOS, UNK]
        for word, freq in counter.most_common():
            if freq >= min_freq and len(self.itos) < max_size:
                self.itos.append(word)

        self.stoi = {word: idx for idx, word in enumerate(self.itos)}

    def encode(self, text):
        return [self.stoi.get(w, self.stoi[UNK]) for w in text.split()]

    def __len__(self):
        return len(self.itos)


In [18]:
article_vocab = Vocab(df["news"], max_size=30000)
summary_vocab = Vocab(df["summary"], max_size=15000)
print(f"Article Vocab Size: {len(article_vocab)}")
print(f"Summary Vocab Size: {len(summary_vocab)}")

Article Vocab Size: 18730
Summary Vocab Size: 12388


In [None]:
from torch.utils.data import Dataset,DataLoader
import torch
class SummaryDataset(Dataset):
    def __init__(self,df,article_vocab,summary_vocab,max_article_len=512,max_summary_len=100):
        super().__init__()
        self.df=df
        self.article_vocab=article_vocab
        self.summary_vocab=summary_vocab
        self.max_article_len=max_article_len
        self.max_summary_len=max_summary_len
        
    def __len__(self):
        return len(self.df)
    
    def pad(self, sequence, max_length, pad_idx):
        return sequence[:max_length] + [pad_idx] * (max_length - len(sequence))
    
    def __getitem__(self, idx):
        article=self.df.iloc[idx]["news"] # get the article text of the given index
        summary=self.df.iloc[idx]["summary"] # get the summary text of the given index
        print(f"Summary: {summary}")
        enc=self.article_vocab.encode(article) # encode the article text
        dec=self.summary_vocab.encode(summary) # encode the summary text
        
        print(f"Encoded Summary: {dec}")
        print(f"Length before padding: {len(dec)}")
        enc=self.pad(enc,self.max_article_len,self.article_vocab.stoi[PAD]) # pad the encoded article
        dec=self.pad(dec,self.max_summary_len,self.summary_vocab.stoi[PAD]) # pad the encoded summary
        print(f"Padded Summary: {dec}")
        print(f"Length after padding: {len(dec)}")
        return torch.tensor(enc), torch.tensor(dec) # return the padded tensors

---

## üîç Understanding the Padding Logic

The `pad()` function does **TWO things**:

```python
def pad(self, sequence, max_length, pad_idx):
    return sequence[:max_length] + [pad_idx] * (max_length - len(sequence))
```

### 1Ô∏è‚É£ **If sequence is LONGER than max_length: TRUNCATE**
```python
sequence = [1, 2, 3, 4, 5, 6, 7, 8]  # length = 8
max_length = 5
pad_idx = 0

# Step 1: sequence[:max_length] = [1, 2, 3, 4, 5]  (first 5 only)
# Step 2: [0] * (5 - 5) = []  (no padding needed, already 5)

Result: [1, 2, 3, 4, 5]  ‚úÇÔ∏è CUT OFF!
```

### 2Ô∏è‚É£ **If sequence is SHORTER than max_length: PAD**
```python
sequence = [1, 2, 3]  # length = 3
max_length = 5
pad_idx = 0

# Step 1: sequence[:max_length] = [1, 2, 3]  (all of it)
# Step 2: [0] * (5 - 3) = [0, 0]  (add 2 padding tokens)

Result: [1, 2, 3, 0, 0]  ‚úèÔ∏è PADDED!
```

---

## üìä Your Examples

### Example 1: LONG SEQUENCE (gets truncated)
```
Encoded Summary length: 254
max_length: 50

# Sequence is much longer than 50!
sequence[:50] = [22, 4, 5045, ..., 132, 54, 5569, 55]  (first 50)
[0] * (50 - 50) = []  (nothing to add)

Result: [22, 4, 5045, ..., 54, 5569, 55]  ‚úÇÔ∏è TRUNCATED to 50
# You only see the FIRST 50 tokens, rest is discarded!
```

### Example 2: EVEN LONGER SEQUENCE (still gets truncated)
```
Encoded Summary length: 84
max_length: 50

# Sequence is longer than 50
sequence[:50] = [22, 45, 59, ..., 570]  (first 50)
[0] * (50 - 50) = []  (nothing to add)

Result: [22, 45, 59, ..., 570]  ‚úÇÔ∏è TRUNCATED to 50
# Again, only first 50 tokens kept!
```

---

## üîë Key Point

**The padding in your dataset is mostly TRUNCATION, not padding!**

Why? Because actual BBC news summaries are usually **longer than 50 tokens**.

To see actual padding in action, you'd need a shorter summary:

```python
sequence = [22, 45, 59, 7864, ..., 941, 82]  # length = 30
max_length = 50
pad_idx = 0

Result: [22, 45, 59, 7864, ..., 941, 82, 0, 0, 0, ..., 0]
         ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò  ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
              30 real tokens                20 padding tokens (0s)
```

---

## üí° Why Truncate?

The model needs **consistent input shapes**:
- All encoder inputs: `(batch_size, 512)` tokens
- All decoder inputs: `(batch_size, 50)` tokens

If summaries are longer than 50, we **truncate** to the first 50 tokens (keep the beginning).

If summaries are shorter than 50, we **pad** with `<pad>` tokens to reach 50.

In [29]:
dataset=SummaryDataset(df,article_vocab,summary_vocab)
loader=DataLoader(dataset,batch_size=2,shuffle=True)

In [30]:
next(iter(loader))

Summary: <sos> the explosion in consumer technology is to continue into delegates at the world s largest gadget show in las vegas have been told the consumer electronics show ces featured the pick of s products the portable technologies on show also reflected one of the buzzwords of ces which was the time and place shifting of multimedia content being able to watch and listen to video and music anywhere at any time another disappointment was the lack of exposure sony s new portable games device the psp had at the show a sony representative told the bbc news website this was because sony did not consider it to be part of their consumer technology offering he unveiled new ways of letting people take tv shows recorded on personal video recorders and watch them back on portable devices everything is going digital kirsten pfeifer from the consumer electronics association told the bbc news website hybrid devices which combine a number of multimedia functions were also in evidence on the show

[tensor([[1014,  121,    5,  ...,  445,   36,   17],
         [8791,  536,   57,  ...,    0,    0,    0]]),
 tensor([[  22,    4, 5045,    8,  465,  144,   14,    5,  582,   80, 3185,   25,
             4,   76,   12,  693,  775,  132,    8, 2683, 2819,   28,   41,   96,
             4,  465, 1232,  132, 2578, 4406,    4, 2385,    6,   12,  867,    4,
           853,  895,   16,  132,   54, 5569,   55,    6,    4, 9183,    6, 2578,
            40,   15],
         [  22,   45,   59, 7864, 1394, 1277,  570,   32,    5, 1156,    7,  926,
           433,  947,   16,  838,  510,   66,   42, 7864,  570, 2012,   57,   29,
             4,  838, 3734,   15,   94, 7864,   18,   54, 2323,  307, 2315,    5,
           570,   63,   39,   32,  947,  344,   59,    5, 3479,   11,    4,  926,
           433,  570]])]

---

## üé¨ Complete Data Flow Visualization

Let's trace how a **real batch** flows through the entire system!

### üìä Input Batch Example:
```python
enc_batch shape: (4, 512)   # 4 articles, each 512 tokens
dec_batch shape: (4, 50)    # 4 summaries, each 50 tokens
```

---

### üîÑ **TRAINING FLOW** (forward function)

```
‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
‚îÇ STEP 1: ENCODER - "Understanding the Article"              ‚îÇ
‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò

enc_batch: (4, 512)  ‚Üí [4052, 187, 700, 71, ...]
    ‚Üì
Embedding: (4, 512, 256)  ‚Üí Each token becomes a 256-d vector
    ‚Üì
LSTM processes all 512 tokens:
    Token 0: [4052] ‚Üí hidden state update
    Token 1: [187]  ‚Üí hidden state update
    ...
    Token 511: [...]‚Üí Final hidden state
    ‚Üì
encoder_hidden (h): (1, 4, 512)  ‚Üê Compressed understanding!
encoder_cell (c):   (1, 4, 512)  ‚Üê Memory!


‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
‚îÇ STEP 2: DECODER - "Generating Summary Word by Word"        ‚îÇ
‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò

Initial Input: dec_in_seq[:, 0] = <SOS> token for all 4 samples
              Shape: (4,)  ‚Üí [22, 22, 22, 22]

‚ïî‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïó
‚ïë TIME STEP 0: Generate 1st word                            ‚ïë
‚ïö‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïù

current_input: (4,) = [22, 22, 22, 22]  (<SOS>)
    ‚Üì
unsqueeze(1): (4, 1)
    ‚Üì
Embedding: (4, 1, 256)
    ‚Üì
LSTM(embedding, h, c):
    - Uses context from encoder (h, c)
    - Output: (4, 1, 512)
    ‚Üì
squeeze(1): (4, 512)
    ‚Üì
FC Layer: (4, vocab_size)  ‚Üí [0.01, 0.05, 0.8, 0.02, ...]
                             ‚Üì
Store in all_decoder_outputs[:, 0, :]

Teacher Forcing Decision:
    random() < 0.5?
    ‚îú‚îÄ YES: next_input = dec_in_seq[:, 1] (real target word)
    ‚îî‚îÄ NO:  next_input = argmax(predictions) (model's guess)

‚ïî‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïó
‚ïë TIME STEP 1: Generate 2nd word                            ‚ïë
‚ïö‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïù

current_input: (4,) = [506, 506, 336, 506] (from previous step)
    ‚Üì
unsqueeze(1): (4, 1)
    ‚Üì
Embedding: (4, 1, 256)
    ‚Üì
LSTM(embedding, h_updated, c_updated):  ‚Üê h and c evolve!
    - Output: (4, 1, 512)
    ‚Üì
FC Layer: (4, vocab_size)
    ‚Üì
Store in all_decoder_outputs[:, 1, :]

... Repeat for all 50 time steps ...

‚ïî‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïó
‚ïë TIME STEP 49: Generate 50th word                          ‚ïë
‚ïö‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïê‚ïù

Final all_decoder_outputs: (4, 50, vocab_size)


‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
‚îÇ STEP 3: LOSS CALCULATION                                   ‚îÇ
‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò

outputs: (4, 50, 5000) ‚Üí Reshape ‚Üí (200, 5000)
targets: (4, 50)       ‚Üí Reshape ‚Üí (200,)
    ‚Üì
CrossEntropyLoss:
    Compare 200 predictions with 200 target words
    Calculate average loss
    ‚Üì
Backward: Update all weights to minimize loss!
```

---

### üéØ **INFERENCE FLOW** (after training)

```
‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
‚îÇ User provides new article (not in training data)           ‚îÇ
‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò

Article: "The economy grew by 5% last quarter..."
    ‚Üì
Encode + Tokenize: [1234, 5678, ...]
    ‚Üì
Encoder: h, c (context)
    ‚Üì
Decoder Loop:
    Start: <SOS>
    Time 0: <SOS> ‚Üí "economy"
    Time 1: "economy" ‚Üí "grows"
    Time 2: "grows" ‚Üí "5%" 
    Time 3: "5%" ‚Üí <EOS> (STOP!)
    ‚Üì
Summary: "economy grows 5%"
```

---


## Architecture Components

### 1. **Embedding Layer** - Converting Words to Numbers

**Problem:** Neural networks don't understand words like "cat" or "dog"

**Solution:** Convert each word to a dense vector of numbers

```
Word: "scientist" (ID=4052)
    ‚Üì
Embedding Layer
    ‚Üì
Vector: [0.23, -0.45, 0.67, ..., 0.12]  # 256 dimensions
```

**Why 256 dimensions?**
- Captures semantic meaning (similar words have similar vectors)
- Learned during training
- Trade-off: Higher = more capacity, but slower training

**Mathematical Operation:**
```python
embedding = nn.Embedding(vocab_size=10000, embed_dim=256)
x = torch.tensor([4052])  # Word ID
embedded = embedding(x)   # Shape: (1, 256)
```

This is essentially a **lookup table**:
```
embedding.weight[4052] ‚Üí [0.23, -0.45, 0.67, ...]
```

---

### 2. **LSTM (Long Short-Term Memory)** - The Memory Unit

**Why LSTM and not simple RNN?**

**Problem with Simple RNNs:**
```
Sentence: "The cat, which was very fluffy and cute, sat on the mat"

When processing "sat", simple RNN forgets "cat" (vanishing gradient)
```

**In Our Code:**
```python
self.lstm = nn.LSTM(embed_dim=256, hidden_dim=512, batch_first=True)

# embed_dim=256: Input size (word embedding)
# hidden_dim=512: Size of hidden state and cell state
# batch_first=True: Input shape is (batch, seq_len, features)
```

---

### 3. **Encoder Architecture**

**Purpose:** Compress the entire input sequence into a fixed-size representation

```python
class Encoder(nn.Module):
    def __init__(self, vocab_size, embed_dim, hidden_dim):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, embed_dim, padding_idx=0)
        self.lstm = nn.LSTM(embed_dim, hidden_dim, batch_first=True)

    def forward(self, x):
        emb = self.embedding(x)              # (batch, seq, embed)
        outputs, (h, c) = self.lstm(emb)     # Process sequence
        return h, c                          # Return final states only
```

**Key Point:** We **discard `outputs`** and only keep **`h`** and **`c`**!

**Why?**
- `outputs`: Hidden state at **every** time step (batch, seq_len, hidden_dim)
- `h`: Hidden state at the **last** time step (1, batch, hidden_dim)
- `c`: Cell state at the **last** time step (1, batch, hidden_dim)

The final `h` and `c` are the **compressed representation** of the entire article!

**Information Flow:**
```
Token 0: "The"        ‚Üí h‚ÇÄ, c‚ÇÄ
Token 1: "scientist"  ‚Üí h‚ÇÅ, c‚ÇÅ (remembers "The" + "scientist")
Token 2: "discovered" ‚Üí h‚ÇÇ, c‚ÇÇ (remembers all previous)
...
Token 511: "."        ‚Üí h‚ÇÖ‚ÇÅ‚ÇÅ, c‚ÇÖ‚ÇÅ‚ÇÅ (remembers ENTIRE article!)

We use h‚ÇÖ‚ÇÅ‚ÇÅ and c‚ÇÖ‚ÇÅ‚ÇÅ as context for decoder!
```

---

### 4. **Decoder Architecture**

**Purpose:** Generate output sequence one word at a time, conditioned on encoder's context

```python
class Decoder(nn.Module):
    def __init__(self, vocab_size, embed_dim, hidden_dim):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, embed_dim, padding_idx=0)
        self.lstm = nn.LSTM(embed_dim, hidden_dim, batch_first=True)
        self.fc = nn.Linear(hidden_dim, vocab_size)

    def forward(self, x, h, c):
        x = x.unsqueeze(1)                    # (batch,) ‚Üí (batch, 1)
        emb = self.embedding(x)               # (batch, 1, embed)
        output, (h, c) = self.lstm(emb, (h, c))  # Use encoder's h, c!
        pred = self.fc(output.squeeze(1))     # (batch, vocab_size)
        return pred, h, c
```

**Key Differences from Encoder:**

1. **Input Shape:** 
   - Encoder: Processes entire sequence (batch, 512)
   - Decoder: Processes ONE token at a time (batch, 1)

2. **Initial Hidden State:**
   - Encoder: Starts with zeros
   - Decoder: Starts with encoder's final h and c

3. **Output:**
   - Encoder: Returns only h, c
   - Decoder: Returns predictions + updated h, c

**The FC Layer:**
```python
self.fc = nn.Linear(hidden_dim=512, vocab_size=5000)
```

Converts LSTM's hidden state ‚Üí probability distribution over vocabulary

```
LSTM output: [0.23, -0.45, 0.67, ...]  (512 values)
    ‚Üì
FC Layer (linear transformation + softmax)
    ‚Üì
Probabilities: [0.001, 0.002, 0.8, 0.001, ...]  (5000 values)
                  ‚Üë
             "scientist" (highest probability!)
```

---

### 5. **Seq2Seq Model - Connecting Everything**

This is the **orchestrator** that combines encoder and decoder:

## Training Concepts

### **Teacher Forcing** - The Training Trick

**Problem:** During training, if model makes one mistake, all subsequent predictions are wrong!

```
Target:  "scientist" "discover" "planet"
Predict: "scientist" "found"    "star"    ‚Üê One mistake cascades!
                        ‚Üë               
                  Wrong prediction makes next prediction harder
```

**Solution: Teacher Forcing**

With probability 0.5, use the **real target word** instead of prediction:

```
Time 0: Model predicts "scientist" ‚úì
        Next input = "scientist" (ground truth)

Time 1: Model predicts "found" ‚úó (wrong)
        Next input = "discover" (ground truth - teacher forcing!)

Time 2: Model predicts "planet" ‚úì (correct because input was right)
```

**Trade-off:**
- Too much teacher forcing (1.0): Model never learns to correct its mistakes
- Too little teacher forcing (0.0): Training is unstable
- Sweet spot: 0.5 - 0.6

---

### **Loss Calculation**

```python
outputs = model(enc_batch, dec_input)  # (4, 50, 5000)
targets = dec_batch[:, 1:]             # (4, 50)

# Reshape for CrossEntropyLoss
outputs = outputs.reshape(-1, 5000)    # (200, 5000)
targets = targets.reshape(-1)          # (200,)

loss = CrossEntropyLoss(outputs, targets)
```

**Why reshape?**

CrossEntropyLoss expects:
- **Predictions:** (N, C) where N=number of predictions, C=number of classes
- **Targets:** (N,) with class indices

We have 4 samples √ó 50 words = 200 predictions total!

**Mathematical Formula:**
```
Loss = -‚àë log(p(target_word))

For each position, penalize if model didn't give high probability to target word
```

---

In [65]:
import torch.nn as nn

class Encoder(nn.Module):
    def __init__(self, vocab_size, embed_dim, hidden_dim):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, embed_dim, padding_idx=0)
        self.lstm = nn.LSTM(embed_dim, hidden_dim, batch_first=True)

    def forward(self, x):
        emb = self.embedding(x)
        outputs, (h, c) = self.lstm(emb)
        return h, c

### üìñ Encoder Explained

The **Encoder** reads the entire article and creates a "memory" of it.

**Components:**
- **Embedding Layer**: Converts word IDs ‚Üí dense vectors (numbers the model can work with)
- **LSTM Layer**: Processes the sequence and creates hidden state (h) and cell state (c)

**Data Flow Example:**
```
Input: [4052, 187, 700, ...]  # Word IDs (batch_size=4, seq_len=512)
    ‚Üì
Embedding: [[0.2, 0.5, ...], [0.1, 0.3, ...], ...]  # Shape: (4, 512, 256)
    ‚Üì
LSTM: Processes sequence step-by-step
    ‚Üì
Output: 
  - h (hidden state): (1, 4, 512) - The "summary" of the article
  - c (cell state): (1, 4, 512) - The "memory" LSTM keeps
```

**Why h and c?**
- **h (hidden)**: What the model "understood" from the article
- **c (cell)**: What the model "remembers" for later use

These become the starting point for the decoder!

In [None]:
class Decoder(nn.Module):
    def __init__(self, vocab_size, embed_dim, hidden_dim):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, embed_dim, padding_idx=0)
        self.lstm = nn.LSTM(embed_dim, hidden_dim, batch_first=True)
        self.fc = nn.Linear(hidden_dim, vocab_size)

    def forward(self, x, h, c):
        x = x.unsqueeze(1) # Add sequence dimension (batch_size, 1, input_size) as lstm expects 3D input
        emb = self.embedding(x)
        output, (h, c) = self.lstm(emb, (h, c))
        pred = self.fc(output.squeeze(1)) # remove sequence dimension added earlier
        return pred, h, c

### Decoder Explained

The **Decoder** generates the summary one word at a time using the encoder's context.

**Key Difference from Encoder:**
- Encoder processes the **entire sequence at once**
- Decoder processes **one word at a time** (autoregressive)

**Components:**
- **Embedding Layer**: Converts target word ID ‚Üí vector
- **LSTM Layer**: Takes embedding + previous hidden/cell states ‚Üí generates next state
- **Fully Connected (fc)**: Converts LSTM output ‚Üí vocabulary probabilities

**Data Flow Example:**
```
Step 1:
  Input: <SOS> token (ID=22)  # Shape: (4,) - batch of 4
    ‚Üì
  unsqueeze(1): (4, 1)  # Add sequence dimension (because we process 1 word at a time)
    ‚Üì
  Embedding: (4, 1, 256)
    ‚Üì
  LSTM (with h, c from encoder): (4, 1, 512)
    ‚Üì
  squeeze(1): (4, 512)
    ‚Üì
  fc layer: (4, vocab_size)  # Probability for each word in vocabulary
    ‚Üì
  Output: [0.01, 0.05, 0.8, ...]  # Model predicts next word

Step 2:
  Input: Predicted word from Step 1 (or ground truth if teacher forcing)
  [Same process repeats...]
```

**Why unsqueeze(1)?**
LSTM expects shape: `(batch_size, sequence_length, features)`
But we feed one word at a time, so we add a fake sequence dimension of 1.

`x = x.unsqueeze(1)`

Why?

PyTorch LSTM expects:
- (batch, seq_len, features)

But x is:
- (batch,)

So we add a fake time dimension:
- (batch, 1)

In [None]:
class SeqtoSeqModel(nn.Module):
    def __init__(self, encoder_model ,decoder_model , computation_device):
        super(SeqtoSeqModel, self).__init__()
        
        self.encoder_model  = encoder_model
        self.decoder_model = decoder_model
        self.device = computation_device
        
    def forward(self, enc_in_seq, dec_in_seq, teacher_forcing_ratio=0.5):
        # First we get the dimensions and initialize the tensor to hold all decoder outputs
        batch_size, decoder_sequence_length = dec_in_seq.shape
        decoder_vocabulary_size = self.decoder_model.fc.out_features
        
        # INITIALIZE A TENSOR TO HOLD ALL DECODER OUTPUTS (Size : batch_size x decoder_sequence_length x decoder_vocabulary_size)
        all_decoder_outputs = torch.zeros(batch_size, decoder_sequence_length, decoder_vocabulary_size, device=self.device)
        
        # ENCODER BLOCK ( GETTING CELL AND HIDDEN STATES FROM ENCODER)
        enc_hidden, enc_cell = self.encoder_model(enc_in_seq)
        current_decoder_input_token = dec_in_seq[:, 0]  # list[:, 0] means taking the first token/<sos> token of each sequence from the batch
        
        # DECODER BLOCK (ITERATING OVER EACH TIME STEP IN THE DECODER SEQUENCE)
        for time_stamp in range(decoder_sequence_length):
            predicted_token, enc_hidden, enc_cell = self.decoder_model(current_decoder_input_token, enc_hidden, enc_cell)
            
            all_decoder_outputs[:, time_stamp, :] = predicted_token # from each batch get the time_stamp predicted token only
            use_teacher_forcing = (torch.rand(1).item() < teacher_forcing_ratio) and (time_stamp + 1 < decoder_sequence_length) 
            
            if use_teacher_forcing:
                current_decoder_input_token = dec_in_seq[:, time_stamp + 1]
            else:
                current_decoder_input_token = predicted_token.argmax(dim=1)
                
        return all_decoder_outputs
        


**What happens in each loop iteration:**

**Time=0:** Generate 1st word
```
current_input: <SOS>
    ‚Üì
Decoder(input=<SOS>, h, c)
    ‚Üì
predicted_logits: [0.01, 0.8, 0.05, ...]  # Probabilities for all words
    ‚Üì
Store in all_decoder_outputs[:, 0, :]
    ‚Üì
Teacher Forcing Decision:
  - If random() < 0.5: Use REAL next word from ground truth (dec_in_seq[:, 1])
  - Else: Use PREDICTED word (argmax of logits)
```

**Time=1:** Generate 2nd word
```
current_input: Word from time=0 (either real or predicted)
    ‚Üì
Decoder(input=word1, h, c)  # h and c are updated from previous step!
    ‚Üì
predicted_logits for 2nd word
    ‚Üì
Store in all_decoder_outputs[:, 1, :]
```

...continues for 50 time steps...

#### **Step 5: Return All Predictions**
```python
return all_decoder_outputs  # Shape: (4, 50, 5000)
```

In [68]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

embed_dim = 256
hidden_dim = 512

encoder = Encoder(len(article_vocab), embed_dim, hidden_dim)
decoder = Decoder(len(summary_vocab), embed_dim, hidden_dim)

sequence_to_sequence_model = SeqtoSeqModel(encoder, decoder, device).to(device)

In [69]:
pad_idx = summary_vocab.stoi[PAD]
loss_function = nn.CrossEntropyLoss(ignore_index=pad_idx)
optimizer = torch.optim.Adam(
    sequence_to_sequence_model.parameters(),
    lr=0.001
)

### Model Instantiation

Creating the actual model with specific dimensions:

```
article_vocab size: ~10,000 words
summary_vocab size: ~5,000 words
embed_dim: 256 (each word ‚Üí 256-dimensional vector)
hidden_dim: 512 (LSTM internal state size)
```

The model is moved to GPU (if available) for faster training.

In [None]:
from torch.nn.utils import clip_grad_norm_
from tqdm import tqdm

num_epochs = 20
teacher_forcing_ratio = 0.6

sequence_to_sequence_model.train()

for epoch in range(num_epochs):
    epoch_loss = 0.0
    for enc_batch, dec_batch in tqdm(loader, desc=f"Epoch {epoch+1}/{num_epochs}"):
        enc_batch = enc_batch.to(device)
        dec_batch = dec_batch.to(device)

        dec_input = dec_batch[:, :-1] # all tokens except the last
        targets = dec_batch[:, 1:] # all tokens except the first

        optimizer.zero_grad()
        
        outputs = sequence_to_sequence_model(enc_batch, dec_input, teacher_forcing_ratio=teacher_forcing_ratio)
        
        outputs = outputs.reshape(-1, outputs.size(-1)) # flatten the outputs for loss calculation
        targets = targets.reshape(-1) # flatten the targets for loss calculation

        loss = loss_function(outputs, targets)
        loss.backward()
        clip_grad_norm_(sequence_to_sequence_model.parameters(), 1.0)
        optimizer.step()

        epoch_loss += loss.item()

    avg_loss = epoch_loss / len(loader)
    print(f"Epoch {epoch+1}/{num_epochs} - Loss: {avg_loss:.4f}")

### üéì Training Loop Explained

This is where the model learns! Let's break down what happens in each batch:

#### **Data Preparation:**
```python
dec_input = dec_batch[:, :-1]   # All words except last: [<SOS>, word1, word2, ..., wordN-1]
targets = dec_batch[:, 1:]      # All words except first: [word1, word2, ..., wordN, <EOS>]
```

**Why this split?**
- Decoder gets: `<SOS> cat on` ‚Üí Should predict: `cat on mat`
- We shift by 1 position to create input-target pairs

#### **Forward Pass:**
```python
outputs = model(enc_batch, dec_input, teacher_forcing_ratio=0.6)
# Shape: (batch=4, seq_len=49, vocab_size=5000)
```

#### **Reshape for Loss Calculation:**
```python
outputs = outputs.reshape(-1, outputs.size(-1))  # (4*49, 5000) = (196, 5000)
targets = targets.reshape(-1)                     # (4*49,) = (196,)
```
**Why?** CrossEntropyLoss expects:
- Predictions: (num_predictions, num_classes)
- Targets: (num_predictions,)

We treat each word prediction independently!

#### **Backward Pass:**
```python
loss.backward()                                  # Calculate gradients
clip_grad_norm_(model.parameters(), 1.0)        # Prevent exploding gradients
optimizer.step()                                 # Update weights
```

#### **Visual Example of One Training Step:**

```
Batch Input:
  Article: [4052, 187, 700, ...]  (batch=4, seq=512)
  Summary: [22, 506, 336, 168, 603, ...]  (batch=4, seq=50)

Split Summary:
  dec_input: [22, 506, 336, 168]  (first 49 tokens)
  targets:   [506, 336, 168, 603] (last 49 tokens)

Model Forward:
  Time 0: Input=22(<SOS>)  ‚Üí Predict: 506  vs Target: 506 ‚úì
  Time 1: Input=506        ‚Üí Predict: 330  vs Target: 336 ‚úó (loss!)
  Time 2: Input=336        ‚Üí Predict: 170  vs Target: 168 ‚úó (loss!)
  ...

Total Loss: Sum of all mismatches
Backward: Adjust weights to reduce loss
```

**Over 20 epochs**, the model learns to generate better summaries!

### üìâ Loss Function & Optimizer

**CrossEntropyLoss:** Compares predicted word probabilities with actual words
- `ignore_index=pad_idx`: Don't penalize the model for predicting padding

**Adam Optimizer:** Adjusts model weights to minimize loss

In [None]:
def summarize_text(model, article, article_vocab, summary_vocab, device, max_summary_len=50):
    model.eval()
    with torch.no_grad():
        enc_input = torch.tensor(article_vocab.encode(article)).unsqueeze(0).to(device)
        enc_hidden, enc_cell = model.encoder_model(enc_input)

        current_decoder_input_token = torch.tensor([summary_vocab.stoi[SOS]]).to(device)
        summary_tokens = []

        for _ in range(max_summary_len):
            predicted_token, enc_hidden, enc_cell = model.decoder_model(current_decoder_input_token, enc_hidden, enc_cell)
            predicted_token_id = predicted_token.argmax(dim=1).item()
            if predicted_token_id == summary_vocab.stoi[EOS]:
                break
            summary_tokens.append(predicted_token_id)
            current_decoder_input_token = torch.tensor([predicted_token_id]).to(device)

        summary_words = [summary_vocab.itos[token_id] for token_id in summary_tokens]
        return ' '.join(summary_words)

def summarize_text_beam_search(model, article, article_vocab, summary_vocab, device, max_summary_len=50, beam_width=3):
    """Beam search with repetition penalty and length normalization"""
    model.eval()
    with torch.no_grad():
        enc_input = torch.tensor(article_vocab.encode(article)).unsqueeze(0).to(device)
        enc_hidden, enc_cell = model.encoder_model(enc_input)
        
        # Initialize beams: (log_prob, tokens, hidden, cell, is_finished)
        beams = [(0.0, [summary_vocab.stoi[SOS]], enc_hidden, enc_cell, False)]
        finished_beams = []
        
        for step in range(max_summary_len):
            new_beams = []
            
            for log_prob, tokens, h, c, is_finished in beams:
                if is_finished:
                    finished_beams.append((log_prob, tokens))
                    continue
                    
                current_token = torch.tensor([tokens[-1]]).to(device)
                logits, h_new, c_new = model.decoder_model(current_token, h, c)
                log_probs = torch.log_softmax(logits, dim=1)[0]
                
                # Repetition penalty: penalize tokens that appear frequently in the sequence
                for token_id in set(tokens[1:]):  # Skip SOS token
                    count = tokens[1:].count(token_id)
                    if count > 2:
                        log_probs[token_id] -= 1.0 * (count - 1)  # Penalty increases with frequency
                
                # Block PAD token
                log_probs[summary_vocab.stoi[PAD]] = float('-inf')
                
                # Get top-k candidates
                top_k = torch.topk(log_probs, min(beam_width, log_probs.numel()), largest=True)
                
                for candidate_log_prob, candidate_id in zip(top_k.values, top_k.indices):
                    if torch.isinf(candidate_log_prob):
                        continue
                    new_log_prob = log_prob + candidate_log_prob.item()
                    new_tokens = tokens + [candidate_id.item()]
                    is_eos = (candidate_id.item() == summary_vocab.stoi[EOS])
                    
                    new_beams.append((new_log_prob, new_tokens, h_new, c_new, is_eos))
            
            # Sort by normalized log probability (length penalty)
            new_beams.sort(reverse=True, key=lambda x: x[0] / max(len(x[1]), 1) ** 0.7)
            beams = new_beams[:beam_width]
            
            # Separate finished and unfinished beams
            finished_beams.extend([b for b in beams if b[4]])
            beams = [b for b in beams if not b[4]]
            
            if not beams:
                break
        
        # Combine and sort all beams
        all_beams = finished_beams + beams
        all_beams.sort(reverse=True, key=lambda x: x[0] / max(len(x[1]), 1) ** 0.7)
        
        # Return best beam
        if all_beams:
            best_tokens = all_beams[0][1][1:]  # Remove SOS token
            summary_words = [summary_vocab.itos[token_id] for token_id in best_tokens if token_id != summary_vocab.stoi[EOS]]
            return ' '.join(summary_words)
        return ""

### üéØ Inference: Generating Summaries

Now that the model is trained, let's use it to generate summaries!

#### **Two Strategies:**

---

### 1Ô∏è‚É£ **Greedy Decoding** (`summarize_text`)

**The Simple Approach:** Always pick the most likely word.

#### **Step-by-Step Flow:**

```
Input Article: "Scientists discovered a new planet orbiting a distant star..."
    ‚Üì
Encode: encoder(article) ‚Üí h, c (context vectors)
    ‚Üì
Start with: <SOS> token

Loop (max 50 times):
  ‚îú‚îÄ Time 0: Input=<SOS>        ‚Üí Decoder ‚Üí Probabilities ‚Üí Pick "scientists" (highest)
  ‚îú‚îÄ Time 1: Input="scientists" ‚Üí Decoder ‚Üí Probabilities ‚Üí Pick "discover" (highest)
  ‚îú‚îÄ Time 2: Input="discover"   ‚Üí Decoder ‚Üí Probabilities ‚Üí Pick "new" (highest)
  ‚îú‚îÄ Time 3: Input="new"        ‚Üí Decoder ‚Üí Probabilities ‚Üí Pick "planet" (highest)
  ‚îî‚îÄ Time 4: Input="planet"     ‚Üí Decoder ‚Üí Probabilities ‚Üí Pick <EOS> (stop!)

Output: "scientists discover new planet"
```

**Key Difference from Training:**
- **Training**: Uses teacher forcing (real target words)
- **Inference**: Uses its own predictions (no ground truth available!)

---

### 2Ô∏è‚É£ **Beam Search** (`summarize_text_beam_search`)

**The Smart Approach:** Explore multiple possibilities simultaneously.

#### **How Beam Search Works:**

Instead of keeping only the best word, we keep the **top-K (beam_width=3) sequences**.

```
Start: <SOS>
    ‚Üì
Time 0: Generate 3 best options
  Beam 1: <SOS> "scientists" (prob=0.8)
  Beam 2: <SOS> "new"        (prob=0.6)
  Beam 3: <SOS> "researchers" (prob=0.5)
    ‚Üì
Time 1: From EACH beam, generate 3 options (9 total), keep best 3
  Beam 1: <SOS> "scientists" "discover" (prob=0.8*0.9=0.72)
  Beam 2: <SOS> "new" "planet"          (prob=0.6*0.8=0.48)
  Beam 3: <SOS> "scientists" "find"     (prob=0.8*0.7=0.56)
    ‚Üì
Time 2: Continue... keep best 3 paths
    ‚Üì
...
    ‚Üì
Final: Pick the sequence with highest probability
```

**Enhancements in the Code:**

1. **Repetition Penalty:**
   ```python
   if count > 2:
       log_probs[token_id] -= 1.0 * (count - 1)
   ```
   Penalizes words that appear too often (avoid "the the the...")

2. **Length Normalization:**
   ```python
   new_log_prob / max(len(new_tokens), 1) ** 0.7
   ```
   Prevents bias toward shorter sequences

3. **PAD Token Blocking:**
   ```python
   log_probs[summary_vocab.stoi[PAD]] = float('-inf')
   ```
   Never predict padding tokens

**Greedy vs Beam Search:**
- **Greedy**: Fast, but can miss better sequences
- **Beam Search**: Slower, but explores alternatives and often produces better summaries

In [None]:
print("Greedy Decoding:")
print(summarize_text(sequence_to_sequence_model, df.iloc[0]["news"], article_vocab, summary_vocab, device))

print("\nBeam Search Decoding (beam_width=3):")
print(summarize_text_beam_search(sequence_to_sequence_model, df.iloc[0]["news"], article_vocab, summary_vocab, device, beam_width=3))