# üì∞ Reddit TIFU Summarization with BART  
**Fine‚Äëtuning a seq2seq model to generate concise summaries of ‚ÄúToday I F***ed Up‚Äù posts**  

This notebook walks through loading the dataset, preprocessing, training a BART model, evaluating with ROUGE, and generating example summaries.


## üìã Table of Contents
1. [Setup & Imports](#setup)  
2. [Dataset Loading & Exploration](#exploration)  
3. [Preprocessing](#preprocessing)  
4. [Model & Training](#training)  
5. [Evaluation](#evaluation)  
6. [Inference Examples](#inference)  
7. [Next Steps & Resources](#next-steps)


## <a name="setup"></a>1. Setup & Imports  
Install packages and import libraries.


In [None]:
!pip install datasets transformers evaluate rouge_score --quiet

In [None]:
!pip install --upgrade transformers --quiet

## <a name="exploration"></a>2. Dataset Loading & Exploration  
- Load the Reddit TIFU dataset  
- Peek at a few examples  
- Basic statistics (number of posts, average length)


In [None]:
from datasets import load_dataset

dataset = load_dataset("reddit_tifu", "long")
dataset = dataset['train'].train_test_split(test_size=0.1, seed=42)
print(dataset)


In [None]:
import pandas as pd

df = dataset['train'].to_pandas()
df['doc_len'] = df['documents'].apply(lambda x: len(x.split()))
df['tldr_len'] = df['tldr'].apply(lambda x: len(x.split()))

print("Document Stats:\n", df['doc_len'].describe())
print("\nSummary Stats:\n", df['tldr_len'].describe())

import matplotlib.pyplot as plt

plt.hist(df['doc_len'], bins=50)
plt.title('Document Length Distribution')
plt.show()

plt.hist(df['tldr_len'], bins=50, color='orange')
plt.title('Summary Length Distribution')
plt.show()

## <a name="preprocessing"></a>3. Preprocessing  
- Define tokenizer and max lengths  
- Tokenize train & validation splits  
- Create PyTorch/TF dataloaders


In [None]:
from transformers import AutoTokenizer

model_checkpoint = "facebook/bart-base"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

max_input_length = 512
max_target_length = 64

def preprocess_function(examples):
    inputs = examples["documents"]
    targets = examples["tldr"]
    model_inputs = tokenizer(inputs, max_length=max_input_length, truncation=True)

    with tokenizer.as_target_tokenizer():
        labels = tokenizer(targets, max_length=max_target_length, truncation=True)

    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

tokenized_datasets = dataset.map(preprocess_function, batched=True)

In [None]:
from transformers import AutoModelForSeq2SeqLM

model = AutoModelForSeq2SeqLM.from_pretrained(model_checkpoint)


In [None]:
import evaluate

rouge = evaluate.load("rouge")

def compute_metrics(eval_preds):
    preds, labels = eval_preds
    decoded_preds = tokenizer.batch_decode(preds, skip_special_tokens=True)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    result = rouge.compute(predictions=decoded_preds, references=decoded_labels, use_stemmer=True)
    return {k: round(v * 100, 2) for k, v in result.items()}


## <a name="training"></a>4. Model & Training  
- Load pretrained BART model  
- Configure training hyperparameters  
- Kick off training loop  


In [None]:
from transformers import Seq2SeqTrainingArguments

training_args = Seq2SeqTrainingArguments(
    output_dir="./bart-finetuned-reddit-tifu",
    do_train=True,
    do_eval=True,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    num_train_epochs=3,
    learning_rate=2e-5,
    logging_dir='./logs',
    save_total_limit=2,
    predict_with_generate=True,
)

data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=model)

trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["test"],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)


  trainer = Seq2SeqTrainer(


In [None]:
trainer.train()

## <a name="inference"></a>6. Inference Examples  
Generate a few sample summaries to see the model in action.


In [None]:
def summarize(text):
    inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512).to(model.device)
    summary_ids = model.generate(**inputs, max_length=64, num_beams=4, length_penalty=2.0)
    return tokenizer.decode(summary_ids[0], skip_special_tokens=True)

# Example
print(summarize("Today I woke up late and got late to work and then I slipped on the footpath and spilled my coffee all over me and ruined my white hoodie soo my day was really shitty."))


## <a name="next-steps"></a>7. Next Steps & Resources  
- Push model to Hugging Face Hub  
- Experiment with different max_length/min_length  
- Try beam search vs. sampling  
- References & further reading
