# 📝 Multilingual Text Summarization (French + English)

## 📘 Context

Text summarization is a crucial NLP task used to extract key insights from long documents. With the advancement of transformer-based architectures like BART and T5, we can now generate high-quality summaries in different languages.

This notebook demonstrates how to perform automatic summarization in:
- 🇬🇧 **English**, using `facebook/bart-large-cnn`
- 🇫🇷 **French**, using `plguillou/t5-base-fr-sum-cnndm`

## 🎯 Objectives

- Load and compare language-specific summarization models
- Generate and display summaries for both English and French input texts
- Test edge cases and observe model behavior

## Packages

In [None]:
!pip install transformers sentencepiece

In [1]:
# import loguru

from transformers import pipeline, AutoTokenizer, AutoModelForSeq2SeqLM
import textwrap  # Text wrapping and filling

## 🧠 Model Descriptions

### 🇬🇧 `facebook/bart-large-cnn` — English Text Summarization

**BART (Bidirectional and Auto-Regressive Transformer)** is a model developed by Facebook AI that combines the strengths of **encoder-decoder** architectures (like T5) and **auto-regressive** models (like GPT). It is **fine-tuned** on the **CNN/DailyMail dataset**, consisting of articles and summaries.

- **Use Case**: Excellent for **journalistic**, **informal**, or **structured opinion texts**.
- **Type of Summary**: **Abstractive** (paraphrasing, not just extraction).

**Architecture**:
- 12 layers of encoder + 12 layers of decoder
- Bidirectional attention for encoding, causal attention for decoding
- Around **406M parameters**

---

### 🇫🇷 `plguillou/t5-base-fr-sum-cnndm` — French Text Summarization

Based on **T5 (Text-to-Text Transfer Transformer)**, developed by Google. This model is **fine-tuned** for **French text summarization** on a dataset inspired by CNN/DailyMail.

- **Use Case**: Best for **formal** or **structured** texts: **news articles**, **reports**, or **official documents**.
- **Type of Summary**: **Abstractive** (rephrasing the input text in its own words).

**Architecture**:
- **T5-base**: Around **220M parameters**
- Multilingual, but fine-tuned specifically for **French**.

---

### 🌍 `facebook/mbart-large-50-one-to-many-mmt` — Multilingual Text Summarization

**mBART (Multilingual BART)** is a variation of the BART model that is trained on **multiple languages**. It is designed for **translation** tasks but can also be adapted for **summarization**.

- **Use Case**: Suitable for summarizing text in multiple languages, making it a versatile tool for multilingual applications.
- **Type of Summary**: **Abstractive**.

**Architecture**:
- 12 layers of encoder + 12 layers of decoder
- Multilingual model trained on 50 languages
- Around **680M parameters**

---

### 🔄 `google/t5-base-xxl-tlm` — T5 for Multilingual Tasks

**T5** (Text-to-Text Transfer Transformer) is a model that frames all NLP tasks as a text-to-text problem, making it highly adaptable. It has been fine-tuned for multiple tasks including **summarization**.

- **Use Case**: Works well for **multilingual summarization**, but can also be used for translation, question-answering, etc.
- **Type of Summary**: **Abstractive** (like all T5-based models).

**Architecture**:
- **T5-base**: Around **220M parameters**
- **T5-XXL**: Much larger, up to **11B parameters**
- Fine-tuned for many multilingual tasks

---

### 🚀 `google/flan-t5-xl` — Fine-tuned T5 for Better Generalization

**FLAN-T5** is a version of T5 that is **fine-tuned on a variety of tasks** to improve generalization. It aims to perform better on a wide range of NLP tasks, including summarization, when compared to regular T5.

- **Use Case**: Ideal for **high-quality summarization** tasks in multiple languages, with improved robustness.
- **Type of Summary**: **Abstractive**.

**Architecture**:
- **T5-XL**: Large model with **11B parameters**.
- Fine-tuned on a wide variety of tasks, improving the model's ability to generalize across domains.

---

### 📊 Quick Comparison

| Model                         | Language      | Architecture        | Fine-Tuning            | Type of Summary |
|-------------------------------|---------------|---------------------|------------------------|-----------------|
| `facebook/bart-large-cnn`      | English       | BART                | CNN/DailyMail          | Abstractive     |
| `plguillou/t5-base-fr-sum`     | French        | T5 (Base)           | CNN/DailyMail FR       | Abstractive     |
| `facebook/mbart-large-50`      | Multilingual  | mBART               | Multilingual (50 languages) | Abstractive |
| `google/t5-base-xxl-tlm`       | Multilingual  | T5 (Base or XXL)    | Multilingual           | Abstractive     |
| `google/flan-t5-xl`            | Multilingual  | T5 (Fine-tuned)     | Fine-tuned for better generalization | Abstractive |


In [2]:
# English summarization model (BART)
summarizer_en = pipeline("summarization", model="facebook/bart-large-cnn")

# French summarization model (T5 fine-tuned for summarization)
fr_model_name = "plguillou/t5-base-fr-sum-cnndm"
tokenizer_fr = AutoTokenizer.from_pretrained(fr_model_name)
model_fr = AutoModelForSeq2SeqLM.from_pretrained(fr_model_name)
summarizer_fr = pipeline("summarization", model=model_fr, tokenizer=tokenizer_fr)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/1.58k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Device set to use cpu


tokenizer_config.json:   0%|          | 0.00/2.13k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.24k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/31.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/1.79k [00:00<?, ?B/s]

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565


pytorch_model.bin:   0%|          | 0.00/892M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/892M [00:00<?, ?B/s]

Device set to use cpu


## 🧪 Application

In [3]:
text_en = """
Artificial Intelligence is revolutionizing many industries such as healthcare, finance, and transportation.
Machine learning techniques now enable systems to analyze vast amounts of data and make decisions with minimal human input.
However, these advances raise concerns over data privacy, algorithmic transparency, and job displacement.
"""

text_fr = """
L’intelligence artificielle transforme profondément des secteurs comme la santé, les transports et l’éducation.
Grâce à l’apprentissage automatique, les systèmes peuvent analyser de grandes quantités de données et prendre des décisions complexes.
Cependant, cela soulève des enjeux éthiques majeurs sur la transparence, l’emploi et la confidentialité.
"""

In [5]:
# help(summarizer_en)

In [6]:
print("🔹 Original English Text:\n")
print(textwrap.fill(text_en, width=100))

summary_en = summarizer_en(text_en, max_length=100, min_length=30 do_sample=False)

print("\n✅ English Summary:\n")
print(textwrap.fill(summary_en[0]["summary_text"], width=100))

Your max_length is set to 100, but your input_length is only 63. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=31)


🔹 Original English Text:

 Artificial Intelligence is revolutionizing many industries such as healthcare, finance, and
transportation.  Machine learning techniques now enable systems to analyze vast amounts of data and
make decisions with minimal human input. However, these advances raise concerns over data privacy,
algorithmic transparency, and job displacement.

✅ English Summary:

Artificial Intelligence is revolutionizing many industries such as healthcare, finance, and
transportation. However, these advances raise concerns over data privacy, algorithmic transparency,
and job displacement.


In [7]:
print("🔹 Texte original en français:\n")
print(textwrap.fill(text_fr, width=100))

summary_fr = summarizer_fr(text_fr, max_length=100, min_length=30, do_sample=False)

print("\n✅ Résumé en français:\n")
print(textwrap.fill(summary_fr[0]["summary_text"], width=100))

🔹 Texte original en français:

 L’intelligence artificielle transforme profondément des secteurs comme la santé, les transports et
l’éducation.  Grâce à l’apprentissage automatique, les systèmes peuvent analyser de grandes
quantités de données et prendre des décisions complexes.  Cependant, cela soulève des enjeux
éthiques majeurs sur la transparence, l’emploi et la confidentialité.

✅ Résumé en français:

L'intelligence artificielle transforme profondément des secteurs tels que la santé, les transports
et l'éducation. Cependant, cela soulève des enjeux éthiques majeurs sur la transparence, l’emploi
eet la confidentialité. L'apprentissage automatique permet aux systèmes de prendre des décisions
complexes.


In [15]:
print("🧪 Edge Case (Empty input):")

# handle with clause if else

Your max_length is set to 50, but your input_length is only 3. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=1)


🧪 Edge Case (Empty input):
CNN.com will feature iReporter photos in a weekly Travel Snapshots gallery. Please submit your best shots of New York for next week. Visit CNN.com/Travel next Wednesday for a new gallery of snapshots.


In [17]:
texts_fr = [
    "Le réchauffement climatique provoque des événements météorologiques extrêmes dans le monde entier.",
    "La France accueille chaque année des millions de touristes attirés par sa culture et sa gastronomie.",
    "Les véhicules autonomes utilisent des capteurs et de l'IA pour se déplacer sans conducteur humain."
]

print("🔁 Résumés français (batch):\n")
for t in texts_fr:
    s = summarizer_fr(t, max_length=60, min_length=20, do_sample=False)
    print(f"📌 Texte: {t}\n➡️ Résumé: {s[0]['summary_text']}\n")

Your max_length is set to 60, but your input_length is only 24. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=12)


🔁 Résumés français (batch):



Your max_length is set to 60, but your input_length is only 32. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=16)


📌 Texte: Le réchauffement climatique provoque des événements météorologiques extrêmes dans le monde entier.
➡️ Résumé: Le réchauffement climatique provoque des événements météorologiques extrêmes dans le monde entier.



Your max_length is set to 60, but your input_length is only 33. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=16)


📌 Texte: La France accueille chaque année des millions de touristes attirés par sa culture et sa gastronomie.
➡️ Résumé: La France accueille chaque année des millions de touristes attirés par la culture et la gastronomie. La culture française est l'une des principales attractions touristiques françaises.

📌 Texte: Les véhicules autonomes utilisent des capteurs et de l'IA pour se déplacer sans conducteur humain.
➡️ Résumé: Les véhicules autonomes utilisent des capteurs et de l'IA pour se déplacer sans conducteur humain.



 ## 🔍 Auto detect text language

In [21]:
!pip install langdetect



In [22]:
from langdetect import detect

texts = [
    "Bonjour, comment allez-vous ?",           # French
    "Hello, how are you doing?",               # English
    "Hola, ¿cómo estás?",                      # Spanish
    "Guten Tag, wie geht's Ihnen?",            # German
    "",                                        # Empty
    "こんにちは、お元気ですか？",                # Japanese
    "1234567890 $$$ ???",                      # Gibberish
]

for text in texts:
    try:
        lang = detect(text)
        print(f"📝 Text: {text}\n➡️ Language detected: {lang}\n")
    except:
        print(f"📝 Text: {text}\n❌ Could not detect language\n")

📝 Text: Bonjour, comment allez-vous ?
➡️ Language detected: fr

📝 Text: Hello, how are you doing?
➡️ Language detected: en

📝 Text: Hola, ¿cómo estás?
➡️ Language detected: es

📝 Text: Guten Tag, wie geht's Ihnen?
➡️ Language detected: de

📝 Text: 
❌ Could not detect language

📝 Text: こんにちは、お元気ですか？
➡️ Language detected: ja

📝 Text: 1234567890 $$$ ???
❌ Could not detect language



## Functions

In [34]:
from transformers import pipeline
from langdetect import detect

# Load English and French summarizers
summarizer_en = pipeline("summarization",
                         model="facebook/bart-large-cnn")
fr_model_name = "plguillou/t5-base-fr-sum-cnndm"
tokenizer_fr = AutoTokenizer.from_pretrained(fr_model_name)
model_fr = AutoModelForSeq2SeqLM.from_pretrained(fr_model_name)
summarizer_fr = pipeline("summarization",
                         model=model_fr, tokenizer=tokenizer_fr)


def detect_language(text):
    try:
        return detect(text)
    except:
        return "unknown"




Device set to use cpu
Device set to use cpu


In [41]:
def summarize_text(text):
    if not text or text.strip() == "":
        return "⚠️ No input text provided."

    if len(text.split()) < 5:
        return f"⚠️ Text too short to summarize. Returning input:\n{text}"

    lang = detect_language(text)

    max_len = max(130, len(text.split()))
    min_len = min(30, len(text.split()))

    if lang == "en":
        summary = summarizer_en(text, max_length=max_len, min_length=min_len, do_sample=False)
        return f"🇬🇧 English Summary:\n{summary[0]['summary_text']}"

    elif lang == "fr":
        summary = summarizer_fr(text, max_length=max_len, min_length=min_len, do_sample=False)
        return f"🇫🇷 Résumé Français :\n{summary[0]['summary_text']}"

    else:
        return f"❌ Unsupported or undetected language: '{lang}'"


text = "Bonjour, voici un exemple de texte que nous allons résumer automatiquement."
print(summarize_text(text))

text = "This is an example of a paragraph that we want to summarize using LLMs."
print(summarize_text(text))

Your max_length is set to 130, but your input_length is only 19. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=9)
Your max_length is set to 130, but your input_length is only 18. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=9)


🇫🇷 Résumé Français :
Voici un exemple de texte que nous allons résumer automatiquement.
🇬🇧 English Summary:
This is an example of a paragraph that we want to summarize using LLMs.


In [36]:
text = "Bonjour, voici un exemple de texte que nous allons résumer automatiquement."
detect_language(text)

'fr'

In [42]:
!pip install gradio

Collecting gradio
  Downloading gradio-5.29.0-py3-none-any.whl.metadata (16 kB)
Collecting aiofiles<25.0,>=22.0 (from gradio)
  Downloading aiofiles-24.1.0-py3-none-any.whl.metadata (10 kB)
Collecting fastapi<1.0,>=0.115.2 (from gradio)
  Downloading fastapi-0.115.12-py3-none-any.whl.metadata (27 kB)
Collecting ffmpy (from gradio)
  Downloading ffmpy-0.5.0-py3-none-any.whl.metadata (3.0 kB)
Collecting gradio-client==1.10.0 (from gradio)
  Downloading gradio_client-1.10.0-py3-none-any.whl.metadata (7.1 kB)
Collecting groovy~=0.1 (from gradio)
  Downloading groovy-0.1.2-py3-none-any.whl.metadata (6.1 kB)
Collecting pydub (from gradio)
  Downloading pydub-0.25.1-py2.py3-none-any.whl.metadata (1.4 kB)
Collecting python-multipart>=0.0.18 (from gradio)
  Downloading python_multipart-0.0.20-py3-none-any.whl.metadata (1.8 kB)
Collecting ruff>=0.9.3 (from gradio)
  Downloading ruff-0.11.8-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (25 kB)
Collecting safehttpx<0.2.0,>=0.1.6

In [44]:
import gradio as gr

In [45]:

def summarize_interface(text):
    return summarize_text(text)

# Gradio interface
demo = gr.Interface(
    fn=summarize_interface,
    inputs=gr.Textbox(lines=10, placeholder="Paste your email, article or paragraph here...", label="📝 Input Text"),
    outputs=gr.Textbox(label="🧾 Summary"),
    title="Multilingual Text Summarizer (EN/FR)",
    description="Paste any English or French text below and get an automatic summary. Language is detected automatically.",
    examples=[
        ["Bonjour, ceci est un exemple de mail professionnel à résumer pour un usage interne."],
        ["This is a long English article that explains how machine learning models are trained using large datasets."]
    ]
)

if __name__ == "__main__":
    demo.launch()

It looks like you are running Gradio on a hosted a Jupyter notebook. For the Gradio app to work, sharing must be enabled. Automatically setting `share=True` (you can turn this off by setting `share=False` in `launch()` explicitly).

Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
* Running on public URL: https://ffa1a05b16f061c165.gradio.live

This share link expires in 1 week. For free permanent hosting and GPU upgrades, run `gradio deploy` from the terminal in the working directory to deploy to Hugging Face Spaces (https://huggingface.co/spaces)


In [46]:
demo.close()


Closing server running on port: 7860
