### Practical Example: Document Summarization Using Transformers

#### Objective
The goal of this example is to summarize a document using a pre-trained summarization model from the Hugging Face `transformers` library. We will use the `facebook/bart-large-cnn` model, which is well-suited for summarization tasks.

#### Steps and Explanation

1. **Install Required Libraries**
   First, ensure you have the necessary libraries installed. You need `transformers` and `torch`. Install them using pip:

In [None]:
pip install transformers torch

2. **Import Libraries**

In [None]:
from transformers import BartTokenizer, BartForConditionalGeneration

   - **`BartTokenizer`**: Tokenizer for the BART model.
   - **`BartForConditionalGeneration`**: BART model for generating summaries.

3. **Load the Pre-Trained Model and Tokenizer**

In [None]:
tokenizer = BartTokenizer.from_pretrained("facebook/bart-large-cnn")
model = BartForConditionalGeneration.from_pretrained("facebook/bart-large-cnn")

   - **`BartTokenizer`**: Tokenizer to convert text into tokens suitable for the BART model.
   - **`BartForConditionalGeneration`**: Pre-trained BART model for conditional generation tasks like summarization.

4. **Prepare the Document**
   Define the document that you want to summarize. For this example, we’ll use a long text passage.

In [None]:

document = """
Artificial Intelligence (AI) is a broad field of computer science focused on creating systems capable of performing tasks that typically require human intelligence. These tasks include understanding natural language, recognizing patterns, solving complex problems, and making decisions. AI is divided into various subfields, including machine learning, neural networks, robotics, and natural language processing. Machine learning, a core component of AI, involves training algorithms to recognize patterns and make predictions based on data. Neural networks, inspired by the human brain, are used to model complex relationships in data. Robotics involves creating machines that can perform tasks autonomously or semi-autonomously. Natural language processing focuses on enabling machines to understand and interact using human language. AI technology has rapidly evolved over the past few decades, leading to advancements in areas such as self-driving cars, virtual assistants, and medical diagnostics. As AI continues to develop, it holds the potential to transform various industries and improve the quality of life for people around the world.
"""


   - **`document`**: The text that we want to summarize.

5. **Tokenize the Document**


In [None]:
inputs = tokenizer(document, return_tensors="pt", max_length=1024, truncation=True)

   - **`tokenizer`**: Converts the document into token IDs. We use `max_length` to ensure that the document fits within the model's input size and `truncation` to handle longer texts.

6. **Generate the Summary**


In [None]:
summary_ids = model.generate(
      inputs["input_ids"],
      attention_mask=inputs["attention_mask"],
      max_length=150,
      min_length=40,
      length_penalty=2.0,
      num_beams=4,
      early_stopping=True
)

   - **`model.generate`**: Generates the summary from the input tokens. Key parameters include:
     - **`max_length`**: Maximum length of the generated summary.
     - **`min_length`**: Minimum length of the generated summary.
     - **`length_penalty`**: Adjusts the length of the summary (higher values produce shorter summaries).
     - **`num_beams`**: Number of beams for beam search (controls the quality of the generated summary).
     - **`early_stopping`**: Stops generation when at least `min_length` is reached.

7. **Decode and Print the Summary**


In [None]:
summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
print(summary)

   - **`tokenizer.decode`**: Converts the token IDs back into a human-readable string. The `skip_special_tokens` parameter ensures that special tokens used by the model are not included in the final output.

#### Key Points

1. **Pre-Trained Models**: We use `facebook/bart-large-cnn`, a BART model fine-tuned for summarization. BART (Bidirectional and Auto-Regressive Transformers) is effective for generating summaries because it combines bidirectional context (like BERT) with auto-regressive generation (like GPT).

2. **Tokenization**: Tokenizing the document ensures that it is in a format that the model can process. We handle long documents by truncating them to fit within the model’s maximum input length.

3. **Generation Parameters**: Parameters like `max_length`, `min_length`, and `length_penalty` control the length and quality of the generated summary. Adjust these parameters based on the document and desired summary length.

4. **Handling Long Documents**: For very long documents, you may need to split the text into smaller chunks and summarize each chunk separately, as the model has a maximum input length it can handle.

#### Complete Code Example

Here is the complete code snippet for summarizing a document using BART:

In [None]:
from transformers import BartTokenizer, BartForConditionalGeneration

# Load pre-trained model and tokenizer
tokenizer = BartTokenizer.from_pretrained("facebook/bart-large-cnn")
model = BartForConditionalGeneration.from_pretrained("facebook/bart-large-cnn")

# Define the document to summarize
document = """
Artificial Intelligence (AI) is a broad field of computer science focused on creating systems capable of performing tasks that typically require human intelligence. These tasks include understanding natural language, recognizing patterns, solving complex problems, and making decisions. AI is divided into various subfields, including machine learning, neural networks, robotics, and natural language processing. Machine learning, a core component of AI, involves training algorithms to recognize patterns and make predictions based on data. Neural networks, inspired by the human brain, are used to model complex relationships in data. Robotics involves creating machines that can perform tasks autonomously or semi-autonomously. Natural language processing focuses on enabling machines to understand and interact using human language. AI technology has rapidly evolved over the past few decades, leading to advancements in areas such as self-driving cars, virtual assistants, and medical diagnostics. As AI continues to develop, it holds the potential to transform various industries and improve the quality of life for people around the world.
"""

# Tokenize the document
inputs = tokenizer(document, return_tensors="pt", max_length=1024, truncation=True)

# Generate the summary
summary_ids = model.generate(
    inputs["input_ids"],
    attention_mask=inputs["attention_mask"],
    max_length=150,
    min_length=40,
    length_penalty=2.0,
    num_beams=4,
    early_stopping=True
)

# Decode and print the summary
summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
print(summary)


This example demonstrates how to use a pre-trained summarization model to generate a concise summary of a longer document. By following these steps, you can effectively summarize documents for various applications, such as content summarization, report generation, or information extraction.