# USING HUGGING FACE TRANSFORMERS

Based on Hugging Face NLP Course Chapter 2

We will use the pretrained facebook/bart-large-cnn model for now

In [30]:
import torch
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

checkpoint = "facebook/bart-large-cnn"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSeq2SeqLM.from_pretrained(checkpoint)






- tokenizer.tokenize(seq) gives a list of strings, each string being a token  
- tokens = tokenizer(sequences, padding=True, truncation=True, return_tensors="pt")  
The output of this call is a dictionary containing the tokenized sequences ('input_ids') and the attention masks (attention_mask).
eg.  
tokens = {  
'input_ids': tensor([[  101,  1045,  1005,  2310,  2042,  3402,  2005,  1037,  8574,  2607,  2026,  2878,  2166,  1012,   102],  
[  101,  2061,  2031,  1045,   999,   102,     0,     0,     0,     0,     0,     0,     0,     0,     0]]),  
'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],  
[1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0]])
}

When we handle batches with inputs of different sizes, we have to perform 3 things:  
Padding  
Attention masking  
Truncation

In [31]:
# Define the texts to summarize
sequences = [
    "Artificial intelligence (AI) is a field of computer science that aims to create machines capable of intelligent behavior. It has become an essential part of the technology industry, providing the heavy lifting for many of the most challenging problems in computer science. AI is the simulation of human intelligence processes by machines, especially computer systems. These processes include learning (the acquisition of information and rules for using the information), reasoning (using rules to reach approximate or definite conclusions), and self-correction. Particular applications of AI include expert systems, speech recognition, and machine vision. AI can be categorized into two types: narrow AI, which is designed to perform a narrow task (e.g., facial recognition or internet searches), and general AI, which is designed to perform any intellectual task that a human can do. Researchers are working on developing technologies that can help address global challenges such as climate change, healthcare, and education. AI has the potential to revolutionize many sectors of the economy, including finance, healthcare, and transportation, by enabling more efficient and effective operations. However, AI also raises ethical and societal concerns, such as job displacement, privacy, and the potential for biased decision-making. To address these concerns, it is essential to establish robust frameworks for the development and deployment of AI technologies, ensuring that they are used responsibly and ethically.",
    "Climate change is one of the most pressing issues of our time. It refers to significant changes in global temperatures and weather patterns over time. While climate change is a natural phenomenon, scientific evidence shows that human activities, particularly the burning of fossil fuels, deforestation, and industrial processes, have accelerated the rate of climate change. These activities increase the concentration of greenhouse gases in the atmosphere, leading to global warming and associated impacts such as rising sea levels, more frequent and severe weather events, and disruptions to ecosystems and biodiversity. Addressing climate change requires a comprehensive approach that includes reducing greenhouse gas emissions, transitioning to renewable energy sources, improving energy efficiency, and implementing sustainable land use and forest management practices. International cooperation is crucial in tackling climate change, as it is a global challenge that transcends national borders. Agreements such as the Paris Agreement aim to unite countries in their efforts to mitigate and adapt to climate change by setting targets for reducing emissions and providing support to developing countries. Public awareness and education are also vital in promoting sustainable practices and encouraging individuals and communities to take action. By working together, we can reduce the impact of climate change and protect the planet for future generations.",
    "The history of the Internet is a fascinating journey of technological innovation and collaboration that has transformed the way we communicate, access information, and conduct business. The origins of the Internet can be traced back to the 1960s when the United States Department of Defense developed the ARPANET, a network that allowed researchers to share data and resources securely. ARPANET introduced key technologies such as packet switching and the TCP/IP protocol suite, which became the foundation of the modern Internet. During the 1970s and 1980s, academic institutions and government organizations began to develop their own networks, which eventually interconnected, forming a larger network of networks. The introduction of the World Wide Web in the early 1990s by Tim Berners-Lee revolutionized the Internet by providing a user-friendly interface for accessing and sharing information. The Web's rapid growth was fueled by the development of web browsers, search engines, and e-commerce platforms, making the Internet accessible to the general public. As the Internet continued to evolve, it enabled the rise of social media, cloud computing, and the Internet of Things (IoT), connecting billions of devices and creating new opportunities for innovation and economic growth. Today, the Internet is an integral part of our daily lives, impacting everything from communication and education to entertainment and healthcare. Despite its many benefits, the Internet also presents challenges such as cybersecurity threats, privacy concerns, and digital divide issues. Addressing these challenges requires ongoing efforts to enhance security, protect user data, and ensure equitable access to digital technologies."
]
# Tokenize the input text
tokens = tokenizer(sequences, padding=True, truncation=True, return_tensors="pt")

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.



The order of the language modeling pipeline:  
The tokenizer handles text and returns IDs.  
The model handles these IDs and outputs a prediction.   
The tokenizer can then be used once again to convert these predictions back to some text.  
THE TOKENIZER HANDLES BOTH TOKENIZING (tokenizer.tokenize()) AND DETOKENIZING (tokenizer.decode())

In [38]:
# Generate the summaries with specified parameters
summary_ids = model.generate(
    **tokens,
    max_length=100,
    min_length=50,
    num_beams=4,
    length_penalty=0.5,
    no_repeat_ngram_size=2,
    early_stopping=True
)

# Decode the summary
summary = tokenizer.batch_decode(summary_ids, skip_special_tokens=True)

output = model(**tokens)  
We get raw outputs containing various attributes like logits  
We can apply a softmax on logits to soften it

In [39]:
# Print the summaries
for i, sum in enumerate(summary):
    print(f"Summary {i + 1}: {sum}")

Summary 1: Artificial intelligence is the simulation of human intelligence processes by machines. Particular applications of AI include expert systems, speech recognition, and machine vision. AI has the potential to revolutionize many sectors of the economy. It also raises ethical and societal concerns, such as job displacement and privacy.
Summary 2: Climate change is one of the most pressing issues of our time. It refers to significant changes in global temperatures and weather patterns. Human activities, particularly the burning of fossil fuels, deforestation, and industrial processes, have accelerated the rate of climate change.
Summary 3: The Internet is an integral part of our daily lives, impacting everything from communication and education to entertainment and healthcare. Despite its many benefits, the Internet also presents challenges such as cybersecurity threats and privacy concerns. Addressing these challenges requires ongoing efforts to enhance security, protect user data

Note that decreasing the length penalty to 0.5 rlly helped, else most of the outputs would end abruptly without the completion of the sentence