Welcome to the *Deep Learning Lab* focused on *Natural Language Processing (NLP)*! In this session, we will explore cutting-edge techniques in NLP using powerful transformer-based models provided by **Hugging Face**. Deep learning has revolutionized the field of NLP, enabling computers to understand, generate, and manipulate human language with unprecedented accuracy and flexibility. Through a series of hands-on exercises, you will delve into various NLP tasks, ranging from sentiment analysis and text generation to text summarization, question answering, translation, and advanced classification.

**Lab Objectives:**

1. **Sentiment Analysis**: Harness the power of pre-trained sentiment analysis models to analyze the sentiment of textual data.
2. **Text Generation**: Explore the capabilities of transformer-based language models to generate coherent and contextually relevant text.
3. **Text Summarization**: Learn how to automatically generate concise summaries of long documents or articles using advanced summarization techniques.
4. **Question Answering**: Build systems capable of understanding and answering questions based on textual data, mimicking human-like comprehension.
5. **Translation**: Utilize Hugging Face's translation pipelines to translate text between different languages, demonstrating the versatility of transformer models.
6. **Building a Small Transformer Model**: Gain insights into the inner workings of transformer models by constructing a small-scale version from scratch, understanding the architecture and components.
7. **Classification with BLOOM or T5**: Explore advanced classification tasks using either the BLOOM or T5 model from Hugging Face, leveraging their bidirectional language inference and text-to-text transformation capabilities.

Installing transformers

In [1]:
# Transformers installation
! pip install transformers
# To install from source instead of the last release, comment the command above and uncomment the following one.
# ! pip install git+https://github.com/huggingface/transformers.git

Collecting transformers
  Downloading transformers-4.39.3-py3-none-any.whl.metadata (134 kB)
     ---------------------------------------- 0.0/134.8 kB ? eta -:--:--
     --- ------------------------------------ 10.2/134.8 kB ? eta -:--:--
     -------- ---------------------------- 30.7/134.8 kB 325.1 kB/s eta 0:00:01
     ------------------------------ ----- 112.6/134.8 kB 726.2 kB/s eta 0:00:01
     -----------------------------------  133.1/134.8 kB 782.7 kB/s eta 0:00:01
     ------------------------------------ 134.8/134.8 kB 613.8 kB/s eta 0:00:00
Collecting huggingface-hub<1.0,>=0.19.3 (from transformers)
  Downloading huggingface_hub-0.22.2-py3-none-any.whl.metadata (12 kB)
Collecting tokenizers<0.19,>=0.14 (from transformers)
  Downloading tokenizers-0.15.2-cp311-none-win_amd64.whl.metadata (6.8 kB)
Collecting safetensors>=0.4.1 (from transformers)
  Downloading safetensors-0.4.2-cp311-none-win_amd64.whl.metadata (3.9 kB)
Downloading transformers-4.39.3-py3-none-any.whl (8.8 M

Utilize Pipeline to perform sentiment analysis of the following sentences(Import pipeline before performing tasks)

In [1]:
# Import the function for loading Hugging Face pipelines
from transformers import pipeline

prompt = "The food was good, but service at the restaurant was a bit slow"

# Load the pipeline for sentiment classification
classifier = pipeline('sentiment-analysis', model='distilbert-base-uncased-finetuned-sst-2-english')

# Pass the customer review to the model for prediction
prediction = classifier(prompt)
print(prediction)

  from .autonotebook import tqdm as notebook_tqdm
2024-04-05 20:08:59.517642: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


[{'label': 'NEGATIVE', 'score': 0.9981649518013}]


By default, the model downloaded for this pipeline is called "distilbert-base-uncased-finetuned-sst-2-english". Now instead of using the default model, use the "***bert-base-multilingual-uncased-sentiment***" to predict the sentiment of these setences. The first sentence is in French and second sentence is in Italian. (https://huggingface.co/models) this website has list of all models.


1. Ce semestre, mes cours sont très intéressants.
2.Tutti si godono l'estate. Mi piace il tempo, quindi posso andare nei parchi.

In [2]:
#code here to output the sentiment of above 2 sentences utilizing the bert-base-multilingual-uncased-sentiment model.
french_italian = [
    "Ce semestre, mes cours sont très intéressants.", 
    "Tutti si godono l'estate. Mi piace il tempo, quindi posso andare nei parchi."
]
classifier = pipeline('sentiment-analysis', model='nlptown/bert-base-multilingual-uncased-sentiment')
for french_italian in french_italian:
    ans = classifier(french_italian)
    print("Sentiment of sentence is: ", ans)

  return self.fget.__get__(instance, owner)()


Sentiment of sentence is:  [{'label': '5 stars', 'score': 0.5586809515953064}]
Sentiment of sentence is:  [{'label': '4 stars', 'score': 0.43630707263946533}]


Learning how to use Pipeline in Text Generation

In [3]:
# Import the function for loading Hugging Face pipelines
from transformers import pipeline

prompt="The Indianapolis is a great city located with "

# Load the pipeline for Text Generation
llm= pipeline("text-generation", model = "gpt2")

#Pass the prompt and specify maximum length as 200

outputs=llm(prompt, max_length=200)

print(outputs[0]['generated_text'])


Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


The Indianapolis is a great city located with iced drinks, so I had to get my shots with a bunch of other cool stuff as well. The only gripe I had with the glass was that the lid was too tight, and that I was not sure if it was because of the metal parts, or because of the fact that while the drink was on top, the bottle was still on the glass, which was just a problem for me at least.

It was only worth a 10. But we were actually really excited to try the coffee. We bought two coffees on the 5th day of our trip, and the first one was at the bar. Not only is both of them extremely cheap, they're very flavorful and tasty too. The second one I really ordered, was the $16 drink at the local bar (which I couldn't remember how to tell what was in the glass). With its big enough size and price, and the large size of the glass, a


Learning how to use Pipeline in Text Summarization

In [4]:
# Import the function for loading Hugging Face pipelines
from transformers import pipeline

long_text= """Indianapolis, home to Indiana University-Purdue University Indianapolis (IUPUI), is a vibrant city that seamlessly blends urban excitement with academic excellence. Situated in the heart of downtown Indianapolis, IUPUI offers students a unique collegiate experience surrounded by cultural attractions, bustling businesses, and a rich sports scene. The campus itself is a dynamic hub of learning, innovation, and diversity, with state-of-the-art facilities and a wide array of academic programs spanning disciplines from business and engineering to arts and sciences. Beyond the classroom, students at IUPUI have ample opportunities for internships, research collaborations, and community engagement, leveraging the city's resources to enrich their educational journey. Whether exploring the renowned Indianapolis Museum of Art, cheering on the Indianapolis Colts at Lucas Oil Stadium, or immersing themselves in the city's thriving music and culinary scene, students at IUPUI find themselves at the intersection of academic growth and urban adventure."""


# Load the pipeline for Text Summarization & Use Facebook, Bart Large CNN model
llm=pipeline("summarization", model="facebook/bart-large-cnn")

#Pass the longtext and specify maximum length as 60 and clean up the tokenization spaces

outputs = llm(long_text, max_length=60, clean_up_tokenization_spaces=True)

print(outputs[0]['summary_text'])


Indiana University-Purdue University Indianapolis (IUPUI) offers students a unique collegiate experience surrounded by cultural attractions, bustling businesses, and a rich sports scene. The campus itself is a dynamic hub of learning, innovation, and diversity, with state-of-the-art facilities and


Learning how to use Pipeline in Question-Answering.

In [5]:
from transformers import pipeline

context= "The history of deep learning traces back to the 1940s when the groundwork for artificial neural networks was laid down by researchers like Warren McCulloch and Walter Pitts, who proposed a computational model inspired by the biological neural networks in the brain. However, progress was slow due to limitations in computational power and data availability. It wasn't until the 1980s that significant advancements occurred, with the development of backpropagation algorithms by Geoffrey Hinton, David Rumelhart, and Ronald Williams, which allowed neural networks to efficiently learn from data by adjusting the weights of connections between neurons. Despite these breakthroughs, deep learning faced challenges in scaling networks with more layers due to the vanishing gradient problem. In the 2000s, with the rise of big data and improvements in computational resources, deep learning experienced a resurgence. Notable milestones include the introduction of convolutional neural networks (CNNs) by Yann LeCun and others in the 1990s, which revolutionized image recognition tasks, and the success of deep learning methods in the ImageNet Large Scale Visual Recognition Challenge in 2012, spearheaded by AlexNet, a CNN developed by Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton. Since then, deep learning has become the cornerstone of artificial intelligence research, powering applications ranging from natural language processing and speech recognition to autonomous vehicles and healthcare diagnostics, and continues to evolve with ongoing innovations in algorithms, architectures, and applications"
question= "What major event in 2012 marked a significant breakthrough for deep learning, propelling it to the forefront of artificial intelligence research?"

# Load the pipeline for Question Answering
llm = pipeline("question-answering",model="distilbert-base-uncased-distilled-squad")

#Pass the arguments for question and context in the llm model
outputs = llm(question=question, context=context)

#Print answer for the question
print(outputs['answer'])

ImageNet Large Scale Visual Recognition Challenge


In [6]:
pip install sentencepiece


Defaulting to user installation because normal site-packages is not writeable

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.3.2[0m[39;49m -> [0m[32;49m24.0[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython3 -m pip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


Learning how to use pipeline in Translation

In [7]:
from transformers import pipeline

input_text = "El tiempo lo cura todo"

# Define pipeline for Spanish-to-English translation, use this model "Helsinki-NLP/opus-mt-es-en" instead of default
translator = pipeline("translation", model="Helsinki-NLP/opus-mt-es-en")

# Translate the input text
translations = translator(input_text)

# Access the output to print the translated text in English
print(translations[0]['translation_text'])


Time heals everything




PyTorch's nn.Transformer class provides a full transformer architecture with pre-built encoder and decoder stacks.

The simplest way to manually create a skeleton nn.Transformer model is by specifying its main structural hyperparameters: model dimensionality (embedding size), number of attention heads, number of encoder layers, and number of decoder layers. PyTorch does the rest of the job for you, assigning default modules inside the encoder and decoder layers.

torch.nn has been already imported for you.

Note: take a deep look at the print(model) output for a very insightful glance inside the transformer model built.

In [8]:
from torch.nn import Transformer
# Set transformer model hyperparameters
d_model = 512
n_heads = 8
num_encoder_layers = 6
num_decoder_layers = 6

# Create the transformer model and assign hyperparameters
model = Transformer(
    d_model=d_model,
    nhead=n_heads,
    num_encoder_layers=num_encoder_layers,
    num_decoder_layers=num_decoder_layers
)

print(model)

Transformer(
  (encoder): TransformerEncoder(
    (layers): ModuleList(
      (0-5): 6 x TransformerEncoderLayer(
        (self_attn): MultiheadAttention(
          (out_proj): NonDynamicallyQuantizableLinear(in_features=512, out_features=512, bias=True)
        )
        (linear1): Linear(in_features=512, out_features=2048, bias=True)
        (dropout): Dropout(p=0.1, inplace=False)
        (linear2): Linear(in_features=2048, out_features=512, bias=True)
        (norm1): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
        (norm2): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
        (dropout1): Dropout(p=0.1, inplace=False)
        (dropout2): Dropout(p=0.1, inplace=False)
      )
    )
    (norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
  )
  (decoder): TransformerDecoder(
    (layers): ModuleList(
      (0-5): 6 x TransformerDecoderLayer(
        (self_attn): MultiheadAttention(
          (out_proj): NonDynamicallyQuantizableLinear(in_features=512, o

Now, let us dive into FineTuning a LLM for a customized dataset

For this excercise you have the choice to use either https://huggingface.co/docs/transformers/model_doc/bloom#bloom  or https://huggingface.co/docs/transformers/model_doc/t5#t5 for the text classification problem. Dataset name is SMSSSpamCollection which has text and corresponding labels as spam or ham

General Steps:


1.   Data Preprocessing.
2.   Model Fine Tuning.
3. Training.
4. Evaluation.
5. Inference




In [9]:
! pip install git+https://github.com/huggingface/transformers.git

Defaulting to user installation because normal site-packages is not writeable
Collecting git+https://github.com/huggingface/transformers.git
  Cloning https://github.com/huggingface/transformers.git to /tmp/pip-req-build-txvwuwiv
  Running command git clone --filter=blob:none --quiet https://github.com/huggingface/transformers.git /tmp/pip-req-build-txvwuwiv
  Resolved https://github.com/huggingface/transformers.git to commit 76fa17c1663a0efeca7208c20579833365584889
  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25ldone
[?25h  Preparing metadata (pyproject.toml) ... [?25ldone

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.3.2[0m[39;49m -> [0m[32;49m24.0[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython3 -m pip install --upgrade pip[0m


In [10]:
from transformers import AutoTokenizer
import torch
from torch.utils.data import Dataset, DataLoader
from sklearn.model_selection import train_test_split

with open("SMSSpamCollection.txt", "r") as file:
    lines = file.readlines()

messages = []
labels = []

for line in lines:
    label, message = line.split("\t")
    messages.append(message.strip())
    labels.append(0 if label == "ham" else 1)  # Convert labels: ham=0, spam=1

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

# Tokenize the dataset
encoding = tokenizer(messages, padding=True, truncation=True, max_length=512, return_tensors="pt")

# Split the dataset
train_texts, val_texts, train_labels, val_labels = train_test_split(encoding["input_ids"], labels, test_size=0.2, random_state=42)

train_labels = torch.tensor(train_labels)
val_labels = torch.tensor(val_labels)


In [None]:
import torch
from torch.utils.data import Dataset, DataLoader
from sklearn.model_selection import train_test_split
from transformers import AutoTokenizer, AutoModelForSequenceClassification, TrainingArguments, Trainer
from datasets import Dataset

# Load dataset
with open("SMSSpamCollection.txt", "r") as file:
    lines = file.readlines()

messages = []
labels = []

for line in lines:
    label, message = line.split("\t")
    messages.append(message.strip())
    labels.append(0 if label == "ham" else 1)  # Convert labels: ham=0, spam=1

# Split dataset
train_texts, val_texts, train_labels, val_labels = train_test_split(messages, labels, test_size=0.2, random_state=42)

# Define dataset class
class SMSDataset(Dataset):
    def __init__(self, texts, labels, tokenizer):
        self.encodings = tokenizer(texts, padding=True, truncation=True, max_length=512)
        self.labels = labels

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item['labels'] = torch.tensor(self.labels[idx])
        return item

    def __len__(self):
        return len(self.labels)

tokenizer = AutoTokenizer.from_pretrained("t5-base")  
model = AutoModelForSequenceClassification.from_pretrained("t5-base", num_labels=2)  

# Prepare datasets
train_dataset = SMSDataset(train_texts, train_labels, tokenizer)
val_dataset = SMSDataset(val_texts, val_labels, tokenizer)

training_args = TrainingArguments(
    output_dir='./results',          
    num_train_epochs=3,              
    per_device_train_batch_size=8,   
    per_device_eval_batch_size=8,    
    warmup_steps=500,                
    weight_decay=0.01,               
    logging_dir='./logs',           
)

trainer = Trainer(
    model=model,                     
    args=training_args,              
    train_dataset=train_dataset,     
    eval_dataset=val_dataset         
)

# Train model
trainer.train()

# Evaluate model
trainer.evaluate()


In [None]:
def predict(text, model, tokenizer):
    inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True, max_length=512)
    inputs = {k: v.to(model.device) for k, v in inputs.items()}
    with torch.no_grad():
        outputs = model(**inputs)
    logits = outputs.logits
    probs = torch.nn.functional.softmax(logits, dim=-1)
    predicted_label = torch.argmax(probs, dim=-1).item()
    label_map = {0: "ham", 1: "spam"}
    return label_map[predicted_label], probs[0][predicted_label].item()
sample_text = "Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's"
prediction, probability = predict(sample_text, model, tokenizer)
print(f"Predicted label: {prediction} with probability {probability}")


In [12]:
pip install datasets

Defaulting to user installation because normal site-packages is not writeable
Collecting datasets
  Downloading datasets-2.18.0-py3-none-any.whl.metadata (20 kB)
Collecting pyarrow-hotfix (from datasets)
  Downloading pyarrow_hotfix-0.6-py3-none-any.whl.metadata (3.6 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.4.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.2.0,>=2023.1.0 (from fsspec[http]<=2024.2.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.2.0-py3-none-any.whl.metadata (6.8 kB)
Collecting aiohttp (from datasets)
  Downloading aiohttp-3.9.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (7.4 kB)
Collecting aiosignal>=1.1.2 (from aiohttp->datasets)
  Downloading aiosignal-1.

In [14]:
pip install accelerate -U


Defaulting to user installation because normal site-packages is not writeable
Collecting accelerate
  Downloading accelerate-0.29.1-py3-none-any.whl.metadata (18 kB)
Downloading accelerate-0.29.1-py3-none-any.whl (297 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m297.3/297.3 kB[0m [31m15.7 kB/s[0m eta [36m0:00:00[0m0:01[0m00:01[0m
[?25hInstalling collected packages: accelerate
[0mSuccessfully installed accelerate-0.29.1

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.3.2[0m[39;49m -> [0m[32;49m24.0[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython3 -m pip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.
