###**Lab 1 exercise on dialogue summarization using BART, Dataset used is SamSum from hugging face**

**BART for Summarization:**

Used BART (facebook/bart-large-cnn), a model pre-trained specifically for summarization tasks.

Implemented a function generate_summary using BART to create more focused and accurate summaries of dialogues.


This exercise provided practical insights into using advanced NLP models for summarization tasks. Highlightes the importance of choosing the right model for specific tasks BART for summarization.


**Task of dialogue summarization. Here's a brief overview of what the code entails:**

Dialogue Summarization Examples:

The code begins by printing sample dialogues and their human-written summaries. This introduces the concept of summarizing conversations, a common application of natural language processing.

Model Loading and Tokenization:

The code demonstrates how to load a pre-trained model (google/flan-t5-base) and tokenize text, an essential step for preparing data for processing by language models.

Encoding and Decoding Sentences:

It includes examples of encoding a sentence into a format understandable by the model and then decoding it back into human-readable text.

Summarization Techniques:

The script explores different techniques for summarization, including zero-shot inference (where the model summarizes without prior examples) and few-shot inference (where the model uses a few examples to understand the context before summarizing).

Prompt Engineering:

There is a focus on how to effectively craft prompts to guide the model in generating accurate and relevant summaries.

Experimentation with Different Settings:

The code allows for experimentation with different generation configurations (like the number of tokens, sampling temperature) to see how they affect the model's output.

Overall, this lab exercise offers a hands-on experience in using a language model for the practical task of summarizing dialogues, demonstrating the importance of prompt engineering and parameter tuning in generative AI.

###**Lab 2 exercise on producing creative narratives using GPT-2, Dataset used is SamSum from hugging face**

Steps Undertaken
**Initial Setup with GPT-2:**

GPT-2 for generating creative, lengthy narratives.


**Adjusting Generation Parameters:**

Experimented with various parameters (like max_length, temperature, top_p, etc.) to improve the relevance and conciseness of GPT-2's outputs.


**Quantitative and Qualitative Evaluation:**

Random Dialogue Selection and Comparison:

Updated the generate_summary function to select a random dialogue from the dataset or use a specific index for generating summaries.
The function also fetches the corresponding human-written summary for direct comparison.

**Challenges and Learnings:**

Encountered challenges with narrative generations and were able to to fine tune the parameters to get the expected ouput, including understanding model outputs, adjusting generation parameters

Learned about the different capabilities and suitability of GPT-2 and BART for specific NLP tasks like summarization.

###**Lab 3 Fine-Tune FLAN-T5 with Reinforcement Learning (PPO) and PEFT to Generate Less-Toxic Summaries**


Brief Explanation:

Model Parameter Analysis:

Functions to calculate the number of trainable parameters in a model.

Dialogue Summarization:

Using pre-trained models (like FLAN-T5) to summarize dialogues from a dataset. This involves generating prompts, tokenizing inputs, and decoding model outputs.

Tokenization for Training:

A function to tokenize dialogues and summaries, preparing them for model training.

Fine-Tuning the Model:

Using Hugging Face's Trainer class to fine-tune the model with the tokenized dataset.

Model Comparison and Evaluation:

Comparing the performance of different models (original, instruct, and PEFT) on the task of dialogue summarization. This involves generating summaries with each model and evaluating them using the ROUGE metric.

PEFT/LoRA Model Fine-Tuning:

Setting up a PEFT/LoRA model with new layer/parameter adapters for fine-tuning, focusing only on the adapter training.

Quantitative Evaluation:

Comparing model-generated summaries with human-written baselines using the ROUGE metric to assess the performance improvements of the PEFT model over baseline and fine-tuned models.



Conclusion


This exercise provided practical insights into using advanced NLP models for summarization tasks. It highlighted the importance of choosing the right model for specific tasks (GPT-2 for creative generation vs. BART for summarization) and the significance of fine-tuning generation parameters. Additionally, it underscored the necessity of combining quantitative metrics with qualitative analysis to comprehensively evaluate NLP model outputs.


Note: Below is the Lab2 implementation first and then Lab 1

In [None]:
import transformers

In [None]:
pip install transformers



In [None]:
%pip install --upgrade pip
%pip install --disable-pip-version-check \
    torch==1.13.1 \
    torchdata==0.5.1 --quiet

%pip install \
    transformers==4.27.2 \
    datasets==2.11.0  --quiet

[0m

In [None]:
# Import the necessary libraries
from datasets import load_dataset  # Import the load_dataset function from the datasets library
from transformers import AutoModelForSeq2SeqLM  # Import the AutoModelForSeq2SeqLM class from the transformers library
from transformers import AutoTokenizer  # Import the AutoTokenizer class from the transformers library
from transformers import GenerationConfig  # Import the GenerationConfig class from the transformers library


In [None]:
# Import the GPT2Tokenizer and GPT2LMHeadModel classes from the transformers library
from transformers import GPT2LMHeadModel, GPT2Tokenizer

# Load the pre-trained GPT2 tokenizer
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')

# Load the pre-trained GPT2 model
model = GPT2LMHeadModel.from_pretrained('gpt2')



In [None]:
pip install py7zr

[0m

In [None]:
# Define the name of the Hugging Face dataset
huggingface_dataset_name = "samsum"

# Load the dataset from Hugging Face
dataset = load_dataset(huggingface_dataset_name, split='train')




In [None]:
# Define a function to generate a summary from a dialogue
def generate_summary(dialogue, max_length=150, no_repeat_ngram_size=2):
    # Preprocess the input dialogue text
    preprocessed_dialogue = "Summarize: " + dialogue

    # Encode the preprocessed dialogue text into tensors
    encoding = tokenizer(preprocessed_dialogue, return_tensors='pt', padding=True, truncation=True, max_length=max_length)
    input_ids = encoding['input_ids']
    attention_mask = encoding['attention_mask']

    # Generate the summary using the GPT-2 model
    summary_ids = model.generate(
        input_ids,
        attention_mask=attention_mask,
        max_length=max_length,
        no_repeat_ngram_size=no_repeat_ngram_size,
        num_return_sequences=1,
        early_stopping=True
    )

    # Decode the generated summary tokens into a string
    summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)

    # Return the generated summary
    return summary


In [None]:
# Set the pad token to EOS token
tokenizer.pad_token = tokenizer.eos_token


In [None]:
def generate_summary(dialogue, max_length=150, no_repeat_ngram_size=2):
    # Encode the input dialogue text and generate the attention mask
    encoding = tokenizer("Summarize: " + dialogue, return_tensors='pt', padding=True, truncation=True, max_length=max_length)
    input_ids = encoding['input_ids']
    attention_mask = encoding['attention_mask']

    # Generate the narrative extension with GPT-2
    summary_ids = model.generate(
        input_ids,
        attention_mask=attention_mask,
        max_length=max_length,
        no_repeat_ngram_size=no_repeat_ngram_size,
        num_return_sequences=1,
        early_stopping=True
    )

    # Decode the generated tokens to a string
    summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
    return summary

# Generate the summary for a single dialogue
dialogue = 'Amanda: I baked cookies. Do you want some? Jerry: Sure! Amanda: Ill bring you tomorrow :-)'
generated_summary = generate_summary(dialogue)
print("Generated Summary:", generated_summary)


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generated Summary: Summarize: Amanda: I baked cookies. Do you want some? Jerry: Sure! Amanda: Ill bring you tomorrow :-)

The next day, Amanda and Jerry went to the store to buy some cookies and then went back to their car to get some more cookies for the kids. They were so excited that they went out to eat and they were like, "Oh my God, I'm so happy!"
.
 (The kids were eating cookies.) Amanda was like "I'm going to go to a party and I want to have a cookie for you!" Jerry was just like that. He was so proud of them. Amanda said,
"I love you so much, Jerry. I love your cookies."
, and


Define the generate_summary function: This function takes a dialogue as input and generates a summary of the dialogue. It has additional parameters for controlling the generation process:

temperature: Adjusts the randomness of the generated text by scaling the probability distribution over tokens.

top_p: Filters the generated tokens based on their cumulative probability, keeping only the top top_p portion of the distribution.

top_k: Limits the number of tokens to consider during generation, reducing computational complexity.

no_repeat_ngram_size: Prevents repetition of n-grams (sequences of n consecutive tokens) to avoid redundancy.

Preprocess the input dialogue: The dialogue is preprocessed by adding the prefix "Provide a brief summary of the following dialogue: " to guide the model.

Encode the dialogue: The preprocessed dialogue is encoded into tensors using the tokenizer.

Generate the summary: The summary is generated using the model's generate method, passing the encoded dialogue, attention mask, and adjusted parameters.

Decode the generated summary: The generated summary tokens are decoded back into a string using the tokenizer.

Return the summary: The generated summary is returned as a string.

Generate and print the summary for a single dialogue: An example dialogue is defined, and its summary is generated using the generate_summary function. The generated summary is then printed to the console.

In [None]:
# Define a function to generate a summary from a dialogue
def generate_summary(dialogue, max_length=150, temperature=0.7, top_p=0.85, top_k=50, no_repeat_ngram_size=1):
    # Preprocess the input dialogue text
    preprocessed_dialogue = "Provide a brief summary of the following dialogue: " + dialogue

    # Encode the preprocessed dialogue text into tensors
    encoding = tokenizer(preprocessed_dialogue, return_tensors='pt', padding=True, truncation=True, max_length=max_length)
    input_ids = encoding['input_ids']
    attention_mask = encoding['attention_mask']

    # Generate the summary using the GPT-2 model with adjusted parameters
    summary_ids = model.generate(
        input_ids,
        attention_mask=attention_mask,
        max_length=max_length,
        temperature=temperature,  # Adjusts the randomness of the generated text
        top_p=top_p,  # Filters the generated tokens based on their cumulative probability
        top_k=top_k,  # Limits the number of tokens to consider during generation
        no_repeat_ngram_size=no_repeat_ngram_size,  # Prevents repetition of n-grams
        num_return_sequences=1,  # Generates only one summary
        early_stopping=True  # Stops the generation process early if an end-of-sequence token is predicted
    )

    # Decode the generated summary tokens into a string
    summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)

    # Return the generated summary
    return summary

# Generate the summary for a single dialogue
dialogue = 'Amanda: I baked cookies. Do you want some? Jerry: Sure! Amanda: Ill bring you tomorrow :-)'
generated_summary = generate_summary(dialogue)
print("Generated Summary:", generated_summary)


Generated Summary: Provide a brief summary of the following dialogue. Do you want some cookies? Jerry: Sure! Amanda, Ill bring them tomorrow :-) "I'm sorry I didn't make it in time for dinner," she says to her husband after he had already eaten all his chocolate-covered walnuts and  the cookie crusts were left on their plate by mistake when they got home from work that day.""Oh my gawd!" is what comes out as an exasperated look at your son's reaction before asking him if there are any more candies?" He responds with:"Yes", then asks whether or not 'they' would like another batch - which will be made later this afternoon (or even today). The conversation continues


In [None]:
def lexical_diversity(text):
    words = text.split()
    unique_words = set(words)
    return len(unique_words) / len(words) if words else 0


In [None]:
# Define a function to calculate the lexical diversity of a text
def lexical_diversity(text):
    # Split the text into words
    words = text.split()

    # Create a set of unique words from the word list
    unique_words = set(words)

    # Check if the word list is empty
    if words:
        # Calculate the lexical diversity by dividing the number of unique words by the total number of words
        diversity = len(unique_words) / len(words)
        return diversity
    else:
        # If the word list is empty, return 0
        return 0

# Original text
original_text = 'Amanda: I baked cookies. Do you want some? Jerry: Sure! Amanda: Ill bring you tomorrow :-)'

# Generated summary
generated_summary = ("Provide a brief summary of the following dialogue. Do you want some cookies? Jerry: Sure! "
                     "Amanda, Ill bring them tomorrow :-) 'I'm sorry I didn't make it in time for dinner,' she says "
                     "to her husband after he had already eaten all his chocolate-covered walnuts and the cookie crusts "
                     "were left on their plate by mistake when they got home from work that day. 'Oh my gawd!' is what "
                     "comes out as an exasperated look at your son's reaction before asking him if there are any more candies? "
                     "'Yes', then asks whether or not 'they' would like another batch - which will be made later this "
                     "afternoon (or even today). The conversation continues")

# Calculate the scores
original_diversity_score = lexical_diversity(original_text)
generated_diversity_score = lexical_diversity(generated_summary)

# Print the scores for comparison
print(f"Lexical Diversity Score for Original Text: {original_diversity_score:.3f}")
print(f"Lexical Diversity Score for Generated Summary: {generated_diversity_score:.3f}")



Lexical Diversity Score for Original Text: 0.875
Lexical Diversity Score for Generated Summary: 0.991


**Results and Evaluation**

####Lexical diversity

Is a measure of how many unique words are used in a given text compared to the total number of words. A higher score indicates a greater variety of words, suggesting rich and varied language use. Scores closer to 1 imply almost no repetition of words, which is rare in natural language, especially in longer texts.

**Analysis of Scores**

Original Text (Score: 0.875):

**Interpretation:** This is a high score, especially for a dialogue-based text, indicating a wide range of vocabulary with minimal repetition. Such diversity is typical in conversational language where different ideas are expressed succinctly.
Context: In dialogues, especially short ones, a higher lexical diversity is expected because each line usually introduces new information or a new aspect of the conversation.

Generated Summary (Score: 0.991):

**Interpretation:** This exceptionally high score suggests an extremely varied use of vocabulary, with almost no repetition. This is unusual for longer texts and indicates a significant expansion in terms of vocabulary compared to the original text.

**Implications:**
Creativity and Expansion: The goal was creative writing/ narrative expansion, this high score reflects success in introducing a wide range of concepts and vocabulary, thereby enriching the narrative.

**Deviation from Original Content:** A score this high also implies that the summary has introduced many new elements or concepts, potentially diverging significantly from the original text's content or intent.

**Naturalness and Coherence:** With such a high diversity score, it's important to assess whether the language still feels natural and whether the summary remains coherent and contextually relevant.

Considerations in Interpretation

**Purpose of the Text:** The appropriateness of the scores depends on the intended purpose of the generated text. For concise summaries, a lower lexical diversity might be expected as the text condenses existing information. For creative expansions, a higher diversity is a positive attribute.



**LAB 1**

In [None]:
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
from transformers import GenerationConfig

# Load the pre-trained BART model and tokenizer
model = AutoModelForSeq2SeqLM.from_pretrained('facebook/bart-base')
tokenizer = AutoTokenizer.from_pretrained('facebook/bart-base')

In [None]:
def generate_summary(index=None, max_length=50, min_length=10, length_penalty=2.0, num_beams=4):
    # Select a random dialogue if no index is provided
    if index is None:
        index = random.randint(0, len(dataset) - 1)

    # Fetch the dialogue
    dialogue = dataset[index]['dialogue']

    # Encode and generate summary
    input_ids = tokenizer.encode("summarize: " + dialogue, return_tensors='pt', truncation=True, max_length=1024)
    summary_ids = model.generate(input_ids, max_length=max_length, min_length=min_length, length_penalty=length_penalty, num_beams=num_beams, early_stopping=True)

    summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)

    # Print the dialogue and its summary
    print(f"Input Dialogue {index}: {dialogue}")
    print(f"Generated Summary: {summary}\n")

# Example usage
generate_summary()


Input Dialogue 5468: Juliette: Sup?
Phillip: Going to bed have a fucking long flight tomorrow
Juliette: How long? And to where?
Phillip: 4 hours there and 4 back
Juliette: Yeah that's long
Phillip: Going to Some place in the south west of Central Afrique république. Fucking long
Juliette: ;) Take me with u on this journey ;)
Phillip: What good will you be to me ?
Juliette: Very good
Phillip: Can you describe that?
Juliette: What?
Phillip: Explain
Juliette: I will make ur flight nicer. I will be talking to you all the flight long hahhaha
Phillip: Haha I have a co-pilot to talk to
Juliette: lol ok so I'm not needed lol
Generated Summary: Summarize: Juliette: Sup? Going to bed have a long flight tomorrow. Phillip: 4 hours there and 4 back.



In [None]:
def generate_and_evaluate_summary(index=None, max_length=50, min_length=10, length_penalty=2.0, num_beams=4):
    # Select a random dialogue if no index is provided
    if index is None:
        index = random.randint(0, len(dataset) - 1)

    # Fetch the dialogue and its human-generated summary
    dialogue = dataset[index]['dialogue']
    reference_summary = dataset[index]['summary']  # Assuming 'summary' is the key for human-generated summary

    # Encode and generate summary
    input_ids = tokenizer.encode("summarize: " + dialogue, return_tensors='pt', truncation=True, max_length=1024)
    summary_ids = model.generate(input_ids, max_length=max_length, min_length=min_length, length_penalty=length_penalty, num_beams=num_beams, early_stopping=True)

    generated_summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)

    # Print the dialogue, generated summary, and human-generated summary
    print(f"Input Dialogue {index}: {dialogue}")
    print(f"Generated Summary: {generated_summary}")
    print(f"Human-Generated Summary: {reference_summary}\n")

    # Evaluate the summary (You can call your evaluation function here)
    evaluate_summary(generated_summary, reference_summary)

# Example usage
generate_and_evaluate_summary(5468)


Input Dialogue 5468: Juliette: Sup?
Phillip: Going to bed have a fucking long flight tomorrow
Juliette: How long? And to where?
Phillip: 4 hours there and 4 back
Juliette: Yeah that's long
Phillip: Going to Some place in the south west of Central Afrique république. Fucking long
Juliette: ;) Take me with u on this journey ;)
Phillip: What good will you be to me ?
Juliette: Very good
Phillip: Can you describe that?
Juliette: What?
Phillip: Explain
Juliette: I will make ur flight nicer. I will be talking to you all the flight long hahhaha
Phillip: Haha I have a co-pilot to talk to
Juliette: lol ok so I'm not needed lol
Generated Summary: Summarize: Juliette: Sup? Going to bed have a long flight tomorrow. Phillip: 4 hours there and 4 back.
Human-Generated Summary: Phillip has a long flight to Central Afrique République tomorrow. Juliete wants to go with him, but he has a co-pilot to talk to.



In [None]:
pip install rouge-score


[0m

In [None]:
from rouge_score import rouge_scorer

def evaluate_summary(generated_summary, reference_summary):
    # Initialize the ROUGE scorer
    scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)

    # Compute the scores
    scores = scorer.score(reference_summary, generated_summary)

    # Print the scores
    print("ROUGE-1: ", scores['rouge1'].fmeasure)
    print("ROUGE-2: ", scores['rouge2'].fmeasure)
    print("ROUGE-L: ", scores['rougeL'].fmeasure)

# Your summaries
generated_summary = "Juliette: Sup? Going to bed have a long flight tomorrow. Phillip: 4 hours there and 4 back."
reference_summary = "Phillip has a long flight to Central Afrique République tomorrow. Juliette wants to go with him, but he has a co-pilot to talk to."

# Evaluate the summaries
evaluate_summary(generated_summary, reference_summary)


ROUGE-1:  0.372093023255814
ROUGE-2:  0.09756097560975609
ROUGE-L:  0.186046511627907


The ROUGE scores obtained are measures of the overlap between the generated summary and the human-generated summary, with each variant of ROUGE focusing on a different aspect of the texts. Let's analyze each of the scores:

**ROUGE-1 (0.372093023255814):**

This score measures the overlap of unigrams (individual words) between the generated and reference summaries. A ROUGE-1 score of approximately 0.372 suggests a moderate level of word-level overlap. This means that around 37% of the words in the generated summary are also found in the human-generated summary.

While this indicates some level of similarity, there's still a significant portion of the content that doesn't overlap, suggesting differences in the specific words chosen or details included in each summary.

**ROUGE-2 (0.09756097560975609):**

ROUGE-2 measures the overlap of bigrams (pairs of consecutive words). A score of about 0.098 indicates a low degree of overlap in terms of consecutive word pairs between the two summaries. This lower score compared to ROUGE-1 suggests that the summaries have less similarity in their phrasing and sentence structure.

In other words, even though some individual words are the same, the way they are combined into phrases differs significantly.

**ROUGE-L (0.186046511627907):** This score measures the longest common subsequence between the generated and reference summaries. It is a measure of the longest string of words that appears in both summaries in the same order.

A score of approximately 0.186 indicates a limited degree of overlap in terms of longer sequences of words. This suggests that the overall structure and flow of the generated summary might differ significantly from that of the human-generated summary.

In summary, these scores reflect a certain level of basic vocabulary overlap between the generated and reference summaries (as indicated by ROUGE-1), but a notable difference in more complex aspects such as phrasing, structure, and the sequencing of ideas (as indicated by the lower ROUGE-2 and ROUGE-L scores). These insights could be used to further refine the summarization model, potentially focusing on improving how it captures and replicates the structure and detailed phrasing of the source dialogue.

In [None]:
!pip install bert_score


Collecting bert_score
  Downloading bert_score-0.3.13-py3-none-any.whl (61 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m61.1/61.1 kB[0m [31m803.1 kB/s[0m eta [36m0:00:00[0m
Installing collected packages: bert_score
Successfully installed bert_score-0.3.13
[0m

In [None]:
from bert_score import score

def evaluate_with_bertscore(candidates, references):
    P, R, F1 = score(candidates, references, lang="en", verbose=True)
    print(f"BERTScore Precision: {P.mean()}")
    print(f"BERTScore Recall: {R.mean()}")
    print(f"BERTScore F1: {F1.mean()}")

# Example usage
candidates = ["Juliette: Sup? Going to bed have a long flight tomorrow. Phillip: 4 hours there and 4 back."]
references = ["Phillip has a long flight to Central Afrique République tomorrow. Juliette wants to go with him, but he has a co-pilot to talk to."]
evaluate_with_bertscore(candidates, references)


Some weights of the model checkpoint at roberta-large were not used when initializing RobertaModel: ['lm_head.dense.bias', 'lm_head.bias', 'lm_head.layer_norm.bias', 'lm_head.layer_norm.weight', 'lm_head.dense.weight']
- This IS expected if you are initializing RobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['roberta.pooler.dense.weight', 'roberta.pooler.dense.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


calculating scores...
computing bert embedding.


  0%|          | 0/1 [00:00<?, ?it/s]

computing greedy matching.


  0%|          | 0/1 [00:00<?, ?it/s]

done in 0.85 seconds, 1.17 sentences/sec
BERTScore Precision: 0.8556202054023743
BERTScore Recall: 0.8545243740081787
BERTScore F1: 0.8550719618797302


The BERTScore is a metric used to assess the quality of text generated by a model in comparison to reference text. It provides three scores: Precision, Recall, and F1, which are commonly used in information retrieval and machine translation evaluation. Let's break down the scores and what they mean in your specific case:

1. **BERTScore Precision (0.8556)**: This score measures the proportion of overlapping word embeddings between the candidate text (your generated text) and the reference text. In this case, a score of 0.8556 means that approximately 85.56% of the word embeddings in generated text match those in the reference text. Higher precision indicates that the generated text contains more relevant information from the reference.

2. **BERTScore Recall (0.8545)**: Recall measures the proportion of overlapping word embeddings between the reference text and the candidate text. A score of 0.8545 means that approximately 85.45% of the word embeddings from the reference text are captured in generated text. A higher recall indicates that generated text retains more of the content from the reference text.

3. **BERTScore F1 (0.8551)**: F1 score is the harmonic mean of precision and recall. It provides a balanced measure of both precision and recall. In this case, the F1 score is 0.8551, indicating a good balance between capturing relevant information from the reference (recall) and not introducing too much irrelevant information (precision).

In summary, BERTScore evaluation suggests that the generated text has a high degree of overlap with the reference text in terms of word embeddings, with both high precision and recall. This means that the code has performed well in generating text that closely resembles the reference text, indicating good quality output in terms of content overlap.

In [None]:
def generate_summary(index=None, max_length=50, min_length=10, length_penalty=2.0, num_beams=4):
    # Select a random dialogue if no index is provided
    if index is None:
        index = random.randint(0, len(dataset) - 1)

    # Fetch the dialogue
    dialogue = dataset[index]['dialogue']

    # Encode and generate summary
    input_ids = tokenizer.encode("summarize: " + dialogue, return_tensors='pt', truncation=True, max_length=1024)
    summary_ids = model.generate(input_ids, max_length=max_length, min_length=min_length, length_penalty=length_penalty, num_beams=num_beams, early_stopping=True)

    generated_summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)

    # Retrieve human-written summary for comparison
    human_summary = dataset[index]['summary']  # Assuming 'summary' is the field name

    # Print the dialogue, generated summary, and human-written summary
    print(f"Input Dialogue {index}: {dialogue}")
    print(f"Generated Summary: {generated_summary}")
    print(f"Human-Written Summary: {human_summary}\n")

# Example usage
generate_summary(index=8697)  # Replace with the desired index


Input Dialogue 8697: Alice: I've just watched Eat pray love. It's amazing! 
Kelly: the one with julia roberts? 
Alice: yeah! you have to watch it! it's rated 3 stars out of 5 but you gonna love it!
Kelly: it's quite old, isn't it? 2012? 
Alice: 2010 actually but it doesn't matter. it's a must!
Kelly: yeah, sure, thanks hon ;) 
Generated Summary: "Eat pray love" is rated 3 stars out of 5 but you gonna love it, says Alice.
Human-Written Summary: Alice advises Kelly to watch a 2010 movie "Eat pray love". 



**LAB 3**

In [None]:
%pip install --upgrade pip
%pip install --disable-pip-version-check \
    torch==1.13.1 \
    torchdata==0.5.1 --quiet

%pip install \
    transformers==4.27.2 \
    datasets==2.11.0 \
    evaluate==0.4.0 \
    rouge_score==0.1.2 \
    peft==0.3.0 --quiet

# Installing the Reinforcement Learning library directly from github.
%pip install git+https://github.com/lvwerra/trl.git@25fa1bd

Collecting pip
  Downloading pip-23.3.1-py3-none-any.whl (2.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.1/2.1 MB[0m [31m20.3 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pip
  Attempting uninstall: pip
    Found existing installation: pip 23.1.2
    Uninstalling pip-23.1.2:
      Successfully uninstalled pip-23.1.2
Successfully installed pip-23.3.1
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m887.5/887.5 MB[0m [31m1.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m4.6/4.6 MB[0m [31m101.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m849.3/849.3 kB[0m [31m55.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m557.1/557.1 MB[0m [31m2.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m317.1/317.1 MB[0m [31m3.2 MB/s[0m eta [36m

In [None]:
from transformers import pipeline, AutoTokenizer, AutoModelForSequenceClassification, AutoModelForSeq2SeqLM, GenerationConfig
from datasets import load_dataset
from peft import PeftModel, PeftConfig, LoraConfig, TaskType

# trl: Transformer Reinforcement Learning library
from trl import PPOTrainer, PPOConfig, AutoModelForSeq2SeqLMWithValueHead
from trl import create_reference_model
from trl.core import LengthSampler

import torch
import evaluate

import numpy as np
import pandas as pd

# tqdm library makes the loops show a smart progress meter.
from tqdm import tqdm
tqdm.pandas()

In [None]:
pip install py7zr

Collecting py7zr
  Downloading py7zr-0.20.8-py3-none-any.whl.metadata (16 kB)
Collecting texttable (from py7zr)
  Downloading texttable-1.7.0-py2.py3-none-any.whl.metadata (9.8 kB)
Collecting pycryptodomex>=3.16.0 (from py7zr)
  Downloading pycryptodomex-3.19.0-cp35-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (3.4 kB)
Collecting pyzstd>=0.15.9 (from py7zr)
  Downloading pyzstd-0.15.9-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (7.5 kB)
Collecting pyppmd<1.2.0,>=1.1.0 (from py7zr)
  Downloading pyppmd-1.1.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (5.7 kB)
Collecting pybcj<1.1.0,>=1.0.0 (from py7zr)
  Downloading pybcj-1.0.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (4.0 kB)
Collecting multivolumefile>=0.2.3 (from py7zr)
  Downloading multivolumefile-0.2.3-py3-none-any.whl (17 kB)
Collecting inflate64<1.1.0,>=1.0.0 (from py7zr)
  Downloading inflate64-1.0.0-cp310-cp310-manylinux_2_17_x86_64.man

In [None]:
model_name='facebook/bart-large-cnn'
huggingface_dataset_name = "samsum"


dataset = load_dataset(huggingface_dataset_name)

dataset

Downloading and preparing dataset samsum/samsum to /root/.cache/huggingface/datasets/samsum/samsum/0.0.0/f1d7c6b7353e6de335d444e424dc002ef70d1277109031327bc9cc6af5d3d46e...


Downloading data:   0%|          | 0.00/2.94M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/14732 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/819 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/818 [00:00<?, ? examples/s]

Dataset samsum downloaded and prepared to /root/.cache/huggingface/datasets/samsum/samsum/0.0.0/f1d7c6b7353e6de335d444e424dc002ef70d1277109031327bc9cc6af5d3d46e. Subsequent calls will reuse this data.


  0%|          | 0/3 [00:00<?, ?it/s]

DatasetDict({
    train: Dataset({
        features: ['id', 'dialogue', 'summary'],
        num_rows: 14732
    })
    test: Dataset({
        features: ['id', 'dialogue', 'summary'],
        num_rows: 819
    })
    validation: Dataset({
        features: ['id', 'dialogue', 'summary'],
        num_rows: 818
    })
})

In [None]:
!pip install transformers


[0m

In [None]:
from transformers import BartForConditionalGeneration


In [None]:
from transformers import BartTokenizer

In [None]:
# Load the model and tokenizer
model_name = 'facebook/bart-large-cnn'
model = BartForConditionalGeneration.from_pretrained(model_name)
tokenizer = BartTokenizer.from_pretrained(model_name)

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

In [None]:
def preprocess_example(example):
    # Join the list into a single string if it's a list, else just use the string
    dialogue_str = ' '.join(example['dialogue']) if isinstance(example['dialogue'], list) else example['dialogue']

    # Tokenize and encode the dialogue and summary text without returning tensors
    input_ids = tokenizer.encode("Summarize this dialogue: " + dialogue_str, truncation=True, padding=False, max_length=1024)
    labels = tokenizer.encode(example['summary'], truncation=True, padding=False, max_length=128)

    # Replace padding token id with -100 (will be done during the collate_fn in DataLoader)
    labels = [label if label != tokenizer.pad_token_id else -100 for label in labels]

    return {'input_ids': input_ids, 'labels': labels}



In [None]:
# Apply preprocessing to the dataset
processed_dataset = dataset.map(preprocess_example, batched=False)

# Check if preprocessing was successful
print(processed_dataset['train'][0])  # Replace 'train' with the correct split name if different



Map:   0%|          | 0/14732 [00:00<?, ? examples/s]

Map:   0%|          | 0/819 [00:00<?, ? examples/s]

Map:   0%|          | 0/818 [00:00<?, ? examples/s]

{'id': '13818513', 'dialogue': "Amanda: I baked  cookies. Do you want some?\r\nJerry: Sure!\r\nAmanda: I'll bring you tomorrow :-)", 'summary': 'Amanda baked cookies and will bring Jerry some tomorrow.', 'input_ids': [0, 38182, 3916, 2072, 42, 6054, 35, 10641, 35, 38, 17241, 1437, 15269, 4, 1832, 47, 236, 103, 116, 50121, 50118, 39237, 35, 9136, 328, 50121, 50118, 10127, 5219, 35, 38, 581, 836, 47, 3859, 48433, 2], 'labels': [0, 10127, 5219, 17241, 15269, 8, 40, 836, 6509, 103, 3859, 4, 2]}


In [None]:
import torch
from torch.utils.data import DataLoader

from torch.nn.utils.rnn import pad_sequence

In [None]:
from torch.nn.utils.rnn import pad_sequence

def collate_fn(batch):
    input_ids = pad_sequence([torch.tensor(item['input_ids']) for item in batch], batch_first=True, padding_value=tokenizer.pad_token_id)
    labels = pad_sequence([torch.tensor(item['labels']) for item in batch], batch_first=True, padding_value=-100)
    return {"input_ids": input_ids, "labels": labels}

# Now create the DataLoader with the collate_fn
train_loader = DataLoader(processed_dataset['train'], batch_size=4, collate_fn=collate_fn, shuffle=True)


In [None]:
from torch.utils.data import DataLoader

train_loader = DataLoader(processed_dataset['train'], batch_size=4, collate_fn=collate_fn, shuffle=True)


In [None]:
from transformers import BartForConditionalGeneration, AdamW

model = BartForConditionalGeneration.from_pretrained('facebook/bart-large-cnn')
#model.to(device)  # Send the model to the GPU if available
optimizer = AdamW(model.parameters(), lr=5e-5)




In [None]:
def build_dataset(model_name,
                  dataset_name,
                  input_min_text_length,
                  input_max_text_length):

    """
    Preprocess the dataset and split it into train and test parts.

    Parameters:
    - model_name (str): Tokenizer model name.
    - dataset_name (str): Name of the dataset to load.
    - input_min_text_length (int): Minimum length of the dialogues.
    - input_max_text_length (int): Maximum length of the dialogues.

    Returns:
    - dataset_splits (datasets.dataset_dict.DatasetDict): Preprocessed dataset containing train and test parts.
    """

    # load dataset (only "train" part will be enough for this lab).
    dataset = load_dataset(dataset_name, split="train")

    # Filter the dialogues of length between input_min_text_length and input_max_text_length characters.
    dataset = dataset.filter(lambda x: len(x["dialogue"]) > input_min_text_length and len(x["dialogue"]) <= input_max_text_length, batched=False)

    # Prepare tokenizer. Setting device_map="auto" allows to switch between GPU and CPU automatically.
    tokenizer = AutoTokenizer.from_pretrained(model_name, device_map="auto")

    def tokenize(sample):

        # Wrap each dialogue with the instruction.
        prompt = f"""
Summarize the following conversation.

{sample["dialogue"]}

Summary:
"""
        sample["input_ids"] = tokenizer.encode(prompt)

        # This must be called "query", which is a requirement of our PPO library.
        sample["query"] = tokenizer.decode(sample["input_ids"])
        return sample

    # Tokenize each dialogue.
    dataset = dataset.map(tokenize, batched=False)
    dataset.set_format(type="torch")

    # Split the dataset into train and test parts.
    dataset_splits = dataset.train_test_split(test_size=0.2, shuffle=False, seed=42)

    return dataset_splits

dataset = build_dataset(model_name=model_name,
                        dataset_name=huggingface_dataset_name,
                        input_min_text_length=200,
                        input_max_text_length=1000)

print(dataset)



Filter:   0%|          | 0/14732 [00:00<?, ? examples/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Map:   0%|          | 0/9814 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['id', 'dialogue', 'summary', 'input_ids', 'query'],
        num_rows: 7851
    })
    test: Dataset({
        features: ['id', 'dialogue', 'summary', 'input_ids', 'query'],
        num_rows: 1963
    })
})
