## LLM Evaluation: 4 Methods
-   *Perplexity*
    -   Perplexity is how probable the reference text words are present in the generated texts
-   *BLEU* (Bilingual Evaluation Understudy) based on the percentage of presence of n-gram words in the reference and generated texts
-   *ROUGE* Recall-Oriented Understudy for Gissing Evaluation
    -   ROUGE is a set of metrics used for evaluating the quality of summaries. It compares the generated summary with one or more reference summaries and calculates **precision**, **recall** and **F1 score**. 
-   *Diversity*
    -   Measures and assess the variety and uniqueness of the generated responses. It involves analyzing metrics such as n-gram diversity or measuring the semantic similarity between generated responses

## TextGen Models:
1.  *GPT-2* (OpenAI)
    -   GPT is a large-transformer based language model with 1.5B parameters.
    -   GPT2 was trained with a causal language modeling objective and is therefore powerful at predicting the next token in a sequence
2.  *Phi-1*(Microsoft)
    -   It is a suite of 1.5b parameter decoder-only language model
3.  *Distil-GPT2*
    -   Only has 82 mil parameters
4.  *T5-Base*
    - Developed by google 
    - has 220 million parameters

In [6]:
import pickle 
import pandas as pd
import nltk
import sacrebleu
from rouge_score import rouge_scorer
from nltk.lm import MLE
from nltk.lm.preprocessing import padded_everygram_pipeline
from evals import *

In [7]:
import pickle

# Load all the data
medicine_texts = []
politics_texts = []
sports_texts = []

for i in range(4):
    with open(f"medicine_text/medicine_text{i}.txt", "rb") as file:
        medicine_texts.append(pickle.load(file))

for i in range(4):
    with open(f"politics_text/politics_text{i}.txt", "rb") as file:
        politics_texts.append(pickle.load(file))

for i in range(4):
    with open(f"sports_text/sports_text{i}.txt", "rb") as file:
        sports_texts.append(pickle.load(file))


In [8]:
print(medicine_texts[0]['df'].columns)

Index(['paragraphs'], dtype='object')


In [9]:
#GPT-2
from transformers import pipeline, set_seed

generator = pipeline('text-generation', model='gpt2')
set_seed(32)

gpt_2_medicine_texts = []
gpt_2_politics_texts = []
gpt_2_sports_texts = []

for medicine_text in medicine_texts:
    # Assuming 'headline' is a key in your dictionary
    generated_sequence = generator(f'Tell me about {medicine_text["headline"]}', max_length=60, num_return_sequences=1)
    gpt_2_medicine_texts.append(generated_sequence[0]['generated_text'])

for politics_text in politics_texts:
    # Assuming 'headline' is a key in your dictionary
    generated_sequence = generator(f'Tell me about {politics_text["headline"]}', max_length=60, num_return_sequences=1)
    gpt_2_politics_texts.append(generated_sequence[0]['generated_text'])

for sports_text in sports_texts:
    # Assuming 'headline' is a key in your dictionary
    generated_sequence = generator(f'Tell me about {sports_text["headline"]}', max_length=60, num_return_sequences=1)
    gpt_2_sports_texts.append(generated_sequence[0]['generated_text'])





Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for 

In [10]:
print(gpt_2_medicine_texts)

['Tell me about doctors with limited vacation have increased burnout risk? In response to this question, there is one answer. In a small study of 23 physicians with long-term chronic illness from five states, physicians were asked if they received medical treatment for the absence of an acute episode of post-traumatic', "Tell me about obstructive sleep apnea may promote early bone loss?\n\nA lot of studies show that this syndrome often develops, but few studies have identified a link between sleep apnea and bone loss in older adults. The current understanding of this topic is complex - perhaps it's due to some", 'Tell me about  fruit juices watch out for the impact on weight loss as a change will be needed, there is plenty more to do but a good way to keep at it is to try to find a low calorie meal that will allow you to keep up. It will make that happen. The goal', 'Tell me about healthcare workers face increased risks during the pandemic. We have received hundreds of reports and inve

In [11]:
#DistilGPT-2
from transformers import pipeline, set_seed

generator = pipeline('text-generation', model='distilgpt2')
set_seed(32)

distilgpt_2_medicine_texts = []
distilgpt_2_politics_texts = []
distilgpt_2_sports_texts = []

for medicine_text in medicine_texts:
    # Assuming 'headline' is a key in your dictionary
    generated_sequence = generator(f'Tell me about {medicine_text["headline"]}', max_length=60, num_return_sequences=1)
    distilgpt_2_medicine_texts.append(generated_sequence[0]['generated_text'])

for politics_text in politics_texts:
    # Assuming 'headline' is a key in your dictionary
    generated_sequence = generator(f'Tell me about {politics_text["headline"]}', max_length=60, num_return_sequences=1)
    distilgpt_2_politics_texts.append(generated_sequence[0]['generated_text'])

for sports_text in sports_texts:
    # Assuming 'headline' is a key in your dictionary
    generated_sequence = generator(f'Tell me about {sports_text["headline"]}', max_length=60, num_return_sequences=1)
    distilgpt_2_sports_texts.append(generated_sequence[0]['generated_text'])


Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for 

In [12]:
distilgpt_2_medicine_texts

['Tell me about doctors with limited vacation have increased burnout risk than placebo in that. In fact there is evidence out there to date that the effect of the placebo on brain functioning is greater in older volunteers than in older patients (see Table 2 in Table 3 here). I do not know what the effect',
 'Tell me about obstructive sleep apnea may promote early bone loss and possibly improve sleep quality\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n',
 'Tell me about  fruit juices watch out for the impact on weight loss and health during my pregnancy. I was eating more and I noticed that a little bit of vitamin D helped to keep a lot of my low calorie intake low while helping people lose weight. Also, a lot of this stuff is made',
 'Tell me about healthcare workers face increased risks during the pandemic. We have a lot more work to perform.”']

In [13]:
#Microsoft phi-1
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

torch.set_default_device("cuda")

model = AutoModelForCausalLM.from_pretrained("microsoft/phi-1", torch_dtype="auto", trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained("microsoft/phi-1", trust_remote_code=True)

phi_medicine_texts = []
phi_politics_texts = []
phi_sports_texts = []

for medicine_text in medicine_texts:
    inputs = tokenizer(f'Tell me about {medicine_text["headline"]}', return_tensors="pt", return_attention_mask=False)
    outputs = model.generate(**inputs, max_length = 200)
    generated_sequence = tokenizer.batch_decode(outputs[0])
    phi_medicine_texts.append(generated_sequence)
for politics_text in politics_texts:
    inputs = tokenizer(f'Tell me about {politics_text["headline"]}', return_tensors="pt", return_attention_mask=False)
    outputs = model.generate(**inputs, max_length = 200)
    generated_sequence = tokenizer.batch_decode(outputs[0])
    phi_politics_texts.append(generated_sequence)
for sports_text in sports_texts:
    inputs = tokenizer(f'Tell me about {sports_text["headline"]}', return_tensors="pt", return_attention_mask=False)
    outputs = model.generate(**inputs, max_length = 200)
    generated_sequence = tokenizer.batch_decode(outputs[0])
    phi_sports_texts.append(generated_sequence)

  return self.fget.__get__(instance, owner)()


In [14]:
phi_medicine_texts

[['Tell',
  ' me',
  ' about',
  ' doctors',
  ' with',
  ' limited',
  ' vacation',
  ' have',
  ' increased',
  ' burn',
  'out',
  ' risk',
  '."',
  '\n',
  '    ',
  'el',
  'if',
  ' "',
  'limited',
  '"',
  ' in',
  ' doctor',
  '.',
  'lower',
  '()',
  ' and',
  ' "',
  'v',
  'ac',
  'ation',
  '"',
  ' in',
  ' doctor',
  '.',
  'lower',
  '():',
  '\n',
  '        ',
  'return',
  ' "',
  'Tell',
  ' me',
  ' about',
  ' doctors',
  ' with',
  ' limited',
  ' vacation',
  ' have',
  ' increased',
  ' risk',
  '."',
  '\n',
  '    ',
  'else',
  ':',
  '\n',
  '        ',
  'return',
  ' "',
  'Tell',
  ' me',
  ' about',
  ' doctors',
  ' with',
  ' limited',
  ' vacation',
  ' have',
  ' increased',
  ' burn',
  'out',
  ' risk',
  '."',
  '\n\n',
  '<|endoftext|>',
  '\n',
  '\n',
  'from',
  ' typing',
  ' import',
  ' List',
  '\n',
  '\n',
  'def',
  ' count',
  '_',
  'same',
  '_',
  'adj',
  'acent',
  '_',
  'p',
  'airs',
  '(',
  'li',
  ':',
  ' List',
  '[',
 

In [15]:
def combine_text(list_of_text):
    combined_text = ''.join(list_of_text)
    return combined_text

In [16]:
medicine_reference_texts = []
for medicine_text in medicine_texts: 
    medicine_reference_texts.append(medicine_text['df']['paragraphs'].apply(lambda x: ''.join(x)))

In [17]:
medicine_reference_texts[0]

0                                                      
1                                                      
2                                         nadine eckert
3                                            january   
4     a recent study sheds light on the heightened r...
5     conducted by the american medical association ...
6     christine a sinsky md study author and senior ...
7     a significant proportion  of respondents repor...
8     doctors who took less vacation and worked duri...
9     administrative tasks though no longer confined...
10    courses and tutorials on ehr inbox management ...
11    many physicians lack coverage for their ehr in...
12    difficulty in finding coverage whether for the...
13    further analysis showed that doctors who took ...
14    however these benefits applied only when physi...
15    the vacation behavior observed in this study l...
16    systemlevel measures must be implemented to en...
17    this article was translated from the medsc

In [18]:
medicine_reference_texts

[0                                                      
 1                                                      
 2                                         nadine eckert
 3                                            january   
 4     a recent study sheds light on the heightened r...
 5     conducted by the american medical association ...
 6     christine a sinsky md study author and senior ...
 7     a significant proportion  of respondents repor...
 8     doctors who took less vacation and worked duri...
 9     administrative tasks though no longer confined...
 10    courses and tutorials on ehr inbox management ...
 11    many physicians lack coverage for their ehr in...
 12    difficulty in finding coverage whether for the...
 13    further analysis showed that doctors who took ...
 14    however these benefits applied only when physi...
 15    the vacation behavior observed in this study l...
 16    systemlevel measures must be implemented to en...
 17    this article was transla

In [19]:
politics_reference_texts = []
for politics_text in politics_texts: 
    politics_reference_texts.append(politics_text['df']['paragraphs'].apply(lambda x: ''.join(x)))

In [31]:
politics_reference_texts

[0    the gender gap is growing between supporters o...
 1    and thats good news for the democratic incumbe...
 2    more women said they would support biden over ...
 3    the numbers were relatively unchanged for men ...
 4    the gender demographic tells a story to keep a...
 5    its a different story for former south carolin...
 6    haleys support comes largely from independents...
 7    in a headtohead matchup against biden haley ou...
 8    the poll surveyed  selfidentified registered v...
 Name: paragraphs, dtype: object,
 0     washington the united states on wednesday attr...
 1     the attribution comes as iran threatened on we...
 2     national security council spokesman john kirby...
 3     kirby said biden was continuing to weigh retal...
 4     kirby dismissed a statement by iraqi militia k...
 5     biden meanwhile is set to attend the somber re...
 6     any additional american strikes could further ...
 7     violence has erupted across the mideast with i...
 8    

In [44]:
sports_reference_texts = []
for sports_text in sports_texts: 
    sports_reference_texts.append(sports_text['df']['paragraphs'].apply(lambda x: ''.join(x)))

In [45]:
# Assuming sports_reference_texts is a list of Series
sports_reference_texts = [series.tolist() for series in sports_reference_texts]
medicine_reference_texts = [series.tolist() for series in medicine_reference_texts]
politics_reference_texts = [series.tolist() for series in politics_reference_texts]
# sports_reference_texts_list is now a list of lists


In [46]:
sports_reference_texts

[['congratulatory messages poured in for indias  tennis star rohan bopanna who scripted a new world record by winning the mens doubles title at the australian open alongside his australian partner matthew ebden with this victory bopanna is now the oldest male player to win a grand slam title breaking the record of dutch player jeanjulien rojer',
  'bopanna won the australian open mens doubles title which also happens to be his maiden mens doubles grand slam triumph by beating the italian pair of simone bolelli and andrea vavassori',
  'following his triumph indians from all fields ranging from prime minister narendra modi to cricketers and authors congratulated bopanna on his remarkable achievement here is how the nation celebrated the victory',
  ' huge applause to rohanbopanna and mattebden for clinching their  grand slam as a duo  mens doubles title rohan bopanna defying odds at  winning his maiden ausopen title ',
  'rohan bopanna legend thats the tweet rohanbopanna ',
  'world num

In [23]:
gpt_2_politics_texts

['Tell me about  gender gap expands between biden and trump new poll shows  gender gap in US political arena: " https://www.washingtonpost.com/post-politics/wp/2017/08/31/the-differences-of-a-gender-gap-between',
 'Tell me about  islamic resistance in iraq group is to blame for jordan drone strike that killed  troops us says  @dalmahazul al-Qarqawi @diaz_bari #ISLAM_INCISORS pic.twitter.com/',
 'Tell me about  biden says us knows how it will respond to jordan attack  b1k: :p we have learned how its going to respond. will this be handled by force. if not. is there an update on this? b2k: :p yes but if not',
 'Tell me about  trump pledges to block us steel sale !!!?? Is what they tell us because these  expensive bnts do not need  to be sold!!!!??\n\n\nIf you will make  this pledge so that  does not come back to  us we can continue to support']

In [40]:
politics_reference_texts

[0    the gender gap is growing between supporters o...
 1    and thats good news for the democratic incumbe...
 2    more women said they would support biden over ...
 3    the numbers were relatively unchanged for men ...
 4    the gender demographic tells a story to keep a...
 5    its a different story for former south carolin...
 6    haleys support comes largely from independents...
 7    in a headtohead matchup against biden haley ou...
 8    the poll surveyed  selfidentified registered v...
 Name: paragraphs, dtype: object,
 0     washington the united states on wednesday attr...
 1     the attribution comes as iran threatened on we...
 2     national security council spokesman john kirby...
 3     kirby said biden was continuing to weigh retal...
 4     kirby dismissed a statement by iraqi militia k...
 5     biden meanwhile is set to attend the somber re...
 6     any additional american strikes could further ...
 7     violence has erupted across the mideast with i...
 8    

In [25]:
str(politics_reference_texts[0])

'0    the gender gap is growing between supporters o...\n1    and thats good news for the democratic incumbe...\n2    more women said they would support biden over ...\n3    the numbers were relatively unchanged for men ...\n4    the gender demographic tells a story to keep a...\n5    its a different story for former south carolin...\n6    haleys support comes largely from independents...\n7    in a headtohead matchup against biden haley ou...\n8    the poll surveyed  selfidentified registered v...\nName: paragraphs, dtype: object'

In [26]:
perplexity(str(politics_reference_texts[0]), gpt_2_politics_texts[0])

array([48.6044222])

In [50]:
bleu = bleu_score(str(combine_text(politics_reference_texts[0])), gpt_2_politics_texts[0])

In [52]:
bleu.bp

0.9657177024852738

In [53]:
politics_reference_texts

[['the gender gap is growing between supporters of president joe biden and former president donald trump according to a new quinnipiac university poll of registered voters',
  'and thats good news for the democratic incumbent biden holds a slight lead over trump in wednesdays  presidential election poll  percent to  percent the same matchup was too close to call just a month ago',
  'more women said they would support biden over trump in this latest survey with  percent backing biden and  percent backing trump last month the quinnipiac poll found  percent of women supported the incumbent democrat compared to  percent for the republican challenger',
  'the numbers were relatively unchanged for men  percent of men said theyd vote for trump and  percent chose biden in the latest poll compared to  percent for biden and  percent for trump in december',
  'the gender demographic tells a story to keep an eye on quinnipiac university polling analyst tim malloy said in a statement propelled by 

In [55]:
politics_reference_texts

[['the gender gap is growing between supporters of president joe biden and former president donald trump according to a new quinnipiac university poll of registered voters',
  'and thats good news for the democratic incumbent biden holds a slight lead over trump in wednesdays  presidential election poll  percent to  percent the same matchup was too close to call just a month ago',
  'more women said they would support biden over trump in this latest survey with  percent backing biden and  percent backing trump last month the quinnipiac poll found  percent of women supported the incumbent democrat compared to  percent for the republican challenger',
  'the numbers were relatively unchanged for men  percent of men said theyd vote for trump and  percent chose biden in the latest poll compared to  percent for biden and  percent for trump in december',
  'the gender demographic tells a story to keep an eye on quinnipiac university polling analyst tim malloy said in a statement propelled by 

In [58]:
type(combine_text(politics_reference_texts[0]))

str

In [59]:
calculate_rouge(combine_text(politics_reference_texts[0]), gpt_2_politics_texts[0])

AttributeError: 'list' object has no attribute 'lower'

In [65]:
politics_reference_texts = lambda x: x.apply(combine_text)

In [66]:
import pandas as pd
from evals import perplexity, bleu_score, calculate_rouge

models = ['gpt-2', 'phi-1', 'distilgpt2']

generated_texts = {
    'gpt-2': gpt_2_politics_texts,
    'phi-1': phi_politics_texts,
    'distilgpt2': distilgpt_2_politics_texts
}

perplexity_scores = []
bleu_scores = []
rouge_scores = []

for model in models:
    for ref_text, gen_text in zip(politics_reference_texts, generated_texts[model]):
        ppl = perplexity(ref_text, gen_text)
        bleu = bleu_score(ref_text, gen_text).bp
        rouge = calculate_rouge(ref_text, gen_text)

        perplexity_scores.append(ppl)
        bleu_scores.append(bleu)
        rouge_scores.append(rouge)

df_politics = pd.DataFrame({
    'Model': models * len(politics_reference_texts),
    'PPL': perplexity_scores,
    'BLEU': bleu_scores,
    'ROUGE_Precision': [score[0] for score in rouge_scores],
    'ROUGE_Recall': [score[1] for score in rouge_scores],
    'ROUGE_F1': [score[2] for score in rouge_scores]
})


TypeError: 'function' object is not iterable