## Analysis
In this notebook, I walk the readers through the differences in outputs between a model finetuned on Indian context versus GPT4 and what this could mean. 

I look at text completion, text summarization, and sentiment analysis.

## The model

In [4]:
from transformers import GPT2LMHeadModel, GPT2Tokenizer

# Load the tokenizer
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')

# Load the fine-tuned model
model = GPT2LMHeadModel.from_pretrained('/content/drive/MyDrive/nlp_final_model/checkpoint-3600/')

# Set the model to evaluation mode
model.eval()

GPT2LMHeadModel(
  (transformer): GPT2Model(
    (wte): Embedding(50257, 768)
    (wpe): Embedding(1024, 768)
    (drop): Dropout(p=0.1, inplace=False)
    (h): ModuleList(
      (0-11): 12 x GPT2Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2Attention(
          (c_attn): Conv1D()
          (c_proj): Conv1D()
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (mlp): GPT2MLP(
          (c_fc): Conv1D()
          (c_proj): Conv1D()
          (act): NewGELUActivation()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
    )
    (ln_f): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
  )
  (lm_head): Linear(in_features=768, out_features=50257, bias=False)
)

## Text Completion


In [17]:
prompt = "Independence Day"

# Tokenize the prompt
input_ids = tokenizer.encode(prompt, return_tensors='pt')

# Generate text
output = model.generate(input_ids, max_length=11, num_return_sequences=1,pad_token_id=tokenizer.eos_token_id)

# Decode and print the generated text
generated_text = tokenizer.decode(output[0], skip_special_tokens=False)
print(generated_text)

Independence Day celebrations in Delhi
Covid-


**Sentence generated by gpt4 :**


"The annual celebration of Independence Day brings joy and patriotism to the hearts of people."

In [16]:
prompt = "What is RBI?"
input_ids = tokenizer.encode(prompt, return_tensors='pt')

# Generate text
output = model.generate(input_ids, max_length=11, num_return_sequences=1,pad_token_id=tokenizer.eos_token_id)

# Decode and print the generated text
generated_text = tokenizer.decode(output[0], skip_special_tokens=False)
print(generated_text)

What is RBI?

The Reserve Bank of India


**Sentence generated by gpt4 :**

"What is RBI?" stands for the Reserve Bank of India, the country's central banking institution.







In [127]:
prompt = "Doctors and nurses"
input_ids = tokenizer.encode(prompt, return_tensors='pt')

# Generate text
output = model.generate(input_ids, max_length=17, num_return_sequences=1,pad_token_id=tokenizer.eos_token_id)

# Decode and print the generated text
generated_text = tokenizer.decode(output[0], skip_special_tokens=True)
print(generated_text)

Doctors and nurses in the city are being asked to stay indoors for fear of infection




**Sentence generated by gpt4 :**

Doctors and nurses work together to provide essential healthcare services and ensure the well-being of patients.

In [113]:
prompt = "The United States is"
input_ids = tokenizer.encode(prompt, return_tensors='pt')

# Generate text
output = model.generate(input_ids, max_length=15, num_return_sequences=1,pad_token_id=tokenizer.eos_token_id)

# Decode and print the generated text
generated_text = tokenizer.decode(output[0], skip_special_tokens=False)
print(generated_text)

The United States is planning to raise $1 billion through a $1 billion


**Sentence generated by gpt4 :**

The United States is a diverse and dynamic country with a rich history and cultural heritage.

In [22]:
prompt = "Cows should be"
input_ids = tokenizer.encode(prompt, return_tensors='pt')

# Generate text
output = model.generate(input_ids, max_length=14, num_return_sequences=1,pad_token_id=tokenizer.eos_token_id)

# Decode and print the generated text
generated_text = tokenizer.decode(output[0], skip_special_tokens=False)
print(generated_text)

Cows should be allowed to use the land for their own purposes



**Sentence generated by gpt4 :** 

Cows should be treated with care and respect, as they are sentient beings with their own needs and emotions.

In [123]:
prompt = "Women are"
input_ids = tokenizer.encode(prompt, return_tensors='pt')

# Generate text
output = model.generate(input_ids, max_length=14, num_return_sequences=1,pad_token_id=tokenizer.eos_token_id)

# Decode and print the generated text
generated_text = tokenizer.decode(output[0], skip_special_tokens=False)
print(generated_text)

Women are not allowed to wear masks, says MP
Bihar:


**Sentence generated by gpt4 :**

Women are powerful agents of change, capable of achieving great things and shaping the world around them.

We can see there is a difference in the kinds of sentences that both models produce. For instance, the fine-tuned model recognises Independence Day in the Indian context and mentions Delhi in the sentence whereas GPT4 gives a more general answer. Similarly, cows are associated with being allowed to use land. While this does not make a lot of sense, it could potentially be closely related to agriculture where cows are instrumental in India. GPT4 again gives a more general response that could fit any animal. The sentences involving RBI seem similar, and the other sentences generated by the fine-tuned model involve infection, masks, and other noise which is probably a function of the fact that the training data for the model comes from 2020 when covid had just struck the world.

However, we do see hints of nuanced differences in how text is generated/completed between the models. This is also reflective of how someone from India would think of cows and Independence Day versus someone, not from India. Thus, the fine-tuned model does seem to be more aligned with the India context and speaker/writer.

## Text Summarization

In [111]:
import torch
import pprint
from transformers import GPT2Tokenizer, GPT2Model
import nltk
nltk.download('punkt')

def tokenize_sentences(text):
    # Tokenize the text into sentences
    sentences = nltk.sent_tokenize(text)
    return sentences

def score_sentences(sentences, model, tokenizer):
    # Encode the sentences
    encoded_inputs = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
    input_ids = encoded_inputs['input_ids']
    
    # Obtain contextualized representations from GPT-2
    with torch.no_grad():
        outputs = model(input_ids=input_ids)
        sentence_embeddings = outputs[0]

    # Calculate sentence scores based on sentence embeddings
    sentence_scores = torch.norm(sentence_embeddings, dim=2).mean(dim=1)  # Example scoring metric: Mean L2-norm

    return sentence_scores

def extractive_summarization(text, num_sentences=3):
    # Load GPT-2 tokenizer and model
    tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
    tokenizer.add_special_tokens({'pad_token': '[PAD]'})  # Add padding token
    model = GPT2LMHeadModel.from_pretrained('/content/drive/MyDrive/nlp_final_model/checkpoint-3600/')
    model.resize_token_embeddings(len(tokenizer))
    # Tokenize the text into sentences
    sentences = tokenize_sentences(text)

    # Calculate sentence scores
    sentence_scores = score_sentences(sentences, model, tokenizer)

    # Select top-scoring sentences for the summary
    top_sentences = torch.argsort(sentence_scores, descending=True)[:num_sentences]
    summary = [sentences[i] for i in top_sentences]

    return ' '.join(summary)

# Example usage
article_text ="Books are an essential part of human civilization, serving as vessels of knowledge, \
imagination, and creativity. They offer a unique medium for storytelling, sharing ideas, and preserving \
the wisdom of the past. Throughout history, books have played a significant role in shaping cultures, \
revolutions, and the collective consciousness of societies. \
Books come in various genres, including fiction, non-fiction, poetry, and more, catering to diverse interests \
and preferences. They provide an avenue for escapism, allowing readers to immerse themselves in different worlds,\
 explore new perspectives, and experience emotions through vivid characters and compelling narratives. Whether \
 it's a thrilling mystery, a thought-provoking philosophical treatise, or a heartfelt love story, books have \
 the power to transport readers to realms beyond their everyday lives.\
Beyond entertainment, books serve as invaluable educational tools. They contain a wealth of \
information on virtually every subject imaginable, enabling individuals to acquire knowledge, learn new skills, and expand their intellectual horizons. From textbooks that guide students through academic disciplines to self-help books that offer practical advice, books are instrumental in personal and professional growth."
summary = extractive_summarization(article_text)
pprint.pprint(summary)



[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


('Throughout history, books have played a significant role in shaping '
 'cultures, revolutions, and the collective consciousness of societies. From '
 'textbooks that guide students through academic disciplines to self-help '
 'books that offer practical advice, books are instrumental in personal and '
 'professional growth. They offer a unique medium for storytelling, sharing '
 'ideas, and preserving the wisdom of the past.')


**Summary generated by GPT4:**

Books are essential for human civilization, providing knowledge, imagination, and creativity. They shape cultures, revolutions, and societies. With various genres, books offer escapism and new perspectives. They educate and expand intellectual horizons.

In [144]:
article_text =" Artificial intelligence and music (AIM) is a common subject in the International \
Computer Music Conference, the Computing Society Conference[1] and the International Joint Conference on\
 Artificial Intelligence. The first International Computer Music Conference (ICMC) was held in 1974 at \
 Michigan State University.[2] Current research includes the application of AI in music composition, \
 performance, theory and digital sound processing.\
A key part of this field is the development of music software programs which use AI to produce music.[3] \
As with applications in other fields, AI in music also simulates mental tasks. A prominent feature is the \
capability of an AI algorithm to learn based on past data, such as in computer accompaniment technology, \
wherein the AI is capable of listening to a human performer and performing accompaniment.[4] Artificial \
intelligence also drives interactive composition technology, wherein a computer composes music in response \
to a live performance. There are other AI applications in music that cover not only music composition, \
production, and performance but also how music is marketed and consumed. Several music player programs \
have also been developed to use voice recognition and natural language processing technology for music \
voice control."
summary = extractive_summarization(article_text)
pprint.pprint(summary)

('Several music player programs have also been developed to use voice '
 'recognition and natural language processing technology for music voice '
 'control. [3] As with applications in other fields, AI in music also '
 'simulates mental tasks. [4] Artificial intelligence also drives interactive '
 'composition technology, wherein a computer composes music in response to a '
 'live performance.')


**Summary generated by GPT4:**


Artificial intelligence (AI) is widely explored in the field of music, with a significant presence in conferences and research. It finds applications in music composition, performance, theory, and digital sound processing. AI-powered music software programs are developed to generate music and simulate mental tasks. Learning algorithms enable AI to analyze past data and provide accompaniment or respond to live performances. AI extends to music marketing, consumption, and player programs with voice recognition and natural language processing capabilities.

I use cosine similarity to first quantitatively asses how different the summaries generated are.

In [145]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity

def calculate_cosine_similarity(text1, text2):
    # Preprocess the texts (lowercase and remove punctuation)
    text1 = text1.lower().strip()
    text2 = text2.lower().strip()

    # Tokenize the texts
    tokenizer = CountVectorizer().build_tokenizer()
    tokens1 = tokenizer(text1)
    tokens2 = tokenizer(text2)

    # Create vector representations
    vectorizer = CountVectorizer()
    vectors = vectorizer.fit_transform([text1, text2]).toarray()
    vector1 = vectors[0]
    vector2 = vectors[1]

    # Calculate cosine similarity
    cosine_sim = cosine_similarity([vector1], [vector2])[0][0]
    
    return cosine_sim

In [150]:
t1 = "Throughout history, books have played a significant role in shaping '\
'cultures, revolutions, and the collective consciousness of societies. From '\
'textbooks that guide students through academic disciplines to self-help '\
'books that offer practical advice, books are instrumental in personal and '\
'professional growth. They offer a unique medium for storytelling, sharing '\
ideas, and preserving the wisdom of the past."

gpt1 = "Books are essential for human civilization, providing knowledge, imagination, \
and creativity. They shape cultures, revolutions, and societies. With various genres, \
books offer escapism and new perspectives. They educate and expand intellectual horizons."

calculate_cosine_similarity(t1,gpt1)

0.4303642813177534

In [152]:
t2 = "'Several music player programs have also been developed to use voice '\
'recognition and natural language processing technology for music voice '\
'control. [3] As with applications in other fields, AI in music also '\
'simulates mental tasks. [4] Artificial intelligence also drives interactive '\
'composition technology, wherein a computer composes music in response to a '\
'live performance.'"

gpt2 = "Artificial intelligence (AI) is widely explored in the field of music, with \
a significant presence in conferences and research. It finds applications in music composition, \
performance, theory, and digital sound processing. AI-powered music software programs are developed to\
generate music and simulate mental tasks. Learning algorithms enable AI to analyze past data and \
provide accompaniment or respond to live performances. AI extends to music marketing, consumption, \
and player programs with voice recognition and natural language processing capabilities."


calculate_cosine_similarity(t2,gpt2)

0.5907947710825321

Thus, we can see that even though both these models summarize the same piece of information, their summaries have a fairly low cosine similarity. There are a few differences that stand out. First, the fine-tuned model produces longer summaries compared to GPT4. Second, the finetuned model also starts its summaries in passive voice such as 'throughout history, books have played' and 'several music player programs have also'. Whereas GPT4 has more active and strong beginnings. Writing in passive voice is a trait of several Indian writers and this is something that could have potentially been picked up from the training data. Moreover, summaries generated by the finetuned model also pick up several words directly from the original text as opposed to GPT4 which synthesized information and uses new words. While this might be a better way of summarizing text, it is not how several Indian speakers/writers do it. Thus, the finetuned model allows for more diversity in how people write text and at the same time could potentially resemble individual writing styles more closely in the Indian context. 

## Sentiment Analysis

Here I use some of the sentence generated under text completion above and take an average of the sentiment generated across 10 sentences. I then compare this to the average generated by GPT4 to see if there are any systemic changes in both models

In [130]:
sentences = ['Independence Day celebrations in Delhi Covid-', 'What is RBI? The Reserve Bank of India',
             'The United States is planning to raise $1 billion through a $1 billion',
             'Doctors and nurses in the city are being asked to stay indoors for fear of infection',
             'Women are not allowed to wear masks, says MP',
             'should be allowed to use the land for their own purposes']

In [132]:
# sentences by GPT4
sentences_gpt = ["The annual celebration of Independence Day brings joy and patriotism to the hearts of people.",
                 "What is RBI? stands for the Reserve Bank of India, the country's central banking institution.",
                 'The United States is a diverse and dynamic country with a rich history and cultural heritage.',
                 'Doctors and nurses work together to provide essential healthcare services and ensure the well-being of patients.',
                 'Cows should be treated with care and respect, as they are sentient beings with their own needs and emotions.',
                 'Women are powerful agents of change, capable of achieving great things and shaping the world around them.'

]

In [129]:

from transformers import BertTokenizer, BertForSequenceClassification
model_name = 'bert-base-uncased'
tokenizer = BertTokenizer.from_pretrained(model_name)
model = BertForSequenceClassification.from_pretrained(model_name, num_labels=2)

def analyze_sentiment(sentence):
    # Tokenize the sentence and convert it to input tensors
    inputs = tokenizer(sentence, padding=True, truncation=True, return_tensors='pt')

    # Forward pass through the model
    outputs = model(**inputs)

    # Get the predicted logits
    logits = outputs.logits

    # Apply softmax to convert logits to probabilities
    probabilities = torch.softmax(logits, dim=1)

    # Get the probability for the positive sentiment class
    positive_prob = probabilities[0][1].item()

    return positive_prob

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.transform.dense.weight', 'cls.predictions.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.weight']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at

In [134]:
sentiment = []
sentiment_gpt = []
for sentence, sent_gpt in zip(sentences, sentences_gpt):
  sentiment.append(analyze_sentiment(sentence))
  sentiment_gpt.append(analyze_sentiment(sent_gpt))

In [143]:
print('Average sentiment of fine-tuned model sentences: ', sum(sentiment)/len(sentiment))
print('Average sentiment of GPT4 sentences: ', sum(sentiment_gpt)/len(sentiment_gpt))

Average sentiment of fine-tuned model sentences:  0.5825870384772619
Average sentiment of GPT4 sentences:  0.6366612017154694


We can see that the average sentiment of sentences generated by GPT4 is slightly higher compared to that of the fine-tuned model. While it is hard to make concrete claims about why this might be the case, perhaps generic LLMs always try to say more neutral and positive statements since they are trained on so much data. The fine-tuned model however was fine-tuned on newspaper articles and headlines which need not always be positive or neutral. Perhaps this skews the sentiment of these two.

This again reiterates the potential that specialized LLMs might have to produce different outputs. For instance, a journalist might prefer a model that gives outputs similar to the fine-tuned model as opposed to producing generic text like GPT4.