<a href="https://colab.research.google.com/github/kanishka-maurya/GenAI/blob/main/Using_GenAI_Models.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [8]:
!pip -q install accelerate -U
!pip -q install transformers[torch]
!pip -q install datasets

In [9]:
import pandas as pd
import numpy as np
from transformers import pipeline

# HUGGING FACE MODELS: WITH PIPELINE()

## 1. Sentiment Analysis 1

In [None]:
# Since no model is mentioned here so a default model would be used.

senti_model = pipeline(task = "sentiment-analysis")
response_1 = senti_model("That movie was so good. I just loved that.")
response_2 = senti_model("This is a bad phone.")

print(response_1[0]["label"],"\n", response_2[0]["label"])


No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision 714eb0f (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
Device set to use cpu


POSITIVE 
 NEGATIVE


## 2. Sentiment Analysis 2

In [None]:
# Predictions on our own data.

user_review = pd.read_csv("https://raw.githubusercontent.com/venkatareddykonasani/Datasets/master/Amazon_Yelp_Reviews/Review_Data.csv")
user_review_data = user_review.sample(50)
user_review_data["Review"].head()

Unnamed: 0,Review
1237,I'd say that would be the hardest decision... ...
71,"Even in my BMW 3 series which is fairly quiet,..."
9,What a waste of money and time!.
1348,5 stars for the brick oven bread app!
267,It's a great item.


In [None]:
senti_model_2 = pipeline(task = "sentiment-analysis", model = "cardiffnlp/twitter-roberta-base-sentiment-latest")
user_review_data["predicted_label"] = user_review_data["Review"].apply(lambda x: senti_model_2(x)[0]["label"])
user_review_data.head()

Some weights of the model checkpoint at cardiffnlp/twitter-roberta-base-sentiment-latest were not used when initializing RobertaForSequenceClassification: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use cpu


Unnamed: 0,Id,Review,Sentiment,predicted_label
1237,1238,I'd say that would be the hardest decision... ...,1,positive
71,72,"Even in my BMW 3 series which is fairly quiet,...",0,negative
9,10,What a waste of money and time!.,0,negative
1348,1349,5 stars for the brick oven bread app!,1,positive
267,268,It's a great item.,1,positive


## 3. Questions and Answers Based on a Document

In [None]:
qa_model = pipeline(model = "deepset/roberta-base-squad2", task = "question-answering")

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/496M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/79.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/772 [00:00<?, ?B/s]

Device set to use cpu


In [None]:
#Importing computer_scientists.txt document from github
!wget https://raw.githubusercontent.com/venkatareddykonasani/Datasets/master/computer_scientists/computer_scientists.txt
document=open("computer_scientists.txt").read()

--2025-03-08 21:14:49--  https://raw.githubusercontent.com/venkatareddykonasani/Datasets/master/computer_scientists/computer_scientists.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2447 (2.4K) [text/plain]
Saving to: ‘computer_scientists.txt’


2025-03-08 21:14:49 (31.2 MB/s) - ‘computer_scientists.txt’ saved [2447/2447]



In [None]:
question_1 = {"question": "Who is the first computer programmer?",
              "context": document}

qa_model(question_1)

{'score': 0.007803424261510372,
 'start': 518,
 'end': 530,
 'answer': 'Ada Lovelace'}

## 4. Text Summarization

In [None]:
summarize_text =pipeline(task="summarization",
                            model="google/pegasus-xsum")

Some weights of PegasusForConditionalGeneration were not initialized from the model checkpoint at google/pegasus-xsum and are newly initialized: ['model.decoder.embed_positions.weight', 'model.encoder.embed_positions.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Device set to use cpu


In [None]:
Book_essay = """
The 7 Habits of Highly Effective People" is a timeless self-help book by Stephen R. Covey that offers a holistic approach to personal and professional effectiveness. The book is a guide to transforming one's life by adopting seven fundamental habits.
Covey's philosophy centers on the idea that true success is achieved by aligning one's values with principles that govern human effectiveness. The first three habits focus on personal development, emphasizing the importance of taking control of one's life, setting clear goals, and prioritizing tasks based on importance rather than urgency.
The next three habits delve into the concept of interdependence, emphasizing the significance of effective communication, cooperation, and collaboration in achieving mutually beneficial outcomes. Covey argues that fostering strong interpersonal relationships and empathetic listening are key to building trust and synergy.
The seventh habit, "Sharpen the Saw," encourages continuous self-renewal and personal growth through physical, mental, emotional, and spiritual well-being.
Throughout the book, Covey provides practical advice and real-life examples to illustrate each habit's application in various aspects of life, from family and work to leadership and community involvement. "The 7 Habits of Highly Effective People" has had a profound impact on individuals seeking personal and professional growth, offering a framework for achieving lasting success and a sense of fulfillment..
"""


summarize_text(Book_essay, max_length = 120, min_length = 30)

[{'summary_text': '"The 7 Habits of Highly Effective People" is a timeless self-help book by Stephen R. Covey that offers a holistic approach to personal and professional effectiveness.'}]

# HUGGING FACE MODELS: WITHOUT PIPELINE()


## Sentiment Analysis

In [None]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification

tokenizer = AutoTokenizer.from_pretrained("cardiffnlp/twitter-roberta-base-sentiment")
model = AutoModelForSequenceClassification.from_pretrained("cardiffnlp/twitter-roberta-base-sentiment")

pytorch_model.bin:   0%|          | 0.00/499M [00:00<?, ?B/s]

In [None]:
texts = [
    "This is a great book",
    "The food was not tasty and it was very cold",
    "The weather is very good today",
]

# Tokenize and encode the input texts
encoded_inputs = tokenizer(texts, padding=True, return_tensors="pt")

# Pass the encoded inputs to the model
outputs = model(**encoded_inputs)

# Get the model's predictions
logits = outputs.logits.detach().cpu().numpy()

# Find the predicted class for each input
predictions = np.argmax(logits, axis=1)

# Print the predictions
print(predictions)

[2 0 2]


## Explanation of the above code.
- ***tokenizer*** is the class in transformers library provided by Hugging Face.
- ***AutoTokenizer***: It is used to load the pre-trained tokenizer of the model to convert the input data from text to numbers.
- ***AutoModelForSequenceClassification***: It is used to load pre-trained sequential model for text classification(Sentiment Analysis).
- ***encoded_inputs***: This form of input data passage to model is chosed to unwrap the dictionary(The input data is in the form of a dictionary.)
- These two combined are working same as pipeline class from transformers.

# FINETUNING HUGGING FACE MODEL

In [10]:
!wget https://github.com/venkatareddykonasani/Datasets/raw/master/Bank_Customer_Complaints/complaints_v2.zip
!unzip -o complaints_v2.zip

--2025-03-09 13:46:03--  https://github.com/venkatareddykonasani/Datasets/raw/master/Bank_Customer_Complaints/complaints_v2.zip
Resolving github.com (github.com)... 140.82.113.4
Connecting to github.com (github.com)|140.82.113.4|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/venkatareddykonasani/Datasets/master/Bank_Customer_Complaints/complaints_v2.zip [following]
--2025-03-09 13:46:05--  https://raw.githubusercontent.com/venkatareddykonasani/Datasets/master/Bank_Customer_Complaints/complaints_v2.zip
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 20228857 (19M) [application/zip]
Saving to: ‘complaints_v2.zip.1’


2025-03-09 13:46:05 (186 MB/s) - ‘complaints_v2.zip.1’ saved [20228857/20228857

## Explanation of the above code.
- !: It is used in jupyter/cloab notebooks to execute commands.
- wget: Linux based command which fetches file from the url.
- -o: It is a flag(option) to overwrite already existing files in the folder(same as that present in the unzipped file)--> If you do not pass this command then the system will ask you to overwrite files if any are found.
- unzip: Linux command to unzip files.
- We are using Linux based commands here as jupyter/colab notebooks work in Linux based environment.


In [11]:
import pandas as pd
complaints_data = pd.read_csv("complaints_v2.csv")
complaints_data.head()

Unnamed: 0,ID,product,text,label
0,0,credit_card,purchase order day shipping amount receive pro...,1
1,1,credit_card,forwarded message date tue subject please inve...,1
2,2,retail_banking,forwarded message cc sent friday pdt subject f...,1
3,3,credit_reporting,payment history missing credit report speciali...,0
4,4,credit_reporting,payment history missing credit report made mis...,0


In [None]:
# DistilBERT model
from transformers import pipeline

distilbert_model = pipeline(task="text-classification",
                            model="distilbert-base-uncased")
sample_data = complaints_data.sample(50)
sample_data["predicted_label"] = sample_data["text"].apply(lambda x: distilbert_model(x)[0]["label"])
sample_data["predicted_label"] = sample_data["predicted_label"].apply(lambda x:x[-1])
sample_data["predicted_label"] = sample_data["predicted_label"].astype(int)
sample_data.head(10)

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Device set to use cpu


Unnamed: 0,ID,product,text,label,predicted_label
101857,101857,retail_banking,tezos suddenly disappeared account coinbase cu...,1,1
27795,27795,credit_reporting,victim identity theft ive trying remove month ...,0,1
23736,23736,debt_collection,hello name dispute collection bill due fact fr...,1,0
146186,146186,credit_reporting,transunion done avery poor job updating accoun...,0,1
7300,7300,debt_collection,resolution decided resume working bettering cr...,1,0
45256,45256,credit_card,quicksilver visa credit card capital one point...,1,1
146525,146525,debt_collection,two account company open abused without author...,1,1
105080,105080,credit_card,dispute reversed credit applied reversal done ...,1,0
90997,90997,mortgages_and_loans,began repaying student loan graduating college...,1,1
66112,66112,mortgages_and_loans,open paypal account make donation closed accou...,1,1


## Accuracy of the model without Finetuning.

In [None]:
from sklearn.metrics import confusion_matrix

cm = confusion_matrix(sample_data["label"],sample_data["predicted_label"])
print(cm)

accuracy = cm.diagonal().sum()/cm.sum()
print("Accuracy: ", accuracy)

[[ 4 21]
 [10 15]]
Accuracy:  0.38


# PROJECT: FINETUNING THE MODEL WITH OUR DATA

- install command installs packages such as accelerate, transformers.
- accelerate: PyTorch library to speed up deep learning models.
- transformers: Python package developed by hugging face to interact with models available on hugging face.
- -q: Flag used to quiet all the messages while downloading any of the packages.

In [None]:
!pip -q install accelerate -U
!pip -q install transformers[torch]
!pip -q install datasets

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m342.1/342.1 kB[0m [31m14.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m363.4/363.4 MB[0m [31m4.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.8/13.8 MB[0m [31m111.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m24.6/24.6 MB[0m [31m85.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m883.7/883.7 kB[0m [31m52.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m664.8/664.8 MB[0m [31m831.1 kB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m211.5/211.5 MB[0m [31m8.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m56.3/56.3 MB[0m [31m14.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━

In [None]:
!pip install sympy --upgrade

Collecting sympy
  Downloading sympy-1.13.3-py3-none-any.whl.metadata (12 kB)
Downloading sympy-1.13.3-py3-none-any.whl (6.2 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.2/6.2 MB[0m [31m49.4 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: sympy
  Attempting uninstall: sympy
    Found existing installation: sympy 1.13.1
    Uninstalling sympy-1.13.1:
      Successfully uninstalled sympy-1.13.1
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
torch 2.5.1+cpu requires sympy==1.13.1; python_version >= "3.9", but you have sympy 1.13.3 which is incompatible.[0m[31m
[0mSuccessfully installed sympy-1.13.3


In [12]:
# Importing necessary libraries and classes for model training and data handling
from transformers import DistilBertTokenizer, DistilBertForSequenceClassification, Trainer, TrainingArguments
from transformers import Trainer, TrainingArguments  # Trainer and TrainingArguments are imported twice, which is redundant and can be combined into a single import statement
from datasets import load_dataset, DatasetDict, ClassLabel, Dataset
import pandas as pd  # For data manipulation and analysis
from sklearn.model_selection import train_test_split  # For splitting data into training and testing sets
import torch  # PyTorch library for deep learning applications

In [13]:
from sklearn.model_selection import train_test_split
sampled_data = complaints_data.sample(100)

# Convert the pandas DataFrame `sample_data` to a Hugging Face `Dataset`
Sample_data = Dataset.from_pandas(sampled_data)

train_test_split = Sample_data.train_test_split(test_size=0.2)
dataset = DatasetDict({
    'train': train_test_split['train'],
    'test': train_test_split['test']
})


In [14]:
dataset

DatasetDict({
    train: Dataset({
        features: ['ID', 'product', 'text', 'label', '__index_level_0__'],
        num_rows: 80
    })
    test: Dataset({
        features: ['ID', 'product', 'text', 'label', '__index_level_0__'],
        num_rows: 20
    })
})

In [16]:
# Load the tokenizer

tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')

# Setting up padding token to be the same as the EOS (end of sequence) token
# This aligns padding behavior with DistilBERT's expected inputs
tokenizer.pad_token = tokenizer.eos_token
tokenizer.pad_token_id = tokenizer.eos_token_id
tokenizer.add_special_tokens({'pad_token': '[PAD]'})

def tokenize_function(examples):
    # Tokenizes the examples text, ensures they are padded to a maximum length of 512 tokens, and truncated if longer
    return tokenizer(examples["text"], padding="max_length", truncation=True, max_length=512)

tokenized_datasets = dataset.map(tokenize_function, batched=True)

Map:   0%|          | 0/80 [00:00<?, ? examples/s]

Map:   0%|          | 0/20 [00:00<?, ? examples/s]

In [17]:
# Initialize the DistilBERT model for sequence classification with the 'distilbert-base-uncased' pre-trained model
model = DistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased',
                                                            num_labels=2,  # Adjust this based on the number of classes in your dataset
                                                            pad_token_id=tokenizer.eos_token_id)  # Set the padding token ID to align with the tokenizer's EOS token ID

# Set up the training arguments specifying the directory for saving results, the number of training epochs, and the logging directory
training_args = TrainingArguments(
    output_dir="./results_bert_custom",  # Directory where the training results will be saved
    num_train_epochs=1,  # Number of epochs to train for
    logging_dir="./logs_bert_custom",  # Directory where the training logs will be saved
    evaluation_strategy="epoch",  # Evaluate the model at the end of each epoch
)

# Initialize the Trainer with the model, training arguments, and the train and evaluation datasets
trainer = Trainer(
    model=model,  # The pre-initialized DistilBERT model
    args=training_args,  # Training arguments specifying the training setup
    train_dataset=tokenized_datasets['train'],  # The tokenized training dataset
    eval_dataset=tokenized_datasets['test'],  # The tokenized evaluation dataset
)

# Start the training process
trainer.train()

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.


<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
wandb: Paste an API key from your profile and hit enter:

 ··········


[34m[1mwandb[0m: No netrc file found, creating one.
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Currently logged in as: [33mkanishkamauryaofficial[0m ([33mkanishkamauryaofficial-motilal-nehru-national-institute-[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


Epoch,Training Loss,Validation Loss
1,No log,0.658545


TrainOutput(global_step=10, training_loss=0.6933887004852295, metrics={'train_runtime': 280.7482, 'train_samples_per_second': 0.285, 'train_steps_per_second': 0.036, 'total_flos': 10597391892480.0, 'train_loss': 0.6933887004852295, 'epoch': 1.0})

In [31]:
# Define the directory where you want to save your model and tokenizer
model_dir = "./distilbert_finetuned"

# Save the model to the specified directory
model.save_pretrained(model_dir)

# Save the tokenizer to the same directory
tokenizer.save_pretrained(model_dir)

# Define a function for making predictions with the finetuned model
def make_prediction(text):
    # Prepare the input text for the model
    inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True, max_length=512)
    # Move inputs to the same device as the model
    inputs = inputs.to(model.device)  # Add this line to move inputs to the GPU
    # Get model predictions
    outputs = model(**inputs)
    # Extract the predicted class (the one with the highest score) from the logits
    predictions = outputs.logits.argmax(-1)
    # Return the predicted class
    return predictions

# Apply the prediction function to the 'text' column of the sample_data DataFrame
# Note: If running this line results in performance issues or out-of-memory errors, consider applying predictions in batches.
sampled_data["finetuned_predicted"] = sampled_data["text"].apply(lambda x: make_prediction(str(x))[0])
sampled_data["finetuned_predicted"] = sampled_data["finetuned_predicted"].apply(lambda x: x.item()).astype(int)

In [32]:
sampled_data.head()

Unnamed: 0,ID,product,text,label,finetuned_predicted
34183,34183,credit_reporting,know sure cheated decent credit score due mass...,0,1
6947,6947,debt_collection,approx back time towards end year spoke repres...,1,1
92051,92051,credit_reporting,come attention victim identity theft year plac...,0,0
39917,39917,credit_reporting,decided get credit pulled due couple collectio...,0,0
98433,98433,debt_collection,ex daughter law opened auto account name pnc b...,1,1


In [33]:
from sklearn.metrics import confusion_matrix
# Create the confusion matrix
cm1 = confusion_matrix(sampled_data["label"], sampled_data["finetuned_predicted"])
print(cm1)
accuracy1=cm1.diagonal().sum()/cm1.sum()
print(accuracy1)

[[37 15]
 [11 37]]
0.74


In [None]:
#Code to donwload the distilbert model
!gdown --id 1785J3ir19RaZP3ebbFvWUX88PMaBouro -O distilbert_finetuned_V1.zip
!unzip -o -j distilbert_finetuned_V1.zip -d distilbert_finetuned_V1

model_v1 = DistilBertForSequenceClassification.from_pretrained('/content/distilbert_finetuned_V1')
model_v1.to("cuda:0")

def make_prediction(text):
  new_complaint=text
  inputs=tokenizer(new_complaint, return_tensors="pt")
  inputs = inputs.to(torch.device("cuda:0"))
  outputs=model_v1(**inputs)
  predictions=outputs.logits.argmax(-1)
  predictions=predictions.detach().cpu().numpy()
  return(predictions)

sample_data_large=complaints_data.sample(n=10000, random_state=55)
sample_data_large["finetuned_predicted"]=sample_data_large["text"].apply(lambda x: make_prediction(str(x)[:350])[0])

from sklearn.metrics import confusion_matrix
# Create the confusion matrix
cm1 = confusion_matrix(sample_data_large["label"], sample_data_large["finetuned_predicted"])
print(cm1)
accuracy1=cm1.diagonal().sum()/cm1.sum()
print(accuracy1)
