<a href="https://colab.research.google.com/github/hongeunhee/RAG/blob/main/Evaluating_Responses_to_Prompts_Quantitatively_using_NLP.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Evaluating Responses to Prompts Quantitatively Using Natural Language Processing Models**
1. Data Preparation:
Prepare example data according to evaluation criteria. This data should include various response examples scored for each evaluation criterion.

2. Model Training:
Train an NLP model using the prepared data. Models like BERT or GPT-4o can be used to evaluate the quality of text. The model learns how to predict scores for each criterion to assess response quality.

3. Model Evaluation:
The model predicts scores for each evaluation criterion for new responses. Based on these predicted scores, calculate a total score to quantitatively evaluate response quality.

**Implementation Steps**
1. Collection and Labeling of Example Data for Evaluation Criteria:
Label example responses for each evaluation criterion (naturalness of conversation, relevance to questions, appropriateness of background setting, clarity).

2. Model Training:
Train classification or regression models for each criterion using the labeled data. For instance, a text classification model might be used to evaluate appropriateness of background setting.

3. Model Evaluation and Validation:
Evaluate the performance of the trained model using a validation set. Adjust the model if necessary.

In [17]:
from transformers import pipeline

# Sample Data
data = [{"response":"I am working on various projects as an artificial intelligence researcher.", "background":9, "clarity":8, "naturalness":9, "relevance":9},
{"response":"I used to be an artist in the past, and now I live with my family.", "background":8, "clarity":7, "naturalness":8, "relevance":8}] # 더 많은 데이터 필요

# Evaluation Model
classifier = pipeline("zero-shot-classification", model="facebook/bart-large-mnli")

# Define Evaluation Items
labels = ["background", "clarity", "naturalness", "relevance"]

# Assessment of new responses
new_response = "I am working with various clothing brands as a fashion designer."
result = classifier(new_response, labels)

# Score Prediction and Summation
predicted_scores = {label: score for label, score in zip(result["labels"], result["scores"])}
total_score = sum(predicted_scores.values())

print(f"Predicted Score: {predicted_scores}")
print(f"Total Score: {total_score}")

Predicted Score: {'relevance': 0.3946623206138611, 'clarity': 0.3609298765659332, 'background': 0.1632412075996399, 'naturalness': 0.08116655796766281}
Total Score: 0.999999962747097


## Training regression model using labeled data

You can repeat the following steps for each evaluation criterion to train respective regression models.

In [16]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error

data = {'response':["I am working on various projects as an artificial intelligence researcher.","I used to be an artist in the past, and now I live with my family."],
        'background_score':[9, 8],
        'clarity_score': [8, 7],
        'naturalness_score': [9, 8],
        'relevance_score': [9, 8]}

df = pd.DataFrame(data)

vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(df['response'])

X_train, X_test, y_train, y_test = train_test_split(X, df['background_score'], test_size = 0.2, random_state = 42)

model = RandomForestRegressor(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

y_pred = model.predict(X_test)

mse = mean_squared_error(y_test, y_pred)
print(f'Predicted Score: {y_pred[0]}')
print(f'Mean Squared Error: {mse}')

Predicted Score: 9.0
Mean Squared Error: 1.0


## Using BERT embedding

In [13]:
import numpy as np
from transformers import BertTokenizer, BertModel
import torch


# Initialize BERT tokenizer and model
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
model = BertModel.from_pretrained("bert-base-uncased")

# Function to get BERT embeddings
def get_bert_embeddings(text):
    inputs = tokenizer(text, return_tensors='pt', padding=True, truncation=True, max_length=512)
    outputs = model(**inputs)
    return outputs.last_hidden_state.mean(dim=1).detach().numpy()

# Compute BERT embeddings for each response
embeddings = np.vstack([get_bert_embeddings(text) for text in df['response']])

# Now `embeddings` contains the BERT embeddings for each response
# You can proceed to train regression models for each score (background, clarity, naturalness, relevance)

# For demonstration, let's assume you want to train a model for `background_score`
from sklearn.linear_model import LinearRegression

# Example for background_score
X = embeddings  # BERT embeddings as features
y = df['background_score']  # Target variable (background_score)

# Train regression model
regression_model = LinearRegression()
regression_model.fit(X, y)

# After fitting the model, you can use it to predict scores for new responses
# For example:
new_responses = ["I am now exploring new opportunities in the tech industry.",
                 "I have a strong background in mathematics and computer science."]

new_embeddings = np.vstack([get_bert_embeddings(text) for text in new_responses])
predicted_scores = regression_model.predict(new_embeddings)

print("Predicted background scores for new responses:", predicted_scores[0])

Predicted background scores for new responses: 8.70787


In [14]:
embeddings

array([[ 0.13358934,  0.30654487, -0.2784769 , ..., -0.24665129,
        -0.12976412,  0.06576177],
       [ 0.23459582,  0.16098467, -0.08742481, ..., -0.2309084 ,
         0.2133836 ,  0.02201337]], dtype=float32)