# ML Workflow Instructions for Generative Chat Assistant

This notebook aims at providing an overall guideline on how to set up a workflow using MLFlow to handle experiment tracking, model management, and prompt evaluation over a custom GPT model.

The key aspects taken into consideration are:



*   Parameters log: learning rate, number of layers in a neural network, batch size, data augmentation techniques
*   Evaluation metrics: accuracy, precision and recall, F1 score, loss value, execution time
*   Artifacts: trained model weights, model architecture diagrams, training logs, evaluation reports, visualizations (e.g., confusion matrices, ROC curves)





## Installing dependencies (MLFlow)

In [None]:
!pip install mlflow transformers torch bert_score

## Selection of evaluation metrics

Some metrics that are important for evaluating the generated text include:

*   Exact Match: Measures if the generated text matches the ground truth exactly.
*   Token Overlap: Measures the overlap between tokens in the generated text and the ground truth.
*   BLEU Score: Measures the similarity between the generated text and the ground truth using n-gram overlaps.
*   ROUGE Score: Measures recall-oriented metrics for evaluating text generation tasks.
*  BERTScore: Uses pre-trained BERT embeddings to measure the similarity of the generated text to the reference text. Captures semantic meaning better than surface-level n-gram overlap.

Keeping this in mind, then we implement the logic.

In [None]:
import mlflow
from langchain.prompts import PromptTemplate
from langchain.llms import OpenAI
from nltk.translate.bleu_score import sentence_bleu
from rouge_score import rouge_scorer

# Initialize your custom GPT model
gpt_model = OpenAI(api_key="your_api_key")

# Define the prompt template
prompt_template = PromptTemplate(
    template="Your custom prompt template here",
    input_variables=["variable1", "variable2"]
)

# Ground truth data
ground_truths = {
    "prompt1": "expected_response1",
    "prompt2": "expected_response2",
    # Add more ground truth data
}

def evaluate_response(prompt, response, ground_truth):
    # Exact Match
    exact_match = response == ground_truth

    # Token Overlap
    response_tokens = set(response.split())
    ground_truth_tokens = set(ground_truth.split())
    token_overlap = len(response_tokens & ground_truth_tokens) / len(ground_truth_tokens)

    # BLEU Score
    bleu_score = sentence_bleu([ground_truth.split()], response.split())

    # ROUGE Score
    rouge = rouge_scorer.RougeScorer(['rouge1', 'rougeL'], use_stemmer=True)
    rouge_scores = rouge.score(ground_truth, response)

    # Log metrics to MLFlow
    mlflow.log_param("prompt", prompt)
    mlflow.log_metric("exact_match", exact_match)
    mlflow.log_metric("token_overlap", token_overlap)
    mlflow.log_metric("bleu_score", bleu_score)
    mlflow.log_metric("rouge1", rouge_scores['rouge1'].fmeasure)
    mlflow.log_metric("rougeL", rouge_scores['rougeL'].fmeasure)

    print(f"Prompt: {prompt}")
    print(f"Response: {response}")
    print(f"Ground Truth: {ground_truth}")
    print(f"Exact Match: {exact_match}")
    print(f"Token Overlap: {token_overlap}")
    print(f"BLEU Score: {bleu_score}")
    print(f"ROUGE-1 F1 Score: {rouge_scores['rouge1'].fmeasure}")
    print(f"ROUGE-L F1 Score: {rouge_scores['rougeL'].fmeasure}")

# Start MLFlow run
mlflow.start_run(run_name="prompt_evaluation_with_ground_truth")

# Generate and evaluate responses
for prompt, ground_truth in ground_truths.items():
    generated_prompt = prompt_template.generate({"variable1": prompt, "variable2": "static_value"})
    response = gpt_model(generated_prompt)
    evaluate_response(generated_prompt, response, ground_truth)

# End MLFlow run
mlflow.end_run()


## Define Thresholds or Criteria

To determine if the generated text is aligned with the ground truth, you need to define thresholds for the metrics. These thresholds will depend on your specific use case and requirements.

Example Thresholds:

*   Exact Match: You might require an exact match for critical applications.
*   Token Overlap: You might set a threshold of 0.7, meaning 70% of the tokens should overlap with the ground truth.
*   BLEU Score: A BLEU score above 0.5 might be considered acceptable.
*   ROUGE Scores: ROUGE-1 and ROUGE-L scores above 0.5 might be considered acceptable.

## Interpret Results

In [None]:
import mlflow
from mlflow.tracking import MlflowClient

client = MlflowClient()
run_id = "your_run_id"  # Replace with your run ID

# Retrieve logged metrics
metrics = client.get_run(run_id).data.metrics

# Check against thresholds
exact_match = metrics.get('exact_match', 0) == 1
token_overlap = metrics.get('token_overlap', 0) >= 0.7
bleu_score = metrics.get('bleu_score', 0) >= 0.5
rouge1 = metrics.get('rouge1', 0) >= 0.5
rougeL = metrics.get('rougeL', 0) >= 0.5

# Determine if the response is aligned with the ground truth
is_aligned = exact_match or (token_overlap and bleu_score and rouge1 and rougeL)

if is_aligned:
    print("The generated text is aligned with the ground truth.")
else:
    print("The generated text is not aligned with the ground truth.")


## Evaluating using the BERTScore

Here with the BERTScore we compute precision, recall, and F1 score based on BERT embeddings, providing a semantic similarity measure between the generated text and the ground truth. BERTScore provides an evaluation method more aligned with human judgment - which is pretty much more suitable for current LLMs.

In [None]:
import openai
import mlflow
from bert_score import score

# Set your OpenAI API key
openai.api_key = "your_openai_api_key"

# Ground truth data
ground_truths = {
    "prompt1": "expected_response1",
    "prompt2": "expected_response2",
    # Add more ground truth data
}

def generate_text(prompt, model="gpt-3.5-turbo"):
    response = openai.ChatCompletion.create(
        model=model,
        messages=[{"role": "user", "content": prompt}],
        max_tokens=50
    )
    return response['choices'][0]['message']['content']

def evaluate_response(prompt, response, ground_truth):
    # BERTScore calculation
    P, R, F1 = score([response], [ground_truth], lang="en", verbose=True)

    # Convert tensor values to float for logging
    precision = P.mean().item()
    recall = R.mean().item()
    f1_score = F1.mean().item()

    # Log metrics to MLFlow
    mlflow.log_param("prompt", prompt)
    mlflow.log_metric("bert_precision", precision)
    mlflow.log_metric("bert_recall", recall)
    mlflow.log_metric("bert_f1", f1_score)

    print(f"Prompt: {prompt}")
    print(f"Response: {response}")
    print(f"Ground Truth: {ground_truth}")
    print(f"BERT Precision: {precision}")
    print(f"BERT Recall: {recall}")
    print(f"BERT F1 Score: {f1_score}")

# Start MLFlow run
mlflow.start_run(run_name="prompt_evaluation_with_bertscore_and_gpt3.5")

# Generate and evaluate responses
for prompt, ground_truth in ground_truths.items():
    response = generate_text(prompt, model="gpt-3.5-turbo")
    evaluate_response(prompt, response, ground_truth)

# End MLFlow run
mlflow.end_run()


## Useful resources:

*   [The complete guide to string similarity algorithms](https://yassineelkhal.medium.com/the-complete-guide-to-string-similarity-algorithms-1290ad07c6b7) - useful to understand the diverse metrics available and which one may be the most relevants for standard scenarios of specific text content on particular prompts.
*   [Similarity Coefficients: A Beginner’s Guide to Measuring String Similarity](https://medium.com/@igniobydigitate/similarity-coefficients-a-beginners-guide-to-measuring-string-similarity-d84da77e8c5a) - explanation at a higher level of some metrics to evaluate text similarity.
*   [About BERTScore](https://medium.com/@abonia/bertscore-explained-in-5-minutes-0b98553bfb71).
*   [BERTScore usage and interactive testing](https://huggingface.co/spaces/evaluate-metric/bertscore).

# Additional Insights on Chat Assistants vs Chat Completions

When using OpenAI's GPT models allows for two main ways of handling conversations with users.

On one hand, [Chat Assistants](https://platform.openai.com/docs/assistants/overview), whose one of the key features is about persistent Threads (conversation sessions), which means it enables keeping history of the conversation. Another point, it handles more complex data formats in the messages between the user and system: they can include text, images, and other files.

On the other hand, [Chat Completions](https://platform.openai.com/docs/guides/chat-completions). These don't count on persistent conversation's history; instead, the messages that want to be kept in the ongoing conversation have to be manually added as part of the "context". Also Chat Completions support text and image, but only as an input, and can output text content (e.g., code and JSON).

**Recommendation:** Check the most suitable also taking into account pricing and token usage. Make tests with real users to understand how long the chat sessions tend to be and whether it is really convenient to store a history (e.g., new questions on the same session related to a very early message). Perhaps if questions are more direct and instructive on specific pieces of content, including just a few messages as context is enough - so Chat Completions should be fine.