A modular Python framework for evaluating log summaries using Large Language Models (LLMs) and semantic similarity metrics.
- Multiple LLM Providers: Support for OpenAI GPT models and Hugging Face transformers (BART, Flan-T5, etc.)
- Semantic Similarity: Uses sentence transformers for accurate similarity scoring
- Additional Metrics: Optional support for ROUGE, BERTScore, TF-IDF, and KL divergence
- Modular Design: Clean, extensible architecture following Python best practices
- Easy to Use: Simple API for both single evaluations and batch processing
pip install -r requirements.txtOr install as a package:
pip install -e .For OpenAI support:
pip install openaiFor additional metrics (ROUGE, BERTScore, etc.):
pip install scikit-learn rouge-score bert-score nltkOr install everything:
pip install -e ".[all]"from reflex import LogSummaryEvaluator, HuggingFaceProvider
# Initialize provider and evaluator
llm_provider = HuggingFaceProvider(model_name="facebook/bart-large-cnn")
evaluator = LogSummaryEvaluator(llm_provider)
# Evaluate a log summary
log = "2025-06-30 12:45:03 ERROR: AuthService failed to validate token. JWT expired."
user_summary = "JWT expired during token check in AuthService"
result = evaluator.evaluate(log, user_summary)
print(f"Similarity Score: {result['similarity_score']}")
print(f"LLM Summary: {result['llm_summary']}")from reflex import LogSummaryEvaluator, OpenAIProvider
import os
# Set your API key (or use environment variable OPENAI_API_KEY)
llm_provider = OpenAIProvider(api_key=os.getenv("OPENAI_API_KEY"), model="gpt-4")
evaluator = LogSummaryEvaluator(llm_provider)
result = evaluator.evaluate(log, user_summary)from reflex import LogSummaryEvaluator, HuggingFaceProvider
from reflex.utils import parse_log_summary_pairs
import csv
llm_provider = HuggingFaceProvider(model_name="facebook/bart-large-cnn")
evaluator = LogSummaryEvaluator(llm_provider)
# Parse log-summary pairs from file
pairs = parse_log_summary_pairs("data/logs.txt")
# Evaluate and save results
with open("results.csv", "w", newline="") as f:
writer = csv.DictWriter(f, fieldnames=["llm_summary", "user_summary", "similarity_score"])
writer.writeheader()
for log, user_summary in pairs:
result = evaluator.evaluate(log, user_summary)
writer.writerow(result)Reflex/
├── reflex/ # Main package
│ ├── core/ # Core evaluation components
│ ├── providers/ # LLM provider implementations
│ ├── metrics/ # Additional evaluation metrics
│ └── utils/ # Utility functions
├── examples/ # Example scripts
├── scripts/ # Utility scripts
├── tests/ # Unit tests
├── data/ # Sample data files
├── docs/ # Documentation
├── requirements.txt # Python dependencies
├── setup.py # Package setup
└── README.md # This file
The framework expects log-summary pairs in the following format:
#1#
<log text here>
#summary:#
<summary text here>
#2#
<log text here>
#summary:#
<summary text here>
...
See the examples/ directory for more detailed examples:
basic_usage.py: Simple evaluation examplesbatch_evaluation.py: Batch processing from filesmetrics_comparison.py: Comparing different metrics
Main evaluator class that generates summaries and computes similarity scores.
evaluator = LogSummaryEvaluator(llm_provider, embedding_model="all-MiniLM-L6-v2")
result = evaluator.evaluate(log_text, user_summary)Provider for OpenAI models.
provider = OpenAIProvider(api_key="your-key", model="gpt-4")Provider for Hugging Face transformer models.
provider = HuggingFaceProvider(model_name="facebook/bart-large-cnn")The SimilarityMetrics class provides additional evaluation metrics:
from reflex.metrics import SimilarityMetrics
# TF-IDF Cosine Similarity
tfidf_scores = SimilarityMetrics.tfidf_cosine(logs, summaries)
# ROUGE Scores
rouge_scores = SimilarityMetrics.rouge_scores(logs, summaries)
# BERTScore
bert_scores = SimilarityMetrics.bert_scores(logs, summaries)
# KL Divergence
kl_scores = SimilarityMetrics.kl_divergence(logs, summaries)OPENAI_API_KEY: OpenAI API key (optional, can be passed directly to provider)
Hugging Face Models:
facebook/bart-large-cnn: Good for summarizationgoogle/flan-t5-xl: General purposegoogle/pegasus-xsum: Specialized for summarization
OpenAI Models:
gpt-4: Best quality (requires API key)gpt-3.5-turbo: Faster and cheaper alternative
Contributions are welcome! Please feel free to submit a Pull Request.
MIT License - see LICENSE file for details
If you use REFLEX in your research, please cite:
@INPROCEEDINGS{11405982,
author={Mudgal, Priyanka},
booktitle={2025 1st International Conference on Emerging Trends in Information Systems and Informatics (ICETISI)},
title={REFLEX: Reference-Free Evaluation of Log Summarization via Large Language Model Judgment},
year={2025},
volume={},
number={},
pages={1-7},
keywords={Measurement;Training;Feedback loop;Protocols;Large language models;Perturbation methods;Semantics;Market research;Real-time systems;Informatics;LLM-as-a-judge;Log summarization;Log summary score;Log analysis},
doi={10.1109/ICETISI67983.2025.11405982}}
- Built with sentence-transformers
- Uses Hugging Face Transformers
- Inspired by research in log analysis and summarization