TaskEval: AI-Powered Answer Validation System

This project involves developing a Streamlit application for model evaluation teams to select and evaluate test cases from the GAIA dataset against the OpenAI model. The application allows users to send context data and questions to the OpenAI model and compare the model's answers with the final answers from the metadata file.

Project Overview

This application automates task validation based on predefined criteria stored in a MySQL database. It leverages OpenAI's API to generate answers for given tasks and compares them with the expected answers stored in the database using cosine similarity. The app also tracks the number of attempts made and visualizes task progress through bar charts and pie charts for better understanding of validation performance.

Features:

Task selection and submission via Streamlit interface.
Fetch answers from AWS S3 and MySQL.
Cosine similarity-based validation of answers using Sentence Transformers.
Tracks number of attempts per task.
Various visualizations for performance analytics (correct attempts, validation status, etc.).

Technologies Used

Streamlit: Frontend framework
MySQL: Backend database for task metadata and answers
AWS S3: Storage for external files (audio, images, PDFs)
OpenAI API: For generating task responses
Sentence Transformers: For embedding and comparing sentences
Matplotlib & Seaborn: For data visualizations
Boto3: AWS S3 integration
PyPDF2: For PDF file reading
Pandas: Data handling and display

Installation

Clone the Repository:

git clone https://github.com/your-username/TaskEval.git
cd TaskEval

Setup Virtual Env

1.  python -m venv venvsource venv/bin/activate

2.  \# On Windows use \`venv\\Scripts\\activate\`

3.  pip install -r requirements.txt

MySQL Setup:
- Ensure you have MySQL installed and running.
- Create the GAIA database and populate it with task metadata (see the /db folder for SQL schema).
AWS S3 Setup:
- Make sure your AWS credentials (aws_access_key_id, aws_secret_access_key) are set correctly to access the S3 bucket containing the task files.
OpenAI API Key:
- Update the openai.api_key in the script with your OpenAI API key.

Application Workflow

Step 1: Streamlit Display

On the homepage, the user selects a task from a dropdown that fetches questions stored in the MySQL database.

Step 2: Generate Answers

Once a task is selected, the application uses the OpenAI API to generate an answer for the selected question.

Step 3: Cosine Similarity Check

The generated answer is compared with the correct answer from the database using cosine similarity. If the similarity is above a predefined threshold, the answer is validated.

Step 4: Regenerate Answers

If the initial answer doesn't meet the threshold, the user can attempt to regenerate the answer. The number of attempts is tracked and updated in the MySQL database.

Step 5: Update Attempts & Validation

The application updates the number of attempts and the validation status (whether the answer matches the correct one) in the MySQL database.

Step 6: Visualizations

A separate page displays various visualizations such as:
- Number of Attempts per Task (Bar Chart)
- Correct Responses: First Attempt vs Regenerated Attempts (Bar Chart)
- Validation Status of Task IDs (Pie Chart)

Data Flow and Backend Processes

Frontend Input: Users interact with the app through Streamlit's UI, selecting tasks and generating answers.
Data Storage:
- Task metadata (questions, correct answers, validation status, etc.) is stored in a MySQL database.
- Files such as PDFs, images, and audios related to tasks are stored in AWS S3.
Processing:
- The application generates answers using OpenAI and compares them with the correct answer using cosine similarity from the Sentence Transformers library.
Output:
- The result (validated or not) is stored back into MySQL.
- The user can view the validation status, number of attempts, and other metrics through data visualizations.

Usage Instructions

streamlit run main.py

Task Selection:
- Select a task from the dropdown to see the question.
Answer Generation:
- Click on the "Generate Answer" button to receive an AI-generated answer from OpenAI.
Regenerate Answer:
- If the initial answer is incorrect, click "Regenerate" to try again.
View Visualizations:
- Visit the Visualizations page to see analytics on the number of attempts, validation status, etc.

Visualizations

Number of Attempts per Task:Displays a horizontal bar chart showing the number of attempts taken for each task.
Correct Responses: First Attempt vs Regenerated Attempts:Bar chart comparing the number of correct answers on the first attempt versus those that required regeneration.
Validation Status of Task IDs:Pie chart showing the percentage of tasks that have been validated versus those that haven’t.

License

This project is licensed under the MIT License. See the LICENSE file for more details.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
Code		Code
Diagrams		Diagrams
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TaskEval: AI-Powered Answer Validation System

Table of Contents

Project Overview

Features:

Technologies Used

Installation

Application Workflow

Step 1: Streamlit Display

Step 2: Generate Answers

Step 3: Cosine Similarity Check

Step 4: Regenerate Answers

Step 5: Update Attempts & Validation

Step 6: Visualizations

Data Flow and Backend Processes

Usage Instructions

Visualizations

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

TaskEval: AI-Powered Answer Validation System

Table of Contents

Project Overview

Features:

Technologies Used

Installation

Application Workflow

Step 1: Streamlit Display

Step 2: Generate Answers

Step 3: Cosine Similarity Check

Step 4: Regenerate Answers

Step 5: Update Attempts & Validation

Step 6: Visualizations

Data Flow and Backend Processes

Usage Instructions

Visualizations

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages