<a href="https://colab.research.google.com/github/huishingchong/agile_llm/blob/main/pipeline_eval.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This Notebook was used to investigate RQ2 and 3, evaluating the final pipeline with the UniEval framework and saving model outputs to CSV files for manual evaluation. The outputs of UniEval metrics from running the experiment (in V100 GPU runtime) is kept and can be viewed for transparency. The model outputs are in Appendix C of the paper.

## Set up
### Import Packages and set up environment with API keys

In [None]:
!pip install transformers datasets torch torchvision torchaudio langchain-community faiss-cpu sentence-transformers langchain gradio evaluate langchain_experimental


In [None]:
# Get a HuggingFace token: https://huggingface.co/docs/api-inference/quicktour#get-your-api-token
from getpass import getpass
import os
HUGGINGFACEHUB_API_TOKEN = getpass()

os.environ["HUGGINGFACEHUB_API_TOKEN"] = HUGGINGFACEHUB_API_TOKEN

### Load Model

In [None]:
from langchain_community.llms import HuggingFaceEndpoint
model_name = "tiiuae/falcon-7b-instruct"
llm = HuggingFaceEndpoint(
    repo_id=model_name,
    model=model_name,
    task="text-generation",
    temperature=0.5,
    max_new_tokens=200
)

## Retrieval-Augmented Generation (with Reed API as external retrieval source)

### Set up prompts, embeddings and retriever

In [None]:
from requests import get
import csv
import re
import pandas as pd
from langchain_community.llms import HuggingFaceEndpoint
from langchain.chains import LLMChain, RetrievalQA
from langchain.prompts import PromptTemplate
from langchain.document_loaders.csv_loader import CSVLoader
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.vectorstores import FAISS

In [None]:
#PROMPT1: Prompt with context
template = """Use the following context to answer the question at the end.
If you don't know the answer, please think rationally and answer from your own knowledge base.
Context: {context}

Question: {question}
Answer:
"""
QA_CHAIN_PROMPT = PromptTemplate(template=template, input_variables=["context", "question"])

#PROMPT2: Normal prompting
template= """
        Please answer the question.
        Answer professionally, and where appropriate, in a Computer Science educational context.
        Question: {question}
        Response:
        """
prompt = PromptTemplate(template=template, input_variables=["question"])

In [None]:
modelPath = "sentence-transformers/gtr-t5-base" # Using t5 sentence transformer model to generate embeddings
model_kwargs = {'device':'cpu'}
encode_kwargs = {'normalize_embeddings': True} # Normalizing embeddings may help improve similarity metrics by ensuring that embeddings magnitude does not affect the similarity scores

# Initialise an instance of HuggingFaceEmbeddings
embeddings = HuggingFaceEmbeddings(
    model_name=modelPath,
    model_kwargs=model_kwargs,
    encode_kwargs=encode_kwargs
)

In [None]:
# Initialize text splitter
text_split = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=50)

### Functions to connect to the API

In [None]:
def clean_html(raw_html):
    """Helper function to clean HTML tags from text."""
    CLEANR = re.compile('<.*?>|&([a-z0-9]+|#[0-9]{1,6}|#x[0-9a-f]{1,6});')
    cleantext = re.sub(CLEANR, '', raw_html)
    return cleantext

In [None]:
# Enter your REED API token
reed_key = getpass("Enter the Reed API token: ")

In [None]:
def query_job_listings(job_name, location, reed_key):
    """Function to query job listings from the API."""
    BASE_URL = 'https://www.reed.co.uk/api/1.0/search'
    # Construct the request URL
    search_url = f'{BASE_URL}?keywords={job_name}&locationName={location}'
    search_response = get(search_url, auth=(reed_key, ''))  # authentication header as the username, with the password left empty

    # Check if the request was successful
    if search_response.status_code == 200:
        job_listings = search_response.json()["results"]
        return job_listings
    else:
        print(f'Error: {search_response.status_code}')
        return []

In [None]:
def create_jobs_csv(job_listings, reed_key):
    """Function to create a CSV file with details of job listings."""
    with open('job_listings.csv', 'w', newline='', encoding='utf-8') as csvfile:
        fieldnames = ['Job Title', 'Job Description', 'Location', 'Part-time', 'Full-time']
        writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
        writer.writeheader()

        # Iterate through job listings to collect required details of each job into csv file
        for job in job_listings:
            job_id = job["jobId"]
            details_url = f'https://www.reed.co.uk/api/1.0/jobs/{job_id}'
            detail_response = get(details_url, auth=(reed_key, ''))
            detail = detail_response.json()
            job_title = detail.get("jobTitle", "")
            job_description = clean_html(detail.get("jobDescription", ""))
            location = detail.get("locationName", "")
            keywords = detail.get("keywords", "")
            part_time = detail.get("partTime", "")
            full_time = detail.get("fullTime", "")
            # Write job details to CSV
            writer.writerow({'Job Title': job_title, 'Job Description': job_description, 'Location': location, "Part-time": part_time, "Full-time": full_time})


In [None]:
def get_job(query):
    """Helper function that returns the Computer Science subject of the sentence to feed into job search."""
    helper_template = """
    [INST]Output only the Computer Science job title of the sentence, give one or two words.
    For example, the output of "What programming skills would IT managers require to possess?" is "IT manager".
    The output of "What are some software tools that an data scientist need to know?" is "data scientist". [\INST]
    Sentence: {query}
    The output is:
    """
    prompt = PromptTemplate(template=helper_template, input_variables=["query"])
    model_name = "mistralai/Mistral-7B-Instruct-v0.1"
    llm = HuggingFaceEndpoint(
        repo_id=model_name,
        model=model_name,
        task="text-generation",
        temperature=0.5,
        max_new_tokens=200
    )
    helper_llm = LLMChain(llm=llm, prompt=prompt)
    response = helper_llm.invoke(input=query)
    text = response["text"]
    print(text)
    return text

In [None]:
# Same implementation as final pipeline, but on top of returning output response, also return retrieved contexts for evaluation purposes
def pipeline(query):
        subject = get_job(query) # Find keywords to search jobs in API
        location = ""
        job_listings = query_job_listings(clean_html(subject), location, reed_key)
        create_jobs_csv(job_listings, reed_key)

        with open('job_listings.csv', 'r', encoding='utf-8') as csvfile:
            reader = csv.reader(csvfile)
            next(reader)  # Skip header
            first_row = next(reader, None)
            if first_row:
                loader = CSVLoader(file_path="job_listings.csv")
                documents = loader.load() # Load data for retrieval

                d = text_split.split_documents(documents)
                db = FAISS.from_documents(d, embeddings)

                chain_type_kwargs = {"prompt": QA_CHAIN_PROMPT}
                qa = RetrievalQA.from_chain_type(llm=llm,
                    retriever=db.as_retriever(search_type="similarity_score_threshold", search_kwargs={"score_threshold": .5, "k": 3}),
                    return_source_documents=True,
                    chain_type_kwargs=chain_type_kwargs, verbose=True)

                input_dict = {'query': query}
                result = qa.invoke(input_dict)
                documents = result.get("source_documents", [])
                for i in documents:
                    print (i)
                text = result['result']
                return documents, text
            else: # If no jobs are found, normal prompting and response is done
                llm_chain = LLMChain(prompt=prompt, llm=llm, verbose=True)
                input_dict = {'question': query}
                response_dict = llm_chain.invoke(input_dict)
                response = response_dict['text']
                return [], response

# Evaluation

In [None]:
# make sure your runtime type is V100 GPU
!pip install torch tiktoken textstat

Prepare evaluation dataset

In [None]:
pipeline_eval = pd.DataFrame({
    "question": [
        "Can you give me a typical job description of graduate role for Software Engineering?",
        "Can you give an example job description for a Software Engineer intern?",
        "What particular skills do recruiters look for in a Web developer?",
        "What are cybersecurity analyst qualifications that recruiters look for?",
        "What are some software tools that an IT consultant need to know?",
        "What are skill descriptions most recruiters look for in a software engineer?",
        "What are some programming languages demanded in web developer jobs?",
        "Can you give common job requirements in Computer Vision?",
        "Please explain the responsibilities of a software architect.",
        "What qualifications does a full-stack engineer need?",
        "What responsibilities would I have as an AI solutions architect?",
        "What are recent topics a Cybersecurity consultant should learn about?",
        "Give me some fairly recent topics in the realm of data science.",
        "Can you explain the skills or experiences that recruiters look for in a Software Architect?",
        "What programming skills would IT managers require to possess?",
    ],

    "ground_truth": [
        "",
        "",
        "",
        "",
        "",
        "",
        "",
        "",
        "",
        "",
        "",
        "",
        "",
        "",
        "",

    ]
})

pipeline_eval2 = pd.DataFrame({
    "question": [
        "What programming languages should a software engineer know these days?",
        "Please explain to me skills I need to learn to become a cloud engineer.",
        "What are the responsibilities for a typical IT manager?",
        "What are examples of frameworks I would be working on as a Database Administrator?",
        "Can you explain some skills or qualifications to become a game developer?",
        "What are the special skills a professional working in cybersecurity should have?",
        "What does a typical UX designer do?",
        "What do recruiters look for in an AI engineer?",
        "If I want to get a graduate role in game developing, what skills should I expand on?",
        "Based on job descriptions, what does a software architect do?",
        "If I want to become a software anlayst, please advise me where to start.",
        "Explain several frameworks an AI engineer need to know how to use.",
        "Please describe the responsibilities of a UX designer",
        "Advise on what to learn to become a proficient cloud engineer.",
        "What are characteristics that recruiters typically look for in a database administrator"
    ],

    "ground_truth": [
        "",
        "",
        "",
        "",
        "",
        "",
        "",
        "",
        "",
        "",
        "",
        "",
        "",
        "",
        "",
    ]
})

Run the model over the evaluation data set once and save results to their respective csv file (for static evaluation)

In [None]:
# Helper function: To prevent having to run predictions again, save output to the dataset as a new column 'predictions'
import datasets

def chain_predictions_from_data(eval_data):
  data = eval_data.copy()
  predictions = []  # List to store predictions for all questions
  source_documents = []
  for question in data['question']:
    # Invoke the llm_chain model with the current question
    documents, response = pipeline(question)
    predictions.append(response)  # Append the generated text to the predictions list
    print(question)
    print(documents)
    print(response)
    concatenated_page_content = ""
    for document in documents:
    # Append the page_content of each Document to the page_contents array
      concatenated_page_content += document.page_content + "\n"
    source_documents.append(concatenated_page_content)
  data['source_documents'] = source_documents
  data['predictions'] = predictions  # Assign the predictions list to a new column
  return data

In [None]:
rag_pipeline_eval = chain_predictions_from_data(pipeline_eval)
os.makedirs("industry", exist_ok=True)
rag_pipeline_eval.to_csv("industry/pipeline_evaluation.csv", index=False)

Token will not been saved to git credential helper. Pass `add_to_git_credential=True` if you want to set the git credential as well.
Token is valid (permission: read).
Your token has been saved to /root/.cache/huggingface/token
Login successful

    Software Engineer


[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m
page_content='Job Title: Software Engineer' metadata={'source': 'job_listings.csv', 'row': 0}
page_content='Job Title: Software Engineer' metadata={'source': 'job_listings.csv', 'row': 1}
page_content='Job Title: Software Engineer' metadata={'source': 'job_listings.csv', 'row': 2}
Can you give me a typical job description of graduate role for Software Engineering?
[Document(page_content='Job Title: Software Engineer', metadata={'source': 'job_listings.csv', 'row': 0}), Document(page_content='Job Title: Software Engineer', metadata={'source': 'job_listings.csv', 'row': 1}), Document(page_content='Job Title: Software Engineer', metadata={'source': 'job_

In [None]:
rag_pipeline_eval2 = chain_predictions_from_data(pipeline_eval2)
os.makedirs("industry", exist_ok=True)
rag_pipeline_eval2.to_csv("industry/pipeline_evaluation2.csv", index=False)

Token will not been saved to git credential helper. Pass `add_to_git_credential=True` if you want to set the git credential as well.
Token is valid (permission: read).
Your token has been saved to /root/.cache/huggingface/token
Login successful

    software engineer


[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m
page_content="Job Description: Software Engineer / Developer (Graduate Python JavaScript) WFH / Cambridge to 60k Are you a bright, ambitious Software Engineer with a strong record of academic achievement looking for an opportunity to progress your career? You could be joining a tech start-up that are producing a cutting edge Digital Twins and Information Management platform for the construction industry and highways agencies. As a Software Engineer / Developer you'll join a small team, collaborating with the CEO, CTO, Chief Scientist and other world leading researchers to design and develop new features and enhancements on the core platform. You'll be

UniEval: multi-dimensional metric and factual consistency

In [None]:
!git clone https://github.com/maszhongming/UniEval.git
%cd UniEval
!pip install -r requirements.txt

In [None]:
import torch
import nltk
from utils import convert_to_json
from metric.evaluator import get_evaluator

nltk.download('punkt')
torch.cuda.is_available()

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [None]:
# Factual consistency score
task = 'fact'

src_list = rag_pipeline_eval["source_documents"]
output_list = rag_pipeline_eval["predictions"]

# Prepare data for pre-trained evaluators
data = convert_to_json(output_list=output_list, src_list=src_list)
# Initialise evaluator for a specific task
evaluator = get_evaluator(task)
# Get factual consistency scores
eval_scores = evaluator.evaluate(data, print_result=True)

Evaluating consistency of 15 samples !!!


100%|██████████| 12/12 [00:04<00:00,  2.49it/s]


Evaluation scores are shown below:
+-------------+---------+
|  Dimensions |  Score  |
+-------------+---------+
| consistency | 0.65651 |
+-------------+---------+





In [None]:
task = 'fact'

src_list = rag_pipeline_eval2["source_documents"]
output_list = rag_pipeline_eval2["predictions"]

data = convert_to_json(output_list=output_list, src_list=src_list)
evaluator = get_evaluator(task)
eval_scores = evaluator.evaluate(data, print_result=True)

Evaluating consistency of 15 samples !!!


100%|██████████| 22/22 [00:09<00:00,  2.33it/s]


Evaluation scores are shown below:
+-------------+----------+
|  Dimensions |  Score   |
+-------------+----------+
| consistency | 0.667881 |
+-------------+----------+





In [None]:
# Multi-dimensional scores
task = 'dialogue'

src_list = rag_pipeline_eval["question"]
context_list = rag_pipeline_eval["source_documents"]
output_list = rag_pipeline_eval["predictions"]

data = convert_to_json(output_list=output_list,
                       src_list=src_list, context_list=context_list)
evaluator = get_evaluator(task)
eval_scores = evaluator.evaluate(data, print_result=True)

Evaluating naturalness of 15 samples !!!


100%|██████████| 2/2 [00:00<00:00,  7.04it/s]


Evaluating coherence of 15 samples !!!


100%|██████████| 2/2 [00:00<00:00,  6.73it/s]


Evaluating engagingness of 15 samples !!!


100%|██████████| 12/12 [00:05<00:00,  2.35it/s]


Evaluating groundedness of 15 samples !!!


100%|██████████| 2/2 [00:01<00:00,  1.45it/s]


Evaluating understandability of 15 samples !!!


100%|██████████| 2/2 [00:00<00:00,  7.57it/s]


Evaluation scores are shown below:
+-------------------+----------+
|     Dimensions    |  Score   |
+-------------------+----------+
|    naturalness    | 0.772858 |
|     coherence     | 0.986734 |
|    engagingness   | 3.750929 |
|    groundedness   | 0.96767  |
| understandability | 0.790636 |
|      overall      | 1.453765 |
+-------------------+----------+





In [None]:
task = 'dialogue'

src_list = rag_pipeline_eval2["question"]
context_list = rag_pipeline_eval2["source_documents"]
output_list = rag_pipeline_eval2["predictions"]

data = convert_to_json(output_list=output_list,
                       src_list=src_list, context_list=context_list)
evaluator = get_evaluator(task)
eval_scores = evaluator.evaluate(data, print_result=True)

Evaluating naturalness of 15 samples !!!


100%|██████████| 2/2 [00:00<00:00,  7.60it/s]


Evaluating coherence of 15 samples !!!


100%|██████████| 2/2 [00:00<00:00,  7.22it/s]


Evaluating engagingness of 15 samples !!!


100%|██████████| 22/22 [00:09<00:00,  2.21it/s]


Evaluating groundedness of 15 samples !!!


100%|██████████| 2/2 [00:01<00:00,  1.51it/s]


Evaluating understandability of 15 samples !!!


100%|██████████| 2/2 [00:00<00:00,  7.60it/s]


Evaluation scores are shown below:
+-------------------+----------+
|     Dimensions    |  Score   |
+-------------------+----------+
|    naturalness    | 0.608406 |
|     coherence     | 0.995535 |
|    engagingness   | 6.014042 |
|    groundedness   | 0.961856 |
| understandability | 0.637322 |
|      overall      | 1.843432 |
+-------------------+----------+



