# **Ground Truth Dataset Creation Using GPT-3.5-turbo and GPT-4**

## **Setup**

In [1]:
import os
os.getcwd()

'c:\\Users\\neilp\\Documents\\College\\F20PA\\f20pa-2023-24-student-revision-system\\test\\AI Makerspace'

## **Loading Documents**

### **Arxiv**

In [1]:
from langchain_community.document_loaders.arxiv import ArxivLoader

loader = ArxivLoader("Vaswani Attention Is All You Need", load_max_docs=1)
docs = loader.load()

In [2]:
docs[0].metadata

{'Published': '2023-08-02',
 'Title': 'Attention Is All You Need',
 'Authors': 'Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin',
 'Summary': 'The dominant sequence transduction models are based on complex recurrent or\nconvolutional neural networks in an encoder-decoder configuration. The best\nperforming models also connect the encoder and decoder through an attention\nmechanism. We propose a new simple network architecture, the Transformer, based\nsolely on attention mechanisms, dispensing with recurrence and convolutions\nentirely. Experiments on two machine translation tasks show these models to be\nsuperior in quality while being more parallelizable and requiring significantly\nless time to train. Our model achieves 28.4 BLEU on the WMT 2014\nEnglish-to-German translation task, improving over the existing best results,\nincluding ensembles by over 2 BLEU. On the WMT 2014 English-to-French\ntranslation task, 

In [3]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
chunks = text_splitter.split_documents(docs)

In [4]:
len(chunks)

50

In [5]:
chunks[36].page_content

'Parser [29] even when training only on the WSJ training set of 40K sentences.\n7\nConclusion\nIn this work, we presented the Transformer, the first sequence transduction model based entirely on\nattention, replacing the recurrent layers most commonly used in encoder-decoder architectures with\nmulti-headed self-attention.\nFor translation tasks, the Transformer can be trained significantly faster than architectures based\non recurrent or convolutional layers. On both WMT 2014 English-to-German and WMT 2014\nEnglish-to-French translation tasks, we achieve a new state of the art. In the former task our best\nmodel outperforms even all previously reported ensembles.\nWe are excited about the future of attention-based models and plan to apply them to other tasks. We\nplan to extend the Transformer to problems involving input and output modalities other than text and\nto investigate local, restricted attention mechanisms to efficiently handle large inputs and outputs'

In [7]:
contexts = [chunk.page_content for chunk in chunks[:37]]    # [:37] to exclude the references section of the paper

### Ground Truth Dataset Creation Using GPT-3.5-turbo and GPT-4

The next section might take you a long time to run, so the evaluation dataset is provided.

The basic idea is that we can use LangChain to create questions based on our contexts, and then answer those questions.

Let's look at how that works in the code!

In [8]:
from langchain.output_parsers import ResponseSchema
from langchain.output_parsers import StructuredOutputParser

question_schema = ResponseSchema(
    name="question",
    description="a question about the context."
)

question_response_schemas = [
    question_schema,
]

In [9]:
question_output_parser = StructuredOutputParser.from_response_schemas(question_response_schemas)
format_instructions = question_output_parser.get_format_instructions()

In [10]:
from langchain_core.prompts import ChatPromptTemplate
from langchain_openai.chat_models import ChatOpenAI

# question_generation_model = ChatOpenAI(model="TheBloke/Mistral-7B-Instruct-v0.2-GGUF/mistral-7b-instruct-v0.2.Q5_K_M.gguf", base_url="http://localhost:1234/v1", api_key="lm-studio", temperature=0)
question_generation_model = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0)

qa_template = """\
You are a University Professor creating a test for advanced students. For each context, create a question that is specific to the context. Avoid creating generic or general questions.

question: a question about the context.
context: {context}

Format the output as JSON with the following keys:
question
"""

prompt_template = ChatPromptTemplate.from_template(template=qa_template)

messages = prompt_template.format_messages(
    context=contexts[6],
    format_instructions=format_instructions
)

response = question_generation_model(messages)
# output_dict = {'question': response.content, 'context': arxiv_contexts}
output_dict = question_output_parser.parse(response.content)

  warn_deprecated(


In [11]:
for k, v in output_dict.items():
  print(k)
  print(v)

question
How does the Transformer model reduce the number of operations required to relate signals from two arbitrary input or output positions compared to models like ConvS2S and ByteNet?
context
The goal of reducing sequential computation also forms the foundation of the Extended Neural GPU [16], ByteNet [18] and ConvS2S [9], all of which use convolutional neural networks as basic building block, computing hidden representations in parallel for all input and output positions. In these models, the number of operations required to relate signals from two arbitrary input or output positions grows in the distance between positions, linearly for ConvS2S and logarithmically for ByteNet. This makes it more difficult to learn dependencies between distant positions [12]. In the Transformer this is reduced to a constant number of operations, albeit at the cost of reduced effective resolution due to averaging attention-weighted positions, an effect we counteract with Multi-Head Attention as des

In [12]:
from tqdm import tqdm

qac_triples = []

for context in tqdm(contexts):
  messages = prompt_template.format_messages(
      context=context,
      format_instructions=format_instructions
  )
  response = question_generation_model(messages)
  try:
    output_dict = question_output_parser.parse(response.content)
  except Exception as e:
    continue
  output_dict["context"] = context
  qac_triples.append(output_dict)

  0%|          | 0/37 [00:00<?, ?it/s]

100%|██████████| 37/37 [01:08<00:00,  1.84s/it]


In [14]:
# primary_ground_truth_llm = ChatOpenAI(model="TheBloke/Mistral-7B-Instruct-v0.2-GGUF/mistral-7b-instruct-v0.2.Q5_K_M.gguf", base_url="http://localhost:1234/v1", api_key="lm-studio", temperature=0)
primary_ground_truth_llm = ChatOpenAI(model_name="gpt-4", temperature=0)

answer_schema = ResponseSchema(
    name="answer",
    description="an answer to the question"
)

answer_response_schemas = [
    answer_schema,
]

answer_output_parser = StructuredOutputParser.from_response_schemas(answer_response_schemas)
format_instructions = answer_output_parser.get_format_instructions()

qa_template = """\
You are a University Professor creating a test for advanced students. For each question and context, create an answer.

answer: an answer to the question.
question: {question}
context: {context}

Format the output as JSON with the following keys:
answer
"""

prompt_template = ChatPromptTemplate.from_template(template=qa_template)

messages = prompt_template.format_messages(
    context=qac_triples[0]["context"],
    question=qac_triples[0]["question"],
    format_instructions=format_instructions
)

response = primary_ground_truth_llm(messages)
output_dict = answer_output_parser.parse(response.content)

In [15]:
for k, v in output_dict.items():
  print(k)
  print(v)

answer
Google grants permission to reproduce the tables and figures in this paper under the condition that proper attribution is provided and the reproduction is solely for use in journalistic or scholarly works.


In [16]:
for triple in tqdm(qac_triples):
  messages = prompt_template.format_messages(
      context=triple["context"],
      question=triple["question"],
      format_instructions=format_instructions
  )
  response = primary_ground_truth_llm(messages)
  try:
    output_dict = answer_output_parser.parse(response.content)
  except Exception as e:
    continue
  triple["answer"] = output_dict["answer"]

100%|██████████| 35/35 [02:23<00:00,  4.09s/it]


In [17]:
import pandas as pd
from datasets import Dataset

ground_truth_qac_set = pd.DataFrame(qac_triples)
ground_truth_qac_set["context"] = ground_truth_qac_set["context"].map(lambda x: str(x))
ground_truth_qac_set = ground_truth_qac_set.rename(columns={"answer" : "ground_truth"})


eval_dataset = Dataset.from_pandas(ground_truth_qac_set)

  from .autonotebook import tqdm as notebook_tqdm


In [18]:
eval_dataset.to_csv("data/groundtruth_eval_dataset.csv")

Creating CSV from Arrow format: 100%|██████████| 1/1 [00:00<00:00, 166.69ba/s]


51192