# Experimenting with GPT-3.5 from OpenAI to try to Improve LLM-Based Answers to Prompts compared to LLaMa-2 experiments

**Reference**: Hui Liu, Wenya Wang, Haoru Li, and Haoliang Li. 2024. TELLER: A Trustworthy Framework for Explainable, Generalizable and Controllable Fake News Detection. arXiv:2402.07776

As the experiments for returning truth values of predicates/fact-checking question templates about news texts were not very successful using the LLaMA-2 model, I decided to try out the commercial GPT-3.5 model from OpenAI which requires a paid subscription, albeit a lower one than later GPT models. The TELLER paper authors however do state that because GPT-3.5 is a "closed" LLM model, it is not possible to access the pre-softmax logits for the answer to create the vector representation for a text that is needed for training the second decision system. Instead, a more complex method has to be implemented to obtain the required probability score for each question, where the model is asked multiple times for its answer, and the number of "yes" responses is divided by the total number of times the question is asked. Of course, this greatly increases the time and computational resources necessary to use this model for the cognition system. However, due to the inability to get LLaMA-2 to answer "no" about obviously false statements, perhaps this approach will yield more accurate answers.

I tried to implement the TELLER pipeline using the LangChain framework, as this framework is specially designed for prompt-learning and reasoning tasks using LLMs such as GPT-3.5 Furthermore, it has special functions for creating prompt templates and connecting to the OpenAI API.

## Environment Setup

In [None]:
# Mounts the Google Drive for Colab access to the data file
from google.colab import drive
drive.mount("/content/drive")

Mounted at /content/drive


In [None]:
# Installs the required packages to use LangChain
!pip install langchain langchain_community python-dotenv
!pip install -U langchain-openai

Collecting langchain_community
  Downloading langchain_community-0.3.14-py3-none-any.whl.metadata (2.9 kB)
Collecting python-dotenv
  Downloading python_dotenv-1.0.1-py3-none-any.whl.metadata (23 kB)
Collecting dataclasses-json<0.7,>=0.5.7 (from langchain_community)
  Downloading dataclasses_json-0.6.7-py3-none-any.whl.metadata (25 kB)
Collecting httpx-sse<0.5.0,>=0.4.0 (from langchain_community)
  Downloading httpx_sse-0.4.0-py3-none-any.whl.metadata (9.0 kB)
Collecting langchain
  Downloading langchain-0.3.14-py3-none-any.whl.metadata (7.1 kB)
Collecting langchain-core<0.4.0,>=0.3.25 (from langchain)
  Downloading langchain_core-0.3.29-py3-none-any.whl.metadata (6.3 kB)
Collecting pydantic-settings<3.0.0,>=2.4.0 (from langchain_community)
  Downloading pydantic_settings-2.7.1-py3-none-any.whl.metadata (3.5 kB)
Collecting marshmallow<4.0.0,>=3.18.0 (from dataclasses-json<0.7,>=0.5.7->langchain_community)
  Downloading marshmallow-3.25.1-py3-none-any.whl.metadata (7.3 kB)
Collecting ty

In [None]:
# Imports all the required libraries
import os
import pandas as pd
import numpy as np
from tqdm import tqdm
# For importing the HuggingFace token
from google.colab import userdata
# For tracking progress of model answers about news texts
tqdm.pandas() 
import time
# For working with LLM prompting and constructing an entire workflow
import langchain
import transformers
import openai
from huggingface_hub import login
# LangChain functionality
from langchain.prompts import PromptTemplate, SystemMessagePromptTemplate, HumanMessagePromptTemplate, ChatPromptTemplate
# OpenAI model
from langchain_openai import ChatOpenAI
# For loading in access tokens
from dotenv import load_dotenv, find_dotenv

In [None]:
# Sets up the filepath to the WELFake example/training dataset
root_welfake_path = "/content/drive/My Drive/LangDetect/FPData/WELFake"
train_path = os.path.join(root_welfake_path, "clean_train_wf.csv")

# Sets up the filepath to the API keys .env file for OpenAI access
api_path = "/content/drive/My Drive/LangDetect/api_keys.env"

In [1]:
# # Loads in the API keys and access tokens from API .env file
# with open(api_path, 'r') as file:
#     print(file.read())

# # Loads in the .env file to get the API keys
# load_dotenv(api_path)

# # Retrieves the environment variables from the file
# hf_access_token = os.getenv("HUGGINGFACEHUB_API_TOKEN")
# langchain_api_key = os.getenv("LANGCHAIN_API_KEY")
# openai_api_key = os.getenv("OPENAI_API_KEY")
# langchain_tracing = os.getenv("LANGCHAIN_TRACING_V2")
# langchain_project = os.getenv("LANGCHAIN_PROJECT")

In [None]:
# Loads in the example WELFake dataset as a pandas DataFrame
wf_train_df = pd.read_csv(train_path)
wf_train_df.head(10)

Unnamed: 0,id,title,text,label
0,56051,"The Politics of Death: Cancer and Politics, a ...",License DMCA This is not about how politics co...,1
1,30084,Governor-Elect Of Kentucky Tells The EPA To Go...,States have rights too! We love the new conser...,1
2,40781,ARE YOU READY FOR JOE? 91% Of Obama-Biden Bund...,"Bernie, Hillary and Joe a low information vote...",1
3,64772,"Trump win, Democratic setbacks cloud Pelosi's ...",WASHINGTON (Reuters) - Nancy Pelosi may face a...,0
4,67872,Investigators ask White House for details on F...,WASHINGTON (Reuters) - The special counsel inv...,0
5,68842,EU affirms support for Lebanon stability,BEIRUT (Reuters) - The European Union on Wedne...,0
6,34654,"‘Game of Thrones’ Season 6, Episode 7: Never T...",“Never too late to come back. ” So sayeth the ...,0
7,66410,Trump '100 percent' committed to take on tax r...,WASHINGTON (Reuters) - President Donald Trump ...,0
8,3189,Gorka: Trump Fired FBI Director to Make a Stat...,"On Wednesday’s Breitbart News Daily, Dr. Sebas...",0
9,739,Etihad flight from Abu Dhabi makes emergency l...,SYDNEY (Reuters) - An Etihad Airways flight tr...,0


In [None]:
# Prints the number of rows in WELFake training data
len(wf_train_df)

41560

## Setting up the LLM Model and Prompt Templates with OpenAI and LangChain

In [None]:
# Extracts the first 10 news articles for examples to test out the question-template answering system
news_example_texts = wf_train_df["text"][0:10].to_list() # Converts this from pd.Series to a list of strings for easier usage
news_example_labels = wf_train_df["label"][0:10].to_list()

In [None]:
# Initializes the GPT-3.5-turbo LLM configured for more deterministic, fact-based responses (low creativity, i.e. temperature)
llm = ChatOpenAI(
    api_key=openai_api_key,
    model="gpt-3.5-turbo",
    temperature=0.25, # Sets a low temperature (to limit creativity) for deterministic, factual responses
    max_tokens=16,  # Sets the maximum token limit to 16, as only want a floating-point number (let it be up to 10 digits) as output
    top_p=0.9,      # Uses the full probability distribution (all possible tokens) but limit tokens chosen to 90 per-cent most probable
)

In [None]:
def createPromptTemplate(question):
    """
      Creates a prompt template containing a question (based on the TELLER question templates) with the LangChain PromptTemplate class.
      To mitigate the fact that the GPT-3.5 LLM is a "closed" model (no access to logits/probabilities of answers), asks it to output
      a score between -1 and 1 (floating-point number) based on how close the answer to the question is to NO or YES.

      Input Parameters:
          question (str): the question for the prompt template

      Output:
          prompt_template (langchain_core.prompts.prompt.PromptTemplate): a PromptTemplate containing the inputted question
          
    """

    prompt = f"""
    You will now be shown a question related to an example news text.
    You MUST respond to the question by outputting a value which is between -1 and 1 (inclusive).
    - -1 means that you are completely certain the answer to the question about this news text is NO.
    - +1 means that you are completely certain that the answer to the question about this news text is YES.

    The closer the answer is to -1, the more confident you are the final answer is NO.
    The closer the answer is to 1, the more confident you are thefinal answer is YES.

    QUESTION: {question}

    This is the sample news text: ```{{sample_news_text}}```

    Output:
    """
    # Creates the LangChain prompt template wit using a dynamic placeholder for the news text
    prompt_template = PromptTemplate.from_template(template=prompt)

    return prompt_template

In [None]:
# Creates a list of questions to prompt GPT-3.5 about the news text with, based on the TELLER paper
questions = [
    "Does the news text contain high-quality background information to support the claims being made in it?",
    "Does the news text contain content that seems intentionally omitted or distorted?",
    "Does the news text show improper intentions (e.g., political bias, commercial motives) that suggest it could be fake or misleading?",
    "Is this news text based on facts, or does it primarily rely on speculation or opinion?",
    "Are there any logical fallacies or misleading arguments present in this news text?",
    "Does this news text exhibit bias?",
    "Are there any grammatical or spelling errors in this news text that make the text seem unprofessional?",
    "Does this news text use inflammatory language or personal attacks?",
    "Is the main topic of this news text a full-length news text rather than a short headline?",
    "Is the main topic of this news text primarily about government and politics?",
    "Is the main topic of this news text primarily about business, economy and finance?",
    "Is the main topic of this news text primarily about celebrity gossip and entertainment?",
    "Is the main topic of this news text primarily about sports?",
    "Is the main topic of this news text primarily about technology?",
    "Is the main topic of this news text primarily about science?",
    "Is the main topic of this news text primarily about the environment?",
    "Is the main topic of this news text primarily about culture and the arts?"
]

# Wraps the questions inside the LangChain PromptTemplate
prompt_templates = [createPromptTemplate(question) for question in questions]

## Testing one Question Template Out on 10 Example News Texts

In [None]:
# Tests the prompts out with the 10 example samples from the WELFake dataset:
for example_text, example_label in zip(news_example_texts, news_example_labels):
    
  # Prints out the news text
  print(example_text[0:100]) # Prints out the news text

  # Uses the first question from the templates to test the GPT-3.5 model outputs
  # "Does the news text contain high-quality background information to support the claims being made in it?"
  example_prompt = prompt_templates[0].format(sample_news_text=example_text)

  # Calls the .invoke LangChain method on the prompt to generate the model's answer
  example_answer = llm.invoke(example_prompt)

  # Prints the model's answer and the ground truth label (0 = real, 1 =fake) for the current news text
  print(f"LLM Answer: {example_answer.content}, News Label: {example_label}\n\n")

License DMCA This is not about how politics controls research on cancer and other diseases. It is ab
LLM Answer: 0.8, News Label: 1


States have rights too! We love the new conservative governor of Kentucky! He means business and it 
LLM Answer: -0.8, News Label: 1


Bernie, Hillary and Joe a low information voter dream ticket Ninety-one percent of the hundreds of i
LLM Answer: 0.5, News Label: 1


WASHINGTON (Reuters) - Nancy Pelosi may face a challenge to her 14-year-old role as the leading Demo
LLM Answer: 0.5, News Label: 0


WASHINGTON (Reuters) - The special counsel investigating Russian interference in the U.S. presidenti
LLM Answer: 0.8, News Label: 0


BEIRUT (Reuters) - The European Union on Wednesday said it reaffirmed support for Lebanon s stabilit
LLM Answer: 0.8, News Label: 0


“Never too late to come back. ” So sayeth the High Swearengen, ministering to his followers but spea
LLM Answer: 0.2, News Label: 0


WASHINGTON (Reuters) - President Donald Trump is committed to

In [None]:
def getLogicAtomVectorsForNewsText(row, prompt_templates, llm_model):
  """
    Generates LLM answers and store as the TELLER "logic-atom vector" for truth values (-1 NO to +1 YES) to all the prompt questions
    for a specific news text in a DataFrame

        Input Parameters:
        
          row (DataFrame row = pd Series): row from a news dataset including a "text" field
          prompt_templates (list of LangChain PromptTemplate instances): list of the prompts containing the
                                                                         relevant questions about the news text
          llm_model (LangChain ChatOpenAI instance): in this case, the GPT-3.5 model

        Output:
            logic_vect (list): a list of floats representing "logic atoms" (scores -1 to 1 representing degree to which
                               the LLM agrees with a NO/YES answer to the prompt/question template)
  """

  # Creates a list to store the logic vector of LLM "answers"/scores for YES/NO answers to the question templates
  logic_vect = []

  # Iterates over the (question) prompt-template list
  for i, prompt_template in enumerate(prompt_templates):

        # Injects the specific news text (context) into the current  prompt template
        prompt_with_news_text = prompt_template.format(sample_news_text=row["text"])

        # Extracts the LLM's answer using the model.invoke method
        response = llm_model.invoke(prompt_with_news_text)

        # Checks if the LLM's response is like a floating-point number,
        # otherwise store default of 0 (i.e. unknown), to ensure resulting vector contains all same length floating-point numbers
        try:
            # Attempts to converts the response content to a float
            truth_value = float(response.content)

            # Checks if the number is between -1 and 1
            if -1 <= truth_value <= 1:
                logic_vect.append(truth_value)
            # Else, if the value is outside the -1 to +1, stores the answer as 0.0 as default (unknown)
            else:
                logic_vect.append(0.0)

        # If the response could not be converted to a float, also store as a 0.0 (in TELLER, this means adefault/unknown truth value)
        except ValueError:
            logic_vect.append(0.0)

  return logic_vect

In [None]:
# Measures the time taken to generate answers to all prompts/questions for the first 10 samples from the WELFake dataset
start_time = time.time()

# Tests with the first 10 rows from the dataset
example_rows = wf_train_df[0:10].copy()

# Creates a new column storing the logic atom vectors for each of the 10 sample news texts 
example_rows["logic_atom_vectors"] = example_rows.progress_apply(
    lambda row: getLogicAtomVectorsForNewsText(row, prompt_templates, llm), axis=1
)

# Stops timing
end_time = time.time()

# Calculates the time elapsed
time_elapsed = end_time - start_time

# Print time taken to get all sanswers for 10 news samples
print(f"\nTime elapsed: {time_elapsed} seconds")

100%|██████████| 10/10 [01:30<00:00,  9.04s/it]


Time elapsed: 90.40247845649719 seconds





Getting all of the question-template answers for ten news texts took over 90 seconds, so about 9 seconds for each news text. The WELFake training dataset alone contains 41560 samples, so it would take  (41560 x 9 seconds = 374040 seconds) about 103 hours, 54 minutes to encode the samples for only this dataset (over 4 days). The Fakeddit dataset is immense, with over 700,000 short news text samples, so it can be expected that this would take even longer, meaning that it could take weeks to solely implement the cognition part of the TELLER system, which is a crucial limitation to this approach.

## Testing All Question Templates Out on WELFake Training Data

In [None]:
# The "example_rows" variable consists of the set of 10 first rows from the WELFake training data
# "questions" is the list of questions to get logic atoms for

# Iterates over each news sample in the 10 example rows
for index, row in example_rows.iterrows():  
    
     # Prints the current index of article, starting from 1
      print(f"News Article Number #{index+1}")
    
      # Prints the news category for each example
      label = "FAKE" if row["label"] == 1 else "REAL"
      print(f"LABEL: {label}")
    
      # Displays the LLM's response (a float from -1, "No", to 1, "Yes") to each
      # question in the question-prompt template
      for question, answer in zip(questions, row["logic_atom_vectors"]):
          print(f"Question: {question}")
          print(f"Answer: {answer}")
          
      # Prints a separator between news articles with line of asterisks
      print("-" * 160)

News Article Number #1
LABEL: FAKE
Question: Does the news text contain high-quality background information to support the claims being made in it?
Answer: 0.8
Question: Does the news text contain content that seems intentionally omitted or distorted?
Answer: 0.8
Question: Does the news text show improper intentions (e.g., political bias, commercial motives) that suggest it could be fake or misleading?
Answer: 0.8
Question: Is this news text based on facts, or does it primarily rely on speculation or opinion?
Answer: -0.8
Question: Are there any logical fallacies or misleading arguments present in this news text?
Answer: 0.8
Question: Does this news text exhibit bias?
Answer: 0.8
Question: Are there any grammatical or spelling errors in this news text that make the text seem unprofessional?
Answer: -0.8
Question: Does this news text use inflammatory language or personal attacks?
Answer: 0.8
Question: Is the main topic of this news text a full-length news text rather than a short headli

In [None]:
# Applies the vectorization function to the train WELFake dataset and checks how long this takes
wf_train_df["logic_atom_vectors"] = wf_train_df.progress_apply(
    lambda row: getLogicAtomVectorsForNewsText(row, prompt_templates, llm),
    axis=1
)

  1%|▏         | 551/41560 [1:28:36<109:54:45,  9.65s/it]

KeyboardInterrupt



The process is very slow, it would take days to encode the entire dataset. It took 1:28:36 hours to only process 551/41560 of samples (approximatly 1%)!