In [8]:
import sys
sys.path.append('..')

import torch
import json
from PIL import Image
from transformers import BitsAndBytesConfig, LlavaForConditionalGeneration, AutoProcessor
from utils.external_retrieval import get_matching_urls, get_webpage_title
from utils.data import get_data

quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16
)

In [4]:
MODEL_NAME = "llava-hf/llava-1.5-13b-hf"

model1 = LlavaForConditionalGeneration.from_pretrained(MODEL_NAME, quantization_config=quantization_config)
model2 =  LlavaForConditionalGeneration.from_pretrained(MODEL_NAME, quantization_config=quantization_config)
processor = AutoProcessor.from_pretrained(MODEL_NAME)

`low_cpu_mem_usage` was None, now set to True since model is quantized.


Loading checkpoint shards:   0%|          | 0/6 [00:00<?, ?it/s]

`low_cpu_mem_usage` was None, now set to True since model is quantized.


Loading checkpoint shards:   0%|          | 0/6 [00:00<?, ?it/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [6]:
def initial_prediction_prompt(history, caption):
    prompt =  """
    CHAT HISTORY: {}
    USER: <image>
    TEXT: "{}"

    Your task is to determine whether the given text and image are from the same context (e.g., the same news article or event) or if the image is being used out of context. 
    To aid your analysis, you have the option to request an Internet search based on the image to gather more information about the context in which the image is used.

    Provide a detailed analysis explaining your reasoning behind your decision. Consider the content of the text and the objects, actions, or scenes depicted in the image. 
    Analyze whether they align and provide context for each other or if they appear to be unrelated. 
    If you need more information about the context of the image state it separately after "INFO REQUIRED: ".

    ASSISTANT:""".format(history, caption)
    return prompt

In [7]:
def debate_prompt(history, caption, agent_response):
    prompt = """
    CHAT HISTORY: {}
    USER: <image> 
    TEXT: "{}"

    This is the answer another AI agent generated for the same image and text pair:
    "{}"

    Your task is to critically analyze the other agent's response and provide a refined answer based on this new information. 
    To strengthen your analysis and argument, you have the option to request an Internet search based on the image to gather more information 
    about the context in which the image is used.

    However, instead of blindly agreeing or repeating yourselves, focus on the following:

    1. Identify any potential inconsistencies, flaws, or counterarguments in the other agent's reasoning or analysis regarding whether the image and text are from the same context or not.
    2. Determine if there are any gaps or missing information that could lead to a more comprehensive understanding of the image-text relationship and their contextual alignment (or misalignment). 
    3. If you disagree with the other agent's assessment, respectfully point out the specific areas of disagreement and provide evidence or reasoning to support your stance on whether the image and text are from the same context or not.
    4. If you agree with the other agent's assessment, explain why their reasoning is valid and how it complements or strengthens your initial analysis of the image-text context.
    5. If you need more information to strengthen your argument or verify the other agent's argument state it separately as "INFO REQUIRED: ".
    The goal is to have a constructive debate that challenges each other's perspectives, uncovers potential blind spots, and ultimately leads to a more robust and well-reasoned conclusion about whether the image and text are from the same context or if the image is being used out of context. Use the option to request an Internet search if you need additional information to strengthen your argument. Avoid simply repeating or agreeing without critical evaluation.

    ASSISTANT:""".format(history, caption, agent_response)
    return prompt

In [11]:
def retrieval_prompt(history, caption, search_results):
    prompt = """
    CHAT HISTORY: {}
    USER: <image>
    TEXT: "{}"

    Search Results: This image is from an article titled: {}

    Based on the additional context and information gathered from the Internet search, 
    please reevaluate your initial prediction and provide an updated analysis on whether the given text and image are from the same context 
    or if the image is being used out of context.

    Incorporate the relevant information from the search results into your analysis and explain how it either supports or contradicts your initial assessment. 
    If the search results provide new insights or perspectives, discuss how they impact your understanding of the image-text relationship and 
    their contextual alignment (or misalignment).

    If you previously disagreed with the other agent's assessment, use the search results to strengthen or revise your counterarguments. 
    If you agreed with the other agent, explain how the search results further validate or complement their reasoning.

    The goal is to leverage the additional information from the Internet search to refine your analysis and provide a more comprehensive and 
    well-reasoned conclusion about whether the image and text are from the same context or if the image is being used out of context.

    ASSISTANT:""".format(history, caption, search_results)
    return prompt

In [12]:
num_iters = 3
data_sample = 5
image, caption,_ = get_data(data_sample)
print("data loaded!")

chat_history1 = []
chat_history2 = []

print("running llm-1...")
prompt = initial_prediction_prompt(chat_history1, caption)
inputs = processor(text=prompt, images=image, return_tensors="pt")
result1 = model1.generate(**inputs, max_new_tokens=200, do_sample=True, temperature=0.7)
result1 = processor.batch_decode(result1, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
chat_history1.append({'user':prompt[prompt.find("TEXT:")+len("TEXT:"):prompt.find("ASSISTANT")], 'assistant':result1[result1.find("ASSISTANT:")+len("ASSISTANT:"):]})
print("AGENT-1: {}\n\n".format(result1[result1.find("ASSISTANT:")+len("ASSISTANT:"):]))
if "INFO" in result1:
    urls = get_matching_urls(data_sample)
    info = get_webpage_title(urls[0])
    post_retrieval_prompt = retrieval_prompt(chat_history1, caption, info)
    inputs = processor(text=prompt, images=image, return_tensors="pt")
    result1 = model1.generate(**inputs, max_new_tokens=200, do_sample=True, temperature=0.7)
    result1 = processor.batch_decode(result1, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
    chat_history1.append({'user':post_retrieval_prompt[post_retrieval_prompt.find("TEXT:")+len("TEXT:"):post_retrieval_prompt.find("ASSISTANT")], 'assistant':result1[result1.find("ASSISTANT:")+len("ASSISTANT:"):]})
    print("AGENT-1 after internet access: {}\n\n".format(result1[result1.find("ASSISTANT:")+len("ASSISTANT:"):]))


print("running llm-2...")
result2 = model2.generate(**inputs, max_new_tokens=200, do_sample=True, temperature=0.7)
result2 = processor.batch_decode(result2,skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
chat_history2.append({'user':prompt[prompt.find("TEXT:")+len("TEXT:"):prompt.find("ASSISTANT")], 'assistant':result2[result2.find("ASSISTANT:")+len("ASSISTANT:"):]})
print("AGENT-2: {}\n\n".format(result2[result2.find("ASSISTANT:")+len("ASSISTANT:"):]))
if "INFO" in result2:
    urls = get_matching_urls(data_sample)
    info = get_webpage_title(urls[0])
    post_retrieval_prompt = retrieval_prompt(chat_history2, caption, info)
    inputs = processor(text=prompt, images=image, return_tensors="pt")
    result2 = model2.generate(**inputs, max_new_tokens=200, do_sample=True, temperature=0.7)
    result2 = processor.batch_decode(result2, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
    chat_history2.append({'user':post_retrieval_prompt[post_retrieval_prompt.find("TEXT:")+len("TEXT:"):post_retrieval_prompt.find("ASSISTANT")], 'assistant':result2[result2.find("ASSISTANT:")+len("ASSISTANT:"):]})
    print("AGENT-2 after internet access: {}\n\n".format(result2[result2.find("ASSISTANT:")+len("ASSISTANT:"):]))

print("COMMENCING DEBATE NOW...")
temp = result1

for i in range(num_iters):

    print("=======================================================================================")
    print("\t\t\t\tDEBATE ROUND - ", i+1)
    print("=======================================================================================")

    prompt = debate_prompt(chat_history1, caption, result1[result1.find("ASSISTANT:")+len("ASSISTANT:"):])
    inputs = processor(text=prompt, images=image, return_tensors="pt")
    result1 = model1.generate(**inputs, max_new_tokens=200, do_sample=True, temperature=0.7)
    result1 = processor.batch_decode(result1, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
    chat_history1.append({'user': prompt[prompt.find("TEXT:")+len("TEXT:"):prompt.find("ASSISTANT")], 'assistant':result1[result1.find("ASSISTANT:")+len("ASSISTANT:"):]})
    print("AGENT-1: {}\n\n ".format(result1[result1.find("ASSISTANT:")+len("ASSISTANT:"):]))

    prompt = debate_prompt(chat_history2, caption, temp[temp.find("ASSISTANT:")+len("ASSISTANT:"):])
    inputs = processor(text=prompt, images=image, return_tensors="pt")
    result2 = model2.generate(**inputs, max_new_tokens=200, do_sample=True, temperature=0.7)
    chat_history2.append({'user': prompt[prompt.find("TEXT:")+len("TEXT:"):prompt.find("ASSISTANT")], 'assistant':result2[result1.find("ASSISTANT:")+len("ASSISTANT:"):]})
    temp = result1
    result2 = processor.batch_decode(result2,skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
    print("AGENT-2: {}\n\n".format(result2[result2.find("ASSISTANT:")+len("ASSISTANT:"):]))
    
    #keep chat history from only 5 timesteps ago
    if len(chat_history1) > 5:
        chat_history1.pop()
        chat_history2.pop()

DATA SAMPLE
Caption:  The Brandenburg Gate stands illuminated during celebrations on the 25th anniversary of the fall of the Berlin Wall
Misinformation (Ground Truth): True
data loaded!
running llm-1...




AGENT-1:  Based on the image, it shows a large crowd of people gathered in a stadium, with some of them holding up cell phones to take pictures. There is also a large bridge in the background, with its lights on. The image does not have any clear indication of the Brandenburg Gate or the fall of the Berlin Wall. Therefore, it is highly unlikely that the image and the text are related.

The fact that the image does not show the Brandenburg Gate or any other significant landmarks from the Berlin Wall's fall further supports this conclusion. It seems that the image has been used out of context or is a generic image of a large stadium event with people taking pictures.

INFO REQUIRED: The image's content, specifically the large bridge with its lights on, might provide some context for the event taking place in the stadium. However, without more information, it is impossible to confidently determine the context of the event or the image's


DATA SAMPLE
Caption:  The Brandenburg Gate stands 



AGENT-1 after internet access:  The image depicts a large crowd with people holding up colored flags. There is a large lit-up sign visible as well. The text suggests that the scene takes place at the Brandenburg Gate during celebrations on the 25th anniversary of the fall of the Berlin Wall. The gathering of people holding flags and the lit-up sign are indicative of a celebratory event.

While the image and the text appear to be related, there is room for improvement in terms of context or additional information. It would be helpful to know more about the specific event or the people involved, as there is no clear indication of who or what the flags represent. Moreover, the image does not provide direct information about the Brandenburg Gate, but the reference to the Berlin Wall suggests that the event might have some historical significance.

INFO REQUIRED: Details about the specific event or celebration taking place, including information on the flags held by the people, the signific



AGENT-2 after internet access:  The image shows a large crowd of people gathered in a stadium, watching a bridge with lights on it. The text describes an event celebrating the 25th anniversary of the fall of the Berlin Wall, with the Brandenburg Gate being illuminated. It appears that the image is being used out of context, as the stadium full of people and the bridge with lights on it do not directly relate to the 25th anniversary celebration. The image could be from a different event or a different time.


COMMENCING DEBATE NOW...
				DEBATE ROUND -  1
AGENT-1:  After analyzing the image and the provided text, it is evident that the image does not contain any reference to the Brandenburg Gate or the fall of the Berlin Wall. Instead, the image shows a crowd of people holding up colored flags, with a large lit-up sign in the background. The lack of any clear indication connecting the image to the Brandenburg Gate or the historical context of the Berlin Wall makes it highly improbable t

KeyboardInterrupt: 