In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


## Instructions: Setup Before Running This Notebook

To successfully run the notebook, follow these steps to prepare your environment:

### 1. Enable Colab Pro for A100 GPU

For optimal performance, you will need a **Colab Pro** subscription to run this notebook on an A100 GPU.  
- Colab Pro costs **$9.99/month** and provides access to higher-performance GPUs like the **NVIDIA A100**.  
- To subscribe, visit: [https://colab.research.google.com/signup](https://colab.research.google.com/signup)

Once you have Colab Pro, select the **A100** GPU in Colab under **Runtime > Change runtime type > Hardware accelerator**.

### 2. Login to HuggingFace
You need access to the **`meta-llama/Llama-2-7b-chat-hf`** model.  
1. Generate a HuggingFace token from: [https://huggingface.co/settings/tokens](https://huggingface.co/settings/tokens)  
2. Get approved for access to the **LLaMA** model. Follow the instructions on the model card.

### 3. Update Dataset Paths
You need to modify the following file paths to match your Google Drive file structure:
- **`conc_dataset_path`**: Path to the `Concreteness_ratings_Brysbaert_et_al_BRM.xlsx` file.  
   - Download it from [https://github.com/ArtsEngine/concreteness](https://github.com/ArtsEngine/concreteness)
- **`generations_file_path`**: Path to the `example_gen_file.csv` file.  
   - Download it from [https://github.com/assafbk/mocha_code/tree/main/OpenCHAIR](https://github.com/assafbk/mocha_code/tree/main/OpenCHAIR)


In [4]:
!huggingface-cli login


    _|    _|  _|    _|    _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|_|_|_|    _|_|      _|_|_|  _|_|_|_|
    _|    _|  _|    _|  _|        _|          _|    _|_|    _|  _|            _|        _|    _|  _|        _|
    _|_|_|_|  _|    _|  _|  _|_|  _|  _|_|    _|    _|  _|  _|  _|  _|_|      _|_|_|    _|_|_|_|  _|        _|_|_|
    _|    _|  _|    _|  _|    _|  _|    _|    _|    _|    _|_|  _|    _|      _|        _|    _|  _|        _|
    _|    _|    _|_|      _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|        _|    _|    _|_|_|  _|_|_|_|

    A token is already saved on your machine. Run `huggingface-cli whoami` to get more information or `huggingface-cli logout` if you want to log out.
    Setting a new token will erase the existing one.
    To log in, `huggingface_hub` requires a token generated from https://huggingface.co/settings/tokens .
Enter your token (input will not be visible): 
Add token as git credential? (Y/n) n
Token is valid (permission: write

In [5]:
conc_dataset_path = '/content/drive/My Drive/CS 5787/Project/OpenCHAIR-Adjectives/Concreteness_ratings_Brysbaert_et_al_BRM.xlsx'
generations_file_path = '/content/drive/My Drive/CS 5787/Project/OpenCHAIR-Adjectives/example_gen_file.csv'

In [6]:
!pip install datasets
!pip install -U bitsandbytes
!pip install -U transformers
!pip install -U accelerate
!pip install torch --upgrade

Collecting datasets
  Downloading datasets-3.2.0-py3-none-any.whl.metadata (20 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.9.0,>=2023.1.0 (from fsspec[http]<=2024.9.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.9.0-py3-none-any.whl.metadata (11 kB)
Downloading datasets-3.2.0-py3-none-any.whl (480 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m480.6/480.6 kB[0m [31m34.5 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m13.8 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading fsspec-2024.9.0-py3-none-any.whl 

In [7]:
import pandas as pd
from datasets import load_dataset

word_conc = pd.read_excel(conc_dataset_path)[['Word','Conc.M']].set_index("Word").to_dict()['Conc.M']

print("Loading Dataset\n")
och_dataset = load_dataset("moranyanuka/OpenCHAIR")['test']

Loading Dataset



The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/1.10k [00:00<?, ?B/s]

test-00000-of-00002.parquet:   0%|          | 0.00/375M [00:00<?, ?B/s]

test-00001-of-00002.parquet:   0%|          | 0.00/376M [00:00<?, ?B/s]

Generating test split:   0%|          | 0/4863 [00:00<?, ? examples/s]

In [8]:
df = pd.read_csv(generations_file_path)
df['ground_truth_caption'] = och_dataset['text'][:len(df)]

In [9]:
df = df.head(20)

In [10]:
import spacy
from tqdm.auto import tqdm
spacy.require_gpu()

def is_concrete(noun, concretness, t=2.5):
    if noun in concretness:
        return concretness[noun] > t
    return False

def extract_adjs(captions, conc_df):
    nlp = spacy.load("en_core_web_sm")
    adjs = []
    for caption in tqdm(captions):
        doc = nlp(caption.lower())
        cur_adjs = [token.lemma_ for token in doc if token.pos_ == 'ADJ' and is_concrete(token.lemma_, conc_df)]
        adjs.append(cur_adjs)
    return adjs

df['generated_adjs'] = extract_adjs(df.generated_caption.tolist(), word_conc)

  0%|          | 0/20 [00:00<?, ?it/s]

In [11]:
import bitsandbytes as bnb
import torch
from transformers import (
    AutoTokenizer,
    AutoModelForCausalLM,
    pipeline,
    BitsAndBytesConfig,
)

def load_llm_pipe():
    llm_ckpt = "meta-llama/Llama-2-7b-chat-hf"
    tokenizer = AutoTokenizer.from_pretrained(llm_ckpt)
    tokenizer.pad_token_id = tokenizer.eos_token_id
    tokenizer.pad_token = "[PAD]"
    tokenizer.padding_side = "left"

    bnb_config = BitsAndBytesConfig(load_in_4bit=True,
                                    bnb_4bit_quant_type="nf4",
                                    bnb_4bit_compute_dtype=torch.float16,
                                    bnb_4bit_use_double_quant=True)

    model = AutoModelForCausalLM.from_pretrained(llm_ckpt,
                                                 quantization_config=bnb_config,device_map="auto",
                                                 cache_dir=None)
    pipe = pipeline("text-generation",
                    model=model,
                    tokenizer=tokenizer,
                    trust_remote_code=True,
                    device_map="auto",
                    batch_size=32)
    return pipe

print("\nLoading LLM\n")
llm_pipe = load_llm_pipe()

The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling `transformers.utils.move_cache()`.


0it [00:00, ?it/s]


Loading LLM



tokenizer_config.json:   0%|          | 0.00/1.62k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/414 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/614 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/26.8k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/9.98G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/3.50G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/188 [00:00<?, ?B/s]

Device set to use cuda:0


In [12]:
from functools import lru_cache
from torch.utils.data import Dataset

class ListDataset(Dataset):
     def __init__(self, original_list):
        self.original_list = original_list
     def __len__(self):
        return len(self.original_list)

     def __getitem__(self, i):
        return self.original_list[i]

def parse_ans(ans):
    ans_word_list = ans.lower().replace(',','').replace('.','').replace(';','').replace('\n',' ').split(' ')
    if 'yes' in ans_word_list:
        return 'yes'
    elif 'no' in ans_word_list or 'not' in ans_word_list:
        return 'no'
    elif 'unsure' in ans_word_list:
        return 'unsure'
    else:
        return 'ERROR: '+';'.join(ans_word_list)

def make_prompt(cap, adj, tokenizer):
    _prompt = f'''Here are a few descriptions of an image: {cap}\nDoes the image contain the following adjective: {adj}?\nAnswer yes/no/unsure.\n The answer is: '''
    prompt = tokenizer.apply_chat_template([{'role':'user', "content":_prompt}], tokenize=False)
    return prompt

@lru_cache(maxsize=None)
def get_answers(caps_flat, adjs_flat, pipe):
    prompts = [make_prompt(cap, adj, pipe.tokenizer) for cap,adj in zip(caps_flat, adjs_flat)]
    dataset = ListDataset(prompts)

    outputs = []
    with tqdm(total=len(prompts)) as pbar:
        for out in pipe(dataset, max_new_tokens=8, do_sample=False, num_return_sequences=1):
            outputs.append(out)
            pbar.update(1)

    outputs = [outputs[i][0]['generated_text'][len(prompts[i]):].strip() for i in range(len(outputs))]
    outputs = [parse_ans(out) for out in outputs]
    return outputs

def flatten_data(df):
    caps_flat, adjs_flat = [], []
    for cap, adjs in zip(df.ground_truth_caption, df.generated_adjs):
        for adj in adjs:
            caps_flat.append(cap)
            adjs_flat.append(adj)
    return tuple(caps_flat), tuple(adjs_flat)

def unflatten_responses(responses_flat, df):
    responses_unflat = []
    i=0
    for adjs in df.generated_adjs:
        cur_responses = []
        for adj in adjs:
            cur_responses.append(responses_flat[i])
            i+=1
        responses_unflat.append(cur_responses)

    assert(len(responses_unflat) == len(df.generated_adjs))
    return responses_unflat

def apply_ignore_words(responses_flat, adjs_flat):
    ignore_words = ['painting', 'drawing', 'photo', 'picture', 'portrait', 'photograph']
    for i, adj in enumerate(adjs_flat):
        if adj in ignore_words:
            responses_flat[i] = 'ignore'
    return responses_flat

def get_llm_responses(df, llm_pipe):

    caps_flat, adjs_flat = flatten_data(df)
    responses_flat = get_answers(caps_flat, adjs_flat, llm_pipe)
    responses_flat = apply_ignore_words(responses_flat, adjs_flat)
    responses = unflatten_responses(responses_flat,df)

    return responses

print("\nGetting LLM Responses\n")
llm_responses = get_llm_responses(df, llm_pipe)
hallucinated = [[True if item == 'no' else False for item in sublist] for sublist in llm_responses]


Getting LLM Responses



  0%|          | 0/4 [00:00<?, ?it/s]



In [13]:
def get_och_score(llm_responses):
    responses = []
    [responses.extend(resp_per_cap) for resp_per_cap in llm_responses]
    data = pd.Series(responses).str.lower().str.strip()
    dv = data.value_counts()
    d = dv.to_dict()
    if not d.get('yes'):
      return 1
    if not d.get('no'):
      return 0
    return d['no'] / (d['yes'] + d['no'])

OpenCHAIR_score = get_och_score(llm_responses)
print("\nOpenCHAIR Score: \n")
print(OpenCHAIR_score)


OpenCHAIR Score: 

1


In [14]:
columns_order = ['ground_truth_caption'] + [col for col in df.columns if col != 'ground_truth_caption']
df = df[columns_order]

df['adj_exist_in_gt?'] = llm_responses
df['hallucinated'] = hallucinated

# Save the results to method1_results.csv
df.to_csv('method1_results.csv', index=False)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['adj_exist_in_gt?'] = llm_responses
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['hallucinated'] = hallucinated


## Download Results

Once the notebook finishes running, you can download the results file:

- **`method1_results.csv`** will be saved in the Colab session's files.  
- Locate it in the **Files** panel on the left side of the Colab interface.  
- Right-click the file and select **Download** to save it to your local machine.