# Notebook 3: Datasets
In this notebook, we'll go over how to implement different measures to your AI model using datasets. We are specifically aiming to adapt measures from cognitive psychology to adminsiter via the API. This requires us to make each item in the measure (text or text/image) pair callable in our API function through a dataset

**Make sure to download the necessary zip file and upload it to JupyterLab before running this script**

In [None]:
import os

# make sure nb2_files.zip exists
if not os.path.exists('nb3_files.zip'):
    print('nb3_files.zip not found. Please make sure it exists in the current directory.')
    exit(1)

In [None]:
# First unzip tutorial contents
import shutil
shutil.unpack_archive('nb3_files.zip', 'nb3_files')

notebook_files_path = 'nb3_files/nb3_files/'

### 1 Text-only vs Text-Image pair dataset

Based on the task you select, it might make more sense to use text-only data or text/image pair data.

Text-only data is preferable if your task is evaluating the linguistic abilities of AI systems. For example, maybe your idea of intelligence is highly influenced by someone's ability to do verbal reasoning. In that case, you may want to find certain reasoning puzzles from cognitive psychology to administer through the API.

Text/Image data is preferable if your task evaluates both linguistic and visual abilities of AI systems. For example, spatial reasoning tasks require subjects to look at images, videos, some visual stimuli to reason about spatial qualities in the scene.

### 2 JSON files

To make all the course projects as accessible as possible, we will provide code that allows you to call the API and evaluate a dataset.

Since this is an established function, all your datasets need to be in a specific format. This format will require you to use a special kind of text file called a JSON file. This is a specifically convenient format because it allows for structured, human-readable, text-storing and can be loaded into python directly to a dataframe for analysis!

JSON files are similar to dictionaries (the mapping object type) we learned in notebook 1. They contain key-value pairs:

```{JSON}
[
    {
        "id": "250",
        "image": "000000000370.jpg",
        "conversations": [
            {
                "from": "human",
                "value": "<image>\nIs the broccoli to the left or right from person's perspectice?"
            },
            {
                "from": "gpt",
                "value": "The broccoli is to the left from the person's perspective"
            }
        ]
    },
    ...
]
```

This is an example of some fine-tuning data:
- id is a unique identifier for the item
- image is the path to the image of reference
- conversations is the actual data used during fine-tuning where "from":"human" is the prompt and "from":"gpt" is the ideal response
- this is one item of many that would all be stored in a similar format

Our datasets will look somewhat similar to this:

```{JSON}
[
    {
        "id": "unique_id",
        "image": "image_path" or None (for text-only),
        "prompt": "dataset_text"
    },
    ...
]
```

In [6]:
# to make things easy, let's define a function that formats a json item
def format_json_item(id, image, prompt):
    return {
        "id": id,
        "image": image,
        "prompt": prompt
    }

### 3 Example: Making a text-only dataset

Here we use a Syllogistic Reasoning Task to make a text-only dataset

Here are the items:

Valid Reasoning (Logically Correct)

1. All fruits have seeds. An apple is a fruit. → Therefore, an apple has seeds. (Valid)

2. All squares are rectangles. All rectangles have four sides. → Therefore, all squares have four sides. (Valid)

3. No reptiles are warm-blooded. All snakes are reptiles. → Therefore, no snakes are warm-blooded. (Valid)

4. All birds lay eggs. Penguins are birds. → Therefore, penguins lay eggs. (Valid)

5. Some humans are musicians. All musicians can read music. → Therefore, some humans can read music. (Valid)

Invalid Reasoning (Logical Fallacy)

1. All dogs are animals. All cats are animals. → Therefore, all dogs are cats. (Invalid – Illicit Conversion Fallacy)

2. Some tall people are basketball players. Michael is tall. → Therefore, Michael is a basketball player. (Invalid – Incorrect Generalization)

3. No insects are mammals. A spider is not an insect. → Therefore, a spider is a mammal. (Invalid – Incorrect Negative Inference)

4. All fish live in water. Dolphins live in water. → Therefore, dolphins are fish. (Invalid – Category Mistake)

5. All roses are flowers. Some flowers are red. → Therefore, all roses are red. (Invalid – False Distribution Fallacy)

#### Selecting Instructions

For this task, we will use the same instructions for all items:

Each question contains two premises and a conclusion. Your task is to determine whether the conclusion logically follows from the premises. If the conclusion is logically valid, select "Valid". If the conclusion does not logically follow, select "Invalid".

In [7]:
# \n is a newline character
instructions = "Each question contains two premises and a conclusion. Your task is to determine whether the conclusion logically follows from the premises.\nIf the conclusion is logically valid, select 'Valid'.\nIf the conclusion does not logically follow, select 'Invalid'."

In [8]:
print(instructions)

Each question contains two premises and a conclusion. Your task is to determine whether the conclusion logically follows from the premises.
If the conclusion is logically valid, select 'Valid'.
If the conclusion does not logically follow, select 'Invalid'.


In [9]:
item_1 =  "Premise 1: All fruits have seeds.\nPremise 2: An apple is a fruit.\nConclusion: Therefore, an apple has seeds."
print(item_1)

Premise 1: All fruits have seeds.
Premise 2: An apple is a fruit.
Conclusion: Therefore, an apple has seeds.


In [10]:
# rest of the items
item_2 =  "Premise 1: All squares are rectangles.\nPremise 2: All rectangles have four sides.\nConclusion: Therefore, all squares have four sides."
item_3 =  "Premise 1: No reptiles are warm-blooded.\nPremise 2: All snakes are reptiles.\nConclusion: Therefore, no snakes are warm-blooded."
item_4 =  "Premise 1: All birds lay eggs.\nPremise 2: Penguins are birds.\nConclusion: Therefore, penguins lay eggs."
item_5 =  "Premise 1: Some humans are musicians.\nPremise 2: All musicians can read music.\nConclusion: Therefore, some humans can read music."
item_6 = "Premise 1: All dogs are animals.\nPremise 2: All cats are animals.\nConclusion: Therefore, all dogs are cats."
item_7 = "Premise 1: Some tall people are basketball players.\nPremise 2: Michael is tall.\nConclusion: Therefore, Michael is a basketball player."
item_8 = "Premise 1: No insects are mammals.\nPremise 2: A spider is not an insect.\nConclusion: Therefore, a spider is a mammal."
item_9 = "Premise 1: All fish live in water.\nPremise 2: Dolphins live in water.\nConclusion: Therefore, dolphins are fish."
item_10 = "Premise 1: All roses are flowers.\nPremise 2: Some flowers are red.\nConclusion: Therefore, all roses are red."

In [11]:
all_items = [item_1, item_2, item_3, item_4, item_5, item_6, item_7, item_8, item_9, item_10]

json_items = []

# lets loop through each item and format it as a json item
for i in range(len(all_items)):
    # id will be item number with leading zeros so that each id is two digits
    id = f"{i:02d}"
    image = None  # no image for this task
    # our actual prompt for the model with be the instruction followed by the item
    prompt = instructions + "\n" + all_items[i]

    # now add this formatted item to our json_items list
    json_items.append(format_json_item(id, image, prompt))

In [12]:
# Let's check that the first three items are formatted correctly
json_items[:3]

[{'id': '00',
  'image': None,
  'prompt': "Each question contains two premises and a conclusion. Your task is to determine whether the conclusion logically follows from the premises.\nIf the conclusion is logically valid, select 'Valid'.\nIf the conclusion does not logically follow, select 'Invalid'.\nPremise 1: All fruits have seeds.\nPremise 2: An apple is a fruit.\nConclusion: Therefore, an apple has seeds."},
 {'id': '01',
  'image': None,
  'prompt': "Each question contains two premises and a conclusion. Your task is to determine whether the conclusion logically follows from the premises.\nIf the conclusion is logically valid, select 'Valid'.\nIf the conclusion does not logically follow, select 'Invalid'.\nPremise 1: All squares are rectangles.\nPremise 2: All rectangles have four sides.\nConclusion: Therefore, all squares have four sides."},
 {'id': '02',
  'image': None,
  'prompt': "Each question contains two premises and a conclusion. Your task is to determine whether the con

In [13]:
# Looks good! Let's save this as a json file
output_path = 'nb3_textonly_dataset.json'

import json
with open(output_path, 'w') as f: # open file in write mode
    # indent=4 is not necessary but makes it more readable for us
    json.dump(json_items, f, indent=4) # write our list of json items to file
    print(f"Saved {len(json_items)} items to {output_path}")

Saved 10 items to nb3_textonly_dataset.json


### 4 Example: making an text/image pair dataset

Here we'll use the RMET data to make a dataset. We already have the RMET materials available locally in the rmet_materials folder. We'll use this to build our dataset instead of manually entering every item like before!

In [14]:
instructions = "Choose which word best describes what the person in the picture is thinking or feeling based on their eyes alone.\nEven if you feel like you cannot tell based on their eyes alone, please select the best word.\nYou may feel that more than one word is applicable, but please choose just one word, the word which you consider to be most suitable.\nYour 4 choices are: "
print(instructions)

Choose which word best describes what the person in the picture is thinking or feeling based on their eyes alone.
Even if you feel like you cannot tell based on their eyes alone, please select the best word.
You may feel that more than one word is applicable, but please choose just one word, the word which you consider to be most suitable.
Your 4 choices are: 


In [16]:
with open(f'{notebook_files_path}word_choices.txt', "r") as f:
    word_choices = f.readlines()
word_choices = [x.strip() for x in word_choices]

word_choices[:3]

['playful comforting irritated bored',
 'terrified upset arrogant annoyed',
 'joking flustered desire convinced']

In [None]:
import os
# all the images I want to use are in my github repo at this location: bridgetleonard2/Eyes-Mind-Model/main/task_materials/regular
# get all image files in the images directory
images = os.listdir(f'{notebook_files_path}images')
# sort images so names are ordered
images.sort()
images[:3]

*If you are creating a dataset using data you have locally, make sure things are in the **order** that you want. You don't want to create wrong pairs of data!

In [17]:
json_items = [] # clear out this list

for i in range(len(images)):
    # id will be item number with leading zeros so that each id is two digits
    id = f"{i:02d}"
    image = images[i]  # image file name
    # our actual prompt for the model with be the instruction followed by the word choices
    item_words = word_choices[i]
    prompt = instructions + item_words

    # now add this formatted item to our json_items list
    json_items.append(format_json_item(id, image, prompt))

In [18]:
json_items[:3]

[{'id': '00',
  'image': 'https://raw.githubusercontent.com/bridgetleonard2/Eyes-Mind-Model/main/task_materials/regular/01-playful-comforting-irritated-bored-300x175.jpg',
  'prompt': 'Choose which word best describes what the person in the picture is thinking or feeling based on their eyes alone.\nEven if you feel like you cannot tell based on their eyes alone, please select the best word.\nYou may feel that more than one word is applicable, but please choose just one word, the word which you consider to be most suitable.\nYour 4 choices are: playful comforting irritated bored'},
 {'id': '01',
  'image': 'https://raw.githubusercontent.com/bridgetleonard2/Eyes-Mind-Model/main/task_materials/regular/02-terrified-upset-arrogant-annoyed-300x175.jpg',
  'prompt': 'Choose which word best describes what the person in the picture is thinking or feeling based on their eyes alone.\nEven if you feel like you cannot tell based on their eyes alone, please select the best word.\nYou may feel that m

In [19]:
# Looks good! Let's save this as a json file
output_path = 'nb3_imagetext_dataset.json'

import json
with open(output_path, 'w') as f: # open file in write mode
    # indent=4 is not necessary but makes it more readable for us
    json.dump(json_items, f, indent=4) # write our list of json items to file
    print(f"Saved {len(json_items)} items to {output_path}")

Saved 36 items to nb3_imagetext_dataset.json
