## Tokenizer + BERT → Data-processing → OpenAI

Installing an environment like conda is recommended. This notebook last ran on Python 3.8.18 without issues.

In [132]:
!pip install --upgrade accelerate transformers numpy pandas nltk

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)




In [133]:
from transformers import AutoTokenizer, DistilBertModel
import torch

import numpy as np
import pandas as pd
import nltk

tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")
model = DistilBertModel.from_pretrained("distilbert-base-uncased")

In [134]:
# load in coco classes from 'coco-classes.json'
import json
with open('coco-classes.json') as f:
  coco_classes = json.load(f)
print(coco_classes)
print(len(coco_classes))

['person', 'bicycle', 'car', 'motorcycle', 'airplane', 'bus', 'train', 'truck', 'boat', 'traffic light', 'fire hydrant', 'stop sign', 'parking meter', 'bench', 'bird', 'cat', 'dog', 'horse', 'sheep', 'cow', 'elephant', 'bear', 'zebra', 'giraffe', 'backpack', 'umbrella', 'handbag', 'tie', 'suitcase', 'frisbee', 'skis', 'snowboard', 'sports ball', 'kite', 'baseball bat', 'baseball glove', 'skateboard', 'surfboard', 'tennis racket', 'bottle', 'wine glass', 'cup', 'fork', 'knife', 'spoon', 'bowl', 'banana', 'apple', 'sandwich', 'orange', 'broccoli', 'carrot', 'hot dog', 'pizza', 'donut', 'cake', 'chair', 'couch', 'potted plant', 'bed', 'dining table', 'toilet', 'tv', 'laptop', 'mouse', 'remote', 'keyboard', 'cell phone', 'microwave', 'oven', 'toaster', 'sink', 'refrigerator', 'book', 'clock', 'vase', 'scissors', 'teddy bear', 'hair drier', 'toothbrush']
80


### Datasets

**PennTreebank Dataset**

The Penn Treebank POS tagset 

1. CC  Coordinating conjunction  
2. CD  Cardinal number           
3. DT  Determiner                
4. EX  Existential there  	     
5. FW  Foreign word              
6. IN  Preposition/subord. conjunction 	   
7. JJ  Adjective                  
8. JJR Adjective, comparative    
9. JJS Adjective, superlative    
10. LS  List item marker          
11. MD  Modal                     
12. NN  Noun, singular or mass    
13. NNS Noun, plural              
14. NNP Proper noun, singular     
15. NNPS Proper noun, plural      
16. PDT Predeterminer             
17. POS Possessive ending         
18. PRP Personal pronoun          
19. PP  Possessive pronoun        
20. RB  Adverb                    
21. RBR Adverb, comparative       
22. RBS Adverb, superlative       
23. RP  Particle                  
24. SYM Symbol 			             
25. TO  to 
26. UH  Interjection 
27. VB  Verb, base form 
28. VBD Verb, past tense 
29. VBG Verb, gerund/present participle 
30. VBN Verb, past participle 
31. VBP Verb, non-3rd ps. sing. present
32. VBZ Verb, 3rd ps. sing. present 
33. WDT wh-determiner 
34. WP  wh-pronoun 
35. WP  Possessive wh-pronoun 
36. WRB wh-adverb 
37. \#  Pound sign 
38. $  Dollar sign 
39. .  Sentence-final punctuation 
40. ,  Comma 
41. :  Colon, semi-colon 
42. (  Left bracket character 
43. )  Right bracket character 
44. "  Straight double quote 
45. `  Left open single quote 
46. "  Left open double quote 
47. '  Right close single quote 
48. "  Right close double quote

For examples: https://www.sketchengine.eu/penn-treebank-tagset/

In [135]:
from nltk.corpus import treebank

In [136]:
# Download the Penn Treebank dataset
nltk.download('treebank')

# Load the Penn Treebank dataset
ptb_sentences = treebank.sents()
ptb_tagged_words = treebank.tagged_words()

# Display some information about the dataset
print(f"Number of sentences: {len(ptb_sentences)}")
print(f"Number of tagged words: {len(ptb_tagged_words)}")

# Display the first few sentences and their POS tags
print("\nSample of the dataset:")
current_word_position = 0
for i in range(3):
    print(f"Sentence {i + 1}: {ptb_sentences[i]}")
    ending_word_position = current_word_position + len(ptb_sentences[i])
    print(f"POS tags: {ptb_tagged_words[current_word_position:ending_word_position]}")
    current_word_position = ending_word_position
    print()

# Convert the dataset to Pandas DataFrame for exploration
columns = ['Word', 'POS']
ptb_df = pd.DataFrame(data={'Word': [word for (word, _) in ptb_tagged_words],
                            'POS': [pos for (_, pos) in ptb_tagged_words]}, columns=columns)

# Display the first few rows of the DataFrame
print("\nPandas DataFrame:")
print(ptb_df.head())

[nltk_data] Downloading package treebank to
[nltk_data]     /Users/mattelim/nltk_data...
[nltk_data]   Package treebank is already up-to-date!


Number of sentences: 3914
Number of tagged words: 100676

Sample of the dataset:
Sentence 1: ['Pierre', 'Vinken', ',', '61', 'years', 'old', ',', 'will', 'join', 'the', 'board', 'as', 'a', 'nonexecutive', 'director', 'Nov.', '29', '.']
POS tags: [('Pierre', 'NNP'), ('Vinken', 'NNP'), (',', ','), ('61', 'CD'), ('years', 'NNS'), ('old', 'JJ'), (',', ','), ('will', 'MD'), ('join', 'VB'), ('the', 'DT'), ('board', 'NN'), ('as', 'IN'), ('a', 'DT'), ('nonexecutive', 'JJ'), ('director', 'NN'), ('Nov.', 'NNP'), ('29', 'CD'), ('.', '.')]

Sentence 2: ['Mr.', 'Vinken', 'is', 'chairman', 'of', 'Elsevier', 'N.V.', ',', 'the', 'Dutch', 'publishing', 'group', '.']
POS tags: [('Mr.', 'NNP'), ('Vinken', 'NNP'), ('is', 'VBZ'), ('chairman', 'NN'), ('of', 'IN'), ('Elsevier', 'NNP'), ('N.V.', 'NNP'), (',', ','), ('the', 'DT'), ('Dutch', 'NNP'), ('publishing', 'VBG'), ('group', 'NN'), ('.', '.')]

Sentence 3: ['Rudolph', 'Agnew', ',', '55', 'years', 'old', 'and', 'former', 'chairman', 'of', 'Consolidated', 

In [137]:
print(ptb_df.shape)

(100676, 2)


In [138]:
# save the dataframe to csv
ptb_df.to_csv('ptb_df.csv', index=False)

In [139]:
# reduce dataframe to unique words
# ptb_df_unique = ptb_df.drop_duplicates(subset=['Word'])
# print(ptb_df_unique.shape)

# reduce dataframe to unique words, count the number of times each word appears and add as a column, while keeping the POS column
ptb_df_unique = ptb_df.groupby(['Word', 'POS']).size().reset_index(name='count')
# ptb_df_unique = ptb_df.drop_duplicates(subset=['Word'])
print(ptb_df_unique.shape)

# save the dataframe to csv
ptb_df_unique.to_csv('ptb_df_unique.csv', index=False)


(13781, 3)


In [140]:
# Remove '-NONE-' POS tags
ptb_df_unique = ptb_df_unique[ptb_df_unique['POS'] != '-NONE-']
print(ptb_df_unique.shape)

# Remove '-LRB-' '-RRB-' POS tags
ptb_df_unique = ptb_df_unique[ptb_df_unique['POS'] != '-LRB-']
ptb_df_unique = ptb_df_unique[ptb_df_unique['POS'] != '-RRB-']
print(ptb_df_unique.shape)

(13341, 3)
(13337, 3)


In [141]:
# Create a new dataframe that contains the 5 most frequent words of each POS tag
ptb_df_top5 = ptb_df_unique.groupby('POS').apply(lambda x: x.nlargest(5, 'count')).reset_index(drop=True)
print(ptb_df_top5.shape)
print(ptb_df_top5.head(10))

# save the dataframe to csv
ptb_df_top5.to_csv('ptb_df_top5.csv', index=False)

(176, 3)
  Word POS  count
0    #   #     16
1    $   $    718
2  US$   $      4
3   C$   $      2
4   ''  ''    684
5    '  ''     10
6    ,   ,   4885
7   Wa   ,      1
8    .   .   3828
9    ?   .     40


In [142]:
# # Get 100 most common words
# ptb_df_unique.sort_values(by=['count'], ascending=False, inplace=True)
# ptb_df_unique_top100 = ptb_df_unique.head(100)
# print(ptb_df_unique_top100.shape)

# # save the dataframe to csv
# ptb_df_unique_top100.to_csv('ptb_df_unique_top100.csv', index=False)

In [143]:
# Generated using ChatGPT 3.5, not foolproof

verb_equivalents = {
    'person': 'walk',
    'bicycle': 'ride',
    'car': 'drive',
    'motorcycle': 'ride',
    'airplane': 'fly',
    'bus': 'transit',
    'train': 'commute',
    'truck': 'haul',
    'boat': 'sail',
    'traffic light': 'control',
    'fire hydrant': 'extinguish',
    'stop sign': 'halt',
    'parking meter': 'measure',
    'bench': 'sit',
    'bird': 'soar',
    'cat': 'purr',
    'dog': 'bark',
    'horse': 'gallop',
    'sheep': 'graze',
    'cow': 'moo',
    'elephant': 'trumpet',
    'bear': 'roar',
    'zebra': 'stride',
    'giraffe': 'graze',
    'backpack': 'carry',
    'umbrella': 'shield',
    'handbag': 'tote',
    'tie': 'fasten',
    'suitcase': 'carry',
    'frisbee': 'throw',
    'skis': 'descend',
    'snowboard': 'glide',
    'sports ball': 'throw',
    'kite': 'flutter',
    'baseball bat': 'swing',
    'baseball glove': 'catch',
    'skateboard': 'skate',
    'surfboard': 'surf',
    'tennis racket': 'hit',
    'bottle': 'uncork',
    'wine glass': 'sip',
    'cup': 'drink',
    'fork': 'poke',
    'knife': 'slice',
    'spoon': 'scoop',
    'bowl': 'eat',
    'banana': 'peel',
    'apple': 'bite',
    'sandwich': 'devour',
    'orange': 'peel',
    'broccoli': 'munch',
    'carrot': 'chop',
    'hot dog': 'consume',
    'pizza': 'devour',
    'donut': 'crave',
    'cake': 'indulge',
    'chair': 'sit',
    'couch': 'lounge',
    'potted plant': 'nurture',
    'bed': 'rest',
    'dining table': 'dine',
    'toilet': 'flush',
    'tv': 'watch',
    'laptop': 'type',
    'mouse': 'click',
    'remote': 'control',
    'keyboard': 'type',
    'cell phone': 'call',
    'microwave': 'heat',
    'oven': 'bake',
    'toaster': 'toast',
    'sink': 'rinse',
    'refrigerator': 'chill',
    'book': 'read',
    'clock': 'tick',
    'vase': 'hold',
    'scissors': 'cut',
    'teddy bear': 'hug',
    'hair drier': 'dry',
    'toothbrush': 'brush'
}

abstract_equivalents = {
    'person': 'individuality',
    'bicycle': 'mobility',
    'car': 'transportation',
    'motorcycle': 'vibration',
    'airplane': 'flight',
    'bus': 'transit',
    'train': 'journey',
    'truck': 'shipment',
    'boat': 'voyage',
    'traffic light': 'signal',
    'fire hydrant': 'safety',
    'stop sign': 'pause',
    'parking meter': 'measurement',
    'bench': 'reflection',
    'bird': 'song',
    'cat': 'companion',
    'dog': 'loyalty',
    'horse': 'grace',
    'sheep': 'conformity',
    'cow': 'moo',
    'elephant': 'majesty',
    'bear': 'solitude',
    'zebra': 'pattern',
    'giraffe': 'elegance',
    'backpack': 'adventure',
    'umbrella': 'protection',
    'handbag': 'accessory',
    'tie': 'formality',
    'suitcase': 'journey',
    'frisbee': 'recreation',
    'skis': 'glide',
    'snowboard': 'descent',
    'sports ball': 'competition',
    'kite': 'soar',
    'baseball bat': 'swing',
    'baseball glove': 'protection',
    'skateboard': 'thrill',
    'surfboard': 'excitement',
    'tennis racket': 'score',
    'bottle': 'containment',
    'wine glass': 'celebration',
    'cup': 'containment',
    'fork': 'prong',
    'knife': 'sharpness',
    'spoon': 'cutlery',
    'bowl': 'container',
    'banana': 'softness',
    'apple': 'crunch',
    'sandwich': 'combination',
    'orange': 'acidity',
    'broccoli': 'nutrient',
    'carrot': 'health',
    'hot dog': 'indulgence',
    'pizza': 'aroma',
    'donut': 'indulgence',
    'cake': 'celebration',
    'chair': 'support',
    'couch': 'comfort',
    'potted plant': 'growth',
    'bed': 'rest',
    'dining table': 'gathering',
    'toilet': 'sanitation',
    'tv': 'entertainment',
    'laptop': 'productivity',
    'mouse': 'navigation',
    'remote': 'control',
    'keyboard': 'input',
    'cell phone': 'communication',
    'microwave': 'heating',
    'oven': 'Thanksgiving',
    'toaster': 'breakfast',
    'sink': 'drainage',
    'refrigerator': 'cooling',
    'book': 'knowledge',
    'clock': 'time',
    'vase': 'decoration',
    'scissors': 'cutting',
    'teddy bear': 'childhood',
    'hair drier': 'drying',
    'toothbrush': 'hygiene'
}

# create a unique set using the values from both dictionaries
unique_equivalents = set(list(verb_equivalents.values()) + list(abstract_equivalents.values()))
print(len(unique_equivalents))
print(unique_equivalents)

139
{'companion', 'toast', 'lounge', 'watch', 'haul', 'combination', 'majesty', 'extinguish', 'nutrient', 'bark', 'rest', 'hold', 'drainage', 'chill', 'time', 'drink', 'eat', 'journey', 'surf', 'shipment', 'measure', 'devour', 'cutlery', 'catch', 'drive', 'cooling', 'sail', 'flutter', 'moo', 'descend', 'commute', 'score', 'indulgence', 'cutting', 'containment', 'communication', 'transportation', 'glide', 'slice', 'drying', 'tote', 'nurture', 'breakfast', 'cut', 'throw', 'brush', 'competition', 'signal', 'carry', 'mobility', 'dine', 'conformity', 'Thanksgiving', 'comfort', 'call', 'soar', 'pause', 'input', 'hygiene', 'click', 'health', 'entertainment', 'halt', 'crunch', 'softness', 'control', 'loyalty', 'formality', 'individuality', 'gallop', 'protection', 'gathering', 'pattern', 'walk', 'productivity', 'purr', 'sip', 'decoration', 'bite', 'accessory', 'reflection', 'rinse', 'uncork', 'skate', 'grace', 'support', 'scoop', 'tick', 'flight', 'type', 'read', 'bake', 'navigation', 'poke', '

In [144]:
# OpenAI ChatGPT 3.5 generated POS tags

word_pos_dict = {
    'companion': 'NN',
    'toast': 'NN',
    'lounge': 'NN',
    'watch': 'VB',
    'haul': 'VB',
    'combination': 'NN',
    'majesty': 'NN',
    'extinguish': 'VB',
    'nutrient': 'NN',
    'bark': 'NN',
    'rest': 'NN',
    'hold': 'VB',
    'drainage': 'NN',
    'chill': 'VB',
    'time': 'NN',
    'drink': 'NN',
    'eat': 'VB',
    'journey': 'NN',
    'surf': 'VB',
    'shipment': 'NN',
    'measure': 'NN',
    'devour': 'VB',
    'cutlery': 'NN',
    'catch': 'NN',
    'drive': 'VB',
    'cooling': 'VBG',
    'sail': 'VB',
    'flutter': 'NN',
    'moo': 'NN',
    'descend': 'VB',
    'commute': 'NN',
    'score': 'NN',
    'indulgence': 'NN',
    'cutting': 'NN',
    'containment': 'NN',
    'communication': 'NN',
    'transportation': 'NN',
    'glide': 'VB',
    'slice': 'NN',
    'drying': 'NN',
    'tote': 'NN',
    'nurture': 'NN',
    'breakfast': 'NN',
    'cut': 'NN',
    'throw': 'VB',
    'brush': 'NN',
    'competition': 'NN',
    'signal': 'NN',
    'carry': 'VB',
    'mobility': 'NN',
    'dine': 'VB',
    'conformity': 'NN',
    'Thanksgiving': 'NNP',
    'comfort': 'NN',
    'call': 'VB',
    'soar': 'VB',
    'pause': 'NN',
    'input': 'NN',
    'hygiene': 'NN',
    'click': 'NN',
    'health': 'NN',
    'entertainment': 'NN',
    'halt': 'NN',
    'crunch': 'NN',
    'softness': 'NN',
    'control': 'NN',
    'loyalty': 'NN',
    'formality': 'NN',
    'individuality': 'NN',
    'gallop': 'NN',
    'protection': 'NN',
    'gathering': 'NN',
    'pattern': 'NN',
    'walk': 'VB',
    'productivity': 'NN',
    'purr': 'NN',
    'sip': 'VB',
    'decoration': 'NN',
    'bite': 'NN',
    'accessory': 'NN',
    'reflection': 'NN',
    'rinse': 'VB',
    'uncork': 'VB',
    'skate': 'VB',
    'grace': 'NN',
    'support': 'NN',
    'scoop': 'NN',
    'tick': 'NN',
    'flight': 'NN',
    'type': 'VB',
    'read': 'VB',
    'bake': 'VB',
    'navigation': 'NN',
    'poke': 'VB',
    'descent': 'NN',
    'voyage': 'NN',
    'heating': 'VBG',
    'thrill': 'NN',
    'consume': 'VB',
    'fly': 'VB',
    'hug': 'NN',
    'knowledge': 'NN',
    'shield': 'NN',
    'transit': 'NN',
    'solitude': 'NN',
    'heat': 'NN',
    'fasten': 'VB',
    'celebration': 'NN',
    'hit': 'VB',
    'flush': 'VB',
    'adventure': 'NN',
    'peel': 'VB',
    'song': 'NN',
    'elegance': 'NN',
    'recreation': 'NN',
    'roar': 'NN',
    'trumpet': 'NN',
    'container': 'NN',
    'aroma': 'NN',
    'childhood': 'NN',
    'measurement': 'NN',
    'sanitation': 'NN',
    'vibration': 'NN',
    'dry': 'VB',
    'growth': 'NN',
    'safety': 'NN',
    'swing': 'NN',
    'crave': 'VB',
    'ride': 'VB',
    'sharpness': 'NN',
    'stride': 'NN',
    'graze': 'NN',
    'excitement': 'NN',
    'sit': 'VB',
    'indulge': 'VB',
    'acidity': 'NN',
    'prong': 'NN',
    'chop': 'VB',
    'munch': 'NN'
}


In [145]:
# create a dataframe from the dictionary, with the key as 'Word' and the value as 'POS'
# for now, set the 'count' column to 0
word_pos_df = pd.DataFrame(list(word_pos_dict.items()), columns=['Word', 'POS'])
word_pos_df['count'] = 0
print(word_pos_df.shape)
word_pos_df.head()

(139, 3)


Unnamed: 0,Word,POS,count
0,companion,NN,0
1,toast,NN,0
2,lounge,NN,0
3,watch,VB,0
4,haul,VB,0


In [146]:
# Create dataframe using coco classes, with POS tags set to NN and count set to 0
coco_classes_df = pd.DataFrame(coco_classes, columns=['Word'])
coco_classes_df['POS'] = 'NN'
coco_classes_df['count'] = 0
print(coco_classes_df.shape)
coco_classes_df.head()

(80, 3)


Unnamed: 0,Word,POS,count
0,person,NN,0
1,bicycle,NN,0
2,car,NN,0
3,motorcycle,NN,0
4,airplane,NN,0


In [147]:
# Combine the word_pos_df, coco_classes_df dataframes
combined_df = pd.concat([word_pos_df, coco_classes_df])

# Remove duplicates
combined_df = combined_df.drop_duplicates(subset=['Word'])

# Combine the combined_df and ptb_df_top5, prioritizing the ptb_df_top5 dataframe if there are duplicates
combined_df = pd.concat([combined_df, ptb_df_top5])
print(combined_df.shape)
print(combined_df.head())
combined_df = combined_df.drop_duplicates(subset=['Word'], keep='last')
print(combined_df.shape)
print(combined_df.head())

# save the dataframe to csv
combined_df.to_csv('combined_df.csv', index=False)

(395, 3)
        Word POS  count
0  companion  NN      0
1      toast  NN      0
2     lounge  NN      0
3      watch  VB      0
4       haul  VB      0
(383, 3)
        Word POS  count
0  companion  NN      0
1      toast  NN      0
2     lounge  NN      0
3      watch  VB      0
4       haul  VB      0


In [148]:
# # create a dataframe from the unique set, with 'Word', 'POS', and 'count' columns
# # for now, set 'POS' to 'TBD' and 'count' to 0
# equivalents_df = pd.DataFrame(unique_equivalents, columns=['Word'])
# equivalents_df['POS'] = 'TBD'
# equivalents_df['count'] = 0
# print(equivalents_df.shape)
# print(equivalents_df.head())

In [149]:
# # Combine equivalents_df and ptb_df_top5
# combined_df = pd.concat([ptb_df_top5, equivalents_df])
# print(combined_df.shape)
# print(combined_df.head())

# # save the dataframe to csv
# combined_df.to_csv('combined_df.csv', index=False)

In [150]:
# # create a superset of set of unique words from both dictionaries and the top 5 words from each POS tag and the original coco classes
# superset = set(list(unique_equivalents) + list(ptb_df_top5['Word']) + list(coco_classes))
# print(len(superset))
# print(superset)

### Tokenizer + BERT

In [151]:
# Convert combined_df['Word'] to list
combined_list = combined_df['Word'].tolist()
print(len(combined_list))
print(combined_list[:5])

383
['companion', 'toast', 'lounge', 'watch', 'haul']


In [152]:
# get embedding for each class
# ❗️ note: I am only getting the embedding for the first token in each class
# ❓ question: are we interested in the final contextual embedding for each class? currently, we're looking at the final hidden state.
embeddings = []
for i in range(len(combined_list)):
# for i in range(1):
    input_ids = torch.tensor(tokenizer.encode(combined_list[i])).unsqueeze(0)
    outputs = model(input_ids)
    last_hidden_states = outputs[0]
    # skip the first and last token, which is the [CLS] and [SEP] tokens
    # take the mean of other tokens (that form the word)    
    embeddings.append(torch.mean(last_hidden_states[0][1:-1], dim=0).tolist())

In [153]:
print(len(embeddings))

383


In [158]:
# round each val in embedding to 3 decimal places
embeddings = [list(np.around(np.array(e),3)) for e in embeddings]

# create string of all classes and their embeddings & save to text file
# ❗️ note: only taking first 10 axes for now due to context window length
# with open("output.txt", "w") as text_file:
#     for i in range(len(combined_list)):
#         class_str = f"{combined_list[i]}: {embeddings[i][:10]}\n"
#         text_file.write(class_str)
with open("output.txt", "w") as text_file:
    for i in range(len(combined_list)):
        class_str = f"{combined_list[i]}: {embeddings[i][0]}\n"
        text_file.write(class_str)

In [155]:
# convert embedding list to dataframe
# Convert to DataFrame
df = pd.DataFrame(embeddings)
df.insert(0, 'word', combined_list)
print(df.shape)
df.head()  # Display the first 5 rows to check the structure

(383, 769)


Unnamed: 0,word,0,1,2,3,4,5,6,7,8,...,758,759,760,761,762,763,764,765,766,767
0,companion,0.128,-0.082,0.109,-0.012,0.317,0.262,-0.016,0.128,0.357,...,0.395,-0.107,-0.026,-0.061,0.358,-0.138,-0.067,0.236,0.385,-0.115
1,toast,0.209,0.411,0.026,-0.0,-0.243,-0.325,0.148,0.238,-0.12,...,0.373,-0.174,0.119,-0.036,0.445,-0.029,0.145,0.169,0.364,-0.097
2,lounge,0.759,-0.116,0.116,0.113,0.46,-0.212,0.115,-0.032,0.137,...,0.259,0.07,-0.059,-0.114,0.441,-0.096,-0.074,0.196,0.215,-0.254
3,watch,0.401,-0.003,-0.061,-0.406,0.933,-0.14,-0.186,0.286,-0.17,...,0.459,-0.067,-0.266,-0.318,0.173,-0.109,0.219,-0.01,0.404,-0.161
4,haul,0.58,0.031,0.311,0.048,-0.102,0.081,0.047,0.423,-0.187,...,0.505,0.096,0.095,-0.009,0.453,0.003,0.337,0.259,0.064,0.092


In [157]:
# OpenAI ChatGPT 3.5 generated test set

daily_life_objects = ['coffee mug', 'newspaper', 'shoes', 'headphones', 'umbrella stand', 'trash can', 'escalator', 'delivery van', 'gardening hose', 'street sign', 'mailbox', 'garage door', 'picnic table', 'seagull sculpture', 'houseplant', 'lap desk', 'home office chair', 'calendar', 'wallet', 'sunglasses', 'notebook', 'desktop computer', 'printer', 'office desk lamp', 'USB drive', 'water bottle', 'wine opener', 'mason jar', 'serving spoon', 'chopsticks', 'plate', 'napkin', 'apple slicer', 'cooking spatula', 'baking pan', 'cookie jar', 'tea kettle', 'candle', 'throw pillow', 'blanket', 'house slippers', 'bathroom scale', 'vanity mirror', 'alarm clock', 'picture frame', 'cactus plant', 'bookshelf', 'wall clock', 'wristwatch', 'reading glasses', 'hairbrush', 'hair tie', 'hand mirror', 'shaving razor', 'toilet paper', 'tissue box', 'paper towel holder', 'flashlight', 'laptop sleeve', 'computer mouse pad', 'USB cable', 'keyboard cover', 'wireless router', 'smartphone stand', 'kitchen apron', 'oven mitts', 'pot holder', 'cutting board', 'salt and pepper shakers', 'napkin holder', 'dish rack', 'wine rack', 'picture album', 'canvas tote bag', 'office phone', 'desk organizer', 'magnetic board']

### OpenAI API (optional)

Might be better to just use the GUI. If we want to directly manipulate the outputs we may need to do some precise prompt engineering. OpenAI has a JSON feature that we could look into.

In [98]:
# install OpenAI api
!pip install --upgrade openai

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Collecting openai
  Downloading openai-1.2.3-py3-none-any.whl.metadata (16 kB)
Collecting anyio<4,>=3.5.0 (from openai)
  Downloading anyio-3.7.1-py3-none-any.whl.metadata (4.7 kB)
Collecting distro<2,>=1.7.0 (from openai)
  Downloading distro-1.8.0-py3-none-any.whl (20 kB)
Collecting httpx<1,>=0.23.0 (from openai)
  Downloading httpx-0.25.1-py3-none-any.whl.metadata (7.1 kB)
Collecting pydantic<3,>=1.9.0 (from openai)
  Downloading pydantic-2.4.2-py3-none-any.whl.metadata (158 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m158.6/158.6 kB[0m [31m5.1 MB/s[0m eta [36m0:00:00[0m
Collecting sniffio>=1.1 (from anyio<4,>=3.5.0->openai)
  Downloading sniffio-1.3.0-py3-none-any.whl (10 kB)
Collecting exceptiongroup (from anyio<4,>=3.5.0->openai)
  Downloading exceptiongroup-1.1.3-py3-none-any.whl.metadata (6.1 kB)
Collecting httpcore (from httpx<1,>=0.23.0->openai)
  Downloading httpcore-1.0.2-py3-none-any.whl.metadata (20 kB)
Collecting annotated-types>=0.4.0 (from py

In [None]:
# load api key from secrets.json
import openai

try:
    with open("secrets.json") as f:
        secrets = json.load(f)
    my_api_key = secrets["openai"]
    print("API key loaded.")
    openai.api_key = my_api_key
except FileNotFoundError:
    print("Secrets file not found. YOU NEED THEM TO RUN THIS.")

In [None]:
completion = openai.ChatCompletion.create(
  model="gpt-3.5-turbo",
  messages=[
    # {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": f"For the following lists, the first list contains words that have been put into DistilBERT – let us call this a label. Each of the subsequent lists contains the embedding value from DistilBERT for one dimension (out of 768) across the labels. By comparing the values for each label for each list, please interpret the likely concepts that each list, that is the dimension/axis of the embedding, encodes. Each of the 50 rows should encode a different concept. \n\n First, count the number of lists excluding the first (the labels list) and report on the number. \n '''There are <N> dimensions''' \n\n Then, the main output should take this form for row 'n', from 'Row 1' to 'Row N': \n '''Row n: <encoded concept>. <one sentence rationale for interpretation>''' \n\n {slice_0}"}
  ]
)

print(completion.choices[0].message)

# log the stringified output into a txt file by appending it to the end of the file
with open("output.txt", "a") as f:
  f.write(str(completion.choices[0].message))

{
  "role": "assistant",
  "content": "There are 6 dimensions.\n\nRow 1: 3D shape. The values for this dimension vary significantly across the labels, indicating that it encodes information about the three-dimensional shape of the objects.\nRow 2: Mobility. The values for this dimension are mostly positive, suggesting that it encodes information about the mobility or movement associated with the objects.\nRow 3: Edibility. The values for this dimension are a mix of positive and negative, but they are generally low, indicating that it encodes information about the edibility of the objects.\nRow 4: Size. The values for this dimension range from negative to positive, suggesting that it encodes information about the size or scale of the objects.\nRow 5: Consumer goods. The values for this dimension are mostly negative, indicating that it encodes information about whether the objects are commonly used consumer goods.\nRow 6: Natural vs. Man-made. The values for this dimension vary significa